27.07.2013 Views

DTJ Volume 8 Number 2 1996 - Digital Technical Journals

DTJ Volume 8 Number 2 1996 - Digital Technical Journals

DTJ Volume 8 Number 2 1996 - Digital Technical Journals

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Digital</strong><br />

<strong>Technical</strong><br />

Journal<br />

I<br />

SPIRALOG LOG-STRUCTURED FILE SYSTEM<br />

OPENVMS FOR 64-BIT ADDRESSABLE<br />

VIRTUAL MEMORY<br />

HIGH-PERFORMANCE MESSAGE PASSING<br />

FOR CLUSTERS<br />

SPEECH RECOGNITION SOFTWARE<br />

<strong>Volume</strong> 8 <strong>Number</strong> 2<br />

<strong>1996</strong>


Editorial<br />

Jane C. Blake, Managing Editor<br />

Kathleen M. Stetson, Editor<br />

Helen L. Patterson, Editor<br />

Circulation<br />

Catherine M. Phillips, Administrator<br />

Dorothea B. Cassady, Secretary<br />

Production<br />

Terri Autieri, Production Editor<br />

Anne S. Katzdf, Typographer<br />

Peter R.. Woodbury, Illustrator<br />

Advisory Board<br />

Samuel H. fuller, Chairman<br />

Richard W. Beane<br />

Donald Z. Harbert<br />

William R. Hawe<br />

Richard J. Hollingsworth<br />

William A. Laing<br />

Richard F. Lary<br />

Alan G. Nemeth<br />

Pauline A. Nist<br />

Robert M. Supnik<br />

Cover Design<br />

<strong>Digital</strong>'s new Spiralog rile system, a featured<br />

topic in the issue, supports rid! 64-bit system<br />

capability and fast backup and is integrated<br />

with the Open VMS 64-bit version 7.0 oper­<br />

ating system. The cover graphic captures the<br />

inspired character of the Spiralog design<br />

effort and illustrates a concept taken from<br />

University of California research in which<br />

the whole disk is treated as a single, sequential<br />

log and all file system modifications arc<br />

appended to the tail of the log.<br />

The cover was designed by Lucinda O'Neill<br />

of <strong>Digital</strong>'s Design Group using images<br />

fi·om Photo Disc, Inc., copyright <strong>1996</strong>.<br />

The <strong>Digital</strong> Teclmica/juumal is a refereed<br />

journal published quarterly by <strong>Digital</strong><br />

Equipment Corporation, 30 Porter Road<br />

L)02/DIO, Littleton, MA 0!460.<br />

Subscriptions can be ordered by sending<br />

a check in U.S. ti.mds (made payable to<br />

<strong>Digital</strong> Equipment Corporation) to the<br />

published-by address. General subscrip­<br />

tion rates arc $40.00 (non-U.S. $60) for<br />

rour issues and $75.00 (non-U.S. $!15)<br />

for eight issues. University and college pro­<br />

fessors and Ph.D. students in the electrical<br />

engineering and computer science fields<br />

receive complimentary subscriptions upon<br />

request. <strong>Digital</strong>'s customers may qualify<br />

for gift subscriptions and arc encouraged<br />

to contact their account representatives.<br />

Single copies and back issues are available<br />

ri:>r $16.00 (non-U.S. $18) each and can<br />

be ordered by sending the requested issue's<br />

volume and number and a check to the<br />

published-by address. Sec the Further<br />

Readings section in the back of this issue<br />

ror a complete listing. Recent issues arc<br />

also available on the Internet at<br />

http:/ /www.digital.com/info/dtj.<br />

<strong>Digital</strong> employees may order subscrip­<br />

tions through Readers Choice at UR.L<br />

http:/ jwebrc.das.dec.com or by entering<br />

VTX PROfiLE at the system prompt.<br />

Inquiries, address changes, and compli­<br />

mentary subscription orders can be sent<br />

to the <strong>Digital</strong> Teclmical.fourna/ at the<br />

published-by address or the electronic<br />

mail address, dtj@digital.com. Inquiries<br />

can also be made by calling the joumal<br />

office at 508-486-2538.<br />

Comments on the content of any paper<br />

arc welcomed and may be sent to the<br />

managing editor at the published-by or<br />

electronic mail address.<br />

Copyright© <strong>1996</strong> <strong>Digital</strong> Equipment<br />

Corporation. Copying without fee is per­<br />

mitted provided that such copies are made<br />

for usc in educational institutions by faculty<br />

members and are not distributed for com­<br />

mercial advantage. Abstracting with credit<br />

of <strong>Digital</strong> Equipment Corporation's auth­<br />

orship is permitted.<br />

The information in the journal is subject<br />

to change without notice and should not<br />

be construed as a commitment by <strong>Digital</strong><br />

Equipment Corporation or by the compa­<br />

nies herein represented. <strong>Digital</strong> Equipment<br />

Corporation assumes no responsibility ror<br />

any errors that may appear in the Journal.<br />

!SSN 0898-90 I X<br />

Documentation <strong>Number</strong> EC-N6992-18<br />

Book production was done by Quantic<br />

Communications, Inc.<br />

The following arc trademarks of <strong>Digital</strong><br />

Equipment Corporation: AlphaScrver,<br />

DEC, DECtalk, <strong>Digital</strong>, the DIGITAL<br />

logo, HSC, Open VMS, PATH WORKS,<br />

POLYCENTER, RZ, TruCiuster, VAX,<br />

and VAXcluster.<br />

BBN Hark is a trademark of Bolt Beranek<br />

and Newman Inc.<br />

Encore is a registered trademark and<br />

MEMORY CHANNEL is a trademark<br />

of Encore Computer Corporation.<br />

FAScrver is a trademark of Network<br />

Appliance Corporation.<br />

Listen for Windows is a trademark of<br />

Verbcx Voice Systems, Inc.<br />

Microsoft and Win32 arc registered trade­<br />

marks and Windows and Windows NT arc<br />

trademarks of Microsoft Corporation.<br />

MIPSpro is trademark of MIPS Technol­<br />

ogies, 1 nc., a wholly owned subsidiary of<br />

Silicon Graphics, Inc.<br />

Netscape Navigator is a trademark of<br />

Netscape Communications Corporation.<br />

PAL is a rcgistcn:d trademark of Advanced<br />

Micro Devices, Inc.<br />

UNIX is a registered trademark in the<br />

United States and in other countries,<br />

licensed exclusively through X/Open<br />

Company Ltd.<br />

VoiceAssist is a trademark of Creative<br />

Labs, Inc.<br />

X Window System is a trademark of the<br />

Massachusetts Institute ofTechnology.


Contents<br />

Foreword<br />

SPIRALOG LOG-STRUCTURED FILE SYSTEM<br />

Overview of the Spira log File System<br />

Design of the Server for the Spira log File System<br />

Designing a Fast, On-line Backup System for<br />

a Log-structured File System<br />

Integrating the Spira log File System into the<br />

OpenVMS Operating System<br />

Open VMS FOR 64-BIT ADDRESSABLE VIRTUAL MEMORY<br />

Extending Open VMS for 64-bit Addressable<br />

Virtual Memory<br />

The OpenVMS Mixed Pointer Size Environment<br />

Rich Marcello<br />

James E. Johnson and William A. Laing<br />

Christopher Whitaker, ). Stu


2<br />

Editor's<br />

Introduction<br />

This past spring when we sun·c\Td<br />

.foun w/s ubscribcrs, readers rook the<br />

rime ro commem on the parri cula1·<br />

,·aluc ofrhc issues fCawring <strong>Digital</strong>'s<br />

64-bir Alpha technology. The engi­<br />

neering described in those two issues<br />

continu es, with ever higher levels of<br />

pcrr(>rmance in Alpha microproces­<br />

sors, servers, clusters, and systems<br />

software. This issue presents reccm<br />

developments: a log-structured file<br />

system, called Spiralog; the Open VMS<br />

opcr1ting system extended ro take full<br />

ar<br />

Alpha clusters; and speech recognition<br />

sofn,arc for Alpha workstations.<br />

Spir1log is a whollv new clusterwidc<br />

file svsrcm integrated with the new<br />

64-bir Open VMS version 7.0 opcrat·<br />

ing system and is designed for hig h<br />

d1ra avaibbiliry and high pert(xman cc.<br />

The tirst of four papers about Spira log<br />

is writtcn by Jim Johnson and Bill<br />

L1ing, wlw imroduce log -srru crured<br />

file ( LfS) concepts, the university<br />

research behind the design, and design<br />

innO\'ltions.<br />

The r scicntitic users,<br />

the p.!r;!llcl-programming tool<br />

ncx t described does not run on the<br />

Open VMS Alpha wstcm but instead<br />

on UNIX clusters connected with<br />

l'vlEMORY CHANNEL technologv.<br />

jim Lawton, John Brosnan, Morgan<br />

Doyle, Scosarnh () Riordain, and<br />

Tim Reddin rc\'iew the challenges in<br />

designing the TruCiu stcr ,vn-:,vlOJY<br />

CH ANNE1. SofTware product, which<br />

is a ll!Cssagc-passing s·stcm intended<br />

t(>r builders of parallel software<br />

libr1rics and implcmenrers of flH!Ilcl<br />

compilers. The product reduces<br />

communicuions latenc\' to Jess than<br />

10 f.lS in shared mcmor\' S\'stc ms.<br />

Fin1lly, Bernie RozmO\·its prcscnrs<br />

the design of user interfaces f(Jr the<br />

<strong>Digital</strong> Speech Recognition Software<br />

( J)SRS) product. Although DSRS<br />

is rugct c d t(Jr <strong>Digital</strong>'s Alpha work­<br />

stations running UNIX, the impk·<br />

mcntrrs ro ensure the prod­<br />

uct's Clsc-of-usc c1n be gcner


Foreword<br />

Rich Marcello<br />

Vice President. Open VMS ,\)stems<br />

Sqfiu,are Group<br />

The papers you will read in this issue<br />

ofthe.fournal describe how we in the<br />

Open VMS engineering community<br />

set out to bring the Open VMS oper­<br />

ating system and our loyal customer<br />

base into the rwcnty-first century.<br />

The papers present both the develop­<br />

ment issues and the technical chal­<br />

lenges faced by the engineers who<br />

delivered the Open VMS operating<br />

system version 7.0 and the Spiralog<br />

file system, a new log-structured file<br />

system tor Open VMS.<br />

We are extremely proud of the<br />

results of these efforts. In December<br />

1995 at U.S. Fall DECUS (<strong>Digital</strong><br />

Equipment Computer Users Society),<br />

<strong>Digital</strong> announced Open VMS version<br />

7.0 and the Spiralog tile system as part<br />

of a first wave of product deliveries for<br />

the Open VMS Windows NT Nllnity<br />

Program. Open VMS version 7.0 pro­<br />

vides the "unlimited high end" on<br />

which our customers can build their<br />

distributed computing environments<br />

and move toward the next millennium.<br />

The release of Open VMS version<br />

7.0 in January oftJ1is year represents<br />

the most significant engi11eering<br />

enhancement to the Open VMS oper­<br />

ating system since <strong>Digital</strong> released<br />

the VAXcluster system in 1983.<br />

Open VMS version 7.0 extends the<br />

32-bit architecture of Open VMS<br />

to a 64-bit


4<br />

In E1ct, Digitl has two 64-bit oper­<br />

uing s1·srems 1\'ith this po11·er: the<br />

OpenV/vlS 1nci the <strong>Digital</strong> U:-.JIX<br />

oper;ni ng SI'Stems.<br />

As noted 1bm·e, Digit:ll inrrocluccd<br />

the OpenV1v!S operating system with<br />

SUflport f(lr full 64-bit l'irtual address­<br />

ing at the s:1mc time it introduced the<br />

Spir;1log tile system, in Dt:cembcr<br />

1995. The Spira log design is lmt:d<br />

on rhe Sprite log-structllred tile svs­<br />

rem trom the University ofCalit(>rni;1,<br />

Berkeley. Wirh irs usc of this log­<br />

structured ;1pproaeh, Spira log offers<br />

Jll;1jor nell performance reatures,<br />

including t1st, applicnion-consistcnr,<br />

on-line b;lckup. further, ir is fulk<br />

comparibk 11·irh customers' oisting<br />

hies-II rile SI'Stems, and applications<br />

rh;H t·un 011 hles - 1 1 will run on<br />

Spir;Jiog 11·ith no modification. To<br />

ddiver 111 of the re;nures we kit ll't:rc<br />

esscmd to mt:t:t the needs of our<br />

loval utstomer base, the Spiralog tc1m<br />

0;1mined ;1t1d resoked a number of<br />

rechniul issues. The papers in this<br />

issue desuibe some ofrhe clullcnges<br />

rhe1· hcnl, including the decision to<br />

design ;1 hlcs-11 tile SI'Stcm emubrion.<br />

The dcli1·crv ofrhe OpenVMS<br />

1 crsion 7.0 operating SI'Stem and<br />

rhe Spir:1lug tile system are part of<br />

DigiL1I's continued commitment ro<br />

rhe Open VMS customer b:se. These<br />

products reptTsent the work of dedi­<br />

cated, t:demed engineering teams<br />

that h;we deployed state-otthe-arr<br />

reehnology in products rhat ll'ill help<br />

our customers remain competiti1·e<br />

r(Jt- \'cars to come.<br />

In the Open VMS group as elsell'hcre<br />

in <strong>Digital</strong>, 11·e are committt:d<br />

to excellence in the de,·clopmcnt and<br />

Di1;ital <strong>Technical</strong> )ournl<br />

delivnv o· business computing solu­<br />

tions. 'vVt: ll 'ill continue to maint;lin<br />

;Jnd cnluncc a product porr(,lio th;1t<br />

meers our customers' need for true<br />

24-hour bl' 365-dal' acct:ss ro theit-<br />

. .<br />

d1t;1, ti.tll imegration wirh Microsof-t<br />

'vVindows :-.rr enl'ironmenrs, and the<br />

full complement of network solutions<br />

;1nd application software tor today<br />

and well into tht: next millennium.<br />

Vol. 8 No. 2 J 996


Overview of the Spira log<br />

File System<br />

The OpenVMS Alpha environment requires a<br />

file system that supports its full 64-bit capabili­<br />

ties. The Spiralog file system was developed to<br />

increase the capabilities of <strong>Digital</strong>'s Files-11 file<br />

system for Open VMS. It incorporates ideas from<br />

a log-structured file system and an ordered write­<br />

back model. The Spiralog file system provides<br />

improvements in data availability, scaling of the<br />

amount of storage easily managed, support for<br />

very large volume sizes, support for applications<br />

that are either write-operation or file-system­<br />

operation intensive, and support for heteroge­<br />

neous file system client types. The Spira log<br />

technology, which matches or exceeds the relia­<br />

bility and device independence of the Files-11<br />

system, was then integrated into the Open VMS<br />

operating system.<br />

I<br />

James E. Johnson<br />

William A. Laing<br />

<strong>Digital</strong>'s Spiralog product is a log-structured, duster­<br />

wide file system with integrated , on-line backup and<br />

restore capability and support tor multiple tile sys­<br />

tem personalities. It incorporates a number of recent<br />

ideas ti·om the research community, including the<br />

log-structured tile system ( LFS) from the Sprite tile<br />

system and the ordered write back ti-om the Echo<br />

tile system.U<br />

The Spiralog file system is fu lly integrated into the<br />

Open VMS operating system, providing compatibility<br />

with the current Open VMS fi le system, Files-ll. It<br />

supports a coherent, clusterwide write-behind cache<br />

and provides higb-pertonnance, on-line bac kup and<br />

per-file and per-volume restore functions.<br />

In this paper, we first discuss the evolution of tile<br />

systems and the requirements tor many of the basic<br />

designs in the Spiralog tile system. Next we descri be<br />

the overall architecture of the Spiralog file system,<br />

identit),ing its major components and outlining their<br />

designs. Then we discuss the project's results: what<br />

worked well and what did not work so well. Finally, \Ve<br />

present some conclusions and ideas tor future work.<br />

Some of the major components, i.e., the backup<br />

and restore facility, the LFS server, and OpcnVMS<br />

integration, are described in greater detail in compan­<br />

ion papers in this issue .3-5<br />

The Evolution of File Systems<br />

File systems have existed throughout much of the his­<br />

tory of computing. The need tor libraries or services<br />

that help to manage the collection of data on long­<br />

term storage devices was recognized many years ago.<br />

The early support libraries have evolved into the tile<br />

systems of today. During their evolution, they have<br />

responded to the industry's improved hardware capa­<br />

bilities and to users' increased expectations. Hardware<br />

has continued to decrease in price and improve in its<br />

price/performance ratio. Consequently, ever larger<br />

amounts of data are stored and manipulated by users<br />

in ever more sophisticated ways. As more and more<br />

data are stored on-line, the need to access that data 24<br />

hours a day, 365 days a year has also escalated.<br />

<strong>Digital</strong> <strong>Technical</strong> Jounul Vol. 8 No. 2 <strong>1996</strong> 5


6<br />

Signiticant i mprovements to file systems have been<br />

made in the t(> l lowing areas:<br />

• Direcrorv stru ctures to ease locating data<br />

• Device independence of data access through the rile<br />

svstcm<br />

• Accessibility of the data to users on other svstems<br />

• Avail;1bility of the thta, despite either planned or<br />

unplanned service outages<br />

• Reliability of the stored data and the pertc>rm:mn.:<br />

of the datJ JCCCSS<br />

Requirements of the Open VMS File System<br />

Since l977, the OpenVMS operating system has<br />

ofti..:rcd a stable, robust tile svstem known as Files-!!.<br />

This ti le system is considered to be very successfu l in<br />

the areas of reliability and device independence.<br />

Recent customer teed lxtck, however, indicated th;H<br />

the areas of data avaibbility, scaling of the ::tmounr of<br />

sror:gc c:tsilv managed, support tor \'Cry large volume<br />

sizes, and support tc>r heterogeneous file system client<br />

types were in need of improvement.<br />

The Spiralog project was initiated in response to<br />

customers' needs. We designed the Spiralog rile system<br />

ro match or somewhat exceed the Files-1 1 system in<br />

its reliability and device independence. The focus of<br />

the Sp ir::t log project was on those areas that were due<br />

tor improveme nt, notably:<br />

• DatJ availability, especially during pJanned opera­<br />

tions, such ::ts backup.<br />

If the stor:tgc device needs to be taken offline<br />

to pcrtcm11 J backup, even at a very high backup<br />

rate of 20 megabytes per second (Ml3/s ), ;llmost<br />

14 hours are needed to back up l terabyte . This<br />

length of service outage is clearly unacceptable.<br />

More typical backup rates of 1 to 2 MB/s can rake<br />

several davs, which, of course, is not acceptable .<br />

• Grc:nlv increased sca ling in total amount of on-line<br />

storage, without greatly increasing the cost to man­<br />

age rhar storage.<br />

For exa mpl e, 1 terabyte of disk storage cutTcmly<br />

costs approximately $250,000, which is wel l within<br />

the budget of many large computing centers .<br />

However, the cost in staff and rime to manage such<br />

amounts of storage can be many times th::tt of the<br />

storage.'' The cost of storage continues ro t11I, while<br />

the cost of managing it continues to rise.<br />

• Efkc rivc scaling as more processing and storage<br />

resources become avail able.<br />

For e xample, Open VMS Cluster systems aJlow pro­<br />

cessing power and storage capacity to be added<br />

incrementally. Ir is crucial that the software support-<br />

Dig.it.tl Tcdtnic;tl jounul Vol. 8 No. 2 <strong>1996</strong><br />

ing the rile system scale as the processing power,<br />

bandwidth to storage, and storage capacity increase .<br />

• Improved performance tor applic:nions that arc<br />

either \\'rite-operation or ti le-syste m-operation<br />

imcnsivc.<br />

As ti le svsrem caches m main mcmor\' ha\'C<br />

increased in capacitv, data reads and file svsrcm read<br />

opcr.nions have become satisf-ied more and more<br />

tl-om the cache. At the same time, nL1nv applica­<br />

tions write large amounts of data or create and<br />

manipulate large numbers of tiles. The usc of<br />

redundant arrays of inexpensive disks (RAID) stor­<br />

age has increased the available bandwidth r(>r d:na<br />

writes and rile system writes. Most tile system oper­<br />

ations, on the other hand, are small writes and 1rc<br />

spread across the disk at random, often negating<br />

the bcndirs of RAID storage.<br />

• lmpnl\'cd ;lbi lity to transparently access the stored<br />

data across several dissim ilar client types.<br />

Computing environments have become tncrcas­<br />

ingly heterogeneous . Different client S\'Stcms, such<br />

as the Windo\\'S or the UNIX opc cni ng S\'Stcm,<br />

store their nics on and share their ti les with scn·cr<br />

S\'Stcms such as the OpenVJ\1\S sen·cr. It Ius<br />

become ncccssJry to support the svntax ;md scmJn­<br />

rics ot- several ditkrcnr tile system personalities on<br />

:1 common rile server.<br />

These needs were central ro many design decisions we<br />

m::tdc t(lr the Spiralog tile system.<br />

The members of the Spiralog project evaluJtcd<br />

much of the ongoing work in file systems, dat:tbascs,<br />

:tnd storage architectures. RA.ID storage makes high<br />

bandwidth av:tiLlble to disk storage, but it req uires<br />

large writes to be etiective. Dar:tbascs ha\'C exp loited<br />

logs ;md the grouping of writes together to minimize<br />

the number ot' disk f/Os and disk seeks req u i red .<br />

Databases and transaction systems have also e xploited<br />

the tt:cb niquc of copving the tail of the log to dkct<br />

backups or data replication. The Sprite project at<br />

Berkeley had brought together a log-str uct ured ri le<br />

system and RA.I D storage to good eftccr.1<br />

By d rawing rl·om the above ideas, parricul ::trlv the<br />

insight of how a log structure could support on-line,<br />

high-pcrr(mnance backup, we began our dcvc lopmcnr<br />

cft(>rt. We designed and buil t a distributed tile system<br />

rhar made extensive use of the processor and me mory<br />

ncar the application and used log-stru ctured storage in<br />

the server.<br />

Spiralog File System Design<br />

The m ai n cxccurion stack of the Spira log tile svstcm<br />

consists of three distinct lavers. Figure I sho\\'S the<br />

o\·er:ll structure. At the rop, nearest the user, is the ri le


F64 FSLIB<br />

Figure 1<br />

VPI SERVICES<br />

FILE<br />

SYSTEM<br />

CLIENT<br />

Spira log Structure Overview<br />

BACKUP USER<br />

INTERFACE<br />

system client layer. It consists of a number of fi le<br />

system personalities and the underlying personalityindependent<br />

services, which we call the VPI.<br />

Two tile system personalities dominate the Spiralog<br />

design. The F64 personality is an emulation of the<br />

Files-l l fi le system. The fi le system library (FSLIB)<br />

personality is an implementation of Microsoft's New<br />

Technology Advanced Server (NTAS) file services for<br />

use by the PATHWORKS for Open VMS file server.<br />

The next layer, present on all systems, is the clerk<br />

layer. It supports a distributed cache and ordered write<br />

back to the LFS server, giving single-system semantics<br />

in a cluster configuration.<br />

The LFS server, the third layer, is present on all designated<br />

server systems. This component is responsible<br />

for maintaining the on-disk log structure; it includes<br />

the cleaner, and it is accessed by multiple clerks. Disks<br />

can be connected to more than one LFS server, but<br />

they are served only by one LFS server at a time. Transparent<br />

fa il over, fi·om the point of view of the tlle system<br />

client layer, is achieved by cooperation between<br />

the clerks and the surviving LFS servers.<br />

The backup engine is present on a system with an<br />

active LFS server. It uses the LFS server to access the<br />

on-disk data, and it interfaces to the clerk to ensure<br />

that the backup or restore operations are consistent<br />

with the clerk's cache.<br />

Figure 2 shows a typical Spiralog cluster configuration.<br />

In this cluster, the clerks on nodes A and B are<br />

accessing the Spira log volumes. Normally, they use the<br />

LFS server on node C to access their data. I f node C<br />

should fail, the LFS server on node D would immediately<br />

provide access to the volumes. The clerks on<br />

nodes A and B would usc the LFS server on node D,<br />

retrying all their outstanding operations. Neither user<br />

application would detect any f1ilure. Once node C had<br />

recovered, it would become the standby LFS server.<br />

NODE A NODE B<br />

USER APPLICATION USER APPLICATION<br />

SPIRALOG CLERK SPIRALOG CLERK<br />

ETHERNET<br />

NODE C NODE D<br />

SPIRALOG VOLUMES<br />

Figure 2<br />

Spiralog Cluster Configuration<br />

File System Client Design<br />

The fi le system client is responsible for the traditional<br />

fi le system fimctions. This layer provides fi les, directories,<br />

access arbitration, and file naming rules. It also<br />

provides the services that the user calls to access the file<br />

system.<br />

VPI Services Layer The VPI layer provides an underlying<br />

primitive file system interface, based on the UNIX<br />

VFS switch. The VPI layer has two overall goals:<br />

1. To support multiple file system personalities<br />

2. To effectively scale to very large volumes of data<br />

and very large numbers oftilcs<br />

To meet the first goal, the VPI layer provides<br />

• File names of 256 Unicode characters, with no<br />

reserved characters<br />

• No restriction on directory depth<br />

• Up to 255 sparse data streams per ti le, each with<br />

64-bit addressing<br />

• Attributes with 255 Unicode character names, containing<br />

values of up to l ,024 bytes<br />

• Files and directories that are freely shared among<br />

fi le system personality modules<br />

To meet the second goal, the V Pl layer provides<br />

• File identifiers stored as 64-bit integers<br />

• Directories through a B-tree, rather than a simple<br />

linear structure, for log(n) fi le name lookup time<br />

The VPI layer is only a base for file system personalities.<br />

Therefore it requires that such personalities are<br />

trusted components of the operating system.<br />

Moreover, it requires them to implement ti le access<br />

security (although there is a convention tor storing<br />

access control list information) and to perform all necessary<br />

cleanup when a process or image terminates.<br />

<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 7


8<br />

F64 File System Personality As m.:viously stated, the<br />

Spi ra log prod uct i nc l udes t\VO ri le system persona lities,<br />

F64 :111d fS UB. The f64 pcrson1 l iry provides a sen·ice<br />

that emulates the Files-! ! ri le S\'Stcm.' Its fu nctions,<br />

services, available ri le attributes, and execution<br />

bchwiors :lrc similar to those in the Filcs-l l ri le S\'S­<br />

tc m. i'vl inor difkrcnccs l i'C isobtcd into areas that<br />

receive little usc ti·om most applicuions.<br />

Fm i nstance , the Spiralog ti le svstcm supports rhe<br />

\':lrious Files-! ! queued l/0 ($QIO) parJ mc ters fo r<br />

reruming ri le attri bute inf(mnuion, because they are<br />

used impl icit\' or expl ici tl y by most usn :1ppl ic:1tions.<br />

On the other hand, the hlcs-1 1 method of read i ng<br />

the ri le hear the ti le name, by<br />

- Re ading some dirccrorv entries (VPI)<br />

- Searching the dirccrorv buftl.:r r( )r the hie name<br />

- Exiting the l oop, ifrhc 1mtch is tcllll1d<br />

• Access the target ti le (VPl)<br />

• Read rhc ri le's attributes (VI'!)<br />

• Audit the hie open Htempt<br />

FSLIB File System Personality The SUB ti le sysrem<br />

personalitv is J spcci11 izcd ri le system to support the<br />

PATHWORKS rc >r Open VMS ri l e server. Irs nvo major<br />

goals arc to support the ti l e 11amcs, attri butes, and<br />

bcha,·iors ti.n111d in M icrosoft's NTAS ri le access proto­<br />

coJs, and ro provide low run-time cost rc >r processing<br />

NTAS ti le svstcm requests.<br />

The PATHWO RKS server i m plements 1 ri le service<br />

t(Jr personal computer (!'C) clients bycrcd on rop of<br />

the Filcs-1 1 ti l e system services. When NTAS service<br />

behaviors or ::trrihutcs do not match those ofFi les- ll,<br />

the PATHWORKS server h:ts to emulate them. This<br />

can lead to checking security access permissions nvice ,<br />

mapping ri le names, and emul atin g ri le Jttriburcs.<br />

Many of these problems can be avoided if the VPI<br />

i nterbcc is used d i rectl y. for instance, because the<br />

SLI B pcrson:1liry docs nor laver on top of a Files-1 1<br />

persona lin·, sccurin· access checks do nor need to be<br />

pcrtcmm:d t\\'icc. furthermore, in a srraightrc>rward<br />

design , there is 110 need to map across diftl.: renr ti le<br />

Dig:ire1l Tcchnic.d )oum,d Vol. No. 2 lr mmagi ng the caches,<br />

determini ng the order of writes out ofrhc cache to the<br />

LFS scn·cr, and mainL1ining c:tche cohet-cnc\· ,,·ithin<br />

a cluster. The caches arc write behind in 1 1111nncr rhat<br />

preserves the order of dependent operations.<br />

The clerk-scn·cr protocol controls the rr:-nskr of<br />

d:tta to and from stable storage. Data Gill he scm 1s<br />

a multi block :n omic \\Titc, md operuions thu ch1 ngc<br />

m ultiple data items s u c h as a ri l e rename em be made<br />

atomically. If a server tails d uri ng ; request, the clerk<br />

treats the req uest as if it were lost and retries rite<br />

req uest.<br />

The clerk-server protocol is idempotent. Idempotent<br />

operations cJn be :tpplicd rcpcarcd l v with no<br />

effects other than the desired one. Thus, after any<br />

number of server fa ilures or scn·cr ta ilovcrs, ir is alwavs<br />

sate to reissue an operation. Clerk-to-server wri te<br />

operations always !ewe the ri le system stare consistent.<br />

The clerk-clerk protocol protects the user data :1nd<br />

ti.Jc svstcm . mctadatJ cached lw . the c l uks . C:c hc<br />

coherency inr(Jrmarion, rather th111


• To avoid writing data thJt is quickly deleted such as<br />

temporary tiles<br />

• To com bine multiple smaU writes into larger transfers<br />

The clerks order dependent writes from tile cache<br />

to the server; consequently, other clerks never see<br />

"impossible" states, and related writes never overtake<br />

each other. For instance, the deletion of a ti le cannot<br />

happen beti:>rc a ren:me that was previously issued to<br />

the same ti le. Related d:ta writes arc caused by a partial<br />

overwrite, or an expl icit linking of operations passed<br />

into the clerk by the V PI layer, or an implicit linking<br />

due to the clerk-clerk coherency protocol.<br />

The ordering between writes is kept as a directed<br />

graph. As the clerks trave rse these graphs, they issue<br />

the writes in order or collapse the graph when writes<br />

can be sately combined or eliminated .<br />

LFS Server Design<br />

The Spiralog ti le system uses a log-structured, on-disk<br />

fo rmat tor sto ring data within a volume, yet presents<br />

a trad itional, update-in-place ti le system to its users.<br />

Figure 3<br />

Spira log Address Mapping<br />

FILE VIRTUAL BLOCKS<br />

I I I I I I I I I I<br />

l USER<br />

1/0s<br />

Recently, log-structured ti le systems, such as Sprite,<br />

have been an area of active 1-escJrch.'<br />

Within the LFS server, support is provided ti:x the<br />

.log-structured, on-disk fo rmat and Jo r mapping that<br />

tormat to an update-in-place model. Specitically, this<br />

component is responsible tor<br />

• Mapping the incoming read and write operations<br />

from their simple Jddress space to positions in an<br />

open-ended Jog<br />

• Mapping the open-ended log onto a ti nite amount<br />

of disk spJce<br />

• Reclaiming disk space by cleaning (gJrbage collect­<br />

ing) the obsolete (overwritten) sections of the log<br />

Figure 3 shows tllC various mapping layers in the<br />

Spiralog tile system, i ncluding those hJndled by the<br />

LFS server.<br />

fncoming read and write operations arc based on a<br />

single, large address space. Initially, the LFS server trans­<br />

torms the address ranges in the incoming operations<br />

into equivalent address ranges in an open-ended log.<br />

This log supports a very large, write-once address space.<br />

DISK<br />

Digit.ll "kdmical Journal<br />

FILE SYSTEM ADDRESS<br />

SPACE<br />

jVPI<br />

CLERK<br />

FILE ADDRESS SPACE<br />

jLFS<br />

B-TREE<br />

LOG ADDRESS SPACE<br />

LFS<br />

LOG<br />

DRIVER<br />

LAYER<br />

PHYSICAL ADDRESS<br />

SPACE<br />

Vol. S No. 2 <strong>1996</strong> 9


10<br />

A read operation looks up its location in the open­<br />

ended log and proceeds. On the other hand, a write<br />

operation makes obsolete its current address range<br />

and appends its new value to the tail ofthe log.<br />

In turn, locations in the open-ended log arc then<br />

mapped into locations on the (finite-sized) disk. This<br />

additional mapping allo\\'S disk blocks to be reused<br />

once their original contents have become obsolete.<br />

Physically, the log is divided into Jog segments, each<br />

of which is 256 kilobytes ( KB) in lengrl1. The log seg­<br />

ment is used as the transfer unit fo r the backup engine.<br />

It is also used by the cleaner fo r reclaiming obsolett<br />

log space.<br />

More intr<br />

high-pcdcm11ancc, on-line backup save opcLltions.<br />

We : dso met their needs tor per-tile and per-volume<br />

restores and an ongoing need tor simplicity and red uc­<br />

tion in operating costs.<br />

<strong>Digital</strong> Te chnical Joumal Vo l. 8 No. 2 19')6<br />

To provide per-tile restore capabilities, tht backup<br />

utilitY and the LrS server ensure that the :1 ppropriarc<br />

ti le header information is stortd during the sa\ c opcr­<br />

:l tion . The Sa\·cd ri le svstem clara, including ri le head ­<br />

ers, log mapping information, and user data, arc<br />

stored in a ti le k.no\\'n as a SCI/ Y!SC'I . Each sa\·escr,<br />

regardless of the number of tapes it requires, repre­<br />

sents a singl e sa,·e operation.<br />

'T'o reduce the complcxitv offile restore opcr:nions,<br />

the Spiralog fi le system provides an onlinc SJ\'CSCt<br />

merge feature. This allows tht systtm manager to<br />

merge several savesets, either tl.d l or incremental, to<br />

t(mn a new, single savcset. VVith this katurc, system<br />

managers can have a workable backup save plan that<br />

never calls fc>r :n on-line fu ll backup, rhus h1 rthcr<br />

red ucing the load on their production systems. Also,<br />

this featurt can be used to ensure that ti lt restore oper­<br />

ations can be accomplished with a small, bounded set<br />

of S:l\'CSCtS.<br />

The Spiralog backup t:Kilitv is dtscribcd in dcuil in<br />

this issue.;<br />

Project Results<br />

'l'he Spir:t log tile svstem contains a number of innm·::t­<br />

tions in the areas of on-line backup, log-structu red<br />

storage, clusttrwick ordered \\'rite-behind caching,<br />

and multiple-tile-system client support.<br />

The usc of log structuring as an on-disk t(Jrmat is<br />

very cftCctivc in supporting high-pcrf(mnancc, on-line<br />

backup. The Spiralog tile system retains the previously<br />

documented benefits of LfS, such as bst write pcrfc>r­<br />

mancc that scales with the disk size and throughput<br />

that increases as large read caches arc used to oftscr<br />

disk rc:1ds.'<br />

It should also be noted that the Filcs-1 1 ti le svstcm<br />

stts a high standard t(x data reliabilitY and robusmcss.<br />

The Spira log technologv met this challe nge \'en· wel l:<br />

as a result of the idempotent protocol, the cluster<br />

Eli lover design, and the recover capabilitv of the log,<br />

\\'C cncountertd fC\\' data rcliabilitv problems during<br />

development.<br />

In any large, complex project, manv technical deci­<br />

sions arc necessary to convert research technology<br />

into :1 product. In this section, we discuss why certain<br />

decisions were made during the devtlopmtnt of the<br />

Spira log subsystems.<br />

VPI Results<br />

The V PI tile system was generally successful 111 pro­<br />

viding the underlying support necessary r< >r dirkrcnr<br />

ti le system personalities. We fo und that it was possi­<br />

ble to construct a set ot' primitive operations that<br />

could be used to build complex, user-lC\·cl, ti le svstcm<br />

operations.


By using these primitives, the Spir:log project<br />

members were able to successfully design t\vo distinctly<br />

difterenr personality modules. Neither was a<br />

fu nctional superset of the other, and neither was layered<br />

on top of the other. However, there was an<br />

imporrallt second-order problem.<br />

The fSLIB t-ile system personality did not have a fu ll<br />

mapping to the Files-1 1 file system. As a consequence,<br />

ti le management was rather difficult, because all the<br />

data management tools on the OpenVMS operating<br />

system assumed compliance with a Files-1 1, rather<br />

than a VPI, t-ile system.<br />

This problem led to the decision not to proceed<br />

with the original design for the FSLIB personality in<br />

version 1.0 of Spiralog. Instead, we developed an<br />

FSLI B !lie svstem personality that was ndly compatible<br />

with the F64 personality, even when that compatibility<br />

t(>rced us to accept an additional execution cost.<br />

We also r()Lilld an execution cost to the primitive<br />

VPI operations. Generally, there was little overhead<br />

t(x


12<br />

arc eligible tor cl ea ning bur have nor yet been<br />

processed . These obsolete areas an: also saved by the<br />

backup engine. Although they have no cfkct on the<br />

logic:d state of the log, they do req uire the b1ckup<br />

engine to move more data to bKkup storage tiL1n<br />

might otherwise be necess:try.<br />

Second, the design initially proh ibited the cleaner<br />

tl:om r u nning ag:inst a log with snapshots. Consequcnrlv,<br />

the cleaner was disabled during 1 save operation,<br />

which had the fo llowing dkcts: ( 1 ) The :mount<br />

of available ti·ee space in tht: log was artiti cia ll v<br />

depressed d uring a backup. (2) Once the backu p \\'JS<br />

fi n ish ed, the activated cle:111cr would discover that<br />

a great number of log segments were now e l igib le for<br />

cleaning. As a result, the cleaner underwent a sudden<br />

surge in cleaning acrivirv soon after the back up h:td<br />

completed .<br />

vVc addressed this p robl em bv n.:ducing the area of<br />

the l og that was offlimirs to the c leaner ro onlv the<br />

part that th e backu p e ngine \\'ould read. This limi ted<br />

snaps hot window al lowed more segme nts to remai n<br />

eligi ble fix cleaning, rhus great!\' alln·i:tting the short­<br />

age of cleanable space during the backup and eliminat­<br />

ing the postb::tckup cleanin g surge. for 1n 8 -gigabvte<br />

rime-sh aring ,-olume, this change tvpical h· reduced the<br />

period of high cleaner acti,·irv ti·om 40 secon ds to less<br />

than one-half of a second.<br />

vVe have nor vet experi mented with difkrellt clcm er<br />

algorithms. More work needs to be done in this arc:J.<br />

to sec if the c leaning efticiencv, cost, and interactions<br />

with backup can be improved .<br />

The current mapping transt(mnarion fi·om the<br />

i ncoming operation address space to l ocations in rhc<br />

open -ended l og is more ex pensive in CPU rime rh;l11<br />

we would like. More work is needed ro opti mize rhe<br />

code path.<br />

Finally, rhe LfS server is gc ncra ll v successful at pro­<br />

viding the appearance of a tradition:!, u pcbrc-i n-place<br />

tile system. H owcvcr, as the unusecl space in a vol umc<br />

nears zero, the ability to behave with semantics that<br />

meet users' expectations in a log-structured file system<br />

proved more difric ult than we had a11 ticipn ed and<br />

required signitican r cff(Jrt to correct.<br />

The LfS server is described in much more detail in<br />

t his issue •<br />

Ta ble 2<br />

Performance Comparison of the Backup Save Operation<br />

File System<br />

Spiralog<br />

Files-1 1<br />

Dig:it,l l Technic.1l )ounnl<br />

Elapsed Time<br />

Backup Performance Results<br />

vVc rook a new approach ro the b:tcku p design in the<br />

Spiralog system, resulting in a very t;1St and very low<br />

impact backup that un be used ro crene consistent<br />

copies of the ti le system while applications :1re ::tctively<br />

mod i'ing dat


Ta ble 3<br />

Performance Comparison of the Individual File<br />

Restore Operation<br />

Elapsed Time<br />

File System (Minutes:Seconds)<br />

Spiralog<br />

Files-1 1<br />

01:06<br />

03:35<br />

length is mitigated by continuing improvements UJ<br />

processor performance.<br />

We learned a great deal during the Spiralog project<br />

and made the fo llowing observations:<br />

• <strong>Volume</strong> fu ll semantics and tine-tuning the cleaner<br />

were more complex than we anticipated and will<br />

require future retinemenr.<br />

• A heavily layered architecture extends the CPU<br />

path length and the tan-out of procedure calls. We<br />

focused too much attention on reducing JjOs and<br />

not enough attention on reducing the resource<br />

usage of some critical code paths.<br />

• Although e legant, the memory abstraction for tbe<br />

interface to the cache was not as good a fit to file<br />

system operations as we had expected. Furthermore,<br />

a block abstraction fo r the data space would<br />

have been more suitable.<br />

In summary, the project team delivered a new<br />

fi le system for the OpenVMS operatjng system. The<br />

Spiralog file system offers single-system semantics in<br />

a cluster, is compatible with the current OpcnVMS<br />

fi le system, and supports on-line backup.<br />

Future Work<br />

During the Spira log version 1.0 project, we pursued a<br />

number of new technologies and found fo ur areas that<br />

warrant fu ture work:<br />

• Support is needed from storage and tilemanagement<br />

tools tor multiple, dissimilar tile<br />

system personalities.<br />

• The cleaner represents another area of ongoing<br />

innovation and complex dynamics. We believe a<br />

better understanding of these dynamics is needed,<br />

and design alternatives should be studied .<br />

• The on-line backup engine, coupled with the logstructured<br />

fi le system technology, oHers many areas<br />

for potential development . For instance, one area<br />

tor investigation is continuous backup operation,<br />

either to a local backup device or to a remote<br />

replica.<br />

• Finally, we do not believe the higher-than-expected<br />

code path length is intrinsic to the basic fi le system<br />

design. vVc expect to be working on this resource<br />

usage in the near future.<br />

Acknowledgments<br />

We would like to take this opportunity to thank the<br />

many individuals who contributed to the Spiraiog<br />

project. Don Harbert and Rich Marcello, OpenVMS<br />

vice presidents, supported this work over the lifetime<br />

of the project. Dan Doherty and Jack Fallon, the<br />

Open VMS managers in Livingston, Scotland, had dayto-day<br />

management responsibility. CatJ1y Foley kept<br />

the project moving toward the goal of shipping. Jan is<br />

Horn and Clare Wells, the product managers who<br />

helped us understand our customers' needs, were eloquent<br />

in explaining our project and goal to others.<br />

Ncar the end of the project, Yehia Bcyh and Paul<br />

Mosteika gave us valuable testing support, without<br />

which the product would certainly be Jess stable than it<br />

is today. Finally, and not least, we would like to<br />

acknowledge the members of the development team :<br />

Alasdair Baird, Stuart Bayley, Rob Burke, Ian<br />

Compton, Chris Davies, Stuart Deans, Alan Dewar,<br />

Campbell Fraser, Russ Green, Peter Hancock, Steve<br />

Hirst, Jim Hogg, Mark Howell, Mike Johnson,<br />

Robert Landau, Douglas McLaggan, Rudi Martin,<br />

Conor J\llorrison, Julian Palmer, Judy Parsons, Ian<br />

Pattison, Alan Paxton, Nancy Pban, Kevin Porter,<br />

Alan Potter, Russell Robles, Chris Whitaker, and Rod<br />

Widdowson.<br />

References<br />

1. M. Rosenblum and J. Ousterhout, "The Design and<br />

Implememation of a Log Structured File System," ACM<br />

Tra nsactions on Computer S)'slems. vol. I 0, no. I<br />

(February 1992): 26-52.<br />

2. T. Mann, A. Birrell, A. Hisgen, C. Jerian, and C. Swart,<br />

"A Coherent Distributed File Cache with Direcrory<br />

Write-behind," <strong>Digital</strong> Systems Research Center,<br />

Research Report 103 (June 1993).<br />

3. R. Creen, A. Baird, and J. Davies, "Designing a Fast,<br />

On-line Backup System for :1 Log-structured File System,"<br />

D(v, ital Techrzica!Journal, vol . 8, no. 2 ( <strong>1996</strong>,<br />

this issue): 32-45.<br />

4. C. Whitaker, }. Bayley, and R. Widdowson, "Design of the<br />

Server lor the Spiralog File System," <strong>Digital</strong> Tee/mica!<br />

Journal, vol. 8, no. 2 ( <strong>1996</strong>, this issue): 15-3 1.<br />

5. M. Howell and ). Palmer, "Integrating the Spiralog<br />

File Svstem into the OpenVMS Operating System,"<br />

<strong>Digital</strong> <strong>Technical</strong> .fournal, vol . 8, no. 2 ( <strong>1996</strong>, this<br />

issue): 46-56.<br />

6. R. Wrenn, "Why the Real Cost of Storage is More Than<br />

$1/MR," presented at U.S. DECUS Symposium,<br />

St. Louis, Mo., June 3-6, <strong>1996</strong>.<br />

<strong>Digital</strong> ·ll:chnic


14<br />

Biographies<br />

James E. Johnson<br />

jim Johnson, a consul ring software engineer, has been<br />

working for <strong>Digital</strong> since 1984. H e was a n1ember of the<br />

Open VMS Engineering Group, where he contributed<br />

in several at-cas, i ncludi ng RMS, transaction processi ng<br />

services, the port of Open VMS to the Alpha architecrl!l·e,<br />

ti le systems, and system management. Jim recently joined<br />

rhc Internet Software B usi ness Unir and is working on<br />

rhe application ofX.SOO directory services. jim holds rwo<br />

patents on transaction commit protocol optimizations and<br />

maintains a keen interest in rh is area.<br />

William A. Laing<br />

Bill Laing, a corporate consulting engineer, is rhe technical<br />

director of the Internet Software. Business Unir. Bill joined<br />

Digita l in 191\l; he worked in the United Stares tcJr five<br />

vears befo re transterring to Europe. During his oreer :tt<br />

<strong>Digital</strong>, Bill has worked on VMS systems ped(mllarlce<br />

:nalysis , VAXclusrer design and development, oper;l ting<br />

systems dc,·clopmcnt, and transaction processing. He<br />

was the technical dir·ector of Open VMS engineering, the<br />

tt:chnical director fo r engineering in Europe, and most<br />

n: cenrk was focusing on software in the ·rechnologv and<br />

Architecture Group of the Computer Svsrcms Division.<br />

Prior to joining <strong>Digital</strong>, Bill held rese.1rch and reaching<br />

posts in operati ng svstems at rhe University of Edinburgh,<br />

where he worked on the EMAS operating system. He \\·,1s<br />

also parr of the starr-up of European Si li con Structures<br />

(LS2), an ambitious pan- Enropean company. He holds<br />

undergraduate and postgraduate degrees in computer<br />

science from the University of Edinburgh .<br />

DigitJI Tcchnic:1l journal Vol. 8 No. 2 <strong>1996</strong>


Design of the Server for<br />

the Spira log File System<br />

The Spira log file system uses a log-structured,<br />

on-disk format inspired by the Sprite log­<br />

structured file system (LFS) from the University<br />

of Ca lifornia, Berkeley. Log-structured file sys­<br />

tems promise a number of performance and<br />

functional benefits over conventional, update­<br />

in-place file systems, such as the Files-11 file<br />

system developed for the Open VMS operating<br />

system or the FFS file system on the UNIX oper­<br />

ating system. The Spira log server combines log­<br />

structured technology with more traditional<br />

B-tree technology to provide a general server<br />

abstraction. The B-tree mapping mechanism<br />

uses write-ahead logging to give stability and<br />

recoverability guarantees. By combining write­<br />

ahead logging with a log-structured, on-disk<br />

format, the Spira log server merges file system<br />

data and recovery log records into a single,<br />

sequential write stream.<br />

I<br />

Christopher Whitaker<br />

J. Stuart Bayley<br />

Rod D. W. Widdowson<br />

The goal of the Spira log file system project team was<br />

w produce a high-pedonnance, highly available, and<br />

robust tile system with a high-performance, on-line<br />

backup capability for the OpenVMS Alpha operating<br />

system. The server component of the Spiralog file system<br />

is responsible tor reading data trom and writing<br />

data w persistent storage. It must provide fa st write<br />

performance, scalability, and rapid recovery from system<br />

fa ilures. In addition, the server must allow an<br />

on-line backup utility to copy a consistent snapshot of<br />

the tile system to another location, while allowing normal<br />

fi le system operations to continue in parallel.<br />

In this paper, \Ve describe the log-structured file system<br />

(LFS) technology and its particular implementation<br />

in the Spiralog file system. We also describe the novel<br />

way in which the Spiralog server maps the log to provide<br />

a rich address space in which files and directo1ies are<br />

constructed. Finally, we review some of the opportunities<br />

and challenges presented by the design we chose.<br />

Background<br />

All tile systems must trade off performance against<br />

availability in different ways to provide the throughput<br />

required during normal operations and to protect data<br />

from corruption during system failures. Traditionally,<br />

fi le systems fall into two categories, careful write and<br />

check on recovery.<br />

• Careful writing policies are designed to provide a<br />

fa il-safe n1cchanism tor the file system structures in<br />

the event of a system fa ilure; however, they sufter<br />

fl·om the need to serialize several ljOs during file<br />

system operations.<br />

• Some fi le systems forego the need to serialize fi le<br />

system updates. After a system failure, however,<br />

they require a complete disk scan to reconstruct a<br />

consistent file system. This requirement becomes<br />

a problem as disk sizes increase.<br />

Modern file systems such as Cedar, Episode,<br />

Microsoft's New Technology File System (NTFS ),<br />

and <strong>Digital</strong>'s POLYCENTER Advanced File System<br />

use Jogging to overcome the problems inherent in<br />

these two approaches.'·2 Logging file system metadata<br />

removes the need to serialize IjOs and allows a simple<br />

Digiral <strong>Technical</strong> journal Vol. 8 No. 2 <strong>1996</strong> 15


16<br />

Jnd bounded mechanism f(>r reconstructing the ti le<br />

system ah:cr a fa ilure. Rcsc.1rchcrs at th e University of<br />

Ca liti:m1i a, Berkeley, took this process one stage fi.Jr­<br />

ther and treated the ll' hole disk as a single, sequential<br />

log where all ri le system modifications arc appended to<br />

tbe tail of the log.'<br />

Log-structured tile system tcchnolot,'Y is p;1rticularh'<br />

appropriate to the Spiral og tile s\·stcm, because it is<br />

tksigncd as a clustcrwidc ti le S\·stcm . The seJTcr must<br />

support a large number of ti le S\'Stcm clerks, each of<br />

which may be rcJding ;1 nd ll'riting lh ta to the disk. The<br />

clerks usc 1:-rge ll'rite-back caches to red uce the need to<br />

read daL1 ti·om the server. The caches also allow the<br />

clerks to bufkr write requests destined t(>r the server.<br />

A log-structured design allows mu ltiple concurrent<br />

writes to be grouped togcthn imo large, sequential<br />

ljOs to the disk. This 1/0 pJttcrn reduces disk head<br />

movement d uring writing and allows the size of the<br />

writes to be marched to characteristics of the underlying<br />

disk. This is Inrticul•1rly bcncticial t( >r storage devices<br />

with redundant arr.n·s of inexpensive disks (RAID )4<br />

The usc of a log-structured , on-disk f(>rmat greatly<br />

simplifies the implementation of ;111 on-line backup<br />

capability. Here, the challenge is ro provi de a con sis­<br />

tent snapshot of thc ti le system that can be copied to<br />

the backup med ia while normal opeLltions continue<br />

to modit\• the ti le svsrcm. BcG\ usc ;1n LrS appends Jil<br />

data to rhc tail of a l og, all d:1t1 writes within the log<br />

arc tc m pora llv ordered . A complete sn:-pshot of the<br />

ti le systcn1 corresponds to the contcllts of the sequen­<br />

tial log up ro thr the clcmcr. In su bsequent sec­<br />

tions of this paper, we compare these diffe rences \\·ith<br />

existing i mpl emc nrarions ll'hcrc appropri •1tc.<br />

Spiralog File System Server Architecture<br />

The Spiralog ti le S\·stcm cmpl o\'s ;1 clic m-scn·cr JJ·c hi­<br />

recture. E:-ch node in rhc duster thar 1noums ;l<br />

Spiralog \·olume runs a ti le svstcm clerk.. The term<br />

clerk is used in this p;1pcr ro d istinguish the client component<br />

of the ri le system fi·om diems oft he ti le S\'Stem<br />

as a whole. Clerks implement all the ti le h111crions associated<br />

\\'ith maintaining the tile svsrem sr1rc wirh the<br />

exception of persistent storage of ti le S\'stem :m d user<br />

data. This latter responsibil ity Lli ls on the Spir;1log<br />

server. There is exactly one sc n·cr h>1· eac h volume,<br />

which must run on a node that l1;1s ;1 direct connection<br />

to the disk conta ining the volume. This distribution of<br />

fu nction , where the majority of tilc system processing<br />

takes place on the cle rk, is simiLlr to that of the Echo<br />

ti le system .'" The reasons t(n choosing rh is ;1 rc h itccrurc<br />

arc described in more dct:-il in the p;1pcr "Ovcn'icw of<br />

the Spira log hie System," elsewhere in this issue."<br />

Spiralog clerks build ti. lcs and d irectories in a struc­<br />

tured address space called the ti le ;1 ddrcss s1x1cc. This<br />

address space is i mcma l ro the tile S\'Stcm and is onh·<br />

loose ly related w tiLlt pcrcci\·cd lw clicms of rhc ti le<br />

svsrcm. The server pro\·ides ;1 11 imcrbcc that :dlm\·s<br />

tbe clerks to pcrsistcnth· map to ti l c sp;Kc add rcscs.<br />

Internalh·, the scnn uses a logic:lh· infinite log struc­<br />

ture, built on top of a plwsic:l disk, to store the ti le<br />

S\'Stem datJ and the srructu res nccess:11·\· to locate<br />

the datJ. Figure l shO\\·s the rcl;Hionshifl hcn\·ccn t he<br />

clerks and the sc nu· and the rebtionships among<br />

the major componcms \\·ithin the scrnT.<br />

Figure 1<br />

Sen·er Architecture


The mapping layer is responsible for maintaining<br />

the mapping between the file address space used by<br />

the clerks to the address space of the log. The server<br />

directly supports the file address space so that it can<br />

make use of information about the relative performance<br />

sensitivity of parts of the address space that is<br />

implicit within its structure. Although this results in<br />

the mapping layer being relatively complex, it reduces<br />

the complexity of the clerks and aids performance.<br />

The mapping layer is the primary point of contact with<br />

the server. Here, read and write requests from clerks<br />

are received and translated into operations on the log<br />

address space .<br />

The log driver ( LD) creates the illusion of an infinite<br />

log on top of the physical disk. The LD transforms read<br />

and write requests from the mapping layer that are cast<br />

in terms of a location in the log address space into read<br />

and write requests to physical addresses on the underlying<br />

disk. Hiding the implementation of the log from<br />

the mapping layer allows the organization of the log to<br />

be altered transparently to the mapping layer. For<br />

example, parts of the log can be migrated to other<br />

physical devices without involving the mapping layer.<br />

FILE HEADER FILE VIRTUAL BLOCKS<br />

Figure 2<br />

Address Translation<br />

I I I I I I I I I I<br />

1 USER<br />

1/0s<br />

Although the log exported by the LD layer is conceptually<br />

infinite, disks have a finite size. The cleaner<br />

is responsible for garbage collecting or coalescing free<br />

space within the log.<br />

Figure 2 shows the relationship between the various<br />

address spaces making up the Spiralog file system. In<br />

the next three sections, we examine each of the components<br />

of the server.<br />

Mapping Layer<br />

The mapping layer implements the mapping bet\veen<br />

the file address space used by the file system clerks<br />

and the log address space maintained by the LD.<br />

It exports an interface to the clerks that they use to<br />

read data from locations in the file address space,<br />

to write new data to the file address space, and to specif)'<br />

which previously written data is no longer required.<br />

The interface also allows clerks to group sets of dependent<br />

writes into units that succeed or fail as if they<br />

were a single write. In this section, we introduce the<br />

file address space and describe the data structure used<br />

to map it. Then we explain the method used to handle<br />

clerk requests to modi f)' the address space.<br />

DISK<br />

<strong>Digital</strong> <strong>Technical</strong> Journal<br />

FILE SYSTEM ADDRESS<br />

SPACE<br />

IVPI<br />

CLERK<br />

FILE ADDRESS SPACE<br />

ILFS<br />

B-TREE<br />

LOG ADDRESS SPACE<br />

LFS<br />

LOG<br />

DRIVER<br />

LAYER<br />

PHYSICAL ADDRESS<br />

SPACE<br />

Vol. S No. 2 <strong>1996</strong> 17


18<br />

File Address Space<br />

Tht fi le address space is a structurtd address sp:1ce. At<br />

its highest level it is divided into objects, each of which<br />

Ius a numeric object ident ifier (OlD). An object may<br />

have any number of named cells associated with it and<br />

up to 2'"- 1 streams. A named cdl may contain a variable<br />

amount of data, but it is read and written as a single<br />

unit. A stream is a sequence of bytes that are<br />

address.:d by their offset from tiK start of the stream,<br />

up to a maximum of2"' -l. Fund:Jm.:ntally, there are<br />

t\vo f(>rms of addresses defined by tiK ti k address<br />

space: Named addresses of the form<br />

<br />

sp.:ci, an individual n amed cell within an object, and<br />

numeric addresses of the f(mn<br />

<br />

spcei' a sequence of lengtb con tiguous bytes in an<br />

individual stream belonging to an obj.:et.<br />

The ekrks use named cells and streams to build ti ks<br />

and direerori.:s. In the Spiralog file system version l.O,<br />

a tile is repr.:scnted by an object, a named ed l contlining<br />

its attributes, and a singk str.:am that is used<br />

to store the ti le's data. A dir.:ctory is represented by<br />

an object that contains a number of nanKd ctlls.<br />

Each 1wned cell represents a link in that dir.:cron· and<br />

conr:.1ins what a traditional ti le svstcm rdcrs to as a<br />

directory entry. Figure 3 shows how data ti les and<br />

directories are built from named cells and str.:ams.<br />

The mapping layer provides thrc.: principal operations<br />

f(>r manipulating the ti le address space: read,<br />

write, :1nd clear. The read operation allovvs a clerk to<br />

read rhc coments of a named cell, a contiguous range<br />

of bytes from a stream, or all the named cells tC>r a particular<br />

object that tall into a specified search range. The<br />

write opcr:ltion allows a clerk to writ.: ro a contiguous<br />

range of bytes in a stream or an individual named cell.<br />

DATA FILE<br />

ATTRIBUTES<br />

KEY<br />

0 OBJECT<br />

El<br />

BYTE STREAM<br />

L:::. NAMED CELL<br />

Figure 3<br />

1-'i lc Svsrc 111<br />

BYTE<br />

STREAM<br />

Digir,d T.:chnical Joun1.1\<br />

DIRECTORY<br />

ATTRIBUTES<br />

FRED.TXT<br />

FRED.C<br />

JIM.H<br />

CHRIS. TXT<br />

STU.C<br />

Vol. t:: No. 2 <strong>1996</strong><br />

The dear operation allows a clerk to remove a named<br />

cell or a nu m ber oflwtes fi-om an object.<br />

Mapping the File Address Space<br />

We looked at a variety of indexing strucwres tor mapping<br />

the file add ress spJcc onto the log address space -'"'1 We<br />

chose a derivative of the B-trcc for the following reasons.<br />

For a unif(>rm address space, B-trees provide predictable<br />

worst-ease access times because the tree is balanced<br />

across all the keys it maps. A B-tree scales we ll as the<br />

number of keys mapped increases. In other words, Js<br />

more keys arc added, the B-tree grows in width and in<br />

depth . Deep B-rrccs urry an obvious perf(>rmr :1ll the tiles stored in the log.<br />

To solve this problem, we limited the k eys tr an<br />

object to a single B -trce leaf node. With this restriction,<br />

several small ti les can be accommodated in a single<br />

leaf node. files with a large number ofexrents (or<br />

large directories) arc supported br allowing individual<br />

streams to be spawned into subtrees. The subtrees arc<br />

balanced across the kevs within the subtree. An object<br />

can never span more than a single leaf node oF the<br />

main B-trce; therefore, nonleaf nodes of the main<br />

B-tree only need to con tJi n OIDs. This allows the<br />

main B-trcc to be very compact. Figure 4 shows the<br />

relationship bet>.vccn the main B-tree and its subtrees.<br />

Figure 4<br />

Mapping B-rrcc Sn·ucrurc


To reduce the time required to open a file, data for<br />

small extents and small named cells are stored directly in<br />

the leaf node tl1at maps tl1em. For larger extents (greater<br />

than one disk block in size in the current implementa­<br />

tion), the data item is written into tl1e log and a pointer<br />

to it is stored in the node. This pointer is an address in<br />

the log address space. Figure 5 illustrates how the B-tree<br />

maps a small file and a file with several large extents.<br />

Processing Read Requests<br />

The clerks submit read requests that may be for a<br />

sequence of bytes from a stream (reading a data from a<br />

file), a single named cell (reading a file's attributes), or<br />

a set of named cells (reading directory contents). To<br />

fultill a given read request, the server must consult the<br />

B-tree to translate from the address in the file address<br />

space supplied by the clerk to the position in the log<br />

address space where the data is stored. The extents<br />

making up a stream are created when the file data<br />

is written. If an application writes 8 kilobytes ( KB)<br />

of data in 1-KB chunks, the B-tree would contain<br />

8 extents, one tor each 1-KB write. The server may<br />

need to collect data ti·om several diffe rent parts of the<br />

log address space to fi.dtlll a single read request.<br />

Read requests share access to the B - t.ree in much<br />

the same way as processes share access to the CPU of<br />

a multiprocessing computer system. Read requests<br />

Figure 5<br />

KEY:<br />

142.1 .0.50 1<br />

Mapping B-rree Derail<br />

_' 135,<br />

ATIRIBUTES I<br />

B-TREE INDEX RECORD<br />

MAPPING OlD 35 ...<br />

RECORD CONTAINING FILE<br />

DATA: OlD 42, STREAM 1,<br />

START OFFSET 0, LENGTH 50<br />

arriving from clerks are placed in a tl rst in tl rst out<br />

(FIFO) work queue and are started in order of their<br />

arrival. Al l operations on the B-tree arc performed by<br />

a single worker thread in each volume. This avoids<br />

the need tor heavyweight locking on individual<br />

nodes in the B-tree, which significantly reduces the<br />

complexity of the tree manipulation algorithms and<br />

removes tbe potential fo r deadlocks on tree nodes.<br />

This reduction in complexity comes at the cost of<br />

the design not scaling with the number of processors<br />

in a symmetric multiprocessing (SMP) system. So far<br />

we have no evidence to show that this design deci­<br />

sion represents a major performance limitation on<br />

the server.<br />

The worker thread takes a request from the head<br />

of the work queue and traverses the B-tree until it<br />

reaches a leaf node that maps the address range of<br />

the read request. Upon reaching a leaf node, it may<br />

discover that the node contains<br />

• Records that map part or al l of the address of the<br />

read request to locations in the log, and/or<br />

• Records that map part or all of the address of the<br />

read request to data stored directly in the node,<br />

and/or<br />

• No records mapping part or all of the address of the<br />

read request<br />

, , /'- ' , , MAIN B-TREE<br />

NODE A<br />

SUBTREE FOR OlD 35,<br />

_STREAM 1<br />

B-TREE INDEX RECORD<br />

MAPPING OlD 35, STREAM 1,<br />

START OFFSET 0 ...<br />

RECORD CONTAINING POINTER<br />

TO FILE DATA: OlD 42, STREAM 1,<br />

START OFFSET 0, LENGTH 50<br />

<strong>Digital</strong> <strong>Technical</strong> Journ


20<br />

Data rh:n is stored in the node is simplv copied<br />

to rhc output bufkr. When data is stored in the log ,<br />

rhc worker thread issues req uests to the LD to read the<br />

ch ra imo the output buffer. Once :.ll the reads ha\T<br />

been issued , the original request is placed on a pend­<br />

ing queue until they complete; then the results are<br />

returned to the clerk. When no dat:t is stored t(>r all or<br />

part of the read request, the server zero-tills the corn:­<br />

sponding part of the output buftCr.<br />

The process described above is complicated by the<br />

bet that the B-trcc is itself stored in the log. The map­<br />

ping layer contains a node cache rh:lt ensures thJt com ­<br />

monly referenced nodes Jre nonnallv tc>Lmd in memory.<br />

vV hen the worker thread needs to traverse through a<br />

tree node that is not in mernono, it must atTJ.nge tc>r the<br />

node to be read into the cache. The address ofrhe node<br />

in the log is the vaJ ue of rhe pointer to it fl·om its pJ.rent<br />

node. The worker thread uses this to issue a request to<br />

the LD to read the node into a cad1e butkr. While the<br />

node reJ.d request is in progress, the origin::tl clerk oper­<br />

ation is placed on a pending queue and the worker<br />

thread proceeds to the next request on the work queue.<br />

When the node is resident in memory, the pmding read<br />

request is placed back on the work queue to be<br />

rest: rred . In this way, multiple read req uests can be in<br />

progress :-tt any given time.<br />

Processing Write Requests<br />

Write requests received by the server arrive in groups<br />

consisting of a number of data items corresponding to<br />

upcbtes to noncontiguous addn:sses in the ti le :-tddress<br />

sp:1ce. Eac h group must be writtn as :t single tJilure<br />

atom ic unit, which means rbat all the parts of the write<br />

request must be made stable or none of thm must<br />

become stable. Such groups of writes arc ulkd wun­<br />

ners :tnd ar used by the clerk to cncapsubtc complex<br />

tile system operations."<br />

Bdi:lre the server Glll complete a wunncr, that<br />

is, bd(>rc an acknowl edgment c:tn be sent b:1ck to<br />

the clerk indicating that the \\'llllner w:1s successful,<br />

the scn·er must make two guar:1ntecs:<br />

1. All [Xlrts of the wunner are stably stored in the log<br />

so that the entire \\'Utmer is persistent in the e\·ent<br />

of: system bilure.<br />

2. All cbtJ. items described by the wunner arc visi ble to<br />

subsequent read requests.<br />

The wunncr is made persistent by writing each data<br />

item to the log. Each data item is tagged with a log<br />

record that identifies irs corresponding tile spKc<br />

ad d ress. This :1 l lows the data to be recovered in the<br />

e\'l'nt of a S\'Stcm tailurc. All indi,·idual writes arc made<br />

as p1 rr ofa single compound atomic operation (C:AO ).<br />

This method is prm·idcd by the LD bvcr to br:1ckct<br />

a scr of writes rhar must be recm·ered as :1 11 :ttomic<br />

unir. Once all the writes fo r the \\'Utl ller have been<br />

Vo l. il No. 2 <strong>1996</strong><br />

issued to the log, the mapping !Jvcr instructs the LD<br />

layer to end (or commit) the CAO.<br />

The \\·u nncr c1n be made ,·isible to subsequent rc1d<br />

operations by u pdati ng the B-tree to rdkct the loca­<br />

tion of th new dat::J. Untc>rntnJ.tely, this "·ould c:1usc<br />

writes to incur :1 significant latency since updating the<br />

B-tree involves traversing the B-tree and potentially<br />

reading B-tree nodes imo memory ti·om the log.<br />

Instead, the server com pletes a writ operation before<br />

the B-tree is updated . By doing this, howver, it must<br />

take addition:1l steps to ensure that the data is 1·isiblc to<br />

subseq uent read requests.<br />

Betorc completing the wunner, the mapping Ll\n<br />

queues the B-trce upd:ttes resulti ng t[·om \\Ti ring rhe<br />

\l'lttJIKr to the same FIFO work queue as read requests.<br />

All items arc queued atomicallv, that is, no other read<br />

or write oper:rion can be imerlcwed with the indi,·id­<br />

ual wu nncr updates. In this wav, the ordering between<br />

the writes making up the wuntler and su bsequellt read<br />

or write oper:-ttions is maintained . vVork cannot begi n<br />

on a subsequent rc:td request until work has started OJJ<br />

the B-trce updates ahead ofit in the queue.<br />

Once the B-trcc upd ates have been queued to the<br />

server work queue and the mapping layer h:s been<br />

notified th:-tt the C:AO fo r th writes has eommirred ,<br />

both of the gu:1ranrces rhar the sen-er gi ,·es on write<br />

completion hold. The data is pcrsistem, and the writes<br />

arc ,·isible to su bseq uenr operations; therefore, the<br />

sen·er can send an ac knowledgment back to the clerk.<br />

Updating the 8-tree<br />

The worker thread processes a B-trce update req uest<br />

in much the s:me way 1s a read request. The upd ate<br />

request traverses the B-rrcc until either it reaches the<br />

node that maps the appropriate pan of the file add ress<br />

space, or it tails to fi nd a node in memory.<br />

Once the lc:.1fnode is reachd , it is updated to point at<br />

the locarjon of the data in the log (if the clatJ is to be<br />

stored directlv in the node, the chta is copied inro the<br />

node). The node is tlOW dim· in memorY :md must<br />

be written to the log at some point. Ra ther than \\Tiring<br />

the node immediate\-, the lll3pping la,·cr writes a log<br />

record describing the ch:nge, locks the node into rhe<br />

cache, and places :1 fl ush operation ror the node to<br />

tl1e mapping L1yer's Hush queue. The flush operation<br />

describes the locltion of the node in the tree and<br />

records the need to write it to the log at some poi nt<br />

in the future.<br />

It on its \\·1\· to the lc:tf node, the write operation<br />

reaches a node that is not in memorv, the worker<br />

thread arranges t( Jr it to be read h·om the log :.1nd the<br />

write operation is pbced on a pe nding queue Js "·i rh :1<br />

read operation. Because the wri te has been ackno11·l­<br />

edged to the clerk, the nc,,· d


eing read. If a read operation reaches the node with<br />

records of stalled updates, it must check whether any<br />

of these records contains data that should be returned.<br />

The record contains either a pointer to the data in the<br />

Jog or the actual data itself. If a read operation ri nds<br />

a record that can satist)r al l or part of the request, the<br />

read request uses the information in the record to<br />

retch the data. This preserves the guarantee that the<br />

clerk must see all data for which the write request has<br />

been acknowledged.<br />

Once the node is read in ti-om the log, the stalled<br />

updates are restarted. Each update removes its log<br />

record fi·om the node and recommences traversing the<br />

B-tree from that point.<br />

Writing B-tree Nodes to the Log<br />

Writing nodes consumes bandwidth to the disk that<br />

might otherwise be used for writing or reading user<br />

data, so the server tries to avoid doing so until<br />

absolutely necessary. Two conditions make it necessary<br />

to begin writing nodes:<br />

1. There are a large number of dirty nodes in the<br />

cache.<br />

2. A checkpoint is in progress.<br />

In the fi rst condition, most of the memory available<br />

to rbe server bas been given over to nodes that are<br />

locked in memory and waiting to be written to the<br />

Jog. Read and update operations begin to back up,<br />

waiting fo r available memory to store nodes. In the<br />

second condition, the LD has requested a checkpoint<br />

in order to bound recovery time (see the section<br />

Checkpointing later in this paper).<br />

When either of these conditions occurs, the mapping<br />

layer switches into tlush mode, during which it only<br />

writes nodes, until rhc condition is changed. In Hush<br />

mode, the worker thread processes tlush operations<br />

ti·om the mapping layer's fl ush queue in depth order,<br />

that is, starting with the nodes ti.1 rthest from the root<br />

ofrhc B-rrcc . For each tl ush operation, it traverses the<br />

B-trce until it ti nds the target node and its parent. The<br />

target node is identified by the keys it maps and its<br />

level. The level of a node is its distance ri·om the leaf of<br />

the B-rn::e (or subtree). Unlike its depth, which is its<br />

distance from the root of the B-tree, a node's level does<br />

not change as the B-tree grows and shrinks.<br />

Once it has reached its destination, the tlush operation<br />

writes out the target node and updates the parent<br />

with the new log address. The moditicnions made to<br />

the p


22<br />

Figure 7<br />

SEGMENT ARRAY I<br />

SEGMENT WRITER I<br />

DISK<br />

Subcomponents of rhe LD LJI'er<br />

ALTERNATE SOURCE<br />

OF SEGMENTS<br />

(SPI RALOG BACKUP)<br />

0<br />

TAPE<br />

The Segment Writer<br />

The segment writer is responsible tor all I/Os to the<br />

log. It groups together writes it receives ti·om the mapping<br />

layer into large, sequential I/Os wiKre possible.<br />

This increases write throughput, but at the potential<br />

cost of increasing the latency of individual operations<br />

when the disk is lightly loaded.<br />

As shown in Figure 8, the segment writer is responsible<br />

fo r the internal organization of segments written<br />

to the disk. Segments are divided into two sections, a<br />

data area md a much smaller commit record area.<br />

Writing a piece of data requires two operations to the<br />

segment at the tail of the log. First the data item is<br />

written to the data area of the segment. Once this I/0<br />

has completed successfully, a record describing that<br />

data is written to the commit record area. Only when<br />

the write to the commit record area is complete can<br />

the original request be considered stable.<br />

Figure 8<br />

DATA AREA<br />

J r-o-f--------,1<br />

-- I<br />

L<br />

I<br />

KEY:<br />

--- - - - ----J<br />

USER DATA OR B-TREE NODE<br />

[.=J COMMIT RECORD<br />

r------1<br />

l o· L___<br />

L _ ___ _ .J<br />

Orgcmizcltion of a Segment<br />

_J SI NGLE 1/0 OPERATION<br />

Digir.1l <strong>Technical</strong> )omnal Vol. 8 No. 2 <strong>1996</strong><br />

The need tor two writes to disk (potentially, with a<br />

rotational delay berween ) to commit a single data<br />

write is clearly a disadvantage. Normally, however, the<br />

segment wri ter receives a set of related writes from<br />

the mapping layer which are tagged as part of a single<br />

CAO. Since the mapping layer is interested in the completion<br />

of the whole CAO and not the writes within it,<br />

the segment writer is able to buffer add itions to the<br />

commit records area in memory and then write them<br />

with a single I/0. Under a normal write load , this<br />

reduces the number ot' ljOs tor a single data write to<br />

very close to om:.<br />

The bou ndary between the commit record area and<br />

the data arcJ is ti xed . Ine\'itably, this wastes space in<br />

either the commit record area or data area when the<br />

other ti lls. Choosing a size t(Jr the commit record area<br />

that minimizes this waste req uires some care. After<br />

analysis of segments that had been subj ected to a typical<br />

OpenVMS load, we chose 24 KB as the value ti.>r<br />

the commit record area .<br />

This segment organization permits the segment<br />

writer to have comp.letc control over the contents of<br />

the commit record area, which allows the segment<br />

writer to accomplish two important recovery tasks:<br />

• Detect the end of the log<br />

• Detect multiblock write failure<br />

When physical segments are reused to extend the<br />

log, they arc not scrubbed and their commit record<br />

areas contain stale (but comprehensible) records. The<br />

recovery man


a segment is an attribute of the physical slot that is<br />

assigned to it. The sequence number tor a physical slot<br />

is incremented each time the slot is reused, allowing<br />

the recovery manager to detect blocks that do not<br />

belong to the segment stored in the physical slot.<br />

The cost of resubstituting the stolen bytes is incurred<br />

only during recovery and cleaning, because this is<br />

the only time that the commit record area is read.<br />

In hindsight, the partitioning of segments into data<br />

and commit areas was probably a mistake. A layout<br />

that intermingles the data and commit records and that<br />

allows them to be written in one ljO would offer better<br />

latency at low throughput. If combined with careful<br />

writing, command tag queuing, and other optimizations<br />

becoming more prevalent in disk hardware and<br />

controllers, such an on-disk structure could offer significant<br />

improvements in latency and throughput.<br />

Cleaner<br />

The cleaner's job is to turn free space in segments in<br />

the log into empty, unassigned physical slots that can<br />

be used to extend the log. Areas offree space appear in<br />

segments when the corresponding data decays; that is,<br />

it is either deleted or replaced.<br />

The cleaner rewrites the live data contained in partially<br />

fu ll segments. Essentially, the cleaner torces the<br />

segments to decay completely. If the rate at which data<br />

is written to the log matches the rate at which it is<br />

deleted, segments eventually become empty of their<br />

own accord. When the log is fu ll (fullness depends on<br />

the distribution of ti le longevity), it is necessary to<br />

proactively clean segments. As the cleaner continues<br />

to consume more of the disk bandwidth, performance<br />

can be expected to decline. Our design goal was that<br />

performance should be maintained up to a point at<br />

which the log is 85 percent fu ll. Beyond this, it was<br />

acceptable for performance to degrade significantly.<br />

Bytes Die Yo ung<br />

Recently written data is more likely to decav than old<br />

data.1"·1' Segments that were writtn a shor time ago<br />

are likely to decay fu rther, after which the cost of<br />

cleaning them will be less. In our design, the cleaner<br />

selects candidate segments that were written some<br />

time ago and are more likely to have undergone this<br />

initial decay.<br />

Mixing data cleaned from older segments with data<br />

trom the current stream or' new writes is likely to produce<br />

a segment that will need to be cleaned again once<br />

the new data has undergone its initial decay. To avoid<br />

mixing cleaned data and data hom the current write<br />

stream, the cleaner builds its output segments separately<br />

and then passes them to the LD to be threaded in<br />

at the tail of the log. This has two important benefits:<br />

• The recovery information in the output segment is<br />

minimal, consisting only of the selfdescribing tags<br />

on the data. As a result, the cleaner is unlikely to<br />

waste space in the data area by virtue of having ti lled<br />

the commit record area.<br />

• By constructing the output segment offline, the<br />

cleaner has as much time as it needs to look for data<br />

chunks that best till the segment.<br />

Remapping the Output Segment<br />

The data items contained in the cleaner's output segment<br />

receive new addresses. The cleaner informs the<br />

mapping layer of the change oflocation by submitting<br />

B-tree update operation for each piece of data it<br />

copied. The mapping layer handles this update operation<br />

in much the same way as it would a normal overwrite.<br />

This update does have one special property:<br />

the cleaner writes are conditional. In other words, tbe<br />

mapping layer will update the B-tree to point to<br />

the copy created by the cleaner as long as no change<br />

has been made to the data since the cleaner took its<br />

copy. This allows the cleaner to work asynchronously<br />

to tile system activity and avoids any locking protocol<br />

between the cleaner and any other part of the Spira log<br />

file system.<br />

To avoid modifYing the mapping layer directly, the<br />

cleaner does not copy B-tree nodes to its output segment.<br />

Instead, it requests the mapping layer to flush<br />

the nodes that occur in its input segments (i.e., rewrite<br />

them to the tail of the log). This also avoids wasting<br />

space in the cleaner output segment on nodes that<br />

map data in the cleaner's input segments. These nodes<br />

are guaranteed to decay as soon as the cleaner's B-tree<br />

updates are processed.<br />

Figure 9 shows how the cleaner constructs an output<br />

segment from a number of input segments. The cleaner<br />

keeps selecting input segments until either the output<br />

segment is fu ll, or there are no more input segments.<br />

Figure 9 also shows the set of operations that are generated<br />

by tl1e cleaner. In this example, tl1e output segment<br />

is filled with the contents oftvm fu ll segments and part<br />

of a third segment. This will cause the third input segment<br />

to decay still further, and the remaining data and<br />

B-tree nodes will be cleaned when that segment is<br />

selected to create another output segment.<br />

Cleaner Policies<br />

A set of heuristics governs the cleaner's operation.<br />

One of our fundamental design decisions was to separate<br />

the cleaner policies from the mechanisms that<br />

implement them.<br />

When to clean?<br />

Our design explicitly avoids cleaning until it is<br />

required. This design appears to be a good match tor<br />

<strong>Digital</strong> Tec hnical journal Vol. 8 No. 2 <strong>1996</strong> 23


24<br />

KEY<br />

I NODE A I<br />

I<br />

B-TREE NODE<br />

D LIVE DATA<br />

;-- SUPERSEDED DATA<br />

• _ _ _ J<br />

Figure 9<br />

Cle,mcr OpcrJtioiJ<br />

B-TREE UPDATE REQUEST<br />

CLEANER<br />

a workload on the OpenVMS system . On our rimesharing<br />

system, the cleaner was entirely inactive for the<br />

tirst th ree months of <strong>1996</strong>; although segments were<br />

used ::md reused repeatedly, they always dec:1ycd<br />

entirely to empty of their own accord. The trade-off<br />

in avoiding cleaning is that although performance is<br />

improved (no cleaner activity), the size of riK fu ll<br />

savcsnaps created by backup is increased. This is<br />

because backup copies whole segments, regardless of<br />

how much live data thev contain.<br />

vVhcn the cleaner is nor running, the live data in the<br />

volume tends to be disnibuted across a large number of<br />

partially hi ll segments. To avoid this problem, we have<br />

added a conrrol to allow the system manager to nu nually<br />

surt and stop the cleaner. Forcing the cleilner ro<br />

run bd()re performing a full backup compacts the live<br />

cb ta in the log and reduces the size ofthe savesnap.<br />

In normal operation, the cleaner will Stilrt cleaning<br />

\\'hen rhc number oftree segments available ro extend<br />

rhc log blls below a fi xed threshold ( 300 in the current<br />

implcmentJtion ). In making this calculation, the<br />

cleaner takes into account the amount of space in<br />

rhe log that will be consumed by writing data currently<br />

held in rhe clerks' write-behind caches. Thus, accepting<br />

data into the cache c1uses the cleaner to "clear the way"<br />

tC:.>r the subsequent write request fi-om the clerk.<br />

When the cleaner starts, it is possible that rhc<br />

amou11t of live data in the log is appro:ching<br />

the capacity of the underlying disk, so the cleaner may<br />

tind nothing to do. It is more likely, however, that<br />

there will be free space it can reclaim. Because the<br />

cleaner works by torcing the data in its input segments<br />

Vol. 8 No. 2 <strong>1996</strong><br />

OPERATIONS SUBMITTED TO MAPPING LAYER<br />

to dec1y by rewriting, it is i mportant that it begins<br />

work while ti-ee segments are available. Debying the<br />

decision to start cleaning cou ld result in the cleaner<br />

being unable to proceed.<br />

A tl xed number was chosen tor the cleaning threshold<br />

rather than one based on the size of the disk. The<br />

size of the disk does not affect the urgency of cleaning<br />

Jt any p:rticular point in time. A more cftcctivc indic:tor<br />

of urgency is the time taken f()r the disk ro ti ll :1t the<br />

maximum rate of writing. Writing to the log at lOMB<br />

per second wi ll usc 300 segments in about 8 seconds.<br />

With hindsight, we realize that a threshold based on a<br />

measuremc11t of the speed of the disk might have been<br />

a more appropriate choice.<br />

Input Segment Selection<br />

The clc::mer divides segments into tour distinct groups:<br />

l. Empty. These segments contain no live data :m d arc<br />

avai lable to the LD to extend the Jog.<br />

2. Noncle:nable. These segments arc not candichtes<br />

t(w cleaning f()r one of t\vo reasons:<br />

• The segmem contains information that would<br />

be required by the recovery manager in the event<br />

of a system fl ilure. Segments in this group arc<br />

always close to the tail of the log and therd(Jrc<br />

likely to undergo t-l1 rther decay, making them<br />

poor candidiltes tor cleaning.<br />

• The segment is part of a snapshot.; The SILlpsllot<br />

represents a rercrcnce to the segment, so it cm-<br />

110t be reused even though it may no longer COilrain<br />

live data .


3. Preferred noncleanable. These segments have<br />

recently experienced some natural decay. The sup­<br />

position is that they may decay further in the near<br />

ti.1 ture and so are not good candidates tor cleaning.<br />

4. Cleanable. These segments have not decayed tor<br />

some time. Their stability makes them good cami.i ­<br />

dates to r cleaning.<br />

The transitions between the groups are illustrated in<br />

Figure 10. It should be noted that the cleaner itself<br />

does not have to execute to transfer segments into the<br />

empty state.<br />

The cleaner's job is to fill output segments, not to<br />

empty input segments. Once it has been started, the<br />

cleaner works to entirely fi ll one segment. When that<br />

segment has been filled, it is threaded into the log;<br />

if appropriate, the cleaner will then repeat the process<br />

with a new output segment and a new set of input<br />

segments. The cleaner will commit a partially full<br />

output segment only under circumstances of extreme<br />

resource depletion.<br />

The cleaner ti lls the output segment by copying<br />

chunks of data fo rward fi·om segments taken ti-om the<br />

cleanable group. The members of this group are held<br />

on a list sorted in order of emptiness. Thus, the tl rst<br />

cleaner cycle will always cause the greatest number of<br />

segments to decay. As the output segment fills, the<br />

smallest chunk of data in the segment at the head of<br />

the cleanable Jist may be larger than the space left in<br />

the output segment. In this case, the cleaner performs<br />

a limited search down the cleanable list tor segments<br />

containing a suitable chunk. The required information<br />

is kept in memory, so this is a reasonably cheap opera­<br />

tion. As each input segment is processed, the cleaner<br />

Figure 10<br />

Segment States<br />

DECAY<br />

TO EMPTY<br />

temporarily removes it from the cleanable list. This<br />

allows the mapping layer to process the operations the<br />

cleaner submitted to it and thereby cause decay<br />

to occur before the cleaner again considers the seg­<br />

ment as a candidate fo r cleaning. As the volume fills,<br />

the ratio between the number of segments in the<br />

cleanable and preferred noncleanable groups is<br />

adjusted so that the size of the preferred non cleanable<br />

group is reduced and segments are inserted into the<br />

cleanable Jist. If appropriate, a segment in the clean­<br />

able list that experiences decay will be moved to the<br />

preferred noncleanable list. The preferred nonclean­<br />

able list is kept in order of least recently decayed.<br />

Hence, as it is emptied, the segments that are least<br />

likely to experience further decay are moved to the<br />

cleanable group.<br />

Recovery<br />

The goal of recovery of any file system is to rebuild the<br />

ti le system state after a system fa ilure. This section<br />

describes how the server reconstructs state, both in<br />

memory and in the log. It then describes checkpoi.nt­<br />

ing, the mechanism by which the server bounds the<br />

amount of time it takes to recover the file system state.<br />

Recovery Process<br />

In normal operation, a single update to the server can<br />

be viewed as several stages:<br />

l. The user data is written to the log. It is tagged with<br />

a self-identifYing record that describes its position in<br />

the file address space. A B-tree update operation is<br />

generated that drives stage 2 of the update process.<br />

CLEANER POLICY/<br />

SEGMENT DECAY<br />

CHECKPOINT/<br />

SNAPSHOT<br />

DELETION<br />

<strong>Digital</strong> Tcchnicll journal Vol. 8 No. 2 <strong>1996</strong> 25


26<br />

2. The leaf nodes of the B-tree are moditled in memory,<br />

and corresponding change records are written<br />

to the log to reflect the position of the new data.<br />

A flush operation is generated and queued and then<br />

starts stage 3.<br />

3. The B-tree is written out level by level until the root<br />

node has been rewritten. As one node is written to<br />

the log, the parent of that node must be modified,<br />

and a corresponding change record is written to the<br />

log. As a parent node is changed, a fu rther Hush<br />

operation is generated fo r the parent node and so<br />

on up to the root node.<br />

Stage 2 of this process, logging changes to the leaf<br />

nodes of the B-tree, is actually redundant. The self<br />

identi'ing tags that are written with the user data are<br />

sufficient to act as change records for the leaf nodes of<br />

the B-tree. When we started to design the server, we<br />

chose a simple implementation based on physiological<br />

B-TREE<br />

STAGE 1<br />

STAGE 2:<br />

STAGE 3:<br />

LOG<br />

LOG<br />

LOG<br />

Figure 11<br />

Stgcs oLl Write Request<br />

<strong>Digital</strong> TcchnicJI Joum.1l Vol. 8 No. 2 <strong>1996</strong><br />

write-ahead logging." As time progressed, we moved<br />

more toward operational logging." The records written<br />

in stage 2 arc a holdover trom the earlier implementation,<br />

which we may remove in a fu ture release ot'<br />

the Spiralog file system.<br />

At each stage of the process, a change record is written<br />

to the log and an in-memory operation is generated<br />

to drive the update through the next stage. In drect,<br />

the change record describes the set of changes mack<br />

to an in-memory copy of a node and an in-memory<br />

operation associated with that change.<br />

Figure ll shows the log and the in-memory work<br />

queue at each stage of


After a system bilure, the server's goal is to recon­<br />

struct the file system state to the point of the last write<br />

that was written to the log at the time of the crash.<br />

This recovery process involves rebuilding, in memory,<br />

those B-tree nodes that were dirty and generating any<br />

operations that were outstanding when the system<br />

ta iled. The outstanding operations can be scheduled in<br />

the normal way to make the changes that they repre­<br />

sent permanent, thus avoiding the need to recover<br />

them in the event of a future system fa ilure. The recov­<br />

ery process itself does not write to the log.<br />

The mapping layer work queue and the flush lists<br />

are rebuilt, and the nodes are fe tched into memory by<br />

reading the sequential log from the recovery start<br />

position (see the section Checkpointing) to the end of<br />

tl1e log in a single pass.<br />

The B-tree update operations are regenerated using<br />

the selfidentifYing tag that was written with each<br />

piece of data. When the recovery process finds a node,<br />

a copy of the node is stored in memory. As log records<br />

fo r node changes are read, they are attached to the<br />

nodes in memory and a Hush operation is generated<br />

for the node. I fa log record is read fo r a node that has<br />

not yet been seen, the log record is attached to a place­<br />

holder node that is marked as not-yet-seen. The recov­<br />

ery process does not pertorm reads to tetch in nodes<br />

that are not part of the recovery scan. Changes to<br />

B-tree nodes are a consequence of operations that<br />

happened earlier in the log; therefore, a B-tree node<br />

t<br />

RECOVERY<br />

START POSITION<br />

CHANGE CHANGE CHANGE<br />

RECORD 1 RECORD 2 RECORD 3<br />

NODE A NODE B NODE C<br />

B-TREE NODE CACHE (AFTER RECOVERY SCAN)<br />

NODE A'<br />

WORK QUEUE {AFTER RECOVERY)<br />

Figure 12<br />

Recovering a Log<br />

I <br />

- CHANGE<br />

RECORD 4<br />

NODE A'<br />

B-TREE r-- FLUSH<br />

UPDATE REQUEST<br />

DATA 1 NODE A'<br />

RECOVERY SCAN<br />

NODE B<br />

(NOT-YET-SEEN)<br />

log record has the et1ect of committing a prior modifi­<br />

cation. Recovery uses this fa ct to throw away update<br />

operations that have been committed; they no longer<br />

need to be applied.<br />

Figure 12 shows a log with change records and<br />

B-tree nodes along with the in-memory state of the<br />

B-tree node cache and the operations that are regener­<br />

ated. In this example, change record l fo r node A is<br />

superseded or committed by the new version of node A<br />

(node A'). The new copy of node C (node C') super­<br />

sedes change records 3 and 5. This example also shows<br />

the effect of finding a Jog record without seeing a copy<br />

of the node during recovery. The log record fo r node B<br />

is attached to an in-memory version of the node that is<br />

marked as not-yet-seen. The data record with self-iden­<br />

tifYing tag Data l generates a B-tree update record that<br />

is placed on the work queue fo r processing. As a final<br />

pass, the recovery process generates the set of flush<br />

operations that was outstanding when the system tailed.<br />

The set oftlush requests is defined as the set of nodes in<br />

the B-tree node cache that has log records attached<br />

when the recovery scan is complete. In this case, flush<br />

operations for nodes A' and B are generated.<br />

The server guarantees that a node is never written to<br />

the log with uncommitted changes, which means that<br />

we only need to log redo records.9·16 In addition, when<br />

we see a node during the recovery scan, any log<br />

records that are attached to the previous version of the<br />

node in memory can be discarded.<br />

r- FLUSH<br />

REQUEST<br />

NODE B<br />

CHANGE CHANGE I NODE C' I<br />

RECORD 4 RECORD 5<br />

NODE A' NODE C<br />

- CHANGE<br />

RECORD 2<br />

NODE B<br />

NODE C'<br />

t<br />

TAIL OF<br />

LOG<br />

<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 27


28<br />

Operati ons generated during recovery arc posted to<br />

the work queues as they wo uld be in normal running.<br />

Normal operation is nor allowed to begin until the<br />

recovery pass has completed ; however, when recovery<br />

reaches rhe end of the log, tJ1e server is :blc to service<br />

opcr:rions fi·om clerks. Thus new requests from the clerk<br />

can be scnced, potentially in parallel with tJ1c operations<br />

tJ1at were generated by rJ1e recovery process.<br />

Log records are not applied to nodes during recov­<br />

ery t(x :1 number of reasons:<br />

• Less processing time is needed to scan the log and<br />

therefore the server can start servicing new user<br />

requests sooner.<br />

• Recover' will not have seen copies of all nodes t(x<br />

which it has log records. To apply the log records,<br />

the R-tree node must be read ti·om the log. This<br />

would result in random read requests during the<br />

sequential scan ofthe log, and again wo uld result in a<br />

longer period betore user requests could be serviced.<br />

• There may be a copy of rhe node L:ltcr in tiK recov­<br />

ery scan. This would make the additional I/O oper­<br />

ation red undant.<br />

Checkpointing<br />

As we have shown, recovering an LfS log is i mplc­<br />

mcnred by a single-pass sequcnri:- 11 scan of 1 11 records<br />

in rhc log ti·om the recovery start position ro the tail of<br />

the log. This section detines a recovery start position<br />

and describes how ir can be moved t()l·ward to red uce<br />

rhc 1 mount of log that Ius to be scan ned to recover<br />

rhc ti le system state.<br />

To re construct the in-memory stare when a system<br />

cr:shcd , recovery must see something in the log that<br />

represents each operation or clnngc of stare that was<br />

represented in memory but nor yet made stable. This<br />

means that at time t, the recovery st:rr position is<br />

ddincd as a poi nt in the log afte r which all operations<br />

rh:n arc not stably stored have a log record associated<br />

with them. Operations obtain rhe :tssoci:ttion by scan­<br />

ning the log sequentially fr om rhc begi nning ro the<br />

end. The recovery position then becomes the start of<br />

the log, which has two important problems:<br />

l. In the worst case, it wou ld be necessary to sequen­<br />

tial ly scan the entire log ro pcr f(mn recovery. For<br />

large disks, a seq uential read of the enti re log con­<br />

sumes a great deal of time.<br />

2. Recove ry must process every log record writte n<br />

between the recovery start position and rhe end of<br />

the Jog. As a consequence, segments bct\vc cn the<br />

start of recovery and the end of the log cannot be<br />

clc:ned and reused.<br />

To restrict the amount of rime to recover the log<br />

and to allow segments to be released Lw cJcaning, the<br />

<strong>Digital</strong> Te chnical Jomn;J\ Vo l. H No. 2 I ')96<br />

recovery position must be mm·ed forward ti·om rime<br />

to time, so that it is alw:ys close to rhe tail ofrhc log.<br />

Under r Jogs rhar predominantlv ser­<br />

vice read requests, rhe rccoverv ri me rends toward the<br />

limit. In these cases, ir mav be more appropriate to add<br />

rj mer- b:scd ch ec kpoi nts.<br />

Managing Free Space<br />

A traditional, updarc-in-p!Jce tile system 0\H \\'ri rcs<br />

superseded data by writing to the same p hysi cal loc:­<br />

tion on disk. It t(>r example,


parts of the B-tree that arc needed to make the tile<br />

system structures consistent. This means that we can<br />

never have dirty B-tree nodes in memory that cannot<br />

be tlushed to the log.<br />

The server must carefully manage the amount of free<br />

space in the log. It must provide 1:\vo guarantees:<br />

l. A write will be accepted by the server only if there is<br />

sufficient free space in the log to hold the data and<br />

rewrite tbe mapping B-trce to describe it. This guarantee<br />

must hold regardless of how much space the<br />

cleaner may subsequently reclaim.<br />

2. At the higher levels of the tile system, if an I/0 operation<br />

is accepted, even if that operation is stored in<br />

the write-behind cache, the data will be w1itten to<br />

the Jog. This guarantee holds except in rbe event of<br />

a system fa ilure.<br />

The server provides these guarantees using the same<br />

mechanism. As shown in Figure 13, the free space and<br />

the reserved space in the log are mod eled using an<br />

escrow fu nction. 17<br />

The total number of blocks that contain live, valid<br />

data is maint


Compression<br />

Adding file compression to an update-in-place file<br />

system presents a particular problem: what to do when<br />

a data item is overwritten with a new version that does<br />

not compress to the same size. Since all updates take<br />

place at the tail of the log, an LFS avoids this problem<br />

entirely. In adctition, the amount of space consumed<br />

by a data item is determined by its size and is not influenced<br />

by the cluster size of the disk. For this reason, an<br />

LFS does not need to employ file compaction to make<br />

efficient use of large disks or RAID sets.'9<br />

Future Improvements<br />

The existing implementation can be improved in a<br />

number of areas, many of which involve resource consumption.<br />

The B-tree mapping mechanism, although<br />

general and flexible, has high CPU overheads and<br />

requires complex recovery algorithms. The segment<br />

layout needs to be revisited to remove the need for serialized<br />

1/0s when committing w1ite operations and thus<br />

further reduce the write latency.<br />

For the Spiralog file system version 1.0, we chose to<br />

keep complete information about live data and data that<br />

was no longer valid for every segment in the log. This<br />

mechanism allows us to reduce the overhead of the<br />

cleaner; however, it does so at the expense of memory<br />

and disk space and consequently does not scale well to<br />

multi-terabyte disks.<br />

A Final Word<br />

Log structuring is a relatively new and exciting technology.<br />

Building <strong>Digital</strong>'s first product using this<br />

technology has been both a considerable challenge and<br />

a great deal of fun. Our experience during the construction<br />

of the Spiralog product has led us to believe<br />

that LFS technology has an important role to play in<br />

the future of file systems and storage management.<br />

Acknowledgments<br />

We would like to take this opportunity to acknowledge<br />

the contributions of the many individuals who<br />

helped during the design of the Spiralog server. Alan<br />

Paxton was responsible for initial investigations into<br />

LFS technology and laid the fo undation to r our understanding.<br />

Mike Johnson made a significant contribution<br />

to the cleaner design and was a key member of the<br />

team that built the final server. We are very grateful to<br />

colleagues who reviewed the design at various stages,<br />

in particular, Bill Laing, Dave Thiel, Andy Goldstein,<br />

and Dave Lomet. Finally, we wou ld like to thank Jim<br />

Johnson and Cathy Foley fo r their continued loyalty,<br />

enthusiasm, and direction during what has been a long<br />

and sometimes hard journey.<br />

30 Digiral <strong>Technical</strong> Journal Vo l. 8 No. 2 <strong>1996</strong><br />

References<br />

l. D. Gifford , R. Needham, and M. Schroeder, "The<br />

Cedar File System," Co mmunications of the ACJ\.1,<br />

vol . 31, oo. 3 (March 1988).<br />

2. S. Chutanai, 0. Anderson, M. Kazar, and B. Leverett,<br />

"The Episode File System," Proceedings of the Winter<br />

7 992 U!:J't!VJX <strong>Technical</strong> Conference (January I 992).<br />

3. M. Rosenblum, "The Design and Implementation of<br />

a Log-Structured File System," Report No. UCB/CSD<br />

92/696, University of California, Berkeley (June<br />

1992).<br />

4. J. Ousterhout and F. Douglis, "Beating the IjO Bottle­<br />

neck: The Case fo r Log-Structured File Systems,"<br />

Operating Systems Review (January 1989).<br />

5. R. Green, A. Baird, and J. Davies, "Designing a Fast,<br />

On-line Backup System fo r a Log-structured File System,"<br />

<strong>Digital</strong> <strong>Technical</strong>joumal, vol. 8, no. 2 (<strong>1996</strong>,<br />

this issue): 32-45.<br />

6. ]. Ousterhout et al., "A Comparison of Logging and<br />

Cl ustering," Computer Science Department, University<br />

of California, Berkeley (March 1994 ).<br />

7. M. Seltzer, K. Bostic, M. McKusick, and C. Staelin,<br />

"An Implementation of a Log·Structured File System<br />

for UNIX," Proceedings of tbe Winter 1993 CSENIX<br />

Te chnical Conference (January 1993).<br />

8. M. Wiebren de Jounge, F. Kaashoek, and W. -C. Hsieh,<br />

"The Logical Disk: A New Approach to Improvi ng<br />

File Systems," ACIH SJGOPS '93 (December 1993).<br />

9. J. Gray and A. Reuter, Tra nsaction Processing: Co n­<br />

cepts and Techniques (San i'vlateo, Calif.: Morgan<br />

Kaufman Publishers, 1993), ISRN l-55860-190-2.<br />

10. A. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swarr,<br />

"The Echo Distributed File System," <strong>Digital</strong> Systems<br />

Research Center, Research Repon 111 (September<br />

1993).<br />

11. J. Johnson and W. Laing, "Overview of the Spira log<br />

File System," <strong>Digital</strong> <strong>Technical</strong>.fournal, vol. 8, no. 2<br />

(<strong>1996</strong>, this issue): 5-14.<br />

12. A. Sweeney et al., "Scalability in the XFS File System,"<br />

Proceedings of the Winter <strong>1996</strong> USENIX Tecbnica/<br />

Conference (January <strong>1996</strong>).<br />

13. J. Ko hl, "Highlight: Using a Log-structured File<br />

System fo r Tertiary Storage Management," USENIX<br />

Associ ation Conference Proceedings (January 1993).<br />

14. M. Baker et al., "Measurements of: 1 Distributed File<br />

System," Symposium on Operating System Principles<br />

(SOSP) 13 (October 1991 ).<br />

15. J. Ousterhour et al., "A Trace-driven Analysis of the<br />

UNIX 4.2 13SD File System," Symposium on Operating<br />

System Principles (SOSJ') 10 (December 1985).


16. D. Lomet and B. Salzberg, "Concurrency and Recov­<br />

ery for Index Trees," <strong>Digital</strong> Cambridge Research<br />

Laboratory, <strong>Technical</strong> Report (August 1991 ).<br />

17. P. O'Neil, "The Escrow Transactional Model,"<br />

A0\11 Transactions on Distributed Systems, vol . 11<br />

(December 1986).<br />

18. Vo lume Shadowingfor Op en VMS A.XP Ve rsion 6. 1<br />

(Maynard, Mass.: <strong>Digital</strong> Equipment Corp., 1994).<br />

19. M. Burrows et al., "On-line Data Compression in a<br />

Log-structured File System," <strong>Digital</strong> Systems Research<br />

Center, Research Report 85 (April 1992 ).<br />

Biographies<br />

Christopher Whitaker<br />

Chris Whitaker joined <strong>Digital</strong> in 1988 after receiving<br />

a R.Sc. Eng. (honours, 1SLcJass) in computer science<br />

from the Imperial College ofScience and Technology,<br />

University of London. He is a principal software engineer<br />

with the Open VMS File System Development Group<br />

located near Edin burgh, Scotland . Chris was the team<br />

leader for the LFS server component of the Spiralog file<br />

system. Prior to this, Chris worked on the distributed<br />

transaction management services (DECdtm) fo r Open VMS<br />

and the port of the Open VMS record management services<br />

(RMS and RMS journaling) to Alpha.<br />

J. Stuart Bayley<br />

Stuart Bayley is a member of the Open VMS File System<br />

Development Group, .located near Edinburgh, Scotland.<br />

He joined <strong>Digital</strong> in 1990 and prior to becoming a member<br />

oftl1e Spiralog LFS server team, worked on Open VMS<br />

DECdtm services and the Open VMS XQP file system.<br />

Stuart graduated fiom Ki ng's College, University of<br />

London, with a R.Sc. (honours) in physics in 1986.<br />

Rod D. W. Widdowson<br />

Rod Widdowson received a B.Sc. (1984) and a Ph.D. (1987)<br />

in computer science fi·om Edinburgh University. He joined<br />

<strong>Digital</strong> in 1990 and is a ptincipa.l software engineer ''d1 the<br />

Open VMS File System Development Group located ncar<br />

Edinburgh, Scotland . Rod worked on me implementation<br />

of LFS and cluster distribution components of the Spira.log<br />

file system. Prior to d1is, Rod worked on the port of the<br />

Open VMS XQP file system to Alpha. Rod is a charter member<br />

of me British Computer Society.<br />

<strong>Digital</strong> <strong>Technical</strong> journ


32<br />

Designing a Fast,<br />

On-line Backup System<br />

for a Log-structured<br />

File System<br />

The Spira log file system for the Open VMS<br />

operating system incorporates a new tech­<br />

nical approach to backing up data. The fast,<br />

low-impact backup can be used to create<br />

consistent copies of the file system while<br />

applications are actively modifying data.<br />

The Spira log backup uses the log-structured<br />

file system to solve the backup problem. The<br />

physical on-disk structure allows data to be<br />

saved at near-maximum device throughput<br />

with little processing of data. The backup<br />

system achieves this level of performance<br />

without compromising functionality such as<br />

incremental backup or fast, selective restore.<br />

Digir.1l Tcchnical /ournl Vol. ll No. 2 <strong>1996</strong><br />

I<br />

Russell J. Green<br />

Alasdair C. Baird<br />

J. Christopher Davies<br />

iVl ost computer users want to be able to recover data<br />

lost through user error, software or mcdi:1 E1 ilurc, or<br />

site disaster but are unwilling to devote system<br />

resources or downtime to m1ke backup copies of the<br />

data. Furthermore, with the rapid growth in the usc of<br />

data storage and the tendency to move systems toward<br />

complete utilization (i.e., 24-hour by 7 -day operation),<br />

the practice of taking the system off line to back up<br />

data is no longer fe asi ble.<br />

The Spiralog tile system, an option:JI component of<br />

the OpenVMS Al pha operating system, incorporates<br />

a nevv approach to the backup process (called<br />

simply backup), resulting in J number of subst:llltial<br />

customer benefits. By exploiting the fe atures of log­<br />

structured storage, the backup system combines the<br />

advantages of two ditkrent tradition:tl approaches<br />

to perf-orming backup: the flexibility of ti le-based<br />

backup and the high pcrt(mnancc of plwsicallv ori­<br />

ented backup.<br />

The design goal to r the Spiralog b:1ckup system was<br />

to provide customers \\'ith a last, applic:1tion-consistent,<br />

on-line backup. In this paper, we explain rhc features<br />

of the Spiralog ti le system that helped achieve this goal<br />

and outline the design of the n1:1jor backup hrnctions,<br />

namelv vo lume save, vol ume restore, ti le restore, :m d<br />

incremental management. \Ve then present some performance<br />

results arrived at using Spira log version l.l.<br />

The paper concludes with a discussion of other design<br />

approaches and areas fo r ti.1 ture work.<br />

Background<br />

File system data may be lost t( Jr many re:tsons, includ­<br />

mg<br />

• User error-A user m:ty mistakenly delete tb t::l.<br />

• Software failure-An application m:ty cxccure<br />

incorrectly.<br />

• Media fa ilure-The computi ng cquirmenr may<br />

malfunction because of poor design, old :1 ge, etc.<br />

• Site disaster-Computing fKilitics m


The ability to save backup copies of all or part of<br />

a file system's information in a form that allows it to be<br />

restored is essential to most customers who usc computing<br />

resources. To understand the backup capability<br />

needed in the Spiralog file system, we spoke to a number<br />

of customers-five directly and several hundred<br />

through public forums. Each ran a different type of system<br />

in a distinct environment, ranging from research<br />

and development to fi nance on OpenVMS and other<br />

systems. Our survey revealed the fo llowing set of customer<br />

requirements tor the Spiralog backup system:<br />

l. Backup copies of d:1ta must be consistent with<br />

respect to the applications that use the data.<br />

2. Data must be continuously avai lable to applications.<br />

Dowmime for the purpose of backup is unacceptable.<br />

An application must copy all data of<br />

interest as it exists at an instant in time; ho>vever,<br />

the applicarjon should also be allowed to modit)'<br />

the data during the copying process. Performing<br />

backup in such a way as to satist)' these constraints is<br />

often called hot backup or on-line backup. Figure l<br />

illustrates how data inconsistency can occur during<br />

an on-line backup.<br />

3. The backup operations, particularly tJ1c save operation,<br />

must be fa st. That is, copying data from the<br />

system or restoring data to the system must be<br />

accomplished in the time available.<br />

4. The backup system rnust allow an incremental<br />

backup operation, i.e., an operation that captures<br />

only the changes made to data since the last backup.<br />

The Spiralog backup team set out to design and<br />

implement a backup system that would meet the fo ur<br />

customer requirements. The fo llowing section discusses<br />

the features of the implementation of a logstructured<br />

ti le system ( LFS ) that allowed us to use<br />

a new approach to performing backup. Note that<br />

throughout this paper we use disk to describe the<br />

TIME<br />

Figure 1<br />

FILE BACKUP EXPLANATION<br />

The initial file contains two blocks.<br />

Backup starts and copies 1he first<br />

block.<br />

The application rewrites the file.<br />

Backup proceeds and copies the<br />

second block. The resulting backup<br />

copy is corrupt because the first<br />

block is inconsistent with the latest<br />

rewritten file.<br />

ExJmplc of an On-line Backup ThJt Results in Inconsistent<br />

Data<br />

physical media used to store data and uolume to<br />

describe the abstraction of the disk as presented by the<br />

Spiralog ti le system.<br />

Spiralog Features<br />

The Spiralog file system is an implementation of a logstructured<br />

file system. An LFS is characterized by the<br />

use of disk storage as a sequential, never-ending repository<br />

of data. We generally reter to this organization of<br />

data as a Jog. Johnson and Ll ing describe in detail the<br />

design of the Spira log implementation of an LFS and<br />

how ti les are maintained in this implementation.'<br />

Some teatures unique to a Jog-structured file system<br />

are of particular interest in the design of a backup<br />

system.2-• These teatures are<br />

• Segments, where a segment is the fu ndamental<br />

unit of storage<br />

• The no-overwrite nature of the system<br />

• The temporal ordering of on-disk data structures<br />

• The means by which files are constructed<br />

This section of the paper discusses the relevance of<br />

these fe atures; a later section explains how these features<br />

arc exploited in the backup design.<br />

Segments<br />

In this paper, the term segment refers to a logiGd<br />

entity that is uniquely identified and never overwritten.<br />

This definition is distinct fi·om the physical storage<br />

of a segment. The only physical feature of interest<br />

to backup with regard to segments is that they are efficient<br />

to read in their entirety.<br />

Using Jog-structured storage in a ti le system allows<br />

efficient writing irrespective of the write patterns or<br />

load to the ti le system. All write operations arc<br />

grouped in segment-sized chunks. The segment size is<br />

chosen to be sufficiently large that the time required<br />

to read or write the segment is significantly greater<br />

than the time required to access the segment, i.e., the<br />

time required f()r a head seek and rotational dc!Jy on<br />

a magnetic disk. AJ I data (except the LFS homeblock<br />

and checkpoint information used to locate the end of<br />

the data log) is stored in segments, and all segments<br />

are known to the ti le system. From a backup point of<br />

view, this means that the entire contents of a volume<br />

can be copied by reading the segments. The segments<br />

are large enough to allow efficient reading, resulting in<br />

a ncar-maximum transfer rate of thc device.<br />

No Overwrite<br />

In a log-structured file system, in which the segments<br />

are never overwritten, all data is written to new, empty<br />

segments. Each new segment is given a segment identifier<br />

(segid ) allocated in a monotonically increasing<br />

<strong>Digital</strong> <strong>Technical</strong> )ourn31 Vol. 8 No. 2 <strong>1996</strong> 33


34<br />

manner. At any point in time, the entire contcm.s and<br />

state of a volume can be described in terms of a (check­<br />

poim position, sep,rnent list) pair. At the physical level,<br />

a volume consists of a list or· segments and a position<br />

within a segment that ddines the end of the log.<br />

Rosenblum describes the concept of time trave l, where<br />

an old stare ofrhe file system can lx ro:visitcd by cn:ar­<br />

ing and maintaining a snapshot of the J-i lc system r(>r<br />

fu ture access.' Allowing time travel in this \\'ay req uires<br />

maintaining an old checkpoint :md disabling the r.:use<br />

of disk space by the cleaner. The ckaner is a mecha­<br />

nism used to reclaim disk space occupied by obsolete<br />

data in a Jog, i.e., disk space no longer rdcn::nc.:d in<br />

the tile system. The contents of a snapshot :m..: inde­<br />

pendent of operations undertaken on the live version<br />

of the ti le system. Modit)'ing or dt:leting a file arkcts<br />

onlv the live version of the rile system (sec figure 2 ) .<br />

Because of the no-overwrite nature of the l.YS, prC\'i­<br />

ouslv written data re mains unchanged .<br />

Other mechanisms specific to a particular backup<br />

algorithm have been developed to achieve on-line CO!l­<br />

sistencv.' The snapshot model as described abm'C allo\\'s<br />

a more general solution with respect to multiple con­<br />

currem backups and the choice of the s:-,·c algorith m.<br />

A re:1d -only version of the fi le svstem at an inst:lllt<br />

in time is preciselv what is req uired for application<br />

consistency in on-line backup. This snapshot approach<br />

to attaining consistency in on-line backup has been<br />

used in other systems. 0·' As explained in the fd lowing<br />

sections, the Spiralog ti le system combines the snap­<br />

shot technique with fi:::atures oflog-structurcd storage<br />

to obtain both on-line backup consistency and perr(>r­<br />

mancc benefits r(>r bJckup.<br />

Te mporal Ordering<br />

As mentioned earl ier, all data , i.e., user data and fi le<br />

system metadata (data that describes the user d:-ta in<br />

the fi le system), is stored in segments and there is no<br />

overwrite of segments. All on-disk data structures that<br />

rekr to physical placement of data usc poimcrs,<br />

namely (segid, ofJ.;,el) pairs, ro describe tbc location of<br />

the data. Each (segid, off-el) pair specifics the segment<br />

and where within rhar segmcnr the d:-ra is stored.<br />

Together, these imply the followi ng two properties of<br />

data structures, which are key rcatures oLln LrS:<br />

This data is<br />

visible to only<br />

!he snapshot.<br />

Figure 2<br />

This dala is<br />

shared by the<br />

snapshot and the<br />

live file system.<br />

This is new live<br />

dala written since<br />

the snapshot was<br />

laken.<br />

Data Accessi ble to the Snapshot Jnd to the Live hie<br />

Svstem<br />

Digirl TL'chnic1l lm1nL1I Vol. t:l No. 2 <strong>1996</strong><br />

l. On-disk structure pointers, namely (s(:r_: id. (!ff.\·et)<br />

pairs, arc re latively ri me ordered. Specifically, data<br />

stored at (s2, n2 ) was written more recentlv than<br />

data stored at ( sl , ol ) if and only if s2 is greater<br />

than sl or s2 cqu::ls sl and o2 is greater than ol .<br />

Thus, new datJ wou ld appear to the right in the<br />

dat:1 structure depicted in Figure 3.<br />

2. Any daLl structure that uses on-disk pointers stored<br />

within the segments (the mapping data structure<br />

implementing the LfS index) must be rime<br />

ordered ; that is, all poi mers must rekr ro data writ­<br />

ten prior ro the pointer. Rckrri ng 1g1in to Figure 3,<br />

onlv dna structures rh:r poinr ro the left are valid.<br />

These properties of on-disk data structures are of<br />

interest when designing backup syste ms. Such data<br />

structu res can be traversed so rhat segments arc re:d<br />

in reverse time order. To understand this concept, con­<br />

sider the root of some 011-disk darJ structu re. This root<br />

must h;wc been written after any of the data to '' hich<br />

ir rekrs (propertY 2 ). A data item that the root rd cr­<br />

cnccs must h:J\·c been written bd()re the root and so<br />

must ha,·e been stored in a segment with a scgid less<br />

than or equal to that of the segment in ,,·hich the root<br />

is stored (proper tv 1 ). A similar ind ucrivc :1rgumen t em<br />

be used to show tll


data contained in a file defined by a file identifier can be<br />

recovered, independent of how the ti le was created,<br />

without any knowledge of the ti le system structure.<br />

Consequently, together with the temporal ordering of<br />

data in an LfS, fi les can be recovered using an ordered<br />

linear scan of the segments of a volume, provided the<br />

on-disk data structures are traversed correctly. This<br />

mecbanism allows efficient tile restore ti·om a sequence<br />

of segments. In particular, a set of files can be restored<br />

in a single pass of a saved volume stored on tape.<br />

Existing Approaches to Backup<br />

The design of the Spiralog backup attempts to com­<br />

bine the advantages oHile-based backup tools such as<br />

Files- 11 backup, UNIX tar, and Windows NT backup,<br />

and physical backup tools such as UN IX dd, Files- 11<br />

backup/PHYSICAL, and HSC backup (a controller­<br />

based backup tor Open VMS volumes)."<br />

File-based Backup<br />

A file-based backup system has two main advantages:<br />

( 1 ) the system can explicitly name files to be saved, and<br />

(2) the system can restore individual files. In this paper,<br />

the file or structure that contains the output data of<br />

a backup save operation is called a saveset. Individual<br />

file restore is achieved by scanning a saveset fo r the file<br />

and then recreating the ti le using the saved contents.<br />

Incremental fi le-based backup usually entails keeping<br />

a record of when the last backup was made (either on a<br />

per-file basis or on a per-volume basis) and copying<br />

only those files and directories that have been created<br />

or modified since a previous backup time.<br />

The penalty associated with these features of a ti le­<br />

based backup system is that of save performance.<br />

In efkct, the backup system performs a considerable<br />

amount of work to lay out data in the saveset to allow<br />

simple restore. All files arc segregated to a much greater<br />

extent than they are in the file system on-disk struc­<br />

ture. The limiting factor in the performance of a fi le­<br />

based save operation is the rate at which data can be<br />

read from the source disk. Although there are some<br />

ways to improve performance, in the case of a volume<br />

that has a large number of tiles, read performance is<br />

always costly. Figure 4 illustrates the layouts of three<br />

different types ofsavcscts.<br />

Physical Backup<br />

In contrast to the fi le-based approach to backup, a<br />

physical backup system copies the actual blocks of data<br />

on the sou rce disk to a saveset. The backup system is<br />

able to read the disk optimally, which allows an imple­<br />

mentation to achieve data throughput near tbe disk's<br />

maximum transfer rate . Physical backups typically<br />

allow neither individual file restore nor incremental<br />

DIRECTION IN WHICH THE TAPE IS WRITTEN<br />

11 1 2 1 3 1 4 1 s 1 6 1 7 1 s 1 9 l1o l11 1···1<br />

In a physical backup saveset, blocks are laid out contiguously on tape.<br />

File restore is not possible without random access.<br />

FILE 1 FILE 2 FILE 3<br />

In a file backup saveset. files are laid out contiguously on tape.<br />

To create this sort of saveset, lites need to be read individually<br />

lrom diSk, wh1ch generally means suboptimal disk access.<br />

DIR I SEGT I SEG SEG SEG I .. ·I<br />

In a Spiralog backup saveset, directory (DIR) and segment table<br />

(SEGT) allow file restore from segments. Segments are large<br />

enough to allow near-optimal disk access.<br />

Figure 4<br />

Layouts of Three Diftc rcnt Types ofSavcser<br />

backup. The overhead required to include sufficient<br />

information fo r these fe atures usually erodes the per­<br />

formance benefits oftered by the physical copy. In<br />

addition, a physical backup usually requires that the<br />

entire volume be saved regardless of how much of the<br />

volume is used to store data.<br />

How Spira log Backup Exploits the LFS<br />

Spiralog backup uses the snapshot mechanism to<br />

achieve on-line consistency to r backup. This section<br />

describes how Spiralog attains high-performance<br />

backup with respect to the various save and restore<br />

operations.<br />

<strong>Volume</strong> Save Operation<br />

The save operation of Spiralog creates a snapshot and<br />

then physically copies it to a tape or disk structure<br />

called a savesnap. (This term is chosen to be different<br />

from saveset to emphasize that it holds a consistent<br />

snapshot of the data.) This physical copy operation<br />

allows high-performance data transfer with minimal<br />

processing.10 In addition, the temporal ordering of<br />

data stored by Spiralog means that this physical copy<br />

operation can also be an incremental operation.<br />

The savcsnap is a file that contains, among other<br />

information, a list of segments cxactlv as thcv exist<br />

in the log. The structure of the savcsap allo·VS the<br />

efficient implementation of volume restore and ti le<br />

restore (see Figure 5 and Figure 6).<br />

The steps of a full save operation arc as follows:<br />

1. Create a snapshot and mount it. This mounted<br />

snapshot looks like a separate, read-only file system.<br />

Read information about the snapshot.<br />

Digi[al Technic! journal Vol. R No. 2 <strong>1996</strong> 35


36<br />

Figure 5<br />

SJ1·csn;1p Structure<br />

Figure 6<br />

METADATA SEGMENTS (DECREASING SEGID)<br />

.--- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --<br />

KEY<br />

DIRECTORY<br />

INFO<br />

PHYSICAL SAVESNAP<br />

RECORD (FIXED SIZE FOR<br />

ENTIRE SAVESNAP)<br />

ZERO PADDING<br />

DIRECTION IN WHICH THE LOG IS WRITTEN<br />

SAVESNAP<br />

l1os 1104 l1o2l1o1 1<br />

DIRECTION IN WHICH<br />

THE TAPE IS WRITTEN<br />

KEY:<br />

D UNUSED SEGMENT<br />

D USED SEGMENT<br />

Cor-respondence bctll'ecn Segments on Disk Jnd in the<br />

SJI'CStl;lp<br />

2. Write the he:der ro the s;ll'esnap, incl uding snap­<br />

shot inf


38<br />

disk from which the savcsnap was made, providing<br />

the target container is large enough to hold the<br />

restored segments.<br />

4. Backup declares the volume restore as complete<br />

(no more segments will be written to the volume).<br />

Backup tells the file system how to mount the voltune<br />

by supplying the snapshot checkpoint location.<br />

A Spiralog restore operation treats an incremental<br />

savesnap and all the preceding savesnaps upon which it<br />

depends as a single savesnap. For savesnaps other than<br />

the most recent savesnap (the base savesnap ), the<br />

snapshot intcJrmation and directory intcmnation arc<br />

ignored. The sole purpose of these saves naps is to pro­<br />

,·idc segments to the base savesnap.<br />

To rstore a volume from a set of incremental savesnaps,<br />

the Spiralog backup system pcrtcmm steps 1<br />

and 2 using the base savcsnap. In step 3, the restore<br />

copies all the segments in the snapshot ddincd by<br />

the base savcsnap to the target disk. (Note that there<br />

is :- one-to-one correspondence between snapshots<br />

and savcsnaps.) The savcsnaps :rc processed in reverse<br />

chronological order. The contents of the segment<br />

table in the base savesnap define the list ofscgmcms in<br />

the snapshot to be restored. Although the volume<br />

restore operation copies all the segments in the base<br />

savcsnap, nor : til segments in the savcsnaps procrmancc<br />

recovcn· ti·om volume bilurc or site disaster.<br />

<strong>Digital</strong> Tcchtlical )ound Vol . 8 No. 2 <strong>1996</strong><br />

File Restore Operation<br />

The purpose of :- tile restore operation is ro provide<br />

a fast and efficient wav to retrieve a small number of<br />

tiles from a savesnap without pc:rtorming a full volume<br />

restore. Typically, ti le restore is used to recover f-iles<br />

that have been in:tdvertcntly deleted . To achieve highperformance<br />

ti le restore, we imposed the fol lowing<br />

requirements on the design:<br />

• A tile restore session must process as few savesnaps<br />

as possible; it should skip savesnaps that do not<br />

contain data needed by the session .<br />

• vV hcn processing a savesnap, the file restore must<br />

scan the savcsnap lincarlv, in a single pass.<br />

Th


elevant to a particular ti le identifier that is held within<br />

a given segment. The mapping information returned<br />

through this interface describes either mapping infor­<br />

mation held elsewhere or real file data. One character­<br />

istic of the log is that anything to which such mapping<br />

information points must occur earlier in the log, that<br />

is, in a subsequent saved segment. Recall property 2 of<br />

the LFS on-disk data structures. Consequently, rhe file<br />

restore session will progress through the savesnaps in<br />

the desired linear fashion provided that requests are<br />

presented to the interface in the correct order. The<br />

correct order is determined by the allocation of segids.<br />

Since segids increase monotonically over time, it is<br />

necessary only to ensure that requests are presented in<br />

a decreasing segid order.<br />

The ti le restore interface operates on an object<br />

called a context. The context is a tuple that contains a<br />

location in the log, namely (segid, ojf\·et), and a type<br />

field. When supplied with a file identifier and a con ­<br />

text, the core function of the interface inspects the seg­<br />

ment determined by the context and returns the set of<br />

contexts that enumerate all available mapping infor­<br />

mation fo r the file identifier held at the location given<br />

by the initial context.<br />

The type of context returned indicates one of the<br />

following situations:<br />

METADATA<br />

Figure 8<br />

File Restore Session in Progress<br />

633<br />

SAVES NAP<br />

555 478<br />

EXTENT OF SAVESNAP TRAVERSAL SO FAR<br />

TARGET FILE SYSTEM FOR FILE RESTORE<br />

DIRECTION IN WHICH THE LOG IS WRITIEN<br />

KEY:<br />

r---,<br />

... - --<br />

•<br />

FILE DATA<br />

FILE SYSTEM MAP DATA<br />

• The location contains real fi le data.<br />

• The location given by the context holds more<br />

mapping information. In this case, the core func­<br />

tion can be applied repeatedly to determine the<br />

precise location of the fi le's data.<br />

A work list of contexts in decreasing segid order<br />

drives the fi le restore process. The procedure tor<br />

retrieving the data for a single file identifier is as fo l­<br />

lows. At the outset of the file restore operation, the<br />

work list holds a single context that identifies the root<br />

of the map (the tail of the log). As items are taken from<br />

the head of the list, the ti le restore must perform one<br />

of two actions. If the context is a pointer to real file<br />

data, then the tile restore reads the data at that location.<br />

If the context holds the location of mapping informa­<br />

tion, then the core fimction must be applied to enu­<br />

merate all possible fimher mapping information held<br />

there. The tile restore operation places all returned<br />

contexts in the work list in the correct order prior to<br />

picking the next work item. This simple procedure,<br />

which is illustrated in Figure 8, contin ues until the<br />

work list is empty and all the file's data has been read .<br />

To cope with more than one ti le, the fi le restore<br />

operation extends this procedure by converting the<br />

work list so that it associates a particular file identifier<br />

195 69 59<br />

The shaded areas represent the file data to be restored and the file system metadata that<br />

needs to be accessed to retrieve that data. The restore session has thus far processed<br />

segment 478. Part A of the file has been recovered into the target file system. Parts B and C<br />

are st1ll to come. Aller processing segment 478, the file restore visits the next known parts of<br />

the log, segments 69 and 59. Items that describe metadata in segment 69 and data in segment<br />

59 Will be on the work list. The next segment that the file restore will read is segment 69. so the<br />

sess1on can sk1p the intervemng segment (segment 195).<br />

<strong>Digital</strong> Tec hnical journal Vo l. S No. 2 <strong>1996</strong> 39


40<br />

with e:Kh context. File restore initializes the work list<br />

ro hol d a poinrer ro the root of the map ( rhe rail of the<br />

log) t(lr each ti le idemitier ro be n:srored . The eftecr is<br />

ro imerleave requests ro read more than one ti le while<br />

maintaining the correct segid ordering.<br />

A fu rther subtletY occurs when rhc conrexr ar the<br />

head of rhc work list is tc)Und ro rckr ro a segment<br />

outside the current savesnap. The ordering imposed<br />

on the work list implies that all subsequent items of<br />

work must also Lx: outside the current S31'esnap. This<br />

fclllows from the rc m por.1l orderi ng properties of LFS<br />

on-disk structu res and the way in which incremental<br />

savesnaps arc ddined. vVhen this situ:nion occurs, the<br />

work list is sa1·cd. When the next swesn:tp is ready for<br />

processing, the ti le restore st:ssion c:tn be restarted<br />

using the saved work list 1s the starting point.<br />

During this step, the fi le restore writes the pieces of<br />

fl ies to the target volume as they arc read from the<br />

savcsnap. Since the ti le restore process allocates fi le<br />

identifiers on a per-volume basis, restore must allocate<br />

new fi le identifiers in the target volum


Figure 9<br />

Merging Savesnaps<br />

BACKUPS<br />

Monday - Full<br />

Wednesday ­<br />

Incremental<br />

Friday ­<br />

Incremental<br />

Ta ble 1<br />

Performance Comparison of the Spira log and Files-1 1 Backup Save Operations<br />

Elapsed Time<br />

Backup System (Minutes:seconds)<br />

Spira log save 05:20<br />

Files-1 1 backup 10:14<br />

phases of the save operation. During the directory<br />

scan phase (typically up to 20 percent of the total<br />

elapsed save time), the only tape output is a compact<br />

representation of the volume directory graph. In comparison,<br />

the segment writing phase is usually bound by<br />

the tape throughput rate. In this configuration, the<br />

tape is the throughput bottleneck; its maximum raw<br />

data throughput is 1 .25 MB/s ( uncompressed ).11<br />

Overall, the Spiralog volume save operation is nearlv<br />

twice as f:1st as the Files-1 1 backup volume save opera - ­<br />

tion in this type of computing environment. Note that<br />

the Spiralog savesnap is larger than the corresponding<br />

Files-1 1 saveser. The Spiralog savesnap is less efticient<br />

at holding user data than the packed per-tile representation<br />

of the Filcs- ll saveset. In many cases, though,<br />

the higher pert(xmance of the Spira log save operation<br />

more than outweighs this inefficiency, particularly<br />

when it is taken into account that the Spiralog save<br />

operation can be performed on-line.<br />

File Restore Performance<br />

To determine tile restore performance, we measured<br />

how long it took to restore a single fi le from the<br />

savesets created in the save benchmark tests. The hardware<br />

and software configurations were identical to<br />

those used tor the save measurements. \Ve deleted<br />

a single 3-kilobyte (J(li ) ti le ti·om the source volume<br />

and then restored the ti le. We repeated this operation<br />

nine times, each time measuring the time it took to<br />

restore the fi le. Table 2 shows the results.<br />

Merge three savesets to produce one<br />

new savesnap equivalent to a full<br />

savesnap taken on Friday.<br />

Savesnap or<br />

Saveset Size Throughput<br />

(Megabytes) (Megabytes/second)<br />

339 1.05<br />

297 0.48<br />

Ta ble 2<br />

Performance Comparison of the Spira log and Files-1 1<br />

Individual File Restore Operations<br />

Backup System<br />

Spira log file restore<br />

Files-1 1 backup<br />

Elapsed Time<br />

(Minutes:seconds)<br />

01 :06<br />

03:35<br />

The Spiralog backup system achieves such good<br />

pertormance for tile restore by using its knowledge of<br />

the way the segments are laid out on tape. The fi le<br />

restore process needs to read only those segments<br />

required to restore the tile; the restore skips the intervening<br />

segments using tape skip commands. In the<br />

example presented in Figure 8, the restore can skip<br />

segments 555 and 195. In contrast, a file-based backup<br />

such as Filcs-ll usually does not have accurate indexing<br />

information to minimize tape 1/0. Spiralog's<br />

tape-skipping benefit is particularly noticeable when<br />

restoring small numbers of files from very large savesnaps;<br />

however, as shown in Table 2, even with small<br />

savesets, individual file restore using Spira log backup is<br />

three times as fast as using Files-ll.<br />

Table 3 presents a comparison of rhe save performance<br />

and features of the Spiralog, ti le-based, and<br />

physical backup systems.<br />

<strong>Digital</strong> <strong>Technical</strong> Journal Vol. 8 No 2 <strong>1996</strong> 41


42<br />

Ta ble 3<br />

Comparison of Spira log, File-based, and Physical Backup Systems<br />

Save performance<br />

(the number of 1/0s<br />

required to save the<br />

the source volume)<br />

File restore<br />

Vo lume restore<br />

Incremental save<br />

Spiralog Backup<br />

System<br />

The number of 1/0s is<br />

O(number of segments that<br />

contain live data) plus<br />

O(number of directories)<br />

Yes<br />

Yes, fast<br />

Yes, physical<br />

File-based Backup<br />

System<br />

The number of 1/0s is<br />

O(number of files)<br />

1/0s to read the file<br />

data plus O(number<br />

of directories) I /Os<br />

Yes<br />

Yes<br />

Yes, entire files that<br />

have changed<br />

Physical Backup<br />

System<br />

The number of 1/0s<br />

is O(size of the disk)<br />

No<br />

Yes, fast but limited<br />

to disks of the same size<br />

Note that this table uses "big oh" notation to bound a va lue. O(n), which is pronounced "order of n, " means that the value represented is no<br />

greater than Cn for some constant C, regardless of the value of n. Informally, this means that O(n) can be thought of as some constant multiple<br />

of n.<br />

Other Approaches and Future Work<br />

This section outlines some other design options<br />

we considered fo r the Spiralog backup system. Our<br />

approach oncrs fu rther possibilities in a number<br />

of areas. \Ve describe some of the opportunities<br />

available.<br />

Backup and the Cleaner<br />

The benefits of the write perf(xmancc g1ins in an LFS<br />

arc attained at the cost of having to clean segments."<br />

An opportunity appears to exist in combining the<br />

cleaner and backup functions to red uce the amount of<br />

work done by either or both of these components;<br />

however, the aims of backup and tbe cleaner are quite<br />

diftcrmt. Backup needs to read all segments written<br />

since a specitlc time (in the case of a ti.J II backup, since<br />

the birth of the volume). The cleaner needs to defrag­<br />

ment the tree space on the volume. This is done most<br />

efficimtly by relocating data held in certain segments.<br />

These segments are those that arc suHiciently empty to<br />

be worth scavenging tor ti-ee space. The datl in these<br />

segments should also be stable in the sense that the<br />

data is unlikely to be deleted or outdated immediately<br />

after relocation.<br />

The onl y real benefit that can be exacted by looking<br />

at thest fu nctions together is to clean some segments<br />

while performing backup. For example, once a stg­<br />

mcnt has been read to copy to a savesnap, it can be<br />

cleaned. This approach is probably not a good one<br />

because it reduces system pedorm::mce in the fo llow­<br />

ing ways: additional processing req uired in clc::ming<br />

removes CPU and memory resources available to<br />

applications, and tht cleaner gtneratcs \\'rite opera­<br />

tions that rtducc the backup rtad rate.<br />

<strong>Digital</strong> Tcclmical Journal Vol. 8 No . 2 <strong>1996</strong><br />

No<br />

There are two other areas in which backup and<br />

the cleaner mechanism interact that warrant fu rther<br />

investigation.<br />

l. Th e save opcLltion copies stgments in their<br />

entirety. Th:1t is, the operation copies both "stale"<br />

(old) cb tJ and li\'c data to a s1vesnap. The cost of<br />

extra storage media t(>r this extraneous data is<br />

traded otfagainst the performance penalty in trying<br />

to copy only live data. It appears that the tile systtm<br />

should run the clemtr vigorously prior to a backup<br />

to minirnizt the stale data copied .<br />

2. Incremental savesnaps contain cleaned data. This<br />

means that an incremental savesnap contains a copy<br />

of data that alrtady exists in one of the savcsnaps on<br />

which it depends. This is an apparent waste ofdhm<br />

and storagt space.<br />

It is best to undert: kc a fu ll backup after a thorough<br />

cleaning of the volumt. A single strategy tor incremen­<br />

tal backups is less easy to dctine. On one hand, the size<br />

of an incrcmcllt:t l backup is incrtased if much cleming<br />

is pcd(m11td bd(l!-c the backup. On tht other h:md,<br />

restore operations from a largt incremental backup<br />

(particularly sclectivt ri le rcstorts) are likely to be<br />

more efticitnt. The larger tht incremental backup, the<br />

more data it contains. Consequently, the chance of<br />

restoring a single ti le trom just the base savesnap<br />

increases with the size of the incrcmemal backup.<br />

Studying the intcLKtions bcrween the backup and the<br />

cleaner may oft-Cr some insight into how to improve<br />

either or both ofthest components.<br />

A conti nuous b1ekup system can takt copies of seg­<br />

ments ti·om disk using policies similar to the cle:mcr.<br />

This is explored in Ko hl's paptr. 12


Separating the Backup Save Operation into a<br />

Snapshot and a Copy<br />

The (k sign of the save operation involves the creation<br />

of a snapshot toll owed by the tast copy of the snapshot<br />

to some separate storage. The Spiralog version 1.1<br />

implementation of the save operation combines these<br />

steps. A snapshot can exist only during a backup save<br />

operation.<br />

System administrators and applications have significantly<br />

more flexibility if the split in these two fu nctions<br />

of backup is visi ble. The ability to create snapshots that<br />

can be mounted to look like read -only versions of a tile<br />

system may eliminate the need fo r the large number of<br />

backups performed today. Indeed, some fi le systems<br />

ofkr this kature 6·7 The additional advantage that<br />

Spir:.dog otkrs is to allow the very efticient copying of<br />

individual snapshots to offline medi:t.<br />

Improving the Consistency and Availability<br />

of On-line Backup<br />

There :re a number of ways to improve application<br />

consistency


44<br />

data indexing. The Spiralog backup implementation<br />

provides an on-lint: backup save operation with significantly<br />

improved performance over existing offerings.<br />

Performance of individual file restore is also improved.<br />

Acknowledgments<br />

We would like to thank the following people whose<br />

efforts were vital in bringing the Spiralog backup system<br />

to fruition: Nancy Phan, who helped us develop<br />

the product and worked relentlessly to get it right;<br />

Judy Parsons, who helped us clarir, describe, and document<br />

our work; Clare Wells, who helped us tocus on<br />

the real customer problems; Alan Paxton, who was<br />

involved in the early design ideas and .later specif-ication<br />

of some of the implementation; and, finally,<br />

Cathy Foley, our engineering manager, who supported<br />

us throughout the project.<br />

References<br />

I. J. Johnson and W. Laing, "Overview of the Spiralog<br />

File System," <strong>Digital</strong> <strong>Technical</strong> journal. vol. 8, no. 2<br />

( <strong>1996</strong>, this issue): 5-14.<br />

2. M. Rose nblum and ]. Ousterhout, "The Design and<br />

Implementation of a Log-Structured File System,"<br />

ACM Transactions on Computer Systems, vol . 10,<br />

no. I ( February 1992): 26-52.<br />

3. M. Rosenblum, "The Design and Implemen tation of a<br />

Log-Structured File System," Report No. UCB/CSD<br />

92/696 (Berke ley, Calif.: University of Calito rnia,<br />

Berkeley, 1992 ).<br />

4. M. Seltzer, K. Bostock , M. McKusick, and C. Staelin,<br />

"An I mplementation of a Log-Structured File System<br />

fo r UNIX," Proceedings of the USENJX Winter 1993<br />

<strong>Technical</strong> Confe rence, San Diego, Calif. (JanuJr)'<br />

1993).<br />

5. K. W;J!ls, "File Backup System to r Producin g a Backup<br />

Copy of a File Which May Be Updated during<br />

B ackup," U.S. Patent No. 5,163,148.<br />

6. D. Hitz, ]. LJu, and M. Malcolm, "File System Design<br />

fo r an NFS Fi le Server Appliance," Proceedings of<br />

the USENIX Winter 1994 <strong>Technical</strong> Con/erence,<br />

San Francisco, Calif. (January 1994 ).<br />

7. S. C hutani, 0. Anderson, M. \(azar, and B. Leverett,<br />

"The Episode File System," Proceedings of the<br />

UStNIX Winter 1992 <strong>Technical</strong> Con/erence.<br />

San Francisco, Calif (January 1992).<br />

8. C:. Whitaker, T. Bayley, and R. Widdowson, "Design of<br />

the Server fo r the Spiralog File Syste m," Digilul<br />

TechnicaL}oumal, vol. 8, no. 2 ( <strong>1996</strong>, this issue): 15-31 .<br />

9. Open VMS System Management Utilities Refere nce<br />

Manual A-L, Order No. M-PVSPC-TK (Mayn ard ,<br />

M ass. : <strong>Digital</strong> Eq uipment Corporation, 1995 ).<br />

DigirJI <strong>Technical</strong> )ourn


Alasdair C. Baird<br />

Alasdair Baird joined <strong>Digital</strong> in 1988 ro work fo r the<br />

ULTIUX Engineering group in Reading, U.K. He is<br />

a senior sofrware engineer and has been a member of<br />

<strong>Digital</strong>'s Open VMS Engineering group since 1991 .<br />

He worked on the design of the Spiralog file system and<br />

then contri buted ro the Spiralog ba ckup syste m, particularly<br />

the ti le resrore component. Currently, he is involved<br />

in Spir


46<br />

Integrating the Spiralog<br />

File System into the<br />

Open VMS Operating<br />

System<br />

<strong>Digital</strong>'s Spira log file system is a log-structured<br />

file system that makes extensive use of write­<br />

back caching. Its technology is substantially<br />

different from that of the traditional Open VMS<br />

file system, known as Files-1 1. The integration<br />

of the Spiralog file system into the Open VMS<br />

environment had to ensure that existing appli­<br />

cations ran unchanged and at the same time had<br />

to expose the benefits of the new file system.<br />

Application compatibility was attained through<br />

an emulation of the existing Files-1 1 file system<br />

interface. The Spira log file system provides an<br />

ordered write-behind cache that allows applica­<br />

tions to control write order through the barrier<br />

primitive. This form of caching gives the benefits<br />

of write-back caching and protects data integrity.<br />

Digiral Te chnical )oum,ll Vol. 8 No. 2 <strong>1996</strong><br />

I<br />

Mark A. Howell<br />

J u1ian M. Palmer<br />

The Spiral og file system is based 011 a log-structuring<br />

method rll3t off-ers fast writes and a bst, on-line backup<br />

capability. ' - ' The i ntegration of the Sriralog fi le system<br />

into the Open VMS operating system prese nted many<br />

challenges. Its progra mming intnrace and its ex tensive<br />

use of write-back caching were su bst;11ltially difkrent<br />

ti·om those of the existing Open VMS fi le svstcm,<br />

known as Files-1 1.<br />

To encourage usc of the Spiralog fi le system , we had<br />

to ensure that ex isti ng applications ran unchanged in<br />

the OpenVMS e nvironmen t. A ti le system em ulati on<br />

layer provided the necessary compatibi lity by mapping<br />

the Files-1 1 file system interrace onto the Spi ralog ri le<br />

system. Bdore we could build the emu Lnion layer, we<br />

needed to understa nd how these applications used the<br />

ri le system interrace. The approach taken to u nder­<br />

standing application rcquirell1 CI 1tS led to a ri le SVStCl11<br />

emu l ation layer that exceeded the origi n a l compatibil­<br />

ity expectations.<br />

The first part of th is paper deals with the approach<br />

to i ntegrating a new ti le system into the OpenVMS<br />

environment and preservi ng application compatibilitY.<br />

It describes the various levels at which the fi le system<br />

could have been int egrated and the decision to em u­<br />

late the l ow-level tile system interbcc. Techniques<br />

such as tracing, source code scanning, and fu nctional<br />

analysis of the Files-1 1 ri le svstcm helped determine<br />

which fe atures should be supported bv the emulation.<br />

The Spiralog fi le system uses extensive write- back<br />

c aching to ga in pe rtonn ance over the write-through<br />

cache on the Files-1 1 ri le system. Applications have<br />

relied on the ordering of writes implied by wri te­<br />

through cac hi ng to maintain on-disk consistency in<br />

the event of system railurcs. The lack of orderi ng<br />

guarantees prevented the i mplementation of such<br />

careful write pol icies in write-b::tck environments. The<br />

Spiralog ri le system uses a write -beh ind cache (intro­<br />

duced in the Echo ti le system) to allm.v applications to<br />

take advantage of write-back cach ing pcrr( Jrmance<br />

while preservi ng careful write policies.' This f-eatu re is<br />

unique in a comnH.:rcial fi le system. The second parr of<br />

this paper describes the dirlicultics ofintcgr:ting write­<br />

back caching into a write-through environment and<br />

how a write-behind cache addressed these problems.


Providing a Compatible File System Interface<br />

Application compatibility can be described in two<br />

ways : compatibility at the file system interface and<br />

compatibility of the on-disk structure. Since only specialized<br />

applications use knowledge of the on-disk<br />

structure and maintaining compatibility at the interface<br />

level is a teature of the Open VMS system, the<br />

Spiralog tile system preserves compatibility at the file<br />

system interface level only. In the section Files-1 1 and<br />

the Spiralog File System On-disk Structures, we give<br />

an overview of the major on-disk diffe rences bet\veen<br />

the t\VO file systems.<br />

The level of interface compatibility wou ld have a<br />

large impact on how well users adopted the Spiralog<br />

file system. If data and applications could be moved to<br />

a Spiralog volume and run unchanged, the file system<br />

would be better accepted . The goal tor the Spiralog<br />

file system was to achieve 100 percent interface compatibi<br />

lity for the majority of existing applications. The<br />

implementation of a log-structured ti le system, however,<br />

meant that certain features and operations of the<br />

Files-1 1 file system could not be supported.<br />

The Open VMS operating system provides a number<br />

of ti le system interfaces that are called by applications.<br />

This section describes how we chose the most compatible<br />

tile system interface. The OpenVMS operating<br />

system directly supports a system-level call interface<br />

(QIO) to the ti le system, which is an extremely complex<br />

interface.' The QIO interface is very specific to<br />

the OpenVMS system and is difficult to map directly<br />

onto a modern ti le system intedace. This interface is<br />

used infrequently by applications but is used extensively<br />

by Open VMS utilities.<br />

Open VMS File System Environment<br />

This section gives an overview of the general<br />

OpenVMS file system environment, and the existing<br />

Figure 1<br />

r-<br />

n 1 HIGH-LEVEL<br />

APPLICATIONS<br />

Open VMS and the new Spiralog file system intertaces.<br />

To emulate the Files-1 1 file system, it was important to<br />

understand the way it is used by applications in the<br />

OpenVMS environment. A brief description of the<br />

Files-1 1 and the Spiralog file system interfaces gives an<br />

indication of the problems in mapping one interface<br />

onto the other. These problems are discussed later in<br />

the section Compatibility Problems.<br />

In the Open VMS environment, applications interact<br />

with the ti le system through various interfaces,<br />

ranging from high-level language interfaces to direct<br />

ti le system calls. Figure 1 shows the organization of<br />

interfaces within the Open VMS environment, including<br />

both tl1e Spira log and the Files-1 1 file systems.<br />

The fo llowing brietly describes the levels of interface<br />

to the tile system.<br />

• High-level language (HLL) libraries. H LL libraries<br />

provide tile system fu nctions tor high-level<br />

languages such as the Standard C library and<br />

FORTRAN I/0 fu nctions.<br />

• OpenVMS language-specific libraries. These<br />

libraries offe r OpenVMS-spccific ti le system fu nctions<br />

at a high level. For ex;mple, lib$create_dir( )<br />

creates a new directory with specitic OpenVMS<br />

security attributes such as ownership.<br />

• Record Management Services. The OpenVMS<br />

Record Management Services (RMS) are a set of<br />

complex routines that fo rm part of the Open VMS<br />

kernel. These routines


48<br />

• Files-1 1 file system interface. The Open VMS operating<br />

system has traditionally provided the Files-1 1<br />

file system fo r applications. It provides a low-level<br />

file system intert:ace so that applications can request<br />

fu e system operations from the kernel.<br />

Each tile system call can be composed of multiple<br />

subcalls. These subcalls can be combined in numerous<br />

permutations to fo rm a complex file system<br />

operation. The number of permutations of calls and<br />

subcalls makes the file system interface extremely<br />

difficult to understand and use.<br />

• File system emulation layer. This layer provides<br />

a compatible interface bet\¥een the Spiralog ti le<br />

system and existing applications. Calls to export<br />

the new fe atures available in the SpiraJog file system<br />

are also included in this layer. An important new<br />

fe ature, the write-behind cache, is described in the<br />

section Overview of Caching.<br />

• The Spiralog file system interface. The Spiralog<br />

file system provides a generic file system interface.<br />

This interface was designed to provide a superset<br />

of the features that are typically available in file systems<br />

used in the UNIX operating system. File<br />

system emulation layers, such as the one written for<br />

Files-1 1, could also be written for many different<br />

file systems." Features that could not be provided<br />

generically, for example, the implementation of<br />

security policies, arc implemented in the file system<br />

emu lation layer.<br />

The Spiralog file system's interface is based on the<br />

Virtual File System (VFS), which provides a file<br />

system interface similar to those fo und on UNIX<br />

systems.7 Functions available are at a higher level<br />

than the Files-1 1 file system interface. For example,<br />

an atomic rename fitnction is provided.<br />

Files-1 1 and the Spira log File Sys tem<br />

On-disk Structures<br />

A major difference bet\'lecn the Filcs-1 1 and the<br />

Spiralog file systems is the way data is laid out on<br />

the disk. The Files-1 1 system is a conventional,<br />

update-in-place file system -" Here, space is reserved for<br />

file data, and updates to that data are written back to<br />

the same location on the disk. Given this knowledge,<br />

applications could place data on Files-1 1 volumes to<br />

take advantage of the disk's geometry. For example,<br />

the Files-l l file system allows applications to place files<br />

on cylinder boundaries to reduce seck times.<br />

The Spiralog file system is a log-structured file<br />

system (LFS). The entire volume is treated as a continuous<br />

Jog with updates to files being appended to<br />

the tail of the log. In efkct, files do not have a fixed<br />

home location on a volume. Updates to files, or cleaner<br />

activity, will change the location of data on a volume.<br />

Applications do nor have to be concerned where their<br />

data is placed on the disk; LFS provides this mapping.<br />

DigirJI Tcchnier example, security<br />

information. Although few in number, those applications<br />

that do call the Files-l l file system directly arc<br />

significant ones. If tile only interface supported was<br />

RMS, then these utilities, such as SET FILE and<br />

OpcnVMS Backup, would need signiticant modification.<br />

This class of utilities represents a large number of<br />

the OpcnVMS utilities that maintain the file system .<br />

To provide support tor the widest range of applications,<br />

we selected the low-level Filcs-1 1 interface to r<br />

usc by the file system emulation layer. By selecting this<br />

interface, we decreased the amount of work needed<br />

fo r its emulation. However, this gain was offset by the<br />

increased complexity in the interface emulation.<br />

Problems caused by this interface selection are<br />

described in the next section.<br />

Interface Compatibility<br />

Once the file system interface was selected, choices<br />

bad to be made about the level of support provided by<br />

the emulation layer. Due to the nature of tl1e logstructured<br />

tile system, described in the section Filcs-11<br />

and the Spiralog File System On-disk Structures, full<br />

compatibility of all features in the emulation layer was<br />

not possible. This section discusses some of the decisions<br />

made concerning interface compatibility.


An initial decision w::s made to support documented<br />

low-level Fi les-l l calls through the emulation<br />

layer as ofi:en as possible. This would enable all<br />

well-behaved applications to run unchanged on the<br />

Spiralog tile system. Examples of well-behaved applications<br />

are those that make use of HLL library calls.<br />

The tollowing categories of access to the fi le system<br />

would not be supported:<br />

• Those directly accessing the disk without going<br />

through the file system<br />

• Those making usc of specitlc on-disk structure<br />

information<br />

• Those making use of undocumented ti le system<br />

katures<br />

A very small number of applications fe ll into these<br />

categories. Exampl es of applications that make use of<br />

on-disk structu re knowledge are the Open VMS boot<br />

code, disk structure analyzers, and disk dcti·agmenters.<br />

The majority of Open VMS applications make tile<br />

system calls through the 1'-LV\S interface. Using file system<br />

call-tracing techniques, described in the section<br />

Investigation Te chniques, a fu ll set of tile system calls<br />

made by RMS could be constructed. Afi:er analysis of<br />

this trace data, it was cle::r that IUv\ S used a small set<br />

of well-structu red calls to the low-level ti le system<br />

intertace. Further, detailed analysis of these calls<br />

showed that all RMS operations could be fu lly emulated<br />

on the Spiralog fi le system.<br />

The su pport of Open VMS ti le system utilities that<br />

made direct calls to the low-level Filcs-11 intertace was<br />

important if we were to minimize the amount of code<br />

change required in the Open VMS code b::se. Analysis<br />

of these utilities showed that the majority of them<br />

could be supported through tJ1e emulation layer.<br />

Ve ry rcw applications made use of katures of the<br />

Filcs-1 1 ti le system that could not be emulated . This<br />

enabled a high number of applications to run<br />

unchanged on the Spiralog file system.<br />

Ta ble 1<br />

Categorization of File System Features<br />

Category Examples<br />

Supported. The operation requested<br />

was completed, and a success status<br />

was returned.<br />

Ignored. The operation req uested<br />

was ignored, and a success status<br />

was returned.<br />

Not supported. The operation<br />

req uested was ignored, and a<br />

failure status was returned.<br />

Requests to create a file or open<br />

a file.<br />

A request to place a file in a<br />

specific position on the disk to<br />

improve performance.<br />

A request to directly read the<br />

on-disk structure.<br />

Compa tibility Problems<br />

This section describes some of the comp::tibility problems<br />

that we encountered in developing the emulation<br />

layer and how we resolved them.<br />

When considering the compatibility of the Spira log<br />

tile system with the Files-l l file system, we placed the<br />

features of the file system into three categories: supported,<br />

ignored, and not supported . Table 1 gives<br />

examples and descriptions of these categories. A feature<br />

was recategorized on ly if it could be supported but was<br />

not used, or if it could not be easily supported but<br />

was used by a wide range ofapplieations.<br />

The majority of Open VMS applications make supported<br />

fi le system ca lls. These :pplications will run as<br />

intended on the Spiralog tile system. Few applications<br />

make calls that could be safely ignored . These applications<br />

would run successfully but could not make use of<br />

these fe atures. Ve ry fe w applications made calls that<br />

were not supported . Untortunatcly, some of these<br />

applications were very important to the success of the<br />

Spiralog tile system, for example, system management<br />

utilities that were optimized fo r the Files-1 1 system.<br />

Analysis of applications that made unsupported calls<br />

showed the to ll owing categories of use:<br />

• Those that accessed the tile header-a structu re<br />

used to store a tile's attributes. This method was<br />

used to return multiple fi le attributes in one call.<br />

The supported mechanism involved an individual<br />

call fo r each attribute.<br />

This was solved by returning an emulated fi le<br />

header to applications that contained the majority<br />

of info rmation interesting to applications.<br />

• Those read ing directory files. This method was used<br />

to perform fast directory scans. The supported<br />

mechanism involved a file system call for each nJme.<br />

This was solved by providing a bulk directory<br />

read ing interface call. This ca ll was similar to the<br />

getdircntries( ) call on the UNIX system and was<br />

Notes<br />

Most calls made by applications<br />

belong in the supported category.<br />

This type of feature is incompatible<br />

with a log-structured file system.<br />

It is very infreq uently used and not<br />

ava ilable through HLL libraries. It<br />

co uld be safely ignored.<br />

This type of request is specific to<br />

the Files-1 1 file system and could<br />

be allowed to fail because the<br />

application would not work on the<br />

Spira log file system. It is used only<br />

by a few specialized applications.<br />

Vol. 8 No. 2 <strong>1996</strong> 49


50<br />

straightr(>rward to replace in applications that<br />

directly read directories.<br />

The OpcnVMS Backup utility was an example of<br />

:t system man:tgcmcm mility that directly read<br />

directory tiles. The backup milirv \\'aS changed to<br />

usc the dircctorv rc:1ding call on Spiralog volumes.<br />

• Tl10sc accessing rcscnni riles. The existing file system<br />

stores :til irs mctad:na in normal ti les that can be<br />

read lw :1pplic:ttions. These riles :tre called reserved<br />

ri les and :11-e crc1tcd when a \'olume is initi1lized.<br />

No reserved tiles arc cn:ated on a Spiralog volume,<br />

\\'ith the exception of the m:ster t-ile directory<br />

( M FD ). Appliutions that 1-e:1d reserved tiles make<br />

specific usc of on-disk structu re infi:mnation and<br />

are not supporrcd with the Spira log fi le system. The<br />

MFD is used as the root directory :nd performs<br />

directory tr:tvcrsals. This ti le w:ts virtually emulated.<br />

It appears in directory listi ngs of a Spiralog volume<br />

and can he used to starr :t directory traversal, but it<br />

docs not exist on the volume :ts a real ri le.<br />

Investigation Te chniques<br />

This section describes the appro:tch taken to investigate<br />

the interhcc and compatibility problems<br />

described euscd on understJnding how<br />

applications c:dled the ti le wstcm and the sem:1ntics of<br />

tbc c;lls. A number of techniques were used in lieu<br />

of design documcnt:ttion tr r :lc!·ting the other maimainers<br />

Vnl. R No. 2 <strong>1996</strong><br />

about the impact of the Spiralog ti le system. J\l[orc<br />

than 2,000 surveys were sent out, but rcwcr than<br />

25 useful results were rctumcd. S:tdly, most application<br />

maintainers were not aw:t-c of how their<br />

product used the fi le s\·stcm.<br />

• Automated applic:tion source code sc:rching<br />

Autom:1tcd source code searching quickly checks<br />

a large amount ofsourcc code. This technique 1vas<br />

most usdi.d when :tn:lyzing lilc svstcm calls m:de bv<br />

the OpenYt'viS opcDting system or utilities. However,<br />

this does not 1\'0rk we ll \\'hen :pplic:1tions<br />

make dvnamic oils to the ti le wstcm


Ta ble 2<br />

Verification of Compatibility<br />

Test Suite<br />

RMS regression tests<br />

OpenVMS regression tests<br />

Files-1 1 compatibility tests<br />

C2 security test suite<br />

C language tests<br />

FORTRAN ianguage tes<br />

<strong>Number</strong> of Te sts<br />

-500<br />

-100<br />

-100<br />

-50 discrete tests<br />

-2,000<br />

-100<br />

Write-back caching provides very difterent semantics<br />

to the model of write-through caching used on the<br />

Files-l l ti le syste m. The goal of the Spiralog project<br />

members was to provide write- back caching<br />

in a way that was compatible with existing Open VMS<br />

applications.<br />

This section compares write-through and write-back<br />

caching and shows how some important OpenVMS<br />

applicatiom rely on write-through semantics to pro­<br />

tect data ti·om system fa ilure. It describes the ordered<br />

wri te-back cache as introduced in the Echo file system<br />

and exp la ins how this model of caching (known as<br />

write-hehind caching) is particularly suited to the envi­<br />

ronment of Open VMS Cluster systems and the<br />

Spira log log-structured ti le svstem.<br />

Overview of Caching<br />

During the last kw years, CPU pcrf{mnance improvements<br />

have continued to ou tpace performance<br />

improvements t( lr disks. As a result, the 1/0 bottle­<br />

neck has worsc iH.:d rather than improved. One of<br />

the most succcssfi.d tech niques used to alleviate this<br />

problem is caching. Caching means holding a copy of<br />

data that has been recently read ri·om , or written to,<br />

the disk in memory, giving appl ications access to that<br />

data at memory speeds rather than at disk speeds.<br />

VVrite-through and write- back caching are two<br />

difrerent models h·cqucnrly used in ti le srstems.<br />

• Write-through caching. In a write-through cache,<br />

data read ti·om the disk is stored in the in-memory<br />

cache. When data is written, a copy is placed in<br />

the cache, but the write request does not return<br />

until the data is on the disk. Wri te-through caches<br />

improve the pcrtormance of read requests but not<br />

write requests.<br />

• Write- back caching. A write-back cache improves<br />

the pert(ml1


.'i 2<br />

Figure 2<br />

Caching Models<br />

NO CACHE<br />

WRITE-BACK<br />

CACHE<br />

I I ..._<br />

MICROSECONDS<br />

Caching is more important to the Spira log ti lt<br />

system than it is to convcntionll ti le systems. Log­<br />

structu red ti le systems have inherently worse read<br />

pertonn1ncc rhan com·enrional, update-in-place ti le<br />

svstems, due to the need to locate the data in the log.<br />

As described in another paper in rhisjour11a/. locating<br />

data in the l og requires more disk I/Os than ;1 11<br />

update-in-place ti le system 2 The Spiralog tile system<br />

uses large read caches to other this extra read cost.<br />

Careful Writing<br />

The Filcs-1 1 ti le system prm·ides \\'rite -through<br />

semantics. Key Open VMS applications such as tr:msac­<br />

tion processing and the OpenVJ'v!S Record Man::tge­<br />

ment Services ( RMS) have come to rely on the implicit<br />

orderi ng of write-through. Thcv use a technique<br />

kno\\'n as c::trcful writing to pre\'(:llt data corruption<br />

fo llo\\'ing l S\'Stem fa ilure.<br />

C1reful \\'riring allo\\'s an 1 pplicnion to ensure rh:n<br />

rhe data on rhe disk is never in :tn inconsistent or<br />

invalid st:ttc. This guarantee


I I I A I B<br />

I t t<br />

I I I<br />

Figure 3<br />

r: x1mplc of a Careti.il Write<br />

Write-behind Caching<br />

A I B<br />

----------<br />

I t t<br />

J I A B I<br />

---L--------L----L.-...J. ..<br />

I<br />

A'<br />

__ . ...J. .. .<br />

,<br />

A careful write policy relics totally on being able to<br />

control the order of writes to the disk. This Gmnot be<br />

achieved on a write-back cache because the write-back<br />

method docs not preserve the order of write requests.<br />

Reordering writes in a write- back cache would risk cor­<br />

rupting the data that applications using careful writing<br />

were seeking to protect. This is untclrtunatc because<br />

the pcrtcm11ancc benefits of deferring the write to the<br />

disk Jrc compatible with a careful write polin•. Cm.:ful<br />

writing docs not need to know when the data is written<br />

to the disk, on ly the order it is wJittcn .<br />

To allow these applications to gain the pcrt(mnancc<br />

of the write-back cache bur still prorcct their data on<br />

disk, the Spiralog ti le system uses a variation on write­<br />

back oching known as write-behind caching. Intro­<br />

duced in the Echo tile system, write-behind e1ching is<br />

essentially write-back caching with ordering guaran­<br />

tees.' The cache allows the application to spccit)r which<br />

writes must be ordered and the order in which they<br />

must be written to the disk.<br />

This is achieved bv providing the barrier primitive to<br />

applications. Barrier defines an order or dcpendencv<br />

between write operations. For example, consider the<br />

diagram in Figure 4: Here, writes arc represented as<br />

a time-ordered queue, with later writes being added<br />

Figure 4<br />

B


54<br />

...<br />

I ..,. ....<br />

.<br />

I _____ ....<br />

I •<br />

.... .... ..,<br />

Figure 5<br />

Ex


closed . The semantics of fu ll write-behind caching<br />

arc best suited to applications that can easily regen­<br />

erate lost data at any time. Temporary ti les from a<br />

compiler arc a good example. Should the system<br />

fa il, the compilation can be restarted without any<br />

loss of data.<br />

3. Flush-on-close caching policy. The fl ush-on-close<br />

policy provides a restricted level of write-behind<br />

caching f(x applications. Here, all updates to the file<br />

are treated as write behind, but when the file is<br />

closed, all changes are forced to the disk. This gives<br />

the performance of write-behind but, in addition,<br />

provides a known point when the data is on the disk.<br />

This torm of caching is particularly suitable tor appli­<br />

cations that can easily re-create data in the event of<br />

a system crash but need to know that data is on the<br />

disk at a specific ti me. For example, a mail store-and­<br />

forward system receiving an incoming message must<br />

know the data is on the disk when it acknowledges<br />

receipt of the message to the forwarder. Once the<br />

acknowledgment is sent, the message has been for­<br />

mally passed on, and the forwarder may delete its<br />

copy. In this example, the data need not be on the<br />

disk until that acknowledgment is sent, because that<br />

is the point at which the message receipt is commit­<br />

ted . Should the system fail before the acknowledg­<br />

ment is sent, all dirty data in the cache would be lost.<br />

In that event, the sender can easily re-create the data<br />

by sending the message again.<br />

Figure 6 shows the results of a performance com­<br />

parison of the three caching policies. The test was run<br />

on a dual-CPU DEC 7000 Alpha system with 384<br />

megabytes of memory on a RAID-S disk. The test<br />

repeated the fo llowing sequence tor the different fi le<br />

SIZeS.<br />

l. Create and open a file of the required size and set<br />

its caching policy.<br />

2. Write data to the whole ti le in l ,024-byte 1/0s.<br />

3. Close the ti le.<br />

4. Delete the ti le.<br />

With small files, the number of file operations (create,<br />

close, delete) dominates. The leftmost side of the<br />

graph therefore shows the time per operation for tile<br />

operations. vVith time, the files increase in size, and the<br />

data 1/0s become prevalent. Hence, the rightmost<br />

side of Figure 6 is displaying the time per operation fo r<br />

data ljOs.<br />

Figure 6 clearly shows that an ordered write-behind<br />

cache provides the highest performance of the three<br />

caching models. For file operations, the write-behind<br />

cache is almost 30 percent faster than the write­<br />

through cache. Data operations are approximately<br />

three times faster than the corresponding operation<br />

with write-through caching.<br />

(jJ<br />

0<br />

z<br />

0<br />

w<br />

'!]_<br />

z<br />

Q<br />

f- :<br />

0:<br />

w<br />

o._<br />

0<br />

0:<br />

w<br />

o._<br />

w<br />

<br />

f-<br />

KEY<br />

0.156<br />

0.138<br />

0.121<br />

0.104<br />

0.086<br />

0.069<br />

0.052<br />

0.035<br />

0.017<br />

0.000<br />

\<br />

\<br />

'<br />

'<br />

- - - - - - -<br />

1,024 2,048 4,096 8.192 16,384 32,768<br />

WRITE-BEHIND CACHE<br />

FLUSH-ON-CLOSE CACHE<br />

WRITE-THROUGH CACHE<br />

FILE SIZE (BYTES)<br />

Figure 6<br />

Performance Comparison of Caching Policies<br />

Summary and Conclusions<br />

The task of integrating a log-structu red ti le system<br />

into the Open VMS environment was a significant<br />

challenge tor the Spiralog project members. Our<br />

approach of carefully determining the interface to<br />

emulate and the level of compatibility was important<br />

to ensure that the majority of applications worked<br />

unchanged .<br />

vVe have shown that an existing update-in-place tile<br />

system can be replaced by a log-structured ti le system.<br />

Initial effo rt in the analysis of application usage fur­<br />

nished intormatjon on interface compatibility. Most<br />

fi le system operations can be provided through a fi le<br />

system emulation layer. Where necessary, new inter­<br />

faces were provided for applications to replace their<br />

direct knowledge of the Files-1 1 file system.<br />

File system operation tracing and fu nctional analysis<br />

of the Files-1 1 fi le system proved to be the most usefld<br />

techniques to establish interface compatibility. Appli­<br />

cation compatibility far exceeds the level expected<br />

when the project was started. A majority of people use<br />

the Spiralog ti le system volumes without noticing any<br />

change in their application's behavior.<br />

Careful write policies rely on the order of updates<br />

to the disk. Since write-back caches reorder write<br />

requests, applications using careful writing have been<br />

unable to take advantage of the significant improve­<br />

ments in write performance given by write-back<br />

caching. The Spiralog tile system solves this problem<br />

by providing ordered write-back caching, known as<br />

write-behind. The write-behind cache allows applica­<br />

tions to control the order of writes to the disk through<br />

a primitive called barrier.<br />

Using barriers, applications can build careful write<br />

policies on top of a write-behind cache, gaining all the<br />

performance of write-back caching without risking<br />

<strong>Digital</strong> <strong>Technical</strong> )ournall Vol. 8 No. 2 <strong>1996</strong> 55


56<br />

d:u integritY. A \\Ti re - behind cache also 1 llows the rile<br />

system itself ro gain write-back ped(>nllr their help through·<br />

out the project, and Karen Howell, Morag Currie, and<br />

all those who helped with this paper. fi ll


Extending OpenVMS<br />

for 64-bit Addressable<br />

Virtual Memory<br />

The OpenVMS operating system recently<br />

extended its 32-bit virtual address space to<br />

exploit the Alpha processor's 64-bit virtual<br />

addressing capacity while ensuring binary<br />

compatibility for 32-bit non privileged pro­<br />

grams. This 64-bit technology is now available<br />

both to Open VMS users and to the operating<br />

system itself. Extending the virtual address<br />

space is a fundamental evolutionary step for<br />

the OpenVMS operating system, which has<br />

existed within the bounds of a 32-bit address<br />

space for nearly 20 years. We chose an asym­<br />

metric division of virtual address extension that<br />

allocates the majority of the address space to<br />

applications by minimizing the address space<br />

devoted to the kernel. Significant scaling issues<br />

arose with respect to the kernel that dictated<br />

a different approach to page table residency<br />

within the OpenVMS address space. The paper<br />

discusses key scaling issues, their solutions,<br />

and the resulting layout of the 64-bit virtual<br />

address space.<br />

I chaei S. llarvey<br />

Leonard S. Szubowicz<br />

The OpcnVMS Alpha operating system initially supported<br />

a 32-bit virtual address space that maximized<br />

compatibility for Open VMS VAX users as they ported<br />

their applications from the VAX platform to the Alpha<br />

plattonn. Providing access to the 64-bit virtual memory<br />

capability detlned by the Alpha architecture was<br />

always a goal tor the Open VMS operating system. An<br />

early consideration was the eventual usc of this techno<br />

logy to enable a transition from a purely 32-bitoriented<br />

context to a purely 64-bit-oricnted native<br />

context. OpenVMS designers recognized that such<br />

a fu ndamental transition tor the operating system,<br />

along with a 32-bit VAX compatibility mode support<br />

environment, would take a long time to implement<br />

and could seriously jeopardize the migration of applications<br />

from the VAX platform to the Al pha platform.<br />

A phased approach was called tor, by which the operating<br />

system could evolve over time, allowing tor quicker<br />

time-to-market for significant features and better, more<br />

timely support for binary compatibility.<br />

In 1989, a strategy emerged that defined two fundamental<br />

phases of Open VMS Alpha development. Phase<br />

1 would deliver the Open VMS Alpha operating system<br />

initiaUy with a virtual address space that fa ithfully replicated<br />

address space as it was defined by the VAX architecture.<br />

This familiar 32-bit environment would case<br />

the migration of applications from the VAX platform<br />

to the Alpha platform and wou ld case the port of the<br />

operating system itself. Phase l, the OpenVMS Alpha<br />

version 1.0 product, was delivered in 1992.1<br />

For Phase 2, the Open VMS operating system would<br />

successfully exploit the 64-bit virtual address capacity<br />

of the Alpha architecture, laying the groundwork<br />

tor fu rther evolution of the OpenVMS system. In<br />

1989, strategists predicted that Phase 2 could be delivered<br />

approximately three years after Phase 1. As<br />

planned, Phase 2 culminated in 1995 with the delivery<br />

of Open VMS Alpha version 7.0, the first version of<br />

the OpenVMS operating system to support 64-bit<br />

virtual addressing.<br />

This paper discusses how the OpenVMS Alpha<br />

Operating System Development group extended the<br />

OpenVMS virtual address space to 64 bits. Topics<br />

covered include compatibility fo r existing applications,<br />

the options for extending the address space, the<br />

<strong>Digital</strong> Tt"chnical Journal Vol . 8 No. 2 19Y6 57


58<br />

strategy tor page table residency, and the fi nal layout of<br />

the Open VMS 64-bit virtual address space. In imple­<br />

menting support tor 64-bit virtual addresses, design ­<br />

ers maximized privileged code compatibility; the paper<br />

presents some key measures taken to this end r a given process, process­<br />

permanent code, and various process-specific<br />

control cells.<br />

• The higher-addressed half ( 2 GB) of virtual address<br />

space is detlned to be shared by all processes. This<br />

shared space is where tbe Open VMS operating sys­<br />

tem kernel resides. Although the VAX architecture<br />

divides this space into a pair of separately named<br />

1-GB regions (SO space and Sl space), the Open VMS<br />

Alpha operating system makes no material distinc­<br />

tion between the two regions and reters to them<br />

collectively


PROCESS<br />

PRIVATE<br />

(2 GB)<br />

SHARED<br />

SPACE<br />

(2 GB)<br />

00000000.00000000<br />

00000000. 7FFFFFFF<br />

•JOOOilOOO HOOOOOOO<br />

I<br />

I<br />

I<br />

/<br />

/ //1<br />

PO SPACE<br />

P1 SPACE<br />

UNREACHABLE WITH<br />

32-BIT POINTERS<br />

12 2 't BYTES<br />

Ff r f 7FFF"FFF : --------------<br />

FFFFFFFF .80000000<br />

Figure 1<br />

Open VMS Alpha 32-bit Virtual Address Space<br />

SO/S1 SPACE<br />

FFFFFFFF.FFFFFFFF L--- -- -- -- -- -1<br />

privileged code compatibility and ensuring Elster timeto-market.<br />

However, analysis of this option showed<br />

that the1-c were enough significant portions of the kernel<br />

requiring change that, in practice, very little additional<br />

privileged code compatibility, such as tor<br />

drivers, would be achievable. Also, this option did not<br />

address certain important problems that are specific to<br />

shared space, such as limitations on the kernel's capacity<br />

to manage ever-larger, very large memory (VLM)<br />

systems in the future.<br />

We decided to pursue the option of a fl at, superset<br />

64-bit virtual address space that provided extensions<br />

f(x both the shared and the process-private portions of<br />

the sp:1ce that a given process could reference. The<br />

new, extended process-private sp:1ce, named P2 space,<br />

is adjacent to Pl space and extends toward higher<br />

virtual addresses."5 The new, extended shared space,<br />

named 52 space, is adjacent to 50/51 space and<br />

extends toward lower virtual addresses. P2 and 52<br />

spaces grow toward each other.<br />

A remaining design problem was to decide where<br />

P2 and 52 would meet in the address space layout.<br />

A simple approach would split the 64-bit address<br />

space exactly in halt symmetrically scaling up the<br />

design of the 32-bit address space already in place.<br />

(The address space is split in this way by the <strong>Digital</strong><br />

UNIX operating system 3) This solution is easy to<br />

explain because, on the one hand, it extends the 32-bit<br />

convention that the most significant address bit can be<br />

tre


60<br />

Virtual Address-to-Physical Address Tra nslation<br />

The Alpha architecture allows an implementation to<br />

choose one of the fo l lowing tour page sizes: 8 kilobytes<br />

(KB), 16 KB, 32 KB, or 64 KB.' The architecture<br />

also ddines a multilevel, hierarchical page table structure<br />

for virtual address-to-physical address (VA-to­<br />

PA ) translations. Al l OpenVMS Alpha platforms have<br />

implemented a page size of 8 KB and three levels<br />

in this page table structure. Although throughout<br />

this paper we assume a page size of 8 KB and three<br />

levels in the page table hierarchy, no loss of generality<br />

is incurred by this assumption.<br />

Figure 2 illustrates the VA-to-PA translation<br />

sequence using the multilevel page table structure.<br />

l. The page table base register (PTBR) is a per-process<br />

pointer to the highest level (Ll) of that process'<br />

page table structure. At the highest level is one<br />

8-KB page (LlPT) that contains 1,024 page table<br />

entries (PTEs) of 8 bytes each. Each PTE at the<br />

highest page table level (that is, each L l PTE) maps<br />

a page table page at the next lower level in the tr;.Jnslation<br />

hierarchy (the L2PTs ).<br />

2. The Segment 1 bit field of a given virtual address<br />

is an index into the Ll PT that selects a particular<br />

Lll'TE, hence selecting a specific L2PT tor the next<br />

stage of the translation.<br />

3. The Segment 2 bit field of the virtual address<br />

then indexes into that L2PT to select an L2PTE,<br />

VIRTUAL<br />

ADDRESS<br />

63<br />

SIGN EXTENSION<br />

OF SEGMENT 1<br />

PAGE TABLE<br />

BASE REGISTER<br />

Figure 2<br />

Virtual Address-to-Physical Address Translation<br />

I<br />

42<br />

SEGMENT 1<br />

L1 PT<br />

<strong>Digital</strong> Tec hnical )oumal Vo l. 8 No. 2 <strong>1996</strong><br />

hence selecting ;1 specific L3PT tor the next stage<br />

ofthe translation.<br />

4. The Segment 3 bit tield of the virtual address then<br />

indexes into that l..3PT to select an L3PTE, hence<br />

selecting a specific 8-KB code or data page.<br />

5. The byte-within-page bit f-ield of the virtual address<br />

then selects a specific byte address in that page.<br />

An Alpha implementation may increase the page<br />

size and/or number of levels in the page t:ble hierarchy,<br />

thus mapping greater amounts of virtual space up<br />

to the fu ll 64-bit amount. The assumed combin:tion<br />

of8-KB pJge size and three levels of page table allows<br />

the system to map up to 8 teralwtes (TB) (i.e., 1,024<br />

X 1,024 X l ,024 X 8 KR = 8 TB) of virtual memorv<br />

fo r a single process.<br />

To map the entire 8-TB Jddress space available to a<br />

single process requires up to 8GB ofPTEs (i.e., 1,024<br />

X l ,024 X I ,024 X 8 bytes = 8 GB ). This bet alone<br />

presents a serious sizing issue f(>r the Open VMS operating<br />

system. The 32-bit page table residency model<br />

that the Open VMS operating system ported ti·om the<br />

VA,'{ plattorm to the Alpha platform does not have<br />

the capacity to support such large page tables.<br />

Page Tables: 32-bit Residency Model<br />

We stated earlier thJt materializing a 32-bit virtual<br />

address space


system tl·om the V fv'


62<br />

The effect of this self-mapping on the VA-to-PA<br />

translation sequence (shown in Figure 2) is subtle but<br />

important.<br />

• For those virtual addresses with a Segment l bit<br />

field value that selects the self-mapper Ll PTE, step<br />

2 of the VA-to-PA translation sequence reselects<br />

the Ll PT as the effective L2PT ( L2PT') tor the<br />

next stage of the translation.<br />

• Step 3 indexes into L2PT' (the Ll PT) using the<br />

Segment 2 bit field value to select an L3 PT '.<br />

• Step 4 indexes into L3PT' (an L2PT) usi ng the<br />

Segment 3 bit field value to select a specific data<br />

page.<br />

• Step 5 indexes into that data page (an L3 PT) using<br />

the byte-within-page bit field of the virtual address<br />

to select a specific byte address within that page.<br />

When step 5 of the VA-to-PA translation sequence<br />

is finished, the final page being accessed is itself one of<br />

the level 3 page table pages, not a page that is mapped<br />

PTBR<br />

L 1 PT'S PFN · <br />

KEY:<br />

PTBR<br />

PFN<br />

PTE<br />

..<br />

Figure 4<br />

Eftect of Page Table Selfmap<br />

L1PT<br />

_,,.r--- -- -- -r<br />

PTE #1022 L1PT'S PFN<br />

PAGE TABLE BASE REGISTER<br />

PAGE FRAME NUMBER<br />

PAGE TABLE ENTRY<br />

L--- -- -- -1. . .<br />

Digiral <strong>Technical</strong> Journal Vol. R No. 2 <strong>1996</strong><br />

by a level 3 page table p::ge. The self-map operation<br />

places the entire 8-GB page table structure at the end<br />

of the VA-to-PA translation sequence fo r a specific<br />

8-GB portion of the process' address space. This virtual<br />

space that contains all of a process' potential page<br />

tables is called page table space (PT space)."<br />

Figure 4 depicts the effect of self-mapping the page<br />

tables. On the left is the highest-level page table<br />

page containing a fixed number of PTEs. On the right<br />

is the virtual address space that is mapped by that page<br />

table page. The mapped address space consists of a collection<br />

of identically sized, contiguous address range<br />

sections, each one mapped by a PTE in the corresponding<br />

position in the highest-level page table page.<br />

(For clarity, lower levels of the page table structure arc<br />

omitted from the llgure.)<br />

Notice that LlPTE # 1022 in Figure 4 was chosen to<br />

map the high-level p::ge table page that contains that<br />

PTE. (The reason f()r this particular choice will<br />

be explained in the next section. Theoretically, any one<br />

64-BIT ADDRESSABLE<br />

VIRTUAL ADDRESS SPACE<br />

PO/P1<br />

PT SPACE f<br />

00000000.00000000<br />

8-GB #0<br />

1,020 X 8 GB<br />

8-GB OW22<br />

-_-_ \ :,:::::3,,mm<br />

t-_-__ -_-_s-o;-_s_1 _


of the Ll PTEs could have been chosen as the selfmapper.)<br />

The section of virtual memory mapped by<br />

the chosen Ll PTE contains the entire set of page<br />

tables needed to map the available address space of<br />

a given process. This section of virtual memory is PT<br />

space, which is depicted on the right side of Figure 4<br />

in the 1 ,022d 8-GB section in the materialized virtual<br />

address space.<br />

The base address fo r this PT space incorporates the<br />

i ndex of the chosen self mapper LlPTE ( 1,022 =<br />

3FE( l6)) as fo llows (see Figure 2):<br />

Segment 1 bit field = 3FE<br />

Segment 2 bit field = 0<br />

Segment 3 bit field = 0<br />

Byte within page = 0,<br />

which result in<br />

VA = FFFFFFFC.OOOOOOOO<br />

(also known as PT_Base).<br />

Figure 5 illustrates the exact contents of PT space<br />

for a given process. One can observe the positional<br />

effect of choosing a particular high-level PTE to self<br />

map the page tables even within PT space. In Figure 4,<br />

the choice of PTE for selfmapping not only places PT<br />

space as a whole in the l ,022d 8-GB section in virtual<br />

memory but also means that<br />

Figure 5<br />

Page Table Space<br />

PAGE TABLE<br />

SPACE (8 GB)<br />

PT_BASE:<br />

L2_BASE:<br />

L1_BASE:<br />

NEXT-LOWER 8 GB<br />

L2PT<br />

----------------------<br />

1.02 1 L2PTs<br />

----------------------<br />

L1PT<br />

---- -- -- -- --- ---------<br />

L2PT<br />

NEXT-HIGHER 8GB<br />

• The 1 ,022d grouping of the lowest-level page<br />

tables (L3PTs ) within PT space is actually the collection<br />

of next-higher-level PTs ( L2PTs) that map<br />

the other groupings ofL3PTs, beginning at<br />

Segment l bit field = 3FE<br />

Segment 2 bit field = 3FE<br />

Segment 3 bit field = 0<br />

Byte within page = 0,<br />

which result in<br />

VA = FFFFFFFD.FFOOOOOO<br />

(also known as L2_Base).<br />

• Within that block of L2PTs , the 1 ,022d L2 PT is<br />

actually the next-higher-level page table that maps<br />

the L2PTs, namely, the Ll PT. The Ll PT begins at<br />

Segment I bit field = 3FE<br />

Segment 2 bit field = 3FE<br />

Segment 3 bit field = 3FE<br />

Byte within page = 0,<br />

which result in<br />

VA = FFFFFFF D.FF7FCOOO<br />

(also known as Ll_Base).<br />

• Within that Ll PT, the l,022d PTE is the one used<br />

fo r self-mapping these page tables. The address of<br />

the self-mapper Ll PTE is<br />

l<br />

1 ,024 L3PTs<br />

1,021 x (1 ,024 L3PTs)<br />

1,024 L2PTs<br />

l<br />

1,024 L3PTs<br />

Digit:JI T


64<br />

Stgmem l bit rield = 3FE<br />

Segmellt 2 bit tleld = 3FE<br />

Segment 3 bit rield = 3rE<br />

Bvte within page = 3FE X 8<br />

,,·hich result in<br />

VA = FrHHFD.FF7FDHO .<br />

This positional correspondence within PT space is preserved<br />

should a difrcrenr high-b·el PTE be chosen r(>r<br />

self-mapping the p:ge tables.<br />

The properties inherent in this sclfmapped page<br />

tabk Jrc compelling.<br />

• The amount of virtual memory reserved is ex:ctly<br />

the amount required fo r mapping the page ta bles,<br />

regard less of page size or page t:blc depth.<br />

Consider the stgmem-numbered bit fi elds of :1<br />

given virtual address ri·om Figure 2. Concatenated ,<br />

these bit ticlds constitute the \'irtual page number<br />

(VPN) portion of a gi,·en virtual address.<br />

The tor: I size of the PT space needed to map n·en·<br />

Vl'N is the number of possible VPNs times 8 l)\'tes,<br />

the size of :1 PTE. The total size of the address<br />

space mapped bv that PT space is the number of<br />

possible VPNs times the page size. Factoring<br />

out the VPN multiplier, the diftc rence between<br />

these is the page size divided bv 8, which is cxactlv<br />

the size of the Segment 1 bit fi eld in the \'irtual<br />

:ddress. Hence, Jli the space mapped Lw 1<br />

single PTE Jt the highest level of p:1ge table is<br />

exacrly the size required for mapping all the I'TEs<br />

that could ever be needed to map the process'<br />

1ddress sp:ce.<br />

• The mapping of PT space involves simplv choosing<br />

one ohhe highest-b·el PTEs and forcing it to<br />

selr:map.<br />

• No addition:! system tuning or coding is required<br />

to :-ccommodatt a mort \\'idely implemented<br />

\irtLd address width in PT space. Bv ddinition of<br />

rhe sclfm:1p cftecr, rhe cxacr amount of ,·irtu1l<br />

address space required will be a,·ailable, no more<br />

and no less.<br />

• lr is easy to locate a given PTE. The add rcss of<br />

:1 PTE becomes an efficient function of the address<br />

that rhc PTE m:1ps. The fu nction first clc1rs<br />

the byte-within-page bit field of the subject virttd<br />

address and then shifts the remaining virru11<br />

J.ddress bits such that rhe Segments 1, 2, and 3 bit<br />

ricld values (Figure 2) now reside in the cmTcsponding<br />

next-lower bit ricld positions. The fu nction<br />

then writes (and sign extends if necessary)<br />

the ,.1c1ted Segment l ticld with the index of<br />

the selfmapper PTE. The result is the address<br />

ot· the r'TE that maps the original \'irtual address.<br />

Note that this algorithm also ,,·orks ror addresses<br />

Dig:ir.tl '1\:dmicJI )uunlJ! Vo l. 8 No. 2 I YY6<br />

\\'ithin PT space, including that of the selfmappn<br />

PTE itsclt'.<br />

• Process page L1blc residcnc1· in ,·irtu1l 111emon· is<br />

:chie,'Cd ,,·ithout imposing on the c:1pacin· ot'<br />

sh:rcd space. Th:-r is, there is no longer need ro<br />

1111p the process page tables into shared space. Such<br />

a 1111pping \\'Ould be red undant and \\'lsteful.<br />

Open VMS 64-bit Virtual Address Space<br />

With this page table residency strategy in h:nd, it<br />

became possible ro ti nalize a 64-bit virtu: I address layour<br />

r(>r the Open VtviS operating system. A self mapper<br />

I'TE had to be chosen. Consider ag:in the highest b·el<br />

of page table in a given process' page table structure<br />

(Figure 4 ). The first PTE in rh:u page table 1mps a section<br />

ofvirru1l memorv that includes PO and PI spaces.<br />

TIJis PTE \\'JS rherdore una,·aibblc tc>r usc as a sclt<br />

mapptr. The last PTE in that page t:-blc n11ps a section<br />

ohirtual memorv that includes SO/S I space. This I'TI-'.<br />

was :!so una,·aillbk tor se lt:mapping purposes.<br />

All the intu,·cning high-Jn·el PTEs ,,·ere potential<br />

choices tr selt:mapping the page rabies. To mnimizc<br />

the size of pmccss-pri\ ate space, the COITCCt choice<br />

is the ncxt-lmn::r PTE than the one rh:t maps the lo\\·est<br />

:ddress in sh;H·ed sp:ce.<br />

This choice is implemented 1s a boor-rime algorithm.<br />

Bootstrap code Erst dercnnines the size<br />

required for OpcnVMS shared space, calcubring the<br />

corresponding number of high-level PTLs. A sufticient<br />

number of PTEs to map rh:r shared sp1cc arc<br />

1 llocated Lner ti·om the high-order end of a gi,·cll<br />

process' highest-ln·d page table page. Then tl1c nextlower<br />

PTE is allocated tor sclfn11pping tiLlt process'<br />

page r:bles. All remaining lr mapping [Jrocess- pri,·uc spKe . In prlctice,<br />

ncarh· all the l'TEs are a\·:ilJblc, ,,·hich mc1ns riLlt<br />

on roda\·'s svstcms, almost 8 TB of process pri\ 1rc ,·irrull<br />

mcmon· is a\·1ilablc to a gi,·en Open V1'viS pmccss.<br />

Figure 6 presents the ri.nal 64 -bir OI)CllVI'vLS ,·irtual<br />

address space la\·out. The portion \\'ith the lo\\·cr<br />

addresses is cmirelv proccss-pt·i,·Jte . The highn­<br />

lddrcsscd portion is sbared b· 1 11 process ,1ddrcss<br />

splCcS. PT space is a region oh·irtual memor\' thlt lies<br />

lxrwecn the P2 and S2 spaces rC)[ am· gi\'Cll pmccss<br />

1 11d at the s1111e virtual Jddress t())' all processes.<br />

Note rhH I''T' sp1ce irselt'consisrs of a proccss·pri,·1rc<br />

and J shared f)Ortion. Again, consider figure 5. ' The<br />

highest-b·cl p1ge tJblc page, Lll'T, is proccss-pri''ltc.<br />

Iris poimcd to lw the PTBR. (When a process' comcxt<br />

is lotkd, or 1111dc acti\T, the process' PT BR \:lluc is<br />

loaded ti·om rhc process' hard,,·art- pri,·ilegcd comnt<br />

block into rhe PT B 1\ register, tllcrcl)\' J111king currellt<br />

the p:ge r1blc structure poimcd ro lw rh1t l'TBR ,md<br />

the pmcess-pri,·lte 1ddress space th


1 /<br />

_,<br />

/<br />

/<br />

/<br />

/<br />

/ ____,_:_ /<br />

/<br />

/ /<br />

/<br />

00000000.00000000<br />

PO SPACE /<br />

00000000. 7FFFFFF F<br />

00000000.80000000<br />

P1 SPACE<br />

P2 SPACE :::b: :;;: :<br />

1--- -- --:<br />

/<br />

/ /<br />

---- --:;<br />

/<br />

- -<br />

;; PAGE TABLE SPACE ;; /<br />

/0<br />

/<br />

0 / (;0<br />

«-<br />

PROCESS-PRIVATE 0<br />

- - - - - - - - - - - - - - - - - -« ---<br />

SHARED SPACE<br />

FFFFFFFF. 7FFFFFF F<br />

FFFFFFFF.80000000<br />

FFFFFFFF.FFFFFFF F<br />

::::;: :<br />

Note that this drawing is not to scale.<br />

Figure 6<br />

Open VMS Alpha 64 -bit Virtual Address Space<br />

All higher-addressed p :gc tables in PT space are<br />

used to map sh:.1red space and are themselves shared.<br />

They arc also adjacent to the shared space that they<br />

map. All page tables in PT space that reside at<br />

addresses lower than that of the Ll PT :re used to map<br />

process-private sp:.1ce. These p:.1ge tables arc processprivate<br />

and Jre adjacent w the process-private space<br />

that they map. Hence, the end of the Ll PT marks<br />

a universal boundary between the process-private<br />

portion and the shared portion of the emire virtual<br />

address space, serving to separate even the PTEs that<br />

map those portions. In Figure 6, the line passing<br />

through PT space illustrates this bou ndary.<br />

A direct co nsequence of this desi gn is that the<br />

process page tables have been privatized . That is,<br />

the portion of PT space that is process-private is curremly<br />

active in virtual memory only when the owning<br />

process itself is currentl y Jctive on the processor.<br />

S2 SPACE :;;:<br />

SO/S1 SPACE<br />

/<br />

/<br />

Fortunately, the majority of page table references<br />

occur while executing in the conrext of the owning<br />

process. Such rderenccs JctuaiJy arc en hanced by<br />

the privatization of the process page t


66<br />

addresses, whether they 1re process-private or shared.<br />

The shared portion of" PT space could serve now as the<br />

sole location tcx shared-space PTEs. Being redundant,<br />

the original SPT is eminently discardable; however,<br />

discarding the SI'T would cn.:are a nussive compati bility<br />

problem r()r device drivers with their many 32-bir<br />

SPT rdcrenccs. This area is one in which an opportunity<br />

exists to preserve a significant degree of"pri,·ileged<br />

code cornpati bilit\'.<br />

Key Measures Ta ken to Maximize<br />

Privileged Code Compatibility<br />

To implement 64-bit virtual address space support, we<br />

altered central sections of the Open VMS Al pha kernel<br />

and many ofits key data structures. We expected that<br />

such changes would require compensating or corresponding<br />

source changes in surrounding privileged<br />

components within the kernel, in device drivers, and<br />

in privileged layered products.<br />

For example, the previous discussion seems to indicare<br />

that any privileged component that reads or writes<br />

PTEs would now need to usc 64-bit-widc pointers<br />

instead of 32-bit l1oimcrs. Similarly, all system f"ork<br />

threads and interrupt service routines could no longer<br />

count on direct :tccess to proccss-priv:tte PTEs without<br />

regard to which process happens to be current<br />

at the momun.<br />

A number of bctors exacerbated the impact of such<br />

changes. Since the OpcnVMS AJplLl operating syste<br />

m origin:ttcd h·om the OpenVMS Vt\ X operating<br />

system, signiric:mr portions of the Open VMS Alpha<br />

operating system and its dC\·ice dri,·ers are still written<br />

in MAC:R0-32 code, a compiled bnguage on the<br />

Alpha p!Jrr(mn .' Because MACR0-32 is an ;lssemblyle,-cl<br />

srvle of progr:unming Llnguagc, we could not<br />

sirnpl\' change the ddinirions 1 11d declarations ofl'ari­<br />

OLIS t\-pes and rch· on recompil:nion to handle the<br />

mm-c ti·om 32-bir to 64-bir pointers. Fin:-dly, there are<br />

well over 3,000 rckrcnces to PTEs h·om MAC R0-32<br />

code modules in the OpenVJVI S Alpha source pool.<br />

We were thus LKed with the prospect or" visiti ng and<br />

potentially altering c1cb of these 3,000 rdcrences.<br />

Moreover, II'C would need to r( >llow the register lifetimes<br />

that resulted h·om each of these rctcrences to<br />

ensure that all address oleularions and memory re krenccs<br />

were done using 64- bir operations. \Ve expected<br />

that this process wo uld be time-consuming m d error<br />

prone and that it would h:we a signiticant negative<br />

impact on our completion dJte.<br />

Once OpcnViVIS Alpha version 7.0 was wailable<br />

to users, those with device dri\'ers and privileged code<br />

of their own would need to go through a similar<br />

efti.)rt. This would further delav wide usc of the<br />

release. For all these reasons, we were we ll motivated<br />

Vo l. 8 \:o. 2 <strong>1996</strong><br />

to minimize the impact on privileged code. The next<br />

fo ur sections discuss techniques th:Jt we used to 0\·ercome<br />

these obstacles.<br />

Resolving the SPT Problem<br />

A signific:mt number of the PTE rc krcnces in pri­<br />

"ileged code arc to PT Es 11·irhin the SPT. Dc1·icc<br />

drivers often double-map the user's 1/0 hufkr into<br />

SO/S l space lw allocating and appmpriately initializing<br />

svstem p:.1ge ta ble entries (SI'Tr:s). Another situation<br />

in which :1 dri\'er manipulates SI'TEs is in the<br />

substitution of a svstcm bufkr ri.>r 1 poorly aligned or<br />

noncontiguous user 1/0 bufkr that pre,·ems rhe<br />

buffer from being directh- used wi th a p:1rricular<br />

dc,·ice . Such code relics ho,'ilv on the svsrem dat:1 cell<br />

MMG$GL_S PTBASE, which points to the SPT.<br />

The new page ublc design comple relv olwi1rcs the<br />

need ro r a separ:nc SPT. Given an 8-KB p1ge size 1 nd<br />

8 bytes per PTE, the entire 2-GB SO/SI virtual H.idrcss<br />

space range can be mapped by 2 MB ofPTEs within PT<br />

space. Because S0/51 resides at the highest addressable<br />

end ofrhe 64-bir virtual address sp1ec, ir is mapped bv<br />

the highest 2 M B of PT space. The :uu on the left in<br />

Figure 7 illustrate this mapping. The PTI-:s in PT space<br />

that map SO/S l arc fu lly shared l1\' :1ll processes, bur<br />

they must be rdcrmccd with 64-bir addresses.<br />

This incompnibi liry is completelv hidden lw the<br />

creation of :1 2-MB "SI'T window" m-cr rhe 2 iv\B in<br />

I'T space (lel'el 3 PTEs ) that m:tps SO/S I space. The<br />

SPT window is positioned at the highest 1dd rcss1blc<br />

end ot"SO/S l space. Therdi. >re, an access through the<br />

SPT window onlv requires a 32-bir SO/S I 1ddrcss and<br />

can obtain anv of the I'TEs in PT sp1ce that m:1p<br />

SO/Sl space The l i"CS on the ri ght in hgurc 7 illus<br />

tratc this access path.<br />

The SPT 11·i ndow is set up at svstcm ini' i1lizarion<br />

rime and consumes onll' the 2 KB of 1'TI-:s tlLlt<br />

are needed to map 2 MB. The S\'stcm d1t.1 cell<br />

MMG$GL_SPTBASE now points to the b:1se of rhe<br />

SPT window, and all existing rekrmccs to that d:1ta ce ll<br />

continue to fu nction corrcctlv without ch:1ngc. -<br />

Providing Cross -process PTE Access for Direct 1/0<br />

The selfm:1pping ofthe page t:1bles is Jn elegant solution<br />

to the page ta ble residency problem imposed by<br />

the preceding design. However, the selr:map1xd page<br />

ta bles present significant challenges or" their own to the<br />

I/0 subsystem and to many device drivers.<br />

Typically, OpenVJ\115 device drivers ri.>r 1111SS storlgc,<br />

net\vork, and other high-perr(mnlncc dev1ccs pcrJ-( mn<br />

direct memory access (DMA) 1 nd what Open VMS cal ls<br />

"direct 1/0." These dn·icc drivers lock down IIHO<br />

physical mcmorv the virtual pages rlut conLl in the<br />

requester's l/0 bufrc r. The 1/0 rr:1nsrcr is fX rr(mned<br />

direcrlv to those pages, aiLLT which the butkr pages are<br />

unlocked, hence the term "direct 1/0 ."


Figure 7<br />

System Page Table Window<br />

: "1<br />

PAGE TABLE SPACE<br />

(8 GB)<br />

- - ---- -----------------------<br />

PTEs THAT MAP SO/S1 (2 MB)<br />

S2 (6GB)<br />

SO/S1 (2 GB)<br />

------------- ------ ----------<br />

SPT WINDOW (2 MB)<br />

The virtual address of the buffer is not adequate for<br />

device drivers because much of the driver code runs in<br />

system context and nor in the process context of the<br />

requester. Similarly, a process-specific virtual address is<br />

meaningless to most DMA devices, which typically can<br />

deal only with the physical addresses of the virtual<br />

pages spanned by the buffer.<br />

For these reasons, when the 1/0 buffe r is locked<br />

into memory, the Open VMS I/0 subsystem converts<br />

the virtual address of the requester's buffer into<br />

( 1) the address of the PTE that maps the start of<br />

the buffe r and (2) the byte oftset within that page to<br />

the first byte of the buffe r.<br />

Once the virtual address of the 1/0 bufkr is convened<br />

to a PTE address, all references to that buffer<br />

are made using the PTE address. This remains the case<br />

even if this 1/0 request and 1/0 bufkr are handed otT<br />

from one driver to another. For example, the IjO<br />

request may be passed fi·om the shadowing virtual disk<br />

driver to the small computer systems intertace (SCSI)<br />

disk class driver to a port driver for a specific SCSI host<br />

adapter. Each of these drivers will rely solely on the<br />

PTE address and the byte offset and not on the virtual<br />

address of the 1/0 butler.<br />

Therefore, the number of virtual address bits the<br />

requester originally used to specif)r rh address of<br />

·FFFFFFFFFFFFFF<br />

the l/0 buffer is irrelevant. What really matters is<br />

the number of address bits that the driver must use<br />

to reference a PTE.<br />

These PTE addresses were always within the page<br />

tables within the balance set slots in shared SO /S 1<br />

space. With the introduction of tbe self mapped page<br />

tables, a 64-bit address is req uired t()r accessing any<br />

l)TE in PT space. Furthermore, the desired PTE is not<br />

accessible using this 64-bit address when the driver is<br />

no longer executing in the context of the original<br />

requester process. This is called a cross-process PTE<br />

access problem.<br />

In most cases, this access problem is solved tor<br />

direct 1/0 by copying the PTEs that map the 1/0<br />

buftc r when the 1/0 buffer is locked into physical<br />

memory. The PTEs in PT space are accessible at that<br />

point because the requester process context is required<br />

in order to lock the buffer. The PTEs arc copied into<br />

the kernel's heap storage and the 64-bit PT space<br />

address is replaced by the address of the PTE copies.<br />

Because the kernel's heap storage remains in SO/S l<br />

space, the replacement address is a 32-bit address that<br />

is shared by all processes on the system.<br />

This copy approach works because drivers do not<br />

need to modit)r the actual PTEs. Typically, this<br />

arrangement works well because the associated PTEs<br />

Vol. 8 No. 2 <strong>1996</strong> 67


em ti t into dcdic:n:d space \\'ithin the I/0 request<br />

pKkct d:t: structure used bv the OncnVMS operatino-<br />

- t b<br />

S\'Stc m, m d there is 110 mcasu r:tble increase in CPU<br />

Ol'crhcad to copy those PT Es .<br />

If the 1/0 buFkr is so L1rgc th:t its :1ssoci;1ted PTEs<br />

c1nnot ti t \\'ithin the 1/0 req uest pKkct, a separate<br />

kernel hc:-tf1 stm:gc packet is 1 lloutcd ro hold the<br />

PTEs. lf the l/0 bufkr is so large rhH the cost of<br />

copl'ing all the l'TFs is Jloticcable, a direct access path<br />

is crc:.1tcd as t( ll lo\\'s:<br />

• The L::l PTEs rhH m:.1p the I/0 bufh: r arc locked<br />

into ph1·sic1l men1or\·.<br />

• Add ress sp1ce 11·ithin SO/S l sp:.1ce is allocned<br />

and nup pcd OI'Cr the l.3 PT Es th at \\'Cre just<br />

locked do\\'Jl.<br />

This esr;1blishcs ' 32-bir address1blc slLlred-space<br />

\\'indo\\' m·cr the 1.3 l'TEs rhar 1mp the 1/0 bu ffe r.<br />

The essc mial point is that one of these methods is<br />

selected md employed umil the 1/0 is completed :md<br />

the bufh:r is unlockl:d . Each method prol'idcs a 32-bit<br />

l'TE address that the rest ofrhc l/0 su bsystem can use<br />

transp::rcnrly, 1s it has been accustomed ro doing, with­<br />

our req uiring numerous, complex sou rce changes.<br />

Use of Self-identifying Structures<br />

To 1 Ccommod:.1tc 64 -bir user l'irtull addresses, 1 num­<br />

ber of kemel dar:1 structures had to be cxp1nded and<br />

ch:1ngcd. for example, as,·nchronous system trap<br />

(AST ) control blocks, bu ffered l/0 pKk


do its job. This is an instance of the cross-process PTE<br />

access problem mentioned earlier.<br />

The swappcr is unable to directly access the page<br />

tables of the process being swapped because the swapper's<br />

own page tables are currently active in virtual<br />

memory. We solved this access problem by revising the<br />

swapper to temporarily "adopt" the page tables of<br />

the process being swapped. The swapper accomplishes<br />

this by temporarily changing its PTBR contents to<br />

point to the page table structure tor the process being<br />

swapped instead of to the swapper's own page table<br />

structure. This change torces the PT space of the<br />

process being swapped to become active in virtual<br />

memory and therd(xe available to the swapper as it<br />

prepares the process to be swapped. Note that the<br />

swapper can make this temporary change because<br />

the swapper resides in shared space. The swapper does<br />

not vanish ti-om virtual memory as the PTBR value is<br />

changed. Once the process has been prepared for<br />

swapping, the swappcr restores its own PTBR value,<br />

thus relinquishing access to the target process' PT<br />

space contents.<br />

Thus, it can be seen how privileged code with<br />

intimate knowledge of OpenVMS memory management<br />

mechanisms can be affected by the changes<br />

to support 64-bit virtual memory. Also evident is that<br />

the alterations needed to accommodate the 64-bit<br />

changes arc relatively straighttorward . Although the<br />

swappcr has a higher-than- normal awareness of memory<br />

managemenr internal workings, extending the<br />

swapper to accommodate the 64-bit changes was<br />

not particularly difficult.<br />

Immediate Use of 64-bit Addressing by the<br />

OpenVMS Kernel<br />

Page table residency was certainly the most pressing<br />

issue we taced with regard to the Open VMS kernel as<br />

it evolved ti·om a 32-bit to a 64-bit-capable operating<br />

system . Once implemented, 64-bit virtual addressing<br />

could be harnessed as an enabling technology tor solving<br />

a number of other problems :s well. This section<br />

briefly discusses some prominent examples that serve<br />

to illustrate how immediately useful 64-bit addressing<br />

became to the OpcnVJ\IIS kernel.<br />

Page Frame <strong>Number</strong> Database and<br />

Very Large Memory<br />

The OpenVMS Alpha operating system maintains a<br />

database t(>r managing individual, physical page ti·ames<br />

of memory, i.e., page ti·ame numbers. This database is<br />

stored in SO/S l space. The size of this database grows<br />

linearly as the size of the physical memory grows.<br />

Future Alpha systems may include larger memory<br />

configurations as memory technology continues to<br />

evolve. The corresponding growth of the page ti-ame<br />

number database tor such systems could consume<br />

an unacceptably large portion of SO /S 1 space, which<br />

has a maximum size of 2 GB. This design effectively<br />

restricts the maxirnum amount of physical memory<br />

that the OpenVMS operating system would be able<br />

to support in the fu wre.<br />

We chose to remove this potential restriction by<br />

relocating the page trame llU!llber database ti·om<br />

SO/S l to 64-bit addressable 52 space. There it can<br />

grow to support any physical memory size being considered<br />

tor years to come.<br />

Global Page Ta ble<br />

The OpenVMS operating system maintains a data<br />

structure in SO /S 1 space called the global page table<br />

( GPT). This pseudo-page table maps memory objects<br />

called global sections. Multiple processes may map<br />

portions of their respective process -private address<br />

spaces to these global sections to achieve protected<br />

shared memory access tor whatever applications they<br />

may be running.<br />

With the advent ofP2 space, one can easily anrjcipate<br />

a need tor orders-of-magnitude-greater global section<br />

usage. This usage directly increases the size of the<br />

GPT, potentially reaching the point where the GPT<br />

consumes an unacceptably large portion of SO/Sl<br />

space. We chose to forestall this problem by relocating<br />

the GPT fi·om SO/S l to S2 space. This move allows the<br />

configuration of a GPT that is much larger than any<br />

that could ever be configured in SO/Sl space.<br />

Summary<br />

AJthough providing 64-bit support was a significant<br />

amount of work, the design of the Open VMS operating<br />

system was readily scalable such that it could<br />

be achieved practically. First, we established a goal of<br />

strict binary compatibility tor nonprivileged applications.<br />

We then designed a superset virtual address<br />

space that extended both process-private and shared<br />

spaces while preserving the 32-bit visible address space<br />

to ensure compatibility. To maximize the available<br />

space for process-private use, we chose an asymmetric<br />

style of address space layout. We privatized the process<br />

page tables, thereby eliminating their residency<br />

in shared space. The bv page table accesses that<br />

occurred fl·om outside the context of the owning<br />

process, which no longer worked after the privatization<br />

of the page tables, were addressed in various ways.<br />

A variety of ripple effects stemming ti-om this design<br />

were readily solved within the kernel.<br />

Solutions to other scaling problems related ro the<br />

kernel were immediately possible with the advent of<br />

64-bit virtual address space. AJrcady mentioned was<br />

the complete removal of the process page tables ti-om<br />

shared space. vVe also removed the global page table<br />

Digirl Tc hnic.ll journ;ll Vo l. S No. 2 l996 69


70<br />

and the page fi-ame number database from 32-bit<br />

addressable to 64-bit addressable shared space. The<br />

immediate net dkct of these changes was significantly<br />

more room in SO/S l space tor configuring more<br />

kernel heap storage, more balance slots to be assigned<br />

to greater numbers of memory resident processes, etc.<br />

We further anticipate use of64-bit addressable shared<br />

space to realize additional bendits of VLM, such as<br />

tor caching massive amounts of file system data.<br />

Providing 64-bit addressing capacity was a logical,<br />

evolutionary step fo r the Open VMS operating system.<br />

Growing numbers of customers are demanding the<br />

additional virtual memory to help solve their problems<br />

in new ways and to achieve higher performance. This<br />

has been especial ly fr uitful for database applications,<br />

with substantial performance improvements already<br />

proved possible by the use of64-bit addressing on the<br />

<strong>Digital</strong> UNIX operating system . Similar results are<br />

expected on the OpenVMS system. With terabytes<br />

of virtual memory and many gigabytes of physical<br />

memory available, entire databases may be loaded into<br />

memory at once. Much of the I/0 that otherwise<br />

would be necessary to access the database can be elimi­<br />

nated, thus allowing an application to improve perfor­<br />

mance by orders of magnitude, for example, to reduce<br />

query time from eight hours to five minutes. Such<br />

performance gains were difficult to achieve while<br />

the OpenVMS operating system was constrained to a<br />

32-bit environment. With the advent of 64-bit address­<br />

ing, OpenVMS users now have a powerful enabling<br />

tec hnology available to solve their problems.<br />

Acknowledgments<br />

The work described in this paper was done by mem­<br />

bers of the Open VMS Alpha Operating System Devel­<br />

opment group. Numerous contributors put in many<br />

long hours to ensure a well-considered design and<br />

a high-quality implementation. The authors particu­<br />

larly wish to acknowledge the fo llowing major con­<br />

tributors to this effort: Tom Benson, Richard Bishop ,<br />

Wa lter Blaschuk, Nitin Karkhanis, Andy Kuehnel,<br />

Karen Noel, Phil Norwich, Margie Sherlock, Dave<br />

\Vall, and Elinor ·wood s. Thanks also to members<br />

of the Alpha languages community who provided<br />

extended programming support for a 64-bit environ­<br />

ment; to Wayne Cardoza, who helped shape the earli­<br />

est notions of what could be accomplished; to Beverly<br />

Schultz, who provided strong, early encouragement<br />

to r pursuing this project; and to Ron Higgins and<br />

Steve Noyes, tor their spirited and unflagging support<br />

to the very end .<br />

The tdlowing reviewers also deserve thanks tor<br />

the invaluable comments they provided in helping to<br />

prepare this paper: Tom Benson, Cathy Foley, Clair<br />

Grant, Russ Green, Mark Howel l, Karen Noel, Margie<br />

Sherlock, and Rod vViddowson.<br />

<strong>Digital</strong> Tec hnical Journal Vo l. 8 No. 2 <strong>1996</strong><br />

References and Notes<br />

I. N. Kronenberg, T. Benson, W. Cardoza, R. Jagannathan,<br />

and B. Thomas, "Porting Open VMS from VAX to AJpha<br />

1\.,\P," <strong>Digital</strong> Tecbnical.fournal, val. 4, no. 4 ( 1992 ):<br />

lll-120.<br />

2. T. Leonard, cd., VAX Arch itecture Reference Manual<br />

(Bedford, Mass.: <strong>Digital</strong> Press, 1987).<br />

3. R. Sites and R. Witek, Alpha AXP Architecture Refe rence<br />

JI!Ia nual, 2d cd. (Newton, 1Vlass.: <strong>Digital</strong> Press,<br />

1995).<br />

4. Although an OpcnVMS process may refer to PO or PI<br />

space using either 32-bit or 64-bit pointers, references<br />

to P2 space require 64-bit pointers. AppJications may<br />

very well execute with mixed pointer sizes. (See reference<br />

8 and D. Smith, "Adding 64-bit Pointer Support<br />

to a 32-bit Run-time Library," <strong>Digital</strong> <strong>Technical</strong><br />

jo urnal, val . 8, no. 2 [<strong>1996</strong>, this issue]: 83-95.) There<br />

is no notion of an application executing in either a 32-bit<br />

mode or a 64-bit mode.<br />

5. Superset system services and language support wct-c<br />

added to facilitate the manipulation of 64-bit addressable<br />

P2 space.8<br />

6. This mechanism has been in place since OpenVMS<br />

Alpha version 1.0 to support virtual PTE tCtchcs by the<br />

translation buffer miss handler in PALcode. (PALcodc<br />

is rbe operating system-specific privileged architecture<br />

library that provides control over interrupts, exceptions,<br />

context switching, ctc 3) In cHeer, this means rhar the<br />

OpenV!YlS page rabies already existed in two virtual<br />

locations, namely, SO/S l space and PT space.<br />

7. The SPT window is more precisely only an SO/S l PTE<br />

window. The PTEs that map 52 space


Biographies<br />

Michael S. Harvey<br />

Michcl Harvey joi ned Digit! in 1978 at-i:er receivi n g his<br />

B.S.C.S. h·om the Univcrsirv of Vermon t. In 1984, as a mem·<br />

ber of rhe OpenV;viS Engi;1eering group, he participated in<br />

nell' processor support i:>r VAX multiprocessor systems and<br />

helped de1·clop Open VMS Sl'mmetric multiprocessing (S;vl I')<br />

support for these s1·srems. He recein:d a patenr or this ll'ork.<br />

Mike 1vs an o1·iginl member of the RJSC1·-VAX task ti Jt·co:,<br />

which conccil'ed and d eveloped rhc Alph a architectun:.<br />

Mike led the project that ported the Open VMS Executil'(:<br />

fi·om the VA-'< to the Alph pbrcmn and subsequ ently led<br />

the project that designed r the Open VMS 1/0 engineering team,<br />

he joined Digir,, l SoiWr rh c Open VMS high-lclel langtugc dc1·ice d ril'cr proj ­<br />

ccr , conrributcd ro rhc port ohhc Open VMS opcr•1ring<br />

sysrem ro the Alph


72<br />

The OpenVMS Mixed<br />

Pointer Size Environment<br />

A central goal in the implementation of 64-bit<br />

addressing on the OpenVMS operating system<br />

was to provide upward-compatible support for<br />

applications that use the existing 32-bit address<br />

space. Another guiding principle was that mixed<br />

pointer sizes are likely to be the rule rather than<br />

the exception for applications that use 64-bit<br />

address space. These factors drove several key<br />

design decisions in the OpenVMS Calling Stan­<br />

dard and programming interfaces, the DEC C<br />

language support , and the system services<br />

support. For example, self-identifying 64-bit<br />

descriptors were designed to ease development<br />

when mixed pointer sizes are used. DEC C sup­<br />

port makes it easy to mix pointer sizes and to<br />

recompile for uniform 32- or 64-bit pointer sizes.<br />

OpenVMS system services remain fully upward<br />

compatible, with new services defined only<br />

where required or to enhance the usability of the<br />

huge 64-bit address space. This paper describes<br />

the approaches taken to support the mixed<br />

pointer size environment in these areas. The<br />

issues and rationale behind these OpenVI\IIS<br />

and DEC C solutions are presented to encourage<br />

others who provide library interfaces to use<br />

a consistent programming interface approach.<br />

<strong>Digital</strong> Tcclmiul )ounul Vol. No. 2 [')')(,<br />

I<br />

Thomas R. Benson<br />

Karen L. Noel<br />

Richard E. Peterson<br />

Suppmt t( >r 64-bir \'irtual addressing on the OpcnVt\!!S<br />

Alph;l opcr;ning s\·srcm, \'Crsion 7.0, h:1s \·asril· inc1·clscd<br />

the ;l mmmr oh inu:1l :-ddrcss space ;l\'r the initil lowing sections explore the reasons t(>r<br />

mixing pointer sizes.


Open VMS and Language Support<br />

I mplement3tion choices that <strong>Digital</strong> made tor th is first<br />

release of the Open VMS operating system that sup­<br />

ports 64-bit virtual addressing will probably encour­<br />

age mixed pointer size programming. These choices<br />

were driven largely by the need tor absolute upward<br />

compati bility tor nisting programs and the goal of<br />

supporting large , dynamic data sets as the primary<br />

applicltion t()r 64-bit addressing.<br />

Dynamic Data Only OpcnVMS services support<br />

dynamic allocation ot-64-bit address space. This mech­<br />

anism most closely resembles the malloc and free fimc­<br />

tions t() r al locating and deallocaring dynamic storage<br />

in the C programming language. Allocation of this<br />

type diffe rs trom static and stack storage in that explicit<br />

source statements are required to manage it. For static<br />

and stack storage, the system is allocating the memory<br />

on behalf of the application at image activation rime.<br />

(Of course, the al location may be extended during<br />

execution in the case ofst:ck stor:ge.) This allocation<br />

continues co be fi·om 32-bit addrcss:ble space.<br />

Two special cases of static allocation are worth men­<br />

tioning. Linkage sections, which are program sections<br />

that contain routine linkage information, and code<br />

sections, which contJin the cxccutJbk instructions,<br />

do not difkr su bstantially ti·om preinitialized static<br />

storage. As J result, these sections also reside only in<br />

32-bit add ressable memory.<br />

Upward-compatibility Constraints The OpenVMS<br />

Al pha operating system is cautious to avoid using<br />

64-bit memory fr eely where it may prevent upward<br />

compatibility tor 32-bit :pplications. For example, the<br />

linkage section might seem to be a nJtural candidate<br />

fo r the Open VMS system to allocate automatically in<br />

64-bit memory. This allocnion would essentially free<br />

more 32-bit addressable memory fo r application use;<br />

however, even if this were done only tor applications<br />

relin ked tor new versions of the Open VMS operating<br />

system, there is no guarantee that all object code treats<br />

Jin kagc section addresses as 64 bits in width. A simple<br />

example is storing the address of a routine in a struc­<br />

ture. Since a routine's address is the add ress of its pro­<br />

cedure descriptor in the linkage section, moving the<br />

lin kage section to 64 -bit memory wouJd cause code<br />

that stores this address in a 32-bit cell to fail.<br />

Allocating the user stack in 64-bit space also appears<br />

to be a good opportunity to easily increase the amount<br />

of memory available to an application. Stack addresses<br />

arc often more visible to application code than linkage<br />

section addresses arc. For instance, a routine can easi ly<br />

allocate a local variable using temporary storage on the<br />

stack and pass the address of the variable to another<br />

routine. If the stack is moved to 64-bit space, this<br />

address quietly becomes a 64-bit address. If the cal led<br />

routine is not 64-bit capable, attempts to use the<br />

address will fai l.<br />

Focus on Services Required for Large Data Sets Not<br />

all system services could be changed to support 64-bit<br />

addresses (i.e., promoted) in time to r the first version<br />

of the OpenVMS operating system to support 64-bit<br />

addressing. vV ith the mixed-pointer model in mind,<br />

we tocused on those services that were likely to be<br />

required fo r large data sets. For example, to allow IjO<br />

directly to and from high memory, it was essential that<br />

the IjO queuing service, SYS$QIO, accept a 64-bit<br />

buffer address. Conversely, the SYS$TRNLNM service<br />

t(x translating a logical name did not need to be moditicd<br />

to accept 64 -bit addresses. Its arguments include<br />

a logical name, a table name, and a vector that contains<br />

requests tor information about the name. These are<br />

small data elements that arc unlikely to req uire 64-bit<br />

addressing on their own . Of course, they may be part<br />

of some larger structure that resides in 64-bit space.<br />

In this case, they can easily be copied to or from 32-bit<br />

addressable memory.<br />

System services are discussed further in the section<br />

Open VMS System Services. The 32-bit address restric­<br />

tion on certain system services again emphasizes the<br />

importance of being able to logically separate large<br />

data set support from the rest of an application.<br />

Limited Language Support Another interface point<br />

that requires care when using 64-bit addressing is at<br />

calls between mod ules written in diffe rent program­<br />

ming languages. The Open VMS Calling Standard<br />

traditionally makes it easy to mix .languages in an appli­<br />

cation, but DEC C is the only high-level language<br />

to fully support 64-bit add resses in the tirst 64-bit­<br />

capable version of the Ope n VMS operating system.2<br />

The usc of 64-bit addresses in mixed-language<br />

applications is possible, and data that contains 64-bit<br />

addresses may even be shared; however, references<br />

that actually use the data pointed to by these addresses<br />

need to be limited to DEC C code or assembly lan­<br />

guage. Mixed high-level language applications arc cer­<br />

tain to be mixed pointer size applications in this<br />

version of the operating system.<br />

Support for 32-bit Libraries<br />

Many applications rely on library packages to provide<br />

some aspect of their fu nctionality. Typical examples<br />

include user interface packages, graphics libraries, and<br />

database utilities. Third-party libraries may or may not<br />

support 64-bit add resses. Applications that usc these<br />

libraries will probably mix 32-bit and 64-bit poi nter<br />

sizes and will therefore require an operating system<br />

that supports mixed pointer sizes.<br />

DigirCll Tcc hnicJI )ourn


74<br />

Implications of Full 64-bit Conversion<br />

for some applications, it may be desirable to mix<br />

pointer sizes to avoid the side dkcts of universal 64-bit<br />

address conversion. The approach of recompiling every­<br />

thing with 64-bit address widths is sometimes called<br />

"throwing the switch." An obvious implication of<br />

throwing the switch is that all pointer data doubles in<br />

size. For complex linked data structures, this can be a<br />

significant overal l increase in size. Increasing the pointer<br />

size may also reveal hidden dependencies on pointer size<br />

being the same as integer size. If code accesses a cell as<br />

both a 32-bit integer and a 32-bit pointer, the code will<br />

no longer work if the pointer is enlarged . Thus,<br />

univerS


structure-specific information. Typically, descriptors<br />

are used only tc>r character strings, arrays, and complex<br />

data types such as packed decimal.<br />

Descriptor types are by ddinition sclfidcntit)ring by<br />

virtue of the type and class fields they contain. An<br />

obvious choice, therd(xe, fo r extending descriptors to<br />

handle 64-bir addresses would be to 1dd new type<br />

constants t(>r 64-bit data elements and extend the<br />

structure beyond the type fields to accommodate<br />

larger addresses and sizes. In practice , however, the<br />

address and Jength fields hom descriptors are fi-equently<br />

used without accessing the type fi elds, particularly<br />

when a character stri ng descriptor is expected.<br />

As a resu lt, a solution was sought that would yield<br />

a predictable fa ilure, rather than incorrect results or<br />

data corruption, when a 64-bit descriptor is received<br />

by a routine that expects onlv the 32-bit f(>rm. The<br />

final design includes a separate 64-bit descriptor layout<br />

that contains two special fields at the same ottSets as<br />

the length and address fields in the 32- bit descriptor.<br />

These fields are called MBO (must be one) and<br />

MRMO (must be minus one ), respectively. The simplest<br />

versions of the 32-bit and 64-bit descriptors are<br />

illustrated in Figure l.<br />

If r ensuring robust interfaces in<br />

the mixed pointer size environment.<br />

DEC C Language Support for Mixed Pointer Sizes<br />

To support application programming in the mixed<br />

pointer size environment, some design work was<br />

required in the DEC: C compiler. This section<br />

describes the rationale behind the ti na! design.<br />

It was clear that the compiler would have to provide<br />

a way tor 32-bit and 64-bit pointers to coexist in the<br />

same regions of code. At the same time, customers and<br />

DigitJI <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 75


76<br />

internal users initially favored a simple command line<br />

switch when polled on potential compiler support<br />

for 64-bit address space. (At least one C compiler that<br />

supports 64-bit addressing, MIPSpro C, does so only<br />

through command line switches fo r setting pointer<br />

sizes.3) The motivation fo r using switches was to limit<br />

the source changes needed to take advantage of the<br />

additional address space, especially when portability<br />

to other plattonns is desired. For cases in which mixing<br />

pointer sizes was unavoidable, something more<br />

flexible than a switch was needed.<br />

Why Not _near and _far?<br />

The most common suggestion fo r controlling individual<br />

pointer declarations was to adopt the _near and<br />

_far type qualifier syntax used in the PC environment<br />

in its transition from 16-bit to 32-bit addressing.4<br />

While this idea has merit in that it has already been<br />

used elsewhere in C compilers and is fa miliar to PC<br />

software developers, we rejected this approach for the<br />

following reasons:<br />

• The syntax is not standard.<br />

• The syntax requires source code edits at each declaration<br />

to be affected.<br />

• The syntax has become largely obsolete even in the<br />

PC domain with the acceptance of the flat 32-bit<br />

address space model offered by modern 386minimum<br />

PC compilers and the Win32 programming<br />

interface.<br />

• Because of the vast difference in scale in choosing<br />

between 16-bit or 32-bit pointers on a PC as compared<br />

to choosing between 32-bit or 64-bit poimers<br />

on an Alpha system, there would be no porting benefit<br />

in using the same keywords. No existing source<br />

code base would be able to port to the OpenVMS<br />

mixed pointer size environment more easily because<br />

of the presence of _near and _t:1r qualifiers.<br />

Pragma Support<br />

The <strong>Digital</strong> UNIX C compiler had previously defined<br />

pragma preprocessing directives to control pointer<br />

sizes tor slightly difkrcnt reasons than those described<br />

for the OpenVMS system.' By default, the <strong>Digital</strong><br />

UNIX operating system ofkrs a pure 64-bit addressing<br />

model. In some circumstances, however, it is desirable<br />

to be able to represent pointers in 32 bits to<br />

match externally imposed data layouts or, more rarely,<br />

to reduce the amount of memory used in representing<br />

pointer values. The <strong>Digital</strong> UNIX pointer_size pragmas<br />

work in conjunction with command line options<br />

and linker/loader katurcs that limit memory use and<br />

map memory such that pointer values accessible to the<br />

C program can always be represented in 32 bits.<br />

Since compatibility with the <strong>Digital</strong> UNIX compiler<br />

would have greater value if it met the needs of the<br />

OpenVMS platform, we evaluated the pragma-based<br />

Digir.1l <strong>Technical</strong> Jourml Vol. 8 No. 2 <strong>1996</strong><br />

approach and decided to adopt it, propagating any<br />

necessary changes back to the UNIX platform to maintain<br />

compatibility. The decision to use pragmas to<br />

control pointer size addressed the major deficiencies<br />

of the _ncar and _far approach. In particular, the<br />

pragma directive is specitled by ISO I ANSI C in such<br />

a way that using it docs not compromise portability as<br />

the use of additional keywords can, because unrecognized<br />

pragmas arc ignored. Furthermore, pragmas can<br />

easily be specified to apply to a range of source code<br />

rather than to an individual declaration. A number of<br />

DEC C pragmas, including the pointer size controls<br />

implemented on the UNIX system, provide the ability<br />

to save and restore the state of the pragma. This makes<br />

them convenient and safe to usc to modit)r the pointer<br />

size within a particular region of code without disturbing<br />

the surrounding region. The state may easily be<br />

saved betore changing it at the beginning of the region<br />

and then restored at the end.<br />

Command Line Interaction<br />

Pragmas tlt in with the initial desire of prospective<br />

users to have a simple command line switch to indicate<br />

64 bits. As with several other pragmas, we detined a<br />

command line qualifier (/pointer_size ) to spccif)r the<br />

initial state of the pragma before any instances arc<br />

encountered in the text. U nlikc other pragmas,<br />

though , we also use the same command line qualifier<br />

to enable or disable the action of the pragmas altogether.<br />

In this way, a ddault compilation of source<br />

code moditied for 64-bit support behaves the same<br />

way that it would on a system that did not ofter 64-bit<br />

support. That is, the pragmas arc effectively ignored,<br />

with only an informational message produced .<br />

This behavior was adopted for consistency with the<br />

<strong>Digital</strong> UNIX behavior and also to aid in the process of<br />

adding optional 64-bit support to existing portable<br />

32-bit source code that might be compiled fo r an<br />

older system or with an older compiler. In this model,<br />

a compilation of new source code using an old command<br />

line produces behavior that is equivalent to the<br />

behavior produced using an older compiler or a compiler<br />

on another platform. vVitb one notable exception,<br />

building an application that actually uses 64-bit<br />

addressing requires changing the command line.<br />

The exception to the rule that existing 32-bit build<br />

procedures do not create 64-bit dependencies is a second<br />

form of the pragma, named required_pointer_size.<br />

This form contrasts with the form poimcr_size in that it<br />

is always active regardless of command line qualifiers;<br />

otherwise, required_pointer_size and pointer_size arc<br />

identical. The intent of this second pragma is to support<br />

writing source code that specifics or interfaces to<br />

services or libraries that can only work correctly with<br />

64-bit pointers. An example of this code might be a<br />

header file that contains declarations for both 64-bit<br />

and 32-bit memory management services; the services


must always be defined to accept and return the<br />

appropriate pointer size, regardless of the command<br />

line qualitler used in the compilation.<br />

Pragma Usage<br />

The use of pragmas to control pointer sizes within a<br />

range of source code fits well with the model of starting<br />

with a working 32-bit application and extending it<br />

to exploit 64-bit addressing with minimal source code<br />

edits. Programming interface and data structure declarations<br />

are typically packaged together in header files,<br />

and the primary manipulators of those data structures<br />

are often implemented together in modules.<br />

One good approach for extending a 32-bit application<br />

wou ld be to start with an initial analysis of memory<br />

usage measurements. The purpose of this analysis<br />

wo uld be ro produce a rough partitioning of routines<br />

and data structures into two categories: "32-bit sufficient"<br />

and "64-bit desirable." Next, 64-bit pointer<br />

pragmas could be used to enclose just the header files<br />

and source modules that correspond to the routines<br />

and dat


78<br />

Allocation Function Mapping The com mand line<br />

qualitier setting the default pointer size has an additional<br />

effect that simplifies the use of 64-bit address<br />

space. If an explicit poinrer size is specified on the<br />

command line, the malloc fu nction is mapped to a<br />

routine specific to the address space to r that size. For<br />

example, _malloc64 is used fo r malloc when the<br />

default pointer size is 64 bits. This allows allocation<br />

of 64-bit address space without additional source<br />

changes. The source code may also call the sizespecific<br />

versions of run-time routines explicitly, when<br />

compiled for mixed pointer sizes. These size-specific<br />

fu nctions are available, however, only when the<br />

/pointer_size command line quatler is used. See<br />

"Adding 64 -bit Pointer Support to a 32-bit Ru n-time<br />

Library" in this issue for a discussion of other cHeers of<br />

64-bit addressing on the C run-time library."<br />

Header File Semantics The treatment of poinrer_size<br />

pragmas in and around header fi les (i.e., any source<br />

included by the #include preprocessi ng directive)<br />

deserves special mention. Programs typically include<br />

both ptivate definition files and public or system-specific<br />

header files. In the latter case, it may not be desirable tor<br />

definitions within the header files to be afrccted by the<br />

poinrer_size pragmas or command line currently in<br />

effect. To prevent these definitions trom being aHected,<br />

the DEC C compiler searches fo r special prologue and<br />

epilogue header files when a #include directive is<br />

processed. These files may be used to establish a particular<br />

state for environmental pragmas, such as<br />

pointer_size, tor all header files in the directory. This<br />

eliminates the need to modify either the individual<br />

header files or the source code that includes them.<br />

The compiler creates a predetined macro called<br />

_JNITIAL_POTNTER_SIZE to indicate the initial<br />

pointer size ;-tS specified on the command line. This may<br />

be of particular use in header files to determine what<br />

pointer size should be used, if mixed pointer size support<br />

is desirable. Conditional compilation based on this<br />

macro's definition state can be used to set or override<br />

pointer size or to detect compilation by an older compiler<br />

lacking pointer-size support. Ifits val ue is zero, no<br />

/pointer_size qualifier was specified, which means that<br />

pointer_size pragmas do not take effect. If its value is<br />

32 or 64, pointer_size pragmas do take drect, so it can<br />

be assumed that mixed pointer sizes are in usc.<br />

Code Example<br />

In the simple code example shown in Figure 3, suppose<br />

that the routine procl is part of a library that has<br />

been only partially promoted to use 64-bit addresses.<br />

This function may receive either a 32-bit address or a<br />

64-bit address in the mRwneru_ptr parameter. To<br />

demonstrate the use of the new DEC C fe atures, prod<br />

has been modified to copy this character string parameter<br />

fi·om 64-bit space to 32-bit space when ncces-<br />

DigitJ! TcchnicJ! Joumal Vol. 8 No. 2 <strong>1996</strong><br />

sary, so that routines that procl subsequently calls<br />

need to deal with only 32-bit addresses.<br />

The INITIAL_POINTER_S IZE macro is used to<br />

determine if pointer_size pragmas will be effective<br />

and, hence, whether argument_ptrmight be 64 bits in<br />

width. If it might be a 64-bit pointer, whose actual<br />

width is utlk.nown in this example, the pointer's value<br />

is copied to a 32-bit-wide pointer. The pointer_size<br />

pragma is used to change the current pointer size to<br />

32 bits to declare the temporary pointer. Next, the<br />

two pointer va.l ues are compared to determine if<br />

the original pointer fits in 32 bits. If the pointer does<br />

not fit, temporary storage in 32-bit addressable space<br />

is allocated, and the argument is copied there. Note<br />

that the example uses _malloc32 rather than malloc,<br />

because malloc would allocate 64-bit address space<br />

if the initial pointer size was 64 bits. At the end of<br />

the routine, the temporary space is treed, if necessary.<br />

A type cast is used in the assignment from<br />

argument:__ptr to temp_short_ptr. even though both<br />

variables are of type char * . Without this type cast, if<br />

arp,umeJZI_jJtr is a 64-bit-wide pointer, the DEC C<br />

compiler would report a warning message because of<br />

the potential data loss when assigning from a 64-bit to<br />

a 32-bit pointer.<br />

For other examples of pointer_ size pragmas and the<br />

use of t he _INITIAL_PO INTER_SIZE macro, see<br />

Duane Smith's paper on 64-bit pointer support in<br />

run-time libraries."<br />

OpenVMS System Services<br />

The OpenVJVlS operating system provides a suite of<br />

services that perform a variety of basic operating system<br />

functions 7 Design work was required to maximize<br />

the utility of these routines in the new mixed<br />

pointer size environment . Issues that needed to be<br />

addressed included the fo llowing, which arc discussed<br />

in subsequent sections:<br />

• Several services pass pointers by reference and,<br />

hence, required an interface change.<br />

• Because of resource constraints, nor all system services<br />

cou ld be promoted to handle 64-bit addresses<br />

in the first version of the 64-bit-capable Open VMS<br />

operating system.<br />

• Since the services provide mixed levels of support, it<br />

is important to indicate those that support 64-bit<br />

addresses and those that do not.<br />

• Certain new services seemed desirable to improve<br />

the usability of64-bit address space.<br />

Services That Are 64-bit Friendly<br />

Services that can be promoted to support 64-bit<br />

addresses without any interface change are called 64-bit<br />

fi·iendly. If a service receives an add ress by reference, the<br />

service is typically not 64-bit friendly, and a separate


Figure 3<br />

voi d proc1 ( char * argument_ pt r)<br />

{<br />

#if INITIAL POINTER SIZE != 0<br />

#pragma poi n ter_ size save<br />

#pragma poi n ter_s ize 32<br />

char * temp_ short_p tr;<br />

temp_s hort_ ptr = (char *)argument_p tr;<br />

i f (temp_s hort_p tr != argument_p t r) {<br />

temp_s hort_p tr = _m alloc32 (strlen(argument_p tr) + 1 ) ;<br />

st rcpy(temp_s hort_ ptr,argum ent_p tr);<br />

argument_p tr = temp_ short_p tr;<br />

}<br />

else {<br />

temp_ short_p tr = 0;<br />

}<br />

#end if<br />

I*<br />

*I<br />

#pragma poi n ter_s i z e restore<br />

The actual body of proc1 is omi tted. Assume that it ca l l s<br />

routi nes that operate on the data po i nted to by argument_ ptr<br />

and that the rout ines are not yet prepa red to hand le 64-b it<br />

addresses .<br />

#if INITIAL_PO INTER_S IZE 1= 0<br />

i f (temp_ short_p tr 1 = 0)<br />

#endif<br />

}<br />

free( t emp_ short_p tr);<br />

Code Example of Poi nrc r_sizc Pragmas and rhc _lNlTlAI ,_POI NTER _SI Z E MKro<br />

enrry point is required to support 64-bir addresses. A<br />

single routine cannot distinguish whether the address at<br />

the specified location is 32 bits or 64 birs in width.<br />

If a scn·icc docs not rccei\'c or return an address by<br />

rderence, the service is usually 64- bit tri endly. Even<br />

descriptor arguments present no problem, because the<br />

32-and 64-bit versions can be distinguished at run<br />

time. The majority of services t:t ll into this category.<br />

The services th::tt are not 64-bit tl-icndly include<br />

the entire suite of memory management system scr­<br />

\'ices, since they access add ress ranges passed by reference.<br />

Other such services include those that receive<br />

J 32-bit vector as an Jrgument, which may include the<br />

address of a pointer as an element. A good example<br />

ti·om this group is SYS$FAOL, which accepts a 32-bit<br />

\'LCtor argu ment t(x t(>rmatrcd output. For all these<br />

scn·ices, JlC\\' intcrhccs were designed to accommodate<br />

64-bir callers.<br />

Promotion of Services<br />

The Open VMS project team explored the idea of pro·<br />

moting all system services to support 64 -bit addresses.<br />

Since the majority of OpcnVtvlS system service<br />

ro utines an: 1\'rittcn in the l'viAC R0-32 assembly language<br />

or the Bliss-32 programming language, the<br />

internals of the routines could not be promoted to<br />

handle 64-bit addresses without moditications. We<br />

could not take advantage of the throw-the-swirch<br />

approach, and we did not want to because many<br />

pointers used internally in the OpcnVMS operating<br />

system remain at 32 bits.<br />

We considered using 64-bit jacket ro utines to copv<br />

64-bit argumcllts to the stack in 32-bit space, which<br />

wou ld then e1ll the 32-bit intcrnJI ro utine to pertr several reasons. First, the jackets would<br />

need ro be custom written to ensure correct parameter<br />

semantics. There could not be a "common jacket"<br />

that could have saved time and lowered risk. Second,<br />

there would be an undesirable pcrtc>n11ance impact h1r<br />

64-bit callers. Third, we decided that ba\'ing a com ­<br />

plete 64-bit system service suite was not essential h>r<br />

usable 64-bit support. 'We could ddine a subset that<br />

wou ld meet the needs of 64- bit address space users,<br />

while lowering our risk and implementation costs.<br />

The sen·iccs selected tor 64-bit support fa ll inro<br />

tour categories.<br />

l. Memory mJnagement services.<br />

2. Performance-critical services. This group includes<br />

services that Jre typically scnsiti\·e to the addition of<br />

Dig;ir;1\ Tcrhnic1l journ.1l Vol. R No. 2 <strong>1996</strong> 79


e\·en a fe\\' cycles of execution ti me . Req uiri ng that<br />

a 64-bit address user do any addi tional \\'ork, such<br />

as copying data to 32-bit space, is undesirable. An<br />

exam p le of this type of service is SYS$ ENQ, which<br />

is used tor q ueu i ng lock requests.<br />

3. Design center services. The primary design center<br />

tor 64-bit su pport was database applications.<br />

Database arch itects Jnd consultants were polled to<br />

determine \\'hi c h scn·ices \\'ere most needed lw<br />

their produ cts. Many of these services, t(Jr exa mple<br />

SYS$QIO f(x q ueuing JjO requests, wen: also in<br />

the performance-critical set.<br />

4. Other useful basic services. Th is set includes ser­<br />

vices to case the transition to 64 bits with minim:d<br />

change to program structure. For example, the<br />

SYS$Gv! Kll..NL service accepts a routine address<br />

and a vector of 32 -bit arguments


interfaces than the 32-bit-only SYS$CRJ\tlPSC. The<br />

original SYS$CRNI.PSC is still present so that existing<br />

code may ti.mction without change.<br />

Some new feature requests were considered as part<br />

of the 64-bit eftort, but, to maintain the focus of<br />

the release, these features were not implemented. The<br />

64-bit memory management services were designed<br />

to more easily accommodate new fe atures in the<br />

future. For example, the new services check the argu­<br />

ment count f()r both too many and too tew supplied<br />

arguments. In this way, new optional arguments can<br />

be added later to the end of the list without jeopardiz­<br />

ing backward compatibility.<br />

Virtual Regions<br />

One new feature that was added to the suite of64-bit<br />

memory management services is support tor new enti­<br />

ties called virtual regions. A virtual region is an address<br />

range th:t is n:served by a program for fu ture dynamic<br />

allocation requests. The region is similar in concept to<br />

the program region (PO) and the control region ( Pl ) ,<br />

which have long existed on the Open VMS operating<br />

system.'' A virtual region differs tl-om the program and<br />

control regions in that it may be ddined by the user by<br />

calling : system service and may exist within PO, Pl, or<br />

the new 64-bit addressable process-private space, P2.'<br />

When a virtual region is created, a handle is returned<br />

that is subsequently used to identif)' the region in<br />

memory management requests.<br />

Address space within virtual regions is allocated in<br />

the same manner as in the def:llllt PO, P 1, and P2<br />

regions, with allocation defined to expand space<br />

toward either ascending or descending addresses. As<br />

in the default regions, allocation is in multiples of<br />

pages. The Open VMS operating system keeps track of<br />

the first fiTe virtual address within the region. A region<br />

can be created such that address space is created auto­<br />

matically when a virtual reference is made within the<br />

region, just as the control region in Pl space expands<br />

automatically to accommodate user stack expansion.<br />

When a virtual region is created within PO, Pl, or P2,<br />

the remainder of that containing region decreases in<br />

size so that it does nor overlap with the virtual region.<br />

Virtual regions were added to the Open VMS Alpha<br />

opcr:ning system along with the 64-bit addressing<br />

capability so that the huge expanse of 64 -bit address<br />

space could be more easi ly managed. If a subsystem<br />

requires a large portion of virtually contiguous address<br />

space , the space can be reserved within P2 with little<br />

ovcrhe:d. Other subsystems within the application<br />

cannot inadvertently interfere with the contiguity<br />

of this address space. They may create their own<br />

regions or cre:Jte address space wi thin one of the<br />

detault regions.<br />

Another advantage of using virtual regions is that<br />

they arc the most efficient w1y ro manage sparse<br />

address space within the 64-bit P2 space. Further-<br />

more, no quotas are charged fo r the creation of a vir­<br />

tual region. The internal storage for the description<br />

of the region comes from process :ddress space, which<br />

is the only resource used.<br />

Summary<br />

This paper presents the reasons behind the new<br />

OpenVMS mixed poi nter size environment and the<br />

support added to allow programming within this envi­<br />

ronment. The discussion touches on some of the new<br />

support designed to simplif)r the usc of the 64 -bit<br />

address space.<br />

The approaches disc ussed yielded fu ll upward com­<br />

patibility tor 32- bit applications, while allowing other<br />

applications access ro the huge 64-bit address space tor<br />

data sets th:t require it. Promotion of all p oi nters to<br />

64-bit width is not required to use 64-bit space; the<br />

mixed pointer size environment was considered para­<br />

mount in all design decisions. A case study of adding<br />

64-bit support to the C run-time library also appears<br />

in this issue of the .fou rna/.''<br />

Acknowledgments<br />

The authors wish to thank the other members or· the<br />

64 -bit Al pha-L Team who helped shape many of the<br />

ideas presented in th is paper: Mark Arsenault, G:1ry<br />

Barton, Barbara Benton, Ron Brender, Ken Cowan,<br />

Mark Davis, Mike Harvey, Lon Hilde, Duane Smith,<br />

Cheryl Stocks, Lenny Szubowicz, and Ed Vogel .<br />

References<br />

l. M. Harvey and L. Szubowicz, "Exte nding OpcnVMS<br />

tor 64-bit Addressable Virru,ll ,\!!emory," D(t;ital<br />

<strong>Technical</strong> joumal, vol. 8, no. 2 ( <strong>1996</strong>, this issue):<br />

57-7 1.<br />

2. Op!'n \ 'MS Callin,l!. S/(!ndcmi (J\ILwnard, Mass.: <strong>Digital</strong><br />

Equipmcnr Corporation, Order No. AA-QSBBA-TE,<br />

1995).<br />

3. i\1/Jf'Spro 64-Bit Por!lni and Ti'ansilion Gu ide, Docu­<br />

ment No. 007-2391 -002 (Mountain Vic\\', Calif.:<br />

Si licon Grc1phies, Inc., <strong>1996</strong>).<br />

4. C lclllf:!JWi


82<br />

7. Op en V/11/S System Seruices Hejerence iltfa rmal.<br />

A-GETMSG' (Maynard , Mass.: <strong>Digital</strong> Equipment<br />

Corporation, Order No. AA-QSBMA-TE, 1995) and<br />

Op e11 VMS Systern Sen• ices Re/erence Jl!Jcm ual<br />

GcTQ UI-Z (Maynard, Mass.: <strong>Digital</strong> Equipment Corpo­<br />

ration, Order No. AA-QSBN-TE, I 995 ).<br />

8. Op en V/v/S Alpha Gu ide to 64 -/Jit Adclressi11g (May­<br />

nard, lvbss.: <strong>Digital</strong> Equipment Corporation, Order<br />

No. AA-QSIICA-TE, 1995 ).<br />

9. T. Leonard, ed ., li;J.X Architecture Re/erence iVIa nual<br />

(Bedford, Mass.: <strong>Digital</strong> Press, 1987).<br />

Biographies<br />

Thomas R. Benson<br />

A consulting engineer in the Open VMS Engineering Group,<br />

Tom Benson is one of' the developers of 64-bit addressi ng<br />

support. To m joined DigitaJ's VA,\ Basic project in I 979<br />

after receivi ng B.S. and M.S. degt·ees in computer science<br />

fi·om Svracuse Universitv. After wmking on an optimizing<br />

compiler shell used by several VAX cornpilers, he joined<br />

the VMS Group where he led the VMS DEC windows<br />

File View Jnd Session Manager projects, and brought tbe<br />

X lib graphics librarv ro rbe VMS operati ng svstem. Tom<br />

holds rhree parents on the design of the VAX MACR0-32<br />

compiler for Alpha and recent!)' applied for two parents<br />

related ro (A-bit addressing work.<br />

Karen L. Noel<br />

A pri ncipal engineer in the Open VMS Engineering Group,<br />

Karen Noel is one of the developers of64-bit


Adding 64-bit Pointer<br />

Support to a 32-bit<br />

Run-time Library<br />

A key component of delivering 64-bit addressing<br />

on the OpenVMS Alpha operating system, ver­<br />

sion 7.0, is an enhanced C run-time library that<br />

allows application programmers to allocate and<br />

utilize 64-bit virtual memory from their C pro­<br />

grams. This C run-time library includes modified<br />

programming interfaces and additional new<br />

interfaces yet ensures upward compatibility<br />

for existing applications. The same run-time<br />

library supports applications that use only<br />

32-bit addresses, only 64-bit addresses, or<br />

a combination of both. Source code changes<br />

are not required to utilize 64-bit addresses,<br />

although recompilation is necessary. The new<br />

techniques used to analyze and modify the<br />

interfaces are not specific to the C run-time<br />

library and can serve as a guide for engineers<br />

who are enhancing their programming inter­<br />

faces to support 64-bit pointers.<br />

I<br />

Duane A. Smith<br />

The OpenVMS Alpha operating system, version 7.0,<br />

has extended the address space accessible to applica­<br />

tions beyond the traditional 32-bit address space. This<br />

new address space is reterred to as 64-bit virtual mem­<br />

ory and requires a 64-bit pointer tOr memorv access.1<br />

The operating system has an additional set of new<br />

memory allocation routines that allows programs to<br />

allocate and release 64-bit memory. ]n OpenVMS<br />

Alpha version 7.0, this set ofroutines is the only mech­<br />

anism available to acquire 64- bit memory.<br />

For application programs to take advantage of these<br />

new OpenVMS programming interfaces, high- level<br />

programmi ng languages such as C had to support<br />

64-bit pointers. Both the C compiler and the C run­<br />

time librarv req uired changes to provide this support.<br />

The compiler neec.k d to understand both 32-bit and<br />

64-bit pointers, and the run -time library needed to<br />

accept and return such pointers .<br />

The compiler has a new qualifier called /pointcr_sizc,<br />

which sets the ddiHllt pointer size t(x the compilation<br />

to either 32 bits or 64 bits. Also added to the compi ler<br />

are pragmas (directives) that c1n be used within the<br />

source code to change the active pointer size. An<br />

appl ication program is not required to compile each<br />

module using the same /pointcr_size qualitier; some<br />

modules may usc 32-bit pointers wh ile others usc<br />

64-bit poin ters. Benson, Noel, and Peterson describe<br />

these compiler enhancements.' The DEC C [!ser \<br />

Cuide jor Open'vi\1S 5)•slems documents the qualiticr<br />

and the pragmas 3<br />

The C run -time librarv added 64-bit pointer sup­<br />

port either by modit\•ing the existing interrace to ,,<br />

fu nction or by adding a second inrertacc to the same<br />

fu nction. Public header ti les dctine the C: run-time<br />

library intcrtaces. These header ti les contain the pub­<br />

licly accessible ti.11Ktion prototypes and structure definitions.<br />

The library documentation and he:.dcr ti les<br />

are shipped with the C compiler; the C run-rime<br />

library ships with the operating system.<br />

This paper discusses all phases of the enhanccmenrs<br />

to the C run-time librarv, ti·om project concepts<br />

through the analysis, the design, and ti nally the i mplementation.<br />

The f)hC C Nuntime Lihrarv R(fercnce<br />

Mauua/j(Jr Open \ 'MS S)•stems contains user documentation<br />

regarding the chan gcs.4<br />

Digiral <strong>Technical</strong> journal Vol. 8 No. 2 1')')6


S4<br />

Sta rting the Project<br />

We devoted the in i tial two momhs of the project to<br />

unclerst111ding the overall OpcnVMS presen tation of<br />

64-bit addresses and deciding ho\\ to prescllt 64-bit<br />

en hancemems to customers. Re present\tin:s ti·om<br />

OpenVMS engineering, the c ompiler team, the runtime<br />

l ibrar y ream, and the Open VMS Calling St:111dard<br />

team met weekly with the go:d of converging on the<br />

cl divcra hl cs f{>r Open VMS Alpha version 7.0.<br />

The project team was committed to preserving both<br />

source code compati bilitv :\lld the upward compatibilit\'<br />

aspects of shareable images on the Open VMS<br />

operating S\'Stem. Early discussions with appl i ca tion<br />

developers reint()rced our belief that the OpwVMS<br />

system must a!Jow applications to usc 32-bit and<br />

64-bit pointers within the s1mc application. The team<br />

also :-greed that tor a mixed-pointer application to<br />

work cfkctin:h', a single run-time librarv \\ ould need<br />

to support both 32 -bit and 64-bit poiners; ho\\'cn:r,<br />

there \\'as no known prcccdem tor designing such<br />

a libr:-ry.<br />

One implication of the decision to design a runti<br />

me librm· that supported 32 -bit :md 64-bit pointers<br />

\\':IS that the librarv could ne\Tr return an unsolicited<br />

64-bit pointer. Returning a 64-bir pointer to an<br />

application that was expecting J 32 -bit poi mer \\·ould<br />

result in t he loss of one half of an add ress. A lthough<br />

typic11ly this error would cause a hardware exception ,<br />

the resu lting address cou ld be a \'alid address. Storing<br />

to or reading tl·om such an 1ddrcss could result in<br />

in correct bc l11\·ior that \\'Ould be d i ffi cu lt to detect.<br />

The OJ X' II\ :If.'; C:allil lP, Sta 11rlm d specif-ics thar argu­<br />

mems p1ssed to a function be 64-bit values.' If 1<br />

32-bit llo\\'in;, ::::><br />

t-o ur categories (listed in order of added comple:-;in·<br />

to librarv users):<br />

I. Not 1fkcted by the size of a pointer<br />

2. En hanced to accept both pointer sizes<br />

Dig:ir,11 Tcclmic,11 )ourml Vol. 8 No. 2 <strong>1996</strong><br />

3. Duplicated to have a 64-bir-specific inrerf..1cc<br />

4. Restricted from using 64-bit pointers<br />

One last point to come tt·om rbe meetings was<br />

that man\' of the C run-time librarv intet·bces arc<br />

.<br />

implememed bv calling other OpenVMS<br />

images. For<br />

example, the Curses Screen J\ll an agc mc m interfaces<br />

make calls to the OpcnVMS Screen Manage ment<br />

(SMG) bei l ity. It is i mporr:mt that the C run-time<br />

libr:rv ddines the inrcrt:1ces to support 64- bit<br />

addrcsSl's without looking 1 t the i mpleme ntation of<br />

rl1e fu nction. Consistenc\· and completeness of the<br />

imerbcc arc more important than the comf)lexit\·<br />

of the implementation. In the SNI G ex1mp lc, if th<br />

C run-rime library needs to mJke a copy of a string<br />

prior ro passing the string to the SMG f:1e iliry, this<br />

is what will be implemented.<br />

Analyzing the Interfaces<br />

The process of analyzi ng the interfaces began by creating<br />

1 document that listed all the header ti les


in strucrurcs. If a rl LE poin ter is :lwavs a 32-bit<br />

poi nter, structures that contain on ly HLE pointers arc<br />

not affected by the choice of poinrcr size We use this<br />

inrormation when we look Jt the Llyout of structures<br />

and examine fu nction prototvpcs that accept pointers<br />

to structures.<br />

In this paper, structures that arc alwavs


86<br />

An exam ple of a parameter being rcsrricrcd ro a<br />

low memory address is the buffe r being passed as rhe<br />

parameter to the fu nction sctbuf. The parameter<br />

defines rhis buffer to be used t(Jr I/0 operations . The<br />

user expects to see this bufkr change as 1/0 opcLltions<br />

arc pcrt(Jrmed on the ti le. If the ru n - ti me library<br />

made a copy of th is bufkr, the changes would


structure thJt is bound to low memorv, such as a fl LE<br />

or Wl NDOW pointer, the return v1lue does not t()rce<br />

separate entry points.<br />

Certain fu nctions, such as malloc, Jllocate memory<br />

on behalf ohhc calling program Jnd return the address<br />

of that memory Js the value of the ti.mction. These<br />

h.t nctions have two implementations: the 32-bit intcr­<br />

fKe always :tllocues 32-bit virtual memory, Jnd the<br />

64- bit interface always allocates 64- bit virtual memory.<br />

Many stri ng and memory fu nctions h:JVe return val­<br />

ues that are relative to a parameter passed to the same<br />

routine. These ;ddrcsses mav be returned as high<br />

memory addresses if ;nd on ly if the p;ramerer is a<br />

high memory address.<br />

The fo llowing is the fu nction prototype fo r strcat,<br />

which is to und in the header ti le :<br />

char *strcat(char *s1 , canst char *s2);<br />

The strcat fu nction appends the stri ng pointed to lw<br />

s2 to the string poimed to by sl . The return value is<br />

the address of thc latest stri ng sl .<br />

In this case, the size of the pointer in the return<br />

value is always the same as the size of the pointer<br />

passed as the first parameter. The C programming l;m­<br />

guage has no wJv to reflect this. Since the function<br />

may return 1 64-bit pointer, the strcat fu nction must<br />

h;1ve two entry points.<br />

As discussed c1rlicr, the poimcr size used fo r parJ­<br />

mctcr s2 is not related to the returned poi nter size.<br />

The C run-rime libr: ry made this s2 argument 64-bit<br />

ti·iendly by declaring it a 64-bit poi nter. This declara­<br />

tion allows the application programmn to concatc­<br />

n;ltC a string in h igh memorv to one in low memory<br />

without alteri ng the source code. The t( )llowing strcat<br />

fu nction statement shows this declaration:<br />

char *strcat(char *s1 , __ c har_ptr64 s2);<br />

The data rvpc _cbar_ptr64 is 1 64-bit character<br />

pointer whose definition ;m d usc will be explained<br />

later in this paper.<br />

High-level Design<br />

The /pointcr_size qua litler is J\"Jilablc in those<br />

versions of the C: compiler that support 64-bit point­<br />

ers. The compiler has a prcddl ncd macro named<br />

_INITIAL_POINTER_SIZJ-: whose value is based on<br />

the usc of the /pointcr_sizc qualifier. The macro<br />

accepts the tllllowing values:<br />

•<br />

0, which indicates rh:t the /pointer_size qualifier is<br />

not used or is not a\·aibble<br />

• 32, which indicates that the /pointcr_sizc qualifier<br />

is used and has l value of 32<br />

• 64, which indicates that the /pointcr_size qualiticr<br />

is used and has J value of64<br />

The C ru n - time library hodcr tiles conditionally<br />

compile based on the value of this predefined macro.<br />

A zero value indicates to the header Illes that the computing<br />

environment is purely 32-bit. The pointer-size­<br />

specific fu nction prototypes arc not ddined. The user<br />

must usc the /pointer_sizc qualifier to access 64-hit<br />

fu nctionality. The choice of 32 or 64 determines the<br />

default pointer size .<br />

The heackr ti les define two distinct types of dccbra­<br />

tions: those that have a single implementation and<br />

tbose that have pointer-size-specific implementations.<br />

The addresses passed or returned h·om ti.mctions that<br />

ba,·e a single implementation arc either bound to lm,·<br />

memory, restricted to low mcmorv, or widened to<br />

accept a 64- bit pointer.<br />

Those fu nctions that have poi ntcr-sizc-spccitic<br />

entry points have three ti.mction prototypes ddincd .<br />

Using m;lloc Js an cxJmple, prototypes Jre created hn<br />

the fu nctions m;lloc, _malloc32, Jnd _rnalloc64 . The<br />

latter two prorotvpes arc the poi nter-size-specific pro­<br />

totypes and Jrc ddined on ly when the /pointcr_sizc<br />

qualitier is used. The malloc prototype dct:llllts to calling<br />

_m111oc32 when the default poin ter size is 32 bits.<br />

The mallnc prototype dcbults to ulling _n1JIIoc64<br />

when the dcbulr pointer size is 64 bits. Applic;tion<br />

programmers who mix pointer types usc the<br />

/pointer_sizc qualifier to establish the default pointer<br />

si ze but can then usc the _malloc32 and m11loc64<br />

explicitly to achieve nondebult behavior.<br />

In addition to being enhanced to support 64-bit<br />

pointers, the C compiler h1s the Jdded capabi l i ty of<br />

detecting incorrect mixed-pointer usage. It is the<br />

fu nction prororvpe found in the header ti les ti1Jt tel ls<br />

the compiler exactly wh


whether the fu nction is the 32-bit-spccitic cmn· poinr.<br />

If it is, the compiler t(mns the prdi xed n:une by<br />

add i ng "dccc$" to the beginning of the n:mc but<br />

also removes the "_" and the "32." Conseq ucntlv, the<br />

function n:1mc _malloc32 becomes dccc$m:tlloc, but<br />

the h.t ncrion name _wz32 docs not c hange.<br />

Implementation<br />

To illustrate changes th:t needed to be made ro the<br />

hc:dcr ti les, we inven ted a single header ti le Cl llcd<br />

. This tile, wh i ch is shown in Figure 1, illus­<br />

trJtes rhe c lasses of problems bccd Lw a dC\·clopcr 11 ho<br />

is adding su pport tc)r 64- bi t poi nters. The fu nctions<br />

detined in this he:der tile :rc acrual C run-rime libr;m•<br />

ri.Jn ctions.<br />

Preparing the Header File<br />

The fi rst p1ss through rtsulted in a number<br />

of chmgcs in terms of t(mnatting, commcming,<br />

and 64-bir support. Rta lizing that matw modifications<br />

would be m1dc to the hc1dcr ti les, we considered<br />

rcachbilitv a major goJI t( >r this release of these tiles.<br />

The initi11 htader J-I les 1ssumcd '' uniform poimcr<br />

size of 32 bits rc)f the OpcnVi\1\ S operating S\'Stcm.<br />

Duri ng the fi rst pass th mugh , 11·e :ddcd<br />

poi nter-size pr;1gmas to ensur-e that the ti le S•ll'cd the<br />

user's poimcr size, set the pointer size to 32 bits, and<br />

then restored the user's pointer size at the end ofrhc<br />

header.<br />

Next 11 c t(mnattcd < heJdcr. h> to s ho\\' t he 1·arious<br />

categories that the structures ,md fi.tnctions ta ll imo.<br />

The Gttcgorics and the result of the first pass th rough<br />

can be seen in hgurc 2. For ex a mple,<br />

the fu nction r:md l1ad no pointers in the fu n ction<br />

Figure 1<br />

Orisitl;ti Hc;ldcr ilc <br />

lfi fndef<br />

#define<br />

HEADER LOADED<br />

HEADER LOADED<br />

lfi fndef SIZE T<br />

If defi n e SIZE T 1<br />

lfendi f<br />

i n t<br />

vo i d<br />

vo i d<br />

i n t<br />

char<br />

char<br />

size<br />

prototvpc and was immcdiatclv mc)l'cd to the section<br />

"Functions that support 64-bit poi nters. "<br />

Organizing in this wav gave us an accur:te<br />

reading of bow m :l n\' more fu nctions needed<br />

64-bir support. If arw of the sections became cmptl',<br />

11·e did nor rcm


Figure 2<br />

/lifndef<br />

#def ine<br />

I*<br />

HEADER_LOADED<br />

HEADER LOADED<br />

** Ensure that we beg in wi th 32-bit pointers.<br />

*I<br />

#if INITIAL POINTER SIZE<br />

If if ( VMS VER < 70000000)<br />

If error "Pointer size added in OpenVMS V7 .0 for Alpha"<br />

If endif<br />

If pragma __ pointer_s ize save<br />

If pragma __ pointer_s ize 32<br />

#endif<br />

I*<br />

** STRUCTURES NOT AFFECTED BY POINTERS<br />

*I<br />

#ifndef SIZE_T<br />

If defi ne SIZE T 1<br />

typedef unsi gned int si ze_t ;<br />

#endif<br />

I*<br />

** FUNCTIONS<br />

*I<br />

THAT NEED 64-BIT SUPPORT<br />

in t execv(const char *, char *[]);<br />

voi d<br />

voi d<br />

char<br />

char<br />

size<br />

I*<br />

free(void *);<br />

*ma l l oc (si ze_t ) ;<br />

*strcat(char * , const char *);<br />

*strerror(int);<br />

t strlen(const cha r *>;<br />

** Create 32-bi t header fi le typedefs.<br />

*I<br />

I*<br />

** Create 64-bi t header fi le typedefs .<br />

*I<br />

I*<br />

** FUNCTIONS RESTRICTED FROM 64 BITS<br />

*I<br />

I*<br />

** Change defau lt to 64-bi t po inters.<br />

*I<br />

#if INITIAL PO INTER SIZE<br />

If pragma __ p ointer_ size 64<br />

#endi f<br />

I*<br />

** FUNCTIONS THAT SUPPORT 64-BIT PO INTERS<br />

*I<br />

int rand(void);<br />

I*<br />

hrsr PJss through <br />

** Restore the user's po inter context .<br />

*I<br />

#i f INITIAL PO INTER SIZE<br />

If pragma __ p oi n ter_s i z e __ r estore<br />

#endif<br />

llendif I* HEADER LOADED *I<br />

<strong>Digital</strong> Tn:hnical Journal Vol . 8 No. 2 <strong>1996</strong> 89


90<br />

We created a C run-time library private header<br />

file caLled . This header file has the<br />

appropriate pragmas to define 64-bit pointer types used<br />

within the C run-time library, as shown in Figure 3.<br />

This header file also contains the definitions ofrnacros<br />

used in the implementations of the functions. Figure 4<br />

shows the macros declared in .<br />

Once a module includes the ti le ,<br />

the compilation of that module changes to add the<br />

qualifier /pointer_size=32. This change improves the<br />

readability of the code because "char * " is read as a<br />

I*<br />

32-bit character pointer, whereas 64-bit poi mcrs usc<br />

typedcfs whose names begin with "_wide." The<br />

name of the new typedcf is _ wide_char_ptr, which is<br />

read as a 64-bit character pointer.<br />

The C run-time librarv design also requires that the<br />

impleme ntation of a fu nction include all header ti les<br />

that define the nmction. This ensures that the imple­<br />

mentation matches the header files as they arc modi­<br />

fi ed to support 64-bit pointers. For fu nctions defined<br />

in multiple header ti les, this ensu res that header ti les<br />

do not contrad ict each other.<br />

** Thi s inc lude fi le defi nes a l l 32-bi t and 64-bi t data types used i n<br />

** the implementation of 64-bi t addresses i n the C run-t ime Library.<br />

**<br />

** Those modu les that are com pi led wi th a 64-bi t-capab le campi Ler<br />

** are requi red to enable poi nter size wi th /POINTER SIZE=32 .<br />

*I<br />

#i fdef INITIAL_PO INTER_S IZE<br />

# if C INITIAL POINTER SIZE != 32 )<br />

# error "This modu le must be compi led /pointer_s ize=32"<br />

# endi f<br />

#endi f<br />

I*<br />

** ALL interfaces that requi r e 64-bi t poi nters must use one of<br />

** the fol lowi ng definitions . When this header fi le is used on<br />

** platforms not supporting 64-bi t pointers, these definitions<br />

** wi l l define 32-bi t pointers.<br />

*I<br />

#ifdef INITIAL POINTER SIZE<br />

# pragma __ p oi nter_ size save<br />

# pragma __ pointer_s ize 64<br />

#endif<br />

typedef char * __ w ide_ char_ptr;<br />

typedef canst char * __ w ide_ const_ char_ptr;<br />

typedef int * __ w i de_i nt_ ptr;<br />

typedef canst int * __ w ide_c onst_i nt_ ptr;<br />

typedef char ** __ w ide_ char_pt r_ptr;<br />

typedef canst char ** __ w ide_ const_ char_pt r_ptr;<br />

typedef void * __ w ide_vo id_ ptr;<br />

typedef canst void * __ w ide_const_vo id_pt r;<br />

#include <br />

typedef WINDOW * __ w ide_W INDOW_ ptr;<br />

#include <br />

typedef size t * __ w ide_s i ze_t_p tr;<br />

I*<br />

** Restore pointer size.<br />

*I<br />

#i fdef INITIAL POINTER SIZE<br />

# pragma __ poi n ter_ size __ r estore<br />

#end if<br />

Figure 3<br />

Typcdds From <br />

<strong>Digital</strong> <strong>Technical</strong> Joumal Vol. 8 No. 2 <strong>1996</strong>


I*<br />

** Def ine macros that are used to determine pointer size and<br />

** macros that wi l l copy from high memo ry onto the stack.<br />

*I<br />

#ifdef INITIAL POINTER SIZE<br />

# i n clude <br />

# def ine C$$IS SHORT ADDR(addr)<br />

( ( ( ( int64 )(addr)32 ) = = (unsigned int64)addr)<br />

# define C$$SHORT ADDR OF STRING(add r) \<br />

(C$$ IS SHORT A DDR(;d d ) ? (char *) (addr) \<br />

: ( c har - *) st cpy( __ A LLO CA(strlen(addr) + 1 ) , (ad dr)) )<br />

# defi n e C$$SHORT ADDR OF STRUCT(add r) \<br />

(C$$ IS SHORT AD DR(;d d ) ? (void *) (addr) \<br />

: ( vo id - *) me; cpy( __ A LLOCA(sizeof (* add r)), Caddr), sizeof (*addr ) ) )<br />

# defi n e C$$SHORT_A DDR_O F_MEMORYCaddr, len) \<br />

(C$$I S SHORT ADDRCadd r) ? (voi d *) (addr) \<br />

: ( voi d *) me; cpy( __ A LLOCA ( L en), (ad dr), Len ) )<br />

#else<br />

# define C$$IS_SHORT_A DDR(addr) ( 1 )<br />

# define C$$SHORT_A DDR_O F_STRING Caddr ) (addr)<br />

# define C$$SHORT_A DDR_O F_STRUCT(addr) (addr)<br />

# define C$$SHORT_A DDR_O F_MEMORY(addr, len) (addr)<br />

#endif<br />

Figure 4<br />

iVL lnos ti·om <br />

Implementing the strerror Return Pointer<br />

The function strcrror always returns a 32-bit pointer.<br />

The memory is allocated by the C ru n-time library for<br />

both 32-bit and 64-bit calling programs. As shown<br />

in Figure 5, we moved the fu nction strcrror into the<br />

section "Functions that support 64-bit pointers" of<br />

to show that there arc no rcsrrictions on<br />

the usc of this fi.mcrion.<br />

The "Create 32-bit header file rypcdcts" section of<br />

is in the 32-bit pointer section, where the<br />

bound-to-low-memory data structures :.1re declared.<br />

The fu nction retu rns a pointer to a character string.<br />

We, therefore, added typedefs fo r _c har_ptr32 and<br />

_const_char_ptr32 while in a 32-bit poinrer context.<br />

These declarations :re protected wirh rhe ddinition of<br />

_CHAR_PTR32 ro allow multiple header ti les to usc<br />

the same naming convention. Declarations of the<br />

consr fo rm of the typcdef arc always made in the same<br />

conditional code since they usually :Jrc needed and<br />

using the same condition removes the need to r a dif<br />

krcnt protecting n:Jmc.<br />

The strerror fi.111ction could have been implemented<br />

in by placing the fi.mction in the 32-bit sec ­<br />

tion, but that wo uld have implied that the 32-bit<br />

pointer was a restriction that could be removed later.<br />

The pointer is not a restriction, and the strerror fu nc­<br />

tion fu lly supports 64 -bit pointers.<br />

Tile private header tile typedefs are always declared<br />

starti ng with rwo underscores and ending in either<br />

"_ptr32" or "_ptr64." These typedets arc created only<br />

when the header tile needs to be in a particular<br />

pointer-size mode while referring to a pointer of the<br />

other size. The return v:.1 lue of strerror is modified to<br />

usc the typedef_char_ptr32.<br />

Including the header tile, which declares strerror,<br />

allows the compiler to vcrif)' that the arguments,<br />

return values, and pointer sizes are correct.<br />

Widening the strlen Argument<br />

The fu nction strlen accepts a constant character<br />

pointer and re turns an unsigned integer (size_t) .<br />

Implementing fu ll 64-bit support in strlcn means<br />

changing the parameter to a 64-bit constant character<br />

pointer. If an application passes a 32-bit pointer to<br />

the strlcn function, the compiler-generated code sign<br />

extends the pointer. The required header tile mod­<br />

ific:tion is to simply move strlen fr om the sec­<br />

tion "Functions that need 64-bit support" to the<br />

section "Functions that support 64 -bit pointers."<br />

The steps necessary to r the source code to support<br />

64 -bit addressing arc as tallows:<br />

l. Ensure that the module includes header files that<br />

declare strlen.<br />

Digjral Tl·dmical journal Vol. 8 No. 2 <strong>1996</strong> 91


92<br />

#ifndef<br />

#define<br />

I*<br />

Figure 5<br />

Final Forrn of <br />

HEADER_LOADED<br />

HEADER LOADED<br />

** Ensure that we begin with 32-bi t poi nters .<br />

*I<br />

#if INITIAL POINTER SIZE<br />

# i y- ( VMS VER < 70000000 )<br />

# error "Poi nter size added in OpenVMS V7 .0 for Alpha"<br />

# endif<br />

# pragma __ po inter_s i z e save<br />

# pragma __ po inter_ size 32<br />

#endi f<br />

I*<br />

** STRUCTURES NOT AFFECTED BY POINTERS<br />

*I<br />

#ifndef SIZE_T<br />

# define SIZE T 1<br />

typedef unsi gned int size_t ;<br />

#endif<br />

I*<br />

** FUNCTIONS THAT NEED 64-BIT SUPPORT<br />

*I<br />

I*<br />

** Create 32-bit header fi le typedefs.<br />

*I<br />

#ifndef CHAR_P TR32<br />

# def ine<br />

typedef<br />

typedef<br />

#end i f<br />

I*<br />

CHAR PTR32<br />

char * __ char_ptr32; const char * __ const_char_ptr32; ** Create 64-b it header fi le typedefs.<br />

*I<br />

#ifndef CHAR_ PTR64<br />

# define CHAR PTR64<br />

# pragma __ p ointer_ size 64<br />

typedef char * __ c har_p tr64;<br />

typedef con st char * __ c onst_c har_p tr64;<br />

# pragma __ p o inter_s i z e 32<br />

#endif<br />

I*<br />

** FUNCTIONS RESTRICTED FROM 64 BITS<br />

*I<br />

int execv( __ c onst_c har_ptr64, char *[]);<br />

I*<br />

** Change default to 64-bit po inters .<br />

*I<br />

#i f INITIAL PO INTER SIZE<br />

# pragma __ p ointer_sT ze 64<br />

#endif<br />

I*<br />

** The fol lowi ng functions have interfaces of XXX, _X XX32,<br />

** and XXX64 .<br />

**<br />

** The function strcat has two interfaces because the return<br />

** argument i s a po inter that is relative to the first arguments.<br />

**<br />

** The ma l l oc funct ion returns ei ther a 32-bit or a 64-bi t<br />

** memo ry address .<br />

*I<br />

#if INITIAL PO INTER SIZE<br />

# pragma __ p oi nter_ size 32<br />

#end i f<br />

Digiral Tcc hnicl )oun1


Figure 5<br />

Continued<br />

vo id *malloc(si ze_t __ s ize);<br />

char *strcat(char * __ s 1 , __ c onst_ char_p tr64 __ s 2) ;<br />

# i f INITIAL _P O INTER_S IZE = = 32<br />

# pragma __ p ointer_ size 64<br />

#endif<br />

#i f INITIAL POINTER SIZE && VMS VER >= 70000000<br />

# pragma __ p ointer_s i z e 32<br />

voi d *_m a l l oc32(size_t ) ;<br />

char *_s t rcat32(char * __ s 1 , __ c onst_ char_ptr64 __ s 2 );<br />

# pragma __ p ointer_ size 64<br />

void *_ma l l oc64(si ze_t );<br />

char * strcat64(char * __ s 1 , const char * __ s 2 >;<br />

#endif<br />

I*<br />

** FUNCTIONS THAT SUPPORT 64-BIT POINTERS<br />

*I<br />

voi d free(voi d * __ p t r);<br />

int rand(void);<br />

size t strlen(const char * __ s );<br />

__ c har_pt r32 strerror(int __ e rrnum);<br />

I*<br />

** Restore the user ' s poi nter context .<br />

*I<br />

#i f INITIAL POINTER SIZE<br />

# prag ma __ po inter_si ze __ r estore<br />

#end i f<br />

#endif I* HEADER LO ADED *I<br />

2. Add the kJJiowing line of code to the top of the<br />

module: II i n c 1 u de < w i de_ types . s r c >.<br />

3. Ch:mge the declaration of th.e fu nction to accept<br />

a _wide_const_char_ptr parameter insteJd of the<br />

previous eonst char * parameter.<br />

4. Visually tcJI Iow this argument through the code,<br />

looking for assignment statements. This particular<br />

tl11Ktion wou ld be a simple loop. If local variables<br />

store this pointer, thev must also be declared as<br />

_ wide_const_chJr_ptr.<br />

5. Compile the source code using the directive<br />

/wJrn=enable=maylosedata to have the compiler<br />

help detect pointer truncation.<br />

6. Add ::1 new test to the test system ro exercise 64 -bit<br />

poimers.<br />

Restricting execv from High Memory<br />

Examination of the execv fi.m ction prototype showed<br />

thJt this fu nction receives two argumenrs. The ri rst<br />

::rgumem is J pointer to the name of the file to start.<br />

The second Mgu ment represents the ::rg" array that is<br />

to be passed to the child process. This JJTJY of pointers<br />

to null terminated strings ends with a NULL pointer.<br />

Initially, the exeev fu nction was to have had two<br />

implementations. The parameters passed to the execv<br />

fu nction are used as the parameters to the main func­<br />

tion of the child process being started . Because no<br />

assumptions could be made about that child process<br />

(in terms of support tor 64 -bit pointers), these para­<br />

meters are restricted to low memory addresses.<br />

To illustrate that the art,"' passing was a restriction,<br />

\\'e place that prorotvpe into the section "Functions<br />

restri cted from 64 bits" of . The first argu­<br />

ment, the name of the tile, did not need to have this<br />

restriction. The section "Create 64 -bit header file<br />

typcdds" was enhanced to add the definition of<br />

_const_c har_ptr64, which allows the prototypes to<br />

define a 64-bit pointer to constant characters while in<br />

either 32-bit or 64 -bit context.<br />

Returning a Relative Pointer in strcat<br />

The strcat function returns a pointer relative to its tirst<br />

argument. We looked at this function and determined<br />

that it required two entry points. In addition, we<br />

widened the second parameter, which is the address of<br />

the string to concatenate to the second, to allow the<br />

application to concatenate a 64-bit string to a 32-bit<br />

string without source code changes.<br />

<strong>Digital</strong> Tcc hnictl )ourn:tl Vo l. 8 No. 2 <strong>1996</strong> 93


94<br />

Figure 5 shows the changes made to support fu nctions<br />

that have pointer-size-specific entry points. The<br />

prototypes of functions X.:'\.X., _XXX32, and _XXX64<br />

begin in 64-bit pointer-size mode. Since the unmod ified<br />

fi.mction name (stt-cat, XXX) is to be in the pointer<br />

size specified by the /pointer_size qualifier, the<br />

pointer size is changed fr om 64 bits to 32 bits if and<br />

only if the user has specified /pointer_size=32. At this<br />

point, we are not certain of the pointer size in effect.<br />

We know only that the size is the same as the size of<br />

the qualifier. The second argu ment to strcat uses the<br />

_const_char_ptr64 typedef in case we are in 32-bit<br />

pointer mode. Notice the declaration of _strcat64<br />

does not use this typedef because we are guaranteed<br />

to be in 64-bit pointer context. Figure 6 shows the<br />

implementation of both the 32-bit and the 64-bit<br />

strcat fi.mctions.<br />

Th e 64-bit mal/oc Function<br />

The implementation of multiple entry points was discussed<br />

and demonstrated in the strcat implementation.<br />

Al though multiple entry points are typically added to<br />

woid tru ncating pointers, fi.mctions such as memory<br />

allocation routines have newly defined behavior.<br />

The functions decc$malloc and decc$_malloc64<br />

use new support provided by the OpenVMS Alpha<br />

operating system fo r allocating, extending, and ri·ecing<br />

64-bit virtual memory. The C run-time library utilizes<br />

this new fu nctionality through the LIBRTL entry<br />

points. The LIBRTL group added new entry points fo r<br />

each of the existing memory management fu nctions.<br />

The LI BRTL includes an additional second t:ntry<br />

point fo r the fr ee fu nction. Since our implementation<br />

of the free function simply widens the pointer, wt: end<br />

up with a single, C run-time library function that must<br />

choose which LIBRTL function to call.<br />

#include <br />

#include <br />

I*<br />

** STRCATI STRCAT64<br />

**<br />

int free( __ w i de_void_p tr ptr) {<br />

if ( ! ( C $$I S_S HORT_A DDR(ptr)))<br />

}<br />

return ( c$$_f ree64(ptr));<br />

else return(c$$ free32((void *) ptr);<br />

Concluding Remarks<br />

** The 'strcat ' function concatenates 's2 ' , including the<br />

** terminating nul l character, to the end of ' s 1 ' .<br />

*I<br />

The project took approximatdy seven person-months<br />

to complete. The work involved two months to determine<br />

what we wanted to do, one month to tlgurc out<br />

how we were. going to do it, and fo ur person -months<br />

to modi ', document, and test the software.<br />

During the initial two months, the technical leaders<br />

met on a weekly basis and discussed the overall<br />

approach to adding 64-bit pointers to the OpenVi'vlS<br />

environment. Since I was the technical lead for the C<br />

run-time library project, this initial phase occupied<br />

between 25 and 50 percent of my time.<br />

The ont: month of detailed analysis and design consumed<br />

more than 90 percent of my time and resulted<br />

in a detailed document of approximately 100 pages.<br />

The document covered each of the 50 headt:r fi les and<br />

500 function interfaces. The fu nctions were grouped<br />

by type, based on the amount of work required to<br />

support 64 -bit pointers.<br />

The first month of implementation occupied Iwarl'<br />

all of my time, as I made several false starts. Once I<br />

worked out the tl nal implementation technique, I<br />

completed at least two of each type of work. As coding<br />

deadlines approached, I taught nvo other engineers on<br />

my team how to add 64-bit pointer support, pointing<br />

out those fu nctions already completed tor reference.<br />

They came up to speed within one week. Togcthn, we<br />

completed the work during the fi nal month of the<br />

project.<br />

__ w i de_ char_ptr _s trcat64 < __ w i de_ char_ptr s1 , __ w i de_ const_ char_ptr s2)<br />

{<br />

}<br />

(void) _m emcpy64 ( ( s1 + strlen(s1 ) ) , s2, (strlen(s2) + 1 ) );<br />

return(s1 ) ;<br />

char * strcat32


Acknowledgments<br />

The author would like to acknowledge the others who<br />

contributed to the success of the C run-time library<br />

project. The engineers who helped with various<br />

aspects of the analysis, design, and implementation<br />

were Sandra Whitman, Brian McCarthy, Greg Tarsa,<br />

Marc Noel, Boris Gubenko, and Ken Cowan. Our<br />

writer, John Paolillo, worked countless hours documenting<br />

the changes we made to the library.<br />

References<br />

l. M. Harvey and L. Szubowiez, "Extending OpcnVl'viS<br />

for 64-bit Addressable Virtual Memory," <strong>Digital</strong><br />

<strong>Technical</strong> journal, val. 8, no. 2 (<strong>1996</strong>, this issue):<br />

57-7 1.<br />

2. T. Benson, K. Noel, and R. Peterson, "The OpenVMS<br />

Mixed Pointer Size Environment," Dip,ital <strong>Technical</strong><br />

.foumal. vol. 8, no. 2 ( <strong>1996</strong>, this issue): 72-82.<br />

3. nH: C User \ Guide fo r Open Vt\1/S Systems (Maynard,<br />

MJss.: <strong>Digital</strong> Equipment Corporation, Order No.<br />

AA- Plct\ZE TK, 1995).<br />

4. f)h'C C Ru ntime Library Reference t\lla nual fo r<br />

OpenVMSSystems (Maynard, Mass.: <strong>Digital</strong> Equipment<br />

CorporJtion, Order No. AA-PU N EE-TK, 1995 ) .<br />

5. Open VMS Ca lling Standard (Maynard , MJss.: <strong>Digital</strong><br />

Equipment Cmporation, Order No. AA-QSB RA-TE,<br />

1995)<br />

Biography<br />

Duane A. Smith<br />

As a consul ring soti:wan: engineer, Duane Smith is currently<br />

.<br />

an:hitect and project leader of the C run-rime libr:ry for<br />

rhe Open VMS VAX and Alpha platforms. He joined <strong>Digital</strong><br />

in 1981 and h•lS worked on a variety of projects, including<br />

the A-to-1: Datab


Building a Hig h-performance<br />

Message-passing System for<br />

MEMORY CHANNEL Clusters<br />

The new MEMORY CHANNEL for PCI cluster<br />

interconnect technology developed by <strong>Digital</strong><br />

{based on technology from Encore Computer<br />

Corporation) dramatically reduces the over­<br />

head involved in intermachine commun ica-<br />

tion. <strong>Digital</strong> has designed a software system,<br />

the TruCiuster MEMORY CHANNEL Software ver­<br />

sion 1.4 product, that provides fast user-level<br />

access to the MEMORY CHANNEL network and<br />

can be used to implement a form of distributed<br />

shared memory. Using this product, <strong>Digital</strong> has<br />

built a low-level message-passing system that<br />

reduces the communications latency in a MEMORY<br />

CHANNEL cluster to less than 10 microseconds.<br />

This system can, in turn, be used to easily build<br />

the communications libraries that programmers<br />

use to parallelize scientific codes. <strong>Digital</strong> has<br />

demonstrated the successful use of this message­<br />

passing system by developing implementations<br />

of two of the most popular of these libraries,<br />

Parallel Virtual Machine (PVM) and Message<br />

Passing Interface {MPI).<br />

Vo l. 8 No. 2 1':!96<br />

I<br />

James V. Lav.rton<br />

John ]. Brosnan<br />

Morgan P. Doyle<br />

Seosamh D. 6 Riordain<br />

Timothy G. Reddin<br />

During rlw l:st [(:,,. vc1rs, signiric:nr rcsc:1rch ;l ml<br />

dn·clopment has been undertaken in both academia<br />

and industry in an dlc>rt to reduce the cost of high­<br />

pertimnanu: computing ( H PC ). The method most<br />

ti·cqueml\' used was to build p;lrallel S\'Stems out of<br />

clusters of commodity ,,·orkstations or sen"Crs tlut<br />

could be used as a ,·i rtu;ll supercomputcr. ' The nwti­<br />

\'Jtion t(>r this '' ork ,,·as the tremendous g:1ins that<br />

have lx:en ;K hievcd in reduced instruction set com­<br />

puter (!USC:) microprocessor perf()l'mJnce during the<br />

last decade. Indeed , processor pertormance in rod:1v's<br />

workst:Jtions and sen·ers often cxceeds that ofrhc indi­<br />

,·idu;ll processors in a tigiHI\· coupled supercomputer.<br />

Hm,·e,·er, trad itional Joc1l 1rc:1 net\\'ork (LAN) per·<br />

to rm1nce has not kept pace ll'ith microprocessor<br />

pert(mn:lJKe. LANs, such as rl bcr distributed d:na<br />

intcrbce (FDDI), ofkr rosonable bandwidth, since<br />

communication is geneLlllv GlJTied our Lw mems of<br />

rradirion1l prorocol stacks suclJ as the usn d:t:gr;1lll<br />

protoco!jimernct protocol ( U D P /1 !' ) or the tr:msmission<br />

conrroJ protocol/internet protowl (TCI)/I P),<br />

bur softw:t re O\'erhead is 1 major factor in mess:gc­<br />

rranskr time 2 This sofi-,,.,lre o\'erhead is not red uced<br />

by building hster LAN ncrwork hardware. 1\:\rhcr, a<br />

ne\\' appro:tch is needed-one that bvpasses the pro­<br />

tocol stack ll'hile prescn·ing sequencing, error detec­<br />

tion , :tnd p rotection.<br />

Much current research is dcnncd to reducing this<br />

communicarions O\'erhe::Jd usi ng specialized lurdw1re<br />

:1 nd sof-tware. To this


Data may then be transkrred between the machines<br />

using simple memory read and write operations, with<br />

no software overhead, essentially utilizing the fu ll per­<br />

formance of the hardware. This approach is similar to<br />

the one used in the Princeton SHRJMP project, where<br />

this process is described as Virtual Memory-Mapped<br />

Com munication (VJ'v!MC). 7-'"<br />

Figure 1 shows the re lationship between the various<br />

components of our message-passing system. The tirst<br />

phase of our work involved designing a program­<br />

ming library and associated kernel components to pro­<br />

vide protected , unprivileged access to the MEMORY<br />

CHANNEL network. Our objective in creating this<br />

library was to provide a fa cility much like the standard<br />

System V interproccss communication (J PC ) shared<br />

memory fi.mctions available in UNIX implementations.<br />

Programmers could usc the library to set up operations<br />

over the M EMORY CHANNEL interconnect, but they<br />

would not need to use the library ti.mctions tor data<br />

transfer. In this way, pcrtonnance cou ld be maximized.<br />

This product, the Tr uCiustcr MEMORY CHANNEL<br />

Software, provides programmers with a simple, high­<br />

pert(mllancc mechanism kH- building parallel systems.<br />

TruCiustcr MEMOLY CHANNEL Software delivers<br />

the pert(xm:mce available ti·om the MEMORY<br />

CHANNEL network directly to user applications but<br />

rcq uircs a programming style that is different from<br />

that req uired for shared memory. This different pro­<br />

gramming style is necessary because of the diffe rent<br />

access characteristics between local memory and mem­<br />

orv on a remote node connected through a MEMORY<br />

CIANNEL network. To make programming with the<br />

MEMOKY CHANNEL technology re latively simple<br />

while continuing to deliver the hardware performance,<br />

we built a library of primitive communications fu nc­<br />

tions. This system, called Un iversal Message Passing<br />

(UMP), hides the details of MEMORY CH.AJ"lNEL<br />

operations from the programmer and operates seam­<br />

lcssly over two transports (initially): shared memory<br />

and the Mb\ll ORY CHANNEL interconnect. This<br />

allows seamless growth h·om a symmetric multipro­<br />

cessor (SMP) to a hi ll MEMORY CHANNEL cluster.<br />

Development can be done on a workstation, while<br />

production work is done on the cl uster. The UMP<br />

Figure 1<br />

SHARED<br />

MEMORY<br />

PARALLEL APPLICATION<br />

PVM I MPI<br />

UMP<br />

TRUCLUSTER<br />

MEMORY CHANNEL<br />

SOFTWARE<br />

Message-passi ng Systc111 Archirccrurc<br />

OTHER<br />

TRANSPORT<br />

layer was designed fr om the beginning with pertor­<br />

mance considerations in mind, particuLlrly \Vith<br />

respect to minimizing the overhead il1\·oJvcd in se nd­<br />

ing small messages.<br />

Two distributed memory models arc predominamly<br />

used in high- perfo rmance computing today:<br />

1. Data parallel, which is used in High Performance<br />

Fortran (HPF)." With this model, tl1e programmer<br />

uses parallel language constructs to indicate ro the<br />

compiler how to distribute data and what opera­<br />

tions should be pertormed on it. The problem is<br />

assumed to be regular so that the compiler can use<br />

one of a number of data distribu tion algorithms.<br />

2. Message passing, which is used in Parallel Virtual<br />

Machine ( PV M) and Message Passing Inrcrt:K e<br />

( M PI ) u 15 In this approach, all messaging is per­<br />

fo rmed explicitly, so the application programmer<br />

determines the data distribution algorithm, making<br />

this approach more suitable fo r irrcgubr probkms.<br />

It is not yet. clear whether one of these approaches<br />

will predominate in rhe ti.tturc or if both will continue<br />

to coexist. <strong>Digital</strong> has been worki ng to provide com­<br />

petitive solutions fo r both approaches usi ng MEMORY<br />

CHANNEL cl usters. <strong>Digital</strong>'s H PF \\'Ork has been<br />

described in a previous issue of the Jo urnal'"·'' This<br />

paper is primarily concerned with message passing.<br />

Building on the UMP layer, we constructed imple­<br />

mentations of two common message-passing systems.<br />

The tlrst, PVM, is a de facto st


depend ing on the bandwidth of the I/0 subsystem<br />

into which the adapter is plugged. Although the cur­<br />

rent /vi EMORY CHANNEL network is a shared bus, the<br />

plan ti:.Jr the next generation is to utilize a switched<br />

technologv that will increase the aggregate bandll'idth<br />

of the llCt\Hlrk be\·ond thn of currentlv a\·aiLlblc<br />

. .<br />

s\\·itchcd LAN technologies. The latency (time to send<br />

a minimum-length message one way bct\vccn t\vo<br />

processes) is less than S mic roseconds (1-ls). The<br />

MEMORY CHANNELnet\vork provides a commLIIlic:t­<br />

tions medium with a low bit-error rate, on the order of<br />

10-u,_ The proba bil it\' of undetected errors occurring<br />

is so sm::Lil (on the order of the undetected error rate of<br />

CPUs and memory Sll bsystcms) that it is esscnri1llv<br />

negligi ble. A MEMORY C H ANNE L cluste r consists of<br />

one or more PCI MEMORY CH ANNE L adapters on<br />

each node and a hub connecting up to eight nodes.<br />

The Mt-: MORY CH ANNEL cluster suppmts l<br />

512-MB glob:! address space into \\'hich e::ch adapter,<br />

under opcr:ti ng S\'Stcrn contml, em map regions of<br />

local \'irtual 1ddress space." figmc 2 illustrates the<br />

JVJ F.MORY CHANNEL opc r;1tion. Figure 2a shows<br />

transmission, and Figure 2h shows reception . A p1ge<br />

table cntr· (PTE) is an cntrv in the svstcm \'irtull­<br />

to-plwsicll map th::n translates the ,·irtual :ddrcss of<br />

a p:ge to the corresponding phvsical address. The<br />

MEMOKY CHANNEL adapter comains a page comrol<br />

table (PC:T) that i ndi cates tr reception<br />

Vo l. S :--Jo 2 I')')(,<br />

Thus, \\'hen a MEMO 1\Y CH ANNEL network packet<br />

is received that corresponds to the page tb:Lt is mapped<br />

t( Jr reception , the data is transkrred directly to the<br />

appropriate page of physic1l memory by the system 's<br />

DMA engine. In addition, an\' cache lines th:Lt rckr to<br />

the updned page are in\-llidatcd.<br />

Subscq ucmlv, am· \\'rites to the mapped page oh·ir­<br />

tual mcmmy on the r! rst node result in correspond i ng<br />

writes to physical memory on the second node. This<br />

means th


TRANSMIT<br />

REGION<br />

VIRTUAL<br />

ADDRESS<br />

RECEIVE<br />

REGION<br />

VIRTUAL<br />

ADDRESS<br />

Figure 2<br />

/\'I EMORY CHANNEL OperJrion<br />

TRUCLUSTER MEMORY<br />

CHANNEL API LIBRARY<br />

TRUCLUSTER MEMORY CHANNEL<br />

KERNEL SUBSYSTEM<br />

LOW-LEVEL KERNEL<br />

MEMORY CHANNEL FUNCTIONS<br />

EXECUTES STORE INSTRUCTION<br />

TO VIRTUAL ADDRESS IN<br />

TRANSMIT REGION<br />

VIRTUAL-TO­<br />

PHYSICAL<br />

ADDRESS<br />

TRANSLATION<br />

PAGE TABLE ENTRY<br />

PHYSICAL ADDRESS<br />

IN PCI I/0 SPACE<br />

(a) Transmission<br />

MEMORY CHANNEL<br />

ADAPTER<br />

PAGE CONTROL<br />

TABLE<br />

MEMORY<br />

CHANNEL<br />

ADDRESS<br />

DATA<br />

-+ --------- - --------- - --- , DATA RETURNED FROM<br />

' , PHYSICAL MEMORY<br />

EXECUTES LOAD INSTRUCTION<br />

FROM VIRTUAL ADDRESS IN<br />

RECEIVE REGION<br />

VIRTUAL-TO­<br />

PHYSICAL<br />

ADDRESS<br />

TRANSLATION<br />

PAGE TABLE ENTRY<br />

USER SPACE<br />

KERNEL SPACE<br />

CACHE<br />

INVALIDATE<br />

Figure 3<br />

TruCiusrer ,vi El'v!ORY CHANNEL Sofrware Architecture<br />

a set of simple building blocks that arc analogous to the<br />

System V !PC bcility in most UNIX implementations.<br />

The advantage is that while a very simple interface is<br />

provided initially, the intertiKe can later be extended as<br />

(b) Reception<br />

'<br />

' '<br />

I<br />

I<br />

I<br />

I<br />

PAGE CONTROL<br />

TABLE<br />

MEMORY<br />

CHANNEL<br />

ADDRESS<br />

DATA<br />

required, \Vithout losing compatibility with applications<br />

based on the initial implementation. Table l details the<br />

MEMORY CHAl'\!NEL API library timctions that the<br />

product provides. An important feature to note is that<br />

when a MEMORY CHANNEL region is allocated using<br />

TruCJuster MEMORY CHANNEL Software, a key is<br />

specified that uniquely identifies this region in the clus­<br />

ter. Otber processes anywhere in the cluster can attach<br />

to the same region using the same key; the collection of<br />

keys provides a clusterwide namespace.<br />

The MEMORY CHAN EL API library communi­<br />

cates with the kernel subsystem using kmodcall, a sim­<br />

ple generic system call used to manage kernel<br />

subsystems. The library fu nction constructs a com­<br />

mand block containing the type of command (i.e.,<br />

<strong>Digital</strong> Tech nical Journal Vol. R No. 2 <strong>1996</strong> 99


Ta ble 1<br />

TruCiuster MEMORY CHANNEL API Library Functions<br />

Function<br />

Name<br />

imc_asalloc<br />

lmc_asattach<br />

lmc_asdetach<br />

imc_asdealloc<br />

imc_l kalloc<br />

imc_l kacquire<br />

imc_lkrelease<br />

imc_l kdealloc<br />

imc_rderrcnt<br />

imc ckerrcnt<br />

imc kill<br />

imc_gethosts<br />

Description<br />

Allocates a region of MEMORY CHANNEL address space of a specified size and permissions and<br />

with a user-supplied key; the ability to specify a key allows other cluster processes to rendezvous<br />

at the sa me region. The function returns to the user a clusterwide ID for this region.<br />

Attaches an allocated MEMORY CHANNEL region to a process virtual address space. A region<br />

can be attached for transmission or reception, and in shared or exclusive mode. The user can also<br />

request that the page be attached in loopback mode, i.e., any writes will be reflected back to the<br />

current node so that if an appropriate reception mapping is in effect, the result of the writes can<br />

be seen local ly. The virtual address of the mapped region is assigned by the kernel and returned<br />

to the user.<br />

Detaches an allocated MEMORY CHANNEL region from a process virtual address space.<br />

Deallocates a region of MEMORY CHANNEL address space with a specified 10.<br />

Allocates a set of clusterwide spinlocks. The user can specify a key and the required permissions.<br />

Normally, if a spin lock set exists, then this function just returns the ID of that lock set; otherwise<br />

it creates the set. If the user specifies that creation is to be exclusive, then fa i I u re wi II resu It if the<br />

spinlock set exists already. In addition, by specifying the IMC_CREATOR flag, the first spin lock in<br />

the set will be acquired. These two features prevent the occurrence of races in the allocation of<br />

spin lock sets across the cluster.<br />

Acquires (locks) a spin lock in a specified spin lock set.<br />

Releases (unlocks) a spin lock in a specified spin lock set.<br />

Deallocates a set of spin locks.<br />

Reads the clusterwide lVI EMORY CHANNEL error count and returns the value to the user. This<br />

va lue is not guaranteed to be up-to-date for all nodes in the cluster. It can be used to construct<br />

an application-specific error-detection scheme.<br />

Checks for outstanding MEMORY CHANNEL errors, i.e., errors that have not yet been reflected in<br />

the clusterwide MEMORY CHANNEL error count returned by imc_rderrcnt. This function checks<br />

each node in the cluster for any outsta nding errors and updates the global error count accordingly.<br />

Sends a UNIX signal to a specified process on another node in the cluster.<br />

Returns the number of nodes currently in the cluster and their host names.<br />

1\'hich libran· fu nction has been called ) rmation, such<br />

as virtual addresses.<br />

Similarly, ro r each set of spin locks a I looted on the<br />

cluster there is a cluster lock dcscriprm ( CLD) that<br />

contains information describing the spinlock set,<br />

including its clusterwidc lock I D, the numhu of spin­<br />

locks in the set, the key, permissions, crc::ttion time,<br />

and the UID and GlD of the creating f> roccss. Fm an<br />

individual CLD, there is a host lock dcsniptor (HLD)<br />

tor each node that is using the spin lock set. The HLD<br />

contains the cluster ID oF the node and other nodc­<br />

spccifie int(>rmation : bout the spin]ock set. for a spe­<br />

cific HLD, there is a process lock descriptor ( I'LD) r( >r<br />

each pmccss on th:t node that is using the Sf)inlock


KEY:<br />

CLD 0<br />

CLD 1<br />

<br />

CLD CLUSTER LOCK DESCRIPTOR<br />

CRD CLUSTER REGION DESCRIPTOR<br />

HLD HOST LOCK DESCRIPTOR<br />

HRD HOST REGION DESCRIPTOR<br />

PLD PROCESS LOCK DESCRIPTOR<br />

PRO PROCESS REGION DESCRIPTOR<br />

HRD 0: HOST 4<br />

Figure 4<br />

Truc:luster ME/IIORY C:HA:-..: \:EL Kernel Data Structures<br />

HRD 1: HOST 6<br />

HRD 0: HOST 6<br />

HRD 1· HOST 1<br />

( J) Regions<br />

HLD 0: HOST 2 PLD 0: PID 3346<br />

HLD 1 HOST 0 I<br />

J L PLD<br />

- HLD 0 HOST 4 ...<br />

HLD 1: HOST 6<br />

...<br />

HLD 2: HOST 3<br />

set. The PLD contains the l' ID of the process that cre­<br />

ated the spinlock set .md any process-specific informa­<br />

tion about the spinlock set.<br />

All these cluster datur regions of Mr)v!ORY CHAN:-JEL space have been<br />

allocued on the cluster. The ti rst region, with descrip­<br />

tor CRD 0, is mapped on three nodes: host 4, host 6,<br />

and host 3. The diJgr:tm


Figure 5<br />

HOST A<br />

-------------------<br />

I<br />

INVOKE<br />

Command Relav Opcr


Gi\'cn the usefulness of coherent allocations, it may<br />

seem unusu:1l that we nude this feature an option<br />

rJ thcr th:m the dcbult. There are severa l t-eJsons t(x<br />

this. vV ith coherent Jllocations, the associated physical<br />

memory becomes nonpagcable on aU nodes within the<br />

cluster,


Performance<br />

The performance ofTruCtuster MEMORY CHI\N::--J EL<br />

Soft\\-11-c on a pair of AlphaScner 4100 5/300<br />

n1 1chi nes is prese nted in Tabl e 2. These measurements<br />

\\'CIT made using 1·crsion 1.5 1\Jl EMOIY C:HA0l:\EL<br />

atbptcrs. The lmKill'idth ( 64 M H/s )


extern Long asm(const char *<br />

#pragma intrinsic (asm)<br />

#def ine mb ( ) asm("mb" )<br />

#inc lude <br />

#include <br />

. . . ) ;<br />

rna in < )<br />

{<br />

int status, i , Locks=4, temp, errors;<br />

imc_asid_t region_i d;<br />

imc_L kid t Lock_i d;<br />

typedef struct {<br />

vo lati l e int processes;<br />

vol ati le int patte rn[2047J;<br />

shared_region;<br />

sha red_region *regi on_ read, *region_w ri te;<br />

caddr_t read_p tr = 0, wri te_p tr = 0;<br />

}<br />

I* MC region ID *I<br />

I* MC spi nlock set ID *I<br />

I* Shared data structure *I<br />

I* Al locate a region of coherent MC address space and attach to *I<br />

I* process VA *I<br />

imc_a sa l l oc (123, 8192, IMC_U RW, IMC_COHERENT, &regi on_ id);<br />

imc _a sattach(region_i d, IMC_T RANSMIT, IMC_SHARED, IMC_LOOPBACK, &write_pt r);<br />

imc_asattach(regi on_i d, IMC_ RECEIVE, IMC_SHARED, 0, &read_p tr);<br />

region_read = (shared_region *)wri te_p tr;<br />

region_w ri te = (shared_region *)re ad_ ptr;<br />

I* Allocate a set of spinlocks and atomi ca l l y acq uire the first Lock *I<br />

status = imc_ Lkalloc(456, &locks, IMC_LKU, IMC_C REATOR, &lock_id); errors = imc rderrcnt();<br />

i f (status =<br />

do {<br />

IMC SUCCESS) {<br />

region wri te->processes = 0;<br />

I * Ini tialize the globa l region *I<br />

for (iO ; ipa ttern[i ] =<br />

i--;<br />

mb ( ) ;<br />

i ;<br />

} wh i l e (imc ckerrcnt (&errors) I I region_read->pattern[i J != i )<br />

}<br />

imc_l kre leaseClock_i d, 0);<br />

else if (status == IMC EXISTS) {<br />

imc_lkallocprocesses + 1 ; I* Increment the process counter *I<br />

errors = imc_r d errcnt();<br />

do {<br />

region_w ri te->processes =<br />

mb processes - 1; I* Decrement the process counter *I<br />

errors = imc_r d errcnt( ) ;<br />

do {<br />

region_w ri te->processes = temp;<br />

mb ( ) ;<br />

} whi le (imc ckerrcntprocesses != temp )<br />

imc_L krelease( lock_i d, 0);<br />

imc_ Lkdea l l oc( lock_ id);<br />

imc_a sdetach( regi on_id);<br />

imc_asdea l l oc ( reg i on_ id);<br />

Figure 6<br />

Programming wirh TruCiusrer 1'vl EMORY CHANNEL Software<br />

I* Dea l locate spi n lock set *I<br />

I* Detach shared region *I<br />

I* Dea l locate MC address space *I<br />

Digit,ll <strong>Technical</strong> Journal Vol. 8 No. 2 <strong>1996</strong> 105


These gu1ls placed some important constraints on<br />

the architecture of UMP, parricularl' with regard to<br />

pcrt(>rmance. This meant that design decisions had<br />

to be constantly evaluated in terms of their pcrtrmance<br />

impact. The initial design decision was to use a dedi­<br />

cated point-to-point circular butter between ever)' pair<br />

of processes. These buf1ers use prod ucer and consumer<br />

indexes to control the reading and writing of bufte r<br />

contents. The indexes can be moditied only by the<br />

consumer and producer tasks and allow fll lly lockless<br />

operation of the buffe rs. Removing lock requirements<br />

eliminates not only the software costs associated with<br />

lock manipulation (in the initial implementation of<br />

TruCiuster Jv! EMORY CHANNEL Software, acquiring<br />

and releasing an uncontested spinlock takes approxi­<br />

mately 130 f.LS and 120 f.LS, respecti,-elv) but also the<br />

impact on processor pertormance associated with<br />

Load-locked/Store-conditional instruction sequences.<br />

Although this butlering style eliminates lock manip­<br />

ulation costs, it results in an exponential demand tor<br />

storage and can limit scalability. If there arc N processes<br />

communicating using this method, that implies N2<br />

bufkrs are req uired for filii mesh communication.<br />

MEMORY CHANNE L address space is a relatively<br />

scarce resource that needs to be carefullv husbanded.<br />

To manage the demand on cluster resources as t3 irly as<br />

possibk, we decided to do the t(>llowing:<br />

Figure 7<br />

Clusrcr Conlnlunicarion Using UMP<br />

• Allocate buftc rs sparsely, i.e., as required up to<br />

some dcbult limit. Full N2 al location would still be<br />

possible if the user increased the n um bcr of bu fk rs.<br />

• Make the size of the butlers contigurJble.<br />

• Usc lock-controlled single-writer, multiple-reader<br />

buffers to hand lc both the overflow trom the JV2<br />

bufkr and bst multicast. One of these bufkrs,<br />

cal led outbufs, would be assigned to each process<br />

using UMP upon initialization.<br />

Note that while the channel buffers are logically<br />

point-to-point, they may be implemented physically as<br />

either point-to-point or broadcast. For example, in the<br />

first \'ersion of UMP, we used broadcast MEMORY<br />

CHANNEL mappings fc>r the sake of simplicity. We


with task 2 using UMP channel buffers in shared mem­<br />

ory, shown as l->2 and 2--> l. Task l and task 3 communicate<br />

using UMP channel buffers in ME1viORY<br />

CHANNEL space, shown as l--> 3 and 3--> l. Task 3 is<br />

reading a message from task 1 using an outbuf. The<br />

outbuf can be written only by task 1 but is mapped fo r<br />

transmission to all other cluster members. On node 2,<br />

the same region is mapped for reception. Access to<br />

each outbuf is controlled by a unique cluster spinlock.<br />

Our rationale fo r taking this approach is that a short<br />

software path is more appropriate for small messages<br />

because overhead dominates message transfer time,<br />

whereas the overhead of lock manipulation is a small<br />

component of message transfer time for large mes­<br />

sages. We fe lt that this approach helped to control the<br />

use of cluster resources and maintained the lowest pos­<br />

sible latency tor short messages yet still accommodated<br />

large messages. Note that outbuts are still ti.xed-size<br />

buffers but are generally configured to be much larger<br />

than the N2 buffers.<br />

This approach worked for PVM because its message<br />

transfer semantics make it acceptable to fa il a mes­<br />

sage send request due to buffe r space restrictions<br />

(e.g., if both the N2 buffer and tbe outbuf are fu ll).<br />

When we analyzed the requirements tor MPI, how­<br />

ever, we fo und that this approach was not possible . For<br />

this reason, we changed the design to use only the N2<br />

buffers. Instead of writing the message as a single<br />

operation, the message is streamed through the buffer<br />

in a series of fragments. Not only does this approach<br />

support arbitrarily large messages, but it also improves<br />

message bandwidth by al lowing (and, fo r messages<br />

exceeding the available bufter capacity, requiring) the<br />

overlapped writing and reading of the message.<br />

Deadlock is avoided by using a background thread<br />

to write the message. Since overflow is now handled<br />

using the streaming N2 buffers, outbufs were not nec­<br />

essary to achieve the required level of performance fo r<br />

large messages and were not implemented. Outbufs<br />

are retained in the design to provide fa st multicast<br />

messaging, even though in the current implementa­<br />

tion they are not yet supported .<br />

Achieving the performance goals set tor UMP was<br />

not easv. In addition to the butler architecture<br />

described earlier, several oth er techniqu es were used.<br />

• No syscalls were allowed anywhere in the UMP<br />

messaging fu nctions, so U M P runs completely in<br />

user space.<br />

• Calls to library routines and any expensive arith­<br />

metic operations were minimized .<br />

• Global state was cached in .local memory wherever<br />

possi ble.<br />

• Careful attention was paid to data alignment issues,<br />

and all transfers are multiples of 32-bit data.<br />

At the programmer's level, UMP operation is based<br />

on duplex point-to-point links called channels, which<br />

correspond to the N2 buffe rs already described .<br />

A channel is a pair of unidirectional buffers used to<br />

provide two-way communication between a pair of<br />

process endpoints anywhere in the cluster. UMP pro­<br />

vides functions to open a channel between a pair of<br />

tasks. While the resources are allocated by the first task<br />

to open the channel, the connection is not complete<br />

until the second task also opens the same channel.<br />

Once a channel has been opened by both sides, UMP<br />

functions can be used to send and receive messages on<br />

that channel. It is possible to direct UMP to use shared<br />

memory or MEMORY CHANNEL address space for<br />

the channel buffers, depending on the relative location<br />

ofthe associated processes. In addition, UMP provides<br />

a fu nction to wait on any event (e.g., arrival of a mes­<br />

sage, creation or deletion of a channel). In total, UMP<br />

provides a dozen fu nctions, which are listed in Table 3.<br />

Most of the functions relate to initial ization, shut­<br />

down, and miscellaneous operations. Three fu nctions<br />

establish the channel connection, and three functions<br />

pertorm all message communications.<br />

UMP channels provide guaranteed error detection<br />

but not recovery. Through the use of TruC!uster<br />

MEMORY CHANNEL Software error-checking rou­<br />

tines, we were able to provide efficient error detection<br />

in UMP. We decided to let the higher layers implement<br />

error recovery. As a result, designers of higher layers can<br />

control the performance penalty they incur by specifY­<br />

ing their ovvn error recovery mechanisms, or, since<br />

rel iability is high, can adapt a fai l-on-error strategy.<br />

Performance<br />

UMP avoids any calls to the kernel and any copying of<br />

data across the kernel boundary. Messages are written<br />

directly into the reception buffer of the destination<br />

channel. Data is copied once from the user's buffe r<br />

to physical memory on the destination node by the<br />

sending process. The receiving process then copies the<br />

data from local physical memory to the destination<br />

user's butler. By comparison, the number of copies<br />

involved in a similar operation over a LAN using sock­<br />

ets is greater. In this case, the data has to be copied<br />

into the kernel, where the network driver uses DMA to<br />

copy it again into the memory of the network adapter.<br />

At this point the data is transmitted onto the LAN.<br />

The first version of UMP used one large shared<br />

region of MEMORY CHANNEL space to contain its<br />

channel buffers and a broadcast mapping to transmit<br />

this simultaneously to all nodes in the cluster. This<br />

version ofUMP also used loopback to reflect transmis­<br />

sions back to the corresponding receive region on the<br />

sending node, which resulted in a Joss of available<br />

bandwidth . Using our AI phaSe rver 2100 4/190<br />

development machines, we measured<br />

Digir:1l Te chnical journal Vol. 8 No. 2 <strong>1996</strong> 107


Ta ble 3<br />

UMP API Functions<br />

Function<br />

Name Description<br />

ump_init<br />

ump_exit<br />

ump_open<br />

ump_close<br />

ump_l isten<br />

Initializes UMP and allocates the necessary resources.<br />

Shuts down UMP and deal locates any resources used by the calling process.<br />

Opens a duplex channel between two endpoints over a given transport (shared memory or<br />

MEMORY CHANNEL). Channel endpoints are identified by user-supplied, 64-bit integer handles.<br />

Closes a specified UMP channel, deallocating all resources assigned to that channel as necessary.<br />

Registers an endpoint for a channel over a specified transport. This can be used by a server process<br />

to wait on connections from clients with unknown handles. This function returns immed iately,<br />

but the channel is created only when another task opens the channel. This can be detected using<br />

ump_wa it.<br />

ump_wa it Wa its for a UMP event to occur, either on one specified channel to this task or on all channels<br />

to this task.<br />

ump_read<br />

Reads a message from a specified channel.<br />

ump_write Writes a message to a specified channel. This function is blocking, i.e., it does not return until<br />

the complete message has been written to the channel.<br />

ump_n bread Starts reading a message from a channel, i.e., it returns as soon as a specified amount of the<br />

message has been received, but not necessarily all the message.<br />

ump_nbwrite Starts writing a message to a specified channel, i.e., it returns as soon as the write has started.<br />

A background thread will continue writing the message until it is completely transmitted.<br />

ump_mcast<br />

ump_info<br />

Writes a message to a specified list of channels.<br />

Returns UMP configuration and status information.<br />

• Lncncv : 11 f.LS (MEi'v!ORY CHANNEL), 4 JJ.S<br />

( sh;1rcd rncmorv)<br />

• Bandwidt h : 16 MB/s (MEMORY CHANNEL),<br />

30 ;VI B/s ( sha red m emory )<br />

To i ncrease bandwidth, we mod i icd U1\'I P to usc<br />

transmit-only regions fur its challnel butlers, th us<br />

el i mi m.ti ng Joopback. The per(>rmancc mosun.:d fo r<br />

the revised UMP usin g the same machines was<br />

• Ll tcncy: 9 f.LS (MEMORY CHANNEL), 3 f.LS<br />

(shared memory)<br />

• Bandwidth: 23 MB/s (MEMORY CHANNEL),<br />

32 MB/s (shared mernon·)<br />

figu re 8 sho\\'s the message tr:111scr time ;1nd Figu re<br />

9 sho\\'S the bandwidth (x various mcss1gc sizes (lr the<br />

ITI'iscd 1·crsion of UMP usi ng both blocki ng ;l lld non­<br />

bl ocking writes m·cr shared memorv and the MEMORY<br />

CHANNEL network. Usi ng newer AlphaSc nn 4100<br />

5/300 mach i nes, which have a Elster [/0 subsystem<br />

th111 the older machines, and version 1.5 /vi E!VlORY<br />

CHANNEL ad apters, the mc1su rcd latency is 5.8 f.LS<br />

(MEMORY CHANNEL), 2 f.LS ( shared memory) . The<br />

pc;lk bandwidth achieved is 61 M B/s (MEMORY<br />

CHANNEL), 75 MB/s ( shared memory ) . In the non­<br />

blocking cases, the buffe r size used 11·:-s 256 kilobvtcs<br />

(KB) (lr sh1 rcd memon· and 32 KB (lr MEMORY<br />

C:HANN EL. Further work is u nder 11·a1' to im prm·c the<br />

pcr(Jml;lllce using shared memorv 1 s the transport.<br />

This work is


100,000<br />

10,000<br />

i=<br />

a:<br />

UJ<br />

t!icn 1.ooo<br />

zo<br />

z<br />

a: a<br />

t-u<br />

100<br />

UJCI:<br />

UJU<br />

10<br />

KEY:<br />

Figure 8<br />

/./<br />

, ". "<br />

/ /<br />

·<br />

,. ;; ."'"' '<br />

, -.-- ·<br />

· -<br />

- .:.. :::: . - --- . .... .<br />

.'/<br />

!/<br />

'<br />

;<br />

y<br />

/<br />

<br />

. /.<br />

. . · "<br />

.,.,<br />

./.<br />

. /<br />

10 100 1 ,000 1 0.000 100,000 1 ,000,000<br />

MESSAGE SIZE (BYTES)<br />

UMP BLOCKING (SHARED MEMORY)<br />

UMP BLOCKING (MEMORY CHANNEL)<br />

UMP NONBLOCKING (SHARED MEMORY)<br />

UMP NONBLOCKING (MEMORY CHANNEL)<br />

UMI' Com municuions PcrfmmanC<br />

CD<br />

<br />

C.!l<br />

UJ<br />

30<br />

I<br />

1- 20<br />

0<br />

<br />

0 10<br />

z<br />

<br />

CD<br />

0<br />

KEY:<br />

50<br />

40<br />

- -----<br />

200,000 400,000 600,000 800,000 1 .000,000<br />

MESSAGE SIZE (BYTES)<br />

UMP BLOCKING (SHARED MEMORY)<br />

UMP BLOCKING (MEMORY CHANNEL)<br />

UMP NONBLOCKING (SHARED MEMORY)<br />

UMP NONBLOCKING (MEMORY CHANNEL)<br />

Figure 9<br />

U 1YI P C:ommun ic.ll ions Pcrtormnu.:: Band "'idrh<br />

PVM when using net\\'Orks like Ethernet or fDDJ.<br />

The high cost of communications f()r these systems<br />

means that only the more co :rse-grai ned paral lel appli­<br />

cations h:ve de monstrJted pert(mnancc im pro,·cments<br />

as a result of paralklization using PVM . Using the<br />

MEMORY CHANNEL cluster technology described<br />

earlier, we have impl e mented :m optimized PVM that<br />

otters lo\1' latencv :nd high -h:.llldwidth comnHi l1ic:­<br />

rions. The PVM librarv and daemon use UM I' ro pro­<br />

vide se:mless communications over the MEMORY<br />

CHANNEL cluster.<br />

\iVhen we began to devel op PVM ror MEMORY<br />

CHAN1 EL c lusters, \\'e had one overrid ing goal: to usc<br />

the hardw;u-c pe rtormance the MEMORY C:HANNFL<br />

interconnect ofters to prm·ide a PVM with industry­<br />

leading communicuions pertormancc, speciticallv with<br />

regard to btcncy. I nitiallv, we set a target btency t(n<br />

PVM of less than 15 f..lS using shared mcmorv md less<br />

than 30 f..lS using the MEMORY C H AN NEL tr:msport.<br />

Our ti rst task was to bui ld ;\ prototype using the<br />

public-domain PVM implementation. We used an<br />

earlv prototYpe of the MEMORY CHANNEL svstcm<br />

jointlv dc\'eloped by <strong>Digital</strong> and Encore. The prototype<br />

had a hardware latency of4 f..lS. We moditied the<br />

shared-memory version of PVM to usc the prototype<br />

hard\\'are and achieved a PVM l:ltencv of 60 f..lS.<br />

Protiling and straighttcx\\'ard code Jn:ll\'sis I-c\·ealed<br />

that most of the overhead was caused by<br />

• PVM's support t


Figure 10<br />

MEMORY CHANNEL CLUSTER<br />

r - - - - - - - - - - - - - - - - - - - - - - - - - - --,<br />

HOST 1 HOST 2<br />

- - - - - - - - - - - - - - - - - - - - - - - - - - - ,<br />

I<br />

DAEMON 1 PROCESS 1 PROCESS 2 DAEMON 2 PROCESS 3<br />

I I<br />

PVM DAEMON PVM APPLICATION<br />

I<br />

PVM APPLICATION I PVM DAEMON<br />

I<br />

PVM APPLICATION<br />

I<br />

I I<br />

PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY I I PVM API LIBRARY PVM API LIBRARY<br />

I<br />

1 UNIX UMP<br />

UNIX I : '--- --'- -- --'<br />

I<br />

UMP UNIX I I<br />

I I<br />

UMP I I UNIX UMP UNIX UMP<br />

I<br />

I I I I<br />

I<br />

t B<br />

I<br />

t I I<br />

'-- ------------ ----A ---t -- ---------------- L - - - - - - - - ----- - - - - - - - - - - - ---<br />

D<br />

HOST 3<br />

r-------------------------------------------------------<br />

F<br />

I<br />

I<br />

I<br />

I<br />

DAEMON 3 PROCESS 4 GATEWAY<br />

I<br />

I<br />

PVM DAEMON PVM APPLICATION PVM GATEWAY<br />

I<br />

I PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY<br />

PVM3 DAEMON<br />

I<br />

I<br />

INTERFACE<br />

UNIX UMP UNIX UMP<br />

I<br />

UNIX UMP<br />

I I I<br />

l _ _ _ !---------------------------------------- ----<br />

c<br />

---_j<br />

L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - <br />

KEY:<br />

A A PVM application on host t performs local control functions using UNIX signals.<br />

B A PVM application on host 1 commun1cates w1th another PVM task on the same host using<br />

UMP (via shared memory).<br />

C A PVM application on host t communicates with another PVM task on a different host in the<br />

cluster (host 2) using UMP (via MEMORY CHANNEL).<br />

D A PVM application on host 1 requires a control function (e.g., a signal) to be executed on<br />

another host in the cluster (host 3 ); it sends a request to a PVM daemon on host 3.<br />

E The PVM daemon on host 3 executes the control function.<br />

F A PVM application on host t sends a message to a PVM task on a host outside the MEMORY CHANNEL<br />

cluster; the message is routed to the PVM gateway task on host 3.<br />

G The PVM gateway translates the cluster message into a form compatible with the external PVM<br />

implementation and forwards the message to the external task via IP sockets.<br />

<strong>Digital</strong> PVM Architecture<br />

gateway daemon. The PVM API library is a complete<br />

rewrite of the standard PVM ,·ersion 3.3 API, \\'ith<br />

which full fu nctional compatibility is maintained.<br />

Emphasis has been placed on optimizing the pcd()r·<br />

mancc of the most frequently used code paths. In<br />

addition, all data structures and dat: transters h;wc<br />

been optimized for the Alpha architecture. As stated<br />

earlier, the amount of message passing between tasks<br />

and the local daemon has been minimi;.ed by pcrt(mn·<br />

ing most operations in-line and communicating \\'ith<br />

the daemon only \\'hen absolutely necessary. !mer·<br />

mediate bufkrs are used fo r copying data between the<br />

user bufrcrs. This is necessary because ofthe semantics<br />

of PVM, which allow operations on buffer contents<br />

bd()re and at-i:cr a message has beetJ sent. The one<br />

exception to this is pvm_pscnd; in this case, data is<br />

copied directly since the user is not allowed to modi'<br />

the send buffer.<br />

The purpose of om PVM daemon is difkrem ri·om<br />

that of the daemon in tbc standard PVM package. Our<br />

daemon is designed to relay commands between diftl:rent<br />

nodes in the PVM cluster. It exists solely ro<br />

110 Digiral Technid Joumal Vnl. R No. 2 l9


comri burions of each of the group members, etc. The<br />

group fu nctions arc implemented separately ti-om the<br />

other PVM messaging fu nctions. They use a sep:.rate<br />

global structure (the group ta ble) to manage l'VJ\11<br />

group data. Access to the group table is controlled<br />

by locks. Unlike other l'VM implementations, there is<br />

no PVM group server, since all group operations c:tn<br />

manipu l:ne the group t:ble direcrlv.<br />

Performance<br />

Table 4 compares the communications latency achieved<br />

by various PVM implementations. A the ta ble indicates,<br />

the Lnency between two mKh incs with <strong>Digital</strong><br />

PVM over :1 MEMORY CHANNEL. tr:1nsport is much<br />

less th:n the l:1rcncy of the public-domain PVM<br />

implemelltation over shared memorr, which va lidates<br />

our approach of removing support r(Jr heterogeneity<br />

ti-orn the critical pertr l'VM l atency over the<br />

MEMORY CHANNEL transport given in Reterence 6<br />

were obtained us1ng an carlin- vers1on of<br />

Digir:li PVM. Since those results were measured,<br />

latency has been hah-ed, mostly due to improvements<br />

in UMI' pertcm11


-<br />

Figure 14<br />

MPI PORTABLE API LIBRARY<br />

MPICH ABSTRACT DEVICE<br />

MPICH CHANNEL INTERFACE<br />

UMP<br />

I<br />

SHARED MEMORY<br />

MEMORY CHANNEL<br />

(a) I nitil Protor1·pc<br />

MPI PORTABLE API LIBRARY<br />

MPICH ABSTRACT DEVICE<br />

FRONT END<br />

UMP<br />

I<br />

SHARED MEMORY<br />

MEMORY CHANNEL<br />

(b) Version 1.0 ImplcmcnLuion<br />

Digit I JV! PI Archirccrurc<br />

MPICH<br />

- - - - - - ,<br />

ABSTRACT<br />

-DEVICE<br />

INTERFACE<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

_ _ _ _ _ _ _j<br />

MPICH<br />

- - - - - -l<br />

ABSTRACT I<br />

- DEVICE - - I<br />

INTERFACE 1<br />

_ _ _ _ _ _ _j<br />

using MEMORY CHANNEL Bv comparison, the<br />

unmodified lvl PI CH achieves a pok bandwidth of<br />

24 MB/s using shared memon· and 5.5 MB/s using<br />

TCP/l Pmnan DDI LAN .<br />

Figure 17 shows the speedup demonstrated by an<br />

MPI application . The application is the Accelerated<br />

Strategic Computing Initiative (ASCI) benchmark<br />

SPPM, which soh-es a three-dimensional g;:s dynamics<br />

problem on a unitonn Cartesian mcsh s2'' The same<br />

code \\'JS run using both Digit:! Ml'l and MPICH<br />

using TCr /1 P. The hardware conligur:t tion was a two­<br />

node MEMORY C:HA:--.JNEL clustn of AJ phaServer<br />

8400 5/350 1mchines, each with six CPUs. <strong>Digital</strong><br />

MPJ used sh:tred memory ;: nd MEMORY CHANNEL<br />

transports, ll'herus Ml'ICH used the Ethernet LAN<br />

eonnecring the machines. The mnimum speedup<br />

Ta ble 5<br />

MPI Latency Comparison<br />

MPI Implementation<br />

MPICH 1.0. 10<br />

MPICH 1.0.10<br />

<strong>Digital</strong> MPI V1.0<br />

<strong>Digital</strong> MPI V1 .0<br />

<strong>Digital</strong> MPI V1 .0<br />

Digita l MPI V1 .0<br />

Transport<br />

Sockets FDDI<br />

Shared Memory<br />

MEMORY CHANNEL 1.0<br />

MEMORY CHANNEL 1.5<br />

Shared Memory<br />

Shared Memory<br />

obtained using <strong>Digital</strong> MPI ll'as l pproximately 7,<br />

whereas fr MPICH the m1ximum speedup w1s<br />

approximately 1.6.<br />

Future Work<br />

We intend to continue rdining tlH: compom:nts<br />

described in this paper. The major chmge em·isioned<br />

regarding the TruCiuster MEMORY CHANNEL Soft­<br />

ware prod uct is the addition of user-spJ.ce spin locks,<br />

which should significanrlv reduce tl1rcc the usc of an<br />

intermediate bufti.:r, perfr scicntilic 1 pplic:1tions that utilizes a<br />

new network technology to hvpass the sotrll':lrc over­<br />

head that limits the applicabilit\' of traditional nct­<br />

II'Orks. The pcrt(ll'mance ofrhis s\·stcm has been prm·cn<br />

to be on a p1 r ll'ith that of currenr supercomputer tech­<br />

nology and has been achie1·ed using commoditv<br />

technology developed to r <strong>Digital</strong>'s commercial cluster<br />

products . The paper demonstrates the suitability of<br />

the MEMORY CHAN:-.JF.l. rcchnolnt-'Y as a communica­<br />

tions medium h>r sellable svstem del'clopmcnt.<br />

Platform Latency<br />

DEC 3000/800 350 JJ.S<br />

AlphaServer 2100 4/233 30 JJ.S<br />

Alpha Server 2100 4/2 33 16 JJ.S<br />

AlphaServer 4100 5/300 6.9 JJ.S<br />

Alpha Server 2100 4/233 9.5 JJ.S<br />

Alpha Server 4100 5/300 5.2 JJ.S<br />

Digit;ll Trdmic.ll jounul Vol . 8 :-.:o. 2 <strong>1996</strong> 113


100,000<br />

10.000<br />

f=<br />

((<br />

w<br />

(j) 1.000<br />

zo<br />


7. J\11 . Blumrich er 1., "Virrul Mcmorv Mapped Ncr­<br />

work lnrerbce i>r the SHJU/v\ P Mul ticomputer," Pro­<br />

ceedings o{ lfw Tit,enl)•-jlrsl Allllttal flllenwlional<br />

Sy mposi11111 011 Com puler Anhilec/11re (April 1 994 ):<br />

142-1 53.<br />

8. M. Rl umrich et al., "T\1 0 Virrual Mcmon· Mapped<br />

Network Interhce Designs," Pmcl'edillg of /he Hoi<br />

Jntercollllects II ,\) ·rnposilltll. Palo Alto, Calif<br />

(August 1994): 134-142.<br />

9. L I frock ct :\I ., " Improving Relc:\sc-Consistenr Shared<br />

Virtunnance Fortran<br />

for Disrribttted-memory Srstenls," D1,ilal Tech 11ical<br />

jou rnal. 'oL 7, no. 3 ( 1995 ): S-23.<br />

17. E. Benson et 1., "Design o· Digir:1 l's Prallel Sofr,, are<br />

Etwimnment," Di,t; ilal Tech 11ical jo11rnal, ,·ol. 7,<br />

no. 3, ( !99S) 24-38.<br />

18. 1n the ti rst itlll'lcmemrions, rhe PC! t\ll i:::MOitY<br />

CHANNEl. network adapter places a limit of 128 !vi i\<br />

on rhe amount of MEMORY CHANNEL space that can<br />

be :\ 1\ocm: d.<br />

19. Tm C!us/er 1/1:' 1/()l?l' CHA X\'f:f Sojill·are Pmp_m/JI­<br />

mer,· Mw 11wl (M\·nard , Mr implementing Ut'vi P<br />

and adding support for collectil'e operations to Digit:ll l'VM.<br />

l3ctore rh:l t, he worked on rhe charactnization and opti m ization<br />

of customer scientific/technical benchmark codes<br />

and on various lurdware and software design projects. Prior<br />

to coming ro <strong>Digital</strong>, Jim conrributcd to the design ofana-<br />

1og and digit:ll morion control svstems and sensors at the<br />

Inland Motor Di,·i-,ion of Kollmorgen Corporation. Jim<br />

recei,·ed a ru:: . in electrical engineering ( 1982) and an<br />

M.Eng.Sc. ( 1985 ) from Uni,usir1· Col lege Cork, Ircbnd,<br />

where he wrote his thesis on rhe design of an electronic<br />

control system or variable relucta nce motors. In addition<br />

to receiving rhe Hewlett-Packard (I rcbnd) Award t(Jr I nnovation<br />

( 1982), jim holds one patent and has published se1 -<br />

era! p:lpcrs. He is a member of I : : L and ACIVL<br />

<strong>Digital</strong> Tcclmicll journal Vnl. 8 No. 2 <strong>1996</strong> liS


John J. Brosnan<br />

)tlhn Brosn.111 is currenrh ;1 principal engi neer in the<br />

<strong>Technical</strong> Computi ng Group 11·here he is project lc;1der<br />

tor <strong>Digital</strong> PVM. In prior positions rnunce Fortra n tc>t<br />

suire and .1 signi ticanr con tri buror to a ,·ariety of publish­<br />

i ng rech nolog1· prod ucts. john joined Digi tal a tin n:ceiv­<br />

ing his B.Lng. in electron ic engineering in l986 from the<br />

U ni1crsi n· of Li meri ck, ln:]:ll\d. He recei1 ed his M.Eng.<br />

in conJfl Uter S\·stems in 1994, also ti·o1n the Uni1·ersin· ot·<br />

Limerick.<br />

Morgan P. Doyle<br />

In I 994, J'vlorgan Dol'le c1me to <strong>Digital</strong> ro \\'Ork on rhe<br />

High Perl(mn.\ncc fortran test suire. Presenrh·, he is :m<br />

engineer in the <strong>Technical</strong> Com puti n g Group. · Earll- on,<br />

he contributed signiticllltlv ro the design and de,·elop­<br />

lllenr of rhe TruCiusrer MEM O l\ Y Cf 1.\NNF.L Softw;1re,<br />

and he is now responsible tllr irs development. 1VI mg:1n<br />

rccein:d his I.A.J and B.A. in eb:tronic engineering<br />

( 1991) .111d h i s 1'v1 .Sc. ( 1993) ti·om T1·i niry Collq.1,c<br />

Du bl i 11, I 1·cLllld.<br />

Seosamh D. 6 Riord,iin<br />

Semamh () Riord,\in is Jn engineer in rhe Technic1l<br />

ComflUtin):'. Cmup \\'here he is cmn:nrh· 11·orking on<br />

Digirai Jvl PI :1 nd on enh.111cemenrs ro the l'.'vl P libr:11·,·.<br />

Pr,·iou>h-, he contributed to rhc design ;1nd implclnnt:\rion<br />

ofthe TruCi u srer ,\H:.\!ORY CI I.\:"\ELSofrll':m· .<br />

Seos;1m h JOined <strong>Digital</strong> ;1frer I"Ccci,·ing his Il.Sc. (199 1)<br />

and M.Sc. ( 1993) in computer science ti·om the U niversity<br />

ofLimnick, Ircl.llld.<br />

Vol. 8 No. 2 [1· which<br />

he holds t\\'O patents, ;1ntl the dcsip;n ot'an 1/0 and com-<br />

1\lunic:uions controller. Tim :1lso ll'orkcd ar 1\:\nheon 011<br />

the d:H:\ co1nmu1J icarion' subs,·sre\Jl tor rhc :S.:L\Il-\ D<br />

distri buted re:\1-time Doppler ll'tcms pmgran11ne1·<br />

Tim is ,, n\l'lnbcr nfrhc British Compmer Socicn· .llld is<br />

a Chartered Engineer.


The Design of User<br />

Interfaces for Digita l<br />

Speech Recognition<br />

Software<br />

<strong>Digital</strong> Speech Recognition Software (DSRS) adds<br />

a new mode of interaction between people and<br />

computers-speech. DSRS is a command and<br />

control application integrated with the UNIX<br />

desktop environment. It accepts user commands<br />

spoken into a microphone and converts them<br />

into keystrokes. The project goal for DSRS was<br />

to provide an easy-to-learn and easy-to-use<br />

computer-user interface that would be a power­<br />

ful productivity tool. Making DSRS simple and<br />

natu ral to use was a challenging engineering<br />

problem in user interface design. Also challeng­<br />

ing was the development of the part of the<br />

interface that communicates with the desktop<br />

and applications. DSRS designers had to solve<br />

timing-induced problems associated with enter­<br />

ing keystrokes into applications at a rate much<br />

higher than that at which people type. The DSRS<br />

project clarifies the need to continue the devel­<br />

opment of improved speech integration with<br />

applications as speech recognition and text-to­<br />

speech technologies become a standard part of<br />

the modern desktop computer.<br />

I<br />

Bernal"d A. Rozmovits<br />

In the 1960s and early 1970s, people controlled com·<br />

purcrs using toggle switches, pu nchcd cards, :m d<br />

punched paper tape. In the 1970s, the common co n­<br />

trol mechanism was the kevboard on tclcrvpes and on<br />

1·ideo termina ls. In the 1980s, with the advent of<br />

graphical user i n terfac es, people ( nmd th:t : new<br />

mode of interJction with the compu ter w1s usdi.d.<br />

The concept of a pointer-the nH>use-n·olved . Its<br />

populari rv grew such that the mouse is now a st:mdard<br />

component of every modern computer. In tlK 1990s,<br />

the time is righ t to 3dd yet an other mode of inter­<br />

:lcti on with the computer. As compute pown grows<br />

each I'C:lr, the boundan· ofthe man-n1:1chinc imerhce<br />

c:n move ti·om interaction that is native to the com ­<br />

puter tow;trd communication that is !lrms<br />

some ;tctions. The action might be to launch an application,<br />

f( >r nample, in response to the comm:nd<br />

"bring up calendar"; or to rvpc something, t(>r cx::tmplc,<br />

in response to "edit to-do list," to it11'oke emacs<br />

\tilcs\projcctA\todo.txt. OS RS not only houses the<br />

speech macro capa bility but also provides a user imer·<br />

t:Ke, a speech recognition engine , and in tcrhccs to the<br />

X Window System .<br />

Following is a high -b·cl description of how the<br />

sofuvarc fu nctions. Commands arc spoken into :1<br />

m icrophom:, and the audio is captured :ncl d igiti /.cd .<br />

The ti.rst step in the processing is the speech m alvsis<br />

system, which pro\·ides a spectra l rcpn:sentr each match. These


match is acceptable, DSRS retrieves keystrokes associ­<br />

:ncd with each utterance, ;:md the keystrokes are then<br />

sent into the system's keyboard bufkr or to the appropriate<br />

application. For instances of continuous speech<br />

recognition, a sentence is recognized and keystrokes<br />

are concatenated to represent the utterance. for<br />

cxJmple, (>r the utterance "ti.\'e two times seven three<br />

tc>ur equals," the keys "52 * 734 =" would be dcli\'crcd<br />

to the calculator application.<br />

Although this concept seems simple, its implementation<br />

raised significant svstem integr:nion issues and<br />

dircctlv affected the user interface design, which w1s<br />

critical to the product's success. This paper specitic1lly<br />

addresses the user interbcc and integration issues and<br />

concludes with a discussion of future directions t(.>r<br />

speech recognition products.<br />

Project Objective<br />

The objecti\'c of the DS RS project was to provide a<br />

useful but limited tool to users of <strong>Digital</strong>'s Alpha<br />

workstations running the UNIX operating sysn.:m.<br />

DSRS would be designed as a low-cost, speech recognition<br />

application and would be provided at no cost to<br />

workstation users for a finite period of ri me.<br />

vVhen the project began in 1994, a number of command<br />

and control speech recognition products f( >r<br />

PCs alreadv nistcd. These progrJms were aimed at<br />

end users and performed useful tasks "out of the bo:x,"<br />

that is, immediately upon start-up. They all carne with<br />

built-in vocabulary for common applications and gave<br />

users the abilitv to add their own \'OGtbulary.<br />

On UNIX systems, howc\'er, speech recognition<br />

products existed only in rhc orm of programmable<br />

rccognizcrs, such as RRN Hark software. Our objective<br />

was to build a speech recognition product tor the<br />

UNIX workst:tion that h1d the char:ctcristics of the<br />

PC recognizcrs, that is, one that would be fu nctional<br />

immediately upon start-up and would allow the nonprogrammer<br />

end user to customize the product's<br />

vocabulary.<br />

vVc studied several speech recognition products,<br />

including Ta lk-> To Next i·om Dragon Systems, Inc.,<br />

VoiccAssist from Creative Labs, Voice Pilot h·om<br />

Microsoft, ;ll)d Listen fi·om Verbcx. 'vVe decided to<br />

provide users with the (>!lowing features as the most<br />

desirable in a command : nd control speech recognition<br />

product:<br />

• Intuiti,-c, c1sv-ro-use interface<br />

• Speaker-independent models that would eliminate<br />

rhe need (>r extensive training<br />

• Spcakcr


Figure 1<br />

TRAI NING<br />

MANAGER<br />

USER<br />

INTERFACE<br />

MICROPHONE<br />

AUDIO CARD<br />

SPEECH<br />

MODELS'<br />

I<br />

DIGITIZED FEATURE<br />

AUDIO FRONT-END PACKETS<br />

PROCESSOR'<br />

• Denotes a component licensed from Dragon Systems. Inc<br />

DSRS Architcctur;ll Block Di:tgrcl ln<br />

words can crcuc models usmg the Vocabul:ry<br />

Manager user intcrbceus :nd Jets ;1 S the :ge nt to control<br />

focus in response to users' speech commands.<br />

The Vo cahulmy /via nager uscr- i nter bce window<br />

displ:ys the current hierarchy of the fi nite state gr:tn­<br />

mar ti le. The Vo cabulary Manager t· addition, deletion,<br />

:nd moditic:tion of words or macros. Also in this win­<br />

dow, the command -utterance to keystroke translations<br />

arc displayed, created , or modi ti ed.<br />

In the TJ-aining ;11a nagC'r user intcrbcc, tl1e user<br />

may train newly created words or phrases in the<br />

user vocabul:lry tiles and rctr:1in, or adapt, the prod uct­<br />

supplied, independent vocabulary.<br />

The DSRS Implementation<br />

As the design team gained ex perience wi th the DSRS<br />

prototypes, we rdi ncd user procedures and intert:1 ccs.<br />

This section describes the kcv fu nctions the team<br />

dc,-eloped ro L· nsurc the user-fi·icndl iness of the prod­<br />

uct, including the first-time setup, the Speech<br />

Manager, the Training Manager, the Vo cabu iJry<br />

Manager, Jiowing critical con trol s :<br />

• Microphone on/offsll'itch.<br />

• A VU (volume units) meter that gives rc:l-timc<br />

fCedbJck to the audio signal being heard. A VU<br />

meter is :1 visual teed back device common ly used on<br />

devices such as tape decks. Users are ge nerally very<br />

comfortable using them.<br />

<strong>Digital</strong> lcdmic:li journJI Vol. 8 No. 2 19'}6 119


Figure 2<br />

DSRS Si1cus. The Actin: \'OC1bu­<br />

larv \I'OI'ds m: specific to the 1 pplic1rion in h>Lus<br />

and c hmge dynamically as the current application<br />

changes . The vocabularies arc designed in this way so<br />

tl1:1t 1 user can speak comm;mds both within Lus. This comcxt :dso<br />

determines the current stJte of the Active word list.<br />

-The history fr:1me displavs the word, phr;1se , or<br />

scmcncc last heard lw the recognizer. The hisrorv<br />

ti·amc is set up as :1 button . \rVhen pressed , it drops<br />

do\\'n ro reveal the l:1st 20 recognized utterances.<br />

• A menu that provides access to the management of<br />

user ti les, the Vocabu l ary Ma11r ;1 good traini ng user imcrbcc.<br />

• Strong, clear indic:ltions thlt utterances arc pro­<br />

cessed . \rYe 1ddcd a series of bo,;cs thrth<br />

across the screen at each utterance.<br />

• A glimpse of upcoming \\'ords. A list of\\'ords is dis­<br />

played on the user inter bee ;md moves r<br />

example, Wmd 4 of2l.<br />

• Option to pause, resume, and rcstJrt training.<br />

• Large, bold tonr disp lay of the word ro be spoken<br />

and a small prompt, "Please continue," displavcd<br />

when the svstcm is waiting t(:>r input.<br />

• AutomHic ;1 ddition of rq1c1tcd utterJnccs that


Figure 3<br />

Tr:1ining i'vbnager vV indow<br />

bold tom to diftcrcmiate it ri·om the other clements in<br />

the window. To train the words, the user repeats an<br />

um:rancc ti·om one to six times. The user must speak<br />

lt the proper times to make tr1ining a smooth md eff-icient<br />

process. DSRS manages the process by prompting<br />

the speaker with visual cues. Right below the word<br />

is :1 set of boxes th:u represent the repetitions. The<br />

boxes arc checked off as utterances arc processed, providing<br />

positive visu1l feed back to the speaker. When<br />

one word is complete, the next word to be trained is<br />

displayed and the process is repeated. vVhcn : II the<br />

words in the list arc trained, the user saves the files, and<br />

DSRS returns to the Speech Manager and its Jcti,·c<br />

mode with the microphone turned off.<br />

Vocabulary Manager<br />

The Vocabularv J\tLmager, l ll example of which is<br />

shown in Figure 4, enables users to modi!)' speech<br />

ll1


£ile ::::'_ocabulary<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

OJ<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

I<br />

-<br />

Figure 4<br />

Vocabularv ivl anager Windo\\'<br />

1:6: ...<br />

Always Active Vocabl ' r.:1<br />

<br />

C Programming Modr <br />

Calculator ...<br />

Calendar ...<br />

Calendar & Diary ...<br />

DtTermlnal ...<br />

Emacs ..<br />

Emacs extensions.<br />

Pesonal Emacs ex<br />

File Manager ...<br />

Mail ...<br />

Nets cape ...<br />

Sentences for Switch<br />

Speech Manager ...<br />

-<br />

Vocabulary Manager<br />

J!j<br />

J!j<br />

•<br />

<br />

J!j<br />

J!j<br />

J!j<br />

J!j<br />

J!j<br />

J!j<br />

<br />

J!j<br />

J!j<br />

•<br />

I<br />

<br />

abort<br />

backward<br />

backward char<br />

backwards a page<br />

beginning of blifer<br />

beginning of line<br />

beginning of word<br />

center on line<br />

control g<br />

copy region<br />

delete backward char<br />

delete character<br />

delete line<br />

delete previous word<br />

delete region<br />

delete word<br />

edit declare project SllTlmary<br />

Emacs extensions ...<br />

end of buffer<br />

p -<br />

- I /r--- -- -- -- --<br />

<strong>Number</strong> of active words in selected Vocabu lary. 64<br />

The simple logic of this special fu nction enh:mces<br />

user productivity. Often workstation and PC screens<br />

are littered with windows or applications icons 1 nd<br />

icon boxes through which the usn must search.<br />

Speech control elirnin:Hes the steps between the user<br />

thinking "I want the calculator" and the applicuion<br />

being presented in f(Kus, ready to be used . The DSRS<br />

team created a fu nction called FocusOrlaunch, which<br />

implements the behavior described above. The tinc­<br />

tion is encoded into the FSC continuous-switching­<br />

mode sentences in the Always Active vocabulary<br />

associated with the spoken commands "switch to<br />

," "go to ,"<br />

and just plain "."<br />

Applications like c1kulator mel c:tlcndJr :liT not<br />

likely to be needed in multiple instances. Hm\'e\·er,<br />

applications such as termin:1l emulator windows 1 re .<br />

DSRS defines the specific phrase "bring up " to explicitly l:nmch a new inst:mcc oftht 1ppli­<br />

cation; that is, the phrase "bring up " is tied to a fu nction 1umed Launch.<br />

The phrases "next " and "pre\·i­<br />

ous " were chosen t( Jr navigating<br />

between instJnces of the s1me application. DS RS<br />

remembers the previous state of the application. For<br />

122 Dig:inl Tcc hniol Jounul \'ol. l'\o. 2 <strong>1996</strong><br />

!:::!.elp<br />

-, .. r-:<br />

instance, if the calcndJr :tpplication is minimized when<br />

the user says "switch to calendar," the calendar<br />

window is restored. \


user to launch an audio-based application that will<br />

record, such as a teleconferencing session. Version 1.1<br />

also includes a fu nction that al lows the user to play<br />

a "wave," or digitized audio clip. Audio cues may thus<br />

be played JS part of speech macros. The "say" command<br />

invokes DECtalk Text-to-Speech functionality<br />

so rbat audio events can be spoken!<br />

Since speech recognition is a st


In the i m plementation of DSRS, we encountered<br />

one unexpected proble m. vV hen :t nested menu<br />

mnemon ic was invoked, the second ch1r:tcter was lost.<br />

The seque nce of events was as rol lows:<br />

• A spoken word was recognized , and keystrokes<br />

were sent to the keyboard butkr.<br />

• The ri rst character, ALT + , acted n ormal ly<br />

and caused a pop-up menu to d ispL1y.<br />

• The menu remai ned on d ispl ay, and the last key was<br />

lost.<br />

vVe determined that the second keystroke was bei ng<br />

delivered to the application bd(>re the pop-up menu<br />

'"1s displavcd . Th erdorc , at the time the key was<br />

pressed , it did not yet have me:1ning to the appl ic:1 tion .<br />

It is 1 pparent that such applications arc written rc>r<br />

1 IH11n:m reaction-based pa radigm . DS RS , on the<br />

other ha nd , is typing on behalf of the user at computer<br />

speeds md is not waiti ng tor the pop-up menu to<br />

d ispby bd(xc entering the next key.<br />

To overcome this problem, we deve loped J syn­<br />

ch ron izi ng ti.1 nction. Normally the Voca bu lary<br />

M:1nagcr not:1tion to send an AlT + f fd lowcd bv an<br />

x would be ALT + fx. Th is new synchronizing func­<br />

tion was designated as sALT + fx. The synch ronizing<br />

fu nction sends the ALT + f and the n monitors e\·ents<br />

t(>r a map-notif'\• message i ndicati ng that the pop-up<br />

menu has been wri tten to the screen. The ch:1ractcr<br />

fol lowi ng ALT + f is then sent, in this usc, the x.<br />

The svnchroni zi ng t'i.mction also has a watchdog timer<br />

to prcvellt a hang in the ev


in tociay's applications. A user agent or secretary pro­<br />

gram that could process common requests delivered<br />

entirely by speech is not out of reach even with the<br />

technology available today, for example:<br />

User: What time is it?<br />

Computer: It is now 1:30 p.m.<br />

User: Do I have any meetings today)<br />

Computer: Staff meeting is ten o'clock to twelve<br />

o'clock in the corner conference room.<br />

Computer: Mike Jones is e


References<br />

l. L . Rabiner and B. Juang, Fu ndamentals of Speech<br />

Recognition (Englewood Cl iffs , N.J.: Prentice-Hall,<br />

Inc., 1993) 45-46.<br />

2. C. Schmandt, Vo ice Communication with Computers:<br />

Co rwersational Systems (New York, N.Y.: Van Nostrand<br />

Reinhold, 1994): 144-145.<br />

3. K.F. Lee, Large- \locabulmy Sp eaker-Independent<br />

Continuous Sp eech Recog nition: The SPHINX System<br />

(Pittsburgh, Pa.: Carnegie-Mellon University Computer<br />

Science Deparunent, Apri I 1988).<br />

4. W. Hallahan , "DECtalk Software: Text-to-Speech Technology<br />

and Implementation," <strong>Digital</strong> <strong>Technical</strong><br />

Journal, vol. 7, no. 4 ( 1995 ): 5-19.<br />

5. G. Lowney, The Jvficrosoji Windows Guidelines fo r<br />

Accessible Sojiware Design (Redmond, Wash.:<br />

Microsoft Development Library, 1995): 3-4.<br />

6. N. Yankelovich, G. Levow, and M. Marx, "Designing<br />

SpeechActs: Issues in Speech User Interfaces," Pro­<br />

ceedings of ACM Cmference on Computer-Human<br />

Interaction (CHI) '9.5: Human factors in Computing<br />

Systems: Mosaic of Creativity, Denver, Colo. (May<br />

1995 ) : 369-376.<br />

Biography<br />

Bernard A. Rozmovits<br />

During his tenure at <strong>Digital</strong>, Bernie Rozmovirs bas worked<br />

on both sides of computer engineering-hardware and software.<br />

Cmrently he manages Speech Services in <strong>Digital</strong>'s<br />

Light and Sound Software Group, which developed the<br />

user inrerf . 1ces for <strong>Digital</strong>'s Speech Recognition Software<br />

and also developed the DECtalk sofnvare product. Prior<br />

to joining this sofnvarc effort, he focused on hardware<br />

engineering in the Computer Special Systems Group<br />

an.d was the architect for voice-processing platforms in<br />

the Image, Voice and Video Group. Bernie received a<br />

Diplome D'Etude Collegiale (DEC) from Dawson<br />

College, Montreal, Quebec, Canada, in 1974. He<br />

holds a patent entitled "Data Format ror Pckets Of<br />

Information," U.S. Patent No. 5,317,719.<br />

126 Digical <strong>Technical</strong> journal Vol. 8 No. 2 <strong>1996</strong>


Further Readings<br />

The DigitaL Te chnica/Joumal is a rerereed, quarterly<br />

publication of papers that explore the toundations of<br />

<strong>Digital</strong>'s products and technologies. Journal content<br />

is se lected by the Journal Advisory Board, and papers<br />

are vvritten by <strong>Digital</strong>'s engineers and engineering<br />

partners. Engineers who would like to contribute a<br />

paper to the Jo urnaL should contact the Managing<br />

Editor, Jane Blake, at Jane. Blake@ ljo.dec.com.<br />

To pics covered in previous issues of the<br />

<strong>Digital</strong> Te ch nical.fourrwlare as to llows:<br />

<strong>Digital</strong> UNIX Clusters/Object Modification Tools/<br />

eXcursion fo r Windows Operating Systems/<br />

Network Directory Services<br />

Vo l. 8, No. 1, <strong>1996</strong>, EY-U025E- TJ<br />

Audio and Video Tecluwlogies/UNIX Available Servers/<br />

Real-time Debugging Tools<br />

Vol . 7, No. 4, 1995, EY- U002F.-TJ<br />

High Performance Fortran in Parallel Environments/<br />

Sequoia 2000 Research<br />

Vo l. 7, No. 3, 1995, EY-T838E-TJ<br />

(Available only on the Internet)<br />

Graphical Software Development/Systems Engineering<br />

Vol . 7, No. 2, 1995, F.Y- UOO l E-TJ<br />

Database Integration/ Alpha Servers & Workstations/<br />

Alpha 21164 CPU<br />

Vol. 7, No. 1, 1995, F.Y-T l35E-TJ<br />

(Avail:tblc only on rhc Internet)<br />

RAID Array ContmllersjWorkflow Models/<br />

PC LAN and System Management Tools<br />

Vo l. 6, No. 4, Fall l994, EY-Tl l8E-TJ<br />

AlphaServer Multiprocessing Systems/<br />

DEC OSF/1 Symmetric Multiprocessing/<br />

Scientific Computing Optimization for Alpha<br />

Vol. 6, No. 3, Summer 1994, EY-S799 F. -TJ<br />

Alpha AXP Partners- Cray, Raytheon, Kubota/<br />

DECchip 21071/21072 PCI Chip Sets/<br />

DLT2000 Tape Drive<br />

Vo l. 6, o. 2, Spring 1994, EY-F947E-T)<br />

High· performance Networking/<br />

Open VMS AXP System Software/<br />

Alpha AXP PC Hardware<br />

Vo l. 6, No. I, Winter 1994, EY-QO l!E-TJ<br />

I<br />

Software Process and Quality<br />

Vol. 5, No. 4, Fall l993, EY- P920E-DP<br />

Product Internationalization<br />

Vol. 5, No. 3, Summer 1993, EY-P986E-Dl'<br />

Multimedia/ Application Control<br />

Vol. 5, No. 2, Spring 1 993, EY-P963E-DP<br />

DECnet Open Networking<br />

Vol. 5, No. 1, Wi nter 1993, EY-M770E-DP<br />

Alpha AXP Architecnlre and Systems<br />

Vol. 4, No. 4, Special Issue 1992, EY-J886E-DP<br />

NV AX- microprocessor VAX Systems<br />

Vo l. 4, No. 3, Summer 1992, EY-J884E-DP<br />

Semiconductor Technologies<br />

Vol. 4, No. 2, Spring 1992, EY-L521E-DP<br />

PATHWORKS : PC Integration Software<br />

Vol . 4, No. 1, Winter 1992, EY-J82 5E-DP<br />

Image Processing, Video Terminals, and<br />

Printer Technologies<br />

Vol. 3, No. 4, fall l991, EY-H889E-DP<br />

Availability in VAXcluster Systems/<br />

Network Performance and Adapters<br />

Vol. 3, No. 3, Sun1mer 1991, EY-H890E-DP<br />

Fiber Distributed Data Interface<br />

Vol. 3, No. 2, Spri ng 1991, EY-H876E-DP<br />

Transaction Processing, Databases, and<br />

Fault-tolerant Systems<br />

Vol. 3, No. 1, Winter 1991, J:::Y-1:'588E-DP<br />

VAX 9000 Series<br />

Vol . 2, No. 4, Fall 1990, EY-E762E-DP<br />

DECwindows Program<br />

Vol. 2, No. 3, Summer 1990, EY-E756E-DP<br />

VAX 6000 Model 400 System<br />

Vo l. 2, No. 2, Spri ng 1990, EY-C l97E- Dl'<br />

Compound Doctm1ent Architecture<br />

Vol. 2, No. l, \N inter 1990, EY-Cl96E- DP<br />

<strong>Digital</strong> Tcd1nicl Journ"l Voi .8 No. 2 <strong>1996</strong> 127


ISSN 0898-901X<br />

Printed in U.S.A. EC-N6992-18/96 9 14 20.0 Copyright © <strong>Digital</strong> Equipment Corporation

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!