DTJ Volume 8 Number 2 1996 - Digital Technical Journals

Digital 

Technical 

Journal 

I 

SPIRALOG LOG-STRUCTURED FILE SYSTEM 

OPENVMS FOR 64-BIT ADDRESSABLE 

VIRTUAL MEMORY 

HIGH-PERFORMANCE MESSAGE PASSING 

FOR CLUSTERS 

SPEECH RECOGNITION SOFTWARE 

Volume 8 Number 2 

1996

Editorial 

Jane C. Blake, Managing Editor 

Kathleen M. Stetson, Editor 

Helen L. Patterson, Editor 

Circulation 

Catherine M. Phillips, Administrator 

Dorothea B. Cassady, Secretary 

Production 

Terri Autieri, Production Editor 

Anne S. Katzdf, Typographer 

Peter R.. Woodbury, Illustrator 

Advisory Board 

Samuel H. fuller, Chairman 

Richard W. Beane 

Donald Z. Harbert 

William R. Hawe 

Richard J. Hollingsworth 

William A. Laing 

Richard F. Lary 

Alan G. Nemeth 

Pauline A. Nist 

Robert M. Supnik 

Cover Design 

Digital's new Spiralog rile system, a featured 

topic in the issue, supports rid! 64-bit system 

capability and fast backup and is integrated 

with the Open VMS 64-bit version 7.0 oper 

ating system. The cover graphic captures the 

inspired character of the Spiralog design 

effort and illustrates a concept taken from 

University of California research in which 

the whole disk is treated as a single, sequential 

log and all file system modifications arc 

appended to the tail of the log. 

The cover was designed by Lucinda O'Neill 

of Digital's Design Group using images 

fi·om Photo Disc, Inc., copyright 1996. 

The Digital Teclmica/juumal is a refereed 

journal published quarterly by Digital 

Equipment Corporation, 30 Porter Road 

L)02/DIO, Littleton, MA 0!460. 

Subscriptions can be ordered by sending 

a check in U.S. ti.mds (made payable to 

Digital Equipment Corporation) to the 

published-by address. General subscrip 

tion rates arc $40.00 (non-U.S. $60) for 

rour issues and $75.00 (non-U.S. $!15) 

for eight issues. University and college pro 

fessors and Ph.D. students in the electrical 

engineering and computer science fields 

receive complimentary subscriptions upon 

request. Digital's customers may qualify 

for gift subscriptions and arc encouraged 

to contact their account representatives. 

Single copies and back issues are available 

ri:>r $16.00 (non-U.S. $18) each and can 

be ordered by sending the requested issue's 

volume and number and a check to the 

published-by address. Sec the Further 

Readings section in the back of this issue 

ror a complete listing. Recent issues arc 

also available on the Internet at 

http:/ /www.digital.com/info/dtj. 

Digital employees may order subscrip 

tions through Readers Choice at UR.L 

http:/ jwebrc.das.dec.com or by entering 

VTX PROfiLE at the system prompt. 

Inquiries, address changes, and compli 

mentary subscription orders can be sent 

to the Digital Teclmical.fourna/ at the 

published-by address or the electronic 

mail address, dtj@digital.com. Inquiries 

can also be made by calling the joumal 

office at 508-486-2538. 

Comments on the content of any paper 

arc welcomed and may be sent to the 

managing editor at the published-by or 

electronic mail address. 

Copyright© 1996 Digital Equipment 

Corporation. Copying without fee is per 

mitted provided that such copies are made 

for usc in educational institutions by faculty 

members and are not distributed for com 

mercial advantage. Abstracting with credit 

of Digital Equipment Corporation's auth 

orship is permitted. 

The information in the journal is subject 

to change without notice and should not 

be construed as a commitment by Digital 

Equipment Corporation or by the compa 

nies herein represented. Digital Equipment 

Corporation assumes no responsibility ror 

any errors that may appear in the Journal. 

!SSN 0898-90 I X 

Documentation Number EC-N6992-18 

Book production was done by Quantic 

Communications, Inc. 

The following arc trademarks of Digital 

Equipment Corporation: AlphaScrver, 

DEC, DECtalk, Digital, the DIGITAL 

logo, HSC, Open VMS, PATH WORKS, 

POLYCENTER, RZ, TruCiuster, VAX, 

and VAXcluster. 

BBN Hark is a trademark of Bolt Beranek 

and Newman Inc. 

Encore is a registered trademark and 

MEMORY CHANNEL is a trademark 

of Encore Computer Corporation. 

FAScrver is a trademark of Network 

Appliance Corporation. 

Listen for Windows is a trademark of 

Verbcx Voice Systems, Inc. 

Microsoft and Win32 arc registered trade 

marks and Windows and Windows NT arc 

trademarks of Microsoft Corporation. 

MIPSpro is trademark of MIPS Technol 

ogies, 1 nc., a wholly owned subsidiary of 

Silicon Graphics, Inc. 

Netscape Navigator is a trademark of 

Netscape Communications Corporation. 

PAL is a rcgistcn:d trademark of Advanced 

Micro Devices, Inc. 

UNIX is a registered trademark in the 

United States and in other countries, 

licensed exclusively through X/Open 

Company Ltd. 

VoiceAssist is a trademark of Creative 

Labs, Inc. 

X Window System is a trademark of the 

Massachusetts Institute ofTechnology.

Contents 

Foreword 

SPIRALOG LOG-STRUCTURED FILE SYSTEM 

Overview of the Spira log File System 

Design of the Server for the Spira log File System 

Designing a Fast, On-line Backup System for 

a Log-structured File System 

Integrating the Spira log File System into the 

OpenVMS Operating System 

Open VMS FOR 64-BIT ADDRESSABLE VIRTUAL MEMORY 

Extending Open VMS for 64-bit Addressable 

Virtual Memory 

The OpenVMS Mixed Pointer Size Environment 

Rich Marcello 

James E. Johnson and William A. Laing 

Christopher Whitaker, ). Stu

2 

Editor's 

Introduction 

This past spring when we sun·c\Td 

.foun w/s ubscribcrs, readers rook the 

rime ro commem on the parri cula1· 

,·aluc ofrhc issues fCawring Digital's 

64-bir Alpha technology. The engi 

neering described in those two issues 

continu es, with ever higher levels of 

pcrr(>rmance in Alpha microproces 

sors, servers, clusters, and systems 

software. This issue presents reccm 

developments: a log-structured file 

system, called Spiralog; the Open VMS 

opcr1ting system extended ro take full 

ar 

Alpha clusters; and speech recognition 

sofn,arc for Alpha workstations. 

Spir1log is a whollv new clusterwidc 

file svsrcm integrated with the new 

64-bir Open VMS version 7.0 opcrat· 

ing system and is designed for hig h 

d1ra avaibbiliry and high pert(xman cc. 

The tirst of four papers about Spira log 

is writtcn by Jim Johnson and Bill 

L1ing, wlw imroduce log -srru crured 

file ( LfS) concepts, the university 

research behind the design, and design 

innO\'ltions. 

The r scicntitic users, 

the p.!r;!llcl-programming tool 

ncx t described does not run on the 

Open VMS Alpha wstcm but instead 

on UNIX clusters connected with 

l'vlEMORY CHANNEL technologv. 

jim Lawton, John Brosnan, Morgan 

Doyle, Scosarnh () Riordain, and 

Tim Reddin rc\'iew the challenges in 

designing the TruCiu stcr ,vn-:,vlOJY 

CH ANNE1. SofTware product, which 

is a ll!Cssagc-passing s·stcm intended 

t(>r builders of parallel software 

libr1rics and implcmenrers of flH!Ilcl 

compilers. The product reduces 

communicuions latenc\' to Jess than 

10 f.lS in shared mcmor\' S\'stc ms. 

Fin1lly, Bernie RozmO\·its prcscnrs 

the design of user interfaces f(Jr the 

Digital Speech Recognition Software 

( J)SRS) product. Although DSRS 

is rugct c d t(Jr Digital's Alpha work 

stations running UNIX, the impk· 

mcntrrs ro ensure the prod 

uct's Clsc-of-usc c1n be gcner

Foreword 

Rich Marcello 

Vice President. Open VMS ,\)stems 

Sqfiu,are Group 

The papers you will read in this issue 

ofthe.fournal describe how we in the 

Open VMS engineering community 

set out to bring the Open VMS oper 

ating system and our loyal customer 

base into the rwcnty-first century. 

The papers present both the develop 

ment issues and the technical chal 

lenges faced by the engineers who 

delivered the Open VMS operating 

system version 7.0 and the Spiralog 

file system, a new log-structured file 

system tor Open VMS. 

We are extremely proud of the 

results of these efforts. In December 

1995 at U.S. Fall DECUS (Digital 

Equipment Computer Users Society), 

Digital announced Open VMS version 

7.0 and the Spiralog tile system as part 

of a first wave of product deliveries for 

the Open VMS Windows NT Nllnity 

Program. Open VMS version 7.0 pro 

vides the "unlimited high end" on 

which our customers can build their 

distributed computing environments 

and move toward the next millennium. 

The release of Open VMS version 

7.0 in January oftJ1is year represents 

the most significant engi11eering 

enhancement to the Open VMS oper 

ating system since Digital released 

the VAXcluster system in 1983. 

Open VMS version 7.0 extends the 

32-bit architecture of Open VMS 

to a 64-bit

4 

In E1ct, Digitl has two 64-bit oper 

uing s1·srems 1\'ith this po11·er: the 

OpenV/vlS 1nci the Digital U:-.JIX 

oper;ni ng SI'Stems. 

As noted 1bm·e, Digit:ll inrrocluccd 

the OpenV1v!S operating system with 

SUflport f(lr full 64-bit l'irtual address 

ing at the s:1mc time it introduced the 

Spir;1log tile system, in Dt:cembcr 

1995. The Spira log design is lmt:d 

on rhe Sprite log-structllred tile svs 

rem trom the University ofCalit(>rni;1, 

Berkeley. Wirh irs usc of this log 

structured ;1pproaeh, Spira log offers 

Jll;1jor nell performance reatures, 

including t1st, applicnion-consistcnr, 

on-line b;lckup. further, ir is fulk 

comparibk 11·irh customers' oisting 

hies-II rile SI'Stems, and applications 

rh;H t·un 011 hles - 1 1 will run on 

Spir;Jiog 11·ith no modification. To 

ddiver 111 of the re;nures we kit ll't:rc 

esscmd to mt:t:t the needs of our 

loval utstomer base, the Spiralog tc1m 

0;1mined ;1t1d resoked a number of 

rechniul issues. The papers in this 

issue desuibe some ofrhe clullcnges 

rhe1· hcnl, including the decision to 

design ;1 hlcs-11 tile SI'Stcm emubrion. 

The dcli1·crv ofrhe OpenVMS 

1 crsion 7.0 operating SI'Stem and 

rhe Spir:1lug tile system are part of 

DigiL1I's continued commitment ro 

rhe Open VMS customer b:se. These 

products reptTsent the work of dedi 

cated, t:demed engineering teams 

that h;we deployed state-otthe-arr 

reehnology in products rhat ll'ill help 

our customers remain competiti1·e 

r(Jt- \'cars to come. 

In the Open VMS group as elsell'hcre 

in Digital, 11·e are committt:d 

to excellence in the de,·clopmcnt and 

Di1;ital Technical )ournl 

delivnv o· business computing solu 

tions. 'vVt: ll 'ill continue to maint;lin 

;Jnd cnluncc a product porr(,lio th;1t 

meers our customers' need for true 

24-hour bl' 365-dal' acct:ss ro theit- 

. . 

d1t;1, ti.tll imegration wirh Microsof-t 

'vVindows :-.rr enl'ironmenrs, and the 

full complement of network solutions 

;1nd application software tor today 

and well into tht: next millennium. 

Vol. 8 No. 2 J 996

Overview of the Spira log 

File System 

The OpenVMS Alpha environment requires a 

file system that supports its full 64-bit capabili 

ties. The Spiralog file system was developed to 

increase the capabilities of Digital's Files-11 file 

system for Open VMS. It incorporates ideas from 

a log-structured file system and an ordered write 

back model. The Spiralog file system provides 

improvements in data availability, scaling of the 

amount of storage easily managed, support for 

very large volume sizes, support for applications 

that are either write-operation or file-system 

operation intensive, and support for heteroge 

neous file system client types. The Spira log 

technology, which matches or exceeds the relia 

bility and device independence of the Files-11 

system, was then integrated into the Open VMS 

operating system. 

I 

James E. Johnson 


Digital's Spiralog product is a log-structured, duster 

wide file system with integrated , on-line backup and 

restore capability and support tor multiple tile sys 

tem personalities. It incorporates a number of recent 

ideas ti·om the research community, including the 

log-structured tile system ( LFS) from the Sprite tile 

system and the ordered write back ti-om the Echo 

tile system.U 

The Spiralog file system is fu lly integrated into the 

Open VMS operating system, providing compatibility 

with the current Open VMS fi le system, Files-ll. It 

supports a coherent, clusterwide write-behind cache 

and provides higb-pertonnance, on-line bac kup and 

per-file and per-volume restore functions. 

In this paper, we first discuss the evolution of tile 

systems and the requirements tor many of the basic 

designs in the Spiralog tile system. Next we descri be 

the overall architecture of the Spiralog file system, 

identit),ing its major components and outlining their 

designs. Then we discuss the project's results: what 

worked well and what did not work so well. Finally, \Ve 

present some conclusions and ideas tor future work. 

Some of the major components, i.e., the backup 

and restore facility, the LFS server, and OpcnVMS 

integration, are described in greater detail in compan 

ion papers in this issue .3-5 

The Evolution of File Systems 

File systems have existed throughout much of the his 

tory of computing. The need tor libraries or services 

that help to manage the collection of data on long 

term storage devices was recognized many years ago. 

The early support libraries have evolved into the tile 

systems of today. During their evolution, they have 

responded to the industry's improved hardware capa 

bilities and to users' increased expectations. Hardware 

has continued to decrease in price and improve in its 

price/performance ratio. Consequently, ever larger 

amounts of data are stored and manipulated by users 

in ever more sophisticated ways. As more and more 

data are stored on-line, the need to access that data 24 

hours a day, 365 days a year has also escalated. 

Digital Technical Jounul Vol. 8 No. 2 1996 5

6 

Signiticant i mprovements to file systems have been 

made in the t(> l lowing areas: 

• Direcrorv stru ctures to ease locating data 

• Device independence of data access through the rile 

svstcm 

• Accessibility of the data to users on other svstems 

• Avail;1bility of the thta, despite either planned or 

unplanned service outages 

• Reliability of the stored data and the pertc>rm:mn.: 

of the datJ JCCCSS 

Requirements of the Open VMS File System 

Since l977, the OpenVMS operating system has 

ofti..:rcd a stable, robust tile svstem known as Files-!!. 

This ti le system is considered to be very successfu l in 

the areas of reliability and device independence. 

Recent customer teed lxtck, however, indicated th;H 

the areas of data avaibbility, scaling of the ::tmounr of 

sror:gc c:tsilv managed, support tor \'Cry large volume 

sizes, and support tc>r heterogeneous file system client 

types were in need of improvement. 

The Spiralog project was initiated in response to 

customers' needs. We designed the Spiralog rile system 

ro match or somewhat exceed the Files-1 1 system in 

its reliability and device independence. The focus of 

the Sp ir::t log project was on those areas that were due 

tor improveme nt, notably: 

• DatJ availability, especially during pJanned opera 

tions, such ::ts backup. 

If the stor:tgc device needs to be taken offline 

to pcrtcm11 J backup, even at a very high backup 

rate of 20 megabytes per second (Ml3/s ), ;llmost 

14 hours are needed to back up l terabyte . This 

length of service outage is clearly unacceptable. 

More typical backup rates of 1 to 2 MB/s can rake 

several davs, which, of course, is not acceptable . 

• Grc:nlv increased sca ling in total amount of on-line 

storage, without greatly increasing the cost to man 

age rhar storage. 

For exa mpl e, 1 terabyte of disk storage cutTcmly 

costs approximately $250,000, which is wel l within 

the budget of many large computing centers . 

However, the cost in staff and rime to manage such 

amounts of storage can be many times th::tt of the 

storage.'' The cost of storage continues ro t11I, while 

the cost of managing it continues to rise. 

• Efkc rivc scaling as more processing and storage 

resources become avail able. 

For e xample, Open VMS Cluster systems aJlow pro 

cessing power and storage capacity to be added 

incrementally. Ir is crucial that the software support- 

Dig.it.tl Tcdtnic;tl jounul Vol. 8 No. 2 1996 

ing the rile system scale as the processing power, 

bandwidth to storage, and storage capacity increase . 

• Improved performance tor applic:nions that arc 

either \\'rite-operation or ti le-syste m-operation 

imcnsivc. 

As ti le svsrem caches m main mcmor\' ha\'C 

increased in capacitv, data reads and file svsrcm read 

opcr.nions have become satisf-ied more and more 

tl-om the cache. At the same time, nL1nv applica 

tions write large amounts of data or create and 

manipulate large numbers of tiles. The usc of 

redundant arrays of inexpensive disks (RAID) stor 

age has increased the available bandwidth r(>r d:na 

writes and rile system writes. Most tile system oper 

ations, on the other hand, are small writes and 1rc 

spread across the disk at random, often negating 

the bcndirs of RAID storage. 

• lmpnl\'cd ;lbi lity to transparently access the stored 

data across several dissim ilar client types. 

Computing environments have become tncrcas 

ingly heterogeneous . Different client S\'Stcms, such 

as the Windo\\'S or the UNIX opc cni ng S\'Stcm, 

store their nics on and share their ti les with scn·cr 

S\'Stcms such as the OpenVJ\1\S sen·cr. It Ius 

become ncccssJry to support the svntax ;md scmJn 

rics ot- several ditkrcnr tile system personalities on 

:1 common rile server. 

These needs were central ro many design decisions we 

m::tdc t(lr the Spiralog tile system. 

The members of the Spiralog project evaluJtcd 

much of the ongoing work in file systems, dat:tbascs, 

:tnd storage architectures. RA.ID storage makes high 

bandwidth av:tiLlble to disk storage, but it req uires 

large writes to be etiective. Dar:tbascs ha\'C exp loited 

logs ;md the grouping of writes together to minimize 

the number ot' disk f/Os and disk seeks req u i red . 

Databases and transaction systems have also e xploited 

the tt:cb niquc of copving the tail of the log to dkct 

backups or data replication. The Sprite project at 

Berkeley had brought together a log-str uct ured ri le 

system and RA.I D storage to good eftccr.1 

By d rawing rl·om the above ideas, parricul ::trlv the 

insight of how a log structure could support on-line, 

high-pcrr(mnance backup, we began our dcvc lopmcnr 

cft(>rt. We designed and buil t a distributed tile system 

rhar made extensive use of the processor and me mory 

ncar the application and used log-stru ctured storage in 

the server. 

Spiralog File System Design 

The m ai n cxccurion stack of the Spira log tile svstcm 

consists of three distinct lavers. Figure I sho\\'S the 

o\·er:ll structure. At the rop, nearest the user, is the ri le

F64 FSLIB 

Figure 1 

VPI SERVICES 

FILE 

SYSTEM 

CLIENT 

Spira log Structure Overview 

BACKUP USER 

INTERFACE 

system client layer. It consists of a number of fi le 

system personalities and the underlying personalityindependent 

services, which we call the VPI. 

Two tile system personalities dominate the Spiralog 

design. The F64 personality is an emulation of the 

Files-l l fi le system. The fi le system library (FSLIB) 

personality is an implementation of Microsoft's New 

Technology Advanced Server (NTAS) file services for 

use by the PATHWORKS for Open VMS file server. 

The next layer, present on all systems, is the clerk 

layer. It supports a distributed cache and ordered write 

back to the LFS server, giving single-system semantics 

in a cluster configuration. 

The LFS server, the third layer, is present on all designated 

server systems. This component is responsible 

for maintaining the on-disk log structure; it includes 

the cleaner, and it is accessed by multiple clerks. Disks 

can be connected to more than one LFS server, but 

they are served only by one LFS server at a time. Transparent 

fa il over, fi·om the point of view of the tlle system 

client layer, is achieved by cooperation between 

the clerks and the surviving LFS servers. 

The backup engine is present on a system with an 

active LFS server. It uses the LFS server to access the 

on-disk data, and it interfaces to the clerk to ensure 

that the backup or restore operations are consistent 

with the clerk's cache. 

Figure 2 shows a typical Spiralog cluster configuration. 

In this cluster, the clerks on nodes A and B are 

accessing the Spira log volumes. Normally, they use the 

LFS server on node C to access their data. I f node C 

should fail, the LFS server on node D would immediately 

provide access to the volumes. The clerks on 

nodes A and B would usc the LFS server on node D, 

retrying all their outstanding operations. Neither user 

application would detect any f1ilure. Once node C had 

recovered, it would become the standby LFS server. 

NODE A NODE B 

USER APPLICATION USER APPLICATION 

SPIRALOG CLERK SPIRALOG CLERK 

ETHERNET 

NODE C NODE D 

SPIRALOG VOLUMES 

Figure 2 

Spiralog Cluster Configuration 

File System Client Design 

The fi le system client is responsible for the traditional 

fi le system fimctions. This layer provides fi les, directories, 

access arbitration, and file naming rules. It also 

provides the services that the user calls to access the file 

system. 

VPI Services Layer The VPI layer provides an underlying 

primitive file system interface, based on the UNIX 

VFS switch. The VPI layer has two overall goals: 

1. To support multiple file system personalities 

2. To effectively scale to very large volumes of data 

and very large numbers oftilcs 

To meet the first goal, the VPI layer provides 

• File names of 256 Unicode characters, with no 

reserved characters 

• No restriction on directory depth 

• Up to 255 sparse data streams per ti le, each with 

64-bit addressing 

• Attributes with 255 Unicode character names, containing 

values of up to l ,024 bytes 

• Files and directories that are freely shared among 

fi le system personality modules 

To meet the second goal, the V Pl layer provides 

• File identifiers stored as 64-bit integers 

• Directories through a B-tree, rather than a simple 

linear structure, for log(n) fi le name lookup time 

The VPI layer is only a base for file system personalities. 

Therefore it requires that such personalities are 

trusted components of the operating system. 

Moreover, it requires them to implement ti le access 

security (although there is a convention tor storing 

access control list information) and to perform all necessary 

cleanup when a process or image terminates. 

Digital Technical Journal Vol. 8 No. 2 1996 7

8 

F64 File System Personality As m.:viously stated, the 

Spi ra log prod uct i nc l udes t\VO ri le system persona lities, 

F64 :111d fS UB. The f64 pcrson1 l iry provides a sen·ice 

that emulates the Files-! ! ri le S\'Stcm.' Its fu nctions, 

services, available ri le attributes, and execution 

bchwiors :lrc similar to those in the Filcs-l l ri le S\'S 

tc m. i'vl inor difkrcnccs l i'C isobtcd into areas that 

receive little usc ti·om most applicuions. 

Fm i nstance , the Spiralog ti le svstcm supports rhe 

\':lrious Files-! ! queued l/0 ($QIO) parJ mc ters fo r 

reruming ri le attri bute inf(mnuion, because they are 

used impl icit\' or expl ici tl y by most usn :1ppl ic:1tions. 

On the other hand, the hlcs-1 1 method of read i ng 

the ri le hear the ti le name, by 

- Re ading some dirccrorv entries (VPI) 

- Searching the dirccrorv buftl.:r r( )r the hie name 

- Exiting the l oop, ifrhc 1mtch is tcllll1d 

• Access the target ti le (VPl) 

• Read rhc ri le's attributes (VI'!) 

• Audit the hie open Htempt 

FSLIB File System Personality The SUB ti le sysrem 

personalitv is J spcci11 izcd ri le system to support the 

PATHWORKS rc >r Open VMS ri l e server. Irs nvo major 

goals arc to support the ti l e 11amcs, attri butes, and 

bcha,·iors ti.n111d in M icrosoft's NTAS ri le access proto 

coJs, and ro provide low run-time cost rc >r processing 

NTAS ti le svstcm requests. 

The PATHWO RKS server i m plements 1 ri le service 

t(Jr personal computer (!'C) clients bycrcd on rop of 

the Filcs-1 1 ti l e system services. When NTAS service 

behaviors or ::trrihutcs do not match those ofFi les- ll, 

the PATHWORKS server h:ts to emulate them. This 

can lead to checking security access permissions nvice , 

mapping ri le names, and emul atin g ri le Jttriburcs. 

Many of these problems can be avoided if the VPI 

i nterbcc is used d i rectl y. for instance, because the 

SLI B pcrson:1liry docs nor laver on top of a Files-1 1 

persona lin·, sccurin· access checks do nor need to be 

pcrtcmm:d t\\'icc. furthermore, in a srraightrc>rward 

design , there is 110 need to map across diftl.: renr ti le 

Dig:ire1l Tcchnic.d )oum,d Vol. No. 2 lr mmagi ng the caches, 

determini ng the order of writes out ofrhc cache to the 

LFS scn·cr, and mainL1ining c:tche cohet-cnc\· ,,·ithin 

a cluster. The caches arc write behind in 1 1111nncr rhat 

preserves the order of dependent operations. 

The clerk-scn·cr protocol controls the rr:-nskr of 

d:tta to and from stable storage. Data Gill he scm 1s 

a multi block :n omic \\Titc, md operuions thu ch1 ngc 

m ultiple data items s u c h as a ri l e rename em be made 

atomically. If a server tails d uri ng ; request, the clerk 

treats the req uest as if it were lost and retries rite 

req uest. 

The clerk-server protocol is idempotent. Idempotent 

operations cJn be :tpplicd rcpcarcd l v with no 

effects other than the desired one. Thus, after any 

number of server fa ilures or scn·cr ta ilovcrs, ir is alwavs 

sate to reissue an operation. Clerk-to-server wri te 

operations always !ewe the ri le system stare consistent. 

The clerk-clerk protocol protects the user data :1nd 

ti.Jc svstcm . mctadatJ cached lw . the c l uks . C:c hc 

coherency inr(Jrmarion, rather th111

• To avoid writing data thJt is quickly deleted such as 

temporary tiles 

• To com bine multiple smaU writes into larger transfers 

The clerks order dependent writes from tile cache 

to the server; consequently, other clerks never see 

"impossible" states, and related writes never overtake 

each other. For instance, the deletion of a ti le cannot 

happen beti:>rc a ren:me that was previously issued to 

the same ti le. Related d:ta writes arc caused by a partial 

overwrite, or an expl icit linking of operations passed 

into the clerk by the V PI layer, or an implicit linking 

due to the clerk-clerk coherency protocol. 

The ordering between writes is kept as a directed 

graph. As the clerks trave rse these graphs, they issue 

the writes in order or collapse the graph when writes 

can be sately combined or eliminated . 

LFS Server Design 

The Spiralog ti le system uses a log-structured, on-disk 

fo rmat tor sto ring data within a volume, yet presents 

a trad itional, update-in-place ti le system to its users. 

Figure 3 

Spira log Address Mapping 

FILE VIRTUAL BLOCKS 

I I I I I I I I I I 

l USER 

1/0s 

Recently, log-structured ti le systems, such as Sprite, 

have been an area of active 1-escJrch.' 

Within the LFS server, support is provided ti:x the 

.log-structured, on-disk fo rmat and Jo r mapping that 

tormat to an update-in-place model. Specitically, this 

component is responsible tor 

• Mapping the incoming read and write operations 

from their simple Jddress space to positions in an 

open-ended Jog 

• Mapping the open-ended log onto a ti nite amount 

of disk spJce 

• Reclaiming disk space by cleaning (gJrbage collect 

ing) the obsolete (overwritten) sections of the log 

Figure 3 shows tllC various mapping layers in the 

Spiralog tile system, i ncluding those hJndled by the 

LFS server. 

fncoming read and write operations arc based on a 

single, large address space. Initially, the LFS server trans 

torms the address ranges in the incoming operations 

into equivalent address ranges in an open-ended log. 

This log supports a very large, write-once address space. 

DISK 

Digit.ll "kdmical Journal 

FILE SYSTEM ADDRESS 

SPACE 

jVPI 

CLERK 

FILE ADDRESS SPACE 

jLFS 

B-TREE 

LOG ADDRESS SPACE 

LFS 

LOG 

DRIVER 

LAYER 

PHYSICAL ADDRESS 

SPACE 

Vol. S No. 2 1996 9

10 

A read operation looks up its location in the open 

ended log and proceeds. On the other hand, a write 

operation makes obsolete its current address range 

and appends its new value to the tail ofthe log. 

In turn, locations in the open-ended log arc then 

mapped into locations on the (finite-sized) disk. This 

additional mapping allo\\'S disk blocks to be reused 

once their original contents have become obsolete. 

Physically, the log is divided into Jog segments, each 

of which is 256 kilobytes ( KB) in lengrl1. The log seg 

ment is used as the transfer unit fo r the backup engine. 

It is also used by the cleaner fo r reclaiming obsolett 

log space. 

More intr 

high-pcdcm11ancc, on-line backup save opcLltions. 

We : dso met their needs tor per-tile and per-volume 

restores and an ongoing need tor simplicity and red uc 

tion in operating costs. 

Digital Te chnical Joumal Vo l. 8 No. 2 19')6 

To provide per-tile restore capabilities, tht backup 

utilitY and the LrS server ensure that the :1 ppropriarc 

ti le header information is stortd during the sa\ c opcr 

:l tion . The Sa\·cd ri le svstem clara, including ri le head 

ers, log mapping information, and user data, arc 

stored in a ti le k.no\\'n as a SCI/ Y!SC'I . Each sa\·escr, 

regardless of the number of tapes it requires, repre 

sents a singl e sa,·e operation. 

'T'o reduce the complcxitv offile restore opcr:nions, 

the Spiralog fi le system provides an onlinc SJ\'CSCt 

merge feature. This allows tht systtm manager to 

merge several savesets, either tl.d l or incremental, to 

t(mn a new, single savcset. VVith this katurc, system 

managers can have a workable backup save plan that 

never calls fc>r :n on-line fu ll backup, rhus h1 rthcr 

red ucing the load on their production systems. Also, 

this featurt can be used to ensure that ti lt restore oper 

ations can be accomplished with a small, bounded set 

of S:l\'CSCtS. 

The Spiralog backup t:Kilitv is dtscribcd in dcuil in 

this issue.; 

Project Results 

'l'he Spir:t log tile svstem contains a number of innm·::t 

tions in the areas of on-line backup, log-structu red 

storage, clusttrwick ordered \\'rite-behind caching, 

and multiple-tile-system client support. 

The usc of log structuring as an on-disk t(Jrmat is 

very cftCctivc in supporting high-pcrf(mnancc, on-line 

backup. The Spiralog tile system retains the previously 

documented benefits of LfS, such as bst write pcrfc>r 

mancc that scales with the disk size and throughput 

that increases as large read caches arc used to oftscr 

disk rc:1ds.' 

It should also be noted that the Filcs-1 1 ti le svstcm 

stts a high standard t(x data reliabilitY and robusmcss. 

The Spira log technologv met this challe nge \'en· wel l: 

as a result of the idempotent protocol, the cluster 

Eli lover design, and the recover capabilitv of the log, 

\\'C cncountertd fC\\' data rcliabilitv problems during 

development. 

In any large, complex project, manv technical deci 

sions arc necessary to convert research technology 

into :1 product. In this section, we discuss why certain 

decisions were made during the devtlopmtnt of the 

Spira log subsystems. 

VPI Results 

The V PI tile system was generally successful 111 pro 

viding the underlying support necessary r< >r dirkrcnr 

ti le system personalities. We fo und that it was possi 

ble to construct a set ot' primitive operations that 

could be used to build complex, user-lC\·cl, ti le svstcm 

operations.

By using these primitives, the Spir:log project 

members were able to successfully design t\vo distinctly 

difterenr personality modules. Neither was a 

fu nctional superset of the other, and neither was layered 

on top of the other. However, there was an 

imporrallt second-order problem. 

The fSLIB t-ile system personality did not have a fu ll 

mapping to the Files-1 1 file system. As a consequence, 

ti le management was rather difficult, because all the 

data management tools on the OpenVMS operating 

system assumed compliance with a Files-1 1, rather 

than a VPI, t-ile system. 

This problem led to the decision not to proceed 

with the original design for the FSLIB personality in 

version 1.0 of Spiralog. Instead, we developed an 

FSLI B !lie svstem personality that was ndly compatible 

with the F64 personality, even when that compatibility 

t(>rced us to accept an additional execution cost. 

We also r()Lilld an execution cost to the primitive 

VPI operations. Generally, there was little overhead 

t(x

12 

arc eligible tor cl ea ning bur have nor yet been 

processed . These obsolete areas an: also saved by the 

backup engine. Although they have no cfkct on the 

logic:d state of the log, they do req uire the b1ckup 

engine to move more data to bKkup storage tiL1n 

might otherwise be necess:try. 

Second, the design initially proh ibited the cleaner 

tl:om r u nning ag:inst a log with snapshots. Consequcnrlv, 

the cleaner was disabled during 1 save operation, 

which had the fo llowing dkcts: ( 1 ) The :mount 

of available ti·ee space in tht: log was artiti cia ll v 

depressed d uring a backup. (2) Once the backu p \\'JS 

fi n ish ed, the activated cle:111cr would discover that 

a great number of log segments were now e l igib le for 

cleaning. As a result, the cleaner underwent a sudden 

surge in cleaning acrivirv soon after the back up h:td 

completed . 

vVc addressed this p robl em bv n.:ducing the area of 

the l og that was offlimirs to the c leaner ro onlv the 

part that th e backu p e ngine \\'ould read. This limi ted 

snaps hot window al lowed more segme nts to remai n 

eligi ble fix cleaning, rhus great!\' alln·i:tting the short 

age of cleanable space during the backup and eliminat 

ing the postb::tckup cleanin g surge. for 1n 8 -gigabvte 

rime-sh aring ,-olume, this change tvpical h· reduced the 

period of high cleaner acti,·irv ti·om 40 secon ds to less 

than one-half of a second. 

vVe have nor vet experi mented with difkrellt clcm er 

algorithms. More work needs to be done in this arc:J. 

to sec if the c leaning efticiencv, cost, and interactions 

with backup can be improved . 

The current mapping transt(mnarion fi·om the 

i ncoming operation address space to l ocations in rhc 

open -ended l og is more ex pensive in CPU rime rh;l11 

we would like. More work is needed ro opti mize rhe 

code path. 

Finally, rhe LfS server is gc ncra ll v successful at pro 

viding the appearance of a tradition:!, u pcbrc-i n-place 

tile system. H owcvcr, as the unusecl space in a vol umc 

nears zero, the ability to behave with semantics that 

meet users' expectations in a log-structured file system 

proved more difric ult than we had a11 ticipn ed and 

required signitican r cff(Jrt to correct. 

The LfS server is described in much more detail in 

t his issue • 

Ta ble 2 

Performance Comparison of the Backup Save Operation 

File System 

Spiralog 

Files-1 1 

Dig:it,l l Technic.1l )ounnl 

Elapsed Time 

Backup Performance Results 

vVc rook a new approach ro the b:tcku p design in the 

Spiralog system, resulting in a very t;1St and very low 

impact backup that un be used ro crene consistent 

copies of the ti le system while applications :1re ::tctively 

mod i'ing dat

Ta ble 3 

Performance Comparison of the Individual File 

Restore Operation 

Elapsed Time 

File System (Minutes:Seconds) 

Spiralog 

Files-1 1 

01:06 

03:35 

length is mitigated by continuing improvements UJ 

processor performance. 

We learned a great deal during the Spiralog project 

and made the fo llowing observations: 

• Volume fu ll semantics and tine-tuning the cleaner 

were more complex than we anticipated and will 

require future retinemenr. 

• A heavily layered architecture extends the CPU 

path length and the tan-out of procedure calls. We 

focused too much attention on reducing JjOs and 

not enough attention on reducing the resource 

usage of some critical code paths. 

• Although e legant, the memory abstraction for tbe 

interface to the cache was not as good a fit to file 

system operations as we had expected. Furthermore, 

a block abstraction fo r the data space would 

have been more suitable. 

In summary, the project team delivered a new 

fi le system for the OpenVMS operatjng system. The 

Spiralog file system offers single-system semantics in 

a cluster, is compatible with the current OpcnVMS 

fi le system, and supports on-line backup. 

Future Work 

During the Spira log version 1.0 project, we pursued a 

number of new technologies and found fo ur areas that 

warrant fu ture work: 

• Support is needed from storage and tilemanagement 

tools tor multiple, dissimilar tile 

system personalities. 

• The cleaner represents another area of ongoing 

innovation and complex dynamics. We believe a 

better understanding of these dynamics is needed, 

and design alternatives should be studied . 

• The on-line backup engine, coupled with the logstructured 

fi le system technology, oHers many areas 

for potential development . For instance, one area 

tor investigation is continuous backup operation, 

either to a local backup device or to a remote 

replica. 

• Finally, we do not believe the higher-than-expected 

code path length is intrinsic to the basic fi le system 

design. vVc expect to be working on this resource 

usage in the near future. 

Acknowledgments 

We would like to take this opportunity to thank the 

many individuals who contributed to the Spiraiog 

project. Don Harbert and Rich Marcello, OpenVMS 

vice presidents, supported this work over the lifetime 

of the project. Dan Doherty and Jack Fallon, the 

Open VMS managers in Livingston, Scotland, had dayto-day 

management responsibility. CatJ1y Foley kept 

the project moving toward the goal of shipping. Jan is 

Horn and Clare Wells, the product managers who 

helped us understand our customers' needs, were eloquent 

in explaining our project and goal to others. 

Ncar the end of the project, Yehia Bcyh and Paul 

Mosteika gave us valuable testing support, without 

which the product would certainly be Jess stable than it 

is today. Finally, and not least, we would like to 

acknowledge the members of the development team : 

Alasdair Baird, Stuart Bayley, Rob Burke, Ian 

Compton, Chris Davies, Stuart Deans, Alan Dewar, 

Campbell Fraser, Russ Green, Peter Hancock, Steve 

Hirst, Jim Hogg, Mark Howell, Mike Johnson, 

Robert Landau, Douglas McLaggan, Rudi Martin, 

Conor J\llorrison, Julian Palmer, Judy Parsons, Ian 

Pattison, Alan Paxton, Nancy Pban, Kevin Porter, 

Alan Potter, Russell Robles, Chris Whitaker, and Rod 

Widdowson. 

References 

1. M. Rosenblum and J. Ousterhout, "The Design and 

Implememation of a Log Structured File System," ACM 

Tra nsactions on Computer S)'slems. vol. I 0, no. I 

(February 1992): 26-52. 

2. T. Mann, A. Birrell, A. Hisgen, C. Jerian, and C. Swart, 

"A Coherent Distributed File Cache with Direcrory 

Write-behind," Digital Systems Research Center, 

Research Report 103 (June 1993). 

3. R. Creen, A. Baird, and J. Davies, "Designing a Fast, 

On-line Backup System for :1 Log-structured File System," 

D(v, ital Techrzica!Journal, vol . 8, no. 2 ( 1996, 

this issue): 32-45. 

4. C. Whitaker, }. Bayley, and R. Widdowson, "Design of the 

Server lor the Spiralog File System," Digital Tee/mica! 

Journal, vol. 8, no. 2 ( 1996, this issue): 15-3 1. 

5. M. Howell and ). Palmer, "Integrating the Spiralog 

File Svstem into the OpenVMS Operating System," 

Digital Technical .fournal, vol . 8, no. 2 ( 1996, this 

issue): 46-56. 

6. R. Wrenn, "Why the Real Cost of Storage is More Than 

$1/MR," presented at U.S. DECUS Symposium, 

St. Louis, Mo., June 3-6, 1996. 

Digital ·ll:chnic

14 

Biographies 

James E. Johnson 

jim Johnson, a consul ring software engineer, has been 

working for Digital since 1984. H e was a n1ember of the 

Open VMS Engineering Group, where he contributed 

in several at-cas, i ncludi ng RMS, transaction processi ng 

services, the port of Open VMS to the Alpha architecrl!l·e, 

ti le systems, and system management. Jim recently joined 

rhc Internet Software B usi ness Unir and is working on 

rhe application ofX.SOO directory services. jim holds rwo 

patents on transaction commit protocol optimizations and 

maintains a keen interest in rh is area. 


Bill Laing, a corporate consulting engineer, is rhe technical 

director of the Internet Software. Business Unir. Bill joined 

Digita l in 191\l; he worked in the United Stares tcJr five 

vears befo re transterring to Europe. During his oreer :tt 

Digital, Bill has worked on VMS systems ped(mllarlce 

:nalysis , VAXclusrer design and development, oper;l ting 

systems dc,·clopmcnt, and transaction processing. He 

was the technical dir·ector of Open VMS engineering, the 

tt:chnical director fo r engineering in Europe, and most 

n: cenrk was focusing on software in the ·rechnologv and 

Architecture Group of the Computer Svsrcms Division. 

Prior to joining Digital, Bill held rese.1rch and reaching 

posts in operati ng svstems at rhe University of Edinburgh, 

where he worked on the EMAS operating system. He \\·,1s 

also parr of the starr-up of European Si li con Structures 

(LS2), an ambitious pan- Enropean company. He holds 

undergraduate and postgraduate degrees in computer 

science from the University of Edinburgh . 

DigitJI Tcchnic:1l journal Vol. 8 No. 2 1996

Design of the Server for 

the Spira log File System 

The Spira log file system uses a log-structured, 

on-disk format inspired by the Sprite log 

structured file system (LFS) from the University 

of Ca lifornia, Berkeley. Log-structured file sys 

tems promise a number of performance and 

functional benefits over conventional, update 

in-place file systems, such as the Files-11 file 

system developed for the Open VMS operating 

system or the FFS file system on the UNIX oper 

ating system. The Spira log server combines log 

structured technology with more traditional 

B-tree technology to provide a general server 

abstraction. The B-tree mapping mechanism 

uses write-ahead logging to give stability and 

recoverability guarantees. By combining write 

ahead logging with a log-structured, on-disk 

format, the Spira log server merges file system 

data and recovery log records into a single, 

sequential write stream. 

I 

Christopher Whitaker 

J. Stuart Bayley 

Rod D. W. Widdowson 

The goal of the Spira log file system project team was 

w produce a high-pedonnance, highly available, and 

robust tile system with a high-performance, on-line 

backup capability for the OpenVMS Alpha operating 

system. The server component of the Spiralog file system 

is responsible tor reading data trom and writing 

data w persistent storage. It must provide fa st write 

performance, scalability, and rapid recovery from system 

fa ilures. In addition, the server must allow an 

on-line backup utility to copy a consistent snapshot of 

the tile system to another location, while allowing normal 

fi le system operations to continue in parallel. 

In this paper, \Ve describe the log-structured file system 

(LFS) technology and its particular implementation 

in the Spiralog file system. We also describe the novel 

way in which the Spiralog server maps the log to provide 

a rich address space in which files and directo1ies are 

constructed. Finally, we review some of the opportunities 

and challenges presented by the design we chose. 

Background 

All tile systems must trade off performance against 

availability in different ways to provide the throughput 

required during normal operations and to protect data 

from corruption during system failures. Traditionally, 

fi le systems fall into two categories, careful write and 

check on recovery. 

• Careful writing policies are designed to provide a 

fa il-safe n1cchanism tor the file system structures in 

the event of a system fa ilure; however, they sufter 

fl·om the need to serialize several ljOs during file 

system operations. 

• Some fi le systems forego the need to serialize fi le 

system updates. After a system failure, however, 

they require a complete disk scan to reconstruct a 

consistent file system. This requirement becomes 

a problem as disk sizes increase. 

Modern file systems such as Cedar, Episode, 

Microsoft's New Technology File System (NTFS ), 

and Digital's POLYCENTER Advanced File System 

use Jogging to overcome the problems inherent in 

these two approaches.'·2 Logging file system metadata 

removes the need to serialize IjOs and allows a simple 

Digiral Technical journal Vol. 8 No. 2 1996 15

16 

Jnd bounded mechanism f(>r reconstructing the ti le 

system ah:cr a fa ilure. Rcsc.1rchcrs at th e University of 

Ca liti:m1i a, Berkeley, took this process one stage fi.Jr 

ther and treated the ll' hole disk as a single, sequential 

log where all ri le system modifications arc appended to 

tbe tail of the log.' 

Log-structured tile system tcchnolot,'Y is p;1rticularh' 

appropriate to the Spiral og tile s\·stcm, because it is 

tksigncd as a clustcrwidc ti le S\·stcm . The seJTcr must 

support a large number of ti le S\'Stcm clerks, each of 

which may be rcJding ;1 nd ll'riting lh ta to the disk. The 

clerks usc 1:-rge ll'rite-back caches to red uce the need to 

read daL1 ti·om the server. The caches also allow the 

clerks to bufkr write requests destined t(>r the server. 

A log-structured design allows mu ltiple concurrent 

writes to be grouped togcthn imo large, sequential 

ljOs to the disk. This 1/0 pJttcrn reduces disk head 

movement d uring writing and allows the size of the 

writes to be marched to characteristics of the underlying 

disk. This is Inrticul•1rly bcncticial t( >r storage devices 

with redundant arr.n·s of inexpensive disks (RAID )4 

The usc of a log-structured , on-disk f(>rmat greatly 

simplifies the implementation of ;111 on-line backup 

capability. Here, the challenge is ro provi de a con sis 

tent snapshot of thc ti le system that can be copied to 

the backup med ia while normal opeLltions continue 

to modit\• the ti le svsrcm. BcG\ usc ;1n LrS appends Jil 

data to rhc tail of a l og, all d:1t1 writes within the log 

arc tc m pora llv ordered . A complete sn:-pshot of the 

ti le systcn1 corresponds to the contcllts of the sequen 

tial log up ro thr the clcmcr. In su bsequent sec 

tions of this paper, we compare these diffe rences \\·ith 

existing i mpl emc nrarions ll'hcrc appropri •1tc. 

Spiralog File System Server Architecture 

The Spiralog ti le S\·stcm cmpl o\'s ;1 clic m-scn·cr JJ·c hi 

recture. E:-ch node in rhc duster thar 1noums ;l 

Spiralog \·olume runs a ti le svstcm clerk.. The term 

clerk is used in this p;1pcr ro d istinguish the client component 

of the ri le system fi·om diems oft he ti le S\'Stem 

as a whole. Clerks implement all the ti le h111crions associated 

\\'ith maintaining the tile svsrem sr1rc wirh the 

exception of persistent storage of ti le S\'stem :m d user 

data. This latter responsibil ity Lli ls on the Spir;1log 

server. There is exactly one sc n·cr h>1· eac h volume, 

which must run on a node that l1;1s ;1 direct connection 

to the disk conta ining the volume. This distribution of 

fu nction , where the majority of tilc system processing 

takes place on the cle rk, is simiLlr to that of the Echo 

ti le system .'" The reasons t(n choosing rh is ;1 rc h itccrurc 

arc described in more dct:-il in the p;1pcr "Ovcn'icw of 

the Spira log hie System," elsewhere in this issue." 

Spiralog clerks build ti. lcs and d irectories in a struc 

tured address space called the ti le ;1 ddrcss s1x1cc. This 

address space is i mcma l ro the tile S\'Stcm and is onh· 

loose ly related w tiLlt pcrcci\·cd lw clicms of rhc ti le 

svsrcm. The server pro\·ides ;1 11 imcrbcc that :dlm\·s 

tbe clerks to pcrsistcnth· map to ti l c sp;Kc add rcscs. 

Internalh·, the scnn uses a logic:lh· infinite log struc 

ture, built on top of a plwsic:l disk, to store the ti le 

S\'Stem datJ and the srructu res nccess:11·\· to locate 

the datJ. Figure l shO\\·s the rcl;Hionshifl hcn\·ccn t he 

clerks and the sc nu· and the rebtionships among 

the major componcms \\·ithin the scrnT. 

Figure 1 

Sen·er Architecture

The mapping layer is responsible for maintaining 

the mapping between the file address space used by 

the clerks to the address space of the log. The server 

directly supports the file address space so that it can 

make use of information about the relative performance 

sensitivity of parts of the address space that is 

implicit within its structure. Although this results in 

the mapping layer being relatively complex, it reduces 

the complexity of the clerks and aids performance. 

The mapping layer is the primary point of contact with 

the server. Here, read and write requests from clerks 

are received and translated into operations on the log 

address space . 

The log driver ( LD) creates the illusion of an infinite 

log on top of the physical disk. The LD transforms read 

and write requests from the mapping layer that are cast 

in terms of a location in the log address space into read 

and write requests to physical addresses on the underlying 

disk. Hiding the implementation of the log from 

the mapping layer allows the organization of the log to 

be altered transparently to the mapping layer. For 

example, parts of the log can be migrated to other 

physical devices without involving the mapping layer. 

FILE HEADER FILE VIRTUAL BLOCKS 

Figure 2 

Address Translation 

I I I I I I I I I I 

1 USER 

1/0s 

Although the log exported by the LD layer is conceptually 

infinite, disks have a finite size. The cleaner 

is responsible for garbage collecting or coalescing free 

space within the log. 

Figure 2 shows the relationship between the various 

address spaces making up the Spiralog file system. In 

the next three sections, we examine each of the components 

of the server. 

Mapping Layer 

The mapping layer implements the mapping bet\veen 

the file address space used by the file system clerks 

and the log address space maintained by the LD. 

It exports an interface to the clerks that they use to 

read data from locations in the file address space, 

to write new data to the file address space, and to specif)' 

which previously written data is no longer required. 

The interface also allows clerks to group sets of dependent 

writes into units that succeed or fail as if they 

were a single write. In this section, we introduce the 

file address space and describe the data structure used 

to map it. Then we explain the method used to handle 

clerk requests to modi f)' the address space. 

DISK 

Digital Technical Journal 

FILE SYSTEM ADDRESS 

SPACE 

IVPI 

CLERK 

FILE ADDRESS SPACE 

ILFS 

B-TREE 

LOG ADDRESS SPACE 

LFS 

LOG 

DRIVER 

LAYER 


SPACE 

Vol. S No. 2 1996 17

18 

File Address Space 

Tht fi le address space is a structurtd address sp:1ce. At 

its highest level it is divided into objects, each of which 

Ius a numeric object ident ifier (OlD). An object may 

have any number of named cells associated with it and 

up to 2'"- 1 streams. A named cdl may contain a variable 

amount of data, but it is read and written as a single 

unit. A stream is a sequence of bytes that are 

address.:d by their offset from tiK start of the stream, 

up to a maximum of2"' -l. Fund:Jm.:ntally, there are 

t\vo f(>rms of addresses defined by tiK ti k address 

space: Named addresses of the form 

 

sp.:ci, an individual n amed cell within an object, and 

numeric addresses of the f(mn 

 

spcei' a sequence of lengtb con tiguous bytes in an 

individual stream belonging to an obj.:et. 

The ekrks use named cells and streams to build ti ks 

and direerori.:s. In the Spiralog file system version l.O, 

a tile is repr.:scnted by an object, a named ed l contlining 

its attributes, and a singk str.:am that is used 

to store the ti le's data. A dir.:ctory is represented by 

an object that contains a number of nanKd ctlls. 

Each 1wned cell represents a link in that dir.:cron· and 

conr:.1ins what a traditional ti le svstcm rdcrs to as a 

directory entry. Figure 3 shows how data ti les and 

directories are built from named cells and str.:ams. 

The mapping layer provides thrc.: principal operations 

f(>r manipulating the ti le address space: read, 

write, :1nd clear. The read operation allovvs a clerk to 

read rhc coments of a named cell, a contiguous range 

of bytes from a stream, or all the named cells tC>r a particular 

object that tall into a specified search range. The 

write opcr:ltion allows a clerk to writ.: ro a contiguous 

range of bytes in a stream or an individual named cell. 

DATA FILE 

ATTRIBUTES 

KEY 

0 OBJECT 

El 

BYTE STREAM 

L:::. NAMED CELL 

Figure 3 

1-'i lc Svsrc 111 

BYTE 

STREAM 

Digir,d T.:chnical Joun1.1\ 

DIRECTORY 

ATTRIBUTES 

FRED.TXT 

FRED.C 

JIM.H 

CHRIS. TXT 

STU.C 

Vol. t:: No. 2 1996 

The dear operation allows a clerk to remove a named 

cell or a nu m ber oflwtes fi-om an object. 

Mapping the File Address Space 

We looked at a variety of indexing strucwres tor mapping 

the file add ress spJcc onto the log address space -'"'1 We 

chose a derivative of the B-trcc for the following reasons. 

For a unif(>rm address space, B-trees provide predictable 

worst-ease access times because the tree is balanced 

across all the keys it maps. A B-tree scales we ll as the 

number of keys mapped increases. In other words, Js 

more keys arc added, the B-tree grows in width and in 

depth . Deep B-rrccs urry an obvious perf(>rmr :1ll the tiles stored in the log. 

To solve this problem, we limited the k eys tr an 

object to a single B -trce leaf node. With this restriction, 

several small ti les can be accommodated in a single 

leaf node. files with a large number ofexrents (or 

large directories) arc supported br allowing individual 

streams to be spawned into subtrees. The subtrees arc 

balanced across the kevs within the subtree. An object 

can never span more than a single leaf node oF the 

main B-trce; therefore, nonleaf nodes of the main 

B-tree only need to con tJi n OIDs. This allows the 

main B-trcc to be very compact. Figure 4 shows the 

relationship bet>.vccn the main B-tree and its subtrees. 

Figure 4 

Mapping B-rrcc Sn·ucrurc

To reduce the time required to open a file, data for 

small extents and small named cells are stored directly in 

the leaf node tl1at maps tl1em. For larger extents (greater 

than one disk block in size in the current implementa 

tion), the data item is written into tl1e log and a pointer 

to it is stored in the node. This pointer is an address in 

the log address space. Figure 5 illustrates how the B-tree 

maps a small file and a file with several large extents. 

Processing Read Requests 

The clerks submit read requests that may be for a 

sequence of bytes from a stream (reading a data from a 

file), a single named cell (reading a file's attributes), or 

a set of named cells (reading directory contents). To 

fultill a given read request, the server must consult the 

B-tree to translate from the address in the file address 

space supplied by the clerk to the position in the log 

address space where the data is stored. The extents 

making up a stream are created when the file data 

is written. If an application writes 8 kilobytes ( KB) 

of data in 1-KB chunks, the B-tree would contain 

8 extents, one tor each 1-KB write. The server may 

need to collect data ti·om several diffe rent parts of the 

log address space to fi.dtlll a single read request. 

Read requests share access to the B - t.ree in much 

the same way as processes share access to the CPU of 

a multiprocessing computer system. Read requests 

Figure 5 

KEY: 

142.1 .0.50 1 

Mapping B-rree Derail 

_' 135, 

ATIRIBUTES I 

B-TREE INDEX RECORD 

MAPPING OlD 35 ... 

RECORD CONTAINING FILE 

DATA: OlD 42, STREAM 1, 

START OFFSET 0, LENGTH 50 

arriving from clerks are placed in a tl rst in tl rst out 

(FIFO) work queue and are started in order of their 

arrival. Al l operations on the B-tree arc performed by 

a single worker thread in each volume. This avoids 

the need tor heavyweight locking on individual 

nodes in the B-tree, which significantly reduces the 

complexity of the tree manipulation algorithms and 

removes tbe potential fo r deadlocks on tree nodes. 

This reduction in complexity comes at the cost of 

the design not scaling with the number of processors 

in a symmetric multiprocessing (SMP) system. So far 

we have no evidence to show that this design deci 

sion represents a major performance limitation on 

the server. 

The worker thread takes a request from the head 

of the work queue and traverses the B-tree until it 

reaches a leaf node that maps the address range of 

the read request. Upon reaching a leaf node, it may 

discover that the node contains 

• Records that map part or al l of the address of the 

read request to locations in the log, and/or 

• Records that map part or all of the address of the 

read request to data stored directly in the node, 

and/or 

• No records mapping part or all of the address of the 

read request 

, , /'- ' , , MAIN B-TREE 

NODE A 

SUBTREE FOR OlD 35, 

_STREAM 1 

B-TREE INDEX RECORD 

MAPPING OlD 35, STREAM 1, 

START OFFSET 0 ... 

RECORD CONTAINING POINTER 

TO FILE DATA: OlD 42, STREAM 1, 

START OFFSET 0, LENGTH 50 

Digital Technical Journ

20 

Data rh:n is stored in the node is simplv copied 

to rhc output bufkr. When data is stored in the log , 

rhc worker thread issues req uests to the LD to read the 

ch ra imo the output buffer. Once :.ll the reads ha\T 

been issued , the original request is placed on a pend 

ing queue until they complete; then the results are 

returned to the clerk. When no dat:t is stored t(>r all or 

part of the read request, the server zero-tills the corn: 

sponding part of the output buftCr. 

The process described above is complicated by the 

bet that the B-trcc is itself stored in the log. The map 

ping layer contains a node cache rh:lt ensures thJt com 

monly referenced nodes Jre nonnallv tc>Lmd in memory. 

vV hen the worker thread needs to traverse through a 

tree node that is not in mernono, it must atTJ.nge tc>r the 

node to be read into the cache. The address ofrhe node 

in the log is the vaJ ue of rhe pointer to it fl·om its pJ.rent 

node. The worker thread uses this to issue a request to 

the LD to read the node into a cad1e butkr. While the 

node reJ.d request is in progress, the origin::tl clerk oper 

ation is placed on a pending queue and the worker 

thread proceeds to the next request on the work queue. 

When the node is resident in memory, the pmding read 

request is placed back on the work queue to be 

rest: rred . In this way, multiple read req uests can be in 

progress :-tt any given time. 

Processing Write Requests 

Write requests received by the server arrive in groups 

consisting of a number of data items corresponding to 

upcbtes to noncontiguous addn:sses in the ti le :-tddress 

sp:1ce. Eac h group must be writtn as :t single tJilure 

atom ic unit, which means rbat all the parts of the write 

request must be made stable or none of thm must 

become stable. Such groups of writes arc ulkd wun 

ners :tnd ar used by the clerk to cncapsubtc complex 

tile system operations." 

Bdi:lre the server Glll complete a wunncr, that 

is, bd(>rc an acknowl edgment c:tn be sent b:1ck to 

the clerk indicating that the \\'llllner w:1s successful, 

the scn·er must make two guar:1ntecs: 

1. All [Xlrts of the wunner are stably stored in the log 

so that the entire \\'Utmer is persistent in the e\·ent 

of: system bilure. 

2. All cbtJ. items described by the wunner arc visi ble to 

subsequent read requests. 

The wunncr is made persistent by writing each data 

item to the log. Each data item is tagged with a log 

record that identifies irs corresponding tile spKc 

ad d ress. This :1 l lows the data to be recovered in the 

e\'l'nt of a S\'Stcm tailurc. All indi,·idual writes arc made 

as p1 rr ofa single compound atomic operation (C:AO ). 

This method is prm·idcd by the LD bvcr to br:1ckct 

a scr of writes rhar must be recm·ered as :1 11 :ttomic 

unir. Once all the writes fo r the \\'Utl ller have been 

Vo l. il No. 2 1996 

issued to the log, the mapping !Jvcr instructs the LD 

layer to end (or commit) the CAO. 

The \\·u nncr c1n be made ,·isible to subsequent rc1d 

operations by u pdati ng the B-tree to rdkct the loca 

tion of th new dat::J. Untc>rntnJ.tely, this "·ould c:1usc 

writes to incur :1 significant latency since updating the 

B-tree involves traversing the B-tree and potentially 

reading B-tree nodes imo memory ti·om the log. 

Instead, the server com pletes a writ operation before 

the B-tree is updated . By doing this, howver, it must 

take addition:1l steps to ensure that the data is 1·isiblc to 

subseq uent read requests. 

Betorc completing the wunner, the mapping Ll\n 

queues the B-trce upd:ttes resulti ng t[·om \\Ti ring rhe 

\l'lttJIKr to the same FIFO work queue as read requests. 

All items arc queued atomicallv, that is, no other read 

or write oper:rion can be imerlcwed with the indi,·id 

ual wu nncr updates. In this wav, the ordering between 

the writes making up the wuntler and su bsequellt read 

or write oper:-ttions is maintained . vVork cannot begi n 

on a subsequent rc:td request until work has started OJJ 

the B-trce updates ahead ofit in the queue. 

Once the B-trcc upd ates have been queued to the 

server work queue and the mapping layer h:s been 

notified th:-tt the C:AO fo r th writes has eommirred , 

both of the gu:1ranrces rhar the sen-er gi ,·es on write 

completion hold. The data is pcrsistem, and the writes 

arc ,·isible to su bseq uenr operations; therefore, the 

sen·er can send an ac knowledgment back to the clerk. 

Updating the 8-tree 

The worker thread processes a B-trce update req uest 

in much the s:me way 1s a read request. The upd ate 

request traverses the B-rrcc until either it reaches the 

node that maps the appropriate pan of the file add ress 

space, or it tails to fi nd a node in memory. 

Once the lc:.1fnode is reachd , it is updated to point at 

the locarjon of the data in the log (if the clatJ is to be 

stored directlv in the node, the chta is copied inro the 

node). The node is tlOW dim· in memorY :md must 

be written to the log at some point. Ra ther than \\Tiring 

the node immediate\-, the lll3pping la,·cr writes a log 

record describing the ch:nge, locks the node into rhe 

cache, and places :1 fl ush operation ror the node to 

tl1e mapping L1yer's Hush queue. The flush operation 

describes the locltion of the node in the tree and 

records the need to write it to the log at some poi nt 

in the future. 

It on its \\·1\· to the lc:tf node, the write operation 

reaches a node that is not in memorv, the worker 

thread arranges t( Jr it to be read h·om the log :.1nd the 

write operation is pbced on a pe nding queue Js "·i rh :1 

read operation. Because the wri te has been ackno11·l 

edged to the clerk, the nc,,· d

eing read. If a read operation reaches the node with 

records of stalled updates, it must check whether any 

of these records contains data that should be returned. 

The record contains either a pointer to the data in the 

Jog or the actual data itself. If a read operation ri nds 

a record that can satist)r al l or part of the request, the 

read request uses the information in the record to 

retch the data. This preserves the guarantee that the 

clerk must see all data for which the write request has 

been acknowledged. 

Once the node is read in ti-om the log, the stalled 

updates are restarted. Each update removes its log 

record fi·om the node and recommences traversing the 

B-tree from that point. 

Writing B-tree Nodes to the Log 

Writing nodes consumes bandwidth to the disk that 

might otherwise be used for writing or reading user 

data, so the server tries to avoid doing so until 

absolutely necessary. Two conditions make it necessary 

to begin writing nodes: 

1. There are a large number of dirty nodes in the 

cache. 

2. A checkpoint is in progress. 

In the fi rst condition, most of the memory available 

to rbe server bas been given over to nodes that are 

locked in memory and waiting to be written to the 

Jog. Read and update operations begin to back up, 

waiting fo r available memory to store nodes. In the 

second condition, the LD has requested a checkpoint 

in order to bound recovery time (see the section 

Checkpointing later in this paper). 

When either of these conditions occurs, the mapping 

layer switches into tlush mode, during which it only 

writes nodes, until rhc condition is changed. In Hush 

mode, the worker thread processes tlush operations 

ti·om the mapping layer's fl ush queue in depth order, 

that is, starting with the nodes ti.1 rthest from the root 

ofrhc B-rrcc . For each tl ush operation, it traverses the 

B-trce until it ti nds the target node and its parent. The 

target node is identified by the keys it maps and its 

level. The level of a node is its distance ri·om the leaf of 

the B-rn::e (or subtree). Unlike its depth, which is its 

distance from the root of the B-tree, a node's level does 

not change as the B-tree grows and shrinks. 

Once it has reached its destination, the tlush operation 

writes out the target node and updates the parent 

with the new log address. The moditicnions made to 

the p

22 

Figure 7 

SEGMENT ARRAY I 

SEGMENT WRITER I 

DISK 

Subcomponents of rhe LD LJI'er 

ALTERNATE SOURCE 

OF SEGMENTS 

(SPI RALOG BACKUP) 

0 

TAPE 

The Segment Writer 

The segment writer is responsible tor all I/Os to the 

log. It groups together writes it receives ti·om the mapping 

layer into large, sequential I/Os wiKre possible. 

This increases write throughput, but at the potential 

cost of increasing the latency of individual operations 

when the disk is lightly loaded. 

As shown in Figure 8, the segment writer is responsible 

fo r the internal organization of segments written 

to the disk. Segments are divided into two sections, a 

data area md a much smaller commit record area. 

Writing a piece of data requires two operations to the 

segment at the tail of the log. First the data item is 

written to the data area of the segment. Once this I/0 

has completed successfully, a record describing that 

data is written to the commit record area. Only when 

the write to the commit record area is complete can 

the original request be considered stable. 

Figure 8 

DATA AREA 

J r-o-f--------,1 

-- I 

L 

I 

KEY: 

--- - - - ----J 

USER DATA OR B-TREE NODE 

[.=J COMMIT RECORD 

r------1 

l o· L___ 

L _ ___ _ .J 

Orgcmizcltion of a Segment 

_J SI NGLE 1/0 OPERATION 

Digir.1l Technical )omnal Vol. 8 No. 2 1996 

The need tor two writes to disk (potentially, with a 

rotational delay berween ) to commit a single data 

write is clearly a disadvantage. Normally, however, the 

segment wri ter receives a set of related writes from 

the mapping layer which are tagged as part of a single 

CAO. Since the mapping layer is interested in the completion 

of the whole CAO and not the writes within it, 

the segment writer is able to buffer add itions to the 

commit records area in memory and then write them 

with a single I/0. Under a normal write load , this 

reduces the number ot' ljOs tor a single data write to 

very close to om:. 

The bou ndary between the commit record area and 

the data arcJ is ti xed . Ine\'itably, this wastes space in 

either the commit record area or data area when the 

other ti lls. Choosing a size t(Jr the commit record area 

that minimizes this waste req uires some care. After 

analysis of segments that had been subj ected to a typical 

OpenVMS load, we chose 24 KB as the value ti.>r 

the commit record area . 

This segment organization permits the segment 

writer to have comp.letc control over the contents of 

the commit record area, which allows the segment 

writer to accomplish two important recovery tasks: 

• Detect the end of the log 

• Detect multiblock write failure 

When physical segments are reused to extend the 

log, they arc not scrubbed and their commit record 

areas contain stale (but comprehensible) records. The 

recovery man

a segment is an attribute of the physical slot that is 

assigned to it. The sequence number tor a physical slot 

is incremented each time the slot is reused, allowing 

the recovery manager to detect blocks that do not 

belong to the segment stored in the physical slot. 

The cost of resubstituting the stolen bytes is incurred 

only during recovery and cleaning, because this is 

the only time that the commit record area is read. 

In hindsight, the partitioning of segments into data 

and commit areas was probably a mistake. A layout 

that intermingles the data and commit records and that 

allows them to be written in one ljO would offer better 

latency at low throughput. If combined with careful 

writing, command tag queuing, and other optimizations 

becoming more prevalent in disk hardware and 

controllers, such an on-disk structure could offer significant 

improvements in latency and throughput. 

Cleaner 

The cleaner's job is to turn free space in segments in 

the log into empty, unassigned physical slots that can 

be used to extend the log. Areas offree space appear in 

segments when the corresponding data decays; that is, 

it is either deleted or replaced. 

The cleaner rewrites the live data contained in partially 

fu ll segments. Essentially, the cleaner torces the 

segments to decay completely. If the rate at which data 

is written to the log matches the rate at which it is 

deleted, segments eventually become empty of their 

own accord. When the log is fu ll (fullness depends on 

the distribution of ti le longevity), it is necessary to 

proactively clean segments. As the cleaner continues 

to consume more of the disk bandwidth, performance 

can be expected to decline. Our design goal was that 

performance should be maintained up to a point at 

which the log is 85 percent fu ll. Beyond this, it was 

acceptable for performance to degrade significantly. 

Bytes Die Yo ung 

Recently written data is more likely to decav than old 

data.1"·1' Segments that were writtn a shor time ago 

are likely to decay fu rther, after which the cost of 

cleaning them will be less. In our design, the cleaner 

selects candidate segments that were written some 

time ago and are more likely to have undergone this 

initial decay. 

Mixing data cleaned from older segments with data 

trom the current stream or' new writes is likely to produce 

a segment that will need to be cleaned again once 

the new data has undergone its initial decay. To avoid 

mixing cleaned data and data hom the current write 

stream, the cleaner builds its output segments separately 

and then passes them to the LD to be threaded in 

at the tail of the log. This has two important benefits: 

• The recovery information in the output segment is 

minimal, consisting only of the selfdescribing tags 

on the data. As a result, the cleaner is unlikely to 

waste space in the data area by virtue of having ti lled 

the commit record area. 

• By constructing the output segment offline, the 

cleaner has as much time as it needs to look for data 

chunks that best till the segment. 

Remapping the Output Segment 

The data items contained in the cleaner's output segment 

receive new addresses. The cleaner informs the 

mapping layer of the change oflocation by submitting 

B-tree update operation for each piece of data it 

copied. The mapping layer handles this update operation 

in much the same way as it would a normal overwrite. 

This update does have one special property: 

the cleaner writes are conditional. In other words, tbe 

mapping layer will update the B-tree to point to 

the copy created by the cleaner as long as no change 

has been made to the data since the cleaner took its 

copy. This allows the cleaner to work asynchronously 

to tile system activity and avoids any locking protocol 

between the cleaner and any other part of the Spira log 

file system. 

To avoid modifYing the mapping layer directly, the 

cleaner does not copy B-tree nodes to its output segment. 

Instead, it requests the mapping layer to flush 

the nodes that occur in its input segments (i.e., rewrite 

them to the tail of the log). This also avoids wasting 

space in the cleaner output segment on nodes that 

map data in the cleaner's input segments. These nodes 

are guaranteed to decay as soon as the cleaner's B-tree 

updates are processed. 

Figure 9 shows how the cleaner constructs an output 

segment from a number of input segments. The cleaner 

keeps selecting input segments until either the output 

segment is fu ll, or there are no more input segments. 

Figure 9 also shows the set of operations that are generated 

by tl1e cleaner. In this example, tl1e output segment 

is filled with the contents oftvm fu ll segments and part 

of a third segment. This will cause the third input segment 

to decay still further, and the remaining data and 

B-tree nodes will be cleaned when that segment is 

selected to create another output segment. 

Cleaner Policies 

A set of heuristics governs the cleaner's operation. 

One of our fundamental design decisions was to separate 

the cleaner policies from the mechanisms that 

implement them. 

When to clean? 

Our design explicitly avoids cleaning until it is 

required. This design appears to be a good match tor 

Digital Tec hnical journal Vol. 8 No. 2 1996 23

24 

KEY 

I NODE A I 

I 

B-TREE NODE 

D LIVE DATA 

;-- SUPERSEDED DATA 

• _ _ _ J 

Figure 9 

Cle,mcr OpcrJtioiJ 

B-TREE UPDATE REQUEST 

CLEANER 

a workload on the OpenVMS system . On our rimesharing 

system, the cleaner was entirely inactive for the 

tirst th ree months of 1996; although segments were 

used ::md reused repeatedly, they always dec:1ycd 

entirely to empty of their own accord. The trade-off 

in avoiding cleaning is that although performance is 

improved (no cleaner activity), the size of riK fu ll 

savcsnaps created by backup is increased. This is 

because backup copies whole segments, regardless of 

how much live data thev contain. 

vVhcn the cleaner is nor running, the live data in the 

volume tends to be disnibuted across a large number of 

partially hi ll segments. To avoid this problem, we have 

added a conrrol to allow the system manager to nu nually 

surt and stop the cleaner. Forcing the cleilner ro 

run bd()re performing a full backup compacts the live 

cb ta in the log and reduces the size ofthe savesnap. 

In normal operation, the cleaner will Stilrt cleaning 

\\'hen rhc number oftree segments available ro extend 

rhc log blls below a fi xed threshold ( 300 in the current 

implcmentJtion ). In making this calculation, the 

cleaner takes into account the amount of space in 

rhe log that will be consumed by writing data currently 

held in rhe clerks' write-behind caches. Thus, accepting 

data into the cache c1uses the cleaner to "clear the way" 

tC:.>r the subsequent write request fi-om the clerk. 

When the cleaner starts, it is possible that rhc 

amou11t of live data in the log is appro:ching 

the capacity of the underlying disk, so the cleaner may 

tind nothing to do. It is more likely, however, that 

there will be free space it can reclaim. Because the 

cleaner works by torcing the data in its input segments 

Vol. 8 No. 2 1996 

OPERATIONS SUBMITTED TO MAPPING LAYER 

to dec1y by rewriting, it is i mportant that it begins 

work while ti-ee segments are available. Debying the 

decision to start cleaning cou ld result in the cleaner 

being unable to proceed. 

A tl xed number was chosen tor the cleaning threshold 

rather than one based on the size of the disk. The 

size of the disk does not affect the urgency of cleaning 

Jt any p:rticular point in time. A more cftcctivc indic:tor 

of urgency is the time taken f()r the disk ro ti ll :1t the 

maximum rate of writing. Writing to the log at lOMB 

per second wi ll usc 300 segments in about 8 seconds. 

With hindsight, we realize that a threshold based on a 

measuremc11t of the speed of the disk might have been 

a more appropriate choice. 

Input Segment Selection 

The clc::mer divides segments into tour distinct groups: 

l. Empty. These segments contain no live data :m d arc 

avai lable to the LD to extend the Jog. 

2. Noncle:nable. These segments arc not candichtes 

t(w cleaning f()r one of t\vo reasons: 

• The segmem contains information that would 

be required by the recovery manager in the event 

of a system fl ilure. Segments in this group arc 

always close to the tail of the log and therd(Jrc 

likely to undergo t-l1 rther decay, making them 

poor candidiltes tor cleaning. 

• The segment is part of a snapshot.; The SILlpsllot 

represents a rercrcnce to the segment, so it cm- 

110t be reused even though it may no longer COilrain 

live data .

3. Preferred noncleanable. These segments have 

recently experienced some natural decay. The sup 

position is that they may decay further in the near 

ti.1 ture and so are not good candidates tor cleaning. 

4. Cleanable. These segments have not decayed tor 

some time. Their stability makes them good cami.i 

dates to r cleaning. 

The transitions between the groups are illustrated in 

Figure 10. It should be noted that the cleaner itself 

does not have to execute to transfer segments into the 

empty state. 

The cleaner's job is to fill output segments, not to 

empty input segments. Once it has been started, the 

cleaner works to entirely fi ll one segment. When that 

segment has been filled, it is threaded into the log; 

if appropriate, the cleaner will then repeat the process 

with a new output segment and a new set of input 

segments. The cleaner will commit a partially full 

output segment only under circumstances of extreme 

resource depletion. 

The cleaner ti lls the output segment by copying 

chunks of data fo rward fi·om segments taken ti-om the 

cleanable group. The members of this group are held 

on a list sorted in order of emptiness. Thus, the tl rst 

cleaner cycle will always cause the greatest number of 

segments to decay. As the output segment fills, the 

smallest chunk of data in the segment at the head of 

the cleanable Jist may be larger than the space left in 

the output segment. In this case, the cleaner performs 

a limited search down the cleanable list tor segments 

containing a suitable chunk. The required information 

is kept in memory, so this is a reasonably cheap opera 

tion. As each input segment is processed, the cleaner 

Figure 10 

Segment States 

DECAY 

TO EMPTY 

temporarily removes it from the cleanable list. This 

allows the mapping layer to process the operations the 

cleaner submitted to it and thereby cause decay 

to occur before the cleaner again considers the seg 

ment as a candidate fo r cleaning. As the volume fills, 

the ratio between the number of segments in the 

cleanable and preferred noncleanable groups is 

adjusted so that the size of the preferred non cleanable 

group is reduced and segments are inserted into the 

cleanable Jist. If appropriate, a segment in the clean 

able list that experiences decay will be moved to the 

preferred noncleanable list. The preferred nonclean 

able list is kept in order of least recently decayed. 

Hence, as it is emptied, the segments that are least 

likely to experience further decay are moved to the 

cleanable group. 

Recovery 

The goal of recovery of any file system is to rebuild the 

ti le system state after a system fa ilure. This section 

describes how the server reconstructs state, both in 

memory and in the log. It then describes checkpoi.nt 

ing, the mechanism by which the server bounds the 

amount of time it takes to recover the file system state. 

Recovery Process 

In normal operation, a single update to the server can 

be viewed as several stages: 

l. The user data is written to the log. It is tagged with 

a self-identifYing record that describes its position in 

the file address space. A B-tree update operation is 

generated that drives stage 2 of the update process. 

CLEANER POLICY/ 

SEGMENT DECAY 

CHECKPOINT/ 

SNAPSHOT 

DELETION 

Digital Tcchnicll journal Vol. 8 No. 2 1996 25

26 

2. The leaf nodes of the B-tree are moditled in memory, 

and corresponding change records are written 

to the log to reflect the position of the new data. 

A flush operation is generated and queued and then 

starts stage 3. 

3. The B-tree is written out level by level until the root 

node has been rewritten. As one node is written to 

the log, the parent of that node must be modified, 

and a corresponding change record is written to the 

log. As a parent node is changed, a fu rther Hush 

operation is generated fo r the parent node and so 

on up to the root node. 

Stage 2 of this process, logging changes to the leaf 

nodes of the B-tree, is actually redundant. The self 

identi'ing tags that are written with the user data are 

sufficient to act as change records for the leaf nodes of 

the B-tree. When we started to design the server, we 

chose a simple implementation based on physiological 

B-TREE 

STAGE 1 

STAGE 2: 

STAGE 3: 

LOG 

LOG 

LOG 

Figure 11 

Stgcs oLl Write Request 

Digital TcchnicJI Joum.1l Vol. 8 No. 2 1996 

write-ahead logging." As time progressed, we moved 

more toward operational logging." The records written 

in stage 2 arc a holdover trom the earlier implementation, 

which we may remove in a fu ture release ot' 

the Spiralog file system. 

At each stage of the process, a change record is written 

to the log and an in-memory operation is generated 

to drive the update through the next stage. In drect, 

the change record describes the set of changes mack 

to an in-memory copy of a node and an in-memory 

operation associated with that change. 

Figure ll shows the log and the in-memory work 

queue at each stage of

After a system bilure, the server's goal is to recon 

struct the file system state to the point of the last write 

that was written to the log at the time of the crash. 

This recovery process involves rebuilding, in memory, 

those B-tree nodes that were dirty and generating any 

operations that were outstanding when the system 

ta iled. The outstanding operations can be scheduled in 

the normal way to make the changes that they repre 

sent permanent, thus avoiding the need to recover 

them in the event of a future system fa ilure. The recov 

ery process itself does not write to the log. 

The mapping layer work queue and the flush lists 

are rebuilt, and the nodes are fe tched into memory by 

reading the sequential log from the recovery start 

position (see the section Checkpointing) to the end of 

tl1e log in a single pass. 

The B-tree update operations are regenerated using 

the selfidentifYing tag that was written with each 

piece of data. When the recovery process finds a node, 

a copy of the node is stored in memory. As log records 

fo r node changes are read, they are attached to the 

nodes in memory and a Hush operation is generated 

for the node. I fa log record is read fo r a node that has 

not yet been seen, the log record is attached to a place 

holder node that is marked as not-yet-seen. The recov 

ery process does not pertorm reads to tetch in nodes 

that are not part of the recovery scan. Changes to 

B-tree nodes are a consequence of operations that 

happened earlier in the log; therefore, a B-tree node 

t 

RECOVERY 

START POSITION 

CHANGE CHANGE CHANGE 

RECORD 1 RECORD 2 RECORD 3 

NODE A NODE B NODE C 

B-TREE NODE CACHE (AFTER RECOVERY SCAN) 

NODE A' 

WORK QUEUE {AFTER RECOVERY) 

Figure 12 

Recovering a Log 

I 

- CHANGE 

RECORD 4 

NODE A' 

B-TREE r-- FLUSH 

UPDATE REQUEST 

DATA 1 NODE A' 

RECOVERY SCAN 

NODE B 

(NOT-YET-SEEN) 

log record has the et1ect of committing a prior modifi 

cation. Recovery uses this fa ct to throw away update 

operations that have been committed; they no longer 

need to be applied. 

Figure 12 shows a log with change records and 

B-tree nodes along with the in-memory state of the 

B-tree node cache and the operations that are regener 

ated. In this example, change record l fo r node A is 

superseded or committed by the new version of node A 

(node A'). The new copy of node C (node C') super 

sedes change records 3 and 5. This example also shows 

the effect of finding a Jog record without seeing a copy 

of the node during recovery. The log record fo r node B 

is attached to an in-memory version of the node that is 

marked as not-yet-seen. The data record with self-iden 

tifYing tag Data l generates a B-tree update record that 

is placed on the work queue fo r processing. As a final 

pass, the recovery process generates the set of flush 

operations that was outstanding when the system tailed. 

The set oftlush requests is defined as the set of nodes in 

the B-tree node cache that has log records attached 

when the recovery scan is complete. In this case, flush 

operations for nodes A' and B are generated. 

The server guarantees that a node is never written to 

the log with uncommitted changes, which means that 

we only need to log redo records.9·16 In addition, when 

we see a node during the recovery scan, any log 

records that are attached to the previous version of the 

node in memory can be discarded. 

r- FLUSH 

REQUEST 

NODE B 

CHANGE CHANGE I NODE C' I 

RECORD 4 RECORD 5 

NODE A' NODE C 

- CHANGE 

RECORD 2 

NODE B 

NODE C' 

t 

TAIL OF 

LOG 

Digital Technical Journal Vol. 8 No. 2 1996 27

28 

Operati ons generated during recovery arc posted to 

the work queues as they wo uld be in normal running. 

Normal operation is nor allowed to begin until the 

recovery pass has completed ; however, when recovery 

reaches rhe end of the log, tJ1e server is :blc to service 

opcr:rions fi·om clerks. Thus new requests from the clerk 

can be scnced, potentially in parallel with tJ1c operations 

tJ1at were generated by rJ1e recovery process. 

Log records are not applied to nodes during recov 

ery t(x :1 number of reasons: 

• Less processing time is needed to scan the log and 

therefore the server can start servicing new user 

requests sooner. 

• Recover' will not have seen copies of all nodes t(x 

which it has log records. To apply the log records, 

the R-tree node must be read ti·om the log. This 

would result in random read requests during the 

sequential scan ofthe log, and again wo uld result in a 

longer period betore user requests could be serviced. 

• There may be a copy of rhe node L:ltcr in tiK recov 

ery scan. This would make the additional I/O oper 

ation red undant. 

Checkpointing 

As we have shown, recovering an LfS log is i mplc 

mcnred by a single-pass sequcnri:- 11 scan of 1 11 records 

in rhc log ti·om the recovery start position ro the tail of 

the log. This section detines a recovery start position 

and describes how ir can be moved t()l·ward to red uce 

rhc 1 mount of log that Ius to be scan ned to recover 

rhc ti le system state. 

To re construct the in-memory stare when a system 

cr:shcd , recovery must see something in the log that 

represents each operation or clnngc of stare that was 

represented in memory but nor yet made stable. This 

means that at time t, the recovery st:rr position is 

ddincd as a poi nt in the log afte r which all operations 

rh:n arc not stably stored have a log record associated 

with them. Operations obtain rhe :tssoci:ttion by scan 

ning the log sequentially fr om rhc begi nning ro the 

end. The recovery position then becomes the start of 

the log, which has two important problems: 

l. In the worst case, it wou ld be necessary to sequen 

tial ly scan the entire log ro pcr f(mn recovery. For 

large disks, a seq uential read of the enti re log con 

sumes a great deal of time. 

2. Recove ry must process every log record writte n 

between the recovery start position and rhe end of 

the Jog. As a consequence, segments bct\vc cn the 

start of recovery and the end of the log cannot be 

clc:ned and reused. 

To restrict the amount of rime to recover the log 

and to allow segments to be released Lw cJcaning, the 

Digital Te chnical Jomn;J\ Vo l. H No. 2 I ')96 

recovery position must be mm·ed forward ti·om rime 

to time, so that it is alw:ys close to rhe tail ofrhc log. 

Under r Jogs rhar predominantlv ser 

vice read requests, rhe rccoverv ri me rends toward the 

limit. In these cases, ir mav be more appropriate to add 

rj mer- b:scd ch ec kpoi nts. 

Managing Free Space 

A traditional, updarc-in-p!Jce tile system 0\H \\'ri rcs 

superseded data by writing to the same p hysi cal loc: 

tion on disk. It t(>r example,

parts of the B-tree that arc needed to make the tile 

system structures consistent. This means that we can 

never have dirty B-tree nodes in memory that cannot 

be tlushed to the log. 

The server must carefully manage the amount of free 

space in the log. It must provide 1:\vo guarantees: 

l. A write will be accepted by the server only if there is 

sufficient free space in the log to hold the data and 

rewrite tbe mapping B-trce to describe it. This guarantee 

must hold regardless of how much space the 

cleaner may subsequently reclaim. 

2. At the higher levels of the tile system, if an I/0 operation 

is accepted, even if that operation is stored in 

the write-behind cache, the data will be w1itten to 

the Jog. This guarantee holds except in rbe event of 

a system fa ilure. 

The server provides these guarantees using the same 

mechanism. As shown in Figure 13, the free space and 

the reserved space in the log are mod eled using an 

escrow fu nction. 17 

The total number of blocks that contain live, valid 

data is maint

Compression 

Adding file compression to an update-in-place file 

system presents a particular problem: what to do when 

a data item is overwritten with a new version that does 

not compress to the same size. Since all updates take 

place at the tail of the log, an LFS avoids this problem 

entirely. In adctition, the amount of space consumed 

by a data item is determined by its size and is not influenced 

by the cluster size of the disk. For this reason, an 

LFS does not need to employ file compaction to make 

efficient use of large disks or RAID sets.'9 

Future Improvements 

The existing implementation can be improved in a 

number of areas, many of which involve resource consumption. 

The B-tree mapping mechanism, although 

general and flexible, has high CPU overheads and 

requires complex recovery algorithms. The segment 

layout needs to be revisited to remove the need for serialized 

1/0s when committing w1ite operations and thus 

further reduce the write latency. 

For the Spiralog file system version 1.0, we chose to 

keep complete information about live data and data that 

was no longer valid for every segment in the log. This 

mechanism allows us to reduce the overhead of the 

cleaner; however, it does so at the expense of memory 

and disk space and consequently does not scale well to 

multi-terabyte disks. 

A Final Word 

Log structuring is a relatively new and exciting technology. 

Building Digital's first product using this 

technology has been both a considerable challenge and 

a great deal of fun. Our experience during the construction 

of the Spiralog product has led us to believe 

that LFS technology has an important role to play in 

the future of file systems and storage management. 


We would like to take this opportunity to acknowledge 

the contributions of the many individuals who 

helped during the design of the Spiralog server. Alan 

Paxton was responsible for initial investigations into 

LFS technology and laid the fo undation to r our understanding. 

Mike Johnson made a significant contribution 

to the cleaner design and was a key member of the 

team that built the final server. We are very grateful to 

colleagues who reviewed the design at various stages, 

in particular, Bill Laing, Dave Thiel, Andy Goldstein, 

and Dave Lomet. Finally, we wou ld like to thank Jim 

Johnson and Cathy Foley fo r their continued loyalty, 

enthusiasm, and direction during what has been a long 

and sometimes hard journey. 

30 Digiral Technical Journal Vo l. 8 No. 2 1996 

References 

l. D. Gifford , R. Needham, and M. Schroeder, "The 

Cedar File System," Co mmunications of the ACJ\.1, 

vol . 31, oo. 3 (March 1988). 

2. S. Chutanai, 0. Anderson, M. Kazar, and B. Leverett, 

"The Episode File System," Proceedings of the Winter 

7 992 U!:J't!VJX Technical Conference (January I 992). 

3. M. Rosenblum, "The Design and Implementation of 

a Log-Structured File System," Report No. UCB/CSD 

92/696, University of California, Berkeley (June 

1992). 

4. J. Ousterhout and F. Douglis, "Beating the IjO Bottle 

neck: The Case fo r Log-Structured File Systems," 

Operating Systems Review (January 1989). 

5. R. Green, A. Baird, and J. Davies, "Designing a Fast, 

On-line Backup System fo r a Log-structured File System," 

Digital Technicaljoumal, vol. 8, no. 2 (1996, 

this issue): 32-45. 

6. ]. Ousterhout et al., "A Comparison of Logging and 

Cl ustering," Computer Science Department, University 

of California, Berkeley (March 1994 ). 

7. M. Seltzer, K. Bostic, M. McKusick, and C. Staelin, 

"An Implementation of a Log·Structured File System 

for UNIX," Proceedings of tbe Winter 1993 CSENIX 

Te chnical Conference (January 1993). 

8. M. Wiebren de Jounge, F. Kaashoek, and W. -C. Hsieh, 

"The Logical Disk: A New Approach to Improvi ng 

File Systems," ACIH SJGOPS '93 (December 1993). 

9. J. Gray and A. Reuter, Tra nsaction Processing: Co n 

cepts and Techniques (San i'vlateo, Calif.: Morgan 

Kaufman Publishers, 1993), ISRN l-55860-190-2. 

10. A. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swarr, 

"The Echo Distributed File System," Digital Systems 

Research Center, Research Repon 111 (September 

1993). 

11. J. Johnson and W. Laing, "Overview of the Spira log 

File System," Digital Technical.fournal, vol. 8, no. 2 

(1996, this issue): 5-14. 

12. A. Sweeney et al., "Scalability in the XFS File System," 

Proceedings of the Winter 1996 USENIX Tecbnica/ 

Conference (January 1996). 

13. J. Ko hl, "Highlight: Using a Log-structured File 

System fo r Tertiary Storage Management," USENIX 

Associ ation Conference Proceedings (January 1993). 

14. M. Baker et al., "Measurements of: 1 Distributed File 

System," Symposium on Operating System Principles 

(SOSP) 13 (October 1991 ). 

15. J. Ousterhour et al., "A Trace-driven Analysis of the 

UNIX 4.2 13SD File System," Symposium on Operating 

System Principles (SOSJ') 10 (December 1985).

16. D. Lomet and B. Salzberg, "Concurrency and Recov 

ery for Index Trees," Digital Cambridge Research 

Laboratory, Technical Report (August 1991 ). 

17. P. O'Neil, "The Escrow Transactional Model," 

A0\11 Transactions on Distributed Systems, vol . 11 

(December 1986). 

18. Vo lume Shadowingfor Op en VMS A.XP Ve rsion 6. 1 

(Maynard, Mass.: Digital Equipment Corp., 1994). 

19. M. Burrows et al., "On-line Data Compression in a 

Log-structured File System," Digital Systems Research 

Center, Research Report 85 (April 1992 ). 

Biographies 

Christopher Whitaker 

Chris Whitaker joined Digital in 1988 after receiving 

a R.Sc. Eng. (honours, 1SLcJass) in computer science 

from the Imperial College ofScience and Technology, 

University of London. He is a principal software engineer 

with the Open VMS File System Development Group 

located near Edin burgh, Scotland . Chris was the team 

leader for the LFS server component of the Spiralog file 

system. Prior to this, Chris worked on the distributed 

transaction management services (DECdtm) fo r Open VMS 

and the port of the Open VMS record management services 

(RMS and RMS journaling) to Alpha. 

J. Stuart Bayley 

Stuart Bayley is a member of the Open VMS File System 

Development Group, .located near Edinburgh, Scotland. 

He joined Digital in 1990 and prior to becoming a member 

oftl1e Spiralog LFS server team, worked on Open VMS 

DECdtm services and the Open VMS XQP file system. 

Stuart graduated fiom Ki ng's College, University of 

London, with a R.Sc. (honours) in physics in 1986. 

Rod D. W. Widdowson 

Rod Widdowson received a B.Sc. (1984) and a Ph.D. (1987) 

in computer science fi·om Edinburgh University. He joined 

Digital in 1990 and is a ptincipa.l software engineer ''d1 the 

Open VMS File System Development Group located ncar 

Edinburgh, Scotland . Rod worked on me implementation 

of LFS and cluster distribution components of the Spira.log 

file system. Prior to d1is, Rod worked on the port of the 

Open VMS XQP file system to Alpha. Rod is a charter member 

of me British Computer Society. 

Digital Technical journ

32 

Designing a Fast, 

On-line Backup System 

for a Log-structured 

File System 

The Spira log file system for the Open VMS 

operating system incorporates a new tech 

nical approach to backing up data. The fast, 

low-impact backup can be used to create 

consistent copies of the file system while 

applications are actively modifying data. 

The Spira log backup uses the log-structured 

file system to solve the backup problem. The 

physical on-disk structure allows data to be 

saved at near-maximum device throughput 

with little processing of data. The backup 

system achieves this level of performance 

without compromising functionality such as 

incremental backup or fast, selective restore. 

Digir.1l Tcchnical /ournl Vol. ll No. 2 1996 

I 

Russell J. Green 

Alasdair C. Baird 

J. Christopher Davies 

iVl ost computer users want to be able to recover data 

lost through user error, software or mcdi:1 E1 ilurc, or 

site disaster but are unwilling to devote system 

resources or downtime to m1ke backup copies of the 

data. Furthermore, with the rapid growth in the usc of 

data storage and the tendency to move systems toward 

complete utilization (i.e., 24-hour by 7 -day operation), 

the practice of taking the system off line to back up 

data is no longer fe asi ble. 

The Spiralog tile system, an option:JI component of 

the OpenVMS Al pha operating system, incorporates 

a nevv approach to the backup process (called 

simply backup), resulting in J number of subst:llltial 

customer benefits. By exploiting the fe atures of log 

structured storage, the backup system combines the 

advantages of two ditkrent tradition:tl approaches 

to perf-orming backup: the flexibility of ti le-based 

backup and the high pcrt(mnancc of plwsicallv ori 

ented backup. 

The design goal to r the Spiralog b:1ckup system was 

to provide customers \\'ith a last, applic:1tion-consistent, 

on-line backup. In this paper, we explain rhc features 

of the Spiralog ti le system that helped achieve this goal 

and outline the design of the n1:1jor backup hrnctions, 

namelv vo lume save, vol ume restore, ti le restore, :m d 

incremental management. \Ve then present some performance 

results arrived at using Spira log version l.l. 

The paper concludes with a discussion of other design 

approaches and areas fo r ti.1 ture work. 

Background 

File system data may be lost t( Jr many re:tsons, includ 

mg 

• User error-A user m:ty mistakenly delete tb t::l. 

• Software failure-An application m:ty cxccure 

incorrectly. 

• Media fa ilure-The computi ng cquirmenr may 

malfunction because of poor design, old :1 ge, etc. 

• Site disaster-Computing fKilitics m

The ability to save backup copies of all or part of 

a file system's information in a form that allows it to be 

restored is essential to most customers who usc computing 

resources. To understand the backup capability 

needed in the Spiralog file system, we spoke to a number 

of customers-five directly and several hundred 

through public forums. Each ran a different type of system 

in a distinct environment, ranging from research 

and development to fi nance on OpenVMS and other 

systems. Our survey revealed the fo llowing set of customer 

requirements tor the Spiralog backup system: 

l. Backup copies of d:1ta must be consistent with 

respect to the applications that use the data. 

2. Data must be continuously avai lable to applications. 

Dowmime for the purpose of backup is unacceptable. 

An application must copy all data of 

interest as it exists at an instant in time; ho>vever, 

the applicarjon should also be allowed to modit)' 

the data during the copying process. Performing 

backup in such a way as to satist)' these constraints is 

often called hot backup or on-line backup. Figure l 

illustrates how data inconsistency can occur during 

an on-line backup. 

3. The backup operations, particularly tJ1c save operation, 

must be fa st. That is, copying data from the 

system or restoring data to the system must be 

accomplished in the time available. 

4. The backup system rnust allow an incremental 

backup operation, i.e., an operation that captures 

only the changes made to data since the last backup. 

The Spiralog backup team set out to design and 

implement a backup system that would meet the fo ur 

customer requirements. The fo llowing section discusses 

the features of the implementation of a logstructured 

ti le system ( LFS ) that allowed us to use 

a new approach to performing backup. Note that 

throughout this paper we use disk to describe the 

TIME 

Figure 1 

FILE BACKUP EXPLANATION 

The initial file contains two blocks. 

Backup starts and copies 1he first 

block. 

The application rewrites the file. 

Backup proceeds and copies the 

second block. The resulting backup 

copy is corrupt because the first 

block is inconsistent with the latest 

rewritten file. 

ExJmplc of an On-line Backup ThJt Results in Inconsistent 

Data 

physical media used to store data and uolume to 

describe the abstraction of the disk as presented by the 

Spiralog ti le system. 

Spiralog Features 

The Spiralog file system is an implementation of a logstructured 

file system. An LFS is characterized by the 

use of disk storage as a sequential, never-ending repository 

of data. We generally reter to this organization of 

data as a Jog. Johnson and Ll ing describe in detail the 

design of the Spira log implementation of an LFS and 

how ti les are maintained in this implementation.' 

Some teatures unique to a Jog-structured file system 

are of particular interest in the design of a backup 

system.2-• These teatures are 

• Segments, where a segment is the fu ndamental 

unit of storage 

• The no-overwrite nature of the system 

• The temporal ordering of on-disk data structures 

• The means by which files are constructed 

This section of the paper discusses the relevance of 

these fe atures; a later section explains how these features 

arc exploited in the backup design. 

Segments 

In this paper, the term segment refers to a logiGd 

entity that is uniquely identified and never overwritten. 

This definition is distinct fi·om the physical storage 

of a segment. The only physical feature of interest 

to backup with regard to segments is that they are efficient 

to read in their entirety. 

Using Jog-structured storage in a ti le system allows 

efficient writing irrespective of the write patterns or 

load to the ti le system. All write operations arc 

grouped in segment-sized chunks. The segment size is 

chosen to be sufficiently large that the time required 

to read or write the segment is significantly greater 

than the time required to access the segment, i.e., the 

time required f()r a head seek and rotational dc!Jy on 

a magnetic disk. AJ I data (except the LFS homeblock 

and checkpoint information used to locate the end of 

the data log) is stored in segments, and all segments 

are known to the ti le system. From a backup point of 

view, this means that the entire contents of a volume 

can be copied by reading the segments. The segments 

are large enough to allow efficient reading, resulting in 

a ncar-maximum transfer rate of thc device. 

No Overwrite 

In a log-structured file system, in which the segments 

are never overwritten, all data is written to new, empty 

segments. Each new segment is given a segment identifier 

(segid ) allocated in a monotonically increasing 

Digital Technical )ourn31 Vol. 8 No. 2 1996 33

34 

manner. At any point in time, the entire contcm.s and 

state of a volume can be described in terms of a (check 

poim position, sep,rnent list) pair. At the physical level, 

a volume consists of a list or· segments and a position 

within a segment that ddines the end of the log. 

Rosenblum describes the concept of time trave l, where 

an old stare ofrhe file system can lx ro:visitcd by cn:ar 

ing and maintaining a snapshot of the J-i lc system r(>r 

fu ture access.' Allowing time travel in this \\'ay req uires 

maintaining an old checkpoint :md disabling the r.:use 

of disk space by the cleaner. The ckaner is a mecha 

nism used to reclaim disk space occupied by obsolete 

data in a Jog, i.e., disk space no longer rdcn::nc.:d in 

the tile system. The contents of a snapshot :m..: inde 

pendent of operations undertaken on the live version 

of the ti le system. Modit)'ing or dt:leting a file arkcts 

onlv the live version of the rile system (sec figure 2 ) . 

Because of the no-overwrite nature of the l.YS, prC\'i 

ouslv written data re mains unchanged . 

Other mechanisms specific to a particular backup 

algorithm have been developed to achieve on-line CO!l 

sistencv.' The snapshot model as described abm'C allo\\'s 

a more general solution with respect to multiple con 

currem backups and the choice of the s:-,·c algorith m. 

A re:1d -only version of the fi le svstem at an inst:lllt 

in time is preciselv what is req uired for application 

consistency in on-line backup. This snapshot approach 

to attaining consistency in on-line backup has been 

used in other systems. 0·' As explained in the fd lowing 

sections, the Spiralog ti le system combines the snap 

shot technique with fi:::atures oflog-structurcd storage 

to obtain both on-line backup consistency and perr(>r 

mancc benefits r(>r bJckup. 

Te mporal Ordering 

As mentioned earl ier, all data , i.e., user data and fi le 

system metadata (data that describes the user d:-ta in 

the fi le system), is stored in segments and there is no 

overwrite of segments. All on-disk data structures that 

rekr to physical placement of data usc poimcrs, 

namely (segid, ofJ.;,el) pairs, ro describe tbc location of 

the data. Each (segid, off-el) pair specifics the segment 

and where within rhar segmcnr the d:-ra is stored. 

Together, these imply the followi ng two properties of 

data structures, which are key rcatures oLln LrS: 

This data is 

visible to only 

!he snapshot. 

Figure 2 

This dala is 

shared by the 

snapshot and the 

live file system. 

This is new live 

dala written since 

the snapshot was 

laken. 

Data Accessi ble to the Snapshot Jnd to the Live hie 

Svstem 

Digirl TL'chnic1l lm1nL1I Vol. t:l No. 2 1996 

l. On-disk structure pointers, namely (s(:r_: id. (!ff.\·et) 

pairs, arc re latively ri me ordered. Specifically, data 

stored at (s2, n2 ) was written more recentlv than 

data stored at ( sl , ol ) if and only if s2 is greater 

than sl or s2 cqu::ls sl and o2 is greater than ol . 

Thus, new datJ wou ld appear to the right in the 

dat:1 structure depicted in Figure 3. 

2. Any daLl structure that uses on-disk pointers stored 

within the segments (the mapping data structure 

implementing the LfS index) must be rime 

ordered ; that is, all poi mers must rekr ro data writ 

ten prior ro the pointer. Rckrri ng 1g1in to Figure 3, 

onlv dna structures rh:r poinr ro the left are valid. 

These properties of on-disk data structures are of 

interest when designing backup syste ms. Such data 

structu res can be traversed so rhat segments arc re:d 

in reverse time order. To understand this concept, con 

sider the root of some 011-disk darJ structu re. This root 

must h;wc been written after any of the data to '' hich 

ir rekrs (propertY 2 ). A data item that the root rd cr 

cnccs must h:J\·c been written bd()re the root and so 

must ha,·e been stored in a segment with a scgid less 

than or equal to that of the segment in ,,·hich the root 

is stored (proper tv 1 ). A similar ind ucrivc :1rgumen t em 

be used to show tll

data contained in a file defined by a file identifier can be 

recovered, independent of how the ti le was created, 

without any knowledge of the ti le system structure. 

Consequently, together with the temporal ordering of 

data in an LfS, fi les can be recovered using an ordered 

linear scan of the segments of a volume, provided the 

on-disk data structures are traversed correctly. This 

mecbanism allows efficient tile restore ti·om a sequence 

of segments. In particular, a set of files can be restored 

in a single pass of a saved volume stored on tape. 

Existing Approaches to Backup 

The design of the Spiralog backup attempts to com 

bine the advantages oHile-based backup tools such as 

Files- 11 backup, UNIX tar, and Windows NT backup, 

and physical backup tools such as UN IX dd, Files- 11 

backup/PHYSICAL, and HSC backup (a controller 

based backup tor Open VMS volumes)." 

File-based Backup 

A file-based backup system has two main advantages: 

( 1 ) the system can explicitly name files to be saved, and 

(2) the system can restore individual files. In this paper, 

the file or structure that contains the output data of 

a backup save operation is called a saveset. Individual 

file restore is achieved by scanning a saveset fo r the file 

and then recreating the ti le using the saved contents. 

Incremental fi le-based backup usually entails keeping 

a record of when the last backup was made (either on a 

per-file basis or on a per-volume basis) and copying 

only those files and directories that have been created 

or modified since a previous backup time. 

The penalty associated with these features of a ti le 

based backup system is that of save performance. 

In efkct, the backup system performs a considerable 

amount of work to lay out data in the saveset to allow 

simple restore. All files arc segregated to a much greater 

extent than they are in the file system on-disk struc 

ture. The limiting factor in the performance of a fi le 

based save operation is the rate at which data can be 

read from the source disk. Although there are some 

ways to improve performance, in the case of a volume 

that has a large number of tiles, read performance is 

always costly. Figure 4 illustrates the layouts of three 

different types ofsavcscts. 

Physical Backup 

In contrast to the fi le-based approach to backup, a 

physical backup system copies the actual blocks of data 

on the sou rce disk to a saveset. The backup system is 

able to read the disk optimally, which allows an imple 

mentation to achieve data throughput near tbe disk's 

maximum transfer rate . Physical backups typically 

allow neither individual file restore nor incremental 

DIRECTION IN WHICH THE TAPE IS WRITTEN 

11 1 2 1 3 1 4 1 s 1 6 1 7 1 s 1 9 l1o l11 1···1 

In a physical backup saveset, blocks are laid out contiguously on tape. 

File restore is not possible without random access. 

FILE 1 FILE 2 FILE 3 

In a file backup saveset. files are laid out contiguously on tape. 

To create this sort of saveset, lites need to be read individually 

lrom diSk, wh1ch generally means suboptimal disk access. 

DIR I SEGT I SEG SEG SEG I .. ·I 

In a Spiralog backup saveset, directory (DIR) and segment table 

(SEGT) allow file restore from segments. Segments are large 

enough to allow near-optimal disk access. 

Figure 4 

Layouts of Three Diftc rcnt Types ofSavcser 

backup. The overhead required to include sufficient 

information fo r these fe atures usually erodes the per 

formance benefits oftered by the physical copy. In 

addition, a physical backup usually requires that the 

entire volume be saved regardless of how much of the 

volume is used to store data. 

How Spira log Backup Exploits the LFS 

Spiralog backup uses the snapshot mechanism to 

achieve on-line consistency to r backup. This section 

describes how Spiralog attains high-performance 

backup with respect to the various save and restore 

operations. 

Volume Save Operation 

The save operation of Spiralog creates a snapshot and 

then physically copies it to a tape or disk structure 

called a savesnap. (This term is chosen to be different 

from saveset to emphasize that it holds a consistent 

snapshot of the data.) This physical copy operation 

allows high-performance data transfer with minimal 

processing.10 In addition, the temporal ordering of 

data stored by Spiralog means that this physical copy 

operation can also be an incremental operation. 

The savcsnap is a file that contains, among other 

information, a list of segments cxactlv as thcv exist 

in the log. The structure of the savcsap allo·VS the 

efficient implementation of volume restore and ti le 

restore (see Figure 5 and Figure 6). 

The steps of a full save operation arc as follows: 

1. Create a snapshot and mount it. This mounted 

snapshot looks like a separate, read-only file system. 

Read information about the snapshot. 

Digi[al Technic! journal Vol. R No. 2 1996 35

36 

Figure 5 

SJ1·csn;1p Structure 

Figure 6 

METADATA SEGMENTS (DECREASING SEGID) 

.--- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 

KEY 

DIRECTORY 

INFO 

PHYSICAL SAVESNAP 

RECORD (FIXED SIZE FOR 

ENTIRE SAVESNAP) 

ZERO PADDING 

DIRECTION IN WHICH THE LOG IS WRITTEN 

SAVESNAP 

l1os 1104 l1o2l1o1 1 

DIRECTION IN WHICH 

THE TAPE IS WRITTEN 

KEY: 

D UNUSED SEGMENT 

D USED SEGMENT 

Cor-respondence bctll'ecn Segments on Disk Jnd in the 

SJI'CStl;lp 

2. Write the he:der ro the s;ll'esnap, incl uding snap 

shot inf

38 

disk from which the savcsnap was made, providing 

the target container is large enough to hold the 

restored segments. 

4. Backup declares the volume restore as complete 

(no more segments will be written to the volume). 

Backup tells the file system how to mount the voltune 

by supplying the snapshot checkpoint location. 

A Spiralog restore operation treats an incremental 

savesnap and all the preceding savesnaps upon which it 

depends as a single savesnap. For savesnaps other than 

the most recent savesnap (the base savesnap ), the 

snapshot intcJrmation and directory intcmnation arc 

ignored. The sole purpose of these saves naps is to pro 

,·idc segments to the base savesnap. 

To rstore a volume from a set of incremental savesnaps, 

the Spiralog backup system pcrtcmm steps 1 

and 2 using the base savcsnap. In step 3, the restore 

copies all the segments in the snapshot ddincd by 

the base savcsnap to the target disk. (Note that there 

is :- one-to-one correspondence between snapshots 

and savcsnaps.) The savcsnaps :rc processed in reverse 

chronological order. The contents of the segment 

table in the base savesnap define the list ofscgmcms in 

the snapshot to be restored. Although the volume 

restore operation copies all the segments in the base 

savcsnap, nor : til segments in the savcsnaps procrmancc 

recovcn· ti·om volume bilurc or site disaster. 

Digital Tcchtlical )ound Vol . 8 No. 2 1996 

File Restore Operation 

The purpose of :- tile restore operation is ro provide 

a fast and efficient wav to retrieve a small number of 

tiles from a savesnap without pc:rtorming a full volume 

restore. Typically, ti le restore is used to recover f-iles 

that have been in:tdvertcntly deleted . To achieve highperformance 

ti le restore, we imposed the fol lowing 

requirements on the design: 

• A tile restore session must process as few savesnaps 

as possible; it should skip savesnaps that do not 

contain data needed by the session . 

• vV hcn processing a savesnap, the file restore must 

scan the savcsnap lincarlv, in a single pass. 

Th

elevant to a particular ti le identifier that is held within 

a given segment. The mapping information returned 

through this interface describes either mapping infor 

mation held elsewhere or real file data. One character 

istic of the log is that anything to which such mapping 

information points must occur earlier in the log, that 

is, in a subsequent saved segment. Recall property 2 of 

the LFS on-disk data structures. Consequently, rhe file 

restore session will progress through the savesnaps in 

the desired linear fashion provided that requests are 

presented to the interface in the correct order. The 

correct order is determined by the allocation of segids. 

Since segids increase monotonically over time, it is 

necessary only to ensure that requests are presented in 

a decreasing segid order. 

The ti le restore interface operates on an object 

called a context. The context is a tuple that contains a 

location in the log, namely (segid, ojf\·et), and a type 

field. When supplied with a file identifier and a con 

text, the core function of the interface inspects the seg 

ment determined by the context and returns the set of 

contexts that enumerate all available mapping infor 

mation fo r the file identifier held at the location given 

by the initial context. 

The type of context returned indicates one of the 

following situations: 

METADATA 

Figure 8 

File Restore Session in Progress 

633 

SAVES NAP 

555 478 

EXTENT OF SAVESNAP TRAVERSAL SO FAR 

TARGET FILE SYSTEM FOR FILE RESTORE 

DIRECTION IN WHICH THE LOG IS WRITIEN 

KEY: 

r---, 

... - -- 

• 

FILE DATA 

FILE SYSTEM MAP DATA 

• The location contains real fi le data. 

• The location given by the context holds more 

mapping information. In this case, the core func 

tion can be applied repeatedly to determine the 

precise location of the fi le's data. 

A work list of contexts in decreasing segid order 

drives the fi le restore process. The procedure tor 

retrieving the data for a single file identifier is as fo l 

lows. At the outset of the file restore operation, the 

work list holds a single context that identifies the root 

of the map (the tail of the log). As items are taken from 

the head of the list, the ti le restore must perform one 

of two actions. If the context is a pointer to real file 

data, then the tile restore reads the data at that location. 

If the context holds the location of mapping informa 

tion, then the core fimction must be applied to enu 

merate all possible fimher mapping information held 

there. The tile restore operation places all returned 

contexts in the work list in the correct order prior to 

picking the next work item. This simple procedure, 

which is illustrated in Figure 8, contin ues until the 

work list is empty and all the file's data has been read . 

To cope with more than one ti le, the fi le restore 

operation extends this procedure by converting the 

work list so that it associates a particular file identifier 

195 69 59 

The shaded areas represent the file data to be restored and the file system metadata that 

needs to be accessed to retrieve that data. The restore session has thus far processed 

segment 478. Part A of the file has been recovered into the target file system. Parts B and C 

are st1ll to come. Aller processing segment 478, the file restore visits the next known parts of 

the log, segments 69 and 59. Items that describe metadata in segment 69 and data in segment 

59 Will be on the work list. The next segment that the file restore will read is segment 69. so the 

sess1on can sk1p the intervemng segment (segment 195). 

Digital Tec hnical journal Vo l. S No. 2 1996 39

40 

with e:Kh context. File restore initializes the work list 

ro hol d a poinrer ro the root of the map ( rhe rail of the 

log) t(lr each ti le idemitier ro be n:srored . The eftecr is 

ro imerleave requests ro read more than one ti le while 

maintaining the correct segid ordering. 

A fu rther subtletY occurs when rhc conrexr ar the 

head of rhc work list is tc)Und ro rckr ro a segment 

outside the current savesnap. The ordering imposed 

on the work list implies that all subsequent items of 

work must also Lx: outside the current S31'esnap. This 

fclllows from the rc m por.1l orderi ng properties of LFS 

on-disk structu res and the way in which incremental 

savesnaps arc ddined. vVhen this situ:nion occurs, the 

work list is sa1·cd. When the next swesn:tp is ready for 

processing, the ti le restore st:ssion c:tn be restarted 

using the saved work list 1s the starting point. 

During this step, the fi le restore writes the pieces of 

fl ies to the target volume as they arc read from the 

savcsnap. Since the ti le restore process allocates fi le 

identifiers on a per-volume basis, restore must allocate 

new fi le identifiers in the target volum

Figure 9 

Merging Savesnaps 

BACKUPS 

Monday - Full 

Wednesday 

Incremental 

Friday 

Incremental 

Ta ble 1 

Performance Comparison of the Spira log and Files-1 1 Backup Save Operations 

Elapsed Time 

Backup System (Minutes:seconds) 

Spira log save 05:20 

Files-1 1 backup 10:14 

phases of the save operation. During the directory 

scan phase (typically up to 20 percent of the total 

elapsed save time), the only tape output is a compact 

representation of the volume directory graph. In comparison, 

the segment writing phase is usually bound by 

the tape throughput rate. In this configuration, the 

tape is the throughput bottleneck; its maximum raw 

data throughput is 1 .25 MB/s ( uncompressed ).11 

Overall, the Spiralog volume save operation is nearlv 

twice as f:1st as the Files-1 1 backup volume save opera - 

tion in this type of computing environment. Note that 

the Spiralog savesnap is larger than the corresponding 

Files-1 1 saveser. The Spiralog savesnap is less efticient 

at holding user data than the packed per-tile representation 

of the Filcs- ll saveset. In many cases, though, 

the higher pert(xmance of the Spira log save operation 

more than outweighs this inefficiency, particularly 

when it is taken into account that the Spiralog save 

operation can be performed on-line. 

File Restore Performance 

To determine tile restore performance, we measured 

how long it took to restore a single fi le from the 

savesets created in the save benchmark tests. The hardware 

and software configurations were identical to 

those used tor the save measurements. \Ve deleted 

a single 3-kilobyte (J(li ) ti le ti·om the source volume 

and then restored the ti le. We repeated this operation 

nine times, each time measuring the time it took to 

restore the fi le. Table 2 shows the results. 

Merge three savesets to produce one 

new savesnap equivalent to a full 

savesnap taken on Friday. 

Savesnap or 

Saveset Size Throughput 

(Megabytes) (Megabytes/second) 

339 1.05 

297 0.48 

Ta ble 2 

Performance Comparison of the Spira log and Files-1 1 

Individual File Restore Operations 

Backup System 

Spira log file restore 

Files-1 1 backup 

Elapsed Time 

(Minutes:seconds) 

01 :06 

03:35 

The Spiralog backup system achieves such good 

pertormance for tile restore by using its knowledge of 

the way the segments are laid out on tape. The fi le 

restore process needs to read only those segments 

required to restore the tile; the restore skips the intervening 

segments using tape skip commands. In the 

example presented in Figure 8, the restore can skip 

segments 555 and 195. In contrast, a file-based backup 

such as Filcs-ll usually does not have accurate indexing 

information to minimize tape 1/0. Spiralog's 

tape-skipping benefit is particularly noticeable when 

restoring small numbers of files from very large savesnaps; 

however, as shown in Table 2, even with small 

savesets, individual file restore using Spira log backup is 

three times as fast as using Files-ll. 

Table 3 presents a comparison of rhe save performance 

and features of the Spiralog, ti le-based, and 

physical backup systems. 

Digital Technical Journal Vol. 8 No 2 1996 41

42 

Ta ble 3 

Comparison of Spira log, File-based, and Physical Backup Systems 

Save performance 

(the number of 1/0s 

required to save the 

the source volume) 

File restore 

Vo lume restore 

Incremental save 

Spiralog Backup 

System 

The number of 1/0s is 

O(number of segments that 

contain live data) plus 

O(number of directories) 

Yes 

Yes, fast 

Yes, physical 

File-based Backup 

System 

The number of 1/0s is 

O(number of files) 

1/0s to read the file 

data plus O(number 

of directories) I /Os 

Yes 

Yes 

Yes, entire files that 

have changed 

Physical Backup 

System 

The number of 1/0s 

is O(size of the disk) 

No 

Yes, fast but limited 

to disks of the same size 

Note that this table uses "big oh" notation to bound a va lue. O(n), which is pronounced "order of n, " means that the value represented is no 

greater than Cn for some constant C, regardless of the value of n. Informally, this means that O(n) can be thought of as some constant multiple 

of n. 

Other Approaches and Future Work 

This section outlines some other design options 

we considered fo r the Spiralog backup system. Our 

approach oncrs fu rther possibilities in a number 

of areas. \Ve describe some of the opportunities 

available. 

Backup and the Cleaner 

The benefits of the write perf(xmancc g1ins in an LFS 

arc attained at the cost of having to clean segments." 

An opportunity appears to exist in combining the 

cleaner and backup functions to red uce the amount of 

work done by either or both of these components; 

however, the aims of backup and tbe cleaner are quite 

diftcrmt. Backup needs to read all segments written 

since a specitlc time (in the case of a ti.J II backup, since 

the birth of the volume). The cleaner needs to defrag 

ment the tree space on the volume. This is done most 

efficimtly by relocating data held in certain segments. 

These segments are those that arc suHiciently empty to 

be worth scavenging tor ti-ee space. The datl in these 

segments should also be stable in the sense that the 

data is unlikely to be deleted or outdated immediately 

after relocation. 

The onl y real benefit that can be exacted by looking 

at thest fu nctions together is to clean some segments 

while performing backup. For example, once a stg 

mcnt has been read to copy to a savesnap, it can be 

cleaned. This approach is probably not a good one 

because it reduces system pedorm::mce in the fo llow 

ing ways: additional processing req uired in clc::ming 

removes CPU and memory resources available to 

applications, and tht cleaner gtneratcs \\'rite opera 

tions that rtducc the backup rtad rate. 

Digital Tcclmical Journal Vol. 8 No . 2 1996 

No 

There are two other areas in which backup and 

the cleaner mechanism interact that warrant fu rther 

investigation. 

l. Th e save opcLltion copies stgments in their 

entirety. Th:1t is, the operation copies both "stale" 

(old) cb tJ and li\'c data to a s1vesnap. The cost of 

extra storage media t(>r this extraneous data is 

traded otfagainst the performance penalty in trying 

to copy only live data. It appears that the tile systtm 

should run the clemtr vigorously prior to a backup 

to minirnizt the stale data copied . 

2. Incremental savesnaps contain cleaned data. This 

means that an incremental savesnap contains a copy 

of data that alrtady exists in one of the savcsnaps on 

which it depends. This is an apparent waste ofdhm 

and storagt space. 

It is best to undert: kc a fu ll backup after a thorough 

cleaning of the volumt. A single strategy tor incremen 

tal backups is less easy to dctine. On one hand, the size 

of an incrcmcllt:t l backup is incrtased if much cleming 

is pcd(m11td bd(l!-c the backup. On tht other h:md, 

restore operations from a largt incremental backup 

(particularly sclectivt ri le rcstorts) are likely to be 

more efticitnt. The larger tht incremental backup, the 

more data it contains. Consequently, the chance of 

restoring a single ti le trom just the base savesnap 

increases with the size of the incrcmemal backup. 

Studying the intcLKtions bcrween the backup and the 

cleaner may oft-Cr some insight into how to improve 

either or both ofthest components. 

A conti nuous b1ekup system can takt copies of seg 

ments ti·om disk using policies similar to the cle:mcr. 

This is explored in Ko hl's paptr. 12

Separating the Backup Save Operation into a 

Snapshot and a Copy 

The (k sign of the save operation involves the creation 

of a snapshot toll owed by the tast copy of the snapshot 

to some separate storage. The Spiralog version 1.1 

implementation of the save operation combines these 

steps. A snapshot can exist only during a backup save 

operation. 

System administrators and applications have significantly 

more flexibility if the split in these two fu nctions 

of backup is visi ble. The ability to create snapshots that 

can be mounted to look like read -only versions of a tile 

system may eliminate the need fo r the large number of 

backups performed today. Indeed, some fi le systems 

ofkr this kature 6·7 The additional advantage that 

Spir:.dog otkrs is to allow the very efticient copying of 

individual snapshots to offline medi:t. 

Improving the Consistency and Availability 

of On-line Backup 

There :re a number of ways to improve application 

consistency

44 

data indexing. The Spiralog backup implementation 

provides an on-lint: backup save operation with significantly 

improved performance over existing offerings. 

Performance of individual file restore is also improved. 


We would like to thank the following people whose 

efforts were vital in bringing the Spiralog backup system 

to fruition: Nancy Phan, who helped us develop 

the product and worked relentlessly to get it right; 

Judy Parsons, who helped us clarir, describe, and document 

our work; Clare Wells, who helped us tocus on 

the real customer problems; Alan Paxton, who was 

involved in the early design ideas and .later specif-ication 

of some of the implementation; and, finally, 

Cathy Foley, our engineering manager, who supported 

us throughout the project. 

References 

I. J. Johnson and W. Laing, "Overview of the Spiralog 

File System," Digital Technical journal. vol. 8, no. 2 

( 1996, this issue): 5-14. 

2. M. Rose nblum and ]. Ousterhout, "The Design and 

Implementation of a Log-Structured File System," 

ACM Transactions on Computer Systems, vol . 10, 

no. I ( February 1992): 26-52. 

3. M. Rosenblum, "The Design and Implemen tation of a 

Log-Structured File System," Report No. UCB/CSD 

92/696 (Berke ley, Calif.: University of Calito rnia, 

Berkeley, 1992 ). 

4. M. Seltzer, K. Bostock , M. McKusick, and C. Staelin, 

"An I mplementation of a Log-Structured File System 

fo r UNIX," Proceedings of the USENJX Winter 1993 

Technical Confe rence, San Diego, Calif. (JanuJr)' 

1993). 

5. K. W;J!ls, "File Backup System to r Producin g a Backup 

Copy of a File Which May Be Updated during 

B ackup," U.S. Patent No. 5,163,148. 

6. D. Hitz, ]. LJu, and M. Malcolm, "File System Design 

fo r an NFS Fi le Server Appliance," Proceedings of 

the USENIX Winter 1994 Technical Con/erence, 

San Francisco, Calif. (January 1994 ). 

7. S. C hutani, 0. Anderson, M. \(azar, and B. Leverett, 

"The Episode File System," Proceedings of the 

UStNIX Winter 1992 Technical Con/erence. 

San Francisco, Calif (January 1992). 

8. C:. Whitaker, T. Bayley, and R. Widdowson, "Design of 

the Server fo r the Spiralog File Syste m," Digilul 

TechnicaL}oumal, vol. 8, no. 2 ( 1996, this issue): 15-31 . 

9. Open VMS System Management Utilities Refere nce 

Manual A-L, Order No. M-PVSPC-TK (Mayn ard , 

M ass. : Digital Eq uipment Corporation, 1995 ). 

DigirJI Technical )ourn

Alasdair C. Baird 

Alasdair Baird joined Digital in 1988 ro work fo r the 

ULTIUX Engineering group in Reading, U.K. He is 

a senior sofrware engineer and has been a member of 

Digital's Open VMS Engineering group since 1991 . 

He worked on the design of the Spiralog file system and 

then contri buted ro the Spiralog ba ckup syste m, particularly 

the ti le resrore component. Currently, he is involved 

in Spir

46 

Integrating the Spiralog 

File System into the 

Open VMS Operating 

System 

Digital's Spira log file system is a log-structured 

file system that makes extensive use of write 

back caching. Its technology is substantially 

different from that of the traditional Open VMS 

file system, known as Files-1 1. The integration 

of the Spiralog file system into the Open VMS 

environment had to ensure that existing appli 

cations ran unchanged and at the same time had 

to expose the benefits of the new file system. 

Application compatibility was attained through 

an emulation of the existing Files-1 1 file system 

interface. The Spira log file system provides an 

ordered write-behind cache that allows applica 

tions to control write order through the barrier 

primitive. This form of caching gives the benefits 

of write-back caching and protects data integrity. 

Digiral Te chnical )oum,ll Vol. 8 No. 2 1996 

I 

Mark A. Howell 

J u1ian M. Palmer 

The Spiral og file system is based 011 a log-structuring 

method rll3t off-ers fast writes and a bst, on-line backup 

capability. ' - ' The i ntegration of the Sriralog fi le system 

into the Open VMS operating system prese nted many 

challenges. Its progra mming intnrace and its ex tensive 

use of write-back caching were su bst;11ltially difkrent 

ti·om those of the existing Open VMS fi le svstcm, 

known as Files-1 1. 

To encourage usc of the Spiralog fi le system , we had 

to ensure that ex isti ng applications ran unchanged in 

the OpenVMS e nvironmen t. A ti le system em ulati on 

layer provided the necessary compatibi lity by mapping 

the Files-1 1 file system interrace onto the Spi ralog ri le 

system. Bdore we could build the emu Lnion layer, we 

needed to understa nd how these applications used the 

ri le system interrace. The approach taken to u nder 

standing application rcquirell1 CI 1tS led to a ri le SVStCl11 

emu l ation layer that exceeded the origi n a l compatibil 

ity expectations. 

The first part of th is paper deals with the approach 

to i ntegrating a new ti le system into the OpenVMS 

environment and preservi ng application compatibilitY. 

It describes the various levels at which the fi le system 

could have been int egrated and the decision to em u 

late the l ow-level tile system interbcc. Techniques 

such as tracing, source code scanning, and fu nctional 

analysis of the Files-1 1 ri le svstcm helped determine 

which fe atures should be supported bv the emulation. 

The Spiralog fi le system uses extensive write- back 

c aching to ga in pe rtonn ance over the write-through 

cache on the Files-1 1 ri le system. Applications have 

relied on the ordering of writes implied by wri te 

through cac hi ng to maintain on-disk consistency in 

the event of system railurcs. The lack of orderi ng 

guarantees prevented the i mplementation of such 

careful write pol icies in write-b::tck environments. The 

Spiralog ri le system uses a write -beh ind cache (intro 

duced in the Echo ti le system) to allm.v applications to 

take advantage of write-back cach ing pcrr( Jrmance 

while preservi ng careful write policies.' This f-eatu re is 

unique in a comnH.:rcial fi le system. The second parr of 

this paper describes the dirlicultics ofintcgr:ting write 

back caching into a write-through environment and 

how a write-behind cache addressed these problems.

Providing a Compatible File System Interface 

Application compatibility can be described in two 

ways : compatibility at the file system interface and 

compatibility of the on-disk structure. Since only specialized 

applications use knowledge of the on-disk 

structure and maintaining compatibility at the interface 

level is a teature of the Open VMS system, the 

Spiralog tile system preserves compatibility at the file 

system interface level only. In the section Files-1 1 and 

the Spiralog File System On-disk Structures, we give 

an overview of the major on-disk diffe rences bet\veen 

the t\VO file systems. 

The level of interface compatibility wou ld have a 

large impact on how well users adopted the Spiralog 

file system. If data and applications could be moved to 

a Spiralog volume and run unchanged, the file system 

would be better accepted . The goal tor the Spiralog 

file system was to achieve 100 percent interface compatibi 

lity for the majority of existing applications. The 

implementation of a log-structured ti le system, however, 

meant that certain features and operations of the 

Files-1 1 file system could not be supported. 

The Open VMS operating system provides a number 

of ti le system interfaces that are called by applications. 

This section describes how we chose the most compatible 

tile system interface. The OpenVMS operating 

system directly supports a system-level call interface 

(QIO) to the ti le system, which is an extremely complex 

interface.' The QIO interface is very specific to 

the OpenVMS system and is difficult to map directly 

onto a modern ti le system intedace. This interface is 

used infrequently by applications but is used extensively 

by Open VMS utilities. 

Open VMS File System Environment 

This section gives an overview of the general 

OpenVMS file system environment, and the existing 

Figure 1 

r- 

n 1 HIGH-LEVEL 

APPLICATIONS 

Open VMS and the new Spiralog file system intertaces. 

To emulate the Files-1 1 file system, it was important to 

understand the way it is used by applications in the 

OpenVMS environment. A brief description of the 

Files-1 1 and the Spiralog file system interfaces gives an 

indication of the problems in mapping one interface 

onto the other. These problems are discussed later in 

the section Compatibility Problems. 

In the Open VMS environment, applications interact 

with the ti le system through various interfaces, 

ranging from high-level language interfaces to direct 

ti le system calls. Figure 1 shows the organization of 

interfaces within the Open VMS environment, including 

both tl1e Spira log and the Files-1 1 file systems. 

The fo llowing brietly describes the levels of interface 

to the tile system. 

• High-level language (HLL) libraries. H LL libraries 

provide tile system fu nctions tor high-level 

languages such as the Standard C library and 

FORTRAN I/0 fu nctions. 

• OpenVMS language-specific libraries. These 

libraries offe r OpenVMS-spccific ti le system fu nctions 

at a high level. For ex;mple, lib$create_dir( ) 

creates a new directory with specitic OpenVMS 

security attributes such as ownership. 

• Record Management Services. The OpenVMS 

Record Management Services (RMS) are a set of 

complex routines that fo rm part of the Open VMS 

kernel. These routines

48 

• Files-1 1 file system interface. The Open VMS operating 

system has traditionally provided the Files-1 1 

file system fo r applications. It provides a low-level 

file system intert:ace so that applications can request 

fu e system operations from the kernel. 

Each tile system call can be composed of multiple 

subcalls. These subcalls can be combined in numerous 

permutations to fo rm a complex file system 

operation. The number of permutations of calls and 

subcalls makes the file system interface extremely 

difficult to understand and use. 

• File system emulation layer. This layer provides 

a compatible interface bet\¥een the Spiralog ti le 

system and existing applications. Calls to export 

the new fe atures available in the SpiraJog file system 

are also included in this layer. An important new 

fe ature, the write-behind cache, is described in the 

section Overview of Caching. 

• The Spiralog file system interface. The Spiralog 

file system provides a generic file system interface. 

This interface was designed to provide a superset 

of the features that are typically available in file systems 

used in the UNIX operating system. File 

system emulation layers, such as the one written for 

Files-1 1, could also be written for many different 

file systems." Features that could not be provided 

generically, for example, the implementation of 

security policies, arc implemented in the file system 

emu lation layer. 

The Spiralog file system's interface is based on the 

Virtual File System (VFS), which provides a file 

system interface similar to those fo und on UNIX 

systems.7 Functions available are at a higher level 

than the Files-1 1 file system interface. For example, 

an atomic rename fitnction is provided. 

Files-1 1 and the Spira log File Sys tem 

On-disk Structures 

A major difference bet\'lecn the Filcs-1 1 and the 

Spiralog file systems is the way data is laid out on 

the disk. The Files-1 1 system is a conventional, 

update-in-place file system -" Here, space is reserved for 

file data, and updates to that data are written back to 

the same location on the disk. Given this knowledge, 

applications could place data on Files-1 1 volumes to 

take advantage of the disk's geometry. For example, 

the Files-l l file system allows applications to place files 

on cylinder boundaries to reduce seck times. 

The Spiralog file system is a log-structured file 

system (LFS). The entire volume is treated as a continuous 

Jog with updates to files being appended to 

the tail of the log. In efkct, files do not have a fixed 

home location on a volume. Updates to files, or cleaner 

activity, will change the location of data on a volume. 

Applications do nor have to be concerned where their 

data is placed on the disk; LFS provides this mapping. 

DigirJI Tcchnier example, security 

information. Although few in number, those applications 

that do call the Files-l l file system directly arc 

significant ones. If tile only interface supported was 

RMS, then these utilities, such as SET FILE and 

OpcnVMS Backup, would need signiticant modification. 

This class of utilities represents a large number of 

the OpcnVMS utilities that maintain the file system . 

To provide support tor the widest range of applications, 

we selected the low-level Filcs-1 1 interface to r 

usc by the file system emulation layer. By selecting this 

interface, we decreased the amount of work needed 

fo r its emulation. However, this gain was offset by the 

increased complexity in the interface emulation. 

Problems caused by this interface selection are 

described in the next section. 

Interface Compatibility 

Once the file system interface was selected, choices 

bad to be made about the level of support provided by 

the emulation layer. Due to the nature of tl1e logstructured 

tile system, described in the section Filcs-11 

and the Spiralog File System On-disk Structures, full 

compatibility of all features in the emulation layer was 

not possible. This section discusses some of the decisions 

made concerning interface compatibility.

An initial decision w::s made to support documented 

low-level Fi les-l l calls through the emulation 

layer as ofi:en as possible. This would enable all 

well-behaved applications to run unchanged on the 

Spiralog tile system. Examples of well-behaved applications 

are those that make use of HLL library calls. 

The tollowing categories of access to the fi le system 

would not be supported: 

• Those directly accessing the disk without going 

through the file system 

• Those making usc of specitlc on-disk structure 

information 

• Those making use of undocumented ti le system 

katures 

A very small number of applications fe ll into these 

categories. Exampl es of applications that make use of 

on-disk structu re knowledge are the Open VMS boot 

code, disk structure analyzers, and disk dcti·agmenters. 

The majority of Open VMS applications make tile 

system calls through the 1'-LV\S interface. Using file system 

call-tracing techniques, described in the section 

Investigation Te chniques, a fu ll set of tile system calls 

made by RMS could be constructed. Afi:er analysis of 

this trace data, it was cle::r that IUv\ S used a small set 

of well-structu red calls to the low-level ti le system 

intertace. Further, detailed analysis of these calls 

showed that all RMS operations could be fu lly emulated 

on the Spiralog fi le system. 

The su pport of Open VMS ti le system utilities that 

made direct calls to the low-level Filcs-11 intertace was 

important if we were to minimize the amount of code 

change required in the Open VMS code b::se. Analysis 

of these utilities showed that the majority of them 

could be supported through tJ1e emulation layer. 

Ve ry rcw applications made use of katures of the 

Filcs-1 1 ti le system that could not be emulated . This 

enabled a high number of applications to run 

unchanged on the Spiralog file system. 

Ta ble 1 

Categorization of File System Features 

Category Examples 

Supported. The operation requested 

was completed, and a success status 

was returned. 

Ignored. The operation req uested 

was ignored, and a success status 

was returned. 

Not supported. The operation 

req uested was ignored, and a 

failure status was returned. 

Requests to create a file or open 

a file. 

A request to place a file in a 

specific position on the disk to 

improve performance. 

A request to directly read the 

on-disk structure. 

Compa tibility Problems 

This section describes some of the comp::tibility problems 

that we encountered in developing the emulation 

layer and how we resolved them. 

When considering the compatibility of the Spira log 

tile system with the Files-l l file system, we placed the 

features of the file system into three categories: supported, 

ignored, and not supported . Table 1 gives 

examples and descriptions of these categories. A feature 

was recategorized on ly if it could be supported but was 

not used, or if it could not be easily supported but 

was used by a wide range ofapplieations. 

The majority of Open VMS applications make supported 

fi le system ca lls. These :pplications will run as 

intended on the Spiralog tile system. Few applications 

make calls that could be safely ignored . These applications 

would run successfully but could not make use of 

these fe atures. Ve ry fe w applications made calls that 

were not supported . Untortunatcly, some of these 

applications were very important to the success of the 

Spiralog tile system, for example, system management 

utilities that were optimized fo r the Files-1 1 system. 

Analysis of applications that made unsupported calls 

showed the to ll owing categories of use: 

• Those that accessed the tile header-a structu re 

used to store a tile's attributes. This method was 

used to return multiple fi le attributes in one call. 

The supported mechanism involved an individual 

call fo r each attribute. 

This was solved by returning an emulated fi le 

header to applications that contained the majority 

of info rmation interesting to applications. 

• Those read ing directory files. This method was used 

to perform fast directory scans. The supported 

mechanism involved a file system call for each nJme. 

This was solved by providing a bulk directory 

read ing interface call. This ca ll was similar to the 

getdircntries( ) call on the UNIX system and was 

Notes 

Most calls made by applications 

belong in the supported category. 

This type of feature is incompatible 

with a log-structured file system. 

It is very infreq uently used and not 

ava ilable through HLL libraries. It 

co uld be safely ignored. 

This type of request is specific to 

the Files-1 1 file system and could 

be allowed to fail because the 

application would not work on the 

Spira log file system. It is used only 

by a few specialized applications. 

Vol. 8 No. 2 1996 49

50 

straightr(>rward to replace in applications that 

directly read directories. 

The OpcnVMS Backup utility was an example of 

:t system man:tgcmcm mility that directly read 

directory tiles. The backup milirv \\'aS changed to 

usc the dircctorv rc:1ding call on Spiralog volumes. 

• Tl10sc accessing rcscnni riles. The existing file system 

stores :til irs mctad:na in normal ti les that can be 

read lw :1pplic:ttions. These riles :tre called reserved 

ri les and :11-e crc1tcd when a \'olume is initi1lized. 

No reserved tiles arc cn:ated on a Spiralog volume, 

\\'ith the exception of the m:ster t-ile directory 

( M FD ). Appliutions that 1-e:1d reserved tiles make 

specific usc of on-disk structu re infi:mnation and 

are not supporrcd with the Spira log fi le system. The 

MFD is used as the root directory :nd performs 

directory tr:tvcrsals. This ti le w:ts virtually emulated. 

It appears in directory listi ngs of a Spiralog volume 

and can he used to starr :t directory traversal, but it 

docs not exist on the volume :ts a real ri le. 

Investigation Te chniques 

This section describes the appro:tch taken to investigate 

the interhcc and compatibility problems 

described euscd on understJnding how 

applications c:dled the ti le wstcm and the sem:1ntics of 

tbc c;lls. A number of techniques were used in lieu 

of design documcnt:ttion tr r :lc!·ting the other maimainers 

Vnl. R No. 2 1996 

about the impact of the Spiralog ti le system. J\l[orc 

than 2,000 surveys were sent out, but rcwcr than 

25 useful results were rctumcd. S:tdly, most application 

maintainers were not aw:t-c of how their 

product used the fi le s\·stcm. 

• Automated applic:tion source code sc:rching 

Autom:1tcd source code searching quickly checks 

a large amount ofsourcc code. This technique 1vas 

most usdi.d when :tn:lyzing lilc svstcm calls m:de bv 

the OpenYt'viS opcDting system or utilities. However, 

this does not 1\'0rk we ll \\'hen :pplic:1tions 

make dvnamic oils to the ti le wstcm

Ta ble 2 

Verification of Compatibility 

Test Suite 

RMS regression tests 

OpenVMS regression tests 

Files-1 1 compatibility tests 

C2 security test suite 

C language tests 

FORTRAN ianguage tes 

Number of Te sts 

-500 

-100 

-100 

-50 discrete tests 

-2,000 

-100 

Write-back caching provides very difterent semantics 

to the model of write-through caching used on the 

Files-l l ti le syste m. The goal of the Spiralog project 

members was to provide write- back caching 

in a way that was compatible with existing Open VMS 

applications. 

This section compares write-through and write-back 

caching and shows how some important OpenVMS 

applicatiom rely on write-through semantics to pro 

tect data ti·om system fa ilure. It describes the ordered 

wri te-back cache as introduced in the Echo file system 

and exp la ins how this model of caching (known as 

write-hehind caching) is particularly suited to the envi 

ronment of Open VMS Cluster systems and the 

Spira log log-structured ti le svstem. 

Overview of Caching 

During the last kw years, CPU pcrf{mnance improvements 

have continued to ou tpace performance 

improvements t( lr disks. As a result, the 1/0 bottle 

neck has worsc iH.:d rather than improved. One of 

the most succcssfi.d tech niques used to alleviate this 

problem is caching. Caching means holding a copy of 

data that has been recently read ri·om , or written to, 

the disk in memory, giving appl ications access to that 

data at memory speeds rather than at disk speeds. 

VVrite-through and write- back caching are two 

difrerent models h·cqucnrly used in ti le srstems. 

• Write-through caching. In a write-through cache, 

data read ti·om the disk is stored in the in-memory 

cache. When data is written, a copy is placed in 

the cache, but the write request does not return 

until the data is on the disk. Wri te-through caches 

improve the pcrtormance of read requests but not 

write requests. 

• Write- back caching. A write-back cache improves 

the pert(ml1

.'i 2 

Figure 2 

Caching Models 

NO CACHE 

WRITE-BACK 

CACHE 

I I ..._ 

MICROSECONDS 

Caching is more important to the Spira log ti lt 

system than it is to convcntionll ti le systems. Log 

structu red ti le systems have inherently worse read 

pertonn1ncc rhan com·enrional, update-in-place ti le 

svstems, due to the need to locate the data in the log. 

As described in another paper in rhisjour11a/. locating 

data in the l og requires more disk I/Os than ;1 11 

update-in-place ti le system 2 The Spiralog tile system 

uses large read caches to other this extra read cost. 

Careful Writing 

The Filcs-1 1 ti le system prm·ides \\'rite -through 

semantics. Key Open VMS applications such as tr:msac 

tion processing and the OpenVJ'v!S Record Man::tge 

ment Services ( RMS) have come to rely on the implicit 

orderi ng of write-through. Thcv use a technique 

kno\\'n as c::trcful writing to pre\'(:llt data corruption 

fo llo\\'ing l S\'Stem fa ilure. 

C1reful \\'riring allo\\'s an 1 pplicnion to ensure rh:n 

rhe data on rhe disk is never in :tn inconsistent or 

invalid st:ttc. This guarantee

I I I A I B 

I t t 

I I I 

Figure 3 

r: x1mplc of a Careti.il Write 

Write-behind Caching 

A I B 

---------- 

I t t 

J I A B I 

---L--------L----L.-...J. .. 

I 

A' 

__ . ...J. .. . 

, 

A careful write policy relics totally on being able to 

control the order of writes to the disk. This Gmnot be 

achieved on a write-back cache because the write-back 

method docs not preserve the order of write requests. 

Reordering writes in a write- back cache would risk cor 

rupting the data that applications using careful writing 

were seeking to protect. This is untclrtunatc because 

the pcrtcm11ancc benefits of deferring the write to the 

disk Jrc compatible with a careful write polin•. Cm.:ful 

writing docs not need to know when the data is written 

to the disk, on ly the order it is wJittcn . 

To allow these applications to gain the pcrt(mnancc 

of the write-back cache bur still prorcct their data on 

disk, the Spiralog ti le system uses a variation on write 

back oching known as write-behind caching. Intro 

duced in the Echo tile system, write-behind e1ching is 

essentially write-back caching with ordering guaran 

tees.' The cache allows the application to spccit)r which 

writes must be ordered and the order in which they 

must be written to the disk. 

This is achieved bv providing the barrier primitive to 

applications. Barrier defines an order or dcpendencv 

between write operations. For example, consider the 

diagram in Figure 4: Here, writes arc represented as 

a time-ordered queue, with later writes being added 

Figure 4 

B

54 

... 

I ..,. .... 

. 

I _____ .... 

I • 

.... .... .., 

Figure 5 

Ex

closed . The semantics of fu ll write-behind caching 

arc best suited to applications that can easily regen 

erate lost data at any time. Temporary ti les from a 

compiler arc a good example. Should the system 

fa il, the compilation can be restarted without any 

loss of data. 

3. Flush-on-close caching policy. The fl ush-on-close 

policy provides a restricted level of write-behind 

caching f(x applications. Here, all updates to the file 

are treated as write behind, but when the file is 

closed, all changes are forced to the disk. This gives 

the performance of write-behind but, in addition, 

provides a known point when the data is on the disk. 

This torm of caching is particularly suitable tor appli 

cations that can easily re-create data in the event of 

a system crash but need to know that data is on the 

disk at a specific ti me. For example, a mail store-and 

forward system receiving an incoming message must 

know the data is on the disk when it acknowledges 

receipt of the message to the forwarder. Once the 

acknowledgment is sent, the message has been for 

mally passed on, and the forwarder may delete its 

copy. In this example, the data need not be on the 

disk until that acknowledgment is sent, because that 

is the point at which the message receipt is commit 

ted . Should the system fail before the acknowledg 

ment is sent, all dirty data in the cache would be lost. 

In that event, the sender can easily re-create the data 

by sending the message again. 

Figure 6 shows the results of a performance com 

parison of the three caching policies. The test was run 

on a dual-CPU DEC 7000 Alpha system with 384 

megabytes of memory on a RAID-S disk. The test 

repeated the fo llowing sequence tor the different fi le 

SIZeS. 

l. Create and open a file of the required size and set 

its caching policy. 

2. Write data to the whole ti le in l ,024-byte 1/0s. 

3. Close the ti le. 

4. Delete the ti le. 

With small files, the number of file operations (create, 

close, delete) dominates. The leftmost side of the 

graph therefore shows the time per operation for tile 

operations. vVith time, the files increase in size, and the 

data 1/0s become prevalent. Hence, the rightmost 

side of Figure 6 is displaying the time per operation fo r 

data ljOs. 

Figure 6 clearly shows that an ordered write-behind 

cache provides the highest performance of the three 

caching models. For file operations, the write-behind 

cache is almost 30 percent faster than the write 

through cache. Data operations are approximately 

three times faster than the corresponding operation 

with write-through caching. 

(jJ 

0 

z 

0 

w 

'!]_ 

z 

Q 

f- : 

0: 

w 

o._ 

0 

0: 

w 

o._ 

w 

 

f- 

KEY 

0.156 

0.138 

0.121 

0.104 

0.086 

0.069 

0.052 

0.035 

0.017 

0.000 

\ 

\ 

' 

' 

- - - - - - - 

1,024 2,048 4,096 8.192 16,384 32,768 

WRITE-BEHIND CACHE 

FLUSH-ON-CLOSE CACHE 

WRITE-THROUGH CACHE 

FILE SIZE (BYTES) 

Figure 6 

Performance Comparison of Caching Policies 

Summary and Conclusions 

The task of integrating a log-structu red ti le system 

into the Open VMS environment was a significant 

challenge tor the Spiralog project members. Our 

approach of carefully determining the interface to 

emulate and the level of compatibility was important 

to ensure that the majority of applications worked 

unchanged . 

vVe have shown that an existing update-in-place tile 

system can be replaced by a log-structured ti le system. 

Initial effo rt in the analysis of application usage fur 

nished intormatjon on interface compatibility. Most 

fi le system operations can be provided through a fi le 

system emulation layer. Where necessary, new inter 

faces were provided for applications to replace their 

direct knowledge of the Files-1 1 file system. 

File system operation tracing and fu nctional analysis 

of the Files-1 1 fi le system proved to be the most usefld 

techniques to establish interface compatibility. Appli 

cation compatibility far exceeds the level expected 

when the project was started. A majority of people use 

the Spiralog ti le system volumes without noticing any 

change in their application's behavior. 

Careful write policies rely on the order of updates 

to the disk. Since write-back caches reorder write 

requests, applications using careful writing have been 

unable to take advantage of the significant improve 

ments in write performance given by write-back 

caching. The Spiralog tile system solves this problem 

by providing ordered write-back caching, known as 

write-behind. The write-behind cache allows applica 

tions to control the order of writes to the disk through 

a primitive called barrier. 

Using barriers, applications can build careful write 

policies on top of a write-behind cache, gaining all the 

performance of write-back caching without risking 

Digital Technical )ournall Vol. 8 No. 2 1996 55

56 

d:u integritY. A \\Ti re - behind cache also 1 llows the rile 

system itself ro gain write-back ped(>nllr their help through· 

out the project, and Karen Howell, Morag Currie, and 

all those who helped with this paper. fi ll

Extending OpenVMS 

for 64-bit Addressable 

Virtual Memory 

The OpenVMS operating system recently 

extended its 32-bit virtual address space to 

exploit the Alpha processor's 64-bit virtual 

addressing capacity while ensuring binary 

compatibility for 32-bit non privileged pro 

grams. This 64-bit technology is now available 

both to Open VMS users and to the operating 

system itself. Extending the virtual address 

space is a fundamental evolutionary step for 

the OpenVMS operating system, which has 

existed within the bounds of a 32-bit address 

space for nearly 20 years. We chose an asym 

metric division of virtual address extension that 

allocates the majority of the address space to 

applications by minimizing the address space 

devoted to the kernel. Significant scaling issues 

arose with respect to the kernel that dictated 

a different approach to page table residency 

within the OpenVMS address space. The paper 

discusses key scaling issues, their solutions, 

and the resulting layout of the 64-bit virtual 

address space. 

I chaei S. llarvey 

Leonard S. Szubowicz 

The OpcnVMS Alpha operating system initially supported 

a 32-bit virtual address space that maximized 

compatibility for Open VMS VAX users as they ported 

their applications from the VAX platform to the Alpha 

plattonn. Providing access to the 64-bit virtual memory 

capability detlned by the Alpha architecture was 

always a goal tor the Open VMS operating system. An 

early consideration was the eventual usc of this techno 

logy to enable a transition from a purely 32-bitoriented 

context to a purely 64-bit-oricnted native 

context. OpenVMS designers recognized that such 

a fu ndamental transition tor the operating system, 

along with a 32-bit VAX compatibility mode support 

environment, would take a long time to implement 

and could seriously jeopardize the migration of applications 

from the VAX platform to the Al pha platform. 

A phased approach was called tor, by which the operating 

system could evolve over time, allowing tor quicker 

time-to-market for significant features and better, more 

timely support for binary compatibility. 

In 1989, a strategy emerged that defined two fundamental 

phases of Open VMS Alpha development. Phase 

1 would deliver the Open VMS Alpha operating system 

initiaUy with a virtual address space that fa ithfully replicated 

address space as it was defined by the VAX architecture. 

This familiar 32-bit environment would case 

the migration of applications from the VAX platform 

to the Alpha platform and wou ld case the port of the 

operating system itself. Phase l, the OpenVMS Alpha 

version 1.0 product, was delivered in 1992.1 

For Phase 2, the Open VMS operating system would 

successfully exploit the 64-bit virtual address capacity 

of the Alpha architecture, laying the groundwork 

tor fu rther evolution of the OpenVMS system. In 

1989, strategists predicted that Phase 2 could be delivered 

approximately three years after Phase 1. As 

planned, Phase 2 culminated in 1995 with the delivery 

of Open VMS Alpha version 7.0, the first version of 

the OpenVMS operating system to support 64-bit 

virtual addressing. 

This paper discusses how the OpenVMS Alpha 

Operating System Development group extended the 

OpenVMS virtual address space to 64 bits. Topics 

covered include compatibility fo r existing applications, 

the options for extending the address space, the 

Digital Tt"chnical Journal Vol . 8 No. 2 19Y6 57

58 

strategy tor page table residency, and the fi nal layout of 

the Open VMS 64-bit virtual address space. In imple 

menting support tor 64-bit virtual addresses, design 

ers maximized privileged code compatibility; the paper 

presents some key measures taken to this end r a given process, process 

permanent code, and various process-specific 

control cells. 

• The higher-addressed half ( 2 GB) of virtual address 

space is detlned to be shared by all processes. This 

shared space is where tbe Open VMS operating sys 

tem kernel resides. Although the VAX architecture 

divides this space into a pair of separately named 

1-GB regions (SO space and Sl space), the Open VMS 

Alpha operating system makes no material distinc 

tion between the two regions and reters to them 

collectively

PROCESS 

PRIVATE 

(2 GB) 

SHARED 

SPACE 

(2 GB) 

00000000.00000000 

00000000. 7FFFFFFF 

•JOOOilOOO HOOOOOOO 

I 

I 

I 

/ 

/ //1 

PO SPACE 

P1 SPACE 

UNREACHABLE WITH 

32-BIT POINTERS 

12 2 't BYTES 

Ff r f 7FFF"FFF : -------------- 

FFFFFFFF .80000000 

Figure 1 

Open VMS Alpha 32-bit Virtual Address Space 

SO/S1 SPACE 

FFFFFFFF.FFFFFFFF L--- -- -- -- -- -1 

privileged code compatibility and ensuring Elster timeto-market. 

However, analysis of this option showed 

that the1-c were enough significant portions of the kernel 

requiring change that, in practice, very little additional 

privileged code compatibility, such as tor 

drivers, would be achievable. Also, this option did not 

address certain important problems that are specific to 

shared space, such as limitations on the kernel's capacity 

to manage ever-larger, very large memory (VLM) 

systems in the future. 

We decided to pursue the option of a fl at, superset 

64-bit virtual address space that provided extensions 

f(x both the shared and the process-private portions of 

the sp:1ce that a given process could reference. The 

new, extended process-private sp:1ce, named P2 space, 

is adjacent to Pl space and extends toward higher 

virtual addresses."5 The new, extended shared space, 

named 52 space, is adjacent to 50/51 space and 

extends toward lower virtual addresses. P2 and 52 

spaces grow toward each other. 

A remaining design problem was to decide where 

P2 and 52 would meet in the address space layout. 

A simple approach would split the 64-bit address 

space exactly in halt symmetrically scaling up the 

design of the 32-bit address space already in place. 

(The address space is split in this way by the Digital 

UNIX operating system 3) This solution is easy to 

explain because, on the one hand, it extends the 32-bit 

convention that the most significant address bit can be 

tre

60 

Virtual Address-to-Physical Address Tra nslation 

The Alpha architecture allows an implementation to 

choose one of the fo l lowing tour page sizes: 8 kilobytes 

(KB), 16 KB, 32 KB, or 64 KB.' The architecture 

also ddines a multilevel, hierarchical page table structure 

for virtual address-to-physical address (VA-to 

PA ) translations. Al l OpenVMS Alpha platforms have 

implemented a page size of 8 KB and three levels 

in this page table structure. Although throughout 

this paper we assume a page size of 8 KB and three 

levels in the page table hierarchy, no loss of generality 

is incurred by this assumption. 

Figure 2 illustrates the VA-to-PA translation 

sequence using the multilevel page table structure. 

l. The page table base register (PTBR) is a per-process 

pointer to the highest level (Ll) of that process' 

page table structure. At the highest level is one 

8-KB page (LlPT) that contains 1,024 page table 

entries (PTEs) of 8 bytes each. Each PTE at the 

highest page table level (that is, each L l PTE) maps 

a page table page at the next lower level in the tr;.Jnslation 

hierarchy (the L2PTs ). 

2. The Segment 1 bit field of a given virtual address 

is an index into the Ll PT that selects a particular 

Lll'TE, hence selecting a specific L2PT tor the next 

stage of the translation. 

3. The Segment 2 bit field of the virtual address 

then indexes into that L2PT to select an L2PTE, 

VIRTUAL 

ADDRESS 

63 

SIGN EXTENSION 

OF SEGMENT 1 

PAGE TABLE 

BASE REGISTER 

Figure 2 

Virtual Address-to-Physical Address Translation 

I 

42 

SEGMENT 1 

L1 PT 

Digital Tec hnical )oumal Vo l. 8 No. 2 1996 

hence selecting ;1 specific L3PT tor the next stage 

ofthe translation. 

4. The Segment 3 bit tield of the virtual address then 

indexes into that l..3PT to select an L3PTE, hence 

selecting a specific 8-KB code or data page. 

5. The byte-within-page bit f-ield of the virtual address 

then selects a specific byte address in that page. 

An Alpha implementation may increase the page 

size and/or number of levels in the page t:ble hierarchy, 

thus mapping greater amounts of virtual space up 

to the fu ll 64-bit amount. The assumed combin:tion 

of8-KB pJge size and three levels of page table allows 

the system to map up to 8 teralwtes (TB) (i.e., 1,024 

X 1,024 X l ,024 X 8 KR = 8 TB) of virtual memorv 

fo r a single process. 

To map the entire 8-TB Jddress space available to a 

single process requires up to 8GB ofPTEs (i.e., 1,024 

X l ,024 X I ,024 X 8 bytes = 8 GB ). This bet alone 

presents a serious sizing issue f(>r the Open VMS operating 

system. The 32-bit page table residency model 

that the Open VMS operating system ported ti·om the 

VA,'{ plattorm to the Alpha platform does not have 

the capacity to support such large page tables. 

Page Tables: 32-bit Residency Model 

We stated earlier thJt materializing a 32-bit virtual 

address space

system tl·om the V fv'

62 

The effect of this self-mapping on the VA-to-PA 

translation sequence (shown in Figure 2) is subtle but 

important. 

• For those virtual addresses with a Segment l bit 

field value that selects the self-mapper Ll PTE, step 

2 of the VA-to-PA translation sequence reselects 

the Ll PT as the effective L2PT ( L2PT') tor the 

next stage of the translation. 

• Step 3 indexes into L2PT' (the Ll PT) using the 

Segment 2 bit field value to select an L3 PT '. 

• Step 4 indexes into L3PT' (an L2PT) usi ng the 

Segment 3 bit field value to select a specific data 

page. 

• Step 5 indexes into that data page (an L3 PT) using 

the byte-within-page bit field of the virtual address 

to select a specific byte address within that page. 

When step 5 of the VA-to-PA translation sequence 

is finished, the final page being accessed is itself one of 

the level 3 page table pages, not a page that is mapped 

PTBR 

L 1 PT'S PFN · 

KEY: 

PTBR 

PFN 

PTE 

.. 

Figure 4 

Eftect of Page Table Selfmap 

L1PT 

_,,.r--- -- -- -r 

PTE #1022 L1PT'S PFN 

PAGE TABLE BASE REGISTER 

PAGE FRAME NUMBER 

PAGE TABLE ENTRY 

L--- -- -- -1. . . 

Digiral Technical Journal Vol. R No. 2 1996 

by a level 3 page table p::ge. The self-map operation 

places the entire 8-GB page table structure at the end 

of the VA-to-PA translation sequence fo r a specific 

8-GB portion of the process' address space. This virtual 

space that contains all of a process' potential page 

tables is called page table space (PT space)." 

Figure 4 depicts the effect of self-mapping the page 

tables. On the left is the highest-level page table 

page containing a fixed number of PTEs. On the right 

is the virtual address space that is mapped by that page 

table page. The mapped address space consists of a collection 

of identically sized, contiguous address range 

sections, each one mapped by a PTE in the corresponding 

position in the highest-level page table page. 

(For clarity, lower levels of the page table structure arc 

omitted from the llgure.) 

Notice that LlPTE # 1022 in Figure 4 was chosen to 

map the high-level p::ge table page that contains that 

PTE. (The reason f()r this particular choice will 

be explained in the next section. Theoretically, any one 

64-BIT ADDRESSABLE 

VIRTUAL ADDRESS SPACE 

PO/P1 

PT SPACE f 

00000000.00000000 

8-GB #0 

1,020 X 8 GB 

8-GB OW22 

-_-_ \ :,:::::3,,mm 

t-_-__ -_-_s-o;-_s_1 _

of the Ll PTEs could have been chosen as the selfmapper.) 

The section of virtual memory mapped by 

the chosen Ll PTE contains the entire set of page 

tables needed to map the available address space of 

a given process. This section of virtual memory is PT 

space, which is depicted on the right side of Figure 4 

in the 1 ,022d 8-GB section in the materialized virtual 


The base address fo r this PT space incorporates the 

i ndex of the chosen self mapper LlPTE ( 1,022 = 

3FE( l6)) as fo llows (see Figure 2): 

Segment 1 bit field = 3FE 

Segment 2 bit field = 0 


Byte within page = 0, 

which result in 

VA = FFFFFFFC.OOOOOOOO 

(also known as PT_Base). 

Figure 5 illustrates the exact contents of PT space 

for a given process. One can observe the positional 

effect of choosing a particular high-level PTE to self 

map the page tables even within PT space. In Figure 4, 

the choice of PTE for selfmapping not only places PT 

space as a whole in the l ,022d 8-GB section in virtual 

memory but also means that 

Figure 5 

Page Table Space 

PAGE TABLE 

SPACE (8 GB) 

PT_BASE: 

L2_BASE: 

L1_BASE: 

NEXT-LOWER 8 GB 

L2PT 

---------------------- 

1.02 1 L2PTs 

---------------------- 

L1PT 

---- -- -- -- --- --------- 

L2PT 

NEXT-HIGHER 8GB 

• The 1 ,022d grouping of the lowest-level page 

tables (L3PTs ) within PT space is actually the collection 

of next-higher-level PTs ( L2PTs) that map 

the other groupings ofL3PTs, beginning at 

Segment l bit field = 3FE 





VA = FFFFFFFD.FFOOOOOO 

(also known as L2_Base). 

• Within that block of L2PTs , the 1 ,022d L2 PT is 

actually the next-higher-level page table that maps 

the L2PTs, namely, the Ll PT. The Ll PT begins at 

Segment I bit field = 3FE 





VA = FFFFFFF D.FF7FCOOO 

(also known as Ll_Base). 

• Within that Ll PT, the l,022d PTE is the one used 

fo r self-mapping these page tables. The address of 

the self-mapper Ll PTE is 

l 

1 ,024 L3PTs 

1,021 x (1 ,024 L3PTs) 

1,024 L2PTs 

l 

1,024 L3PTs 

Digit:JI T

64 

Stgmem l bit rield = 3FE 

Segmellt 2 bit tleld = 3FE 

Segment 3 bit rield = 3rE 

Bvte within page = 3FE X 8 

,,·hich result in 

VA = FrHHFD.FF7FDHO . 

This positional correspondence within PT space is preserved 

should a difrcrenr high-b·el PTE be chosen r(>r 

self-mapping the p:ge tables. 

The properties inherent in this sclfmapped page 

tabk Jrc compelling. 

• The amount of virtual memory reserved is ex:ctly 

the amount required fo r mapping the page ta bles, 

regard less of page size or page t:blc depth. 

Consider the stgmem-numbered bit fi elds of :1 

given virtual address ri·om Figure 2. Concatenated , 

these bit ticlds constitute the \'irtual page number 

(VPN) portion of a gi,·en virtual address. 

The tor: I size of the PT space needed to map n·en· 

Vl'N is the number of possible VPNs times 8 l)\'tes, 

the size of :1 PTE. The total size of the address 

space mapped bv that PT space is the number of 

possible VPNs times the page size. Factoring 

out the VPN multiplier, the diftc rence between 

these is the page size divided bv 8, which is cxactlv 

the size of the Segment 1 bit fi eld in the \'irtual 

:ddress. Hence, Jli the space mapped Lw 1 

single PTE Jt the highest level of p:1ge table is 

exacrly the size required for mapping all the I'TEs 

that could ever be needed to map the process' 

1ddress sp:ce. 

• The mapping of PT space involves simplv choosing 

one ohhe highest-b·el PTEs and forcing it to 

selr:map. 

• No addition:! system tuning or coding is required 

to :-ccommodatt a mort \\'idely implemented 

\irtLd address width in PT space. Bv ddinition of 

rhe sclfm:1p cftecr, rhe cxacr amount of ,·irtu1l 

address space required will be a,·ailable, no more 

and no less. 

• lr is easy to locate a given PTE. The add rcss of 

:1 PTE becomes an efficient function of the address 

that rhc PTE m:1ps. The fu nction first clc1rs 

the byte-within-page bit field of the subject virttd 

address and then shifts the remaining virru11 

J.ddress bits such that rhe Segments 1, 2, and 3 bit 

ricld values (Figure 2) now reside in the cmTcsponding 

next-lower bit ricld positions. The fu nction 

then writes (and sign extends if necessary) 

the ,.1c1ted Segment l ticld with the index of 

the selfmapper PTE. The result is the address 

ot· the r'TE that maps the original \'irtual address. 

Note that this algorithm also ,,·orks ror addresses 

Dig:ir.tl '1\:dmicJI )uunlJ! Vo l. 8 No. 2 I YY6 

\\'ithin PT space, including that of the selfmappn 

PTE itsclt'. 

• Process page L1blc residcnc1· in ,·irtu1l 111emon· is 

:chie,'Cd ,,·ithout imposing on the c:1pacin· ot' 

sh:rcd space. Th:-r is, there is no longer need ro 

1111p the process page tables into shared space. Such 

a 1111pping \\'Ould be red undant and \\'lsteful. 

Open VMS 64-bit Virtual Address Space 

With this page table residency strategy in h:nd, it 

became possible ro ti nalize a 64-bit virtu: I address layour 

r(>r the Open VtviS operating system. A self mapper 

I'TE had to be chosen. Consider ag:in the highest b·el 

of page table in a given process' page table structure 

(Figure 4 ). The first PTE in rh:u page table 1mps a section 

ofvirru1l memorv that includes PO and PI spaces. 

TIJis PTE \\'JS rherdore una,·aibblc tc>r usc as a sclt 

mapptr. The last PTE in that page t:-blc n11ps a section 

ohirtual memorv that includes SO/S I space. This I'TI-'. 

was :!so una,·aillbk tor se lt:mapping purposes. 

All the intu,·cning high-Jn·el PTEs ,,·ere potential 

choices tr selt:mapping the page rabies. To mnimizc 

the size of pmccss-pri\ ate space, the COITCCt choice 

is the ncxt-lmn::r PTE than the one rh:t maps the lo\\·est 

:ddress in sh;H·ed sp:ce. 

This choice is implemented 1s a boor-rime algorithm. 

Bootstrap code Erst dercnnines the size 

required for OpcnVMS shared space, calcubring the 

corresponding number of high-level PTLs. A sufticient 

number of PTEs to map rh:r shared sp1cc arc 

1 llocated Lner ti·om the high-order end of a gi,·cll 

process' highest-ln·d page table page. Then tl1c nextlower 

PTE is allocated tor sclfn11pping tiLlt process' 

page r:bles. All remaining lr mapping [Jrocess- pri,·uc spKe . In prlctice, 

ncarh· all the l'TEs are a\·:ilJblc, ,,·hich mc1ns riLlt 

on roda\·'s svstcms, almost 8 TB of process pri\ 1rc ,·irrull 

mcmon· is a\·1ilablc to a gi,·en Open V1'viS pmccss. 

Figure 6 presents the ri.nal 64 -bir OI)CllVI'vLS ,·irtual 

address space la\·out. The portion \\'ith the lo\\·cr 

addresses is cmirelv proccss-pt·i,·Jte . The highn 

lddrcsscd portion is sbared b· 1 11 process ,1ddrcss 

splCcS. PT space is a region oh·irtual memor\' thlt lies 

lxrwecn the P2 and S2 spaces rC)[ am· gi\'Cll pmccss 

1 11d at the s1111e virtual Jddress t())' all processes. 

Note rhH I''T' sp1ce irselt'consisrs of a proccss·pri,·1rc 

and J shared f)Ortion. Again, consider figure 5. ' The 

highest-b·cl p1ge tJblc page, Lll'T, is proccss-pri''ltc. 

Iris poimcd to lw the PTBR. (When a process' comcxt 

is lotkd, or 1111dc acti\T, the process' PT BR \:lluc is 

loaded ti·om rhc process' hard,,·art- pri,·ilegcd comnt 

block into rhe PT B 1\ register, tllcrcl)\' J111king currellt 

the p:ge r1blc structure poimcd ro lw rh1t l'TBR ,md 

the pmcess-pri,·lte 1ddress space th

1 / 

_, 

/ 

/ 

/ 

/ 

/ ____,_:_ / 

/ 

/ / 

/ 

00000000.00000000 

PO SPACE / 

00000000. 7FFFFFF F 

00000000.80000000 

P1 SPACE 

P2 SPACE :::b: :;;: : 

1--- -- --: 

/ 

/ / 

---- --:; 

/ 

- - 

;; PAGE TABLE SPACE ;; / 

/0 

/ 

0 / (;0 

«- 

PROCESS-PRIVATE 0 

- - - - - - - - - - - - - - - - - -« --- 

SHARED SPACE 

FFFFFFFF. 7FFFFFF F 

FFFFFFFF.80000000 

FFFFFFFF.FFFFFFF F 

::::;: : 

Note that this drawing is not to scale. 

Figure 6 

Open VMS Alpha 64 -bit Virtual Address Space 

All higher-addressed p :gc tables in PT space are 

used to map sh:.1red space and are themselves shared. 

They arc also adjacent to the shared space that they 

map. All page tables in PT space that reside at 

addresses lower than that of the Ll PT :re used to map 

process-private sp:.1ce. These p:.1ge tables arc processprivate 

and Jre adjacent w the process-private space 

that they map. Hence, the end of the Ll PT marks 

a universal boundary between the process-private 

portion and the shared portion of the emire virtual 

address space, serving to separate even the PTEs that 

map those portions. In Figure 6, the line passing 

through PT space illustrates this bou ndary. 

A direct co nsequence of this desi gn is that the 

process page tables have been privatized . That is, 

the portion of PT space that is process-private is curremly 

active in virtual memory only when the owning 

process itself is currentl y Jctive on the processor. 

S2 SPACE :;;: 

SO/S1 SPACE 

/ 

/ 

Fortunately, the majority of page table references 

occur while executing in the conrext of the owning 

process. Such rderenccs JctuaiJy arc en hanced by 

the privatization of the process page t

66 

addresses, whether they 1re process-private or shared. 

The shared portion of" PT space could serve now as the 

sole location tcx shared-space PTEs. Being redundant, 

the original SPT is eminently discardable; however, 

discarding the SI'T would cn.:are a nussive compati bility 

problem r()r device drivers with their many 32-bir 

SPT rdcrenccs. This area is one in which an opportunity 

exists to preserve a significant degree of"pri,·ileged 

code cornpati bilit\'. 

Key Measures Ta ken to Maximize 

Privileged Code Compatibility 

To implement 64-bit virtual address space support, we 

altered central sections of the Open VMS Al pha kernel 

and many ofits key data structures. We expected that 

such changes would require compensating or corresponding 

source changes in surrounding privileged 

components within the kernel, in device drivers, and 

in privileged layered products. 

For example, the previous discussion seems to indicare 

that any privileged component that reads or writes 

PTEs would now need to usc 64-bit-widc pointers 

instead of 32-bit l1oimcrs. Similarly, all system f"ork 

threads and interrupt service routines could no longer 

count on direct :tccess to proccss-priv:tte PTEs without 

regard to which process happens to be current 

at the momun. 

A number of bctors exacerbated the impact of such 

changes. Since the OpcnVMS AJplLl operating syste 

m origin:ttcd h·om the OpenVMS Vt\ X operating 

system, signiric:mr portions of the Open VMS Alpha 

operating system and its dC\·ice dri,·ers are still written 

in MAC:R0-32 code, a compiled bnguage on the 

Alpha p!Jrr(mn .' Because MACR0-32 is an ;lssemblyle,-cl 

srvle of progr:unming Llnguagc, we could not 

sirnpl\' change the ddinirions 1 11d declarations ofl'ari 

OLIS t\-pes and rch· on recompil:nion to handle the 

mm-c ti·om 32-bir to 64-bir pointers. Fin:-dly, there are 

well over 3,000 rckrcnces to PTEs h·om MAC R0-32 

code modules in the OpenVJVI S Alpha source pool. 

We were thus LKed with the prospect or" visiti ng and 

potentially altering c1cb of these 3,000 rdcrences. 

Moreover, II'C would need to r( >llow the register lifetimes 

that resulted h·om each of these rctcrences to 

ensure that all address oleularions and memory re krenccs 

were done using 64- bir operations. \Ve expected 

that this process wo uld be time-consuming m d error 

prone and that it would h:we a signiticant negative 

impact on our completion dJte. 

Once OpcnViVIS Alpha version 7.0 was wailable 

to users, those with device dri\'ers and privileged code 

of their own would need to go through a similar 

efti.)rt. This would further delav wide usc of the 

release. For all these reasons, we were we ll motivated 

Vo l. 8 \:o. 2 1996 

to minimize the impact on privileged code. The next 

fo ur sections discuss techniques th:Jt we used to 0\·ercome 

these obstacles. 

Resolving the SPT Problem 

A signific:mt number of the PTE rc krcnces in pri 

"ileged code arc to PT Es 11·irhin the SPT. Dc1·icc 

drivers often double-map the user's 1/0 hufkr into 

SO/S l space lw allocating and appmpriately initializing 

svstem p:.1ge ta ble entries (SI'Tr:s). Another situation 

in which :1 dri\'er manipulates SI'TEs is in the 

substitution of a svstcm bufkr ri.>r 1 poorly aligned or 

noncontiguous user 1/0 bufkr that pre,·ems rhe 

buffer from being directh- used wi th a p:1rricular 

dc,·ice . Such code relics ho,'ilv on the svsrem dat:1 cell 

MMG$GL_S PTBASE, which points to the SPT. 

The new page ublc design comple relv olwi1rcs the 

need ro r a separ:nc SPT. Given an 8-KB p1ge size 1 nd 

8 bytes per PTE, the entire 2-GB SO/SI virtual H.idrcss 

space range can be mapped by 2 MB ofPTEs within PT 

space. Because S0/51 resides at the highest addressable 

end ofrhe 64-bir virtual address sp1ec, ir is mapped bv 

the highest 2 M B of PT space. The :uu on the left in 

Figure 7 illustrate this mapping. The PTI-:s in PT space 

that map SO/S l arc fu lly shared l1\' :1ll processes, bur 

they must be rdcrmccd with 64-bir addresses. 

This incompnibi liry is completelv hidden lw the 

creation of :1 2-MB "SI'T window" m-cr rhe 2 iv\B in 

I'T space (lel'el 3 PTEs ) that m:tps SO/S I space. The 

SPT window is positioned at the highest 1dd rcss1blc 

end ot"SO/S l space. Therdi. >re, an access through the 

SPT window onlv requires a 32-bir SO/S I 1ddrcss and 

can obtain anv of the I'TEs in PT sp1ce that m:1p 

SO/Sl space The l i"CS on the ri ght in hgurc 7 illus 

tratc this access path. 

The SPT 11·i ndow is set up at svstcm ini' i1lizarion 

rime and consumes onll' the 2 KB of 1'TI-:s tlLlt 

are needed to map 2 MB. The S\'stcm d1t.1 cell 

MMG$GL_SPTBASE now points to the b:1se of rhe 

SPT window, and all existing rekrmccs to that d:1ta ce ll 

continue to fu nction corrcctlv without ch:1ngc. - 

Providing Cross -process PTE Access for Direct 1/0 

The selfm:1pping ofthe page t:1bles is Jn elegant solution 

to the page ta ble residency problem imposed by 

the preceding design. However, the selr:map1xd page 

ta bles present significant challenges or" their own to the 

I/0 subsystem and to many device drivers. 

Typically, OpenVJ\115 device drivers ri.>r 1111SS storlgc, 

net\vork, and other high-perr(mnlncc dev1ccs pcrJ-( mn 

direct memory access (DMA) 1 nd what Open VMS cal ls 

"direct 1/0." These dn·icc drivers lock down IIHO 

physical mcmorv the virtual pages rlut conLl in the 

requester's l/0 bufrc r. The 1/0 rr:1nsrcr is fX rr(mned 

direcrlv to those pages, aiLLT which the butkr pages are 

unlocked, hence the term "direct 1/0 ."

Figure 7 

System Page Table Window 

: "1 

PAGE TABLE SPACE 

(8 GB) 

- - ---- ----------------------- 

PTEs THAT MAP SO/S1 (2 MB) 

S2 (6GB) 

SO/S1 (2 GB) 

------------- ------ ---------- 

SPT WINDOW (2 MB) 

The virtual address of the buffer is not adequate for 

device drivers because much of the driver code runs in 

system context and nor in the process context of the 

requester. Similarly, a process-specific virtual address is 

meaningless to most DMA devices, which typically can 

deal only with the physical addresses of the virtual 

pages spanned by the buffer. 

For these reasons, when the 1/0 buffe r is locked 

into memory, the Open VMS I/0 subsystem converts 

the virtual address of the requester's buffer into 

( 1) the address of the PTE that maps the start of 

the buffe r and (2) the byte oftset within that page to 

the first byte of the buffe r. 

Once the virtual address of the 1/0 bufkr is convened 

to a PTE address, all references to that buffer 

are made using the PTE address. This remains the case 

even if this 1/0 request and 1/0 bufkr are handed otT 

from one driver to another. For example, the IjO 

request may be passed fi·om the shadowing virtual disk 

driver to the small computer systems intertace (SCSI) 

disk class driver to a port driver for a specific SCSI host 

adapter. Each of these drivers will rely solely on the 

PTE address and the byte offset and not on the virtual 

address of the 1/0 butler. 

Therefore, the number of virtual address bits the 

requester originally used to specif)r rh address of 

·FFFFFFFFFFFFFF 

the l/0 buffer is irrelevant. What really matters is 

the number of address bits that the driver must use 

to reference a PTE. 

These PTE addresses were always within the page 

tables within the balance set slots in shared SO /S 1 

space. With the introduction of tbe self mapped page 

tables, a 64-bit address is req uired t()r accessing any 

l)TE in PT space. Furthermore, the desired PTE is not 

accessible using this 64-bit address when the driver is 

no longer executing in the context of the original 

requester process. This is called a cross-process PTE 

access problem. 

In most cases, this access problem is solved tor 

direct 1/0 by copying the PTEs that map the 1/0 

buftc r when the 1/0 buffer is locked into physical 

memory. The PTEs in PT space are accessible at that 

point because the requester process context is required 

in order to lock the buffer. The PTEs arc copied into 

the kernel's heap storage and the 64-bit PT space 

address is replaced by the address of the PTE copies. 

Because the kernel's heap storage remains in SO/S l 

space, the replacement address is a 32-bit address that 

is shared by all processes on the system. 

This copy approach works because drivers do not 

need to modit)r the actual PTEs. Typically, this 

arrangement works well because the associated PTEs 

Vol. 8 No. 2 1996 67

em ti t into dcdic:n:d space \\'ithin the I/0 request 

pKkct d:t: structure used bv the OncnVMS operatino- 

- t b 

S\'Stc m, m d there is 110 mcasu r:tble increase in CPU 

Ol'crhcad to copy those PT Es . 

If the 1/0 buFkr is so L1rgc th:t its :1ssoci;1ted PTEs 

c1nnot ti t \\'ithin the 1/0 req uest pKkct, a separate 

kernel hc:-tf1 stm:gc packet is 1 lloutcd ro hold the 

PTEs. lf the l/0 bufkr is so large rhH the cost of 

copl'ing all the l'TFs is Jloticcable, a direct access path 

is crc:.1tcd as t( ll lo\\'s: 

• The L::l PTEs rhH m:.1p the I/0 bufh: r arc locked 

into ph1·sic1l men1or\·. 

• Add ress sp1ce 11·ithin SO/S l sp:.1ce is allocned 

and nup pcd OI'Cr the l.3 PT Es th at \\'Cre just 

locked do\\'Jl. 

This esr;1blishcs ' 32-bir address1blc slLlred-space 

\\'indo\\' m·cr the 1.3 l'TEs rhar 1mp the 1/0 bu ffe r. 

The essc mial point is that one of these methods is 

selected md employed umil the 1/0 is completed :md 

the bufh:r is unlockl:d . Each method prol'idcs a 32-bit 

l'TE address that the rest ofrhc l/0 su bsystem can use 

transp::rcnrly, 1s it has been accustomed ro doing, with 

our req uiring numerous, complex sou rce changes. 

Use of Self-identifying Structures 

To 1 Ccommod:.1tc 64 -bir user l'irtull addresses, 1 num 

ber of kemel dar:1 structures had to be cxp1nded and 

ch:1ngcd. for example, as,·nchronous system trap 

(AST ) control blocks, bu ffered l/0 pKk

do its job. This is an instance of the cross-process PTE 

access problem mentioned earlier. 

The swappcr is unable to directly access the page 

tables of the process being swapped because the swapper's 

own page tables are currently active in virtual 

memory. We solved this access problem by revising the 

swapper to temporarily "adopt" the page tables of 

the process being swapped. The swapper accomplishes 

this by temporarily changing its PTBR contents to 

point to the page table structure tor the process being 

swapped instead of to the swapper's own page table 

structure. This change torces the PT space of the 

process being swapped to become active in virtual 

memory and therd(xe available to the swapper as it 

prepares the process to be swapped. Note that the 

swapper can make this temporary change because 

the swapper resides in shared space. The swapper does 

not vanish ti-om virtual memory as the PTBR value is 

changed. Once the process has been prepared for 

swapping, the swappcr restores its own PTBR value, 

thus relinquishing access to the target process' PT 

space contents. 

Thus, it can be seen how privileged code with 

intimate knowledge of OpenVMS memory management 

mechanisms can be affected by the changes 

to support 64-bit virtual memory. Also evident is that 

the alterations needed to accommodate the 64-bit 

changes arc relatively straighttorward . Although the 

swappcr has a higher-than- normal awareness of memory 

managemenr internal workings, extending the 

swapper to accommodate the 64-bit changes was 

not particularly difficult. 

Immediate Use of 64-bit Addressing by the 

OpenVMS Kernel 

Page table residency was certainly the most pressing 

issue we taced with regard to the Open VMS kernel as 

it evolved ti·om a 32-bit to a 64-bit-capable operating 

system . Once implemented, 64-bit virtual addressing 

could be harnessed as an enabling technology tor solving 

a number of other problems :s well. This section 

briefly discusses some prominent examples that serve 

to illustrate how immediately useful 64-bit addressing 

became to the OpcnVJ\IIS kernel. 

Page Frame Number Database and 

Very Large Memory 

The OpenVMS Alpha operating system maintains a 

database t(>r managing individual, physical page ti·ames 

of memory, i.e., page ti·ame numbers. This database is 

stored in SO/S l space. The size of this database grows 

linearly as the size of the physical memory grows. 

Future Alpha systems may include larger memory 

configurations as memory technology continues to 

evolve. The corresponding growth of the page ti-ame 

number database tor such systems could consume 

an unacceptably large portion of SO /S 1 space, which 

has a maximum size of 2 GB. This design effectively 

restricts the maxirnum amount of physical memory 

that the OpenVMS operating system would be able 

to support in the fu wre. 

We chose to remove this potential restriction by 

relocating the page trame llU!llber database ti·om 

SO/S l to 64-bit addressable 52 space. There it can 

grow to support any physical memory size being considered 

tor years to come. 

Global Page Ta ble 

The OpenVMS operating system maintains a data 

structure in SO /S 1 space called the global page table 

( GPT). This pseudo-page table maps memory objects 

called global sections. Multiple processes may map 

portions of their respective process -private address 

spaces to these global sections to achieve protected 

shared memory access tor whatever applications they 

may be running. 

With the advent ofP2 space, one can easily anrjcipate 

a need tor orders-of-magnitude-greater global section 

usage. This usage directly increases the size of the 

GPT, potentially reaching the point where the GPT 

consumes an unacceptably large portion of SO/Sl 

space. We chose to forestall this problem by relocating 

the GPT fi·om SO/S l to S2 space. This move allows the 

configuration of a GPT that is much larger than any 

that could ever be configured in SO/Sl space. 

Summary 

AJthough providing 64-bit support was a significant 

amount of work, the design of the Open VMS operating 

system was readily scalable such that it could 

be achieved practically. First, we established a goal of 

strict binary compatibility tor nonprivileged applications. 

We then designed a superset virtual address 

space that extended both process-private and shared 

spaces while preserving the 32-bit visible address space 

to ensure compatibility. To maximize the available 

space for process-private use, we chose an asymmetric 

style of address space layout. We privatized the process 

page tables, thereby eliminating their residency 

in shared space. The bv page table accesses that 

occurred fl·om outside the context of the owning 

process, which no longer worked after the privatization 

of the page tables, were addressed in various ways. 

A variety of ripple effects stemming ti-om this design 

were readily solved within the kernel. 

Solutions to other scaling problems related ro the 

kernel were immediately possible with the advent of 

64-bit virtual address space. AJrcady mentioned was 

the complete removal of the process page tables ti-om 

shared space. vVe also removed the global page table 

Digirl Tc hnic.ll journ;ll Vo l. S No. 2 l996 69

70 

and the page fi-ame number database from 32-bit 

addressable to 64-bit addressable shared space. The 

immediate net dkct of these changes was significantly 

more room in SO/S l space tor configuring more 

kernel heap storage, more balance slots to be assigned 

to greater numbers of memory resident processes, etc. 

We further anticipate use of64-bit addressable shared 

space to realize additional bendits of VLM, such as 

tor caching massive amounts of file system data. 

Providing 64-bit addressing capacity was a logical, 

evolutionary step fo r the Open VMS operating system. 

Growing numbers of customers are demanding the 

additional virtual memory to help solve their problems 

in new ways and to achieve higher performance. This 

has been especial ly fr uitful for database applications, 

with substantial performance improvements already 

proved possible by the use of64-bit addressing on the 

Digital UNIX operating system . Similar results are 

expected on the OpenVMS system. With terabytes 

of virtual memory and many gigabytes of physical 

memory available, entire databases may be loaded into 

memory at once. Much of the I/0 that otherwise 

would be necessary to access the database can be elimi 

nated, thus allowing an application to improve perfor 

mance by orders of magnitude, for example, to reduce 

query time from eight hours to five minutes. Such 

performance gains were difficult to achieve while 

the OpenVMS operating system was constrained to a 

32-bit environment. With the advent of 64-bit address 

ing, OpenVMS users now have a powerful enabling 

tec hnology available to solve their problems. 


The work described in this paper was done by mem 

bers of the Open VMS Alpha Operating System Devel 

opment group. Numerous contributors put in many 

long hours to ensure a well-considered design and 

a high-quality implementation. The authors particu 

larly wish to acknowledge the fo llowing major con 

tributors to this effort: Tom Benson, Richard Bishop , 

Wa lter Blaschuk, Nitin Karkhanis, Andy Kuehnel, 

Karen Noel, Phil Norwich, Margie Sherlock, Dave 

\Vall, and Elinor ·wood s. Thanks also to members 

of the Alpha languages community who provided 

extended programming support for a 64-bit environ 

ment; to Wayne Cardoza, who helped shape the earli 

est notions of what could be accomplished; to Beverly 

Schultz, who provided strong, early encouragement 

to r pursuing this project; and to Ron Higgins and 

Steve Noyes, tor their spirited and unflagging support 

to the very end . 

The tdlowing reviewers also deserve thanks tor 

the invaluable comments they provided in helping to 

prepare this paper: Tom Benson, Cathy Foley, Clair 

Grant, Russ Green, Mark Howel l, Karen Noel, Margie 

Sherlock, and Rod vViddowson. 

Digital Tec hnical Journal Vo l. 8 No. 2 1996 

References and Notes 

I. N. Kronenberg, T. Benson, W. Cardoza, R. Jagannathan, 

and B. Thomas, "Porting Open VMS from VAX to AJpha 

1\.,\P," Digital Tecbnical.fournal, val. 4, no. 4 ( 1992 ): 

lll-120. 

2. T. Leonard, cd., VAX Arch itecture Reference Manual 

(Bedford, Mass.: Digital Press, 1987). 

3. R. Sites and R. Witek, Alpha AXP Architecture Refe rence 

JI!Ia nual, 2d cd. (Newton, 1Vlass.: Digital Press, 

1995). 

4. Although an OpcnVMS process may refer to PO or PI 

space using either 32-bit or 64-bit pointers, references 

to P2 space require 64-bit pointers. AppJications may 

very well execute with mixed pointer sizes. (See reference 

8 and D. Smith, "Adding 64-bit Pointer Support 

to a 32-bit Run-time Library," Digital Technical 

jo urnal, val . 8, no. 2 [1996, this issue]: 83-95.) There 

is no notion of an application executing in either a 32-bit 

mode or a 64-bit mode. 

5. Superset system services and language support wct-c 

added to facilitate the manipulation of 64-bit addressable 

P2 space.8 

6. This mechanism has been in place since OpenVMS 

Alpha version 1.0 to support virtual PTE tCtchcs by the 

translation buffer miss handler in PALcode. (PALcodc 

is rbe operating system-specific privileged architecture 

library that provides control over interrupts, exceptions, 

context switching, ctc 3) In cHeer, this means rhar the 

OpenV!YlS page rabies already existed in two virtual 

locations, namely, SO/S l space and PT space. 

7. The SPT window is more precisely only an SO/S l PTE 

window. The PTEs that map 52 space

Biographies 

Michael S. Harvey 

Michcl Harvey joi ned Digit! in 1978 at-i:er receivi n g his 

B.S.C.S. h·om the Univcrsirv of Vermon t. In 1984, as a mem· 

ber of rhe OpenV;viS Engi;1eering group, he participated in 

nell' processor support i:>r VAX multiprocessor systems and 

helped de1·clop Open VMS Sl'mmetric multiprocessing (S;vl I') 

support for these s1·srems. He recein:d a patenr or this ll'ork. 

Mike 1vs an o1·iginl member of the RJSC1·-VAX task ti Jt·co:, 

which conccil'ed and d eveloped rhc Alph a architectun:. 

Mike led the project that ported the Open VMS Executil'(: 

fi·om the VA-'< to the Alph pbrcmn and subsequ ently led 

the project that designed r the Open VMS 1/0 engineering team, 

he joined Digir,, l SoiWr rh c Open VMS high-lclel langtugc dc1·ice d ril'cr proj 

ccr , conrributcd ro rhc port ohhc Open VMS opcr•1ring 

sysrem ro the Alph

72 

The OpenVMS Mixed 

Pointer Size Environment 

A central goal in the implementation of 64-bit 

addressing on the OpenVMS operating system 

was to provide upward-compatible support for 

applications that use the existing 32-bit address 

space. Another guiding principle was that mixed 

pointer sizes are likely to be the rule rather than 

the exception for applications that use 64-bit 

address space. These factors drove several key 

design decisions in the OpenVMS Calling Stan 

dard and programming interfaces, the DEC C 

language support , and the system services 

support. For example, self-identifying 64-bit 

descriptors were designed to ease development 

when mixed pointer sizes are used. DEC C sup 

port makes it easy to mix pointer sizes and to 

recompile for uniform 32- or 64-bit pointer sizes. 

OpenVMS system services remain fully upward 

compatible, with new services defined only 

where required or to enhance the usability of the 

huge 64-bit address space. This paper describes 

the approaches taken to support the mixed 

pointer size environment in these areas. The 

issues and rationale behind these OpenVI\IIS 

and DEC C solutions are presented to encourage 

others who provide library interfaces to use 

a consistent programming interface approach. 

Digital Tcclmiul )ounul Vol. No. 2 [')')(, 

I 

Thomas R. Benson 

Karen L. Noel 

Richard E. Peterson 

Suppmt t( >r 64-bir \'irtual addressing on the OpcnVt\!!S 

Alph;l opcr;ning s\·srcm, \'Crsion 7.0, h:1s \·asril· inc1·clscd 

the ;l mmmr oh inu:1l :-ddrcss space ;l\'r the initil lowing sections explore the reasons t(>r 

mixing pointer sizes.

Open VMS and Language Support 

I mplement3tion choices that Digital made tor th is first 

release of the Open VMS operating system that sup 

ports 64-bit virtual addressing will probably encour 

age mixed pointer size programming. These choices 

were driven largely by the need tor absolute upward 

compati bility tor nisting programs and the goal of 

supporting large , dynamic data sets as the primary 

applicltion t()r 64-bit addressing. 

Dynamic Data Only OpcnVMS services support 

dynamic allocation ot-64-bit address space. This mech 

anism most closely resembles the malloc and free fimc 

tions t() r al locating and deallocaring dynamic storage 

in the C programming language. Allocation of this 

type diffe rs trom static and stack storage in that explicit 

source statements are required to manage it. For static 

and stack storage, the system is allocating the memory 

on behalf of the application at image activation rime. 

(Of course, the al location may be extended during 

execution in the case ofst:ck stor:ge.) This allocation 

continues co be fi·om 32-bit addrcss:ble space. 

Two special cases of static allocation are worth men 

tioning. Linkage sections, which are program sections 

that contain routine linkage information, and code 

sections, which contJin the cxccutJbk instructions, 

do not difkr su bstantially ti·om preinitialized static 

storage. As J result, these sections also reside only in 

32-bit add ressable memory. 

Upward-compatibility Constraints The OpenVMS 

Al pha operating system is cautious to avoid using 

64-bit memory fr eely where it may prevent upward 

compatibility tor 32-bit :pplications. For example, the 

linkage section might seem to be a nJtural candidate 

fo r the Open VMS system to allocate automatically in 

64-bit memory. This allocnion would essentially free 

more 32-bit addressable memory fo r application use; 

however, even if this were done only tor applications 

relin ked tor new versions of the Open VMS operating 

system, there is no guarantee that all object code treats 

Jin kagc section addresses as 64 bits in width. A simple 

example is storing the address of a routine in a struc 

ture. Since a routine's address is the add ress of its pro 

cedure descriptor in the linkage section, moving the 

lin kage section to 64 -bit memory wouJd cause code 

that stores this address in a 32-bit cell to fail. 

Allocating the user stack in 64-bit space also appears 

to be a good opportunity to easily increase the amount 

of memory available to an application. Stack addresses 

arc often more visible to application code than linkage 

section addresses arc. For instance, a routine can easi ly 

allocate a local variable using temporary storage on the 

stack and pass the address of the variable to another 

routine. If the stack is moved to 64-bit space, this 

address quietly becomes a 64-bit address. If the cal led 

routine is not 64-bit capable, attempts to use the 

address will fai l. 

Focus on Services Required for Large Data Sets Not 

all system services could be changed to support 64-bit 

addresses (i.e., promoted) in time to r the first version 

of the OpenVMS operating system to support 64-bit 

addressing. vV ith the mixed-pointer model in mind, 

we tocused on those services that were likely to be 

required fo r large data sets. For example, to allow IjO 

directly to and from high memory, it was essential that 

the IjO queuing service, SYS$QIO, accept a 64-bit 

buffer address. Conversely, the SYS$TRNLNM service 

t(x translating a logical name did not need to be moditicd 

to accept 64 -bit addresses. Its arguments include 

a logical name, a table name, and a vector that contains 

requests tor information about the name. These are 

small data elements that arc unlikely to req uire 64-bit 

addressing on their own . Of course, they may be part 

of some larger structure that resides in 64-bit space. 

In this case, they can easily be copied to or from 32-bit 

addressable memory. 

System services are discussed further in the section 

Open VMS System Services. The 32-bit address restric 

tion on certain system services again emphasizes the 

importance of being able to logically separate large 

data set support from the rest of an application. 

Limited Language Support Another interface point 

that requires care when using 64-bit addressing is at 

calls between mod ules written in diffe rent program 

ming languages. The Open VMS Calling Standard 

traditionally makes it easy to mix .languages in an appli 

cation, but DEC C is the only high-level language 

to fully support 64-bit add resses in the tirst 64-bit 

capable version of the Ope n VMS operating system.2 

The usc of 64-bit addresses in mixed-language 

applications is possible, and data that contains 64-bit 

addresses may even be shared; however, references 

that actually use the data pointed to by these addresses 

need to be limited to DEC C code or assembly lan 

guage. Mixed high-level language applications arc cer 

tain to be mixed pointer size applications in this 

version of the operating system. 

Support for 32-bit Libraries 

Many applications rely on library packages to provide 

some aspect of their fu nctionality. Typical examples 

include user interface packages, graphics libraries, and 

database utilities. Third-party libraries may or may not 

support 64-bit add resses. Applications that usc these 

libraries will probably mix 32-bit and 64-bit poi nter 

sizes and will therefore require an operating system 

that supports mixed pointer sizes. 

DigirCll Tcc hnicJI )ourn

74 

Implications of Full 64-bit Conversion 

for some applications, it may be desirable to mix 

pointer sizes to avoid the side dkcts of universal 64-bit 

address conversion. The approach of recompiling every 

thing with 64-bit address widths is sometimes called 

"throwing the switch." An obvious implication of 

throwing the switch is that all pointer data doubles in 

size. For complex linked data structures, this can be a 

significant overal l increase in size. Increasing the pointer 

size may also reveal hidden dependencies on pointer size 

being the same as integer size. If code accesses a cell as 

both a 32-bit integer and a 32-bit pointer, the code will 

no longer work if the pointer is enlarged . Thus, 

univerS

structure-specific information. Typically, descriptors 

are used only tc>r character strings, arrays, and complex 

data types such as packed decimal. 

Descriptor types are by ddinition sclfidcntit)ring by 

virtue of the type and class fields they contain. An 

obvious choice, therd(xe, fo r extending descriptors to 

handle 64-bir addresses would be to 1dd new type 

constants t(>r 64-bit data elements and extend the 

structure beyond the type fields to accommodate 

larger addresses and sizes. In practice , however, the 

address and Jength fields hom descriptors are fi-equently 

used without accessing the type fi elds, particularly 

when a character stri ng descriptor is expected. 

As a resu lt, a solution was sought that would yield 

a predictable fa ilure, rather than incorrect results or 

data corruption, when a 64-bit descriptor is received 

by a routine that expects onlv the 32-bit f(>rm. The 

final design includes a separate 64-bit descriptor layout 

that contains two special fields at the same ottSets as 

the length and address fields in the 32- bit descriptor. 

These fields are called MBO (must be one) and 

MRMO (must be minus one ), respectively. The simplest 

versions of the 32-bit and 64-bit descriptors are 

illustrated in Figure l. 

If r ensuring robust interfaces in 

the mixed pointer size environment. 

DEC C Language Support for Mixed Pointer Sizes 

To support application programming in the mixed 

pointer size environment, some design work was 

required in the DEC: C compiler. This section 

describes the rationale behind the ti na! design. 

It was clear that the compiler would have to provide 

a way tor 32-bit and 64-bit pointers to coexist in the 

same regions of code. At the same time, customers and 

DigitJI Technical Journal Vol. 8 No. 2 1996 75

76 

internal users initially favored a simple command line 

switch when polled on potential compiler support 

for 64-bit address space. (At least one C compiler that 

supports 64-bit addressing, MIPSpro C, does so only 

through command line switches fo r setting pointer 

sizes.3) The motivation fo r using switches was to limit 

the source changes needed to take advantage of the 

additional address space, especially when portability 

to other plattonns is desired. For cases in which mixing 

pointer sizes was unavoidable, something more 

flexible than a switch was needed. 

Why Not _near and _far? 

The most common suggestion fo r controlling individual 

pointer declarations was to adopt the _near and 

_far type qualifier syntax used in the PC environment 

in its transition from 16-bit to 32-bit addressing.4 

While this idea has merit in that it has already been 

used elsewhere in C compilers and is fa miliar to PC 

software developers, we rejected this approach for the 

following reasons: 

• The syntax is not standard. 

• The syntax requires source code edits at each declaration 

to be affected. 

• The syntax has become largely obsolete even in the 

PC domain with the acceptance of the flat 32-bit 

address space model offered by modern 386minimum 

PC compilers and the Win32 programming 

interface. 

• Because of the vast difference in scale in choosing 

between 16-bit or 32-bit pointers on a PC as compared 

to choosing between 32-bit or 64-bit poimers 

on an Alpha system, there would be no porting benefit 

in using the same keywords. No existing source 

code base would be able to port to the OpenVMS 

mixed pointer size environment more easily because 

of the presence of _near and _t:1r qualifiers. 

Pragma Support 

The Digital UNIX C compiler had previously defined 

pragma preprocessing directives to control pointer 

sizes tor slightly difkrcnt reasons than those described 

for the OpenVMS system.' By default, the Digital 

UNIX operating system ofkrs a pure 64-bit addressing 

model. In some circumstances, however, it is desirable 

to be able to represent pointers in 32 bits to 

match externally imposed data layouts or, more rarely, 

to reduce the amount of memory used in representing 

pointer values. The Digital UNIX pointer_size pragmas 

work in conjunction with command line options 

and linker/loader katurcs that limit memory use and 

map memory such that pointer values accessible to the 

C program can always be represented in 32 bits. 

Since compatibility with the Digital UNIX compiler 

would have greater value if it met the needs of the 

OpenVMS platform, we evaluated the pragma-based 

Digir.1l Technical Jourml Vol. 8 No. 2 1996 

approach and decided to adopt it, propagating any 

necessary changes back to the UNIX platform to maintain 

compatibility. The decision to use pragmas to 

control pointer size addressed the major deficiencies 

of the _ncar and _far approach. In particular, the 

pragma directive is specitled by ISO I ANSI C in such 

a way that using it docs not compromise portability as 

the use of additional keywords can, because unrecognized 

pragmas arc ignored. Furthermore, pragmas can 

easily be specified to apply to a range of source code 

rather than to an individual declaration. A number of 

DEC C pragmas, including the pointer size controls 

implemented on the UNIX system, provide the ability 

to save and restore the state of the pragma. This makes 

them convenient and safe to usc to modit)r the pointer 

size within a particular region of code without disturbing 

the surrounding region. The state may easily be 

saved betore changing it at the beginning of the region 

and then restored at the end. 

Command Line Interaction 

Pragmas tlt in with the initial desire of prospective 

users to have a simple command line switch to indicate 

64 bits. As with several other pragmas, we detined a 

command line qualifier (/pointer_size ) to spccif)r the 

initial state of the pragma before any instances arc 

encountered in the text. U nlikc other pragmas, 

though , we also use the same command line qualifier 

to enable or disable the action of the pragmas altogether. 

In this way, a ddault compilation of source 

code moditied for 64-bit support behaves the same 

way that it would on a system that did not ofter 64-bit 

support. That is, the pragmas arc effectively ignored, 

with only an informational message produced . 

This behavior was adopted for consistency with the 

Digital UNIX behavior and also to aid in the process of 

adding optional 64-bit support to existing portable 

32-bit source code that might be compiled fo r an 

older system or with an older compiler. In this model, 

a compilation of new source code using an old command 

line produces behavior that is equivalent to the 

behavior produced using an older compiler or a compiler 

on another platform. vVitb one notable exception, 

building an application that actually uses 64-bit 

addressing requires changing the command line. 

The exception to the rule that existing 32-bit build 

procedures do not create 64-bit dependencies is a second 

form of the pragma, named required_pointer_size. 

This form contrasts with the form poimcr_size in that it 

is always active regardless of command line qualifiers; 

otherwise, required_pointer_size and pointer_size arc 

identical. The intent of this second pragma is to support 

writing source code that specifics or interfaces to 

services or libraries that can only work correctly with 

64-bit pointers. An example of this code might be a 

header file that contains declarations for both 64-bit 

and 32-bit memory management services; the services

must always be defined to accept and return the 

appropriate pointer size, regardless of the command 

line qualitler used in the compilation. 

Pragma Usage 

The use of pragmas to control pointer sizes within a 

range of source code fits well with the model of starting 

with a working 32-bit application and extending it 

to exploit 64-bit addressing with minimal source code 

edits. Programming interface and data structure declarations 

are typically packaged together in header files, 

and the primary manipulators of those data structures 

are often implemented together in modules. 

One good approach for extending a 32-bit application 

wou ld be to start with an initial analysis of memory 

usage measurements. The purpose of this analysis 

wo uld be ro produce a rough partitioning of routines 

and data structures into two categories: "32-bit sufficient" 

and "64-bit desirable." Next, 64-bit pointer 

pragmas could be used to enclose just the header files 

and source modules that correspond to the routines 

and dat

78 

Allocation Function Mapping The com mand line 

qualitier setting the default pointer size has an additional 

effect that simplifies the use of 64-bit address 

space. If an explicit poinrer size is specified on the 

command line, the malloc fu nction is mapped to a 

routine specific to the address space to r that size. For 

example, _malloc64 is used fo r malloc when the 

default pointer size is 64 bits. This allows allocation 

of 64-bit address space without additional source 

changes. The source code may also call the sizespecific 

versions of run-time routines explicitly, when 

compiled for mixed pointer sizes. These size-specific 

fu nctions are available, however, only when the 

/pointer_size command line quatler is used. See 

"Adding 64 -bit Pointer Support to a 32-bit Ru n-time 

Library" in this issue for a discussion of other cHeers of 

64-bit addressing on the C run-time library." 

Header File Semantics The treatment of poinrer_size 

pragmas in and around header fi les (i.e., any source 

included by the #include preprocessi ng directive) 

deserves special mention. Programs typically include 

both ptivate definition files and public or system-specific 

header files. In the latter case, it may not be desirable tor 

definitions within the header files to be afrccted by the 

poinrer_size pragmas or command line currently in 

effect. To prevent these definitions trom being aHected, 

the DEC C compiler searches fo r special prologue and 

epilogue header files when a #include directive is 

processed. These files may be used to establish a particular 

state for environmental pragmas, such as 

pointer_size, tor all header files in the directory. This 

eliminates the need to modify either the individual 

header files or the source code that includes them. 

The compiler creates a predetined macro called 

_JNITIAL_POTNTER_SIZE to indicate the initial 

pointer size ;-tS specified on the command line. This may 

be of particular use in header files to determine what 

pointer size should be used, if mixed pointer size support 

is desirable. Conditional compilation based on this 

macro's definition state can be used to set or override 

pointer size or to detect compilation by an older compiler 

lacking pointer-size support. Ifits val ue is zero, no 

/pointer_size qualifier was specified, which means that 

pointer_size pragmas do not take effect. If its value is 

32 or 64, pointer_size pragmas do take drect, so it can 

be assumed that mixed pointer sizes are in usc. 

Code Example 

In the simple code example shown in Figure 3, suppose 

that the routine procl is part of a library that has 

been only partially promoted to use 64-bit addresses. 

This function may receive either a 32-bit address or a 

64-bit address in the mRwneru_ptr parameter. To 

demonstrate the use of the new DEC C fe atures, prod 

has been modified to copy this character string parameter 

fi·om 64-bit space to 32-bit space when ncces- 

DigitJ! TcchnicJ! Joumal Vol. 8 No. 2 1996 

sary, so that routines that procl subsequently calls 

need to deal with only 32-bit addresses. 

The INITIAL_POINTER_S IZE macro is used to 

determine if pointer_size pragmas will be effective 

and, hence, whether argument_ptrmight be 64 bits in 

width. If it might be a 64-bit pointer, whose actual 

width is utlk.nown in this example, the pointer's value 

is copied to a 32-bit-wide pointer. The pointer_size 

pragma is used to change the current pointer size to 

32 bits to declare the temporary pointer. Next, the 

two pointer va.l ues are compared to determine if 

the original pointer fits in 32 bits. If the pointer does 

not fit, temporary storage in 32-bit addressable space 

is allocated, and the argument is copied there. Note 

that the example uses _malloc32 rather than malloc, 

because malloc would allocate 64-bit address space 

if the initial pointer size was 64 bits. At the end of 

the routine, the temporary space is treed, if necessary. 

A type cast is used in the assignment from 

argument:__ptr to temp_short_ptr. even though both 

variables are of type char * . Without this type cast, if 

arp,umeJZI_jJtr is a 64-bit-wide pointer, the DEC C 

compiler would report a warning message because of 

the potential data loss when assigning from a 64-bit to 

a 32-bit pointer. 

For other examples of pointer_ size pragmas and the 

use of t he _INITIAL_PO INTER_SIZE macro, see 

Duane Smith's paper on 64-bit pointer support in 

run-time libraries." 

OpenVMS System Services 

The OpenVJVlS operating system provides a suite of 

services that perform a variety of basic operating system 

functions 7 Design work was required to maximize 

the utility of these routines in the new mixed 

pointer size environment . Issues that needed to be 

addressed included the fo llowing, which arc discussed 

in subsequent sections: 

• Several services pass pointers by reference and, 

hence, required an interface change. 

• Because of resource constraints, nor all system services 

cou ld be promoted to handle 64-bit addresses 

in the first version of the 64-bit-capable Open VMS 

operating system. 

• Since the services provide mixed levels of support, it 

is important to indicate those that support 64-bit 

addresses and those that do not. 

• Certain new services seemed desirable to improve 

the usability of64-bit address space. 

Services That Are 64-bit Friendly 

Services that can be promoted to support 64-bit 

addresses without any interface change are called 64-bit 

fi·iendly. If a service receives an add ress by reference, the 

service is typically not 64-bit friendly, and a separate

Figure 3 

voi d proc1 ( char * argument_ pt r) 

{ 

#if INITIAL POINTER SIZE != 0 

#pragma poi n ter_ size save 

#pragma poi n ter_s ize 32 

char * temp_ short_p tr; 

temp_s hort_ ptr = (char *)argument_p tr; 

i f (temp_s hort_p tr != argument_p t r) { 

temp_s hort_p tr = _m alloc32 (strlen(argument_p tr) + 1 ) ; 

st rcpy(temp_s hort_ ptr,argum ent_p tr); 

argument_p tr = temp_ short_p tr; 

} 

else { 

temp_ short_p tr = 0; 

} 

#end if 

I* 

*I 

#pragma poi n ter_s i z e restore 

The actual body of proc1 is omi tted. Assume that it ca l l s 

routi nes that operate on the data po i nted to by argument_ ptr 

and that the rout ines are not yet prepa red to hand le 64-b it 

addresses . 

#if INITIAL_PO INTER_S IZE 1= 0 

i f (temp_ short_p tr 1 = 0) 

#endif 

} 

free( t emp_ short_p tr); 

Code Example of Poi nrc r_sizc Pragmas and rhc _lNlTlAI ,_POI NTER _SI Z E MKro 

enrry point is required to support 64-bir addresses. A 

single routine cannot distinguish whether the address at 

the specified location is 32 bits or 64 birs in width. 

If a scn·icc docs not rccei\'c or return an address by 

rderence, the service is usually 64- bit tri endly. Even 

descriptor arguments present no problem, because the 

32-and 64-bit versions can be distinguished at run 

time. The majority of services t:t ll into this category. 

The services th::tt are not 64-bit tl-icndly include 

the entire suite of memory management system scr 

\'ices, since they access add ress ranges passed by reference. 

Other such services include those that receive 

J 32-bit vector as an Jrgument, which may include the 

address of a pointer as an element. A good example 

ti·om this group is SYS$FAOL, which accepts a 32-bit 

\'LCtor argu ment t(x t(>rmatrcd output. For all these 

scn·ices, JlC\\' intcrhccs were designed to accommodate 

64-bir callers. 

Promotion of Services 

The Open VMS project team explored the idea of pro· 

moting all system services to support 64 -bit addresses. 

Since the majority of OpcnVtvlS system service 

ro utines an: 1\'rittcn in the l'viAC R0-32 assembly language 

or the Bliss-32 programming language, the 

internals of the routines could not be promoted to 

handle 64-bit addresses without moditications. We 

could not take advantage of the throw-the-swirch 

approach, and we did not want to because many 

pointers used internally in the OpcnVMS operating 

system remain at 32 bits. 

We considered using 64-bit jacket ro utines to copv 

64-bit argumcllts to the stack in 32-bit space, which 

wou ld then e1ll the 32-bit intcrnJI ro utine to pertr several reasons. First, the jackets would 

need ro be custom written to ensure correct parameter 

semantics. There could not be a "common jacket" 

that could have saved time and lowered risk. Second, 

there would be an undesirable pcrtc>n11ance impact h1r 

64-bit callers. Third, we decided that ba\'ing a com 

plete 64-bit system service suite was not essential h>r 

usable 64-bit support. 'We could ddine a subset that 

wou ld meet the needs of 64- bit address space users, 

while lowering our risk and implementation costs. 

The sen·iccs selected tor 64-bit support fa ll inro 

tour categories. 

l. Memory mJnagement services. 

2. Performance-critical services. This group includes 

services that Jre typically scnsiti\·e to the addition of 

Dig;ir;1\ Tcrhnic1l journ.1l Vol. R No. 2 1996 79

e\·en a fe\\' cycles of execution ti me . Req uiri ng that 

a 64-bit address user do any addi tional \\'ork, such 

as copying data to 32-bit space, is undesirable. An 

exam p le of this type of service is SYS$ ENQ, which 

is used tor q ueu i ng lock requests. 

3. Design center services. The primary design center 

tor 64-bit su pport was database applications. 

Database arch itects Jnd consultants were polled to 

determine \\'hi c h scn·ices \\'ere most needed lw 

their produ cts. Many of these services, t(Jr exa mple 

SYS$QIO f(x q ueuing JjO requests, wen: also in 

the performance-critical set. 

4. Other useful basic services. Th is set includes ser 

vices to case the transition to 64 bits with minim:d 

change to program structure. For example, the 

SYS$Gv! Kll..NL service accepts a routine address 

and a vector of 32 -bit arguments

interfaces than the 32-bit-only SYS$CRJ\tlPSC. The 

original SYS$CRNI.PSC is still present so that existing 

code may ti.mction without change. 

Some new feature requests were considered as part 

of the 64-bit eftort, but, to maintain the focus of 

the release, these features were not implemented. The 

64-bit memory management services were designed 

to more easily accommodate new fe atures in the 

future. For example, the new services check the argu 

ment count f()r both too many and too tew supplied 

arguments. In this way, new optional arguments can 

be added later to the end of the list without jeopardiz 

ing backward compatibility. 

Virtual Regions 

One new feature that was added to the suite of64-bit 

memory management services is support tor new enti 

ties called virtual regions. A virtual region is an address 

range th:t is n:served by a program for fu ture dynamic 

allocation requests. The region is similar in concept to 

the program region (PO) and the control region ( Pl ) , 

which have long existed on the Open VMS operating 

system.'' A virtual region differs tl-om the program and 

control regions in that it may be ddined by the user by 

calling : system service and may exist within PO, Pl, or 

the new 64-bit addressable process-private space, P2.' 

When a virtual region is created, a handle is returned 

that is subsequently used to identif)' the region in 

memory management requests. 

Address space within virtual regions is allocated in 

the same manner as in the def:llllt PO, P 1, and P2 

regions, with allocation defined to expand space 

toward either ascending or descending addresses. As 

in the default regions, allocation is in multiples of 

pages. The Open VMS operating system keeps track of 

the first fiTe virtual address within the region. A region 

can be created such that address space is created auto 

matically when a virtual reference is made within the 

region, just as the control region in Pl space expands 

automatically to accommodate user stack expansion. 

When a virtual region is created within PO, Pl, or P2, 

the remainder of that containing region decreases in 

size so that it does nor overlap with the virtual region. 

Virtual regions were added to the Open VMS Alpha 

opcr:ning system along with the 64-bit addressing 

capability so that the huge expanse of 64 -bit address 

space could be more easi ly managed. If a subsystem 

requires a large portion of virtually contiguous address 

space , the space can be reserved within P2 with little 

ovcrhe:d. Other subsystems within the application 

cannot inadvertently interfere with the contiguity 

of this address space. They may create their own 

regions or cre:Jte address space wi thin one of the 

detault regions. 

Another advantage of using virtual regions is that 

they arc the most efficient w1y ro manage sparse 

address space within the 64-bit P2 space. Further- 

more, no quotas are charged fo r the creation of a vir 

tual region. The internal storage for the description 

of the region comes from process :ddress space, which 

is the only resource used. 

Summary 

This paper presents the reasons behind the new 

OpenVMS mixed poi nter size environment and the 

support added to allow programming within this envi 

ronment. The discussion touches on some of the new 

support designed to simplif)r the usc of the 64 -bit 


The approaches disc ussed yielded fu ll upward com 

patibility tor 32- bit applications, while allowing other 

applications access ro the huge 64-bit address space tor 

data sets th:t require it. Promotion of all p oi nters to 

64-bit width is not required to use 64-bit space; the 

mixed pointer size environment was considered para 

mount in all design decisions. A case study of adding 

64-bit support to the C run-time library also appears 

in this issue of the .fou rna/.'' 


The authors wish to thank the other members or· the 

64 -bit Al pha-L Team who helped shape many of the 

ideas presented in th is paper: Mark Arsenault, G:1ry 

Barton, Barbara Benton, Ron Brender, Ken Cowan, 

Mark Davis, Mike Harvey, Lon Hilde, Duane Smith, 

Cheryl Stocks, Lenny Szubowicz, and Ed Vogel . 

References 

l. M. Harvey and L. Szubowicz, "Exte nding OpcnVMS 

tor 64-bit Addressable Virru,ll ,\!!emory," D(t;ital 

Technical joumal, vol. 8, no. 2 ( 1996, this issue): 

57-7 1. 

2. Op!'n \ 'MS Callin,l!. S/(!ndcmi (J\ILwnard, Mass.: Digital 

Equipmcnr Corporation, Order No. AA-QSBBA-TE, 

1995). 

3. i\1/Jf'Spro 64-Bit Por!lni and Ti'ansilion Gu ide, Docu 

ment No. 007-2391 -002 (Mountain Vic\\', Calif.: 

Si licon Grc1phies, Inc., 1996). 

4. C lclllf:!JWi

82 

7. Op en V/11/S System Seruices Hejerence iltfa rmal. 

A-GETMSG' (Maynard , Mass.: Digital Equipment 

Corporation, Order No. AA-QSBMA-TE, 1995) and 

Op e11 VMS Systern Sen• ices Re/erence Jl!Jcm ual 

GcTQ UI-Z (Maynard, Mass.: Digital Equipment Corpo 

ration, Order No. AA-QSBN-TE, I 995 ). 

8. Op en V/v/S Alpha Gu ide to 64 -/Jit Adclressi11g (May 

nard, lvbss.: Digital Equipment Corporation, Order 

No. AA-QSIICA-TE, 1995 ). 

9. T. Leonard, ed ., li;J.X Architecture Re/erence iVIa nual 

(Bedford, Mass.: Digital Press, 1987). 

Biographies 

Thomas R. Benson 

A consulting engineer in the Open VMS Engineering Group, 

Tom Benson is one of' the developers of 64-bit addressi ng 

support. To m joined DigitaJ's VA,\ Basic project in I 979 

after receivi ng B.S. and M.S. degt·ees in computer science 

fi·om Svracuse Universitv. After wmking on an optimizing 

compiler shell used by several VAX cornpilers, he joined 

the VMS Group where he led the VMS DEC windows 

File View Jnd Session Manager projects, and brought tbe 

X lib graphics librarv ro rbe VMS operati ng svstem. Tom 

holds rhree parents on the design of the VAX MACR0-32 

compiler for Alpha and recent!)' applied for two parents 

related ro (A-bit addressing work. 

Karen L. Noel 

A pri ncipal engineer in the Open VMS Engineering Group, 

Karen Noel is one of the developers of64-bit

Adding 64-bit Pointer 

Support to a 32-bit 

Run-time Library 

A key component of delivering 64-bit addressing 

on the OpenVMS Alpha operating system, ver 

sion 7.0, is an enhanced C run-time library that 

allows application programmers to allocate and 

utilize 64-bit virtual memory from their C pro 

grams. This C run-time library includes modified 

programming interfaces and additional new 

interfaces yet ensures upward compatibility 

for existing applications. The same run-time 

library supports applications that use only 

32-bit addresses, only 64-bit addresses, or 

a combination of both. Source code changes 

are not required to utilize 64-bit addresses, 

although recompilation is necessary. The new 

techniques used to analyze and modify the 

interfaces are not specific to the C run-time 

library and can serve as a guide for engineers 

who are enhancing their programming inter 

faces to support 64-bit pointers. 

I 

Duane A. Smith 

The OpenVMS Alpha operating system, version 7.0, 

has extended the address space accessible to applica 

tions beyond the traditional 32-bit address space. This 

new address space is reterred to as 64-bit virtual mem 

ory and requires a 64-bit pointer tOr memorv access.1 

The operating system has an additional set of new 

memory allocation routines that allows programs to 

allocate and release 64-bit memory. ]n OpenVMS 

Alpha version 7.0, this set ofroutines is the only mech 

anism available to acquire 64- bit memory. 

For application programs to take advantage of these 

new OpenVMS programming interfaces, high- level 

programmi ng languages such as C had to support 

64-bit pointers. Both the C compiler and the C run 

time librarv req uired changes to provide this support. 

The compiler neec.k d to understand both 32-bit and 

64-bit pointers, and the run -time library needed to 

accept and return such pointers . 

The compiler has a new qualifier called /pointcr_sizc, 

which sets the ddiHllt pointer size t(x the compilation 

to either 32 bits or 64 bits. Also added to the compi ler 

are pragmas (directives) that c1n be used within the 

source code to change the active pointer size. An 

appl ication program is not required to compile each 

module using the same /pointcr_size qualitier; some 

modules may usc 32-bit pointers wh ile others usc 

64-bit poin ters. Benson, Noel, and Peterson describe 

these compiler enhancements.' The DEC C [!ser \ 

Cuide jor Open'vi\1S 5)•slems documents the qualiticr 

and the pragmas 3 

The C run -time librarv added 64-bit pointer sup 

port either by modit\•ing the existing interrace to ,, 

fu nction or by adding a second inrertacc to the same 

fu nction. Public header ti les dctine the C: run-time 

library intcrtaces. These header ti les contain the pub 

licly accessible ti.11Ktion prototypes and structure definitions. 

The library documentation and he:.dcr ti les 

are shipped with the C compiler; the C run-rime 

library ships with the operating system. 

This paper discusses all phases of the enhanccmenrs 

to the C run-time librarv, ti·om project concepts 

through the analysis, the design, and ti nally the i mplementation. 

The f)hC C Nuntime Lihrarv R(fercnce 

Mauua/j(Jr Open \ 'MS S)•stems contains user documentation 

regarding the chan gcs.4 

Digiral Technical journal Vol. 8 No. 2 1')')6

S4 

Sta rting the Project 

We devoted the in i tial two momhs of the project to 

unclerst111ding the overall OpcnVMS presen tation of 

64-bit addresses and deciding ho\\ to prescllt 64-bit 

en hancemems to customers. Re present\tin:s ti·om 

OpenVMS engineering, the c ompiler team, the runtime 

l ibrar y ream, and the Open VMS Calling St:111dard 

team met weekly with the go:d of converging on the 

cl divcra hl cs f{>r Open VMS Alpha version 7.0. 

The project team was committed to preserving both 

source code compati bilitv :\lld the upward compatibilit\' 

aspects of shareable images on the Open VMS 

operating S\'Stem. Early discussions with appl i ca tion 

developers reint()rced our belief that the OpwVMS 

system must a!Jow applications to usc 32-bit and 

64-bit pointers within the s1mc application. The team 

also :-greed that tor a mixed-pointer application to 

work cfkctin:h', a single run-time librarv \\ ould need 

to support both 32 -bit and 64-bit poiners; ho\\'cn:r, 

there \\'as no known prcccdem tor designing such 

a libr:-ry. 

One implication of the decision to design a runti 

me librm· that supported 32 -bit :md 64-bit pointers 

\\':IS that the librarv could ne\Tr return an unsolicited 

64-bit pointer. Returning a 64-bir pointer to an 

application that was expecting J 32 -bit poi mer \\·ould 

result in t he loss of one half of an add ress. A lthough 

typic11ly this error would cause a hardware exception , 

the resu lting address cou ld be a \'alid address. Storing 

to or reading tl·om such an 1ddrcss could result in 

in correct bc l11\·ior that \\'Ould be d i ffi cu lt to detect. 

The OJ X' II\ :If.'; C:allil lP, Sta 11rlm d specif-ics thar argu 

mems p1ssed to a function be 64-bit values.' If 1 

32-bit llo\\'in;, ::::> 

t-o ur categories (listed in order of added comple:-;in· 

to librarv users): 

I. Not 1fkcted by the size of a pointer 

2. En hanced to accept both pointer sizes 

Dig:ir,11 Tcclmic,11 )ourml Vol. 8 No. 2 1996 

3. Duplicated to have a 64-bir-specific inrerf..1cc 

4. Restricted from using 64-bit pointers 

One last point to come tt·om rbe meetings was 

that man\' of the C run-time librarv intet·bces arc 

. 

implememed bv calling other OpenVMS 

images. For 

example, the Curses Screen J\ll an agc mc m interfaces 

make calls to the OpcnVMS Screen Manage ment 

(SMG) bei l ity. It is i mporr:mt that the C run-time 

libr:rv ddines the inrcrt:1ces to support 64- bit 

addrcsSl's without looking 1 t the i mpleme ntation of 

rl1e fu nction. Consistenc\· and completeness of the 

imerbcc arc more important than the comf)lexit\· 

of the implementation. In the SNI G ex1mp lc, if th 

C run-rime library needs to mJke a copy of a string 

prior ro passing the string to the SMG f:1e iliry, this 

is what will be implemented. 

Analyzing the Interfaces 

The process of analyzi ng the interfaces began by creating 

1 document that listed all the header ti les

in strucrurcs. If a rl LE poin ter is :lwavs a 32-bit 

poi nter, structures that contain on ly HLE pointers arc 

not affected by the choice of poinrcr size We use this 

inrormation when we look Jt the Llyout of structures 

and examine fu nction prototvpcs that accept pointers 

to structures. 

In this paper, structures that arc alwavs

86 

An exam ple of a parameter being rcsrricrcd ro a 

low memory address is the buffe r being passed as rhe 

parameter to the fu nction sctbuf. The parameter 

defines rhis buffer to be used t(Jr I/0 operations . The 

user expects to see this bufkr change as 1/0 opcLltions 

arc pcrt(Jrmed on the ti le. If the ru n - ti me library 

made a copy of th is bufkr, the changes would

structure thJt is bound to low memorv, such as a fl LE 

or Wl NDOW pointer, the return v1lue does not t()rce 

separate entry points. 

Certain fu nctions, such as malloc, Jllocate memory 

on behalf ohhc calling program Jnd return the address 

of that memory Js the value of the ti.mction. These 

h.t nctions have two implementations: the 32-bit intcr 

fKe always :tllocues 32-bit virtual memory, Jnd the 

64- bit interface always allocates 64- bit virtual memory. 

Many stri ng and memory fu nctions h:JVe return val 

ues that are relative to a parameter passed to the same 

routine. These ;ddrcsses mav be returned as high 

memory addresses if ;nd on ly if the p;ramerer is a 

high memory address. 

The fo llowing is the fu nction prototype fo r strcat, 

which is to und in the header ti le : 

char *strcat(char *s1 , canst char *s2); 

The strcat fu nction appends the stri ng pointed to lw 

s2 to the string poimed to by sl . The return value is 

the address of thc latest stri ng sl . 

In this case, the size of the pointer in the return 

value is always the same as the size of the pointer 

passed as the first parameter. The C programming l;m 

guage has no wJv to reflect this. Since the function 

may return 1 64-bit pointer, the strcat fu nction must 

h;1ve two entry points. 

As discussed c1rlicr, the poimcr size used fo r parJ 

mctcr s2 is not related to the returned poi nter size. 

The C run-rime libr: ry made this s2 argument 64-bit 

ti·iendly by declaring it a 64-bit poi nter. This declara 

tion allows the application programmn to concatc 

n;ltC a string in h igh memorv to one in low memory 

without alteri ng the source code. The t( )llowing strcat 

fu nction statement shows this declaration: 

char *strcat(char *s1 , __ c har_ptr64 s2); 

The data rvpc _cbar_ptr64 is 1 64-bit character 

pointer whose definition ;m d usc will be explained 

later in this paper. 

High-level Design 

The /pointcr_size qua litler is J\"Jilablc in those 

versions of the C: compiler that support 64-bit point 

ers. The compiler has a prcddl ncd macro named 

_INITIAL_POINTER_SIZJ-: whose value is based on 

the usc of the /pointcr_sizc qualifier. The macro 

accepts the tllllowing values: 

• 

0, which indicates rh:t the /pointer_size qualifier is 

not used or is not a\·aibble 

• 32, which indicates that the /pointcr_sizc qualifier 

is used and has l value of 32 

• 64, which indicates that the /pointcr_size qualiticr 

is used and has J value of64 

The C ru n - time library hodcr tiles conditionally 

compile based on the value of this predefined macro. 

A zero value indicates to the header Illes that the computing 

environment is purely 32-bit. The pointer-size 

specific fu nction prototypes arc not ddined. The user 

must usc the /pointer_sizc qualifier to access 64-hit 

fu nctionality. The choice of 32 or 64 determines the 

default pointer size . 

The heackr ti les define two distinct types of dccbra 

tions: those that have a single implementation and 

tbose that have pointer-size-specific implementations. 

The addresses passed or returned h·om ti.mctions that 

ba,·e a single implementation arc either bound to lm,· 

memory, restricted to low mcmorv, or widened to 

accept a 64- bit pointer. 

Those fu nctions that have poi ntcr-sizc-spccitic 

entry points have three ti.mction prototypes ddincd . 

Using m;lloc Js an cxJmple, prototypes Jre created hn 

the fu nctions m;lloc, _malloc32, Jnd _rnalloc64 . The 

latter two prorotvpes arc the poi nter-size-specific pro 

totypes and Jrc ddined on ly when the /pointcr_sizc 

qualitier is used. The malloc prototype dct:llllts to calling 

_m111oc32 when the default poin ter size is 32 bits. 

The mallnc prototype dcbults to ulling _n1JIIoc64 

when the dcbulr pointer size is 64 bits. Applic;tion 

programmers who mix pointer types usc the 

/pointer_sizc qualifier to establish the default pointer 

si ze but can then usc the _malloc32 and m11loc64 

explicitly to achieve nondebult behavior. 

In addition to being enhanced to support 64-bit 

pointers, the C compiler h1s the Jdded capabi l i ty of 

detecting incorrect mixed-pointer usage. It is the 

fu nction prororvpe found in the header ti les ti1Jt tel ls 

the compiler exactly wh

whether the fu nction is the 32-bit-spccitic cmn· poinr. 

If it is, the compiler t(mns the prdi xed n:une by 

add i ng "dccc$" to the beginning of the n:mc but 

also removes the "_" and the "32." Conseq ucntlv, the 

function n:1mc _malloc32 becomes dccc$m:tlloc, but 

the h.t ncrion name _wz32 docs not c hange. 

Implementation 

To illustrate changes th:t needed to be made ro the 

hc:dcr ti les, we inven ted a single header ti le Cl llcd 

. This tile, wh i ch is shown in Figure 1, illus 

trJtes rhe c lasses of problems bccd Lw a dC\·clopcr 11 ho 

is adding su pport tc)r 64- bi t poi nters. The fu nctions 

detined in this he:der tile :rc acrual C run-rime libr;m• 

ri.Jn ctions. 

Preparing the Header File 

The fi rst p1ss through rtsulted in a number 

of chmgcs in terms of t(mnatting, commcming, 

and 64-bir support. Rta lizing that matw modifications 

would be m1dc to the hc1dcr ti les, we considered 

rcachbilitv a major goJI t( >r this release of these tiles. 

The initi11 htader J-I les 1ssumcd '' uniform poimcr 

size of 32 bits rc)f the OpcnVi\1\ S operating S\'Stcm. 

Duri ng the fi rst pass th mugh , 11·e :ddcd 

poi nter-size pr;1gmas to ensur-e that the ti le S•ll'cd the 

user's poimcr size, set the pointer size to 32 bits, and 

then restored the user's pointer size at the end ofrhc 

header. 

Next 11 c t(mnattcd < heJdcr. h> to s ho\\' t he 1·arious 

categories that the structures ,md fi.tnctions ta ll imo. 

The Gttcgorics and the result of the first pass th rough 

can be seen in hgurc 2. For ex a mple, 

the fu nction r:md l1ad no pointers in the fu n ction 

Figure 1 

Orisitl;ti Hc;ldcr ilc 

lfi fndef 

#define 

HEADER LOADED 

HEADER LOADED 

lfi fndef SIZE T 

If defi n e SIZE T 1 

lfendi f 

i n t 

vo i d 

vo i d 

i n t 

char 

char 

size 

prototvpc and was immcdiatclv mc)l'cd to the section 

"Functions that support 64-bit poi nters. " 

Organizing in this wav gave us an accur:te 

reading of bow m :l n\' more fu nctions needed 

64-bir support. If arw of the sections became cmptl', 

11·e did nor rcm

Figure 2 

/lifndef 

#def ine 

I* 

HEADER_LOADED 

HEADER LOADED 

** Ensure that we beg in wi th 32-bit pointers. 

*I 

#if INITIAL POINTER SIZE 

If if ( VMS VER < 70000000) 

If error "Pointer size added in OpenVMS V7 .0 for Alpha" 

If endif 

If pragma __ pointer_s ize save 

If pragma __ pointer_s ize 32 

#endif 

I* 

** STRUCTURES NOT AFFECTED BY POINTERS 

*I 

#ifndef SIZE_T 

If defi ne SIZE T 1 

typedef unsi gned int si ze_t ; 

#endif 

I* 

** FUNCTIONS 

*I 

THAT NEED 64-BIT SUPPORT 

in t execv(const char *, char *[]); 

voi d 

voi d 

char 

char 

size 

I* 

free(void *); 

*ma l l oc (si ze_t ) ; 

*strcat(char * , const char *); 

*strerror(int); 

t strlen(const cha r *>; 

** Create 32-bi t header fi le typedefs. 

*I 

I* 

** Create 64-bi t header fi le typedefs . 

*I 

I* 

** FUNCTIONS RESTRICTED FROM 64 BITS 

*I 

I* 

** Change defau lt to 64-bi t po inters. 

*I 

#if INITIAL PO INTER SIZE 

If pragma __ p ointer_ size 64 

#endi f 

I* 

** FUNCTIONS THAT SUPPORT 64-BIT PO INTERS 

*I 

int rand(void); 

I* 

hrsr PJss through 

** Restore the user's po inter context . 

*I 

#i f INITIAL PO INTER SIZE 

If pragma __ p oi n ter_s i z e __ r estore 

#endif 

llendif I* HEADER LOADED *I 

Digital Tn:hnical Journal Vol . 8 No. 2 1996 89

90 

We created a C run-time library private header 

file caLled . This header file has the 

appropriate pragmas to define 64-bit pointer types used 

within the C run-time library, as shown in Figure 3. 

This header file also contains the definitions ofrnacros 

used in the implementations of the functions. Figure 4 

shows the macros declared in . 

Once a module includes the ti le , 

the compilation of that module changes to add the 

qualifier /pointer_size=32. This change improves the 

readability of the code because "char * " is read as a 

I* 

32-bit character pointer, whereas 64-bit poi mcrs usc 

typedcfs whose names begin with "_wide." The 

name of the new typedcf is _ wide_char_ptr, which is 

read as a 64-bit character pointer. 

The C run-time librarv design also requires that the 

impleme ntation of a fu nction include all header ti les 

that define the nmction. This ensures that the imple 

mentation matches the header files as they arc modi 

fi ed to support 64-bit pointers. For fu nctions defined 

in multiple header ti les, this ensu res that header ti les 

do not contrad ict each other. 

** Thi s inc lude fi le defi nes a l l 32-bi t and 64-bi t data types used i n 

** the implementation of 64-bi t addresses i n the C run-t ime Library. 

** 

** Those modu les that are com pi led wi th a 64-bi t-capab le campi Ler 

** are requi red to enable poi nter size wi th /POINTER SIZE=32 . 

*I 

#i fdef INITIAL_PO INTER_S IZE 

# if C INITIAL POINTER SIZE != 32 ) 

# error "This modu le must be compi led /pointer_s ize=32" 

# endi f 

#endi f 

I* 

** ALL interfaces that requi r e 64-bi t poi nters must use one of 

** the fol lowi ng definitions . When this header fi le is used on 

** platforms not supporting 64-bi t pointers, these definitions 

** wi l l define 32-bi t pointers. 

*I 

#ifdef INITIAL POINTER SIZE 

# pragma __ p oi nter_ size save 

# pragma __ pointer_s ize 64 

#endif 

typedef char * __ w ide_ char_ptr; 

typedef canst char * __ w ide_ const_ char_ptr; 

typedef int * __ w i de_i nt_ ptr; 

typedef canst int * __ w ide_c onst_i nt_ ptr; 

typedef char ** __ w ide_ char_pt r_ptr; 

typedef canst char ** __ w ide_ const_ char_pt r_ptr; 

typedef void * __ w ide_vo id_ ptr; 

typedef canst void * __ w ide_const_vo id_pt r; 

#include 

typedef WINDOW * __ w ide_W INDOW_ ptr; 

#include 

typedef size t * __ w ide_s i ze_t_p tr; 

I* 

** Restore pointer size. 

*I 

#i fdef INITIAL POINTER SIZE 

# pragma __ poi n ter_ size __ r estore 

#end if 

Figure 3 

Typcdds From 

Digital Technical Joumal Vol. 8 No. 2 1996

I* 

** Def ine macros that are used to determine pointer size and 

** macros that wi l l copy from high memo ry onto the stack. 

*I 

#ifdef INITIAL POINTER SIZE 

# i n clude 

# def ine C$$IS SHORT ADDR(addr) 

( ( ( ( int64 )(addr)32 ) = = (unsigned int64)addr) 

# define C$$SHORT ADDR OF STRING(add r) \ 

(C$$ IS SHORT A DDR(;d d ) ? (char *) (addr) \ 

: ( c har - *) st cpy( __ A LLO CA(strlen(addr) + 1 ) , (ad dr)) ) 

# defi n e C$$SHORT ADDR OF STRUCT(add r) \ 

(C$$ IS SHORT AD DR(;d d ) ? (void *) (addr) \ 

: ( vo id - *) me; cpy( __ A LLOCA(sizeof (* add r)), Caddr), sizeof (*addr ) ) ) 

# defi n e C$$SHORT_A DDR_O F_MEMORYCaddr, len) \ 

(C$$I S SHORT ADDRCadd r) ? (voi d *) (addr) \ 

: ( voi d *) me; cpy( __ A LLOCA ( L en), (ad dr), Len ) ) 

#else 

# define C$$IS_SHORT_A DDR(addr) ( 1 ) 

# define C$$SHORT_A DDR_O F_STRING Caddr ) (addr) 

# define C$$SHORT_A DDR_O F_STRUCT(addr) (addr) 

# define C$$SHORT_A DDR_O F_MEMORY(addr, len) (addr) 

#endif 

Figure 4 

iVL lnos ti·om 

Implementing the strerror Return Pointer 

The function strcrror always returns a 32-bit pointer. 

The memory is allocated by the C ru n-time library for 

both 32-bit and 64-bit calling programs. As shown 

in Figure 5, we moved the fu nction strcrror into the 

section "Functions that support 64-bit pointers" of 

to show that there arc no rcsrrictions on 

the usc of this fi.mcrion. 

The "Create 32-bit header file rypcdcts" section of 

is in the 32-bit pointer section, where the 

bound-to-low-memory data structures :.1re declared. 

The fu nction retu rns a pointer to a character string. 

We, therefore, added typedefs fo r _c har_ptr32 and 

_const_char_ptr32 while in a 32-bit poinrer context. 

These declarations :re protected wirh rhe ddinition of 

_CHAR_PTR32 ro allow multiple header ti les to usc 

the same naming convention. Declarations of the 

consr fo rm of the typcdef arc always made in the same 

conditional code since they usually :Jrc needed and 

using the same condition removes the need to r a dif 

krcnt protecting n:Jmc. 

The strerror fi.111ction could have been implemented 

in by placing the fi.mction in the 32-bit sec 

tion, but that wo uld have implied that the 32-bit 

pointer was a restriction that could be removed later. 

The pointer is not a restriction, and the strerror fu nc 

tion fu lly supports 64 -bit pointers. 

Tile private header tile typedefs are always declared 

starti ng with rwo underscores and ending in either 

"_ptr32" or "_ptr64." These typedets arc created only 

when the header tile needs to be in a particular 

pointer-size mode while referring to a pointer of the 

other size. The return v:.1 lue of strerror is modified to 

usc the typedef_char_ptr32. 

Including the header tile, which declares strerror, 

allows the compiler to vcrif)' that the arguments, 

return values, and pointer sizes are correct. 

Widening the strlen Argument 

The fu nction strlen accepts a constant character 

pointer and re turns an unsigned integer (size_t) . 

Implementing fu ll 64-bit support in strlcn means 

changing the parameter to a 64-bit constant character 

pointer. If an application passes a 32-bit pointer to 

the strlcn function, the compiler-generated code sign 

extends the pointer. The required header tile mod 

ific:tion is to simply move strlen fr om the sec 

tion "Functions that need 64-bit support" to the 

section "Functions that support 64 -bit pointers." 

The steps necessary to r the source code to support 

64 -bit addressing arc as tallows: 

l. Ensure that the module includes header files that 

declare strlen. 

Digjral Tl·dmical journal Vol. 8 No. 2 1996 91

92 

#ifndef 

#define 

I* 

Figure 5 

Final Forrn of 

HEADER_LOADED 

HEADER LOADED 

** Ensure that we begin with 32-bi t poi nters . 

*I 

#if INITIAL POINTER SIZE 

# i y- ( VMS VER < 70000000 ) 

# error "Poi nter size added in OpenVMS V7 .0 for Alpha" 

# endif 

# pragma __ po inter_s i z e save 

# pragma __ po inter_ size 32 

#endi f 

I* 

** STRUCTURES NOT AFFECTED BY POINTERS 

*I 

#ifndef SIZE_T 

# define SIZE T 1 

typedef unsi gned int size_t ; 

#endif 

I* 

** FUNCTIONS THAT NEED 64-BIT SUPPORT 

*I 

I* 

** Create 32-bit header fi le typedefs. 

*I 

#ifndef CHAR_P TR32 

# def ine 

typedef 

typedef 

#end i f 

I* 

CHAR PTR32 

char * __ char_ptr32; const char * __ const_char_ptr32; ** Create 64-b it header fi le typedefs. 

*I 

#ifndef CHAR_ PTR64 

# define CHAR PTR64 

# pragma __ p ointer_ size 64 

typedef char * __ c har_p tr64; 

typedef con st char * __ c onst_c har_p tr64; 

# pragma __ p o inter_s i z e 32 

#endif 

I* 

** FUNCTIONS RESTRICTED FROM 64 BITS 

*I 

int execv( __ c onst_c har_ptr64, char *[]); 

I* 

** Change default to 64-bit po inters . 

*I 

#i f INITIAL PO INTER SIZE 

# pragma __ p ointer_sT ze 64 

#endif 

I* 

** The fol lowi ng functions have interfaces of XXX, _X XX32, 

** and XXX64 . 

** 

** The function strcat has two interfaces because the return 

** argument i s a po inter that is relative to the first arguments. 

** 

** The ma l l oc funct ion returns ei ther a 32-bit or a 64-bi t 

** memo ry address . 

*I 

#if INITIAL PO INTER SIZE 

# pragma __ p oi nter_ size 32 

#end i f 

Digiral Tcc hnicl )oun1

Figure 5 

Continued 

vo id *malloc(si ze_t __ s ize); 

char *strcat(char * __ s 1 , __ c onst_ char_p tr64 __ s 2) ; 

# i f INITIAL _P O INTER_S IZE = = 32 


#endif 

#i f INITIAL POINTER SIZE && VMS VER >= 70000000 

# pragma __ p ointer_s i z e 32 

voi d *_m a l l oc32(size_t ) ; 

char *_s t rcat32(char * __ s 1 , __ c onst_ char_ptr64 __ s 2 ); 


void *_ma l l oc64(si ze_t ); 

char * strcat64(char * __ s 1 , const char * __ s 2 >; 

#endif 

I* 

** FUNCTIONS THAT SUPPORT 64-BIT POINTERS 

*I 

voi d free(voi d * __ p t r); 

int rand(void); 

size t strlen(const char * __ s ); 

__ c har_pt r32 strerror(int __ e rrnum); 

I* 

** Restore the user ' s poi nter context . 

*I 

#i f INITIAL POINTER SIZE 

# prag ma __ po inter_si ze __ r estore 

#end i f 

#endif I* HEADER LO ADED *I 

2. Add the kJJiowing line of code to the top of the 

module: II i n c 1 u de < w i de_ types . s r c >. 

3. Ch:mge the declaration of th.e fu nction to accept 

a _wide_const_char_ptr parameter insteJd of the 

previous eonst char * parameter. 

4. Visually tcJI Iow this argument through the code, 

looking for assignment statements. This particular 

tl11Ktion wou ld be a simple loop. If local variables 

store this pointer, thev must also be declared as 

_ wide_const_chJr_ptr. 

5. Compile the source code using the directive 

/wJrn=enable=maylosedata to have the compiler 

help detect pointer truncation. 

6. Add ::1 new test to the test system ro exercise 64 -bit 

poimers. 

Restricting execv from High Memory 

Examination of the execv fi.m ction prototype showed 

thJt this fu nction receives two argumenrs. The ri rst 

::rgumem is J pointer to the name of the file to start. 

The second Mgu ment represents the ::rg" array that is 

to be passed to the child process. This JJTJY of pointers 

to null terminated strings ends with a NULL pointer. 

Initially, the exeev fu nction was to have had two 

implementations. The parameters passed to the execv 

fu nction are used as the parameters to the main func 

tion of the child process being started . Because no 

assumptions could be made about that child process 

(in terms of support tor 64 -bit pointers), these para 

meters are restricted to low memory addresses. 

To illustrate that the art,"' passing was a restriction, 

\\'e place that prorotvpe into the section "Functions 

restri cted from 64 bits" of . The first argu 

ment, the name of the tile, did not need to have this 

restriction. The section "Create 64 -bit header file 

typcdds" was enhanced to add the definition of 

_const_c har_ptr64, which allows the prototypes to 

define a 64-bit pointer to constant characters while in 

either 32-bit or 64 -bit context. 

Returning a Relative Pointer in strcat 

The strcat function returns a pointer relative to its tirst 

argument. We looked at this function and determined 

that it required two entry points. In addition, we 

widened the second parameter, which is the address of 

the string to concatenate to the second, to allow the 

application to concatenate a 64-bit string to a 32-bit 

string without source code changes. 

Digital Tcc hnictl )ourn:tl Vo l. 8 No. 2 1996 93

94 

Figure 5 shows the changes made to support fu nctions 

that have pointer-size-specific entry points. The 

prototypes of functions X.:'\.X., _XXX32, and _XXX64 

begin in 64-bit pointer-size mode. Since the unmod ified 

fi.mction name (stt-cat, XXX) is to be in the pointer 

size specified by the /pointer_size qualifier, the 

pointer size is changed fr om 64 bits to 32 bits if and 

only if the user has specified /pointer_size=32. At this 

point, we are not certain of the pointer size in effect. 

We know only that the size is the same as the size of 

the qualifier. The second argu ment to strcat uses the 

_const_char_ptr64 typedef in case we are in 32-bit 

pointer mode. Notice the declaration of _strcat64 

does not use this typedef because we are guaranteed 

to be in 64-bit pointer context. Figure 6 shows the 

implementation of both the 32-bit and the 64-bit 

strcat fi.mctions. 

Th e 64-bit mal/oc Function 

The implementation of multiple entry points was discussed 

and demonstrated in the strcat implementation. 

Al though multiple entry points are typically added to 

woid tru ncating pointers, fi.mctions such as memory 

allocation routines have newly defined behavior. 

The functions decc$malloc and decc$_malloc64 

use new support provided by the OpenVMS Alpha 

operating system fo r allocating, extending, and ri·ecing 

64-bit virtual memory. The C run-time library utilizes 

this new fu nctionality through the LIBRTL entry 

points. The LIBRTL group added new entry points fo r 

each of the existing memory management fu nctions. 

The LI BRTL includes an additional second t:ntry 

point fo r the fr ee fu nction. Since our implementation 

of the free function simply widens the pointer, wt: end 

up with a single, C run-time library function that must 

choose which LIBRTL function to call. 

#include 

#include 

I* 

** STRCATI STRCAT64 

** 

int free( __ w i de_void_p tr ptr) { 

if ( ! ( C $$I S_S HORT_A DDR(ptr))) 

} 

return ( c$$_f ree64(ptr)); 

else return(c$$ free32((void *) ptr); 

Concluding Remarks 

** The 'strcat ' function concatenates 's2 ' , including the 

** terminating nul l character, to the end of ' s 1 ' . 

*I 

The project took approximatdy seven person-months 

to complete. The work involved two months to determine 

what we wanted to do, one month to tlgurc out 

how we were. going to do it, and fo ur person -months 

to modi ', document, and test the software. 

During the initial two months, the technical leaders 

met on a weekly basis and discussed the overall 

approach to adding 64-bit pointers to the OpenVi'vlS 

environment. Since I was the technical lead for the C 

run-time library project, this initial phase occupied 

between 25 and 50 percent of my time. 

The ont: month of detailed analysis and design consumed 

more than 90 percent of my time and resulted 

in a detailed document of approximately 100 pages. 

The document covered each of the 50 headt:r fi les and 

500 function interfaces. The fu nctions were grouped 

by type, based on the amount of work required to 

support 64 -bit pointers. 

The first month of implementation occupied Iwarl' 

all of my time, as I made several false starts. Once I 

worked out the tl nal implementation technique, I 

completed at least two of each type of work. As coding 

deadlines approached, I taught nvo other engineers on 

my team how to add 64-bit pointer support, pointing 

out those fu nctions already completed tor reference. 

They came up to speed within one week. Togcthn, we 

completed the work during the fi nal month of the 

project. 

__ w i de_ char_ptr _s trcat64 < __ w i de_ char_ptr s1 , __ w i de_ const_ char_ptr s2) 

{ 

} 

(void) _m emcpy64 ( ( s1 + strlen(s1 ) ) , s2, (strlen(s2) + 1 ) ); 

return(s1 ) ; 

char * strcat32


The author would like to acknowledge the others who 

contributed to the success of the C run-time library 

project. The engineers who helped with various 

aspects of the analysis, design, and implementation 

were Sandra Whitman, Brian McCarthy, Greg Tarsa, 

Marc Noel, Boris Gubenko, and Ken Cowan. Our 

writer, John Paolillo, worked countless hours documenting 

the changes we made to the library. 

References 

l. M. Harvey and L. Szubowiez, "Extending OpcnVl'viS 

for 64-bit Addressable Virtual Memory," Digital 

Technical journal, val. 8, no. 2 (1996, this issue): 

57-7 1. 

2. T. Benson, K. Noel, and R. Peterson, "The OpenVMS 

Mixed Pointer Size Environment," Dip,ital Technical 

.foumal. vol. 8, no. 2 ( 1996, this issue): 72-82. 

3. nH: C User \ Guide fo r Open Vt\1/S Systems (Maynard, 

MJss.: Digital Equipment Corporation, Order No. 

AA- Plct\ZE TK, 1995). 

4. f)h'C C Ru ntime Library Reference t\lla nual fo r 

OpenVMSSystems (Maynard, Mass.: Digital Equipment 

CorporJtion, Order No. AA-PU N EE-TK, 1995 ) . 

5. Open VMS Ca lling Standard (Maynard , MJss.: Digital 

Equipment Cmporation, Order No. AA-QSB RA-TE, 

1995) 

Biography 

Duane A. Smith 

As a consul ring soti:wan: engineer, Duane Smith is currently 

. 

an:hitect and project leader of the C run-rime libr:ry for 

rhe Open VMS VAX and Alpha platforms. He joined Digital 

in 1981 and h•lS worked on a variety of projects, including 

the A-to-1: Datab

Building a Hig h-performance 

Message-passing System for 

MEMORY CHANNEL Clusters 

The new MEMORY CHANNEL for PCI cluster 

interconnect technology developed by Digital 

{based on technology from Encore Computer 

Corporation) dramatically reduces the over 

head involved in intermachine commun ica- 

tion. Digital has designed a software system, 

the TruCiuster MEMORY CHANNEL Software ver 

sion 1.4 product, that provides fast user-level 

access to the MEMORY CHANNEL network and 

can be used to implement a form of distributed 

shared memory. Using this product, Digital has 

built a low-level message-passing system that 

reduces the communications latency in a MEMORY 

CHANNEL cluster to less than 10 microseconds. 

This system can, in turn, be used to easily build 

the communications libraries that programmers 

use to parallelize scientific codes. Digital has 

demonstrated the successful use of this message 

passing system by developing implementations 

of two of the most popular of these libraries, 

Parallel Virtual Machine (PVM) and Message 

Passing Interface {MPI). 

Vo l. 8 No. 2 1':!96 

I 

James V. Lav.rton 

John ]. Brosnan 

Morgan P. Doyle 

Seosamh D. 6 Riordain 

Timothy G. Reddin 

During rlw l:st [(:,,. vc1rs, signiric:nr rcsc:1rch ;l ml 

dn·clopment has been undertaken in both academia 

and industry in an dlc>rt to reduce the cost of high 

pertimnanu: computing ( H PC ). The method most 

ti·cqueml\' used was to build p;lrallel S\'Stems out of 

clusters of commodity ,,·orkstations or sen"Crs tlut 

could be used as a ,·i rtu;ll supercomputcr. ' The nwti 

\'Jtion t(>r this '' ork ,,·as the tremendous g:1ins that 

have lx:en ;K hievcd in reduced instruction set com 

puter (!USC:) microprocessor perf()l'mJnce during the 

last decade. Indeed , processor pertormance in rod:1v's 

workst:Jtions and sen·ers often cxceeds that ofrhc indi 

,·idu;ll processors in a tigiHI\· coupled supercomputer. 

Hm,·e,·er, trad itional Joc1l 1rc:1 net\\'ork (LAN) per· 

to rm1nce has not kept pace ll'ith microprocessor 

pert(mn:lJKe. LANs, such as rl bcr distributed d:na 

intcrbce (FDDI), ofkr rosonable bandwidth, since 

communication is geneLlllv GlJTied our Lw mems of 

rradirion1l prorocol stacks suclJ as the usn d:t:gr;1lll 

protoco!jimernct protocol ( U D P /1 !' ) or the tr:msmission 

conrroJ protocol/internet protowl (TCI)/I P), 

bur softw:t re O\'erhead is 1 major factor in mess:gc 

rranskr time 2 This sofi-,,.,lre o\'erhead is not red uced 

by building hster LAN ncrwork hardware. 1\:\rhcr, a 

ne\\' appro:tch is needed-one that bvpasses the pro 

tocol stack ll'hile prescn·ing sequencing, error detec 

tion , :tnd p rotection. 

Much current research is dcnncd to reducing this 

communicarions O\'erhe::Jd usi ng specialized lurdw1re 

:1 nd sof-tware. To this

Data may then be transkrred between the machines 

using simple memory read and write operations, with 

no software overhead, essentially utilizing the fu ll per 

formance of the hardware. This approach is similar to 

the one used in the Princeton SHRJMP project, where 

this process is described as Virtual Memory-Mapped 

Com munication (VJ'v!MC). 7-'" 

Figure 1 shows the re lationship between the various 

components of our message-passing system. The tirst 

phase of our work involved designing a program 

ming library and associated kernel components to pro 

vide protected , unprivileged access to the MEMORY 

CHANNEL network. Our objective in creating this 

library was to provide a fa cility much like the standard 

System V interproccss communication (J PC ) shared 

memory fi.mctions available in UNIX implementations. 

Programmers could usc the library to set up operations 

over the M EMORY CHANNEL interconnect, but they 

would not need to use the library ti.mctions tor data 

transfer. In this way, pcrtonnance cou ld be maximized. 

This product, the Tr uCiustcr MEMORY CHANNEL 

Software, provides programmers with a simple, high 

pert(mllancc mechanism kH- building parallel systems. 

TruCiustcr MEMOLY CHANNEL Software delivers 

the pert(xm:mce available ti·om the MEMORY 

CHANNEL network directly to user applications but 

rcq uircs a programming style that is different from 

that req uired for shared memory. This different pro 

gramming style is necessary because of the diffe rent 

access characteristics between local memory and mem 

orv on a remote node connected through a MEMORY 

CIANNEL network. To make programming with the 

MEMOKY CHANNEL technology re latively simple 

while continuing to deliver the hardware performance, 

we built a library of primitive communications fu nc 

tions. This system, called Un iversal Message Passing 

(UMP), hides the details of MEMORY CH.AJ"lNEL 

operations from the programmer and operates seam 

lcssly over two transports (initially): shared memory 

and the Mb\ll ORY CHANNEL interconnect. This 

allows seamless growth h·om a symmetric multipro 

cessor (SMP) to a hi ll MEMORY CHANNEL cluster. 

Development can be done on a workstation, while 

production work is done on the cl uster. The UMP 

Figure 1 

SHARED 

MEMORY 

PARALLEL APPLICATION 

PVM I MPI 

UMP 

TRUCLUSTER 

MEMORY CHANNEL 

SOFTWARE 

Message-passi ng Systc111 Archirccrurc 

OTHER 

TRANSPORT 

layer was designed fr om the beginning with pertor 

mance considerations in mind, particuLlrly \Vith 

respect to minimizing the overhead il1\·oJvcd in se nd 

ing small messages. 

Two distributed memory models arc predominamly 

used in high- perfo rmance computing today: 

1. Data parallel, which is used in High Performance 

Fortran (HPF)." With this model, tl1e programmer 

uses parallel language constructs to indicate ro the 

compiler how to distribute data and what opera 

tions should be pertormed on it. The problem is 

assumed to be regular so that the compiler can use 

one of a number of data distribu tion algorithms. 

2. Message passing, which is used in Parallel Virtual 

Machine ( PV M) and Message Passing Inrcrt:K e 

( M PI ) u 15 In this approach, all messaging is per 

fo rmed explicitly, so the application programmer 

determines the data distribution algorithm, making 

this approach more suitable fo r irrcgubr probkms. 

It is not yet. clear whether one of these approaches 

will predominate in rhe ti.tturc or if both will continue 

to coexist. Digital has been worki ng to provide com 

petitive solutions fo r both approaches usi ng MEMORY 

CHANNEL cl usters. Digital's H PF \\'Ork has been 

described in a previous issue of the Jo urnal'"·'' This 

paper is primarily concerned with message passing. 

Building on the UMP layer, we constructed imple 

mentations of two common message-passing systems. 

The tlrst, PVM, is a de facto st

depend ing on the bandwidth of the I/0 subsystem 

into which the adapter is plugged. Although the cur 

rent /vi EMORY CHANNEL network is a shared bus, the 

plan ti:.Jr the next generation is to utilize a switched 

technologv that will increase the aggregate bandll'idth 

of the llCt\Hlrk be\·ond thn of currentlv a\·aiLlblc 

. . 

s\\·itchcd LAN technologies. The latency (time to send 

a minimum-length message one way bct\vccn t\vo 

processes) is less than S mic roseconds (1-ls). The 

MEMORY CHANNELnet\vork provides a commLIIlic:t 

tions medium with a low bit-error rate, on the order of 

10-u,_ The proba bil it\' of undetected errors occurring 

is so sm::Lil (on the order of the undetected error rate of 

CPUs and memory Sll bsystcms) that it is esscnri1llv 

negligi ble. A MEMORY C H ANNE L cluste r consists of 

one or more PCI MEMORY CH ANNE L adapters on 

each node and a hub connecting up to eight nodes. 

The Mt-: MORY CH ANNEL cluster suppmts l 

512-MB glob:! address space into \\'hich e::ch adapter, 

under opcr:ti ng S\'Stcrn contml, em map regions of 

local \'irtual 1ddress space." figmc 2 illustrates the 

JVJ F.MORY CHANNEL opc r;1tion. Figure 2a shows 

transmission, and Figure 2h shows reception . A p1ge 

table cntr· (PTE) is an cntrv in the svstcm \'irtull 

to-plwsicll map th::n translates the ,·irtual :ddrcss of 

a p:ge to the corresponding phvsical address. The 

MEMOKY CHANNEL adapter comains a page comrol 

table (PC:T) that i ndi cates tr reception 

Vo l. S :--Jo 2 I')')(, 

Thus, \\'hen a MEMO 1\Y CH ANNEL network packet 

is received that corresponds to the page tb:Lt is mapped 

t( Jr reception , the data is transkrred directly to the 

appropriate page of physic1l memory by the system 's 

DMA engine. In addition, an\' cache lines th:Lt rckr to 

the updned page are in\-llidatcd. 

Subscq ucmlv, am· \\'rites to the mapped page oh·ir 

tual mcmmy on the r! rst node result in correspond i ng 

writes to physical memory on the second node. This 

means th

TRANSMIT 

REGION 

VIRTUAL 

ADDRESS 

RECEIVE 

REGION 

VIRTUAL 

ADDRESS 

Figure 2 

/\'I EMORY CHANNEL OperJrion 

TRUCLUSTER MEMORY 

CHANNEL API LIBRARY 

TRUCLUSTER MEMORY CHANNEL 

KERNEL SUBSYSTEM 

LOW-LEVEL KERNEL 

MEMORY CHANNEL FUNCTIONS 

EXECUTES STORE INSTRUCTION 

TO VIRTUAL ADDRESS IN 

TRANSMIT REGION 

VIRTUAL-TO 

PHYSICAL 

ADDRESS 

TRANSLATION 



IN PCI I/0 SPACE 

(a) Transmission 


ADAPTER 

PAGE CONTROL 

TABLE 

MEMORY 

CHANNEL 

ADDRESS 

DATA 

-+ --------- - --------- - --- , DATA RETURNED FROM 

' , PHYSICAL MEMORY 

EXECUTES LOAD INSTRUCTION 

FROM VIRTUAL ADDRESS IN 

RECEIVE REGION 

VIRTUAL-TO 

PHYSICAL 

ADDRESS 

TRANSLATION 


USER SPACE 

KERNEL SPACE 

CACHE 

INVALIDATE 

Figure 3 

TruCiusrer ,vi El'v!ORY CHANNEL Sofrware Architecture 

a set of simple building blocks that arc analogous to the 

System V !PC bcility in most UNIX implementations. 

The advantage is that while a very simple interface is 

provided initially, the intertiKe can later be extended as 

(b) Reception 

' 

' ' 

I 

I 

I 

I 

PAGE CONTROL 

TABLE 

MEMORY 

CHANNEL 

ADDRESS 

DATA 

required, \Vithout losing compatibility with applications 

based on the initial implementation. Table l details the 

MEMORY CHAl'\!NEL API library timctions that the 

product provides. An important feature to note is that 

when a MEMORY CHANNEL region is allocated using 

TruCJuster MEMORY CHANNEL Software, a key is 

specified that uniquely identifies this region in the clus 

ter. Otber processes anywhere in the cluster can attach 

to the same region using the same key; the collection of 

keys provides a clusterwide namespace. 

The MEMORY CHAN EL API library communi 

cates with the kernel subsystem using kmodcall, a sim 

ple generic system call used to manage kernel 

subsystems. The library fu nction constructs a com 

mand block containing the type of command (i.e., 

Digital Tech nical Journal Vol. R No. 2 1996 99

Ta ble 1 

TruCiuster MEMORY CHANNEL API Library Functions 

Function 

Name 

imc_asalloc 

lmc_asattach 

lmc_asdetach 

imc_asdealloc 

imc_l kalloc 

imc_l kacquire 

imc_lkrelease 

imc_l kdealloc 

imc_rderrcnt 

imc ckerrcnt 

imc kill 

imc_gethosts 

Description 

Allocates a region of MEMORY CHANNEL address space of a specified size and permissions and 

with a user-supplied key; the ability to specify a key allows other cluster processes to rendezvous 

at the sa me region. The function returns to the user a clusterwide ID for this region. 

Attaches an allocated MEMORY CHANNEL region to a process virtual address space. A region 

can be attached for transmission or reception, and in shared or exclusive mode. The user can also 

request that the page be attached in loopback mode, i.e., any writes will be reflected back to the 

current node so that if an appropriate reception mapping is in effect, the result of the writes can 

be seen local ly. The virtual address of the mapped region is assigned by the kernel and returned 

to the user. 

Detaches an allocated MEMORY CHANNEL region from a process virtual address space. 

Deallocates a region of MEMORY CHANNEL address space with a specified 10. 

Allocates a set of clusterwide spinlocks. The user can specify a key and the required permissions. 

Normally, if a spin lock set exists, then this function just returns the ID of that lock set; otherwise 

it creates the set. If the user specifies that creation is to be exclusive, then fa i I u re wi II resu It if the 

spinlock set exists already. In addition, by specifying the IMC_CREATOR flag, the first spin lock in 

the set will be acquired. These two features prevent the occurrence of races in the allocation of 

spin lock sets across the cluster. 

Acquires (locks) a spin lock in a specified spin lock set. 

Releases (unlocks) a spin lock in a specified spin lock set. 

Deallocates a set of spin locks. 

Reads the clusterwide lVI EMORY CHANNEL error count and returns the value to the user. This 

va lue is not guaranteed to be up-to-date for all nodes in the cluster. It can be used to construct 

an application-specific error-detection scheme. 

Checks for outstanding MEMORY CHANNEL errors, i.e., errors that have not yet been reflected in 

the clusterwide MEMORY CHANNEL error count returned by imc_rderrcnt. This function checks 

each node in the cluster for any outsta nding errors and updates the global error count accordingly. 

Sends a UNIX signal to a specified process on another node in the cluster. 

Returns the number of nodes currently in the cluster and their host names. 

1\'hich libran· fu nction has been called ) rmation, such 

as virtual addresses. 

Similarly, ro r each set of spin locks a I looted on the 

cluster there is a cluster lock dcscriprm ( CLD) that 

contains information describing the spinlock set, 

including its clusterwidc lock I D, the numhu of spin 

locks in the set, the key, permissions, crc::ttion time, 

and the UID and GlD of the creating f> roccss. Fm an 

individual CLD, there is a host lock dcsniptor (HLD) 

tor each node that is using the spin lock set. The HLD 

contains the cluster ID oF the node and other nodc 

spccifie int(>rmation : bout the spin]ock set. for a spe 

cific HLD, there is a process lock descriptor ( I'LD) r( >r 

each pmccss on th:t node that is using the Sf)inlock

KEY: 

CLD 0 

CLD 1 

 

CLD CLUSTER LOCK DESCRIPTOR 

CRD CLUSTER REGION DESCRIPTOR 

HLD HOST LOCK DESCRIPTOR 

HRD HOST REGION DESCRIPTOR 

PLD PROCESS LOCK DESCRIPTOR 

PRO PROCESS REGION DESCRIPTOR 

HRD 0: HOST 4 

Figure 4 

Truc:luster ME/IIORY C:HA:-..: \:EL Kernel Data Structures 

HRD 1: HOST 6 

HRD 0: HOST 6 

HRD 1· HOST 1 

( J) Regions 

HLD 0: HOST 2 PLD 0: PID 3346 

HLD 1 HOST 0 I 

J L PLD 

- HLD 0 HOST 4 ... 

HLD 1: HOST 6 

... 

HLD 2: HOST 3 

set. The PLD contains the l' ID of the process that cre 

ated the spinlock set .md any process-specific informa 

tion about the spinlock set. 

All these cluster datur regions of Mr)v!ORY CHAN:-JEL space have been 

allocued on the cluster. The ti rst region, with descrip 

tor CRD 0, is mapped on three nodes: host 4, host 6, 

and host 3. The diJgr:tm

Figure 5 

HOST A 

------------------- 

I 

INVOKE 

Command Relav Opcr

Gi\'cn the usefulness of coherent allocations, it may 

seem unusu:1l that we nude this feature an option 

rJ thcr th:m the dcbult. There are severa l t-eJsons t(x 

this. vV ith coherent Jllocations, the associated physical 

memory becomes nonpagcable on aU nodes within the 

cluster,

Performance 

The performance ofTruCtuster MEMORY CHI\N::--J EL 

Soft\\-11-c on a pair of AlphaScner 4100 5/300 

n1 1chi nes is prese nted in Tabl e 2. These measurements 

\\'CIT made using 1·crsion 1.5 1\Jl EMOIY C:HA0l:\EL 

atbptcrs. The lmKill'idth ( 64 M H/s )

extern Long asm(const char * 

#pragma intrinsic (asm) 

#def ine mb ( ) asm("mb" ) 

#inc lude 

#include 

. . . ) ; 

rna in < ) 

{ 

int status, i , Locks=4, temp, errors; 

imc_asid_t region_i d; 

imc_L kid t Lock_i d; 

typedef struct { 

vo lati l e int processes; 

vol ati le int patte rn[2047J; 

shared_region; 

sha red_region *regi on_ read, *region_w ri te; 

caddr_t read_p tr = 0, wri te_p tr = 0; 

} 

I* MC region ID *I 

I* MC spi nlock set ID *I 

I* Shared data structure *I 

I* Al locate a region of coherent MC address space and attach to *I 

I* process VA *I 

imc_a sa l l oc (123, 8192, IMC_U RW, IMC_COHERENT, &regi on_ id); 

imc _a sattach(region_i d, IMC_T RANSMIT, IMC_SHARED, IMC_LOOPBACK, &write_pt r); 

imc_asattach(regi on_i d, IMC_ RECEIVE, IMC_SHARED, 0, &read_p tr); 

region_read = (shared_region *)wri te_p tr; 

region_w ri te = (shared_region *)re ad_ ptr; 

I* Allocate a set of spinlocks and atomi ca l l y acq uire the first Lock *I 

status = imc_ Lkalloc(456, &locks, IMC_LKU, IMC_C REATOR, &lock_id); errors = imc rderrcnt(); 

i f (status = 

do { 

IMC SUCCESS) { 

region wri te->processes = 0; 

I * Ini tialize the globa l region *I 

for (iO ; ipa ttern[i ] = 

i--; 

mb ( ) ; 

i ; 

} wh i l e (imc ckerrcnt (&errors) I I region_read->pattern[i J != i ) 

} 

imc_l kre leaseClock_i d, 0); 

else if (status == IMC EXISTS) { 

imc_lkallocprocesses + 1 ; I* Increment the process counter *I 

errors = imc_r d errcnt(); 

do { 

region_w ri te->processes = 

mb processes - 1; I* Decrement the process counter *I 

errors = imc_r d errcnt( ) ; 

do { 

region_w ri te->processes = temp; 

mb ( ) ; 

} whi le (imc ckerrcntprocesses != temp ) 

imc_L krelease( lock_i d, 0); 

imc_ Lkdea l l oc( lock_ id); 

imc_a sdetach( regi on_id); 

imc_asdea l l oc ( reg i on_ id); 

Figure 6 

Programming wirh TruCiusrer 1'vl EMORY CHANNEL Software 

I* Dea l locate spi n lock set *I 

I* Detach shared region *I 

I* Dea l locate MC address space *I 

Digit,ll Technical Journal Vol. 8 No. 2 1996 105

These gu1ls placed some important constraints on 

the architecture of UMP, parricularl' with regard to 

pcrt(>rmance. This meant that design decisions had 

to be constantly evaluated in terms of their pcrtrmance 

impact. The initial design decision was to use a dedi 

cated point-to-point circular butter between ever)' pair 

of processes. These buf1ers use prod ucer and consumer 

indexes to control the reading and writing of bufte r 

contents. The indexes can be moditied only by the 

consumer and producer tasks and allow fll lly lockless 

operation of the buffe rs. Removing lock requirements 

eliminates not only the software costs associated with 

lock manipulation (in the initial implementation of 

TruCiuster Jv! EMORY CHANNEL Software, acquiring 

and releasing an uncontested spinlock takes approxi 

mately 130 f.LS and 120 f.LS, respecti,-elv) but also the 

impact on processor pertormance associated with 

Load-locked/Store-conditional instruction sequences. 

Although this butlering style eliminates lock manip 

ulation costs, it results in an exponential demand tor 

storage and can limit scalability. If there arc N processes 

communicating using this method, that implies N2 

bufkrs are req uired for filii mesh communication. 

MEMORY CHANNE L address space is a relatively 

scarce resource that needs to be carefullv husbanded. 

To manage the demand on cluster resources as t3 irly as 

possibk, we decided to do the t(>llowing: 

Figure 7 

Clusrcr Conlnlunicarion Using UMP 

• Allocate buftc rs sparsely, i.e., as required up to 

some dcbult limit. Full N2 al location would still be 

possible if the user increased the n um bcr of bu fk rs. 

• Make the size of the butlers contigurJble. 

• Usc lock-controlled single-writer, multiple-reader 

buffers to hand lc both the overflow trom the JV2 

bufkr and bst multicast. One of these bufkrs, 

cal led outbufs, would be assigned to each process 

using UMP upon initialization. 

Note that while the channel buffers are logically 

point-to-point, they may be implemented physically as 

either point-to-point or broadcast. For example, in the 

first \'ersion of UMP, we used broadcast MEMORY 

CHANNEL mappings fc>r the sake of simplicity. We

with task 2 using UMP channel buffers in shared mem 

ory, shown as l->2 and 2--> l. Task l and task 3 communicate 

using UMP channel buffers in ME1viORY 

CHANNEL space, shown as l--> 3 and 3--> l. Task 3 is 

reading a message from task 1 using an outbuf. The 

outbuf can be written only by task 1 but is mapped fo r 

transmission to all other cluster members. On node 2, 

the same region is mapped for reception. Access to 

each outbuf is controlled by a unique cluster spinlock. 

Our rationale fo r taking this approach is that a short 

software path is more appropriate for small messages 

because overhead dominates message transfer time, 

whereas the overhead of lock manipulation is a small 

component of message transfer time for large mes 

sages. We fe lt that this approach helped to control the 

use of cluster resources and maintained the lowest pos 

sible latency tor short messages yet still accommodated 

large messages. Note that outbuts are still ti.xed-size 

buffers but are generally configured to be much larger 

than the N2 buffers. 

This approach worked for PVM because its message 

transfer semantics make it acceptable to fa il a mes 

sage send request due to buffe r space restrictions 

(e.g., if both the N2 buffer and tbe outbuf are fu ll). 

When we analyzed the requirements tor MPI, how 

ever, we fo und that this approach was not possible . For 

this reason, we changed the design to use only the N2 

buffers. Instead of writing the message as a single 

operation, the message is streamed through the buffer 

in a series of fragments. Not only does this approach 

support arbitrarily large messages, but it also improves 

message bandwidth by al lowing (and, fo r messages 

exceeding the available bufter capacity, requiring) the 

overlapped writing and reading of the message. 

Deadlock is avoided by using a background thread 

to write the message. Since overflow is now handled 

using the streaming N2 buffers, outbufs were not nec 

essary to achieve the required level of performance fo r 

large messages and were not implemented. Outbufs 

are retained in the design to provide fa st multicast 

messaging, even though in the current implementa 

tion they are not yet supported . 

Achieving the performance goals set tor UMP was 

not easv. In addition to the butler architecture 

described earlier, several oth er techniqu es were used. 

• No syscalls were allowed anywhere in the UMP 

messaging fu nctions, so U M P runs completely in 

user space. 

• Calls to library routines and any expensive arith 

metic operations were minimized . 

• Global state was cached in .local memory wherever 

possi ble. 

• Careful attention was paid to data alignment issues, 

and all transfers are multiples of 32-bit data. 

At the programmer's level, UMP operation is based 

on duplex point-to-point links called channels, which 

correspond to the N2 buffe rs already described . 

A channel is a pair of unidirectional buffers used to 

provide two-way communication between a pair of 

process endpoints anywhere in the cluster. UMP pro 

vides functions to open a channel between a pair of 

tasks. While the resources are allocated by the first task 

to open the channel, the connection is not complete 

until the second task also opens the same channel. 

Once a channel has been opened by both sides, UMP 

functions can be used to send and receive messages on 

that channel. It is possible to direct UMP to use shared 

memory or MEMORY CHANNEL address space for 

the channel buffers, depending on the relative location 

ofthe associated processes. In addition, UMP provides 

a fu nction to wait on any event (e.g., arrival of a mes 

sage, creation or deletion of a channel). In total, UMP 

provides a dozen fu nctions, which are listed in Table 3. 

Most of the functions relate to initial ization, shut 

down, and miscellaneous operations. Three fu nctions 

establish the channel connection, and three functions 

pertorm all message communications. 

UMP channels provide guaranteed error detection 

but not recovery. Through the use of TruC!uster 

MEMORY CHANNEL Software error-checking rou 

tines, we were able to provide efficient error detection 

in UMP. We decided to let the higher layers implement 

error recovery. As a result, designers of higher layers can 

control the performance penalty they incur by specifY 

ing their ovvn error recovery mechanisms, or, since 

rel iability is high, can adapt a fai l-on-error strategy. 

Performance 

UMP avoids any calls to the kernel and any copying of 

data across the kernel boundary. Messages are written 

directly into the reception buffer of the destination 

channel. Data is copied once from the user's buffe r 

to physical memory on the destination node by the 

sending process. The receiving process then copies the 

data from local physical memory to the destination 

user's butler. By comparison, the number of copies 

involved in a similar operation over a LAN using sock 

ets is greater. In this case, the data has to be copied 

into the kernel, where the network driver uses DMA to 

copy it again into the memory of the network adapter. 

At this point the data is transmitted onto the LAN. 

The first version of UMP used one large shared 

region of MEMORY CHANNEL space to contain its 

channel buffers and a broadcast mapping to transmit 

this simultaneously to all nodes in the cluster. This 

version ofUMP also used loopback to reflect transmis 

sions back to the corresponding receive region on the 

sending node, which resulted in a Joss of available 

bandwidth . Using our AI phaSe rver 2100 4/190 

development machines, we measured 

Digir:1l Te chnical journal Vol. 8 No. 2 1996 107

Ta ble 3 

UMP API Functions 

Function 

Name Description 

ump_init 

ump_exit 

ump_open 

ump_close 

ump_l isten 

Initializes UMP and allocates the necessary resources. 

Shuts down UMP and deal locates any resources used by the calling process. 

Opens a duplex channel between two endpoints over a given transport (shared memory or 

MEMORY CHANNEL). Channel endpoints are identified by user-supplied, 64-bit integer handles. 

Closes a specified UMP channel, deallocating all resources assigned to that channel as necessary. 

Registers an endpoint for a channel over a specified transport. This can be used by a server process 

to wait on connections from clients with unknown handles. This function returns immed iately, 

but the channel is created only when another task opens the channel. This can be detected using 

ump_wa it. 

ump_wa it Wa its for a UMP event to occur, either on one specified channel to this task or on all channels 

to this task. 

ump_read 

Reads a message from a specified channel. 

ump_write Writes a message to a specified channel. This function is blocking, i.e., it does not return until 

the complete message has been written to the channel. 

ump_n bread Starts reading a message from a channel, i.e., it returns as soon as a specified amount of the 

message has been received, but not necessarily all the message. 

ump_nbwrite Starts writing a message to a specified channel, i.e., it returns as soon as the write has started. 

A background thread will continue writing the message until it is completely transmitted. 

ump_mcast 

ump_info 

Writes a message to a specified list of channels. 

Returns UMP configuration and status information. 

• Lncncv : 11 f.LS (MEi'v!ORY CHANNEL), 4 JJ.S 

( sh;1rcd rncmorv) 

• Bandwidt h : 16 MB/s (MEMORY CHANNEL), 

30 ;VI B/s ( sha red m emory ) 

To i ncrease bandwidth, we mod i icd U1\'I P to usc 

transmit-only regions fur its challnel butlers, th us 

el i mi m.ti ng Joopback. The per(>rmancc mosun.:d fo r 

the revised UMP usin g the same machines was 

• Ll tcncy: 9 f.LS (MEMORY CHANNEL), 3 f.LS 

(shared memory) 

• Bandwidth: 23 MB/s (MEMORY CHANNEL), 

32 MB/s (shared mernon·) 

figu re 8 sho\\'s the message tr:111scr time ;1nd Figu re 

9 sho\\'S the bandwidth (x various mcss1gc sizes (lr the 

ITI'iscd 1·crsion of UMP usi ng both blocki ng ;l lld non 

bl ocking writes m·cr shared memorv and the MEMORY 

CHANNEL network. Usi ng newer AlphaSc nn 4100 

5/300 mach i nes, which have a Elster [/0 subsystem 

th111 the older machines, and version 1.5 /vi E!VlORY 

CHANNEL ad apters, the mc1su rcd latency is 5.8 f.LS 

(MEMORY CHANNEL), 2 f.LS ( shared memory) . The 

pc;lk bandwidth achieved is 61 M B/s (MEMORY 

CHANNEL), 75 MB/s ( shared memory ) . In the non 

blocking cases, the buffe r size used 11·:-s 256 kilobvtcs 

(KB) (lr sh1 rcd memon· and 32 KB (lr MEMORY 

C:HANN EL. Further work is u nder 11·a1' to im prm·c the 

pcr(Jml;lllce using shared memorv 1 s the transport. 

This work is

100,000 

10,000 

i= 

a: 

UJ 

t!icn 1.ooo 

zo 

z 

a: a 

t-u 

100 

UJCI: 

UJU 

10 

KEY: 

Figure 8 

/./ 

, ". " 

/ / 

· 

,. ;; ."'"' ' 

, -.-- · 

· - 

- .:.. :::: . - --- . .... . 

.'/ 

!/ 

' 

; 

y 

/ 

 

. /. 

. . · " 

.,., 

./. 

. / 

10 100 1 ,000 1 0.000 100,000 1 ,000,000 

MESSAGE SIZE (BYTES) 

UMP BLOCKING (SHARED MEMORY) 

UMP BLOCKING (MEMORY CHANNEL) 

UMP NONBLOCKING (SHARED MEMORY) 

UMP NONBLOCKING (MEMORY CHANNEL) 

UMI' Com municuions PcrfmmanC 

CD 

 

C.!l 

UJ 

30 

I 

1- 20 

0 

 

0 10 

z 

 

CD 

0 

KEY: 

50 

40 

- ----- 

200,000 400,000 600,000 800,000 1 .000,000 

MESSAGE SIZE (BYTES) 

UMP BLOCKING (SHARED MEMORY) 

UMP BLOCKING (MEMORY CHANNEL) 

UMP NONBLOCKING (SHARED MEMORY) 

UMP NONBLOCKING (MEMORY CHANNEL) 

Figure 9 

U 1YI P C:ommun ic.ll ions Pcrtormnu.:: Band "'idrh 

PVM when using net\\'Orks like Ethernet or fDDJ. 

The high cost of communications f()r these systems 

means that only the more co :rse-grai ned paral lel appli 

cations h:ve de monstrJted pert(mnancc im pro,·cments 

as a result of paralklization using PVM . Using the 

MEMORY CHANNEL cluster technology described 

earlier, we have impl e mented :m optimized PVM that 

otters lo\1' latencv :nd high -h:.llldwidth comnHi l1ic: 

rions. The PVM librarv and daemon use UM I' ro pro 

vide se:mless communications over the MEMORY 

CHANNEL cluster. 

\iVhen we began to devel op PVM ror MEMORY 

CHAN1 EL c lusters, \\'e had one overrid ing goal: to usc 

the hardw;u-c pe rtormance the MEMORY C:HANNFL 

interconnect ofters to prm·ide a PVM with industry 

leading communicuions pertormancc, speciticallv with 

regard to btcncy. I nitiallv, we set a target btency t(n 

PVM of less than 15 f..lS using shared mcmorv md less 

than 30 f..lS using the MEMORY C H AN NEL tr:msport. 

Our ti rst task was to bui ld ;\ prototype using the 

public-domain PVM implementation. We used an 

earlv prototYpe of the MEMORY CHANNEL svstcm 

jointlv dc\'eloped by Digital and Encore. The prototype 

had a hardware latency of4 f..lS. We moditied the 

shared-memory version of PVM to usc the prototype 

hard\\'are and achieved a PVM l:ltencv of 60 f..lS. 

Protiling and straighttcx\\'ard code Jn:ll\'sis I-c\·ealed 

that most of the overhead was caused by 

• PVM's support t

Figure 10 

MEMORY CHANNEL CLUSTER 

r - - - - - - - - - - - - - - - - - - - - - - - - - - --, 

HOST 1 HOST 2 

- - - - - - - - - - - - - - - - - - - - - - - - - - - , 

I 

DAEMON 1 PROCESS 1 PROCESS 2 DAEMON 2 PROCESS 3 

I I 

PVM DAEMON PVM APPLICATION 

I 

PVM APPLICATION I PVM DAEMON 

I 

PVM APPLICATION 

I 

I I 

PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY I I PVM API LIBRARY PVM API LIBRARY 

I 

1 UNIX UMP 

UNIX I : '--- --'- -- --' 

I 

UMP UNIX I I 

I I 

UMP I I UNIX UMP UNIX UMP 

I 

I I I I 

I 

t B 

I 

t I I 

'-- ------------ ----A ---t -- ---------------- L - - - - - - - - ----- - - - - - - - - - - - --- 

D 

HOST 3 

r------------------------------------------------------- 

F 

I 

I 

I 

I 

DAEMON 3 PROCESS 4 GATEWAY 

I 

I 

PVM DAEMON PVM APPLICATION PVM GATEWAY 

I 

I PVM API LIBRARY PVM API LIBRARY PVM API LIBRARY 

PVM3 DAEMON 

I 

I 

INTERFACE 

UNIX UMP UNIX UMP 

I 

UNIX UMP 

I I I 

l _ _ _ !---------------------------------------- ---- 

c 

---_j 

L - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

KEY: 

A A PVM application on host t performs local control functions using UNIX signals. 

B A PVM application on host 1 commun1cates w1th another PVM task on the same host using 

UMP (via shared memory). 

C A PVM application on host t communicates with another PVM task on a different host in the 

cluster (host 2) using UMP (via MEMORY CHANNEL). 

D A PVM application on host 1 requires a control function (e.g., a signal) to be executed on 

another host in the cluster (host 3 ); it sends a request to a PVM daemon on host 3. 

E The PVM daemon on host 3 executes the control function. 

F A PVM application on host t sends a message to a PVM task on a host outside the MEMORY CHANNEL 

cluster; the message is routed to the PVM gateway task on host 3. 

G The PVM gateway translates the cluster message into a form compatible with the external PVM 

implementation and forwards the message to the external task via IP sockets. 

Digital PVM Architecture 

gateway daemon. The PVM API library is a complete 

rewrite of the standard PVM ,·ersion 3.3 API, \\'ith 

which full fu nctional compatibility is maintained. 

Emphasis has been placed on optimizing the pcd()r· 

mancc of the most frequently used code paths. In 

addition, all data structures and dat: transters h;wc 

been optimized for the Alpha architecture. As stated 

earlier, the amount of message passing between tasks 

and the local daemon has been minimi;.ed by pcrt(mn· 

ing most operations in-line and communicating \\'ith 

the daemon only \\'hen absolutely necessary. !mer· 

mediate bufkrs are used fo r copying data between the 

user bufrcrs. This is necessary because ofthe semantics 

of PVM, which allow operations on buffer contents 

bd()re and at-i:cr a message has beetJ sent. The one 

exception to this is pvm_pscnd; in this case, data is 

copied directly since the user is not allowed to modi' 

the send buffer. 

The purpose of om PVM daemon is difkrem ri·om 

that of the daemon in tbc standard PVM package. Our 

daemon is designed to relay commands between diftl:rent 

nodes in the PVM cluster. It exists solely ro 

110 Digiral Technid Joumal Vnl. R No. 2 l9

comri burions of each of the group members, etc. The 

group fu nctions arc implemented separately ti-om the 

other PVM messaging fu nctions. They use a sep:.rate 

global structure (the group ta ble) to manage l'VJ\11 

group data. Access to the group table is controlled 

by locks. Unlike other l'VM implementations, there is 

no PVM group server, since all group operations c:tn 

manipu l:ne the group t:ble direcrlv. 

Performance 

Table 4 compares the communications latency achieved 

by various PVM implementations. A the ta ble indicates, 

the Lnency between two mKh incs with Digital 

PVM over :1 MEMORY CHANNEL. tr:1nsport is much 

less th:n the l:1rcncy of the public-domain PVM 

implemelltation over shared memorr, which va lidates 

our approach of removing support r(Jr heterogeneity 

ti-orn the critical pertr l'VM l atency over the 

MEMORY CHANNEL transport given in Reterence 6 

were obtained us1ng an carlin- vers1on of 

Digir:li PVM. Since those results were measured, 

latency has been hah-ed, mostly due to improvements 

in UMI' pertcm11

- 

Figure 14 

MPI PORTABLE API LIBRARY 

MPICH ABSTRACT DEVICE 

MPICH CHANNEL INTERFACE 

UMP 

I 

SHARED MEMORY 


(a) I nitil Protor1·pc 

MPI PORTABLE API LIBRARY 

MPICH ABSTRACT DEVICE 

FRONT END 

UMP 

I 

SHARED MEMORY 


(b) Version 1.0 ImplcmcnLuion 

Digit I JV! PI Archirccrurc 

MPICH 

- - - - - - , 

ABSTRACT 

-DEVICE 

INTERFACE 

I 

I 

I 

I 

I 

I 

_ _ _ _ _ _ _j 

MPICH 

- - - - - -l 

ABSTRACT I 

- DEVICE - - I 

INTERFACE 1 

_ _ _ _ _ _ _j 

using MEMORY CHANNEL Bv comparison, the 

unmodified lvl PI CH achieves a pok bandwidth of 

24 MB/s using shared memon· and 5.5 MB/s using 

TCP/l Pmnan DDI LAN . 

Figure 17 shows the speedup demonstrated by an 

MPI application . The application is the Accelerated 

Strategic Computing Initiative (ASCI) benchmark 

SPPM, which soh-es a three-dimensional g;:s dynamics 

problem on a unitonn Cartesian mcsh s2'' The same 

code \\'JS run using both Digit:! Ml'l and MPICH 

using TCr /1 P. The hardware conligur:t tion was a two 

node MEMORY C:HA:--.JNEL clustn of AJ phaServer 

8400 5/350 1mchines, each with six CPUs. Digital 

MPJ used sh:tred memory ;: nd MEMORY CHANNEL 

transports, ll'herus Ml'ICH used the Ethernet LAN 

eonnecring the machines. The mnimum speedup 

Ta ble 5 

MPI Latency Comparison 

MPI Implementation 

MPICH 1.0. 10 

MPICH 1.0.10 

Digital MPI V1.0 

Digital MPI V1 .0 

Digital MPI V1 .0 

Digita l MPI V1 .0 

Transport 

Sockets FDDI 

Shared Memory 

MEMORY CHANNEL 1.0 

MEMORY CHANNEL 1.5 

Shared Memory 

Shared Memory 

obtained using Digital MPI ll'as l pproximately 7, 

whereas fr MPICH the m1ximum speedup w1s 

approximately 1.6. 

Future Work 

We intend to continue rdining tlH: compom:nts 

described in this paper. The major chmge em·isioned 

regarding the TruCiuster MEMORY CHANNEL Soft 

ware prod uct is the addition of user-spJ.ce spin locks, 

which should significanrlv reduce tl1rcc the usc of an 

intermediate bufti.:r, perfr scicntilic 1 pplic:1tions that utilizes a 

new network technology to hvpass the sotrll':lrc over 

head that limits the applicabilit\' of traditional nct 

II'Orks. The pcrt(ll'mance ofrhis s\·stcm has been prm·cn 

to be on a p1 r ll'ith that of currenr supercomputer tech 

nology and has been achie1·ed using commoditv 

technology developed to r Digital's commercial cluster 

products . The paper demonstrates the suitability of 

the MEMORY CHAN:-.JF.l. rcchnolnt-'Y as a communica 

tions medium h>r sellable svstem del'clopmcnt. 

Platform Latency 

DEC 3000/800 350 JJ.S 

AlphaServer 2100 4/233 30 JJ.S 

Alpha Server 2100 4/2 33 16 JJ.S 

AlphaServer 4100 5/300 6.9 JJ.S 

Alpha Server 2100 4/233 9.5 JJ.S 

Alpha Server 4100 5/300 5.2 JJ.S 

Digit;ll Trdmic.ll jounul Vol . 8 :-.:o. 2 1996 113

100,000 

10.000 

f= 

(( 

w 

(j) 1.000 

zo 

7. J\11 . Blumrich er 1., "Virrul Mcmorv Mapped Ncr 

work lnrerbce i>r the SHJU/v\ P Mul ticomputer," Pro 

ceedings o{ lfw Tit,enl)•-jlrsl Allllttal flllenwlional 

Sy mposi11111 011 Com puler Anhilec/11re (April 1 994 ): 

142-1 53. 

8. M. Rl umrich et al., "T\1 0 Virrual Mcmon· Mapped 

Network Interhce Designs," Pmcl'edillg of /he Hoi 

Jntercollllects II ,\) ·rnposilltll. Palo Alto, Calif 

(August 1994): 134-142. 

9. L I frock ct :\I ., " Improving Relc:\sc-Consistenr Shared 

Virtunnance Fortran 

for Disrribttted-memory Srstenls," D1,ilal Tech 11ical 

jou rnal. 'oL 7, no. 3 ( 1995 ): S-23. 

17. E. Benson et 1., "Design o· Digir:1 l's Prallel Sofr,, are 

Etwimnment," Di,t; ilal Tech 11ical jo11rnal, ,·ol. 7, 

no. 3, ( !99S) 24-38. 

18. 1n the ti rst itlll'lcmemrions, rhe PC! t\ll i:::MOitY 

CHANNEl. network adapter places a limit of 128 !vi i\ 

on rhe amount of MEMORY CHANNEL space that can 

be :\ 1\ocm: d. 

19. Tm C!us/er 1/1:' 1/()l?l' CHA X\'f:f Sojill·are Pmp_m/JI 

mer,· Mw 11wl (M\·nard , Mr implementing Ut'vi P 

and adding support for collectil'e operations to Digit:ll l'VM. 

l3ctore rh:l t, he worked on rhe charactnization and opti m ization 

of customer scientific/technical benchmark codes 

and on various lurdware and software design projects. Prior 

to coming ro Digital, Jim conrributcd to the design ofana- 

1og and digit:ll morion control svstems and sensors at the 

Inland Motor Di,·i-,ion of Kollmorgen Corporation. Jim 

recei,·ed a ru:: . in electrical engineering ( 1982) and an 

M.Eng.Sc. ( 1985 ) from Uni,usir1· Col lege Cork, Ircbnd, 

where he wrote his thesis on rhe design of an electronic 

control system or variable relucta nce motors. In addition 

to receiving rhe Hewlett-Packard (I rcbnd) Award t(Jr I nnovation 

( 1982), jim holds one patent and has published se1 - 

era! p:lpcrs. He is a member of I : : L and ACIVL 

Digital Tcclmicll journal Vnl. 8 No. 2 1996 liS

John J. Brosnan 

)tlhn Brosn.111 is currenrh ;1 principal engi neer in the 

Technical Computi ng Group 11·here he is project lc;1der 

tor Digital PVM. In prior positions rnunce Fortra n tc>t 

suire and .1 signi ticanr con tri buror to a ,·ariety of publish 

i ng rech nolog1· prod ucts. john joined Digi tal a tin n:ceiv 

ing his B.Lng. in electron ic engineering in l986 from the 

U ni1crsi n· of Li meri ck, ln:]:ll\d. He recei1 ed his M.Eng. 

in conJfl Uter S\·stems in 1994, also ti·o1n the Uni1·ersin· ot· 

Limerick. 

Morgan P. Doyle 

In I 994, J'vlorgan Dol'le c1me to Digital ro \\'Ork on rhe 

High Perl(mn.\ncc fortran test suire. Presenrh·, he is :m 

engineer in the Technical Com puti n g Group. · Earll- on, 

he contributed signiticllltlv ro the design and de,·elop 

lllenr of rhe TruCiusrer MEM O l\ Y Cf 1.\NNF.L Softw;1re, 

and he is now responsible tllr irs development. 1VI mg:1n 

rccein:d his I.A.J and B.A. in eb:tronic engineering 

( 1991) .111d h i s 1'v1 .Sc. ( 1993) ti·om T1·i niry Collq.1,c 

Du bl i 11, I 1·cLllld. 

Seosamh D. 6 Riord,iin 

Semamh () Riord,\in is Jn engineer in rhe Technic1l 

ComflUtin):'. Cmup \\'here he is cmn:nrh· 11·orking on 

Digirai Jvl PI :1 nd on enh.111cemenrs ro the l'.'vl P libr:11·,·. 

Pr,·iou>h-, he contributed to rhc design ;1nd implclnnt:\rion 

ofthe TruCi u srer ,\H:.\!ORY CI I.\:"\ELSofrll':m· . 

Seos;1m h JOined Digital ;1frer I"Ccci,·ing his Il.Sc. (199 1) 

and M.Sc. ( 1993) in computer science ti·om the U niversity 

ofLimnick, Ircl.llld. 

Vol. 8 No. 2 [1· which 

he holds t\\'O patents, ;1ntl the dcsip;n ot'an 1/0 and com- 

1\lunic:uions controller. Tim :1lso ll'orkcd ar 1\:\nheon 011 

the d:H:\ co1nmu1J icarion' subs,·sre\Jl tor rhc :S.:L\Il-\ D 

distri buted re:\1-time Doppler ll'tcms pmgran11ne1· 

Tim is ,, n\l'lnbcr nfrhc British Compmer Socicn· .llld is 

a Chartered Engineer.

The Design of User 

Interfaces for Digita l 

Speech Recognition 

Software 

Digital Speech Recognition Software (DSRS) adds 

a new mode of interaction between people and 

computers-speech. DSRS is a command and 

control application integrated with the UNIX 

desktop environment. It accepts user commands 

spoken into a microphone and converts them 

into keystrokes. The project goal for DSRS was 

to provide an easy-to-learn and easy-to-use 

computer-user interface that would be a power 

ful productivity tool. Making DSRS simple and 

natu ral to use was a challenging engineering 

problem in user interface design. Also challeng 

ing was the development of the part of the 

interface that communicates with the desktop 

and applications. DSRS designers had to solve 

timing-induced problems associated with enter 

ing keystrokes into applications at a rate much 

higher than that at which people type. The DSRS 

project clarifies the need to continue the devel 

opment of improved speech integration with 

applications as speech recognition and text-to 

speech technologies become a standard part of 

the modern desktop computer. 

I 

Bernal"d A. Rozmovits 

In the 1960s and early 1970s, people controlled com· 

purcrs using toggle switches, pu nchcd cards, :m d 

punched paper tape. In the 1970s, the common co n 

trol mechanism was the kevboard on tclcrvpes and on 

1·ideo termina ls. In the 1980s, with the advent of 

graphical user i n terfac es, people ( nmd th:t : new 

mode of interJction with the compu ter w1s usdi.d. 

The concept of a pointer-the nH>use-n·olved . Its 

populari rv grew such that the mouse is now a st:mdard 

component of every modern computer. In tlK 1990s, 

the time is righ t to 3dd yet an other mode of inter 

:lcti on with the computer. As compute pown grows 

each I'C:lr, the boundan· ofthe man-n1:1chinc imerhce 

c:n move ti·om interaction that is native to the com 

puter tow;trd communication that is !lrms 

some ;tctions. The action might be to launch an application, 

f( >r nample, in response to the comm:nd 

"bring up calendar"; or to rvpc something, t(>r cx::tmplc, 

in response to "edit to-do list," to it11'oke emacs 

\tilcs\projcctA\todo.txt. OS RS not only houses the 

speech macro capa bility but also provides a user imer· 

t:Ke, a speech recognition engine , and in tcrhccs to the 

X Window System . 

Following is a high -b·cl description of how the 

sofuvarc fu nctions. Commands arc spoken into :1 

m icrophom:, and the audio is captured :ncl d igiti /.cd . 

The ti.rst step in the processing is the speech m alvsis 

system, which pro\·ides a spectra l rcpn:sentr each match. These

match is acceptable, DSRS retrieves keystrokes associ 

:ncd with each utterance, ;:md the keystrokes are then 

sent into the system's keyboard bufkr or to the appropriate 

application. For instances of continuous speech 

recognition, a sentence is recognized and keystrokes 

are concatenated to represent the utterance. for 

cxJmple, (>r the utterance "ti.\'e two times seven three 

tc>ur equals," the keys "52 * 734 =" would be dcli\'crcd 

to the calculator application. 

Although this concept seems simple, its implementation 

raised significant svstem integr:nion issues and 

dircctlv affected the user interface design, which w1s 

critical to the product's success. This paper specitic1lly 

addresses the user interbcc and integration issues and 

concludes with a discussion of future directions t(.>r 

speech recognition products. 

Project Objective 

The objecti\'c of the DS RS project was to provide a 

useful but limited tool to users of Digital's Alpha 

workstations running the UNIX operating sysn.:m. 

DSRS would be designed as a low-cost, speech recognition 

application and would be provided at no cost to 

workstation users for a finite period of ri me. 

vVhen the project began in 1994, a number of command 

and control speech recognition products f( >r 

PCs alreadv nistcd. These progrJms were aimed at 

end users and performed useful tasks "out of the bo:x," 

that is, immediately upon start-up. They all carne with 

built-in vocabulary for common applications and gave 

users the abilitv to add their own \'OGtbulary. 

On UNIX systems, howc\'er, speech recognition 

products existed only in rhc orm of programmable 

rccognizcrs, such as RRN Hark software. Our objective 

was to build a speech recognition product tor the 

UNIX workst:tion that h1d the char:ctcristics of the 

PC recognizcrs, that is, one that would be fu nctional 

immediately upon start-up and would allow the nonprogrammer 

end user to customize the product's 

vocabulary. 

vVc studied several speech recognition products, 

including Ta lk-> To Next i·om Dragon Systems, Inc., 

VoiccAssist from Creative Labs, Voice Pilot h·om 

Microsoft, ;ll)d Listen fi·om Verbcx. 'vVe decided to 

provide users with the (>!lowing features as the most 

desirable in a command : nd control speech recognition 

product: 

• Intuiti,-c, c1sv-ro-use interface 

• Speaker-independent models that would eliminate 

rhe need (>r extensive training 

• Spcakcr

Figure 1 

TRAI NING 

MANAGER 

USER 

INTERFACE 

MICROPHONE 

AUDIO CARD 

SPEECH 

MODELS' 

I 

DIGITIZED FEATURE 

AUDIO FRONT-END PACKETS 

PROCESSOR' 

• Denotes a component licensed from Dragon Systems. Inc 

DSRS Architcctur;ll Block Di:tgrcl ln 

words can crcuc models usmg the Vocabul:ry 

Manager user intcrbceus :nd Jets ;1 S the :ge nt to control 

focus in response to users' speech commands. 

The Vo cahulmy /via nager uscr- i nter bce window 

displ:ys the current hierarchy of the fi nite state gr:tn 

mar ti le. The Vo cabulary Manager t· addition, deletion, 

:nd moditic:tion of words or macros. Also in this win 

dow, the command -utterance to keystroke translations 

arc displayed, created , or modi ti ed. 

In the TJ-aining ;11a nagC'r user intcrbcc, tl1e user 

may train newly created words or phrases in the 

user vocabul:lry tiles and rctr:1in, or adapt, the prod uct 

supplied, independent vocabulary. 

The DSRS Implementation 

As the design team gained ex perience wi th the DSRS 

prototypes, we rdi ncd user procedures and intert:1 ccs. 

This section describes the kcv fu nctions the team 

dc,-eloped ro L· nsurc the user-fi·icndl iness of the prod 

uct, including the first-time setup, the Speech 

Manager, the Training Manager, the Vo cabu iJry 

Manager, Jiowing critical con trol s : 

• Microphone on/offsll'itch. 

• A VU (volume units) meter that gives rc:l-timc 

fCedbJck to the audio signal being heard. A VU 

meter is :1 visual teed back device common ly used on 

devices such as tape decks. Users are ge nerally very 

comfortable using them. 

Digital lcdmic:li journJI Vol. 8 No. 2 19'}6 119

Figure 2 

DSRS Si1cus. The Actin: \'OC1bu 

larv \I'OI'ds m: specific to the 1 pplic1rion in h>Lus 

and c hmge dynamically as the current application 

changes . The vocabularies arc designed in this way so 

tl1:1t 1 user can speak comm;mds both within Lus. This comcxt :dso 

determines the current stJte of the Active word list. 

-The history fr:1me displavs the word, phr;1se , or 

scmcncc last heard lw the recognizer. The hisrorv 

ti·amc is set up as :1 button . \rVhen pressed , it drops 

do\\'n ro reveal the l:1st 20 recognized utterances. 

• A menu that provides access to the management of 

user ti les, the Vocabu l ary Ma11r ;1 good traini ng user imcrbcc. 

• Strong, clear indic:ltions thlt utterances arc pro 

cessed . \rYe 1ddcd a series of bo,;cs thrth 

across the screen at each utterance. 

• A glimpse of upcoming \\'ords. A list of\\'ords is dis 

played on the user inter bee ;md moves r 

example, Wmd 4 of2l. 

• Option to pause, resume, and rcstJrt training. 

• Large, bold tonr disp lay of the word ro be spoken 

and a small prompt, "Please continue," displavcd 

when the svstcm is waiting t(:>r input. 

• AutomHic ;1 ddition of rq1c1tcd utterJnccs that

Figure 3 

Tr:1ining i'vbnager vV indow 

bold tom to diftcrcmiate it ri·om the other clements in 

the window. To train the words, the user repeats an 

um:rancc ti·om one to six times. The user must speak 

lt the proper times to make tr1ining a smooth md eff-icient 

process. DSRS manages the process by prompting 

the speaker with visual cues. Right below the word 

is :1 set of boxes th:u represent the repetitions. The 

boxes arc checked off as utterances arc processed, providing 

positive visu1l feed back to the speaker. When 

one word is complete, the next word to be trained is 

displayed and the process is repeated. vVhcn : II the 

words in the list arc trained, the user saves the files, and 

DSRS returns to the Speech Manager and its Jcti,·c 

mode with the microphone turned off. 

Vocabulary Manager 

The Vocabularv J\tLmager, l ll example of which is 

shown in Figure 4, enables users to modi!)' speech 

ll1

£ile ::::'_ocabulary 

I 

I 

I 

I 

I 

I 

OJ 

I 

I 

I 

I 

I 

I 

I 

- 

Figure 4 

Vocabularv ivl anager Windo\\' 

1:6: ... 

Always Active Vocabl ' r.:1 

 

C Programming Modr 

Calculator ... 

Calendar ... 

Calendar & Diary ... 

DtTermlnal ... 

Emacs .. 

Emacs extensions. 

Pesonal Emacs ex 

File Manager ... 

Mail ... 

Nets cape ... 

Sentences for Switch 

Speech Manager ... 

- 

Vocabulary Manager 

J!j 

J!j 

• 

 

J!j 

J!j 

J!j 

J!j 

J!j 

J!j 

 

J!j 

J!j 

• 

I 

 

abort 

backward 

backward char 

backwards a page 

beginning of blifer 

beginning of line 

beginning of word 

center on line 

control g 

copy region 

delete backward char 

delete character 

delete line 

delete previous word 

delete region 

delete word 

edit declare project SllTlmary 

Emacs extensions ... 

end of buffer 

p - 

- I /r--- -- -- -- -- 

Number of active words in selected Vocabu lary. 64 

The simple logic of this special fu nction enh:mces 

user productivity. Often workstation and PC screens 

are littered with windows or applications icons 1 nd 

icon boxes through which the usn must search. 

Speech control elirnin:Hes the steps between the user 

thinking "I want the calculator" and the applicuion 

being presented in f(Kus, ready to be used . The DSRS 

team created a fu nction called FocusOrlaunch, which 

implements the behavior described above. The tinc 

tion is encoded into the FSC continuous-switching 

mode sentences in the Always Active vocabulary 

associated with the spoken commands "switch to 

," "go to ," 

and just plain "." 

Applications like c1kulator mel c:tlcndJr :liT not 

likely to be needed in multiple instances. Hm\'e\·er, 

applications such as termin:1l emulator windows 1 re . 

DSRS defines the specific phrase "bring up " to explicitly l:nmch a new inst:mcc oftht 1ppli 

cation; that is, the phrase "bring up " is tied to a fu nction 1umed Launch. 

The phrases "next " and "pre\·i 

ous " were chosen t( Jr navigating 

between instJnces of the s1me application. DS RS 

remembers the previous state of the application. For 

122 Dig:inl Tcc hniol Jounul \'ol. l'\o. 2 1996 

!:::!.elp 

-, .. r-: 

instance, if the calcndJr :tpplication is minimized when 

the user says "switch to calendar," the calendar 

window is restored. \

user to launch an audio-based application that will 

record, such as a teleconferencing session. Version 1.1 

also includes a fu nction that al lows the user to play 

a "wave," or digitized audio clip. Audio cues may thus 

be played JS part of speech macros. The "say" command 

invokes DECtalk Text-to-Speech functionality 

so rbat audio events can be spoken! 

Since speech recognition is a st

In the i m plementation of DSRS, we encountered 

one unexpected proble m. vV hen :t nested menu 

mnemon ic was invoked, the second ch1r:tcter was lost. 

The seque nce of events was as rol lows: 

• A spoken word was recognized , and keystrokes 

were sent to the keyboard butkr. 

• The ri rst character, ALT + , acted n ormal ly 

and caused a pop-up menu to d ispL1y. 

• The menu remai ned on d ispl ay, and the last key was 

lost. 

vVe determined that the second keystroke was bei ng 

delivered to the application bd(>re the pop-up menu 

'"1s displavcd . Th erdorc , at the time the key was 

pressed , it did not yet have me:1ning to the appl ic:1 tion . 

It is 1 pparent that such applications arc written rc>r 

1 IH11n:m reaction-based pa radigm . DS RS , on the 

other ha nd , is typing on behalf of the user at computer 

speeds md is not waiti ng tor the pop-up menu to 

d ispby bd(xc entering the next key. 

To overcome this problem, we deve loped J syn 

ch ron izi ng ti.1 nction. Normally the Voca bu lary 

M:1nagcr not:1tion to send an AlT + f fd lowcd bv an 

x would be ALT + fx. Th is new synchronizing func 

tion was designated as sALT + fx. The synch ronizing 

fu nction sends the ALT + f and the n monitors e\·ents 

t(>r a map-notif'\• message i ndicati ng that the pop-up 

menu has been wri tten to the screen. The ch:1ractcr 

fol lowi ng ALT + f is then sent, in this usc, the x. 

The svnchroni zi ng t'i.mction also has a watchdog timer 

to prcvellt a hang in the ev

in tociay's applications. A user agent or secretary pro 

gram that could process common requests delivered 

entirely by speech is not out of reach even with the 

technology available today, for example: 

User: What time is it? 

Computer: It is now 1:30 p.m. 

User: Do I have any meetings today) 

Computer: Staff meeting is ten o'clock to twelve 

o'clock in the corner conference room. 

Computer: Mike Jones is e

References 

l. L . Rabiner and B. Juang, Fu ndamentals of Speech 

Recognition (Englewood Cl iffs , N.J.: Prentice-Hall, 

Inc., 1993) 45-46. 

2. C. Schmandt, Vo ice Communication with Computers: 

Co rwersational Systems (New York, N.Y.: Van Nostrand 

Reinhold, 1994): 144-145. 

3. K.F. Lee, Large- \locabulmy Sp eaker-Independent 

Continuous Sp eech Recog nition: The SPHINX System 

(Pittsburgh, Pa.: Carnegie-Mellon University Computer 

Science Deparunent, Apri I 1988). 

4. W. Hallahan , "DECtalk Software: Text-to-Speech Technology 

and Implementation," Digital Technical 

Journal, vol. 7, no. 4 ( 1995 ): 5-19. 

5. G. Lowney, The Jvficrosoji Windows Guidelines fo r 

Accessible Sojiware Design (Redmond, Wash.: 

Microsoft Development Library, 1995): 3-4. 

6. N. Yankelovich, G. Levow, and M. Marx, "Designing 

SpeechActs: Issues in Speech User Interfaces," Pro 

ceedings of ACM Cmference on Computer-Human 

Interaction (CHI) '9.5: Human factors in Computing 

Systems: Mosaic of Creativity, Denver, Colo. (May 

1995 ) : 369-376. 

Biography 

Bernard A. Rozmovits 

During his tenure at Digital, Bernie Rozmovirs bas worked 

on both sides of computer engineering-hardware and software. 

Cmrently he manages Speech Services in Digital's 

Light and Sound Software Group, which developed the 

user inrerf . 1ces for Digital's Speech Recognition Software 

and also developed the DECtalk sofnvare product. Prior 

to joining this sofnvarc effort, he focused on hardware 

engineering in the Computer Special Systems Group 

an.d was the architect for voice-processing platforms in 

the Image, Voice and Video Group. Bernie received a 

Diplome D'Etude Collegiale (DEC) from Dawson 

College, Montreal, Quebec, Canada, in 1974. He 

holds a patent entitled "Data Format ror Pckets Of 

Information," U.S. Patent No. 5,317,719. 

126 Digical Technical journal Vol. 8 No. 2 1996

Further Readings 

The DigitaL Te chnica/Joumal is a rerereed, quarterly 

publication of papers that explore the toundations of 

Digital's products and technologies. Journal content 

is se lected by the Journal Advisory Board, and papers 

are vvritten by Digital's engineers and engineering 

partners. Engineers who would like to contribute a 

paper to the Jo urnaL should contact the Managing 

Editor, Jane Blake, at Jane. Blake@ ljo.dec.com. 

To pics covered in previous issues of the 

Digital Te ch nical.fourrwlare as to llows: 

Digital UNIX Clusters/Object Modification Tools/ 

eXcursion fo r Windows Operating Systems/ 

Network Directory Services 

Vo l. 8, No. 1, 1996, EY-U025E- TJ 

Audio and Video Tecluwlogies/UNIX Available Servers/ 

Real-time Debugging Tools 

Vol . 7, No. 4, 1995, EY- U002F.-TJ 

High Performance Fortran in Parallel Environments/ 

Sequoia 2000 Research 

Vo l. 7, No. 3, 1995, EY-T838E-TJ 

(Available only on the Internet) 

Graphical Software Development/Systems Engineering 

Vol . 7, No. 2, 1995, F.Y- UOO l E-TJ 

Database Integration/ Alpha Servers & Workstations/ 

Alpha 21164 CPU 

Vol. 7, No. 1, 1995, F.Y-T l35E-TJ 

(Avail:tblc only on rhc Internet) 

RAID Array ContmllersjWorkflow Models/ 

PC LAN and System Management Tools 

Vo l. 6, No. 4, Fall l994, EY-Tl l8E-TJ 

AlphaServer Multiprocessing Systems/ 

DEC OSF/1 Symmetric Multiprocessing/ 

Scientific Computing Optimization for Alpha 

Vol. 6, No. 3, Summer 1994, EY-S799 F. -TJ 

Alpha AXP Partners- Cray, Raytheon, Kubota/ 

DECchip 21071/21072 PCI Chip Sets/ 

DLT2000 Tape Drive 

Vo l. 6, o. 2, Spring 1994, EY-F947E-T) 

High· performance Networking/ 

Open VMS AXP System Software/ 

Alpha AXP PC Hardware 

Vo l. 6, No. I, Winter 1994, EY-QO l!E-TJ 

I 

Software Process and Quality 

Vol. 5, No. 4, Fall l993, EY- P920E-DP 

Product Internationalization 

Vol. 5, No. 3, Summer 1993, EY-P986E-Dl' 

Multimedia/ Application Control 

Vol. 5, No. 2, Spring 1 993, EY-P963E-DP 

DECnet Open Networking 

Vol. 5, No. 1, Wi nter 1993, EY-M770E-DP 

Alpha AXP Architecnlre and Systems 

Vol. 4, No. 4, Special Issue 1992, EY-J886E-DP 

NV AX- microprocessor VAX Systems 

Vo l. 4, No. 3, Summer 1992, EY-J884E-DP 

Semiconductor Technologies 

Vol. 4, No. 2, Spring 1992, EY-L521E-DP 

PATHWORKS : PC Integration Software 

Vol . 4, No. 1, Winter 1992, EY-J82 5E-DP 

Image Processing, Video Terminals, and 

Printer Technologies 

Vol. 3, No. 4, fall l991, EY-H889E-DP 

Availability in VAXcluster Systems/ 

Network Performance and Adapters 

Vol. 3, No. 3, Sun1mer 1991, EY-H890E-DP 

Fiber Distributed Data Interface 

Vol. 3, No. 2, Spri ng 1991, EY-H876E-DP 

Transaction Processing, Databases, and 

Fault-tolerant Systems 

Vol. 3, No. 1, Winter 1991, J:::Y-1:'588E-DP 

VAX 9000 Series 

Vol . 2, No. 4, Fall 1990, EY-E762E-DP 

DECwindows Program 

Vol. 2, No. 3, Summer 1990, EY-E756E-DP 

VAX 6000 Model 400 System 

Vo l. 2, No. 2, Spri ng 1990, EY-C l97E- Dl' 

Compound Doctm1ent Architecture 

Vol. 2, No. l, \N inter 1990, EY-Cl96E- DP 

Digital Tcd1nicl Journ"l Voi .8 No. 2 1996 127

ISSN 0898-901X 

Printed in U.S.A. EC-N6992-18/96 9 14 20.0 Copyright © Digital Equipment Corporation

DTJ Volume 8 Number 2 1996 - Digital Technical Journals

Create successful ePaper yourself

Delete template?

Save as template?