12.07.2015 Views

Understanding The Linux Kernel.pdf - whb.es

Understanding The Linux Kernel.pdf - whb.es

Understanding The Linux Kernel.pdf - whb.es

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Daniel P. BovetMarco C<strong>es</strong>atiPublisher: O'ReillyFirst Edition October 2000ISBN: 0-596-00002-2, 702 pag<strong>es</strong><strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong> helps readers understand how <strong>Linux</strong> performs b<strong>es</strong>t and howit meets the challenge of different environments. <strong>The</strong> authors introduce each topic byexplaining its importance, and show how kernel operations relate to the utiliti<strong>es</strong> that arefamiliar to Unix programmers and users.


Table of ContentsPreface .......................................................... 1<strong>The</strong> Audience for This Book .......................................... 1Organization of the Material .......................................... 1Overview of the Book .............................................. 3Background Information ............................................. 4Conventions in This Book ........................................... 4How to Contact Us ................................................. 4Acknowledgments ................................................. 51. Introduction .................................................... 61.1 <strong>Linux</strong> Versus Other Unix-Like <strong>Kernel</strong>s ............................... 61.2 Hardware Dependency .......................................... 101.3 <strong>Linux</strong> Versions ................................................ 111.4 Basic Operating System Concepts .................................. 121.5 An Overview of the Unix Fil<strong>es</strong>ystem ................................ 161.6 An Overview of Unix <strong>Kernel</strong>s ..................................... 222. Memory Addr<strong>es</strong>sing ............................................. 362.1 Memory Addr<strong>es</strong>s<strong>es</strong> ............................................. 362.2 Segmentation in Hardware ....................................... 372.3 Segmentation in <strong>Linux</strong> .......................................... 412.4 Paging in Hardware ............................................ 442.5 Paging in <strong>Linux</strong> ............................................... 522.6 Anticipating <strong>Linux</strong> 2.4 .......................................... 633. Proc<strong>es</strong>s<strong>es</strong> ...................................................... 643.1 Proc<strong>es</strong>s D<strong>es</strong>criptor ............................................. 643.2 Proc<strong>es</strong>s Switching ............................................. 783.3 Creating Proc<strong>es</strong>s<strong>es</strong> ............................................. 863.4 D<strong>es</strong>troying Proc<strong>es</strong>s<strong>es</strong> ........................................... 933.5 Anticipating <strong>Linux</strong> 2.4 .......................................... 944. Interrupts and Exceptions .........................................4.1 <strong>The</strong> Role of Interrupt Signals ......................................4.2 Interrupts and Exceptions ........................................4.3 N<strong>es</strong>ted Execution of Exception and Interrupt Handlers ..................4.4 Initializing the Interrupt D<strong>es</strong>criptor Table ............................4.5 Exception Handling ...........................................4.6 Interrupt Handling ............................................4.7 Returning from Interrupts and Exceptions ...........................4.8 Anticipating <strong>Linux</strong> 2.4 .........................................5. Timing Measurements ...........................................5.1 Hardware Clocks .............................................5.2 <strong>The</strong> Timer Interrupt Handler .....................................5.3 PIT's Interrupt Service Routine ...................................5.4 <strong>The</strong> TIMER_BH Bottom Half Functions ............................5.5 System Calls Related to Timing Measurements ........................5.6 Anticipating <strong>Linux</strong> 2.4 .........................................969697106107109112126129131131133134136145148


6. Memory Management ...........................................6.1 Page Frame Management .......................................6.2 Memory Area Management ......................................6.3 Noncontiguous Memory Area Management ..........................6.4 Anticipating <strong>Linux</strong> 2.4 .........................................7. Proc<strong>es</strong>s Addr<strong>es</strong>s Space ..........................................7.1 <strong>The</strong> Proc<strong>es</strong>s's Addr<strong>es</strong>s Space .....................................7.2 <strong>The</strong> Memory D<strong>es</strong>criptor ........................................7.3 Memory Regions .............................................7.4 Page Fault Exception Handler ....................................7.5 Creating and Deleting a Proc<strong>es</strong>s Addr<strong>es</strong>s Space .......................7.6 Managing the Heap ............................................7.7 Anticipating <strong>Linux</strong> 2.4 .........................................8. System Calls ..................................................8.1 POSIX APIs and System Calls ...................................8.2 System Call Handler and Service Routin<strong>es</strong> ...........................8.3 Wrapper Routin<strong>es</strong> .............................................8.4 Anticipating <strong>Linux</strong> 2.4 .........................................9. Signals .......................................................9.1 <strong>The</strong> Role of Signals ...........................................9.2 Sending a Signal ..............................................9.3 Receiving a Signal ............................................9.4 Real-Time Signals ............................................9.5 System Calls Related to Signal Handling ............................9.6 Anticipating <strong>Linux</strong> 2.4 .........................................10. Proc<strong>es</strong>s Scheduling ............................................10.1 Scheduling Policy ............................................10.2 <strong>The</strong> Scheduling Algorithm .....................................10.3 System Calls Related to Scheduling ...............................10.4 Anticipating <strong>Linux</strong> 2.4 ........................................11. <strong>Kernel</strong> Synchronization .........................................11.1 <strong>Kernel</strong> Control Paths ..........................................11.2 Synchronization Techniqu<strong>es</strong> ....................................11.3 <strong>The</strong> SMP Architecture ........................................11.4 <strong>The</strong> <strong>Linux</strong>/SMP <strong>Kernel</strong> ........................................11.5 Anticipating <strong>Linux</strong> 2.4 ........................................12. <strong>The</strong> Virtual Fil<strong>es</strong>ystem .........................................12.1 <strong>The</strong> Role of the VFS ..........................................12.2 VFS Data Structur<strong>es</strong> ..........................................12.3 Fil<strong>es</strong>ystem Mounting .........................................12.4 Pathname Lookup ............................................12.5 Implementations of VFS System Calls .............................12.6 File Locking ................................................12.7 Anticipating <strong>Linux</strong> 2.4 ........................................149149160176181183183185186201212214216217217218229230231231239242251252257258258261272276277277278286290302303303308324329333337342


13. Managing I/O Devic<strong>es</strong> ..........................................13.1 I/O Architecture .............................................13.2 Associating Fil<strong>es</strong> with I/O Devic<strong>es</strong> ...............................13.3 Device Drivers ..............................................13.4 Character Device Handling .....................................13.5 Block Device Handling ........................................13.6 Page I/O Operations ..........................................13.7 Anticipating <strong>Linux</strong> 2.4 ........................................14. Disk Cach<strong>es</strong> ..................................................14.1 <strong>The</strong> Buffer Cache ............................................14.2 <strong>The</strong> Page Cache .............................................14.3 Anticipating <strong>Linux</strong> 2.4 ........................................15. Acc<strong>es</strong>sing Regular Fil<strong>es</strong> .........................................15.1 Reading and Writing a Regular File ...............................15.2 Memory Mapping ............................................15.3 Anticipating <strong>Linux</strong> 2.4 ........................................16. Swapping: Methods for Freeing Memory ...........................16.1 What Is Swapping? ...........................................16.2 Swap Area .................................................16.3 <strong>The</strong> Swap Cache .............................................16.4 Transferring Swap Pag<strong>es</strong> .......................................16.5 Page Swap-Out ..............................................16.6 Page Swap-In ...............................................16.7 Freeing Page Fram<strong>es</strong> ..........................................16.8 Anticipating <strong>Linux</strong> 2.4 ........................................17. <strong>The</strong> Ext2 Fil<strong>es</strong>ystem ...........................................17.1 General Characteristics ........................................17.2 Disk Data Structur<strong>es</strong> ..........................................17.3 Memory Data Structur<strong>es</strong> .......................................17.4 Creating the Fil<strong>es</strong>ystem ........................................17.5 Ext2 Methods ...............................................17.6 Managing Disk Space .........................................17.7 Reading and Writing an Ext2 Regular File ..........................17.8 Anticipating <strong>Linux</strong> 2.4 ........................................18. Proc<strong>es</strong>s Communication ........................................18.1 Pip<strong>es</strong> .....................................................18.2 FIFOs ....................................................18.3 System V IPC ...............................................18.4 Anticipating <strong>Linux</strong> 2.4 ........................................19. Program Execution ............................................19.1 Executable Fil<strong>es</strong> .............................................19.2 Executable Formats ..........................................19.3 Execution Domains ...........................................19.4 <strong>The</strong> exec-like Functions .......................................19.5 Anticipating <strong>Linux</strong> 2.4 ........................................343343348353360361377380382383396398400400408416417417420429433437442444450451451453459463464466473475476477483486499500500512514515519


A. System Startup ................................................A.1 Prehistoric Age: <strong>The</strong> BIOS ......................................A.2 Ancient Age: <strong>The</strong> Boot Loader ...................................A.3 Middle Ag<strong>es</strong>: <strong>The</strong> setup( ) Function ...............................A.4 Renaissance: <strong>The</strong> startup_32( ) Functions ...........................A.5 Modern Age: <strong>The</strong> start_kernel( ) Function ...........................B. Modul<strong>es</strong> .....................................................B.1 To Be (a Module) or Not to Be? ..................................B.2 Module Implementation ........................................B.3 Linking and Unlinking Modul<strong>es</strong> ..................................B.4 Linking Modul<strong>es</strong> on Demand ....................................C. Source Code Structure .......................................... 533Colophon ...................................................... 536520520521523523524526526527529531


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>quite a good job in performing a specific task, we must first tell what kind of supportcom<strong>es</strong> from the hardware.• Even if a large portion of a Unix kernel source code is proc<strong>es</strong>sor-independent andcoded in C language, a small and critical part is coded in assembly language. Athorough knowledge of the kernel thus requir<strong>es</strong> the study of a few assembly languagefragments that interact with the hardware.When covering hardware featur<strong>es</strong>, our strategy will be quite simple: just sketch the featur<strong>es</strong>that are totally hardware-driven while detailing those that need some software support. In fact,we are inter<strong>es</strong>ted in kernel d<strong>es</strong>ign rather than in computer architecture.<strong>The</strong> next step consisted of selecting the computer system to be d<strong>es</strong>cribed: although <strong>Linux</strong> isnow running on several kinds of personal computers and workstations, we decided toconcentrate on the very popular and cheap IBM-compatible personal computers—thus, on theIntel 80x86 microproc<strong>es</strong>sors and on some support chips included in th<strong>es</strong>e personal computers.<strong>The</strong> term Intel 80x86 microproc<strong>es</strong>sor will be used in the forthcoming chapters to denote theIntel 80386, 80486, Pentium, Pentium Pro, Pentium II, and Pentium III microproc<strong>es</strong>sors orcompatible models. In a few cas<strong>es</strong>, explicit referenc<strong>es</strong> will be made to specific models.One more choice was the order followed in studying <strong>Linux</strong> components. We tried to follow abottom-up approach: start with topics that are hardware-dependent and end with those that aretotally hardware-independent. In fact, we'll make many referenc<strong>es</strong> to the Intel 80x86microproc<strong>es</strong>sors in the first part of the book, while the r<strong>es</strong>t of it is relatively hardwareindependent.Two significant exceptions are made in Chapter 11, and Chapter 13. In practice,following a bottom-up approach is not as simple as it looks, since the areas of memorymanagement, proc<strong>es</strong>s management, and fil<strong>es</strong>ystem are intertwined; a few forwardreferenc<strong>es</strong>—that is, referenc<strong>es</strong> to topics yet to be explained—are unavoidable.Each chapter starts with a theoretical overview of the topics covered. <strong>The</strong> material is thenpr<strong>es</strong>ented according to the bottom-up approach. We start with the data structur<strong>es</strong> needed tosupport the functionaliti<strong>es</strong> d<strong>es</strong>cribed in the chapter. <strong>The</strong>n we usually move from the low<strong>es</strong>tlevel of functions to higher levels, often ending by showing how system calls issued by userapplications are supported.Level of D<strong>es</strong>cription<strong>Linux</strong> source code for all supported architectur<strong>es</strong> is contained in about 4500 C and Assemblyfil<strong>es</strong> stored in about 270 subdirectori<strong>es</strong>; it consists of about 2 million lin<strong>es</strong> of code, whichoccupy more than 58 megabyt<strong>es</strong> of disk space. Of course, this book can cover a very smallportion of that code. Just to figure out how big the <strong>Linux</strong> source is, consider that the whol<strong>es</strong>ource code of the book you are reading occupi<strong>es</strong> l<strong>es</strong>s than 2 megabyt<strong>es</strong> of disk space.<strong>The</strong>refore, in order to list all code, without commenting on it, we would need more than 25books like this! [1][1] Neverthel<strong>es</strong>s, <strong>Linux</strong> is a tiny operating system when compared with other commercial giants. Microsoft Windows 2000, for example, reportedly hasmore than 30 million lin<strong>es</strong> of code. <strong>Linux</strong> is also small when compared to some popular applications; Netscape Communicator 5 browser, for example,has about 17 million lin<strong>es</strong> of code.So we had to make some choic<strong>es</strong> about the parts to be d<strong>es</strong>cribed. This is a rough ass<strong>es</strong>smentof our decisions:2


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>trapdoors to reach hardware devic<strong>es</strong>; Chapter 13 offers insights on th<strong>es</strong>e special fil<strong>es</strong> and onthe corr<strong>es</strong>ponding hardware device drivers. Another issue to be considered is disk acc<strong>es</strong>stime; Chapter 14 shows how a clever use of RAM reduc<strong>es</strong> disk acc<strong>es</strong>s<strong>es</strong> and thus improv<strong>es</strong>system performance significantly. Building on the material covered in th<strong>es</strong>e last chapters, wecan now explain in Chapter 15, how user applications acc<strong>es</strong>s normal fil<strong>es</strong>. Chapter 16complet<strong>es</strong> our discussion of <strong>Linux</strong> memory management and explains the techniqu<strong>es</strong> used by<strong>Linux</strong> to ensure that enough memory is always available. <strong>The</strong> last chapter dealing with fil<strong>es</strong> isChapter 17, which illustrat<strong>es</strong> the most-used <strong>Linux</strong> fil<strong>es</strong>ystem, namely Ext2.<strong>The</strong> last two chapters end our detailed tour of the <strong>Linux</strong> kernel: Chapter 18 introduc<strong>es</strong>communication mechanisms other than signals available to User Mode proc<strong>es</strong>s<strong>es</strong>; Chapter 19explains how user applications are started.Last but not least are the appendix<strong>es</strong>: Appendix A sketch<strong>es</strong> out how <strong>Linux</strong> is booted, whileAppendix B d<strong>es</strong>crib<strong>es</strong> how to dynamically reconfigure the running kernel, adding andremoving functionaliti<strong>es</strong> as needed. Appendix C is just a list of the directori<strong>es</strong> that contain the<strong>Linux</strong> source code. <strong>The</strong> Source Code Index includ<strong>es</strong> all the <strong>Linux</strong> symbols referenced in thebook; you will find here the name of the <strong>Linux</strong> file defining each symbol and the book's pagenumber where it is explained. We think you'll find it quite handy.Background InformationNo prerequisit<strong>es</strong> are required, except some skill in C programming language and perhapssome knowledge of Assembly language.Conventions in This Book<strong>The</strong> following is a list of typographical conventions used in this book:Constant WidthItalicIs used to show the contents of code fil<strong>es</strong> or the output from commands, and toindicate source code keywords that appear in code.Is used for file and directory nam<strong>es</strong>, program and command nam<strong>es</strong>, command-lineoptions, URLs, and for emphasizing new terms.How to Contact UsWe have t<strong>es</strong>ted and verified all the information in this book to the b<strong>es</strong>t of our abiliti<strong>es</strong>, but youmay find that featur<strong>es</strong> have changed or that we have let errors slip through the production ofthe book. Please let us know of any errors that you find, as well as sugg<strong>es</strong>tions for futureeditions, by writing to:O'Reilly & Associat<strong>es</strong>, Inc. 101 Morris St. Sebastopol, CA 95472 (800) 998-9938 (in the U.S.or Canada) (707) 829-0515 (international/local) (707) 829-0104 (fax)4


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>You can also send m<strong>es</strong>sag<strong>es</strong> electronically. To be put on our mailing list or to requ<strong>es</strong>t acatalog, send email to:info@oreilly.comTo ask technical qu<strong>es</strong>tions or to comment on the book, send email to:bookqu<strong>es</strong>tions@oreilly.comWe have a web site for the book, where we'll list reader reviews, errata, and any plans forfuture editions. You can acc<strong>es</strong>s this page at:http://www.oreilly.com/catalog/linuxkernel/We also have an additional web site where you will find material written by the authors aboutthe new featur<strong>es</strong> of <strong>Linux</strong> 2.4. Hopefully, this material will be used for a future edition of thisbook. You can acc<strong>es</strong>s this page at:http://www.oreilly.com/catalog/linuxkernel/updat<strong>es</strong>/For more information about this book and others, see the O'Reilly web site:http://www.oreilly.com/AcknowledgmentsThis book would not have been written without the precious help of the many students of th<strong>es</strong>chool of engineering at the University of Rome "Tor Vergata" who took our course and triedto decipher the lecture not<strong>es</strong> about the <strong>Linux</strong> kernel. <strong>The</strong>ir strenuous efforts to grasp themeaning of the source code led us to improve our pr<strong>es</strong>entation and to correct many mistak<strong>es</strong>.Andy Oram, our wonderful editor at O'Reilly & Associat<strong>es</strong>, d<strong>es</strong>erv<strong>es</strong> a lot of credit. He wasthe first at O'Reilly to believe in this project, and he spent a lot of time and energydeciphering our preliminary drafts. He also sugg<strong>es</strong>ted many ways to make the book morereadable, and he wrote several excellent introductory paragraphs.Many thanks also to the O'Reilly staff, <strong>es</strong>pecially Rob Romano, the technical illustrator, andLenny Muellner, for tools support.We had some pr<strong>es</strong>tigious reviewers who read our text quite carefully (in alphabetical order byfirst name): Alan Cox, Michael Kerrisk, Paul Kinzelman, Raph Levien, and Rik van Riel.<strong>The</strong>ir comments helped us to remove several errors and inaccuraci<strong>es</strong> and have made this bookstronger.—Daniel P. Bovet, Marco C<strong>es</strong>atiSeptember 20005


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>addr<strong>es</strong>s space, common physical memory pag<strong>es</strong>, common opened fil<strong>es</strong>, and so on.<strong>Linux</strong> defin<strong>es</strong> its own version of lightweight proc<strong>es</strong>s<strong>es</strong>, which is different from thetyp<strong>es</strong> used on other systems such as SVR4 and Solaris. While all the commercial Unixvariants of LWP are based on kernel threads, <strong>Linux</strong> regards lightweight proc<strong>es</strong>s<strong>es</strong> asthe basic execution context and handl<strong>es</strong> them via the nonstandard clone( ) systemcall.• <strong>Linux</strong> is a nonpreemptive kernel. This means that <strong>Linux</strong> cannot arbitrarily interleaveexecution flows while they are in privileged mode. Several sections of kernel codeassume they can run and modify data structur<strong>es</strong> without fear of being interrupted andhaving another thread alter those data structur<strong>es</strong>. Usually, fully preemptive kernels areassociated with special real-time operating systems. Currently, among conventional,general-purpose Unix systems, only Solaris 2.x and Mach 3.0 are fully preemptivekernels. SVR4.2/MP introduc<strong>es</strong> some fixed preemption points as a method to getlimited preemption capability.• Multiproc<strong>es</strong>sor support. Several Unix kernel variants take advantage of multiproc<strong>es</strong>sorsystems. <strong>Linux</strong> 2.2 offers an evolving kind of support for symmetric multiproc<strong>es</strong>sing(SMP), which means not only that the system can use multiple proc<strong>es</strong>sors but also thatany proc<strong>es</strong>sor can handle any task; there is no discrimination among them. However,<strong>Linux</strong> 2.2 do<strong>es</strong> not make optimal use of SMP. Several kernel activiti<strong>es</strong> that could beexecuted concurrently—like fil<strong>es</strong>ystem handling and networking—must now beexecuted sequentially.• Fil<strong>es</strong>ystem. <strong>Linux</strong>'s standard fil<strong>es</strong>ystem lacks some advanced featur<strong>es</strong>, such asjournaling. However, more advanced fil<strong>es</strong>ystems for <strong>Linux</strong> are available, although notincluded in the <strong>Linux</strong> source code; among them, IBM AIX's Journaling File System(JFS), and Silicon Graphics Irix's XFS fil<strong>es</strong>ystem. Thanks to a powerful objectorientedVirtual File System technology (inspired by Solaris and SVR4), portinga foreign fil<strong>es</strong>ystem to <strong>Linux</strong> is a relatively easy task.• STREAMS. <strong>Linux</strong> has no analog to the STREAMS I/O subsystem introduced inSVR4, although it is included nowadays in most Unix kernels and it has become thepreferred interface for writing device drivers, terminal drivers, and network protocols.This somewhat disappointing ass<strong>es</strong>sment do<strong>es</strong> not depict, however, the whole truth. Severalfeatur<strong>es</strong> make <strong>Linux</strong> a wonderfully unique operating system. Commercial Unix kernels oftenintroduce new featur<strong>es</strong> in order to gain a larger slice of the market, but th<strong>es</strong>e featur<strong>es</strong> are notnec<strong>es</strong>sarily useful, stable, or productive. As a matter of fact, modern Unix kernels tend to bequite bloated. By contrast, <strong>Linux</strong> do<strong>es</strong>n't suffer from the r<strong>es</strong>trictions and the conditioningimposed by the market, hence it can freely evolve according to the ideas of its d<strong>es</strong>igners(mainly Linus Torvalds). Specifically, <strong>Linux</strong> offers the following advantag<strong>es</strong> over itscommercial competitors:<strong>Linux</strong> is free.You can install a complete Unix system at no expense other than the hardware (ofcourse).8


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>Linux</strong> is fully customizable in all its components.Thanks to the General Public License (GPL), you are allowed to freely read andmodify the source code of the kernel and of all system programs. [3][3] Several commercial compani<strong>es</strong> have started to support their products under <strong>Linux</strong>, most of which aren't distributed under a GNU Public License.<strong>The</strong>refore, you may not be allowed to read or modify their source code.<strong>Linux</strong> runs on low-end, cheap hardware platforms.You can even build a network server using an old Intel 80386 system with 4 MB ofRAM.<strong>Linux</strong> is powerful.<strong>Linux</strong> systems are very fast, since they fully exploit the featur<strong>es</strong> of the hardwarecomponents. <strong>The</strong> main <strong>Linux</strong> target is efficiency, and indeed many d<strong>es</strong>ign choic<strong>es</strong> ofcommercial variants, like the STREAMS I/O subsystem, have been rejected by Linusbecause of their implied performance penalty.<strong>Linux</strong> has a high standard for source code quality.<strong>Linux</strong> systems are usually very stable; they have a very low failure rate and systemmaintenance time.<strong>The</strong> <strong>Linux</strong> kernel can be very small and compact.Indeed, it is possible to fit both a kernel image and full root fil<strong>es</strong>ystem, including allfundamental system programs, on just one 1.4 MB floppy disk! As far as we know,none of the commercial Unix variants is able to boot from a single floppy disk.<strong>Linux</strong> is highly compatible with many common operating systems.It lets you directly mount fil<strong>es</strong>ystems for all versions of MS-DOS and MS Windows,SVR4, OS/2, Mac OS, Solaris, SunOS, NeXTSTEP, many BSD variants, and so on.<strong>Linux</strong> is also able to operate with many network layers like Ethernet, Fiber DistributedData Interface (FDDI), High Performance Parallel Interface (HIPPI), IBM's TokenRing, AT&T WaveLAN, DEC RoamAbout DS, and so forth. By using suitablelibrari<strong>es</strong>, <strong>Linux</strong> systems are even able to directly run programs written for otheroperating systems. For example, <strong>Linux</strong> is able to execute applications written for MS-DOS, MS Windows, SVR3 and R4, 4.4BSD, SCO Unix, XENIX, and others on theIntel 80x86 platform.<strong>Linux</strong> is well supported.Believe it or not, it may be a lot easier to get patch<strong>es</strong> and updat<strong>es</strong> for <strong>Linux</strong> than forany proprietary operating system! <strong>The</strong> answer to a problem often com<strong>es</strong> back withina few hours after sending a m<strong>es</strong>sage to some newsgroup or mailing list. Moreover,drivers for <strong>Linux</strong> are usually available a few weeks after new hardware products havebeen introduced on the market. By contrast, hardware manufacturers release devicedrivers for only a few commercial operating systems, usually the Microsoft on<strong>es</strong>.9


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong>refore, all commercial Unix variants run on a r<strong>es</strong>tricted subset of hardwarecomponents.With an <strong>es</strong>timated installed base of more than 12 million and growing, people who are used tocertain creature featur<strong>es</strong> that are standard under other operating systems are starting to expectthe same from <strong>Linux</strong>. As such, the demand on <strong>Linux</strong> developers is also increasing. Luckily,though, <strong>Linux</strong> has evolved under the close direction of Linus over the years, to accommodatethe needs of the mass<strong>es</strong>.1.2 Hardware Dependency<strong>Linux</strong> tri<strong>es</strong> to maintain a neat distinction between hardware-dependent and hardwareindependentsource code. To that end, both the arch and the include directori<strong>es</strong> include nin<strong>es</strong>ubdirectori<strong>es</strong> corr<strong>es</strong>ponding to the nine hardware platforms supported. <strong>The</strong> standard nam<strong>es</strong>of the platforms are:armalphai386m68kmipsppcsparcsparc64Acorn personal computersCompaq Alpha workstationsIBM-compatible personal computers based on Intel 80x86 or Intel 80x86-compatiblemicroproc<strong>es</strong>sorsPersonal computers based on Motorola MC680x0 microproc<strong>es</strong>sorsWorkstations based on Silicon Graphics MIPS microproc<strong>es</strong>sorsWorkstations based on Motorola-IBM PowerPC microproc<strong>es</strong>sorsWorkstations based on Sun Microsystems SPARC microproc<strong>es</strong>sorsWorkstations based on Sun Microsystems 64-bit Ultra SPARC microproc<strong>es</strong>sors10


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>s390IBM System/390 mainfram<strong>es</strong>1.3 <strong>Linux</strong> Versions<strong>Linux</strong> distinguish<strong>es</strong> stable kernels from development kernels through a simple numberingscheme. Each version is characterized by three numbers, separated by periods. <strong>The</strong> first twonumbers are used to identify the version; the third number identifi<strong>es</strong> the release.As shown in Figure 1-1, if the second number is even, it denot<strong>es</strong> a stable kernel; otherwise, itdenot<strong>es</strong> a development kernel. At the time of this writing, the current stable version of the<strong>Linux</strong> kernel is 2.2.14, and the current development version is 2.3.51. <strong>The</strong> 2.2 kernel, which isthe basis for this book, was first released in January 1999, and it differs considerably from the2.0 kernel, particularly with r<strong>es</strong>pect to memory management. Work on the 2.3 developmentversion started in May 1999.Figure 1-1. Numbering <strong>Linux</strong> versionsNew releas<strong>es</strong> of a stable version come out mostly to fix bugs reported by users. <strong>The</strong> mainalgorithms and data structur<strong>es</strong> used to implement the kernel are left unchanged.Development versions, on the other hand, may differ quite significantly from one another;kernel developers are free to experiment with different solutions that occasionally lead todrastic kernel chang<strong>es</strong>. Users who rely on development versions for running applications mayexperience unpleasant surpris<strong>es</strong> when upgrading their kernel to a newer release. This bookconcentrat<strong>es</strong> on the most recent stable kernel that we had available because, among allthe new featur<strong>es</strong> being tried in experimental kernels, there's no way of telling which willultimately be accepted and what they'll look like in their final form.At the time of this writing, <strong>Linux</strong> 2.4 has not officially come out. We tried to anticipate theforthcoming featur<strong>es</strong> and the main kernel chang<strong>es</strong> with r<strong>es</strong>pect to the 2.2 version by lookingat the <strong>Linux</strong> 2.3.99-pre8 prerelease. <strong>Linux</strong> 2.4 inherits a good deal from <strong>Linux</strong> 2.2: manyconcepts, d<strong>es</strong>ign choic<strong>es</strong>, algorithms, and data structur<strong>es</strong> remain the same. For that reason, weconclude each chapter by sketching how <strong>Linux</strong> 2.4 differs from <strong>Linux</strong> 2.2 with r<strong>es</strong>pect tothe topics just discussed. As you'll notice, the new <strong>Linux</strong> is gleaming and shining; it shouldappear more appealing to large corporations and, more generally, to the whole busin<strong>es</strong>scommunity.11


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.4 Basic Operating System ConceptsAny computer system includ<strong>es</strong> a basic set of programs called the operating system. <strong>The</strong> mostimportant program in the set is called the kernel. It is loaded into RAM when the system bootsand contains many critical procedur<strong>es</strong> that are needed for the system to operate. <strong>The</strong> otherprograms are l<strong>es</strong>s crucial utiliti<strong>es</strong>; they can provide a wide variety of interactive experienc<strong>es</strong>for the user—as well as doing all the jobs the user bought the computer for—but the <strong>es</strong>sentialshape and capabiliti<strong>es</strong> of the system are determined by the kernel. <strong>The</strong> kernel, then, is wherewe fix our attention in this book. Hence, we'll often use the term "operating system" asa synonym for "kernel."<strong>The</strong> operating system must fulfill two main objectiv<strong>es</strong>:• Interact with the hardware components servicing all low-level programmable elementsincluded in the hardware platform.• Provide an execution environment to the applications that run on the computer system(the so-called user programs).Some operating systems allow all user programs to directly play with the hardwarecomponents (a typical example is MS-DOS). In contrast, a Unix-like operating system hid<strong>es</strong>all low-level details concerning the physical organization of the computer from applicationsrun by the user. When a program wants to make use of a hardware r<strong>es</strong>ource, it must issuea requ<strong>es</strong>t to the operating system. <strong>The</strong> kernel evaluat<strong>es</strong> the requ<strong>es</strong>t and, if it choos<strong>es</strong> to grantthe r<strong>es</strong>ource, interacts with the relative hardware components on behalf of the user program.In order to enforce this mechanism, modern operating systems rely on the availability ofspecific hardware featur<strong>es</strong> that forbid user programs to directly interact with low-levelhardware components or to acc<strong>es</strong>s arbitrary memory locations. In particular, the hardwareintroduc<strong>es</strong> at least two different execution mod<strong>es</strong> for the CPU: a nonprivileged mode for userprograms and a privileged mode for the kernel. Unix calls th<strong>es</strong>e User Mode and <strong>Kernel</strong> Mode,r<strong>es</strong>pectively.In the r<strong>es</strong>t of this chapter, we introduce the basic concepts that have motivated the d<strong>es</strong>ign ofUnix over the past two decad<strong>es</strong>, as well as <strong>Linux</strong> and other operating systems. While theconcepts are probably familiar to you as a <strong>Linux</strong> user, th<strong>es</strong>e sections try to delve into thema bit more deeply than usual to explain the requirements they place on an operating systemkernel. <strong>The</strong>se broad considerations refer to Unix-like systems, thus also to <strong>Linux</strong>. <strong>The</strong> otherchapters of this book will hopefully help you to understand the <strong>Linux</strong> kernel internals.1.4.1 Multiuser SystemsA multiuser system is a computer that is able to concurrently and independently execut<strong>es</strong>everal applications belonging to two or more users. "Concurrently" means that applicationscan be active at the same time and contend for the various r<strong>es</strong>ourc<strong>es</strong> such as CPU, memory,hard disks, and so on. "Independently" means that each application can perform its task withno concern for what the applications of the other users are doing. Switching from oneapplication to another, of course, slows down each of them and affects the r<strong>es</strong>ponse time seenby the users. Many of the complexiti<strong>es</strong> of modern operating system kernels, which we willexamine in this book, are pr<strong>es</strong>ent to minimize the delays enforced on each program and toprovide the user with r<strong>es</strong>pons<strong>es</strong> that are as fast as possible.12


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Multiuser operating systems must include several featur<strong>es</strong>:• An authentication mechanism for verifying the user identity• A protection mechanism against buggy user programs that could block otherapplications running in the system• A protection mechanism against malicious user programs that could interfere with, orspy on, the activity of other users• An accounting mechanism that limits the amount of r<strong>es</strong>ource units assigned to eachuserIn order to ensure safe protection mechanisms, operating systems must make use of thehardware protection associated with the CPU privileged mode. Otherwise, a user programwould be able to directly acc<strong>es</strong>s the system circuitry and overcome the imposed bounds. Unixis a multiuser system that enforc<strong>es</strong> the hardware protection of system r<strong>es</strong>ourc<strong>es</strong>.1.4.2 Users and GroupsIn a multiuser system, each user has a private space on the machine: typically, he owns somequota of the disk space to store fil<strong>es</strong>, receiv<strong>es</strong> private mail m<strong>es</strong>sag<strong>es</strong>, and so on. <strong>The</strong> operatingsystem must ensure that the private portion of a user space is visible only to its owner. Inparticular, it must ensure that no user can exploit a system application for the purpose ofviolating the private space of another user.All users are identified by a unique number called the User ID , or UID. Usually only ar<strong>es</strong>tricted number of persons are allowed to make use of a computer system. When one ofth<strong>es</strong>e users starts a working s<strong>es</strong>sion, the operating system asks for a login name and apassword. If the user do<strong>es</strong> not input a valid pair, the system deni<strong>es</strong> acc<strong>es</strong>s. Since the passwordis assumed to be secret, the user's privacy is ensured.In order to selectively share material with other users, each user is a member of one or moregroups, which are identified by a unique number called a Group ID , or GID. Each file is alsoassociated with exactly one group. For example, acc<strong>es</strong>s could be set so that the user owningthe file has read and write privileg<strong>es</strong>, the group has read-only privileg<strong>es</strong>, and other users onthe system are denied acc<strong>es</strong>s to the file.Any Unix-like operating system has a special user called root, superuser, or supervisor. <strong>The</strong>system administrator must log in as root in order to handle user accounts, performmaintenance tasks like system backups and program upgrad<strong>es</strong>, and so on. <strong>The</strong> root user cando almost everything, since the operating system do<strong>es</strong> not apply the usual protectionmechanisms to her. In particular, the root user can acc<strong>es</strong>s every file on the system and caninterfere with the activity of every running user program.1.4.3 Proc<strong>es</strong>s<strong>es</strong>All operating systems make use of one fundamental abstraction: the proc<strong>es</strong>s . A proc<strong>es</strong>s canbe defined either as "an instance of a program in execution," or as the "execution context" of arunning program. In traditional operating systems, a proc<strong>es</strong>s execut<strong>es</strong> a single sequence ofinstructions in an addr<strong>es</strong>s space ; the addr<strong>es</strong>s space is the set of memory addr<strong>es</strong>s<strong>es</strong> that theproc<strong>es</strong>s is allowed to reference. Modern operating systems allow proc<strong>es</strong>s<strong>es</strong> with multiple13


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>execution flows, that is, multiple sequenc<strong>es</strong> of instructions executed in the same addr<strong>es</strong>sspace.Multiuser systems must enforce an execution environment in which several proc<strong>es</strong>s<strong>es</strong> can beactive concurrently and contend for system r<strong>es</strong>ourc<strong>es</strong>, mainly the CPU. Systems that allowconcurrent active proc<strong>es</strong>s<strong>es</strong> are said to be multiprogramming or multiproc<strong>es</strong>sing. [4] It isimportant to distinguish programs from proc<strong>es</strong>s<strong>es</strong>: several proc<strong>es</strong>s<strong>es</strong> can execute the sameprogram concurrently, while the same proc<strong>es</strong>s can execute several programs sequentially.[4] Some multiproc<strong>es</strong>sing operating systems are not multiuser; an example is Microsoft's Windows 98.On uniproc<strong>es</strong>sor systems, just one proc<strong>es</strong>s can hold the CPU, and hence just one executionflow can progr<strong>es</strong>s at a time. In general, the number of CPUs is always r<strong>es</strong>tricted, and thereforeonly a few proc<strong>es</strong>s<strong>es</strong> can progr<strong>es</strong>s at the same time. <strong>The</strong> choice of the proc<strong>es</strong>s that canprogr<strong>es</strong>s is left to an operating system component called the scheduler. Some operatingsystems allow only nonpreemptive proc<strong>es</strong>s<strong>es</strong>, which means that the scheduler is invoked onlywhen a proc<strong>es</strong>s voluntarily relinquish<strong>es</strong> the CPU. But proc<strong>es</strong>s<strong>es</strong> of a multiuser system must bepreemptive ; the operating system tracks how long each proc<strong>es</strong>s holds the CPU andperiodically activat<strong>es</strong> the scheduler.Unix is a multiproc<strong>es</strong>sing operating system with preemptive proc<strong>es</strong>s<strong>es</strong>. Indeed, the proc<strong>es</strong>sabstraction is really fundamental in all Unix systems. Even when no user is logged in and noapplication is running, several system proc<strong>es</strong>s<strong>es</strong> monitor the peripheral devic<strong>es</strong>. In particular,several proc<strong>es</strong>s<strong>es</strong> listen at the system terminals waiting for user logins. When a user inputs alogin name, the listening proc<strong>es</strong>s runs a program that validat<strong>es</strong> the user password. If the useridentity is acknowledged, the proc<strong>es</strong>s creat<strong>es</strong> another proc<strong>es</strong>s that runs a shell into whichcommands are entered. When a graphical display is activated, one proc<strong>es</strong>s runs the windowmanager, and each window on the display is usually run by a separate proc<strong>es</strong>s. When a usercreat<strong>es</strong> a graphics shell, one proc<strong>es</strong>s runs the graphics windows, and a second proc<strong>es</strong>s runs th<strong>es</strong>hell into which the user can enter the commands. For each user command, the shell proc<strong>es</strong>screat<strong>es</strong> another proc<strong>es</strong>s that execut<strong>es</strong> the corr<strong>es</strong>ponding program.Unix-like operating systems adopt a proc<strong>es</strong>s/kernel model. Each proc<strong>es</strong>s has the illusion thatit's the only proc<strong>es</strong>s on the machine and it has exclusive acc<strong>es</strong>s to the operating systemservic<strong>es</strong>. Whenever a proc<strong>es</strong>s mak<strong>es</strong> a system call (i.e., a requ<strong>es</strong>t to the kernel), the hardwarechang<strong>es</strong> the privilege mode from User Mode to <strong>Kernel</strong> Mode, and the proc<strong>es</strong>s starts theexecution of a kernel procedure with a strictly limited purpose. In this way, the operatingsystem acts within the execution context of the proc<strong>es</strong>s in order to satisfy its requ<strong>es</strong>t.Whenever the requ<strong>es</strong>t is fully satisfied, the kernel procedure forc<strong>es</strong> the hardware to return toUser Mode and the proc<strong>es</strong>s continu<strong>es</strong> its execution from the instruction following the systemcall.1.4.4 <strong>Kernel</strong> ArchitectureAs stated before, most Unix kernels are monolithic: each kernel layer is integrated into thewhole kernel program and runs in <strong>Kernel</strong> Mode on behalf of the current proc<strong>es</strong>s. In contrast,microkernel operating systems demand a very small set of functions from the kernel,generally including a few synchronization primitiv<strong>es</strong>, a simple scheduler, and an interproc<strong>es</strong>scommunication mechanism. Several system proc<strong>es</strong>s<strong>es</strong> that run on top of the microkernelimplement other operating system-layer functions, like memory allocators, device drivers,system call handlers, and so on.14


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Although academic r<strong>es</strong>earch on operating systems is oriented toward microkernels, suchoperating systems are generally slower than monolithic on<strong>es</strong>, since the explicit m<strong>es</strong>sagepassing between the different layers of the operating system has a cost. However, microkerneloperating systems might have some theoretical advantag<strong>es</strong> over monolithic on<strong>es</strong>.Microkernels force the system programmers to adopt a modularized approach, since anyoperating system layer is a relatively independent program that must interact with the otherlayers through well-defined and clean software interfac<strong>es</strong>. Moreover, an existing microkerneloperating system can be fairly easily ported to other architectur<strong>es</strong>, since all hardwaredependentcomponents are generally encapsulated in the microkernel code. Finally,microkernel operating systems tend to make better use of random acc<strong>es</strong>s memory (RAM) thanmonolithic on<strong>es</strong>, since system proc<strong>es</strong>s<strong>es</strong> that aren't implementing needed functionaliti<strong>es</strong> mightbe swapped out or d<strong>es</strong>troyed.Modul<strong>es</strong> are a kernel feature that effectively achiev<strong>es</strong> many of the theoretical advantag<strong>es</strong> ofmicrokernels without introducing performance penalti<strong>es</strong>. A module is an object file whosecode can be linked to (and unlinked from) the kernel at runtime. <strong>The</strong> object code usuallyconsists of a set of functions that implements a fil<strong>es</strong>ystem, a device driver, or other featur<strong>es</strong> atthe kernel's upper layer. <strong>The</strong> module, unlike the external layers of microkernel operatingsystems, do<strong>es</strong> not run as a specific proc<strong>es</strong>s. Instead, it is executed in <strong>Kernel</strong> Mode on behalfof the current proc<strong>es</strong>s, like any other statically linked kernel function.<strong>The</strong> main advantag<strong>es</strong> of using modul<strong>es</strong> include:Modularized approachSince any module can be linked and unlinked at runtime, system programmers mustintroduce well-defined software interfac<strong>es</strong> to acc<strong>es</strong>s the data structur<strong>es</strong> handled bymodul<strong>es</strong>. This mak<strong>es</strong> it easy to develop new modul<strong>es</strong>.Platform independenceEven if it may rely on some specific hardware featur<strong>es</strong>, a module do<strong>es</strong>n't depend on afixed hardware platform. For example, a disk driver module that reli<strong>es</strong> on the SCSIstandard works as well on an IBM-compatible PC as it do<strong>es</strong> on Compaq's Alpha.Frugal main memory usageA module can be linked to the running kernel when its functionality is required andunlinked when it is no longer useful. This mechanism also can be made transparent tothe user, since linking and unlinking can be performed automatically by the kernel.No performance penaltyOnce linked in, the object code of a module is equivalent to the object code of th<strong>es</strong>tatically linked kernel. <strong>The</strong>refore, no explicit m<strong>es</strong>sage passing is required when thefunctions of the module are invoked. [5][5] A small performance penalty occurs when the module is linked and when it is unlinked. However, this penalty can be compared to the penaltycaused by the creation and deletion of system proc<strong>es</strong>s<strong>es</strong> in microkernel operating systems.15


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.5 An Overview of the Unix Fil<strong>es</strong>ystem<strong>The</strong> Unix operating system d<strong>es</strong>ign is centered on its fil<strong>es</strong>ystem, which has several inter<strong>es</strong>tingcharacteristics. We'll review the most significant on<strong>es</strong>, since they will be mentioned quiteoften in forthcoming chapters.1.5.1 Fil<strong>es</strong>A Unix file is an information container structured as a sequence of byt<strong>es</strong>; the kernel do<strong>es</strong> notinterpret the contents of a file. Many programming librari<strong>es</strong> implement higher-levelabstractions, such as records structured into fields and record addr<strong>es</strong>sing based on keys.However, the programs in th<strong>es</strong>e librari<strong>es</strong> must rely on system calls offered by the kernel.From the user's point of view, fil<strong>es</strong> are organized in a tree-structured name space as shown inFigure 1-2.Figure 1-2. An example of a directory treeAll the nod<strong>es</strong> of the tree, except the leav<strong>es</strong>, denote directory nam<strong>es</strong>. A directory node containsinformation about the fil<strong>es</strong> and directori<strong>es</strong> just beneath it. A file or directory name consists ofa sequence of arbitrary ASCII characters, [6] with the exception of / and of the null character \0.Most fil<strong>es</strong>ystems place a limit on the length of a filename, typically no more than 255characters. <strong>The</strong> directory corr<strong>es</strong>ponding to the root of the tree is called the root directory . Byconvention, its name is a slash (/). Nam<strong>es</strong> must be different within the same directory, but th<strong>es</strong>ame name may be used in different directori<strong>es</strong>.[6] Some operating systems allow filenam<strong>es</strong> to be expr<strong>es</strong>sed in many different alphabets, based on 16-bit extended coding of graphical characters suchas Unicode.Unix associat<strong>es</strong> a current working directory with each proc<strong>es</strong>s (see Section 1.6.1 later in thischapter); it belongs to the proc<strong>es</strong>s execution context, and it identifi<strong>es</strong> the directory currentlyused by the proc<strong>es</strong>s. In order to identify a specific file, the proc<strong>es</strong>s us<strong>es</strong> a pathname, whichconsists of slash<strong>es</strong> alternating with a sequence of directory nam<strong>es</strong> that lead to the file. If thefirst item in the pathname is a slash, the pathname is said to be absolute, since its startingpoint is the root directory. Otherwise, if the first item is a directory name or filename, thepathname is said to be relative, since its starting point is the proc<strong>es</strong>s's current directory.While specifying filenam<strong>es</strong>, the notations "." and ".." are also used. <strong>The</strong>y denote the currentworking directory and its parent directory, r<strong>es</strong>pectively. If the current working directory is theroot directory, "." and ".." coincide.16


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.5.2 Hard and Soft LinksA filename included in a directory is called a file hard link, or more simply a link. <strong>The</strong> samefile may have several links included in the same directory or in different on<strong>es</strong>, thus severalfilenam<strong>es</strong>.<strong>The</strong> Unix command:$ ln f1 f2is used to create a new hard link that has the pathname f2 for a file identified by the pathnamef1.Hard links have two limitations:• Users are not allowed to create hard links for directori<strong>es</strong>. This might transform thedirectory tree into a graph with cycl<strong>es</strong>, thus making it impossible to locate a fileaccording to its name.• Links can be created only among fil<strong>es</strong> included in the same fil<strong>es</strong>ystem. This is aserious limitation since modern Unix systems may include several fil<strong>es</strong>ystems locatedon different disks and/or partitions, and users may be unaware of the physicaldivisions between them.In order to overcome th<strong>es</strong>e limitations, soft links (also called symbolic links) have beenintroduced. Symbolic links are short fil<strong>es</strong> that contain an arbitrary pathname of another file.<strong>The</strong> pathname may refer to any file located in any fil<strong>es</strong>ystem; it may even refer to anonexistent file.<strong>The</strong> Unix command:$ ln -s f1 f2creat<strong>es</strong> a new soft link with pathname f2 that refers to pathname f1. When this command isexecuted, the fil<strong>es</strong>ystem creat<strong>es</strong> a soft link and writ<strong>es</strong> into it the f1 pathname. It then inserts—in the proper directory—a new entry containing the last name of the f2 pathname. In this way,any reference to f2 can be translated automatically into a reference to f1.1.5.3 File Typ<strong>es</strong>Unix fil<strong>es</strong> may have one of the following typ<strong>es</strong>:• Regular file• Directory• Symbolic link• Block-oriented device file• Character-oriented device file• Pipe and named pipe (also called FIFO)• Socket17


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> first three file typ<strong>es</strong> are constituents of any Unix fil<strong>es</strong>ystem. <strong>The</strong>ir implementation will bed<strong>es</strong>cribed in detail in Chapter 17.Device fil<strong>es</strong> are related to I/O devic<strong>es</strong> and device drivers integrated into the kernel. Forexample, when a program acc<strong>es</strong>s<strong>es</strong> a device file, it acts directly on the I/O device associatedwith that file (see Chapter 13).Pip<strong>es</strong> and sockets are special fil<strong>es</strong> used for interproc<strong>es</strong>s communication (see Section 1.6.5later in this chapter and Chapter 18).1.5.4 File D<strong>es</strong>criptor and InodeUnix mak<strong>es</strong> a clear distinction between a file and a file d<strong>es</strong>criptor. With the exception ofdevice and special fil<strong>es</strong>, each file consists of a sequence of characters. <strong>The</strong> file do<strong>es</strong> notinclude any control information such as its length, or an End-Of-File (EOF) delimiter.All information needed by the fil<strong>es</strong>ystem to handle a file is included in a data structure calledan inode. Each file has its own inode, which the fil<strong>es</strong>ystem us<strong>es</strong> to identify the file.While fil<strong>es</strong>ystems and the kernel functions handling them can vary widely from one Unixsystem to another, they must always provide at least the following attribut<strong>es</strong>, which ar<strong>es</strong>pecified in the POSIX standard:• File type (see previous section)• Number of hard links associated with the file• File length in byt<strong>es</strong>• Device ID (i.e., an identifier of the device containing the file)• Inode number that identifi<strong>es</strong> the file within the fil<strong>es</strong>ystem• User ID of the file owner• Group ID of the file• Several tim<strong>es</strong>tamps that specify the inode status change time, the last acc<strong>es</strong>s time, andthe last modify time• Acc<strong>es</strong>s rights and file mode (see next section)1.5.5 Acc<strong>es</strong>s Rights and File Mode<strong>The</strong> potential users of a file fall into three class<strong>es</strong>:• <strong>The</strong> user who is the owner of the file• <strong>The</strong> users who belong to the same group as the file, not including the owner• All remaining users (others)<strong>The</strong>re are three typ<strong>es</strong> of acc<strong>es</strong>s rights, Read, Write, and Execute, for each of th<strong>es</strong>e threeclass<strong>es</strong>. Thus, the set of acc<strong>es</strong>s rights associated with a file consists of nine different binaryflags. Three additional flags, called suid (Set User ID), sgid (Set Group ID), and sticky definethe file mode. <strong>The</strong>se flags have the following meanings when applied to executable fil<strong>es</strong>:18


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>suidA proc<strong>es</strong>s executing a file normally keeps the User ID (UID) of the proc<strong>es</strong>s owner.However, if the executable file has the suid flag set, the proc<strong>es</strong>s gets the UID of thefile owner.sgidstickyA proc<strong>es</strong>s executing a file keeps the Group ID (GID) of the proc<strong>es</strong>s group. However,if the executable file has the sgid flag set, the proc<strong>es</strong>s gets the ID of the file group.An executable file with the sticky flag set corr<strong>es</strong>ponds to a requ<strong>es</strong>t to the kernel tokeep the program in memory after its execution terminat<strong>es</strong>. [7][7] This flag has become obsolete; other approach<strong>es</strong> based on sharing of code pag<strong>es</strong> are now used (see Chapter 7).When a file is created by a proc<strong>es</strong>s, its owner ID is the UID of the proc<strong>es</strong>s. Its owner group IDcan be either the GID of the creator proc<strong>es</strong>s or the GID of the parent directory, depending onthe value of the sgid flag of the parent directory.1.5.6 File-Handling System CallsWhen a user acc<strong>es</strong>s<strong>es</strong> the contents of either a regular file or a directory, he actually acc<strong>es</strong>s<strong>es</strong>some data stored in a hardware block device. In this sense, a fil<strong>es</strong>ystem is a user-level view ofthe physical organization of a hard disk partition. Since a proc<strong>es</strong>s in User Mode cannotdirectly interact with the low-level hardware components, each actual file operation must beperformed in <strong>Kernel</strong> Mode.<strong>The</strong>refore, the Unix operating system defin<strong>es</strong> several system calls related to file handling.Whenever a proc<strong>es</strong>s wants to perform some operation on a specific file, it us<strong>es</strong> the propersystem call and pass<strong>es</strong> the file pathname as a parameter.All Unix kernels devote great attention to the efficient handling of hardware block devic<strong>es</strong> inorder to achieve good overall system performance. In the chapters that follow, we willd<strong>es</strong>cribe topics related to file handling in <strong>Linux</strong> and specifically how the kernel reacts to filerelatedsystem calls. In order to understand those d<strong>es</strong>criptions, you will need to know how themain file-handling system calls are used; they are d<strong>es</strong>cribed in the next section.1.5.6.1 Opening a fileProc<strong>es</strong>s<strong>es</strong> can acc<strong>es</strong>s only "opened" fil<strong>es</strong>. In order to open a file, the proc<strong>es</strong>s invok<strong>es</strong> th<strong>es</strong>ystem call:fd = open(path, flag, mode)<strong>The</strong> three parameters have the following meanings:19


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>pathDenot<strong>es</strong> the pathname (relative or absolute) of the file to be opened.flagSpecifi<strong>es</strong> how the file must be opened (e.g., read, write, read/write, append). It canalso specify whether a nonexisting file should be created.modeSpecifi<strong>es</strong> the acc<strong>es</strong>s rights of a newly created file.This system call creat<strong>es</strong> an "open file" object and returns an identifier called file d<strong>es</strong>criptor .An open file object contains:• Some file-handling data structur<strong>es</strong>, like a pointer to the kernel buffer memory areawhere file data will be copied; an offset field that denot<strong>es</strong> the current position in thefile from which the next operation will take place (the so-called file pointer); and soon.• Some pointers to kernel functions that the proc<strong>es</strong>s is enabled to invoke. <strong>The</strong> set ofpermitted functions depends on the value of the flag parameter.We'll discuss open file objects in detail in Chapter 12. Let's limit ourselv<strong>es</strong> here to d<strong>es</strong>cribingsome general properti<strong>es</strong> specified by the POSIX semantics:• A file d<strong>es</strong>criptor repr<strong>es</strong>ents an interaction between a proc<strong>es</strong>s and an opened file, whilean open file object contains data related to that interaction. <strong>The</strong> same open file objectmay be identified by several file d<strong>es</strong>criptors.• Several proc<strong>es</strong>s<strong>es</strong> may concurrently open the same file. In this case, the fil<strong>es</strong>ystemassigns a separate file d<strong>es</strong>criptor to each file, along with a separate open file object.When this occurs, the Unix fil<strong>es</strong>ystem do<strong>es</strong> not provide any kind of synchronizationamong the I/O operations issued by the proc<strong>es</strong>s<strong>es</strong> on the same file. However, severalsystem calls such as flock( ) are available to allow proc<strong>es</strong>s<strong>es</strong> to synchronizethemselv<strong>es</strong> on the entire file or on portions of it (see Chapter 12).In order to create a new file, the proc<strong>es</strong>s may also invoke the create( ) system call, which ishandled by the kernel exactly like open( ).1.5.6.2 Acc<strong>es</strong>sing an opened fileRegular Unix fil<strong>es</strong> can be addr<strong>es</strong>sed either sequentially or randomly, while device fil<strong>es</strong> andnamed pip<strong>es</strong> are usually acc<strong>es</strong>sed sequentially (see Chapter 13). In both kinds of acc<strong>es</strong>s, thekernel stor<strong>es</strong> the file pointer in the open file object, that is, the current position at which thenext read or write operation will take place.Sequential acc<strong>es</strong>s is implicitly assumed: the read( ) and write( ) system calls always referto the position of the current file pointer. In order to modify the value, a program mustexplicitly invoke the lseek( ) system call. When a file is opened, the kernel sets the filepointer to the position of the first byte in the file (offset 0).20


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> lseek( ) system call requir<strong>es</strong> the following parameters:newoffset = lseek(fd, offset, whence);which have the following meanings:fdoffsetwhenceIndicat<strong>es</strong> the file d<strong>es</strong>criptor of the opened fileSpecifi<strong>es</strong> a signed integer value that will be used for computing the new position ofthe file pointerSpecifi<strong>es</strong> whether the new position should be computed by adding the offset value tothe number (offset from the beginning of the file), the current file pointer, or theposition of the last byte (offset from the end of the file)<strong>The</strong> read( ) system call requir<strong>es</strong> the following parameters:nread = read(fd, buf, count);which have the following meaning:fdbufcountIndicat<strong>es</strong> the file d<strong>es</strong>criptor of the opened fileSpecifi<strong>es</strong> the addr<strong>es</strong>s of the buffer in the proc<strong>es</strong>s's addr<strong>es</strong>s space to which the data willbe transferredDenot<strong>es</strong> the number of byt<strong>es</strong> to be readWhen handling such a system call, the kernel attempts to read count byt<strong>es</strong> from the filehaving the file d<strong>es</strong>criptor fd, starting from the current value of the opened file's offset field. Insome cas<strong>es</strong>—end-of-file, empty pipe, and so on—the kernel do<strong>es</strong> not succeed in reading allcount byt<strong>es</strong>. <strong>The</strong> returned nread value specifi<strong>es</strong> the number of byt<strong>es</strong> effectively read. <strong>The</strong> filepointer is also updated by adding nread to its previous value. <strong>The</strong> write( ) parameters ar<strong>es</strong>imilar.21


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.5.6.3 Closing a fileWhen a proc<strong>es</strong>s do<strong>es</strong> not need to acc<strong>es</strong>s the contents of a file anymore, it can invoke th<strong>es</strong>ystem call:r<strong>es</strong> = close(fd);which releas<strong>es</strong> the open file object corr<strong>es</strong>ponding to the file d<strong>es</strong>criptor fd. When a proc<strong>es</strong>sterminat<strong>es</strong>, the kernel clos<strong>es</strong> all its still opened fil<strong>es</strong>.1.5.6.4 Renaming and deleting a fileIn order to rename or delete a file, a proc<strong>es</strong>s do<strong>es</strong> not need to open it. Indeed, such operationsdo not act on the contents of the affected file, but rather on the contents of one or moredirectori<strong>es</strong>. For example, the system call:r<strong>es</strong> = rename(oldpath, newpath);chang<strong>es</strong> the name of a file link, while the system call:r<strong>es</strong> = unlink(pathname);decrements the file link count and remov<strong>es</strong> the corr<strong>es</strong>ponding directory entry. <strong>The</strong> file isdeleted only when the link count assum<strong>es</strong> the value 0.1.6 An Overview of Unix <strong>Kernel</strong>sUnix kernels provide an execution environment in which applications may run. <strong>The</strong>refore, thekernel must implement a set of servic<strong>es</strong> and corr<strong>es</strong>ponding interfac<strong>es</strong>. Applications use thoseinterfac<strong>es</strong> and do not usually interact directly with hardware r<strong>es</strong>ourc<strong>es</strong>.1.6.1 <strong>The</strong> Proc<strong>es</strong>s/<strong>Kernel</strong> ModelAs already mentioned, a CPU can run either in User Mode or in <strong>Kernel</strong> Mode. Actually, someCPUs can have more than two execution stat<strong>es</strong>. For instance, the Intel 80x86 microproc<strong>es</strong>sorshave four different execution stat<strong>es</strong>. But all standard Unix kernels make use of only <strong>Kernel</strong>Mode and User Mode.When a program is executed in User Mode, it cannot directly acc<strong>es</strong>s the kernel data structur<strong>es</strong>or the kernel programs. When an application execut<strong>es</strong> in <strong>Kernel</strong> Mode, however, th<strong>es</strong>er<strong>es</strong>trictions no longer apply. Each CPU model provid<strong>es</strong> special instructions to switch fromUser Mode to <strong>Kernel</strong> Mode and vice versa. A program execut<strong>es</strong> most of the time in UserMode and switch<strong>es</strong> to <strong>Kernel</strong> Mode only when requ<strong>es</strong>ting a service provided by the kernel.When the kernel has satisfied the program's requ<strong>es</strong>t, it puts the program back in User Mode.Proc<strong>es</strong>s<strong>es</strong> are dynamic entiti<strong>es</strong> that usually have a limited life span within the system. <strong>The</strong>task of creating, eliminating, and synchronizing the existing proc<strong>es</strong>s<strong>es</strong> is delegated to a groupof routin<strong>es</strong> in the kernel.<strong>The</strong> kernel itself is not a proc<strong>es</strong>s but a proc<strong>es</strong>s manager. <strong>The</strong> proc<strong>es</strong>s/kernel model assum<strong>es</strong>that proc<strong>es</strong>s<strong>es</strong> that require a kernel service make use of specific programming constructs22


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>called system calls. Each system call sets up the group of parameters that identifi<strong>es</strong> theproc<strong>es</strong>s requ<strong>es</strong>t and then execut<strong>es</strong> the hardware-dependent CPU instruction to switch fromUser Mode to <strong>Kernel</strong> Mode.B<strong>es</strong>id<strong>es</strong> user proc<strong>es</strong>s<strong>es</strong>, Unix systems include a few privileged proc<strong>es</strong>s<strong>es</strong> called kernel threadswith the following characteristics:• <strong>The</strong>y run in <strong>Kernel</strong> Mode in the kernel addr<strong>es</strong>s space.• <strong>The</strong>y do not interact with users, and thus do not require terminal devic<strong>es</strong>.• <strong>The</strong>y are usually created during system startup and remain alive until the system isshut down.Notice how the proc<strong>es</strong>s/ kernel model is somewhat orthogonal to the CPU state: on auniproc<strong>es</strong>sor system, only one proc<strong>es</strong>s is running at any time and it may run either in User orin <strong>Kernel</strong> Mode. If it runs in <strong>Kernel</strong> Mode, the proc<strong>es</strong>sor is executing some kernel routine.Figure 1-3 illustrat<strong>es</strong> exampl<strong>es</strong> of transitions between User and <strong>Kernel</strong> Mode. Proc<strong>es</strong>s 1 inUser Mode issu<strong>es</strong> a system call, after which the proc<strong>es</strong>s switch<strong>es</strong> to <strong>Kernel</strong> Mode and th<strong>es</strong>ystem call is serviced. Proc<strong>es</strong>s 1 then r<strong>es</strong>um<strong>es</strong> execution in User Mode until a timer interruptoccurs and the scheduler is activated in <strong>Kernel</strong> Mode. A proc<strong>es</strong>s switch tak<strong>es</strong> place, andProc<strong>es</strong>s 2 starts its execution in User Mode until a hardware device rais<strong>es</strong> an interrupt. As aconsequence of the interrupt, Proc<strong>es</strong>s 2 switch<strong>es</strong> to <strong>Kernel</strong> Mode and servic<strong>es</strong> the interrupt.Figure 1-3. Transitions between User and <strong>Kernel</strong> ModeUnix kernels do much more than handle system calls; in fact, kernel routin<strong>es</strong> can be activatedin several ways:• A proc<strong>es</strong>s invok<strong>es</strong> a system call.• <strong>The</strong> CPU executing the proc<strong>es</strong>s signals an exception, which is some unusual conditionsuch as an invalid instruction. <strong>The</strong> kernel handl<strong>es</strong> the exception on behalf of theproc<strong>es</strong>s that caused it.• A peripheral device issu<strong>es</strong> an interrupt signal to the CPU to notify it of an event suchas a requ<strong>es</strong>t for attention, a status change, or the completion of an I/O operation. Eachinterrupt signal is dealt by a kernel program called an interrupt handler. Sinceperipheral devic<strong>es</strong> operate asynchronously with r<strong>es</strong>pect to the CPU, interrupts occur atunpredictable tim<strong>es</strong>.• A kernel thread is executed; since it runs in <strong>Kernel</strong> Mode, the corr<strong>es</strong>ponding programmust be considered part of the kernel, albeit encapsulated in a proc<strong>es</strong>s.23


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.6.2 Proc<strong>es</strong>s ImplementationTo let the kernel manage proc<strong>es</strong>s<strong>es</strong>, each proc<strong>es</strong>s is repr<strong>es</strong>ented by a proc<strong>es</strong>s d<strong>es</strong>criptor thatinclud<strong>es</strong> information about the current state of the proc<strong>es</strong>s.When the kernel stops the execution of a proc<strong>es</strong>s, it sav<strong>es</strong> the current contents of severalproc<strong>es</strong>sor registers in the proc<strong>es</strong>s d<strong>es</strong>criptor. <strong>The</strong>se include:• <strong>The</strong> program counter (PC) and stack pointer (SP) registers• <strong>The</strong> general-purpose registers• <strong>The</strong> floating point registers• <strong>The</strong> proc<strong>es</strong>sor control registers (Proc<strong>es</strong>sor Status Word) containing information aboutthe CPU state• <strong>The</strong> memory management registers used to keep track of the RAM acc<strong>es</strong>sed by theproc<strong>es</strong>sWhen the kernel decid<strong>es</strong> to r<strong>es</strong>ume executing a proc<strong>es</strong>s, it us<strong>es</strong> the proper proc<strong>es</strong>s d<strong>es</strong>criptorfields to load the CPU registers. Since the stored value of the program counter points to theinstruction following the last instruction executed, the proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> execution from whereit was stopped.When a proc<strong>es</strong>s is not executing on the CPU, it is waiting for some event. Unix kernelsdistinguish many wait stat<strong>es</strong>, which are usually implemented by queu<strong>es</strong> of proc<strong>es</strong>sd<strong>es</strong>criptors; each (possibly empty) queue corr<strong>es</strong>ponds to the set of proc<strong>es</strong>s<strong>es</strong> waiting for aspecific event.1.6.3 Reentrant <strong>Kernel</strong>sAll Unix kernels are reentrant : this means that several proc<strong>es</strong>s<strong>es</strong> may be executing in <strong>Kernel</strong>Mode at the same time. Of course, on uniproc<strong>es</strong>sor systems only one proc<strong>es</strong>s can progr<strong>es</strong>s,but many of them can be blocked in <strong>Kernel</strong> Mode waiting for the CPU or the completion ofsome I/O operation. For instance, after issuing a read to a disk on behalf of some proc<strong>es</strong>s, thekernel will let the disk controller handle it and will r<strong>es</strong>ume executing other proc<strong>es</strong>s<strong>es</strong>.An interrupt notifi<strong>es</strong> the kernel when the device has satisfied the read, so the former proc<strong>es</strong>scan r<strong>es</strong>ume the execution.One way to provide reentrancy is to write functions so that they modify only local variabl<strong>es</strong>and do not alter global data structur<strong>es</strong>. Such functions are called reentrant functions. Buta reentrant kernel is not limited just to such reentrant functions (although that is how somereal-time kernels are implemented). Instead, the kernel can include nonreentrant functions anduse locking mechanisms to ensure that only one proc<strong>es</strong>s can execute a nonreentrant functionat a time. Every proc<strong>es</strong>s in <strong>Kernel</strong> Mode acts on its own set of memory locations and cannotinterfere with the others.If a hardware interrupt occurs, a reentrant kernel is able to suspend the current runningproc<strong>es</strong>s even if that proc<strong>es</strong>s is in <strong>Kernel</strong> Mode. This capability is very important, since itimprov<strong>es</strong> the throughput of the device controllers that issue interrupts. Once a device hasissued an interrupt, it waits until the CPU acknowledg<strong>es</strong> it. If the kernel is able to answerquickly, the device controller will be able to perform other tasks while the CPU handl<strong>es</strong>the interrupt.24


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Now let's look at kernel reentrancy and its impact on the organization of the kernel. A kernelcontrol path denot<strong>es</strong> the sequence of instructions executed by the kernel to handle a systemcall, an exception, or an interrupt.In the simpl<strong>es</strong>t case, the CPU execut<strong>es</strong> a kernel control path sequentially from the firstinstruction to the last. When one of the following events occurs, however, the CPU interleav<strong>es</strong>the kernel control paths:• A proc<strong>es</strong>s executing in User Mode invok<strong>es</strong> a system call and the corr<strong>es</strong>ponding kernelcontrol path verifi<strong>es</strong> that the requ<strong>es</strong>t cannot be satisfied immediately; it then invok<strong>es</strong>the scheduler to select a new proc<strong>es</strong>s to run. As a r<strong>es</strong>ult, a proc<strong>es</strong>s switch occurs. <strong>The</strong>first kernel control path is left unfinished and the CPU r<strong>es</strong>um<strong>es</strong> the execution of someother kernel control path. In this case, the two control paths are executed on behalf oftwo different proc<strong>es</strong>s<strong>es</strong>.• <strong>The</strong> CPU detects an exception—for example, an acc<strong>es</strong>s to a page not pr<strong>es</strong>ent inRAM—while running a kernel control path. <strong>The</strong> first control path is suspended, andthe CPU starts the execution of a suitable procedure. In our example, this type ofprocedure could allocate a new page for the proc<strong>es</strong>s and read its contents from disk.When the procedure terminat<strong>es</strong>, the first control path can be r<strong>es</strong>umed. In this case, thetwo control paths are executed on behalf of the same proc<strong>es</strong>s.• A hardware interrupt occurs while the CPU is running a kernel control path with theinterrupts enabled. <strong>The</strong> first kernel control path is left unfinished and the CPU startsproc<strong>es</strong>sing another kernel control path to handle the interrupt. <strong>The</strong> first kernel controlpath r<strong>es</strong>um<strong>es</strong> when the interrupt handler terminat<strong>es</strong>. In this case the two kernel controlpaths run in the execution context of the same proc<strong>es</strong>s and the total elapsed systemtime is accounted to it. However, the interrupt handler do<strong>es</strong>n't nec<strong>es</strong>sarily operate onbehalf of the proc<strong>es</strong>s.Figure 1-4 illustrat<strong>es</strong> a few exampl<strong>es</strong> of noninterleaved and interleaved kernel control paths.Three different CPU stat<strong>es</strong> are considered:• Running a proc<strong>es</strong>s in User Mode (User)• Running an exception or a system call handler (Excp)• Running an interrupt handler (Intr)Figure 1-4. Interleaving of kernel control paths25


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.6.4 Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceEach proc<strong>es</strong>s runs in its private addr<strong>es</strong>s space. A proc<strong>es</strong>s running in User Mode refers toprivate stack, data, and code areas. When running in <strong>Kernel</strong> Mode, the proc<strong>es</strong>s addr<strong>es</strong>s<strong>es</strong> thekernel data and code area and mak<strong>es</strong> use of another stack.Since the kernel is reentrant, several kernel control paths—each related to a differentproc<strong>es</strong>s—may be executed in turn. In this case, each kernel control path refers to its ownprivate kernel stack.While it appears to each proc<strong>es</strong>s that it has acc<strong>es</strong>s to a private addr<strong>es</strong>s space, there are tim<strong>es</strong>when part of the addr<strong>es</strong>s space is shared among proc<strong>es</strong>s<strong>es</strong>. In some cas<strong>es</strong> this sharing isexplicitly requ<strong>es</strong>ted by proc<strong>es</strong>s<strong>es</strong>; in others it is done automatically by the kernel to reducememory usage.If the same program, say an editor, is needed simultaneously by several users, the programwill be loaded into memory only once, and its instructions can be shared by all of the userswho need it. Its data, of course, must not be shared, because each user will have separate data.This kind of shared addr<strong>es</strong>s space is done automatically by the kernel to save memory.Proc<strong>es</strong>s<strong>es</strong> can also share parts of their addr<strong>es</strong>s space as a kind of interproc<strong>es</strong>s communication,using the "shared memory" technique introduced in System V and supported by <strong>Linux</strong>.Finally, <strong>Linux</strong> supports the mmap( ) system call, which allows part of a file or the memoryr<strong>es</strong>iding on a device to be mapped into a part of a proc<strong>es</strong>s addr<strong>es</strong>s space. Memory mappingcan provide an alternative to normal reads and writ<strong>es</strong> for transferring data. If the same file isshared by several proc<strong>es</strong>s<strong>es</strong>, its memory mapping is included in the addr<strong>es</strong>s space of each ofthe proc<strong>es</strong>s<strong>es</strong> that share it.1.6.5 Synchronization and Critical RegionsImplementing a reentrant kernel requir<strong>es</strong> the use of synchronization: if a kernel control path issuspended while acting on a kernel data structure, no other kernel control path will be allowedto act on the same data structure unl<strong>es</strong>s it has been r<strong>es</strong>et to a consistent state. Otherwise, theinteraction of the two control paths could corrupt the stored information.For example, let's suppose that a global variable V contains the number of available items ofsome system r<strong>es</strong>ource. A first kernel control path A reads the variable and determin<strong>es</strong> thatthere is just one available item. At this point, another kernel control path B is activated andreads the same variable, which still contains the value 1. Thus, B decrements V and startsusing the r<strong>es</strong>ource item. <strong>The</strong>n A r<strong>es</strong>um<strong>es</strong> the execution; because it has already read the valueof V, it assum<strong>es</strong> that it can decrement V and take the r<strong>es</strong>ource item, which B already us<strong>es</strong>. Asa final r<strong>es</strong>ult, V contains -1, and two kernel control paths are using the same r<strong>es</strong>ource itemwith potentially disastrous effects.When the outcome of some computation depends on how two or more proc<strong>es</strong>s<strong>es</strong> ar<strong>es</strong>cheduled, the code is incorrect: we say that there is a race condition.In general, safe acc<strong>es</strong>s to a global variable is ensured by using atomic operations. In theprevious example, data corruption would not be possible if the two control paths read and26


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>decrement V with a single, noninterruptible operation. However, kernels contain many datastructur<strong>es</strong> that cannot be acc<strong>es</strong>sed with a single operation. For example, it usually isn'tpossible to remove an element from a linked list with a single operation, because the kernelneeds to acc<strong>es</strong>s at least two pointers at once. Any section of code that should be finished byeach proc<strong>es</strong>s that begins it before another proc<strong>es</strong>s can enter it is called a critical region. [8][8] Synchronization problems have been fully d<strong>es</strong>cribed in other works; we refer the inter<strong>es</strong>ted reader to books on the Unix operating systems (see thebibliography near the end of the book).<strong>The</strong>se problems occur not only among kernel control paths but also among proc<strong>es</strong>s<strong>es</strong> sharingcommon data. Several synchronization techniqu<strong>es</strong> have been adopted. <strong>The</strong> following sectionwill concentrate on how to synchronize kernel control paths.1.6.5.1 Nonpreemptive kernelsIn search of a drastically simple solution to synchronization problems, most traditional Unixkernels are nonpreemptive: when a proc<strong>es</strong>s execut<strong>es</strong> in <strong>Kernel</strong> Mode, it cannot be arbitrarilysuspended and substituted with another proc<strong>es</strong>s. <strong>The</strong>refore, on a uniproc<strong>es</strong>sor system allkernel data structur<strong>es</strong> that are not updated by interrupts or exception handlers are safe for thekernel to acc<strong>es</strong>s.Of course, a proc<strong>es</strong>s in <strong>Kernel</strong> Mode can voluntarily relinquish the CPU, but in this case itmust ensure that all data structur<strong>es</strong> are left in a consistent state. Moreover, when it r<strong>es</strong>um<strong>es</strong> itsexecution, it must recheck the value of any previously acc<strong>es</strong>sed data structur<strong>es</strong> that could bechanged.Nonpreemptability is ineffective in multiproc<strong>es</strong>sor systems, since two kernel control pathsrunning on different CPUs could concurrently acc<strong>es</strong>s the same data structure.1.6.5.2 Interrupt disablingAnother synchronization mechanism for uniproc<strong>es</strong>sor systems consists of disabling allhardware interrupts before entering a critical region and reenabling them right after leaving it.This mechanism, while simple, is far from optimal. If the critical region is large, interruptscan remain disabled for a relatively long time, potentially causing all hardware activiti<strong>es</strong> tofreeze.Moreover, on a multiproc<strong>es</strong>sor system this mechanism do<strong>es</strong>n't work at all. <strong>The</strong>re is no way toensure that no other CPU can acc<strong>es</strong>s the same data structur<strong>es</strong> updated in the protected criticalregion.1.6.5.3 Semaphor<strong>es</strong>A widely used mechanism, effective in both uniproc<strong>es</strong>sor and multiproc<strong>es</strong>sor systems, reli<strong>es</strong>on the use of semaphor<strong>es</strong>. A semaphore is simply a counter associated with a data structure;the semaphore is checked by all kernel threads before they try to acc<strong>es</strong>s the data structure.Each semaphore may be viewed as an object composed of:• An integer variable• A list of waiting proc<strong>es</strong>s<strong>es</strong>• Two atomic methods: down( ) and up( )27


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> down( ) method decrements the value of the semaphore. If the new value is l<strong>es</strong>s than 0,the method adds the running proc<strong>es</strong>s to the semaphore list and then blocks (i.e., invok<strong>es</strong> th<strong>es</strong>cheduler). <strong>The</strong> up( ) method increments the value of the semaphore and, if its new value isgreater than or equal to 0, reactivat<strong>es</strong> one or more proc<strong>es</strong>s<strong>es</strong> in the semaphore list.Each data structure to be protected has its own semaphore, which is initialized to 1. When akernel control path wish<strong>es</strong> to acc<strong>es</strong>s the data structure, it execut<strong>es</strong> the down( ) method on theproper semaphore. If the value of the new semaphore isn't negative, acc<strong>es</strong>s to the datastructure is granted. Otherwise, the proc<strong>es</strong>s that is executing the kernel control path is addedto the semaphore list and blocked. When another proc<strong>es</strong>s execut<strong>es</strong> the up( ) method on thatsemaphore, one of the proc<strong>es</strong>s<strong>es</strong> in the semaphore list is allowed to proceed.1.6.5.4 Spin locksIn multiproc<strong>es</strong>sor systems, semaphor<strong>es</strong> are not always the b<strong>es</strong>t solution to the synchronizationproblems. Some kernel data structur<strong>es</strong> should be protected from being concurrently acc<strong>es</strong>sedby kernel control paths that run on different CPUs. In this case, if the time required to updatethe data structure is short, a semaphore could be very inefficient. To check a semaphore, thekernel must insert a proc<strong>es</strong>s in the semaphore list and then suspend it. Since both operationsare relatively expensive, in the time it tak<strong>es</strong> to complete them, the other kernel control pathcould have already released the semaphore.In th<strong>es</strong>e cas<strong>es</strong>, multiproc<strong>es</strong>sor operating systems make use of spin locks. A spin lock is verysimilar to a semaphore, but it has no proc<strong>es</strong>s list: when a proc<strong>es</strong>s finds the lock closed byanother proc<strong>es</strong>s, it "spins" around repeatedly, executing a tight instruction loop until the lockbecom<strong>es</strong> open.Of course, spin locks are usel<strong>es</strong>s in a uniproc<strong>es</strong>sor environment. When a kernel control pathtri<strong>es</strong> to acc<strong>es</strong>s a locked data structure, it starts an endl<strong>es</strong>s loop. <strong>The</strong>refore, the kernel controlpath that is updating the protected data structure would not have a chance to continue theexecution and release the spin lock. <strong>The</strong> final r<strong>es</strong>ult is that the system hangs.1.6.5.5 Avoiding deadlocksProc<strong>es</strong>s<strong>es</strong> or kernel control paths that synchronize with other control paths may easily enter ina deadlocked state. <strong>The</strong> simpl<strong>es</strong>t case of deadlock occurs when proc<strong>es</strong>s p1 gains acc<strong>es</strong>s to datastructure a and proc<strong>es</strong>s p2 gains acc<strong>es</strong>s to b, but p1 then waits for b and p2 waits for a. Othermore complex cyclic waitings among groups of proc<strong>es</strong>s<strong>es</strong> may also occur. Of course, adeadlock condition caus<strong>es</strong> a complete freeze of the affected proc<strong>es</strong>s<strong>es</strong> or kernel control paths.As far as kernel d<strong>es</strong>ign is concerned, deadlock becom<strong>es</strong> an issue when the number of kernelsemaphore typ<strong>es</strong> used is high. In this case, it may be quite difficult to ensure that no deadlockstate will ever be reached for all possible ways to interleave kernel control paths. Severaloperating systems, including <strong>Linux</strong>, avoid this problem by introducing a very limited numberof semaphore typ<strong>es</strong> and by requ<strong>es</strong>ting semaphor<strong>es</strong> in an ascending order.28


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.6.6 Signals and Interproc<strong>es</strong>s CommunicationUnix signals provide a mechanism for notifying proc<strong>es</strong>s<strong>es</strong> of system events. Each event hasits own signal number, which is usually referred to by a symbolic constant such as SIGTERM.<strong>The</strong>re are two kinds of system events:Asynchronous notificationsFor instance, a user can send the interrupt signal SIGTERM to a foreground proc<strong>es</strong>s bypr<strong>es</strong>sing the interrupt keycode (usually, CTRL-C) at the terminal.Synchronous errors or exceptionsFor instance, the kernel sends the signal SIGSEGV to a proc<strong>es</strong>s when it acc<strong>es</strong>s<strong>es</strong> amemory location at an illegal addr<strong>es</strong>s.<strong>The</strong> POSIX standard defin<strong>es</strong> about 20 different signals, two of which are user-definable andmay be used as a primitive mechanism for communication and synchronization amongproc<strong>es</strong>s<strong>es</strong> in User Mode. In general, a proc<strong>es</strong>s may react to a signal reception in two possibleways:• Ignore the signal.• Asynchronously execute a specified procedure (the signal handler).If the proc<strong>es</strong>s do<strong>es</strong> not specify one of th<strong>es</strong>e alternativ<strong>es</strong>, the kernel performs a default actionthat depends on the signal number. <strong>The</strong> five possible default actions are:• Terminate the proc<strong>es</strong>s.• Write the execution context and the contents of the addr<strong>es</strong>s space in a file (core dump)and terminate the proc<strong>es</strong>s.• Ignore the signal.• Suspend the proc<strong>es</strong>s.• R<strong>es</strong>ume the proc<strong>es</strong>s's execution, if it was stopped.<strong>Kernel</strong> signal handling is rather elaborate since the POSIX semantics allows proc<strong>es</strong>s<strong>es</strong> totemporarily block signals. Moreover, a few signals such as SIGKILL cannot be directlyhandled by the proc<strong>es</strong>s and cannot be ignored.AT&T's Unix System V introduced other kinds of interproc<strong>es</strong>s communication amongproc<strong>es</strong>s<strong>es</strong> in User Mode, which have been adopted by many Unix kernels: semaphor<strong>es</strong>,m<strong>es</strong>sage queu<strong>es</strong>, and shared memory. <strong>The</strong>y are collectively known as System V IPC.<strong>The</strong> kernel implements th<strong>es</strong>e constructs as IPC r<strong>es</strong>ourc<strong>es</strong>: a proc<strong>es</strong>s acquir<strong>es</strong> a r<strong>es</strong>ource byinvoking a shmget( ), semget( ), or msgget( ) system call. Just like fil<strong>es</strong>, IPC r<strong>es</strong>ourc<strong>es</strong>are persistent: they must be explicitly deallocated by the creator proc<strong>es</strong>s, by the currentowner, or by a superuser proc<strong>es</strong>s.Semaphor<strong>es</strong> are similar to those d<strong>es</strong>cribed in Section 1.6.5 earlier in this chapter, except thatthey are r<strong>es</strong>erved for proc<strong>es</strong>s<strong>es</strong> in User Mode. M<strong>es</strong>sage queu<strong>es</strong> allow proc<strong>es</strong>s<strong>es</strong> to exchange29


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>m<strong>es</strong>sag<strong>es</strong> by making use of the msgsnd( ) and msgget( ) system calls, which r<strong>es</strong>pectivelyinsert a m<strong>es</strong>sage into a specific m<strong>es</strong>sage queue and extract a m<strong>es</strong>sage from it.Shared memory provid<strong>es</strong> the fast<strong>es</strong>t way for proc<strong>es</strong>s<strong>es</strong> to exchange and share data. A proc<strong>es</strong>sstarts by issuing a shmget( ) system call to create a new shared memory having a requiredsize. After obtaining the IPC r<strong>es</strong>ource identifier, the proc<strong>es</strong>s invok<strong>es</strong> the shmat( ) systemcall, which returns the starting addr<strong>es</strong>s of the new region within the proc<strong>es</strong>s addr<strong>es</strong>s space.When the proc<strong>es</strong>s wish<strong>es</strong> to detach the shared memory from its addr<strong>es</strong>s space, it invok<strong>es</strong> th<strong>es</strong>hmdt( ) system call. <strong>The</strong> implementation of shared memory depends on how the kernelimplements proc<strong>es</strong>s addr<strong>es</strong>s spac<strong>es</strong>.1.6.7 Proc<strong>es</strong>s ManagementUnix mak<strong>es</strong> a neat distinction between the proc<strong>es</strong>s and the program it is executing. To thatend, the fork( ) and exit( ) system calls are used r<strong>es</strong>pectively to create a new proc<strong>es</strong>s andto terminate it, while an exec( )-like system call is invoked to load a new program. Aftersuch a system call has been executed, the proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> execution with a brand newaddr<strong>es</strong>s space containing the loaded program.<strong>The</strong> proc<strong>es</strong>s that invok<strong>es</strong> a fork( ) is the parent while the new proc<strong>es</strong>s is its child . Parentsand children can find each other because the data structure d<strong>es</strong>cribing each proc<strong>es</strong>s includ<strong>es</strong> apointer to its immediate parent and pointers to all its immediate children.A naive implementation of the fork( ) would require both the parent's data and the parent'scode to be duplicated and assign the copi<strong>es</strong> to the child. This would be quite time-consuming.Current kernels that can rely on hardware paging units follow the Copy-On-Write approach,which defers page duplication until the last moment (i.e., until the parent or the child isrequired to write into a page). We shall d<strong>es</strong>cribe how <strong>Linux</strong> implements this technique inSection 7.4.4 in Chapter 7.<strong>The</strong> exit( ) system call terminat<strong>es</strong> a proc<strong>es</strong>s. <strong>The</strong> kernel handl<strong>es</strong> this system call byreleasing the r<strong>es</strong>ourc<strong>es</strong> owned by the proc<strong>es</strong>s and sending the parent proc<strong>es</strong>s a SIGCHLDsignal, which is ignored by default.1.6.7.1 Zombie proc<strong>es</strong>s<strong>es</strong>How can a parent proc<strong>es</strong>s inquire about termination of its children? <strong>The</strong> wait( ) system callallows a proc<strong>es</strong>s to wait until one of its children terminat<strong>es</strong>; it returns the proc<strong>es</strong>s ID (PID) ofthe terminated child.When executing this system call, the kernel checks whether a child has already terminated. Aspecial zombie proc<strong>es</strong>s state is introduced to repr<strong>es</strong>ent terminated proc<strong>es</strong>s<strong>es</strong>: a proc<strong>es</strong>sremains in that state until its parent proc<strong>es</strong>s execut<strong>es</strong> a wait( ) system call on it. <strong>The</strong> systemcall handler extracts some data about r<strong>es</strong>ource usage from the proc<strong>es</strong>s d<strong>es</strong>criptor fields; theproc<strong>es</strong>s d<strong>es</strong>criptor may be released once the data has been collected. If no child proc<strong>es</strong>s hasalready terminated when the wait( ) system call is executed, the kernel usually puts theproc<strong>es</strong>s in a wait state until a child terminat<strong>es</strong>.Many kernels also implement a waitpid( ) system call, which allows a proc<strong>es</strong>s to wait for aspecific child proc<strong>es</strong>s. Other variants of wait( ) system calls are also quite common.30


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>It's a good practice for the kernel to keep around information on a child proc<strong>es</strong>s until theparent issu<strong>es</strong> its wait( ) call, but suppose the parent proc<strong>es</strong>s terminat<strong>es</strong> without issuing thatcall? <strong>The</strong> information tak<strong>es</strong> up valuable memory slots that could be used to serve livingproc<strong>es</strong>s<strong>es</strong>. For example, many shells allow the user to start a command in the background andthen log out. <strong>The</strong> proc<strong>es</strong>s that is running the command shell terminat<strong>es</strong>, but its childrencontinue their execution.<strong>The</strong> solution li<strong>es</strong> in a special system proc<strong>es</strong>s called init that is created during systeminitialization. When a proc<strong>es</strong>s terminat<strong>es</strong>, the kernel chang<strong>es</strong> the appropriate proc<strong>es</strong>sd<strong>es</strong>criptor pointers of all the existing children of the terminated proc<strong>es</strong>s to make them becomechildren of init. This proc<strong>es</strong>s monitors the execution of all its children and routinely issu<strong>es</strong>wait( ) system calls, whose side effect is to get rid of all zombi<strong>es</strong>.1.6.7.2 Proc<strong>es</strong>s groups and login s<strong>es</strong>sionsModern Unix operating systems introduce the notion of proc<strong>es</strong>s groups to repr<strong>es</strong>ent a "job"abstraction. For example, in order to execute the command line:$ ls | sort | morea shell that supports proc<strong>es</strong>s groups, such as bash, creat<strong>es</strong> a new group for the three proc<strong>es</strong>s<strong>es</strong>corr<strong>es</strong>ponding to ls, sort, and more. In this way, the shell acts on the three proc<strong>es</strong>s<strong>es</strong> as ifthey were a single entity (the job, to be precise). Each proc<strong>es</strong>s d<strong>es</strong>criptor includ<strong>es</strong> a proc<strong>es</strong>sgroup ID field. Each group of proc<strong>es</strong>s<strong>es</strong> may have a group leader, which is the proc<strong>es</strong>s whosePID coincid<strong>es</strong> with the proc<strong>es</strong>s group ID. A newly created proc<strong>es</strong>s is initially inserted into theproc<strong>es</strong>s group of its parent.Modern Unix kernels also introduce login s<strong>es</strong>sions. Informally, a login s<strong>es</strong>sion contains allproc<strong>es</strong>s<strong>es</strong> that are d<strong>es</strong>cendants of the proc<strong>es</strong>s that has started a working s<strong>es</strong>sion on a specificterminal—usually, the first command shell proc<strong>es</strong>s created for the user. All proc<strong>es</strong>s<strong>es</strong> in aproc<strong>es</strong>s group must be in the same login s<strong>es</strong>sion. A login s<strong>es</strong>sion may have several proc<strong>es</strong>sgroups active simultaneously; one of th<strong>es</strong>e proc<strong>es</strong>s groups is always in the foreground, whichmeans that it has acc<strong>es</strong>s to the terminal. <strong>The</strong> other active proc<strong>es</strong>s groups are in thebackground. When a background proc<strong>es</strong>s tri<strong>es</strong> to acc<strong>es</strong>s the terminal, it receiv<strong>es</strong> a SIGTTIN orSIGTTOUT signal. In many command shells the internal commands bg and fg can be used toput a proc<strong>es</strong>s group in either the background or the foreground.1.6.8 Memory ManagementMemory management is by far the most complex activity in a Unix kernel. We shall dedicatemore than a third of this book just to d<strong>es</strong>cribing how <strong>Linux</strong> do<strong>es</strong> it. This section illustrat<strong>es</strong>some of the main issu<strong>es</strong> related to memory management.1.6.8.1 Virtual memoryAll recent Unix systems provide a useful abstraction called virtual memory. Virtual memoryacts as a logical layer between the application memory requ<strong>es</strong>ts and the hardware MemoryManagement Unit (MMU). Virtual memory has many purpos<strong>es</strong> and advantag<strong>es</strong>:31


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Several proc<strong>es</strong>s<strong>es</strong> can be executed concurrently.• It is possible to run applications whose memory needs are larger than the availablephysical memory.• Proc<strong>es</strong>s<strong>es</strong> can execute a program whose code is only partially loaded in memory.• Each proc<strong>es</strong>s is allowed to acc<strong>es</strong>s a subset of the available physical memory.• Proc<strong>es</strong>s<strong>es</strong> can share a single memory image of a library or program.• Programs can be relocatable, that is, they can be placed anywhere in physical memory.• Programmers can write machine-independent code, since they do not need to beconcerned about physical memory organization.<strong>The</strong> main ingredient of a virtual memory subsystem is the notion of virtual addr<strong>es</strong>s space.<strong>The</strong> set of memory referenc<strong>es</strong> that a proc<strong>es</strong>s can use is different from physical memoryaddr<strong>es</strong>s<strong>es</strong>. When a proc<strong>es</strong>s us<strong>es</strong> a virtual addr<strong>es</strong>s, [9] the kernel and the MMU cooperate tolocate the actual physical location of the requ<strong>es</strong>ted memory item.[9] <strong>The</strong>se addr<strong>es</strong>s<strong>es</strong> have different nomenclatur<strong>es</strong> depending on the computer architecture. As we'll see in Chapter 2, Intel 80x86 manuals refer to themas "logical addr<strong>es</strong>s<strong>es</strong>."Today's CPUs include hardware circuits that automatically translate the virtual addr<strong>es</strong>s<strong>es</strong> intophysical on<strong>es</strong>. To that end, the available RAM is partitioned into page fram<strong>es</strong> 4 or 8 KB inlength, and a set of page tabl<strong>es</strong> is introduced to specify the corr<strong>es</strong>pondence between virtualand physical addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong>se circuits make memory allocation simpler, since a requ<strong>es</strong>t for ablock of contiguous virtual addr<strong>es</strong>s<strong>es</strong> can be satisfied by allocating a group of page fram<strong>es</strong>having noncontiguous physical addr<strong>es</strong>s<strong>es</strong>.1.6.8.2 Random acc<strong>es</strong>s memory usageAll Unix operating systems clearly distinguish two portions of the random acc<strong>es</strong>s memory(RAM). A few megabyt<strong>es</strong> are dedicated to storing the kernel image (i.e., the kernel code andthe kernel static data structur<strong>es</strong>). <strong>The</strong> remaining portion of RAM is usually handled by thevirtual memory system and is used in three possible ways:• To satisfy kernel requ<strong>es</strong>ts for buffers, d<strong>es</strong>criptors, and other dynamic kernel datastructur<strong>es</strong>• To satisfy proc<strong>es</strong>s requ<strong>es</strong>ts for generic memory areas and for memory mapping of fil<strong>es</strong>• To get better performance from disks and other buffered devic<strong>es</strong> by means of cach<strong>es</strong>Each requ<strong>es</strong>t type is valuable. On the other hand, since the available RAM is limited, somebalancing among requ<strong>es</strong>t typ<strong>es</strong> must be done, particularly when little available memory is left.Moreover, when some critical thr<strong>es</strong>hold of available memory is reached and a page-framereclaimingalgorithm is invoked to free additional memory, which are the page fram<strong>es</strong> mostsuitable for reclaiming? As we shall see in Chapter 16, there is no simple answer to thisqu<strong>es</strong>tion and very little support from theory. <strong>The</strong> only available solution li<strong>es</strong> in developingcarefully tuned empirical algorithms.One major problem that must be solved by the virtual memory system is memoryfragmentation . Ideally, a memory requ<strong>es</strong>t should fail only when the number of free pagefram<strong>es</strong> is too small. However, the kernel is often forced to use physically contiguous memoryareas, hence the memory requ<strong>es</strong>t could fail even if there is enough memory available but it isnot available as one contiguous chunk.32


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1.6.8.3 <strong>Kernel</strong> Memory Allocator<strong>The</strong> <strong>Kernel</strong> Memory Allocator (KMA) is a subsystem that tri<strong>es</strong> to satisfy the requ<strong>es</strong>ts formemory areas from all parts of the system. Some of th<strong>es</strong>e requ<strong>es</strong>ts will come from otherkernel subsystems needing memory for kernel use, and some requ<strong>es</strong>ts will come via systemcalls from user programs to increase their proc<strong>es</strong>s<strong>es</strong>' addr<strong>es</strong>s spac<strong>es</strong>. A good KMA shouldhave the following featur<strong>es</strong>:• It must be fast. Actually, this is the most crucial attribute, since it is invoked by allkernel subsystems (including the interrupt handlers).• It should minimize the amount of wasted memory.• It should try to reduce the memory fragmentation problem.• It should be able to cooperate with the other memory management subsystems in orderto borrow and release page fram<strong>es</strong> from them.Several kinds of KMAs have been proposed, which are based on a variety of differentalgorithmic techniqu<strong>es</strong>, including:• R<strong>es</strong>ource map allocator• Power-of-two free lists• McKusick-Karels allocator• Buddy system• Mach's Zone allocator• Dynix allocator• Solaris's Slab allocatorAs we shall see in Chapter 6, <strong>Linux</strong>'s KMA us<strong>es</strong> a Slab allocator on top of a Buddy system.1.6.8.4 Proc<strong>es</strong>s virtual addr<strong>es</strong>s space handling<strong>The</strong> addr<strong>es</strong>s space of a proc<strong>es</strong>s contains all the virtual memory addr<strong>es</strong>s<strong>es</strong> that the proc<strong>es</strong>s isallowed to reference. <strong>The</strong> kernel usually stor<strong>es</strong> a proc<strong>es</strong>s virtual addr<strong>es</strong>s space as a list ofmemory area d<strong>es</strong>criptors. For example, when a proc<strong>es</strong>s starts the execution of some programvia an exec( )-like system call, the kernel assigns to the proc<strong>es</strong>s a virtual addr<strong>es</strong>s space thatcompris<strong>es</strong> memory areas for:• <strong>The</strong> executable code of the program• <strong>The</strong> initialized data of the program• <strong>The</strong> uninitialized data of the program• <strong>The</strong> initial program stack (that is, the User Mode stack)• <strong>The</strong> executable code and data of needed shared librari<strong>es</strong>• <strong>The</strong> heap (the memory dynamically requ<strong>es</strong>ted by the program)All recent Unix operating systems adopt a memory allocation strategy called demand paging.With demand paging, a proc<strong>es</strong>s can start program execution with none of its pag<strong>es</strong> in physicalmemory. As it acc<strong>es</strong>s<strong>es</strong> a nonpr<strong>es</strong>ent page, the MMU generat<strong>es</strong> an exception; the exceptionhandler finds the affected memory region, allocat<strong>es</strong> a free page, and initializ<strong>es</strong> it with theappropriate data. In a similar fashion, when the proc<strong>es</strong>s dynamically requir<strong>es</strong> some memoryby using malloc( ) or the brk( ) system call (which is invoked internally by malloc( )),the kernel just updat<strong>es</strong> the size of the heap memory region of the proc<strong>es</strong>s. A page frame is33


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>assigned to the proc<strong>es</strong>s only when it generat<strong>es</strong> an exception by trying to refer its virtualmemory addr<strong>es</strong>s<strong>es</strong>.Virtual addr<strong>es</strong>s spac<strong>es</strong> also allow other efficient strategi<strong>es</strong>, such as the Copy-On-Writ<strong>es</strong>trategy mentioned earlier. For example, when a new proc<strong>es</strong>s is created, the kernel justassigns the parent's page fram<strong>es</strong> to the child addr<strong>es</strong>s space, but it marks them read only. Anexception is raised as soon the parent or the child tri<strong>es</strong> to modify the contents of a page. <strong>The</strong>exception handler assigns a new page frame to the affected proc<strong>es</strong>s and initializ<strong>es</strong> it with thecontents of the original page.1.6.8.5 Swapping and cachingIn order to extend the size of the virtual addr<strong>es</strong>s space usable by the proc<strong>es</strong>s<strong>es</strong>, the Unixoperating system mak<strong>es</strong> use of swap areas on disk. <strong>The</strong> virtual memory system regards thecontents of a page frame as the basic unit for swapping. Whenever some proc<strong>es</strong>s refers to aswapped-out page, the MMU rais<strong>es</strong> an exception. <strong>The</strong> exception handler then allocat<strong>es</strong> a newpage frame and initializ<strong>es</strong> the page frame with its old contents saved on disk.On the other hand, physical memory is also used as cache for hard disks and other blockdevic<strong>es</strong>. This is because hard driv<strong>es</strong> are very slow: a disk acc<strong>es</strong>s requir<strong>es</strong> several milliseconds,which is a very long time compared with the RAM acc<strong>es</strong>s time. <strong>The</strong>refore, disks are often thebottleneck in system performance. As a general rule, one of the polici<strong>es</strong> already implementedin the earli<strong>es</strong>t Unix system is to defer writing to disk as long as possible by loading into RAMa set of disk buffers corr<strong>es</strong>ponding to blocks read from disk. <strong>The</strong> sync( ) system call forc<strong>es</strong>disk synchronization by writing all of the "dirty" buffers (i.e., all the buffers whose contentsdiffer from that of the corr<strong>es</strong>ponding disk blocks) into disk. In order to avoid data loss, alloperating systems take care to periodically write dirty buffers back to disk.1.6.9 Device Drivers<strong>The</strong> kernel interacts with I/O devic<strong>es</strong> by means of device drivers. Device drivers are includedin the kernel and consist of data structur<strong>es</strong> and functions that control one or more devic<strong>es</strong>,such as hard disks, keyboards, mous<strong>es</strong>, monitors, network interfac<strong>es</strong>, and devic<strong>es</strong> connectedto a SCSI bus. Each driver interacts with the remaining part of the kernel (even with otherdrivers) through a specific interface. This approach has the following advantag<strong>es</strong>:• Device-specific code can be encapsulated in a specific module.• Vendors can add new devic<strong>es</strong> without knowing the kernel source code: only theinterface specifications must be known.• <strong>The</strong> kernel deals with all devic<strong>es</strong> in a uniform way and acc<strong>es</strong>s<strong>es</strong> them through th<strong>es</strong>ame interface.• It is possible to write a device driver as a module that can be dynamically loaded in thekernel without requiring the system to be rebooted. It is also possible to dynamicallyunload a module that is no longer needed, thus minimizing the size of the kernel imag<strong>es</strong>tored in RAM.Figure 1-5 illustrat<strong>es</strong> how device drivers interface with the r<strong>es</strong>t of the kernel and with theproc<strong>es</strong>s<strong>es</strong>. Some user programs (P) wish to operate on hardware devic<strong>es</strong>. <strong>The</strong>y make requ<strong>es</strong>tsto the kernel using the usual file-related system calls and the device fil<strong>es</strong> normally found inthe /dev directory. Actually, the device fil<strong>es</strong> are the user-visible portion of the device driver34


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>interface. Each device file refers to a specific device driver, which is invoked by the kernel inorder to perform the requ<strong>es</strong>ted operation on the hardware component.Figure 1-5. Device driver interfaceIt is worth mentioning that at the time Unix was introduced graphical terminals wereuncommon and expensive, and thus only alphanumeric terminals were handled directly byUnix kernels. When graphical terminals became wid<strong>es</strong>pread, ad hoc applications such as theX Window System were introduced that ran as standard proc<strong>es</strong>s<strong>es</strong> and acc<strong>es</strong>sed the I/O portsof the graphics interface and the RAM video area directly. Some recent Unix kernels, such as<strong>Linux</strong> 2.2, include limited support for some frame buffer devic<strong>es</strong>, thus allowing a program toacc<strong>es</strong>s the local memory inside a video card through a device file.35


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 2. Memory Addr<strong>es</strong>singThis chapter deals with addr<strong>es</strong>sing techniqu<strong>es</strong>. Luckily, an operating system is not forced tokeep track of physical memory all by itself; today's microproc<strong>es</strong>sors include several hardwarecircuits to make memory management both more efficient and more robust in case ofprogramming errors.As in the r<strong>es</strong>t of this book, we offer details in this chapter on how Intel 80x86microproc<strong>es</strong>sors addr<strong>es</strong>s memory chips and how <strong>Linux</strong> mak<strong>es</strong> use of the available addr<strong>es</strong>singcircuits. You will find, we hope, that when you learn the implementation details on <strong>Linux</strong>'smost popular platform you will better understand both the general theory of paging and howto r<strong>es</strong>earch the implementation on other platforms.This is the first of three chapters related to memory management: Chapter 6, discuss<strong>es</strong> howthe kernel allocat<strong>es</strong> main memory to itself, while Chapter 7, considers how linear addr<strong>es</strong>s<strong>es</strong>are assigned to proc<strong>es</strong>s<strong>es</strong>.2.1 Memory Addr<strong>es</strong>s<strong>es</strong>Programmers casually refer to a memory addr<strong>es</strong>s as the way to acc<strong>es</strong>s the contents ofa memory cell. But when dealing with Intel 80x86 microproc<strong>es</strong>sors, we have to distinguishamong three kinds of addr<strong>es</strong>s<strong>es</strong>:Logical addr<strong>es</strong>sIncluded in the machine language instructions to specify the addr<strong>es</strong>s of an operand orof an instruction. This type of addr<strong>es</strong>s embodi<strong>es</strong> the well-known Intel segmentedarchitecture that forc<strong>es</strong> MS-DOS and Windows programmers to divide their programsinto segments. Each logical addr<strong>es</strong>s consists of a segment and an offset (ordisplacement) that denot<strong>es</strong> the distance from the start of the segment to the actualaddr<strong>es</strong>s.Linear addr<strong>es</strong>sA single 32-bit unsigned integer that can be used to addr<strong>es</strong>s up to 4 GB, that is, up to4,294,967,296 memory cells. Linear addr<strong>es</strong>s<strong>es</strong> are usually repr<strong>es</strong>ented in hexadecimalnotation; their valu<strong>es</strong> range from 0x00000000 to 0xffffffff.Physical addr<strong>es</strong>sUsed to addr<strong>es</strong>s memory cells included in memory chips. <strong>The</strong>y corr<strong>es</strong>pond to theelectrical signals sent along the addr<strong>es</strong>s pins of the microproc<strong>es</strong>sor to the memory bus.Physical addr<strong>es</strong>s<strong>es</strong> are repr<strong>es</strong>ented as 32-bit unsigned integers.<strong>The</strong> CPU control unit transforms a logical addr<strong>es</strong>s into a linear addr<strong>es</strong>s by means of ahardware circuit called a segmentation unit; succ<strong>es</strong>sively, a second hardware circuit called apaging unit transforms the linear addr<strong>es</strong>s into a physical addr<strong>es</strong>s (see Figure 2-1).36


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 2-1. Logical addr<strong>es</strong>s translation2.2 Segmentation in HardwareStarting with the 80386 model, Intel microproc<strong>es</strong>sors perform addr<strong>es</strong>s translation in twodifferent ways called real mode and protected mode. Real mode exists mostly to maintainproc<strong>es</strong>sor compatibility with older models and to allow the operating system to bootstrap (seeAppendix A, for a short d<strong>es</strong>cription of real mode). We shall thus focus our attention onprotected mode.2.2.1 Segmentation RegistersA logical addr<strong>es</strong>s consists of two parts: a segment identifier and an offset that specifi<strong>es</strong> therelative addr<strong>es</strong>s within the segment. <strong>The</strong> segment identifier is a 16-bit field called SegmentSelector, while the offset is a 32-bit field.To make it easy to retrieve segment selectors quickly, the proc<strong>es</strong>sor provid<strong>es</strong> segmentationregisters whose only purpose is to hold Segment Selectors; th<strong>es</strong>e registers are called cs, ss,ds, <strong>es</strong>, fs, and gs. Although there are only six of them, a program can reuse the sam<strong>es</strong>egmentation register for different purpos<strong>es</strong> by saving its content in memory and thenr<strong>es</strong>toring it later.Three of the six segmentation registers have specific purpos<strong>es</strong>:csssds<strong>The</strong> code segment register, which points to a segment containing program instructions<strong>The</strong> stack segment register, which points to a segment containing the current programstack<strong>The</strong> data segment register, which points to a segment containing static and externaldata<strong>The</strong> remaining three segmentation registers are general purpose and may refer to arbitrarysegments.<strong>The</strong> cs register has another important function: it includ<strong>es</strong> a 2-bit field that specifi<strong>es</strong> theCurrent Privilege Level (CPL) of the CPU. <strong>The</strong> value denot<strong>es</strong> the high<strong>es</strong>t privilege level, whilethe value 3 denot<strong>es</strong> the low<strong>es</strong>t one. <strong>Linux</strong> us<strong>es</strong> only levels and 3, which are r<strong>es</strong>pectively called<strong>Kernel</strong> Mode and User Mode.37


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2.2.2 Segment D<strong>es</strong>criptorsEach segment is repr<strong>es</strong>ented by an 8-byte Segment D<strong>es</strong>criptor (see Figure 2-2) that d<strong>es</strong>crib<strong>es</strong>the segment characteristics. Segment D<strong>es</strong>criptors are stored either in the Global D<strong>es</strong>criptorTable (GDT ) or in the Local D<strong>es</strong>criptor Table (LDT ).Figure 2-2. Segment D<strong>es</strong>criptor formatUsually only one GDT is defined, while each proc<strong>es</strong>s may have its own LDT. <strong>The</strong> addr<strong>es</strong>s ofthe GDT in main memory is contained in the gdtr proc<strong>es</strong>sor register and the addr<strong>es</strong>s of thecurrently used LDT is contained in the ldtr proc<strong>es</strong>sor register.Each Segment D<strong>es</strong>criptor consists of the following fields:• A 32-bit Base field that contains the linear addr<strong>es</strong>s of the first byte of the segment.• A G granularity flag: if it is cleared, the segment size is expr<strong>es</strong>sed in byt<strong>es</strong>; otherwise,it is expr<strong>es</strong>sed in multipl<strong>es</strong> of 4096 byt<strong>es</strong>.• A 20-bit Limit field that denot<strong>es</strong> the segment length in byt<strong>es</strong>. If G is set to 0, the sizeof a non-null segment may vary between 1 byte and 1 MB; otherwise, it may varybetween 4 KB and 4 GB.• An S system flag: if it is cleared, the segment is a system segment that stor<strong>es</strong> kerneldata structur<strong>es</strong>; otherwise, it is a normal code or data segment.• A 4-bit Type field that characteriz<strong>es</strong> the segment type and its acc<strong>es</strong>s rights. <strong>The</strong>following Segment D<strong>es</strong>criptor typ<strong>es</strong> are widely used:Code Segment D<strong>es</strong>criptorIndicat<strong>es</strong> that the Segment D<strong>es</strong>criptor refers to a code segment; it may be includedeither in the GDT or in the LDT. <strong>The</strong> d<strong>es</strong>criptor has the S flag set.38


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Data Segment D<strong>es</strong>criptorIndicat<strong>es</strong> that the Segment D<strong>es</strong>criptor refers to a data segment; it may be includedeither in the GDT or in the LDT. <strong>The</strong> d<strong>es</strong>criptor has the S flag set. Stack segments areimplemented by means of generic data segments.Task State Segment D<strong>es</strong>criptor (TSSD)Indicat<strong>es</strong> that the Segment D<strong>es</strong>criptor refers to a Task State Segment (TSS), that is,a segment used to save the contents of the proc<strong>es</strong>sor registers (see Section 3.2.2 inChapter 3); it can appear only in the GDT. <strong>The</strong> corr<strong>es</strong>ponding Type field has the value11 or 9, depending on whether the corr<strong>es</strong>ponding proc<strong>es</strong>s is currently executing on theCPU. <strong>The</strong> S flag of such d<strong>es</strong>criptors is set to 0.Local D<strong>es</strong>criptor Table D<strong>es</strong>criptor (LDTD)Indicat<strong>es</strong> that the Segment D<strong>es</strong>criptor refers to a segment containing an LDT; it canappear only in the GDT. <strong>The</strong> corr<strong>es</strong>ponding Type field has the value 2. <strong>The</strong> S flag ofsuch d<strong>es</strong>criptors is set to 0.• A DPL (D<strong>es</strong>criptor Privilege Level ) 2-bit field used to r<strong>es</strong>trict acc<strong>es</strong>s<strong>es</strong> to the segment.It repr<strong>es</strong>ents the minimal CPU privilege level requ<strong>es</strong>ted for acc<strong>es</strong>sing the segment.<strong>The</strong>refore, a segment with its DPL set to is acc<strong>es</strong>sible only when the CPL is 0, that is, in<strong>Kernel</strong> Mode, while a segment with its DPL set to 3 is acc<strong>es</strong>sible with every CPL value.• A Segment-Pr<strong>es</strong>ent flag that is set to if the segment is currently not stored in mainmemory. <strong>Linux</strong> always sets this field to 1, since it never swaps out whole segments todisk.• An additional flag called D or B depending on whether the segment contains code ordata. Its meaning is slightly different in the two cas<strong>es</strong>, but it is basically set if theaddr<strong>es</strong>s<strong>es</strong> used as segment offsets are 32 bits long and it is cleared if they are 16 bitslong (see the Intel manual for further details).• A r<strong>es</strong>erved bit (bit 53) always set to 0.• An AVL flag that may be used by the operating system but is ignored in <strong>Linux</strong>.2.2.3 Segment SelectorsTo speed up the translation of logical addr<strong>es</strong>s<strong>es</strong> into linear addr<strong>es</strong>s<strong>es</strong>, the Intel proc<strong>es</strong>sorprovid<strong>es</strong> an additional nonprogrammable register—that is, a register that cannot be set by aprogrammer—for each of the six programmable segmentation registers. Eachnonprogrammable register contains the 8-byte Segment D<strong>es</strong>criptor (d<strong>es</strong>cribed in the previoussection) specified by the Segment Selector contained in the corr<strong>es</strong>ponding segmentationregister. Every time a Segment Selector is loaded in a segmentation register, thecorr<strong>es</strong>ponding Segment D<strong>es</strong>criptor is loaded from memory into the matchingnonprogrammable CPU register. From then on, translations of logical addr<strong>es</strong>s<strong>es</strong> referring tothat segment can be performed without acc<strong>es</strong>sing the GDT or LDT stored in main memory;the proc<strong>es</strong>sor can just refer directly to the CPU register containing the Segment D<strong>es</strong>criptor.Acc<strong>es</strong>s<strong>es</strong> to the GDT or LDT are nec<strong>es</strong>sary only when the contents of the segmentationregister change (see Figure 2-3). Each Segment Selector includ<strong>es</strong> the following fields:39


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• A 13-bit index (d<strong>es</strong>cribed further in the text following this list) that identifi<strong>es</strong> thecorr<strong>es</strong>ponding Segment D<strong>es</strong>criptor entry contained in the GDT or in the LDT• A TI (Table Indicator) flag that specifi<strong>es</strong> whether the Segment D<strong>es</strong>criptor is includedin the GDT (TI = 0) or in the LDT (TI = 1)• An RPL (Requ<strong>es</strong>tor Privilege Level ) 2-bit field, which is precisely the CurrentPrivilege Level of the CPU when the corr<strong>es</strong>ponding Segment Selector is loaded intothe cs register [1][1] <strong>The</strong> RPL field may also be used to selectively weaken the proc<strong>es</strong>sor privilege level when acc<strong>es</strong>sing data segments; see Intel documentation fordetails.Figure 2-3. Segment Selector and Segment D<strong>es</strong>criptorSince a Segment D<strong>es</strong>criptor is 8 byt<strong>es</strong> long, its relative addr<strong>es</strong>s inside the GDT or the LDT isobtained by multiplying the most significant 13 bits of the Segment Selector by 8. Forinstance, if the GDT is at 0x00020000 (the value stored in the gdtr register) and the indexspecified by the Segment Selector is 2, the addr<strong>es</strong>s of the corr<strong>es</strong>ponding Segment D<strong>es</strong>criptoris 0x00020000 + (2 x 8), or 0x00020010.<strong>The</strong> first entry of the GDT is always set to 0: this ensur<strong>es</strong> that logical addr<strong>es</strong>s<strong>es</strong> with a nullSegment Selector will be considered invalid, thus causing a proc<strong>es</strong>sor exception. <strong>The</strong>maximum number of Segment D<strong>es</strong>criptors that can be stored in the GDT is thus 8191, that is,2 13 -1.2.2.4 Segmentation UnitFigure 2-4 shows in detail how a logical addr<strong>es</strong>s is translated into a corr<strong>es</strong>ponding linearaddr<strong>es</strong>s. <strong>The</strong> segmentation unit performs the following operations:• Examin<strong>es</strong> the TI field of the Segment Selector, in order to determine which D<strong>es</strong>criptorTable stor<strong>es</strong> the Segment D<strong>es</strong>criptor. This field indicat<strong>es</strong> that the D<strong>es</strong>criptor is eitherin the GDT (in which case the segmentation unit gets the base linear addr<strong>es</strong>s of theGDT from the gdtr register) or in the active LDT (in which case the segmentationunit gets the base linear addr<strong>es</strong>s of that LDT from the ldtr register).• Comput<strong>es</strong> the addr<strong>es</strong>s of the Segment D<strong>es</strong>criptor from the index field of the SegmentSelector. <strong>The</strong> index field is multiplied by 8 (the size of a Segment D<strong>es</strong>criptor), and ther<strong>es</strong>ult is added to the content of the gdtr or ldtr register.• Adds to the Base field of the Segment D<strong>es</strong>criptor the offset of the logical addr<strong>es</strong>s, thusobtains the linear addr<strong>es</strong>s.40


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 2-4. Translating a logical addr<strong>es</strong>sNotice that, thanks to the nonprogrammable registers associated with the segmentationregisters, the first two operations need to be performed only when a segmentation register hasbeen changed.2.3 Segmentation in <strong>Linux</strong>Segmentation has been included in Intel microproc<strong>es</strong>sors to encourage programmers to splittheir applications in logically related entiti<strong>es</strong>, such as subroutin<strong>es</strong> or global and local dataareas. However, <strong>Linux</strong> us<strong>es</strong> segmentation in a very limited way. In fact, segmentation andpaging are somewhat redundant since both can be used to separate the physical addr<strong>es</strong>s spac<strong>es</strong>of proc<strong>es</strong>s<strong>es</strong>: segmentation can assign a different linear addr<strong>es</strong>s space to each proc<strong>es</strong>s whilepaging can map the same linear addr<strong>es</strong>s space into different physical addr<strong>es</strong>s spac<strong>es</strong>. <strong>Linux</strong>prefers paging to segmentation for the following reasons:• Memory management is simpler when all proc<strong>es</strong>s<strong>es</strong> use the same segment registervalu<strong>es</strong>, that is, when they share the same set of linear addr<strong>es</strong>s<strong>es</strong>.• One of the d<strong>es</strong>ign objectiv<strong>es</strong> of <strong>Linux</strong> is portability to the most popular architectur<strong>es</strong>;however, several RISC proc<strong>es</strong>sors support segmentation in a very limited way.<strong>The</strong> 2.2 version of <strong>Linux</strong> us<strong>es</strong> segmentation only when required by the Intel 80x86architecture. In particular, all proc<strong>es</strong>s<strong>es</strong> use the same logical addr<strong>es</strong>s<strong>es</strong>, so the total number ofsegments to be defined is quite limited and it is possible to store all Segment D<strong>es</strong>criptors inthe Global D<strong>es</strong>criptor Table (GDT). This table is implemented by the array gdt_tablereferred by the gdt variable. If you look in the Source Code Index, you can see that th<strong>es</strong><strong>es</strong>ymbols are defined in the file arch/i386/kernel/head.S. Every macro, function, and othersymbol in this book is listed in the appendix so you can quickly find it in the source code.Local D<strong>es</strong>criptor Tabl<strong>es</strong> are not used by the kernel, although a system call exists that allowsproc<strong>es</strong>s<strong>es</strong> to create their own LDTs. This turns out to be useful to applications such as Winethat execute segment-oriented Microsoft Windows applications.41


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Here are the segments used by <strong>Linux</strong>:• A kernel code segment. <strong>The</strong> fields of the corr<strong>es</strong>ponding Segment D<strong>es</strong>criptor in theGDT have the following valu<strong>es</strong>:o Base = 0x00000000o Limit = 0xfffffo G (granularity flag) = 1, for segment size expr<strong>es</strong>sed in pag<strong>es</strong>o S (system flag) = 1, for normal code or data segmento Type = 0xa, for code segment that can be read and executedo DPL (D<strong>es</strong>criptor Privilege Level) = 0, for <strong>Kernel</strong> Modeo D/B (32-bit addr<strong>es</strong>s flag) = 1, for 32-bit offset addr<strong>es</strong>s<strong>es</strong>Thus, the linear addr<strong>es</strong>s<strong>es</strong> associated with that segment start at and reach theaddr<strong>es</strong>sing limit of 2 32 - 1. <strong>The</strong> S and Type fields specify that the segment is a cod<strong>es</strong>egment that can be read and executed. Its DPL value is 0, thus it can be acc<strong>es</strong>sed onlyin <strong>Kernel</strong> Mode. <strong>The</strong> corr<strong>es</strong>ponding Segment Selector is defined by the __KERNEL_CSmacro: in order to addr<strong>es</strong>s the segment, the kernel just loads the value yielded by themacro into the cs register.• A kernel data segment. <strong>The</strong> fields of the corr<strong>es</strong>ponding Segment D<strong>es</strong>criptor in theGDT have the following valu<strong>es</strong>:o Base = 0x00000000o Limit = 0xfffffo G (granularity flag) = 1, for segment size expr<strong>es</strong>sed in pag<strong>es</strong>o S (system flag) = 1, for normal code or data segmento Type = 2, for data segment that can be read and writteno DPL (D<strong>es</strong>criptor Privilege Level) = 0, for <strong>Kernel</strong> Modeo D/B (32-bit addr<strong>es</strong>s flag) = 1, for 32-bit offset addr<strong>es</strong>s<strong>es</strong>This segment is identical to the previous one (in fact, they overlap in the linear addr<strong>es</strong>sspace) except for the value of the Type field, which specifi<strong>es</strong> that it is a data segmentthat can be read and written. <strong>The</strong> corr<strong>es</strong>ponding Segment Selector is defined by the__KERNEL_DS macro.• A user code segment shared by all proc<strong>es</strong>s<strong>es</strong> in User Mode. <strong>The</strong> fields of thecorr<strong>es</strong>ponding Segment D<strong>es</strong>criptor in the GDT have the following valu<strong>es</strong>:o Base = 0x00000000o Limit = 0xfffffo G (granularity flag) = 1, for segment size expr<strong>es</strong>sed in pag<strong>es</strong>o S (system flag) = 1, for normal code or data segmento Type = 0xa, for code segment that can be read and executedo DPL (D<strong>es</strong>criptor Privilege Level) = 3, for User Modeo D/B (32-bit addr<strong>es</strong>s flag) = 1, for 32-bit offset addr<strong>es</strong>s<strong>es</strong><strong>The</strong> S and DPL fields specify that the segment is not a system segment and that itsprivilege level is equal to 3; it can thus be acc<strong>es</strong>sed both in <strong>Kernel</strong> Mode and in UserMode. <strong>The</strong> corr<strong>es</strong>ponding Segment Selector is defined by the __USER_CS macro.42


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• A user data segment shared by all proc<strong>es</strong>s<strong>es</strong> in User Mode. <strong>The</strong> fields of thecorr<strong>es</strong>ponding Segment D<strong>es</strong>criptor in the GDT have the following valu<strong>es</strong>:o Base = 0x00000000o Limit = 0xfffffo G (granularity flag) = 1, for segment size expr<strong>es</strong>sed in pag<strong>es</strong>o S (system flag) = 1, for normal code or data segmento Type = 2, for data segment that can be read and writteno DPL (D<strong>es</strong>criptor Privilege Level) = 3, for User Modeo D/B (32-bit addr<strong>es</strong>s flag) = 1, for 32-bit offset addr<strong>es</strong>s<strong>es</strong>This segment overlaps the previous one: they are identical, except for the value ofType. <strong>The</strong> corr<strong>es</strong>ponding Segment Selector is defined by the __USER_DS macro.• A Task State Segment (TSS) segment for each proc<strong>es</strong>s. <strong>The</strong> d<strong>es</strong>criptors of th<strong>es</strong><strong>es</strong>egments are stored in the GDT. <strong>The</strong> Base field of the TSS d<strong>es</strong>criptor associated witheach proc<strong>es</strong>s contains the addr<strong>es</strong>s of the tss field of the corr<strong>es</strong>ponding proc<strong>es</strong>sd<strong>es</strong>criptor. <strong>The</strong> G flag is cleared, while the Limit field is set to 0xeb, since the TSSsegment is 236 byt<strong>es</strong> long. <strong>The</strong> Type field is set to 9 or 11 (available 32-bit TSS), andthe DPL is set to 0, since proc<strong>es</strong>s<strong>es</strong> in User Mode are not allowed to acc<strong>es</strong>s TSSsegments.• A default LDT segment that is usually shared by all proc<strong>es</strong>s<strong>es</strong>. This segment is storedin the default_ldt variable. <strong>The</strong> default LDT includ<strong>es</strong> a single entry consisting of anull Segment D<strong>es</strong>criptor. Each proc<strong>es</strong>s has its own LDT Segment D<strong>es</strong>criptor, whichusually points to the common default LDT segment. <strong>The</strong> Base field is set to theaddr<strong>es</strong>s of default_ldt and the Limit field is set to 7. If a proc<strong>es</strong>s requir<strong>es</strong> a realLDT, a new 4096-byte segment is created (it can include up to 511 SegmentD<strong>es</strong>criptors), and the default LDT Segment D<strong>es</strong>criptor associated with that proc<strong>es</strong>s isreplaced in the GDT with a new d<strong>es</strong>criptor with specific valu<strong>es</strong> for the Base andLimit fields.For each proc<strong>es</strong>s, therefore, the GDT contains two different Segment D<strong>es</strong>criptors: one for theTSS segment and one for the LDT segment. <strong>The</strong> maximum number of entri<strong>es</strong> allowed in theGDT is 12+2xNR_TASKS, where, in turn, NR_TASKS denot<strong>es</strong> the maximum number ofproc<strong>es</strong>s<strong>es</strong>. In the previous list we d<strong>es</strong>cribed the six main Segment D<strong>es</strong>criptors used by <strong>Linux</strong>.Four additional Segment D<strong>es</strong>criptors cover Advanced Power Management (APM) featur<strong>es</strong>,and four entri<strong>es</strong> of the GDT are left unused, for a grand total of 14.As we mentioned before, the GDT can have at most 2 13 = 8192 entri<strong>es</strong>, of which the first isalways null. Since 14 are either unused or filled by the system, NR_TASKS cannot be largerthan 8180/2 = 4090.<strong>The</strong> TSS and LDT d<strong>es</strong>criptors for each proc<strong>es</strong>s are added to the GDT as the proc<strong>es</strong>s iscreated. As we shall see in Section 3.3.2 in Chapter 3, the kernel itself spawns the firstproc<strong>es</strong>s: proc<strong>es</strong>s running init_task . During kernel initialization, the trap_init( )function inserts the TSS d<strong>es</strong>criptor of this first proc<strong>es</strong>s into the GDT using the statement:set_tss_d<strong>es</strong>c(0, &init_task.tss);<strong>The</strong> first proc<strong>es</strong>s creat<strong>es</strong> others, so that every subsequent proc<strong>es</strong>s is the child of some existingproc<strong>es</strong>s. <strong>The</strong> copy_thread( ) function, which is invoked from the clone( ) and fork( )43


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>system calls to create new proc<strong>es</strong>s<strong>es</strong>, execut<strong>es</strong> the same function in order to set the TSS of thenew proc<strong>es</strong>s:set_tss_d<strong>es</strong>c(nr, &(task[nr]->tss));Since each TSS d<strong>es</strong>criptor refers to a different proc<strong>es</strong>s, of course, each Base field has adifferent value. <strong>The</strong> copy_thread( ) function also invok<strong>es</strong> the set_ldt_d<strong>es</strong>c( ) functionin order to insert a Segment D<strong>es</strong>criptor in the GDT relative to the default LDT for the newproc<strong>es</strong>s.<strong>The</strong> kernel data segment includ<strong>es</strong> a proc<strong>es</strong>s d<strong>es</strong>criptor for each proc<strong>es</strong>s. Each proc<strong>es</strong>sd<strong>es</strong>criptor includ<strong>es</strong> its own TSS segment and a pointer to its LDT segment, which is alsolocated inside the kernel data segment.As stated earlier, the Current Privilege Level of the CPU reflects whether the proc<strong>es</strong>sor is inUser or <strong>Kernel</strong> Mode and is specified by the RPL field of the Segment Selector stored in thecs register. Whenever the Current Privilege Level is changed, some segmentation registersmust be corr<strong>es</strong>pondingly updated. For instance, when the CPL is equal to 3 (User Mode), theds register must contain the Segment Selector of the user data segment, but when the CPL isequal to 0, the ds register must contain the Segment Selector of the kernel data segment.A similar situation occurs for the ss register: it must refer to a User Mode stack inside theuser data segment when the CPL is 3, and it must refer to a <strong>Kernel</strong> Mode stack inside thekernel data segment when the CPL is 0. When switching from User Mode to <strong>Kernel</strong> Mode,<strong>Linux</strong> always mak<strong>es</strong> sure that the ss register contains the Segment Selector of the kernel datasegment.2.4 Paging in Hardware<strong>The</strong> paging unit translat<strong>es</strong> linear addr<strong>es</strong>s<strong>es</strong> into physical on<strong>es</strong>. It checks the requ<strong>es</strong>ted acc<strong>es</strong>stype against the acc<strong>es</strong>s rights of the linear addr<strong>es</strong>s. If the memory acc<strong>es</strong>s is not valid, itgenerat<strong>es</strong> a page fault exception (see Chapter 4, and Chapter 6).For the sake of efficiency, linear addr<strong>es</strong>s<strong>es</strong> are grouped in fixed-length intervals called pag<strong>es</strong>;contiguous linear addr<strong>es</strong>s<strong>es</strong> within a page are mapped into contiguous physical addr<strong>es</strong>s<strong>es</strong>. Inthis way, the kernel can specify the physical addr<strong>es</strong>s and the acc<strong>es</strong>s rights of a page instead ofthose of all the linear addr<strong>es</strong>s<strong>es</strong> included in it. Following the usual convention, we shall usethe term "page" to refer both to a set of linear addr<strong>es</strong>s<strong>es</strong> and to the data contained in this groupof addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong> paging unit thinks of all RAM as partitioned into fixed-length page fram<strong>es</strong> (they ar<strong>es</strong>ometim<strong>es</strong> referred to as physical pag<strong>es</strong>). Each page frame contains a page, that is, the lengthof a page frame coincid<strong>es</strong> with that of a page. A page frame is a constituent of main memory,and hence it is a storage area. It is important to distinguish a page from a page frame: theformer is just a block of data, which may be stored in any page frame or on disk.<strong>The</strong> data structur<strong>es</strong> that map linear to physical addr<strong>es</strong>s<strong>es</strong> are called page tabl<strong>es</strong>; they ar<strong>es</strong>tored in main memory and must be properly initialized by the kernel before enabling thepaging unit.44


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In Intel proc<strong>es</strong>sors, paging is enabled by setting the PG flag of the cr0 register. When PG = 0,linear addr<strong>es</strong>s<strong>es</strong> are interpreted as physical addr<strong>es</strong>s<strong>es</strong>.2.4.1 Regular PagingStarting with the i80386, the paging unit of Intel proc<strong>es</strong>sors handl<strong>es</strong> 4 KB pag<strong>es</strong>. <strong>The</strong> 32 bitsof a linear addr<strong>es</strong>s are divided into three fields:DirectoryTableOffset<strong>The</strong> most significant 10 bits<strong>The</strong> intermediate 10 bits<strong>The</strong> least significant 12 bits<strong>The</strong> translation of linear addr<strong>es</strong>s<strong>es</strong> is accomplished in two steps, each based on a type oftranslation table. <strong>The</strong> first translation table is called Page Directory and the second is calledPage Table.<strong>The</strong> physical addr<strong>es</strong>s of the Page Directory in use is stored in the cr3 proc<strong>es</strong>sor register. <strong>The</strong>Directory field within the linear addr<strong>es</strong>s determin<strong>es</strong> the entry in the Page Directory that pointsto the proper Page Table. <strong>The</strong> addr<strong>es</strong>s's Table field, in turn, determin<strong>es</strong> the entry in the PageTable that contains the physical addr<strong>es</strong>s of the page frame containing the page. <strong>The</strong> Offsetfield determin<strong>es</strong> the relative position within the page frame (see Figure 2-5). Since it is 12 bitslong, each page consists of 4096 byt<strong>es</strong> of data.Figure 2-5. Paging by Intel 80x86 proc<strong>es</strong>sors45


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Both the Directory and the Table fields are 10 bits long, so Page Directori<strong>es</strong> and Page Tabl<strong>es</strong>can include up to 1024 entri<strong>es</strong>. It follows that a Page Directory can addr<strong>es</strong>s up to 1024 x 1024x 4096=2 32 memory cells, as you'd expect in 32-bit addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong> entri<strong>es</strong> of Page Directori<strong>es</strong> and Page Tabl<strong>es</strong> have the same structure. Each entry includ<strong>es</strong>the following fields:Pr<strong>es</strong>ent flagIf it is set, the referred page (or Page Table) is contained in main memory; if the flag is0, the page is not contained in main memory and the remaining entry bits may be usedby the operating system for its own purpos<strong>es</strong>. (We shall see in Chapter 16, how <strong>Linux</strong>mak<strong>es</strong> use of this field.)Field containing the 20 most significant bits of a page frame physical addr<strong>es</strong>sSince each page frame has a 4 KB capacity, its physical addr<strong>es</strong>s must be a multiple of4096, so the 12 least significant bits of the physical addr<strong>es</strong>s are always equal to 0. Ifthe field refers to a Page Directory, the page frame contains a Page Table; if it refers toa Page Table, the page frame contains a page of data.Acc<strong>es</strong>sed flagDirty flagIs set each time the paging unit addr<strong>es</strong>s<strong>es</strong> the corr<strong>es</strong>ponding page frame. This flag maybe used by the operating system when selecting pag<strong>es</strong> to be swapped out. <strong>The</strong> pagingunit never r<strong>es</strong>ets this flag; this must be done by the operating system.Appli<strong>es</strong> only to the Page Table entri<strong>es</strong>. It is set each time a write operation isperformed on the page frame. As in the previous case, this flag may be used by theoperating system when selecting pag<strong>es</strong> to be swapped out. <strong>The</strong> paging unit neverr<strong>es</strong>ets this flag; this must be done by the operating system.Read/Write flagContains the acc<strong>es</strong>s right (Read/Write or Read) of the page or of the Page Table (seeSection 2.4.3 later in this chapter).User/Supervisor flagContains the privilege level required to acc<strong>es</strong>s the page or Page Table (see Section2.4.3).Two flags called PCD and PWTControl the way the page or Page Table is handled by the hardware cache (see Section2.4.6 later in this chapter).46


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Page Size flagAppli<strong>es</strong> only to Page Directory entri<strong>es</strong>. If it is set, the entry refers to a 4 MB long pageframe (see the following section).If the entry of a Page Table or Page Directory needed to perform an addr<strong>es</strong>s translation hasthe Pr<strong>es</strong>ent flag cleared, the paging unit stor<strong>es</strong> the linear addr<strong>es</strong>s in the cr2 proc<strong>es</strong>sorregister and generat<strong>es</strong> the exception 14, that is, the "Page fault" exception.2.4.2 Extended PagingStarting with the Pentium model, Intel 80x86 microproc<strong>es</strong>sors introduce extended paging ,which allows page fram<strong>es</strong> to be either 4 KB or 4 MB in size (see Figure 2-6).Figure 2-6. Extended pagingAs we have seen in the previous section, extended paging is enabled by setting the Page Sizeflag of a Page Directory entry. In this case, the paging unit divid<strong>es</strong> the 32 bits of a linearaddr<strong>es</strong>s into two fields:DirectoryOffset<strong>The</strong> most significant 10 bits<strong>The</strong> remaining 22 bitsPage Directory entri<strong>es</strong> for extended paging are the same as for normal paging, except that:• <strong>The</strong> Page Size flag must be set.• Only the first 10 most significant bits of the 20-bit physical addr<strong>es</strong>s field ar<strong>es</strong>ignificant. This is because each physical addr<strong>es</strong>s is aligned on a 4 MB boundary, sothe 22 least significant bits of the addr<strong>es</strong>s are 0.47


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Extended paging coexists with regular paging; it is enabled by setting the PSE flag of the cr4proc<strong>es</strong>sor register. Extended paging is used to translate large intervals of contiguous linearaddr<strong>es</strong>s<strong>es</strong> into corr<strong>es</strong>ponding physical on<strong>es</strong>; in th<strong>es</strong>e cas<strong>es</strong>, the kernel can do withoutintermediate Page Tabl<strong>es</strong> and thus save memory.2.4.3 Hardware Protection Scheme<strong>The</strong> paging unit us<strong>es</strong> a different protection scheme from the segmentation unit. While Intelproc<strong>es</strong>sors allow four possible privilege levels to a segment, only two privilege levels areassociated with pag<strong>es</strong> and Page Tabl<strong>es</strong>, because privileg<strong>es</strong> are controlled by theUser/Supervisor flag mentioned in Section 2.4.1. When this flag is 0, the page can beaddr<strong>es</strong>sed only when the CPL is l<strong>es</strong>s than 3 (this means, for <strong>Linux</strong>, when the proc<strong>es</strong>sor is in<strong>Kernel</strong> Mode). When the flag is 1, the page can always be addr<strong>es</strong>sed.Furthermore, instead of the three typ<strong>es</strong> of acc<strong>es</strong>s rights (Read, Write, Execute) associated withsegments, only two typ<strong>es</strong> of acc<strong>es</strong>s rights (Read, Write) are associated with pag<strong>es</strong>. If theRead/Write flag of a Page Directory or Page Table entry is equal to 0, the corr<strong>es</strong>pondingPage Table or page can only be read; otherwise it can be read and written.2.4.4 An Example of PagingA simple example will help in clarifying how paging works.Let us assume that the kernel has assigned the linear addr<strong>es</strong>s space between 0x20000000 and0x2003ffff to a running proc<strong>es</strong>s. This space consists of exactly 64 pag<strong>es</strong>. We don't careabout the physical addr<strong>es</strong>s<strong>es</strong> of the page fram<strong>es</strong> containing the pag<strong>es</strong>; in fact, some of themmight not even be in main memory. We are inter<strong>es</strong>ted only in the remaining fields of the pagetable entri<strong>es</strong>.Let us start with the 10 most significant bits of the linear addr<strong>es</strong>s<strong>es</strong> assigned to the proc<strong>es</strong>s,which are interpreted as the Directory field by the paging unit. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> start with a 2followed by zeros, so the 10 bits all have the same value, namely 0x080 or 128 decimal. Thusthe Directory field in all the addr<strong>es</strong>s<strong>es</strong> refers to the 129th entry of the proc<strong>es</strong>s Page Directory.<strong>The</strong> corr<strong>es</strong>ponding entry must contain the physical addr<strong>es</strong>s of the Page Table assigned to theproc<strong>es</strong>s (see Figure 2-7). If no other linear addr<strong>es</strong>s<strong>es</strong> are assigned to the proc<strong>es</strong>s, all theremaining 1023 entri<strong>es</strong> of the Page Directory are filled with zeros.Figure 2-7. An example of paging48


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> valu<strong>es</strong> assumed by the intermediate 10 bits, (that is, the valu<strong>es</strong> of the Table field) rangefrom to 0x03f, or from to 63 decimal. Thus, only the first 64 entri<strong>es</strong> of the Page Table ar<strong>es</strong>ignificant. <strong>The</strong> remaining 960 entri<strong>es</strong> are filled with zeros.Suppose that the proc<strong>es</strong>s needs to read the byte at linear addr<strong>es</strong>s 0x20021406. This addr<strong>es</strong>s ishandled by the paging unit as follows:1. <strong>The</strong> Directory field 0x80 is used to select entry 0x80 of the Page Directory, whichpoints to the Page Table associated with the proc<strong>es</strong>s's pag<strong>es</strong>.2. <strong>The</strong> Table field 0x21 is used to select entry 0x21 of the Page Table, which points tothe page frame containing the d<strong>es</strong>ired page.3. Finally, the Offset field 0x406 is used to select the byte at offset 0x406 in the d<strong>es</strong>iredpage frame.If the Pr<strong>es</strong>ent flag of the 0x21 entry of the Page Table is cleared, the page is not pr<strong>es</strong>ent inmain memory; in this case, the paging unit issu<strong>es</strong> a page exception while translating the linearaddr<strong>es</strong>s. <strong>The</strong> same exception is issued whenever the proc<strong>es</strong>s attempts to acc<strong>es</strong>s linearaddr<strong>es</strong>s<strong>es</strong> outside of the interval delimited by 0x20000000 and 0x2003ffff since the PageTable entri<strong>es</strong> not assigned to the proc<strong>es</strong>s are filled with zeros; in particular, their Pr<strong>es</strong>entflags are all cleared.2.4.5 Three-Level PagingTwo-level paging is used by 32-bit microproc<strong>es</strong>sors. But in recent years, severalmicroproc<strong>es</strong>sors (such as Compaq's Alpha, and Sun's UltraSPARC) have adopted a 64-bitarchitecture. In this case, two-level paging is no longer suitable and it is nec<strong>es</strong>sary to move upto three-level paging. Let us use a thought experiment to see why.Start by assuming about as large a page size as is reasonable (since you have to account forpag<strong>es</strong> being transferred routinely to and from disk). Let's choose 16 KB for the page size.Since 1 KB covers a range of 2 10 addr<strong>es</strong>s<strong>es</strong>, 16 KB covers 2 14 addr<strong>es</strong>s<strong>es</strong>, so the Offset fieldwould be 14 bits. This leav<strong>es</strong> 50 bits of the linear addr<strong>es</strong>s to be distributed between the Tableand the Directory fields. If we now decide to r<strong>es</strong>erve 25 bits for each of th<strong>es</strong>e two fields, thismeans that both the Page Directory and the Page Tabl<strong>es</strong> of a proc<strong>es</strong>s would include 2 25entri<strong>es</strong>, that is, more than 32 million entri<strong>es</strong>.Even if RAM is getting cheaper and cheaper, we cannot afford to waste so much memoryspace just for storing the page tabl<strong>es</strong>.<strong>The</strong> solution chosen for Compaq's Alpha microproc<strong>es</strong>sors is the following:• Page fram<strong>es</strong> are 8 KB long, so the Offset field is 13 bits long.• Only the least significant 43 bits of an addr<strong>es</strong>s are used. (<strong>The</strong> most significant 21 bitsare always set 0.)• Three levels of page tabl<strong>es</strong> are introduced so that the remaining 30 bits of the addr<strong>es</strong>scan be split into three 10-bit fields (see Figure 2-9 later in this chapter). So the PageTabl<strong>es</strong> include 2 10 = 1024 entri<strong>es</strong> as in the two-level paging schema examinedpreviously.49


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As we shall see in Section 2.5 later in this chapter, <strong>Linux</strong>'s d<strong>es</strong>igners decided to implement apaging model inspired by the Alpha architecture.2.4.6 Hardware CacheToday's microproc<strong>es</strong>sors have clock rat<strong>es</strong> approaching gigahertz, while dynamic RAM(DRAM) chips have acc<strong>es</strong>s tim<strong>es</strong> in the range of tens of clock cycl<strong>es</strong>. This means that theCPU may be held back considerably while executing instructions that require fetchingoperands from RAM and/or storing r<strong>es</strong>ults into RAM.Hardware cache memori<strong>es</strong> have been introduced to reduce the speed mismatch between CPUand RAM. <strong>The</strong>y are based on the well-known locality principle, which holds both forprograms and data structur<strong>es</strong>: because of the cyclic structure of programs and the packing ofrelated data into linear arrays, addr<strong>es</strong>s<strong>es</strong> close to the on<strong>es</strong> most recently used have a highprobability of being used in the near future. It thus mak<strong>es</strong> sense to introduce a smaller andfaster memory that contains the most recently used code and data. For this purpose, a new unitcalled the line has been introduced into the Intel architecture. It consists of a few dozencontiguous byt<strong>es</strong> that are transferred in burst mode between the slow DRAM and the fast onchipstatic RAM (SRAM) used to implement cach<strong>es</strong>.<strong>The</strong> cache is subdivided into subsets of lin<strong>es</strong>. At one extreme the cache can be direct mapped,in which case a line in main memory is always stored at the exact same location in the cache.At the other extreme, the cache is fully associative, meaning that any line in memory can b<strong>es</strong>tored at any location in the cache. But most cach<strong>es</strong> are to some degree N-way associative,where any line of main memory can be stored in any one of N lin<strong>es</strong> of the cache. For instance,a line of memory can be stored in two different lin<strong>es</strong> of a 2-way set of associative cache.As shown in Figure 2-8, the cache unit is inserted between the paging unit and the mainmemory. It includ<strong>es</strong> both a hardware cache memory and a cache controller. <strong>The</strong> cachememory stor<strong>es</strong> the actual lin<strong>es</strong> of memory. <strong>The</strong> cache controller stor<strong>es</strong> an array of entri<strong>es</strong>, oneentry for each line of the cache memory. Each entry includ<strong>es</strong> a tag and a few flags thatd<strong>es</strong>cribe the status of the cache line. <strong>The</strong> tag consists of some bits that allow the cachecontroller to recognize the memory location currently mapped by the line. <strong>The</strong> bits of thememory physical addr<strong>es</strong>s are usually split into three groups: the most significant on<strong>es</strong>corr<strong>es</strong>pond to the tag, the middle on<strong>es</strong> corr<strong>es</strong>pond to the cache controller subset index, theleast significant on<strong>es</strong> to the offset within the line.Figure 2-8. Proc<strong>es</strong>sor hardware cacheWhen acc<strong>es</strong>sing a RAM memory cell, the CPU extracts the subset index from the physicaladdr<strong>es</strong>s and compar<strong>es</strong> the tags of all lin<strong>es</strong> in the subset with the high-order bits of the physical50


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>addr<strong>es</strong>s. If a line with the same tag as the high-order bits of the addr<strong>es</strong>s is found, the CPU hasa cache hit; otherwise, it has a cache miss.When a cache hit occurs, the cache controller behav<strong>es</strong> differently depending on acc<strong>es</strong>s type.For a read operation, the controller selects the data from the cache line and transfers it into aCPU register; the RAM is not acc<strong>es</strong>sed and the CPU achiev<strong>es</strong> the time saving for which thecache system was invented. For a write operation, the controller may implement one of twobasic strategi<strong>es</strong> called write-through and write-back. In a write-through, the controller alwayswrit<strong>es</strong> into both RAM and the cache line, effectively switching off the cache for writeoperations. In a write-back, which offers more immediate efficiency, only the cache line isupdated, and the contents of the RAM are left unchanged. After a write-back, of course, theRAM must eventually be updated. <strong>The</strong> cache controller writ<strong>es</strong> the cache line back into RAMonly when the CPU execut<strong>es</strong> an instruction requiring a flush of cache entri<strong>es</strong> or when aFLUSH hardware signal occurs (usually after a cache miss).When a cache miss occurs, the cache line is written to memory, if nec<strong>es</strong>sary, and the correctline is fetched from RAM into the cache entry.Multiproc<strong>es</strong>sor systems have a separate hardware cache for every proc<strong>es</strong>sor, and thereforethey need additional hardware circuitry to synchronize the cache contents. See Section 11.3.2in Chapter 11.Cache technology is rapidly evolving. For example, the first Pentium models included a singleon-chip cache called the L1-cache. More recent models also include another larger and sloweron-chip cache called the L2-cache. <strong>The</strong> consistency between the two cache levels isimplemented at the hardware level. <strong>Linux</strong> ignor<strong>es</strong> th<strong>es</strong>e hardware details and assum<strong>es</strong> there isa single cache.<strong>The</strong> CD flag of the cr0 proc<strong>es</strong>sor register is used to enable or disable the cache circuitry. <strong>The</strong>NW flag, in the same register, specifi<strong>es</strong> whether the write-through or the write-back strategy isused for the cach<strong>es</strong>.Another inter<strong>es</strong>ting feature of the Pentium cache is that it lets an operating system associate adifferent cache management policy with each page frame. For that purpose, each PageDirectory and each Page Table entry includ<strong>es</strong> two flags: PCD specifi<strong>es</strong> whether the cache mustbe enabled or disabled while acc<strong>es</strong>sing data included in the page frame; PWT specifi<strong>es</strong> whetherthe write-back or the write-through strategy must be applied while writing data into the pageframe. <strong>Linux</strong> clears the PCD and PWT flags of all Page Directory and Page Table entri<strong>es</strong>: as ar<strong>es</strong>ult, caching is enabled for all page fram<strong>es</strong> and the write-back strategy is always adopted forwriting.<strong>The</strong> L1_CACHE_BYTES macro yields the size of a cache line on a Pentium, that is, 32 byt<strong>es</strong>. Inorder to optimize the cache hit rate, the kernel adopts the following rul<strong>es</strong>:• <strong>The</strong> most frequently used fields of a data structure are placed at the low offset withinthe data structure so that they can be cached in the same line.• When allocating a large set of data structur<strong>es</strong>, the kernel tri<strong>es</strong> to store each of them inmemory so that all cache lin<strong>es</strong> are uniformly used.51


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2.4.7 Translation Lookaside Buffers (TLB)B<strong>es</strong>id<strong>es</strong> general-purpose hardware cach<strong>es</strong>, Intel 80x86 proc<strong>es</strong>sors include other cach<strong>es</strong> calledtranslation lookaside buffers or TLB to speed up linear addr<strong>es</strong>s translation. When a linearaddr<strong>es</strong>s is used for the first time, the corr<strong>es</strong>ponding physical addr<strong>es</strong>s is computed throughslow acc<strong>es</strong>s<strong>es</strong> to the page tabl<strong>es</strong> in RAM. <strong>The</strong> physical addr<strong>es</strong>s is then stored in a TLB entry,so that further referenc<strong>es</strong> to the same linear addr<strong>es</strong>s can be quickly translated.<strong>The</strong> invlpg instruction can be used to invalidate (that is, to free) a single entry of a TLB. Inorder to invalidate all TLB entri<strong>es</strong>, the proc<strong>es</strong>sor can simply write into the cr3 register thatpoints to the currently used Page Directory.Since the TLBs serve as cach<strong>es</strong> of page table contents, whenever a Page Table entry ismodified, the kernel must invalidate the corr<strong>es</strong>ponding TLB entry. To do this, <strong>Linux</strong> mak<strong>es</strong>use of the flush_tlb_page(addr) function, which invok<strong>es</strong> __flush_tlb_one( ). <strong>The</strong> latterfunction execut<strong>es</strong> the invlpg Assembly instruction:movl $addr,%eaxinvlpg (%eax)Sometim<strong>es</strong> it is nec<strong>es</strong>sary to invalidate all TLB entri<strong>es</strong>, such as during kernel initialization. Insuch cas<strong>es</strong>, the kernel invok<strong>es</strong> the __flush_tlb( ) function, which rewrit<strong>es</strong> the current valueof cr3 back into it:movl %cr3, %eaxmovl %eax, %cr32.5 Paging in <strong>Linux</strong>As we explained in Section 2.4.5, <strong>Linux</strong> adopted a three-level paging model so paging isfeasible on 64-bit architectur<strong>es</strong>. Figure 2-9 shows the model, which defin<strong>es</strong> three typ<strong>es</strong> ofpaging tabl<strong>es</strong>:• Page Global Directory• Page Middle Directory• Page Table<strong>The</strong> Page Global Directory includ<strong>es</strong> the addr<strong>es</strong>s<strong>es</strong> of several Page Middle Directori<strong>es</strong>, whichin turn include the addr<strong>es</strong>s<strong>es</strong> of several Page Tabl<strong>es</strong>. Each Page Table entry points to a pageframe. <strong>The</strong> linear addr<strong>es</strong>s is thus split into four parts. Figure 2-9 do<strong>es</strong> not show the bitnumbers because the size of each part depends on the computer architecture.52


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 2-9. <strong>The</strong> <strong>Linux</strong> paging model<strong>Linux</strong> handling of proc<strong>es</strong>s<strong>es</strong> reli<strong>es</strong> heavily on paging. In fact, the automatic translation oflinear addr<strong>es</strong>s<strong>es</strong> into physical on<strong>es</strong> mak<strong>es</strong> the following d<strong>es</strong>ign objectiv<strong>es</strong> feasible:• Assign a different physical addr<strong>es</strong>s space to each proc<strong>es</strong>s, thus ensuring an efficientprotection against addr<strong>es</strong>sing errors.• Distinguish pag<strong>es</strong>, that is, groups of data, from page fram<strong>es</strong>, that is, physical addr<strong>es</strong>s<strong>es</strong>in main memory. This allows the same page to be stored in a page frame, then saved todisk, and later reloaded in a different page frame. This is the basic ingredient of thevirtual memory mechanism (see Chapter 16).As we shall see in Chapter 7, each proc<strong>es</strong>s has its own Page Global Directory and its own setof Page Tabl<strong>es</strong>. When a proc<strong>es</strong>s switching occurs (see Section 3.2 in Chapter 3), <strong>Linux</strong> sav<strong>es</strong>in a TSS segment the contents of the cr3 control register and loads from another TSS segmenta new value into cr3. Thus, when the new proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution on the CPU, thepaging unit refers to the correct set of page tabl<strong>es</strong>.What happens when this three-level paging model is applied to the Pentium, which us<strong>es</strong> onlytwo typ<strong>es</strong> of page tabl<strong>es</strong>? <strong>Linux</strong> <strong>es</strong>sentially eliminat<strong>es</strong> the Page Middle Directory field bysaying that it contains zero bits. However, the position of the Page Middle Directory in th<strong>es</strong>equence of pointers is kept so that the same code can work on 32-bit and 64-bit architectur<strong>es</strong>.<strong>The</strong> kernel keeps a position for the Page Middle Directory by setting the number of entri<strong>es</strong> init to 1 and mapping this single entry into the proper entry of the Page Global Directory.Mapping logical to linear addr<strong>es</strong>s<strong>es</strong> now becom<strong>es</strong> a mechanical task, although somewhatcomplex. <strong>The</strong> next few sections of this chapter are thus a rather tedious list of functions andmacros that retrieve information the kernel needs to find addr<strong>es</strong>s<strong>es</strong> and manage the tabl<strong>es</strong>;most of the functions are one or two lin<strong>es</strong> long. You may want to just skim th<strong>es</strong>e sectionsnow, but it is useful to know the role of th<strong>es</strong>e functions and macros because you'll see themoften in discussions in subsequent chapters.53


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2.5.1 <strong>The</strong> Linear Addr<strong>es</strong>s Fields<strong>The</strong> following macros simplify page table handling:PAGE_SHIFTSpecifi<strong>es</strong> the length in bits of the Offset field; when applied to Pentium proc<strong>es</strong>sors ityields the value 12. Since all the addr<strong>es</strong>s<strong>es</strong> in a page must fit in the Offset field, th<strong>es</strong>ize of a page on Intel 80x86 systems is 2 12 or the familiar 4096 byt<strong>es</strong>; thePAGE_SHIFT of 12 can thus be considered the logarithm base 2 of the total page size.This macro is used by PAGE_SIZE to return the size of the page. Finally, thePAGE_MASK macro is defined as the value 0xfffff000; it is used to mask all the bits ofthe Offset field.PMD_SHIFTDetermin<strong>es</strong> the number of bits in an addr<strong>es</strong>s that are mapped by the second-level pagetable. It yields the value 22 (12 from Offset plus 10 from Table). <strong>The</strong> PMD_SIZE macrocomput<strong>es</strong> the size of the area mapped by a single entry of the Page Middle Directory,that is, of a Page Table. Thus, PMD_SIZE yields 2 22 or 4 MB. <strong>The</strong> PMD_MASK macroyields the value 0xffc00000; it is used to mask all the bits of the Offset and Tablefields.PGDIR_SHIFTDetermin<strong>es</strong> the logarithm of the size of the area a first-level page table can map. Sincethe Middle Directory field has length 0, this macro yields the same value yielded byPMD_SHIFT, which is 22. <strong>The</strong> PGDIR_SIZE macro comput<strong>es</strong> the size of the areamapped by a single entry of the Page Global Directory, that is, of a Page Directory.PGDIR_SIZE therefore yields 4 MB. <strong>The</strong> PGDIR_MASK macro yields the value0xffc00000, the same as PMD_MASK.PTRS_PER_PTE , PTRS_PER_PMD , and PTRS_PER_PGDCompute the number of entri<strong>es</strong> in the Page Table, Page Middle Directory, and PageGlobal Directory; they yield the valu<strong>es</strong> 1024, 1, and 1024, r<strong>es</strong>pectively.2.5.2 Page Table Handlingpte_t, pmd_t, and pgd_t are 32-bit data typ<strong>es</strong> that d<strong>es</strong>cribe, r<strong>es</strong>pectively, a Page Table, aPage Middle Directory, and a Page Global Directory entry. pgprot_t is another 32-bit datatype that repr<strong>es</strong>ents the protection flags associated with a single entry.Four type-conversion macros (pte_val( ), pmd_val( ), pgd_val( ), and pgprot_val( ))cast a 32-bit unsigned integer into the required type. Four other type-conversion macros (__pte( ), __ pmd( ), __ pgd( ), and __ pgprot( )) perform the reverse casting from one ofthe four previously mentioned specialized typ<strong>es</strong> into a 32-bit unsigned integer.<strong>The</strong> kernel also provid<strong>es</strong> several macros and functions to read or modify page table entri<strong>es</strong>:54


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> pte_none( ), pmd_none( ), and pgd_none( ) macros yield the value 1 if thecorr<strong>es</strong>ponding entry has the value 0; otherwise, they yield the value 0.• <strong>The</strong> pte_pr<strong>es</strong>ent( ), pmd_pr<strong>es</strong>ent( ), and pgd_pr<strong>es</strong>ent( ) macros yield the value1 if the Pr<strong>es</strong>ent flag of the corr<strong>es</strong>ponding entry is equal to 1, that is, if thecorr<strong>es</strong>ponding page or Page Table is loaded in main memory.• <strong>The</strong> pte_clear( ), pmd_clear( ), and pgd_clear( ) macros clear an entry of thecorr<strong>es</strong>ponding page table.<strong>The</strong> macros pmd_bad( ) and pgd_bad( ) are used by functions to check Page GlobalDirectory and Page Middle Directory entri<strong>es</strong> passed as input parameters. Each macro yieldsthe value 1 if the entry points to a bad page table, that is, if at least one of the followingconditions appli<strong>es</strong>:• <strong>The</strong> page is not in main memory (Pr<strong>es</strong>ent flag cleared).• <strong>The</strong> page allows only Read acc<strong>es</strong>s (Read/Write flag cleared).• Either Acc<strong>es</strong>sed or Dirty is cleared (<strong>Linux</strong> always forc<strong>es</strong> th<strong>es</strong>e flags to be set forevery existing page table).No pte_bad( ) macro is defined because it is legal for a Page Table entry to refer to a pagethat is not pr<strong>es</strong>ent in main memory, not writable, or not acc<strong>es</strong>sible at all. Instead, severalfunctions are offered to query the current value of any of the flags included in a Page Tableentry:pte_read( )Returns the value of the User/Supervisor flag (indicating whether the page isacc<strong>es</strong>sible in User Mode).pte_write( )Returns 1 if both the Pr<strong>es</strong>ent and Read/Write flags are set (indicating whether thepage is pr<strong>es</strong>ent and writable).pte_exec( )Returns the value of the User/Supervisor flag (indicating whether the page isacc<strong>es</strong>sible in User Mode). Notice that pag<strong>es</strong> on the Intel proc<strong>es</strong>sor cannot be protectedagainst code execution.pte_dirty( )Returns the value of the Dirty flag (indicating whether or not the page has beenmodified).pte_young( )Returns the value of the Acc<strong>es</strong>sed flag (indicating whether the page has beenacc<strong>es</strong>sed).Another group of functions sets the value of the flags in a Page Table entry:55


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>pte_wrprotect( )Clears the Read/Write flagpte_rdprotect and pte_exprotect( )Clear the User/Supervisor flagpte_mkwrite( )Sets the Read/Write flagpte_mkread( ) and pte_mkexec( )Set the User/Supervisor flagpte_mkdirty( ) and pte_mkclean( )Set the Dirty flag to 1 and to 0, r<strong>es</strong>pectively, thus marking the page as modified orunmodifiedpte_mkyoung( ) and pte_mkold( )Set the Acc<strong>es</strong>sed flag to 1 and to 0, r<strong>es</strong>pectively, thus marking the page as acc<strong>es</strong>sed(young) or nonacc<strong>es</strong>sed (old)pte_modify(p,v)set_pteSets all acc<strong>es</strong>s rights in a Page Table entry p to a specified value vWrit<strong>es</strong> a specified value into a Page Table entryNow come the macros that combine a page addr<strong>es</strong>s and a group of protection flags into a 32-bit page entry or perform the reverse operation of extracting the page addr<strong>es</strong>s from a pagetable entry:mk_ pte( )Combin<strong>es</strong> a linear addr<strong>es</strong>s and a group of acc<strong>es</strong>s rights to create a 32-bit Page Tableentry.mk_ pte_ physCreat<strong>es</strong> a Page Table entry by combining the physical addr<strong>es</strong>s and the acc<strong>es</strong>s rights ofthe page.56


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>pte_ page( ) and pmd_ page( )Return the linear addr<strong>es</strong>s of a page from its Page Table entry, and of a Page Tablefrom its Page Middle Directory entry.pgd_offset(p,a)Receiv<strong>es</strong> as parameters a memory d<strong>es</strong>criptor p (see Chapter 6) and a linear addr<strong>es</strong>s a.<strong>The</strong> macro yields the addr<strong>es</strong>s of the entry in a Page Global Directory that corr<strong>es</strong>pondsto the addr<strong>es</strong>s a; the Page Global Directory is found through a pointer within thememory d<strong>es</strong>criptor p. <strong>The</strong> pgd_offset_k(o) macro is similar, except that it refers tothe memory d<strong>es</strong>criptor used by kernel threads (see Section 3.3.2 in Chapter 3).pmd_offset(p,a)Receiv<strong>es</strong> as parameter a Page Global Directory entry p and a linear addr<strong>es</strong>s a; it yieldsthe addr<strong>es</strong>s of the entry corr<strong>es</strong>ponding to the addr<strong>es</strong>s a in the Page Middle Directoryreferenced by p. <strong>The</strong> pte_offset(p,a) macro is similar, but p is a Page MiddleDirectory entry and the macro yields the addr<strong>es</strong>s of the entry corr<strong>es</strong>ponding to a in thePage Table referenced by p.<strong>The</strong> last group of functions of this long and rather boring list were introduced to simplify thecreation and deletion of page table entri<strong>es</strong>. When two-level paging is used, creating ordeleting a Page Middle Directory entry is trivial. As we explained earlier in this section, thePage Middle Directory contains a single entry that points to the subordinate Page Table. Thus,the Page Middle Directory entry is the entry within the Page Global Directory too. Whendealing with Page Tabl<strong>es</strong>, however, creating an entry may be more complex, because the PageTable that is supposed to contain it might not exist. In such cas<strong>es</strong>, it is nec<strong>es</strong>sary to allocate anew page frame, fill it with zeros and finally add the entry.Each page table is stored in one page frame; moreover, each proc<strong>es</strong>s mak<strong>es</strong> use of severalpage tabl<strong>es</strong>. As we shall see in Section 6.1 in Chapter 6, the allocations and deallocations ofpage fram<strong>es</strong> are expensive operations. <strong>The</strong>refore, when the kernel d<strong>es</strong>troys a page table, itadds the corr<strong>es</strong>ponding page frame to a software cache. When the kernel must allocate a newpage table, it tak<strong>es</strong> a page frame contained in the cache; a new page frame is requ<strong>es</strong>ted fromthe memory allocator only when the cache is empty.<strong>The</strong> Page Table cache is a simple list of page fram<strong>es</strong>. <strong>The</strong> pte_quicklist macro points to thehead of the list, while the first 4 byt<strong>es</strong> of each page frame in the list are used as a pointer tothe next element. <strong>The</strong> Page Global Directory cache is similar, but the head of the list isyielded by the pgd_quicklist macro. Of course, on Intel architecture there is no PageMiddle Directory cache.Since there is no limit on the size of the page table cach<strong>es</strong>, the kernel must implement amechanism for shrinking them. <strong>The</strong>refore, the kernel introduc<strong>es</strong> high and low watermarks,which are stored in the pgt_cache_water array; the check_pgt_cache( ) function checkswhether the size of each cache is greater than the high watermark and, if so, deallocat<strong>es</strong> pagefram<strong>es</strong> until the cache size reach<strong>es</strong> the low watermark. <strong>The</strong> check_ pgt_cache( ) is invokedeither when the system is idle or when the kernel releas<strong>es</strong> all page tabl<strong>es</strong> of some proc<strong>es</strong>s.57


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Now com<strong>es</strong> the last round of functions and macros:pgd_alloc( )Allocat<strong>es</strong> a new Page Global Directory by invoking the get_ pgd_fast( ) function,which tak<strong>es</strong> a page frame from the Page Global Directory cache; if the cache is empty,the page frame is allocated by invoking the get_ pgd_slow( ) function.pmd_alloc(p,a)Defined so three-level paging systems can allocate a new Page Middle Directory forthe linear addr<strong>es</strong>s a. On Intel 80x86 systems, the function simply returns the inputparameter p, that is, the addr<strong>es</strong>s of the entry in the Page Global Directory.pte_alloc(p,a)Receiv<strong>es</strong> as parameters the addr<strong>es</strong>s of a Page Middle Directory entry p and a linearaddr<strong>es</strong>s a, and it returns the addr<strong>es</strong>s of the Page Table entry corr<strong>es</strong>ponding to a. If thePage Middle Directory entry is null, the function must allocate a new Page Table. Toaccomplish this, it looks for a free page frame in the Page Table cache by invoking theget_ pte_fast( ) function. If the cache is empty, the page frame is allocated byinvoking get_ pte_slow( ). If a new Page Table is allocated, the entrycorr<strong>es</strong>ponding to a is initialized and the User/Supervisor flag is set.pte_alloc_kernel( ) is similar, except that it invok<strong>es</strong> the get_ pte_kernel_slow() function instead of get_ pte_slow( ) for allocating a new page frame; theget_pte_kernel_slow( ) function clears the User/Supervisor flag of the newPage Table.pte_free( ) , pte_free_kernel( ) , and pgd_free( )Release a page table and insert the freed page frame in the proper cache. <strong>The</strong>pmd_free( ) and pmd_free_kernel( ) functions do nothing, since Page MiddleDirectori<strong>es</strong> do not really exist on Intel 80x86 systems.free_one_pmd( )Invok<strong>es</strong> pte_free( ) to release a Page Table.free_one_ pgd( )Releas<strong>es</strong> all Page Tabl<strong>es</strong> of a Page Middle Directory; in the Intel architecture, it justinvok<strong>es</strong> free_one_ pmd( ) once. <strong>The</strong>n it releas<strong>es</strong> the Page Middle Directory byinvoking pmd_free( ).SET_PAGE_DIRSets the Page Global Directory of a proc<strong>es</strong>s. This is accomplished by placing thephysical addr<strong>es</strong>s of the Page Global Directory in a field of the TSS segment of theproc<strong>es</strong>s; this addr<strong>es</strong>s is loaded in the cr3 register every time the proc<strong>es</strong>s starts orr<strong>es</strong>um<strong>es</strong> its execution on the CPU. Of course, if the affected proc<strong>es</strong>s is currently in58


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>execution, the macro also directly chang<strong>es</strong> the cr3 register value so that the changetak<strong>es</strong> effect right away.new_ page_tabl<strong>es</strong>( )Allocat<strong>es</strong> the Page Global Directory and all the Page Tabl<strong>es</strong> needed to set up a proc<strong>es</strong>saddr<strong>es</strong>s space. It also invok<strong>es</strong> SET_PAGE_DIR in order to assign the new Page GlobalDirectory to the proc<strong>es</strong>s. This topic will be covered in Chapter 7.clear_ page_tabl<strong>es</strong>( )Clears the contents of the page tabl<strong>es</strong> of a proc<strong>es</strong>s by iteratively invoking free_one_pgd( ).free_page_tabl<strong>es</strong>( )Is very similar to clear_ page_tabl<strong>es</strong>( ) , but it also releas<strong>es</strong> the Page GlobalDirectory of the proc<strong>es</strong>s.2.5.3 R<strong>es</strong>erved Page Fram<strong>es</strong><strong>The</strong> kernel's code and data structur<strong>es</strong> are stored in a group of r<strong>es</strong>erved page fram<strong>es</strong>. A pagecontained in one of th<strong>es</strong>e page fram<strong>es</strong> can never be dynamically assigned or swapped to disk.As a general rule, the <strong>Linux</strong> kernel is installed in RAM starting from physical addr<strong>es</strong>s0x00100000, that is, from the second megabyte. <strong>The</strong> total number of page fram<strong>es</strong> requireddepends on how the kernel has been configured: a typical configuration yields a kernel thatcan be loaded in l<strong>es</strong>s than 2 MBs of RAM.Why isn't the kernel loaded starting with the first available megabyte of RAM? Well, the PCarchitecture has several peculiariti<strong>es</strong> that must be taken into account:• Page frame is used by BIOS to store the system hardware configuration detectedduring the Power-On Self-T<strong>es</strong>t (POST ).• Physical addr<strong>es</strong>s<strong>es</strong> ranging from 0x000a0000 to 0x000fffff are r<strong>es</strong>erved to BIOSroutin<strong>es</strong> and to map the internal memory of ISA graphics cards (the source of the wellknown640 KB addr<strong>es</strong>sing limit in the first MS-DOS systems).• Additional page fram<strong>es</strong> within the first megabyte may be r<strong>es</strong>erved by specificcomputer models. For example, the IBM ThinkPad maps the 0xa0 page frame into the0x9f one.In order to avoid loading the kernel into groups of noncontiguous page fram<strong>es</strong>, <strong>Linux</strong> prefersto skip the first megabyte of RAM. Clearly, page fram<strong>es</strong> not r<strong>es</strong>erved by the PC architecturewill be used by <strong>Linux</strong> to store dynamically assigned pag<strong>es</strong>.Figure 2-10 shows how the first 2 MB of RAM are filled by <strong>Linux</strong>. We have assumed that thekernel requir<strong>es</strong> l<strong>es</strong>s than one megabyte of RAM (this is a bit optimistic).59


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 2-10. <strong>The</strong> first 512 page fram<strong>es</strong> (2 MB) in <strong>Linux</strong> 2.2<strong>The</strong> symbol _text, which corr<strong>es</strong>ponds to physical addr<strong>es</strong>s 0x00100000, denot<strong>es</strong> the addr<strong>es</strong>sof the first byte of kernel code. <strong>The</strong> end of the kernel code is similarly identified by th<strong>es</strong>ymbol _etext. <strong>Kernel</strong> data is divided into two groups: initialized and uninitialized. <strong>The</strong>initialized data starts right after _etext and ends at _edata. <strong>The</strong> uninitialized data followsand ends up at _end.<strong>The</strong> symbols appearing in the figure are not defined in <strong>Linux</strong> source code; they are producedwhile compiling the kernel. [2][2] You can find the linear addr<strong>es</strong>s of th<strong>es</strong>e symbols in the file System.map, which is created right after the kernel is compiled.<strong>The</strong> linear addr<strong>es</strong>s corr<strong>es</strong>ponding to the first physical addr<strong>es</strong>s r<strong>es</strong>erved to the BIOS or to ahardware device (usually, 0x0009f000) is stored in the i386_endbase variable. In mostcas<strong>es</strong>, this variable is initialized with a value written by the BIOS during the POST phase.2.5.4 Proc<strong>es</strong>s Page Tabl<strong>es</strong><strong>The</strong> linear addr<strong>es</strong>s space of a proc<strong>es</strong>s is divided into two parts:• Linear addr<strong>es</strong>s<strong>es</strong> from 0x00000000 to PAGE_OFFSET -1 can be addr<strong>es</strong>sed when theproc<strong>es</strong>s is in either User or <strong>Kernel</strong> Mode.• Linear addr<strong>es</strong>s<strong>es</strong> from PAGE_OFFSET to 0xffffffff can be addr<strong>es</strong>sed only when theproc<strong>es</strong>s is in <strong>Kernel</strong> Mode.Usually, the PAGE_OFFSET macro yields the value 0xc0000000: this means that the fourthgigabyte of linear addr<strong>es</strong>s<strong>es</strong> is r<strong>es</strong>erved for the kernel, while the first three gigabyt<strong>es</strong> areacc<strong>es</strong>sible from both the kernel and the user programs. However, the value of PAGE_OFFSETmay be customized by the user when the <strong>Linux</strong> kernel image is compiled. In fact, as we shallsee in the next section, the range of linear addr<strong>es</strong>s<strong>es</strong> r<strong>es</strong>erved for the kernel must include amapping of all physical RAM installed in the system; moreover, as we shall see in Chapter 7,the kernel also mak<strong>es</strong> use of the linear addr<strong>es</strong>s<strong>es</strong> in this range to remap noncontiguous pagefram<strong>es</strong> into contiguous linear addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong>refore, if <strong>Linux</strong> must be installed on a machinehaving a huge amount of RAM, a different arrangement for the linear addr<strong>es</strong>s<strong>es</strong> might benec<strong>es</strong>sary.<strong>The</strong> content of the first entri<strong>es</strong> of the Page Global Directory that map linear addr<strong>es</strong>s<strong>es</strong> lowerthan PAGE_OFFSET (usually the first 768 entri<strong>es</strong>) depends on the specific proc<strong>es</strong>s. Conversely,60


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the remaining entri<strong>es</strong> are the same for all proc<strong>es</strong>s<strong>es</strong>; they are equal to the corr<strong>es</strong>pondingentri<strong>es</strong> of the swapper_ pg_dir kernel Page Global Directory (see the following section).2.5.5 <strong>Kernel</strong> Page Tabl<strong>es</strong>We now d<strong>es</strong>cribe how the kernel initializ<strong>es</strong> its own page tabl<strong>es</strong>. This is a two-phase activity.In fact, right after the kernel image has been loaded into memory, the CPU is still running inreal mode; thus, paging is not enabled.In the first phase, the kernel creat<strong>es</strong> a limited 4 MB addr<strong>es</strong>s space, which is enough for it toinstall itself in RAM.In the second phase, the kernel tak<strong>es</strong> advantage of all of the existing RAM and sets up thepaging tabl<strong>es</strong> properly. <strong>The</strong> next section examin<strong>es</strong> how this plan is executed.2.5.5.1 Provisional kernel page tabl<strong>es</strong>Both the Page Global Directory and the Page Table are initialized statically during the kernelcompilation. We won't bother mentioning the Page Middle Directori<strong>es</strong> any more since theyequate to Page Global Directory entri<strong>es</strong>.<strong>The</strong> Page Global Directory is contained in the swapper_ pg_dir variable, while the PageTable that spans the first 4 MB of RAM is contained in the pg0 variable.<strong>The</strong> objective of this first phase of paging is to allow th<strong>es</strong>e 4 MB to be easily addr<strong>es</strong>sed inboth real mode and protected mode. <strong>The</strong>refore, the kernel must create a mapping from boththe linear addr<strong>es</strong>s<strong>es</strong> 0x00000000 through 0x003fffff and the linear addr<strong>es</strong>s<strong>es</strong> PAGE_OFFSETthrough PAGE_OFFSET+0x3fffff into the physical addr<strong>es</strong>s<strong>es</strong> 0x00000000 through0x003fffff. In other words, the kernel during its first phase of initialization can addr<strong>es</strong>s thefirst 4 MB of RAM (0x00000000 through 0x003fffff) either using linear addr<strong>es</strong>s<strong>es</strong> identicalto the physical on<strong>es</strong> or using 4 MB worth of linear addr<strong>es</strong>s<strong>es</strong> starting from PAGE_OFFSET.Assuming that PAGE_OFFSET yields the value 0xc0000000, the kernel creat<strong>es</strong> the d<strong>es</strong>iredmapping by filling all the swapper_ pg_dir entri<strong>es</strong> with zeros, except for entri<strong>es</strong> and 0x300(decimal 768); the latter entry spans all linear addr<strong>es</strong>s<strong>es</strong> between 0xc0000000 and0xc03fffff. <strong>The</strong> and 0x300 entri<strong>es</strong> are initialized as follows:• <strong>The</strong> addr<strong>es</strong>s field is set to the addr<strong>es</strong>s of pg0.• <strong>The</strong> Pr<strong>es</strong>ent, Read/Write, and User/Supervisor flags are set.• <strong>The</strong> Acc<strong>es</strong>sed, Dirty, PCD, PWD, and Page Size flags are cleared.<strong>The</strong> single pg0 Page Table is also statically initialized, so that the i th entry addr<strong>es</strong>s<strong>es</strong> the i thpage frame.<strong>The</strong> paging unit is enabled by the startup_32( ) Assembly-language function. This isachieved by loading in the cr3 control register the addr<strong>es</strong>s of swapper_pg_dir and by settingthe PG flag of the cr0 control register, as shown in the following excerpt:61


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>movl $0x101000,%eaxmovl %eax,%cr3 /* set the page table pointer.. */movl %cr0,%eaxorl $0x80000000,%eaxmovl %eax,%cr0 /* ..and set paging (PG) bit */2.5.5.2 Final kernel page table<strong>The</strong> final mapping provided by the kernel page tabl<strong>es</strong> must transform linear addr<strong>es</strong>s<strong>es</strong> startingfrom PAGE_OFFSET into physical addr<strong>es</strong>s<strong>es</strong> starting from 0.<strong>The</strong> _ pa macro is used to convert a linear addr<strong>es</strong>s starting from PAGE_OFFSET to thecorr<strong>es</strong>ponding physical addr<strong>es</strong>s, while the _va macro do<strong>es</strong> the reverse.<strong>The</strong> final kernel Page Global Directory is still stored in swapper_ pg_dir. It is initialized bythe paging_init( ) function. This function acts on two input parameters:start_memend_mem<strong>The</strong> linear addr<strong>es</strong>s of the first byte of RAM right after the kernel code and data areas.<strong>The</strong> linear addr<strong>es</strong>s of the end of memory (this addr<strong>es</strong>s is computed by the BIOSroutin<strong>es</strong> during the POST phase).<strong>Linux</strong> exploits the extended paging feature of the Pentium proc<strong>es</strong>sors, enabling 4 MB pagefram<strong>es</strong>: it allows a very efficient mapping from PAGE_OFFSET into physical addr<strong>es</strong>s<strong>es</strong> bymaking kernel Page Tabl<strong>es</strong> superfluous. [3][3] We'll see in Section 6.3 in Chapter 6 that the kernel may set additional mappings for its own use based on 4 KB pag<strong>es</strong>; when this happens, it mak<strong>es</strong>use of Page Tabl<strong>es</strong>.<strong>The</strong> swapper_ pg_dir Page Global Directory is reinitialized by a cycle equivalent to thefollowing:addr<strong>es</strong>s = 0;pg_dir = swapper_pg_dir;pgd_val(pg_dir[0]) = 0;pg_dir += (PAGE_OFFSET >> PGDIR_SHIFT);while (addr<strong>es</strong>s < end_mem) {pgd_val(*pg_dir) = _PAGE_PRESENT+_PAGE_RW+_PAGE_ACCESSED+_PAGE_DIRTY +_PAGE_4M+__pa(addr<strong>es</strong>s);pg_dir++;addr<strong>es</strong>s += 0x400000;}As you can see, the first entry of the Page Global Directory is zeroed out, hence removing themapping between the first 4 MB of linear and physical addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong> first Page Table is thusavailable, so User Mode proc<strong>es</strong>s<strong>es</strong> can also use the range of linear addr<strong>es</strong>s<strong>es</strong> between and4194303.62


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> User/Supervisor flags in all Page Global Directory entri<strong>es</strong> referencing linear addr<strong>es</strong>s<strong>es</strong>above PAGE_OFFSET are cleared, thus denying to proc<strong>es</strong>s<strong>es</strong> in User Mode acc<strong>es</strong>s to the kerneladdr<strong>es</strong>s space.<strong>The</strong> pg0 provisional Page Table is no longer used once swapper_ pg_dir has beeninitialized.2.6 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 introduc<strong>es</strong> two main chang<strong>es</strong>. <strong>The</strong> TSS Segment D<strong>es</strong>criptor associated with allexisting proc<strong>es</strong>s<strong>es</strong> is no longer stored in the Global D<strong>es</strong>criptor Table. This change remov<strong>es</strong>the hard-coded limit on the number of existing proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong> limit thus becom<strong>es</strong> the numberof available PIDs. In short, you will not find anymore the NR_TASKS macro inside the kernelcode, and all data structur<strong>es</strong> whose size was depending on it have been replaced or removed.<strong>The</strong> other main change is related to physical memory addr<strong>es</strong>sing. Recent Intel 80x86microproc<strong>es</strong>sors include a feature called Physical Addr<strong>es</strong>s Extension (PAE), which adds fourextra bits to the standard 32-bit physical addr<strong>es</strong>s. <strong>Linux</strong> 2.4 tak<strong>es</strong> advantage of PAE andsupports up to 64 GB of RAM. However, a linear addr<strong>es</strong>s is still composed by 32 bits, so thatonly 4 GB of RAM can be "permanently mapped" and acc<strong>es</strong>sed at any time. <strong>Linux</strong> 2.4 thusrecogniz<strong>es</strong> three different portions of RAM: the physical memory suitable for ISA DirectMemory Acc<strong>es</strong>s (DMA), the physical memory not suitable for ISA DMA but permanentlymapped by the kernel, and the "high memory," that is, the physical memory that is notpermanently mapped by the kernel.63


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 3. Proc<strong>es</strong>s<strong>es</strong><strong>The</strong> concept of a proc<strong>es</strong>s is fundamental to any multiprogramming operating system.A proc<strong>es</strong>s is usually defined as an instance of a program in execution; thus, if 16 users arerunning vi at once, there are 16 separate proc<strong>es</strong>s<strong>es</strong> (although they can share the sameexecutable code). Proc<strong>es</strong>s<strong>es</strong> are often called "tasks" in <strong>Linux</strong> source code.In this chapter, we will first discuss static properti<strong>es</strong> of proc<strong>es</strong>s<strong>es</strong> and then d<strong>es</strong>cribe howproc<strong>es</strong>s switching is performed by the kernel. <strong>The</strong> last two sections inv<strong>es</strong>tigate dynamicproperti<strong>es</strong> of proc<strong>es</strong>s<strong>es</strong>, namely, how proc<strong>es</strong>s<strong>es</strong> can be created and d<strong>es</strong>troyed. This chapteralso d<strong>es</strong>crib<strong>es</strong> how <strong>Linux</strong> supports multithreaded applications: as mentioned in Chapter 1, itreli<strong>es</strong> on so-called lightweight proc<strong>es</strong>s<strong>es</strong> (LWP).3.1 Proc<strong>es</strong>s D<strong>es</strong>criptorIn order to manage proc<strong>es</strong>s<strong>es</strong>, the kernel must have a clear picture of what each proc<strong>es</strong>s isdoing. It must know, for instance, the proc<strong>es</strong>s's priority, whether it is running on the CPU orblocked on some event, what addr<strong>es</strong>s space has been assigned to it, which fil<strong>es</strong> it is allowed toaddr<strong>es</strong>s, and so on. This is the role of the proc<strong>es</strong>s d<strong>es</strong>criptor , that is, of a task_struct typ<strong>es</strong>tructure whose fields contain all the information related to a single proc<strong>es</strong>s. As the repositoryof so much information, the proc<strong>es</strong>s d<strong>es</strong>criptor is rather complex. Not only do<strong>es</strong> it containmany fields itself, but some contain pointers to other data structur<strong>es</strong> that, in turn, containpointers to other structur<strong>es</strong>. Figure 3-1 d<strong>es</strong>crib<strong>es</strong> the <strong>Linux</strong> proc<strong>es</strong>s d<strong>es</strong>criptor schematically.64


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 3-1. <strong>The</strong> <strong>Linux</strong> proc<strong>es</strong>s d<strong>es</strong>criptor<strong>The</strong> five data structur<strong>es</strong> on the right side of the figure refer to specific r<strong>es</strong>ourc<strong>es</strong> owned by theproc<strong>es</strong>s. <strong>The</strong>se r<strong>es</strong>ourc<strong>es</strong> will be covered in future chapters. This chapter will focus on twotyp<strong>es</strong> of fields that refer to the proc<strong>es</strong>s state and to proc<strong>es</strong>s parent/child relationships.3.1.1 Proc<strong>es</strong>s StateAs its name impli<strong>es</strong>, the state field of the proc<strong>es</strong>s d<strong>es</strong>criptor d<strong>es</strong>crib<strong>es</strong> what is currentlyhappening to the proc<strong>es</strong>s. It consists of an array of flags, each of which d<strong>es</strong>crib<strong>es</strong> a possibleproc<strong>es</strong>s state. In the current <strong>Linux</strong> version th<strong>es</strong>e stat<strong>es</strong> are mutually exclusive, and henceexactly one flag of state is set; the remaining flags are cleared. <strong>The</strong> following are thepossible proc<strong>es</strong>s stat<strong>es</strong>:TASK_RUNNING<strong>The</strong> proc<strong>es</strong>s is either executing on the CPU or waiting to be executed.TASK_INTERRUPTIBLE<strong>The</strong> proc<strong>es</strong>s is suspended (sleeping) until some condition becom<strong>es</strong> true. Raising ahardware interrupt, releasing a system r<strong>es</strong>ource the proc<strong>es</strong>s is waiting for, ordelivering a signal are exampl<strong>es</strong> of conditions that might wake up the proc<strong>es</strong>s, that is,put its state back to TASK_RUNNING.65


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>TASK_UNINTERRUPTIBLELike the previous state, except that delivering a signal to the sleeping proc<strong>es</strong>s leav<strong>es</strong>its state unchanged. This proc<strong>es</strong>s state is seldom used. It is valuable, however, undercertain specific conditions in which a proc<strong>es</strong>s must wait until a given event occurswithout being interrupted. For instance, this state may be used when a proc<strong>es</strong>s opens adevice file and the corr<strong>es</strong>ponding device driver starts probing for a corr<strong>es</strong>pondinghardware device. <strong>The</strong> device driver must not be interrupted until the probing iscomplete, or the hardware device could be left in an unpredictable state.TASK_STOPPEDProc<strong>es</strong>s execution has been stopped: the proc<strong>es</strong>s enters this state after receiving aSIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal. When a proc<strong>es</strong>s is being monitoredby another (such as when a debugger execut<strong>es</strong> a ptrace( ) system call to monitor at<strong>es</strong>t program), any signal may put the proc<strong>es</strong>s in the TASK_STOPPED state.TASK_ZOMBIEProc<strong>es</strong>s execution is terminated, but the parent proc<strong>es</strong>s has not yet issued a wait( )-like system call (wait( ), wait3( ), wait4( ), or waitpid( )) to returninformation about the dead proc<strong>es</strong>s. Before the wait( )-like call is issued, the kernelcannot discard the data contained in the dead proc<strong>es</strong>s d<strong>es</strong>criptor because the parentcould need it. (See Section 3.4.2 near the end of this chapter.)3.1.2 Identifying a Proc<strong>es</strong>sAlthough <strong>Linux</strong> proc<strong>es</strong>s<strong>es</strong> can share a large portion of their kernel data structur<strong>es</strong>—anefficiency measure known as lightweight proc<strong>es</strong>s<strong>es</strong>—each proc<strong>es</strong>s has its own proc<strong>es</strong>sd<strong>es</strong>criptor. Each execution context that can be independently scheduled must have its ownproc<strong>es</strong>s d<strong>es</strong>criptor.Lightweight proc<strong>es</strong>s<strong>es</strong> should not be confused with user-mode threads, which are differentexecution flows handled by a user-level library. For instance, older <strong>Linux</strong> systemsimplemented POSIX threads entirely in user space by means of the pthread library; therefore,a multithreaded program was executed as a single <strong>Linux</strong> proc<strong>es</strong>s. Currently, the pthreadlibrary, which has been merged into the standard C library, tak<strong>es</strong> advantage of lightweightproc<strong>es</strong>s<strong>es</strong>.<strong>The</strong> very strict one-to-one corr<strong>es</strong>pondence between the proc<strong>es</strong>s and proc<strong>es</strong>s d<strong>es</strong>criptor mak<strong>es</strong>the 32-bit proc<strong>es</strong>s d<strong>es</strong>criptor addr<strong>es</strong>s [1] a convenient tool to identify proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>se addr<strong>es</strong>s<strong>es</strong>are referred to as proc<strong>es</strong>s d<strong>es</strong>criptor pointers. Most of the referenc<strong>es</strong> to proc<strong>es</strong>s<strong>es</strong> that thekernel mak<strong>es</strong> are through proc<strong>es</strong>s d<strong>es</strong>criptor pointers.[1] Technically speaking, th<strong>es</strong>e 32 bits are only the offset component of a logical addr<strong>es</strong>s. However, since <strong>Linux</strong> mak<strong>es</strong> use of a single kernel datasegment, we can consider the offset to be equivalent to a whole logical addr<strong>es</strong>s. Furthermore, since the base addr<strong>es</strong>s<strong>es</strong> of the code and data segmentsare set to 0, we can treat the offset as a linear addr<strong>es</strong>s.Any Unix-like operating system, on the other hand, allows users to identify proc<strong>es</strong>s<strong>es</strong> bymeans of a number called the Proc<strong>es</strong>s ID (or PID). <strong>The</strong> PID is a 32-bit unsigned integerstored in the pid field of the proc<strong>es</strong>s d<strong>es</strong>criptor. PIDs are numbered sequentially: the PID of66


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>a newly created proc<strong>es</strong>s is normally the PID of the previously created proc<strong>es</strong>s incremented byone. However, for compatibility with traditional Unix systems developed for 16-bit hardwareplatforms, the maximum PID number allowed on <strong>Linux</strong> is 32767. When the kernel creat<strong>es</strong> the32768th proc<strong>es</strong>s in the system, it must start recycling the lower unused PIDs.At the end of this section, we'll show you how it is possible to derive a proc<strong>es</strong>s d<strong>es</strong>criptorpointer efficiently from its r<strong>es</strong>pective PID. Efficiency is important because many system callslike kill( ) use the PID to denote the affected proc<strong>es</strong>s.3.1.2.1 <strong>The</strong> task arrayProc<strong>es</strong>s<strong>es</strong> are dynamic entiti<strong>es</strong> whose lifetim<strong>es</strong> in the system range from a few milliseconds tomonths. Thus, the kernel must be able to handle many proc<strong>es</strong>s<strong>es</strong> at the same time. In fact, weknow from the previous chapter that <strong>Linux</strong> is able to handle up to NR_TASKS proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>kernel r<strong>es</strong>erv<strong>es</strong> a global static array of size NR_TASKS called task in its own addr<strong>es</strong>s space.<strong>The</strong> elements in the array are proc<strong>es</strong>s d<strong>es</strong>criptor pointers; a null pointer indicat<strong>es</strong> thata proc<strong>es</strong>s d<strong>es</strong>criptor hasn't been associated with the array entry.3.1.2.2 Storing a proc<strong>es</strong>s d<strong>es</strong>criptor<strong>The</strong> task array contains only pointers to proc<strong>es</strong>s d<strong>es</strong>criptors, not the sizable d<strong>es</strong>criptorsthemselv<strong>es</strong>. Since proc<strong>es</strong>s<strong>es</strong> are dynamic entiti<strong>es</strong>, proc<strong>es</strong>s d<strong>es</strong>criptors are stored in dynamicmemory rather than in the memory area permanently assigned to the kernel. <strong>Linux</strong> stor<strong>es</strong> twodifferent data structur<strong>es</strong> for each proc<strong>es</strong>s in a single 8 KB memory area: the proc<strong>es</strong>s d<strong>es</strong>criptorand the <strong>Kernel</strong> Mode proc<strong>es</strong>s stack.In Section 2.3 in Chapter 2, we learned that a proc<strong>es</strong>s in <strong>Kernel</strong> Mode acc<strong>es</strong>s<strong>es</strong> a stackcontained in the kernel data segment, which is different from the stack used by the proc<strong>es</strong>s inUser Mode. Since kernel control paths make little use of the stack—even taking into accountthe interleaved execution of multiple kernel control paths on behalf of the same proc<strong>es</strong>s—onlya few thousand byt<strong>es</strong> of kernel stack are required. <strong>The</strong>refore, 8 KB is ample space for th<strong>es</strong>tack and the proc<strong>es</strong>s d<strong>es</strong>criptor.Figure 3-2 shows how the two data structur<strong>es</strong> are stored in the memory area. <strong>The</strong> proc<strong>es</strong>sd<strong>es</strong>criptor starts from the beginning of the memory area and the stack from the end.67


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 3-2. Storing the proc<strong>es</strong>s d<strong>es</strong>criptor and the proc<strong>es</strong>s kernel stack<strong>The</strong> <strong>es</strong>p register is the CPU stack pointer, which is used to addr<strong>es</strong>s the stack's top location.On Intel systems, the stack starts at the end and grows toward the beginning of the memoryarea. Right after switching from User Mode to <strong>Kernel</strong> Mode, the kernel stack of a proc<strong>es</strong>s isalways empty, and therefore the <strong>es</strong>p register points to the byte immediately followingthe memory area.<strong>The</strong> C language allows such a hybrid structure to be conveniently repr<strong>es</strong>ented by means ofthe following union construct:union task_union {struct task_struct task;unsigned long stack[2048];};After switching from User Mode to <strong>Kernel</strong> Mode in Figure 3-2, the <strong>es</strong>p register contains theaddr<strong>es</strong>s 0x015fc000. <strong>The</strong> proc<strong>es</strong>s d<strong>es</strong>criptor is stored starting at addr<strong>es</strong>s 0x015fa000. <strong>The</strong>value of the <strong>es</strong>p is decremented as soon as data is written into the stack. Since the proc<strong>es</strong>sd<strong>es</strong>criptor is l<strong>es</strong>s than 1000 byt<strong>es</strong> long, the kernel stack can expand up to 7200 byt<strong>es</strong>.3.1.2.3 <strong>The</strong> current macro<strong>The</strong> pairing between the proc<strong>es</strong>s d<strong>es</strong>criptor and the <strong>Kernel</strong> Mode stack just d<strong>es</strong>cribed offersa key benefit in terms of efficiency: the kernel can easily obtain the proc<strong>es</strong>s d<strong>es</strong>criptor pointerof the proc<strong>es</strong>s currently running on the CPU from the value of the <strong>es</strong>p register. In fact, sincethe memory area is 8 KB (2 13 byt<strong>es</strong>) long, all the kernel has to do is mask out the 13 leastsignificant bits of <strong>es</strong>p to obtain the base addr<strong>es</strong>s of the proc<strong>es</strong>s d<strong>es</strong>criptor. This is done by thecurrent macro, which produc<strong>es</strong> some Assembly instructions like the following:movl $0xffffe000, %ecxandl %<strong>es</strong>p, %ecxmovl %ecx, p68


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>After executing th<strong>es</strong>e three instructions, the local variable p contains the proc<strong>es</strong>s d<strong>es</strong>criptorpointer of the proc<strong>es</strong>s running on the CPU. [2][2] One drawback to the shared-storage approach is that, for efficiency reasons, the kernel stor<strong>es</strong> the 8 KB memory area in two consecutive page fram<strong>es</strong>with the first page frame aligned to a multiple of 2 13 . This may turn out to be a problem when little dynamic memory is available.Another advantage of storing the proc<strong>es</strong>s d<strong>es</strong>criptor with the stack emerg<strong>es</strong> on multiproc<strong>es</strong>sorsystems: the correct current proc<strong>es</strong>s for each hardware proc<strong>es</strong>sor can be derived just bychecking the stack as shown previously. <strong>Linux</strong> 2.0 did not store the kernel stack and theproc<strong>es</strong>s d<strong>es</strong>criptor together. Instead, it was forced to introduce a global static variable calledcurrent to identify the proc<strong>es</strong>s d<strong>es</strong>criptor of the running proc<strong>es</strong>s. On multiproc<strong>es</strong>sor systems,it was nec<strong>es</strong>sary to define current as an array—one element for each available CPU.<strong>The</strong> current macro often appears in kernel code as a prefix to fields of the proc<strong>es</strong>s d<strong>es</strong>criptor.For example, current->pid returns the proc<strong>es</strong>s ID of the proc<strong>es</strong>s currently running on theCPU.A small cache consisting of EXTRA_TASK_STRUCT memory areas (where the macro is usuallyset to 16) is used to avoid unnec<strong>es</strong>sarily invoking the memory allocator. To understandthe purpose of this cache, assume for instance that some proc<strong>es</strong>s is d<strong>es</strong>troyed and that, rightafterward, a new proc<strong>es</strong>s is created. Without the cache, the kernel would have to releasean 8 KB memory area to the memory allocator and then, immediately afterward, requ<strong>es</strong>tanother memory area of the same size. This is an example of memory cache, a softwaremechanism introduced to bypass the <strong>Kernel</strong> Memory Allocator. You will find many otherexampl<strong>es</strong> of memory cach<strong>es</strong> in the following chapters.<strong>The</strong> task_struct_stack array contains the pointers to the proc<strong>es</strong>s d<strong>es</strong>criptors in the cache.Its name com<strong>es</strong> from the fact that proc<strong>es</strong>s d<strong>es</strong>criptor releas<strong>es</strong> and requ<strong>es</strong>ts are implementedr<strong>es</strong>pectively as "push" and "pop" operations on the array:free_task_struct( )This function releas<strong>es</strong> the 8 KB task_union memory areas and plac<strong>es</strong> them inthe cache unl<strong>es</strong>s it is full.alloc_task_struct( )This function allocat<strong>es</strong> 8 KB task_union memory areas. <strong>The</strong> function tak<strong>es</strong> memoryareas from the cache if it is at least half-full or if there isn't a free pair of consecutivepage fram<strong>es</strong> available.3.1.2.4 <strong>The</strong> proc<strong>es</strong>s listTo allow an efficient search through proc<strong>es</strong>s<strong>es</strong> of a given type (for instance, all proc<strong>es</strong>s<strong>es</strong> ina runnable state) the kernel creat<strong>es</strong> several lists of proc<strong>es</strong>s<strong>es</strong>. Each list consists of pointers toproc<strong>es</strong>s d<strong>es</strong>criptors. A list pointer (that is, the field that each proc<strong>es</strong>s us<strong>es</strong> to point to the nextproc<strong>es</strong>s) is embedded right in the proc<strong>es</strong>s d<strong>es</strong>criptor's data structure. When you look atthe C-language declaration of the task_struct structure, the d<strong>es</strong>criptors may seem to turn inon themselv<strong>es</strong> in a complicated recursive manner. However, the concept is no morecomplicated than any list, which is a data structure containing a pointer to the next instance ofitself.69


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>A circular doubly linked list (see Figure 3-3) links together all existing proc<strong>es</strong>s d<strong>es</strong>criptors;we will call it the proc<strong>es</strong>s list. <strong>The</strong> prev_task and next_task fields of each proc<strong>es</strong>sd<strong>es</strong>criptor are used to implement the list. <strong>The</strong> head of the list is the init_task d<strong>es</strong>criptorreferenced by the first element of the task array: it is the anc<strong>es</strong>tor of all proc<strong>es</strong>s<strong>es</strong>, and it iscalled proc<strong>es</strong>s 0 or swapper (see Section 3.3.2 later in this chapter). <strong>The</strong> prev_task field ofinit_task points to the proc<strong>es</strong>s d<strong>es</strong>criptor inserted last in the list.Figure 3-3. <strong>The</strong> proc<strong>es</strong>s list<strong>The</strong> SET_LINKS and REMOVE_LINKS macros are used to insert and to remove a proc<strong>es</strong>sd<strong>es</strong>criptor in the proc<strong>es</strong>s list, r<strong>es</strong>pectively. <strong>The</strong>se macros also take care of the parenthoodrelationship of the proc<strong>es</strong>s (see Section 3.1.3 later in this chapter).Another useful macro, called for_each_task , scans the whole proc<strong>es</strong>s list. It is defined as:#define for_each_task(p) \for (p = &init_task ; (p = p->next_task) != &init_task ; )<strong>The</strong> macro is the loop control statement after which the kernel programmer suppli<strong>es</strong> the loop.Notice how the init_task proc<strong>es</strong>s d<strong>es</strong>criptor just plays the role of list header. <strong>The</strong> macrostarts by moving past init_task to the next task and continu<strong>es</strong> until it reach<strong>es</strong> init_taskagain (thanks to the circularity of the list).3.1.2.5 <strong>The</strong> list of TASK_RUNNING proc<strong>es</strong>s<strong>es</strong>When looking for a new proc<strong>es</strong>s to run on the CPU, the kernel has to consider only therunnable proc<strong>es</strong>s<strong>es</strong> (that is, the proc<strong>es</strong>s<strong>es</strong> in the TASK_RUNNING state). Since it would be ratherinefficient to scan the whole proc<strong>es</strong>s list, a doubly linked circular list of TASK_RUNNINGproc<strong>es</strong>s<strong>es</strong> called runqueue has been introduced. <strong>The</strong> proc<strong>es</strong>s d<strong>es</strong>criptors include the next_runand prev_run fields to implement the runqueue list. As in the previous case, the init_taskproc<strong>es</strong>s d<strong>es</strong>criptor plays the role of list header. <strong>The</strong> nr_running variable stor<strong>es</strong> the totalnumber of runnable proc<strong>es</strong>s<strong>es</strong>.<strong>The</strong> add_to_runqueue( ) function inserts a proc<strong>es</strong>s d<strong>es</strong>criptor at the beginning of the list,while del_from_runqueue( ) remov<strong>es</strong> a proc<strong>es</strong>s d<strong>es</strong>criptor from the list. For schedulingpurpos<strong>es</strong>, two functions, move_first_runqueue( ) and move_last_runqueue( ), areprovided to move a proc<strong>es</strong>s d<strong>es</strong>criptor to the beginning or the end of the runqueue,r<strong>es</strong>pectively.Finally, the wake_up_proc<strong>es</strong>s( ) function is used to make a proc<strong>es</strong>s runnable. It sets theproc<strong>es</strong>s state to TASK_RUNNING, invok<strong>es</strong> add_to_runqueue( ) to insert the proc<strong>es</strong>s in therunqueue list, and increments nr_running. It also forc<strong>es</strong> the invocation of the scheduler whenthe proc<strong>es</strong>s is either real-time or has a dynamic priority much larger than that of the currentproc<strong>es</strong>s (see Chapter 10).70


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3.1.2.6 <strong>The</strong> pidhash table and chained listsIn several circumstanc<strong>es</strong>, the kernel must be able to derive the proc<strong>es</strong>s d<strong>es</strong>criptor pointercorr<strong>es</strong>ponding to a PID. This occurs, for instance, in servicing the kill( ) system call: whenproc<strong>es</strong>s P1 wish<strong>es</strong> to send a signal to another proc<strong>es</strong>s, P2, it invok<strong>es</strong> the kill( ) system callspecifying the PID of P2 as the parameter. <strong>The</strong> kernel deriv<strong>es</strong> the proc<strong>es</strong>s d<strong>es</strong>criptor pointerfrom the PID and then extracts the pointer to the data structure that records the pendingsignals from P2's proc<strong>es</strong>s d<strong>es</strong>criptor.Scanning the proc<strong>es</strong>s list sequentially and checking the pid fields of the proc<strong>es</strong>s d<strong>es</strong>criptorswould be feasible but rather inefficient. In order to speed up the search, a pidhash hash tableconsisting of PIDHASH_SZ elements has been introduced (PIDHASH_SZ is usually set toNR_TASKS/4). <strong>The</strong> table entri<strong>es</strong> contain proc<strong>es</strong>s d<strong>es</strong>criptor pointers. <strong>The</strong> PID is transformedinto a table index using the pid_hashfn macro:#define pid_hashfn(x) \((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))As every basic computer science course explains, a hash function do<strong>es</strong> not always ensure aone-to-one corr<strong>es</strong>pondence between PIDs and table index<strong>es</strong>. Two different PIDs that hash intothe same table index are said to be colliding.<strong>Linux</strong> us<strong>es</strong> chaining to handle colliding PIDs: each table entry is a doubly linked list ofcolliding proc<strong>es</strong>s d<strong>es</strong>criptors. <strong>The</strong>se lists are implemented by means of the pidhash_next andpidhash_pprev fields in the proc<strong>es</strong>s d<strong>es</strong>criptor. Figure 3-4 illustrat<strong>es</strong> a pidhash table withtwo lists: the proc<strong>es</strong>s<strong>es</strong> having PIDs 228 and 27535 hash into the 101st element of the table,while the proc<strong>es</strong>s having PID 27536 hash<strong>es</strong> into the 124th element of the table.Figure 3-4. <strong>The</strong> pidhash table and chained listsHashing with chaining is preferable to a linear transformation from PIDs to table index<strong>es</strong>,because a PID can assume any value between and 32767. Since NR_TASKS, the maximumnumber of proc<strong>es</strong>s<strong>es</strong>, is usually set to 512, it would be a waste of storage to define a tableconsisting of 32768 entri<strong>es</strong>.<strong>The</strong> hash_ pid( ) and unhash_ pid( ) functions are invoked to insert and remove aproc<strong>es</strong>s in the pidhash table, r<strong>es</strong>pectively. <strong>The</strong> find_task_by_pid( ) function search<strong>es</strong> thehash table and returns the proc<strong>es</strong>s d<strong>es</strong>criptor pointer of the proc<strong>es</strong>s with a given PID (or a nullpointer if it do<strong>es</strong> not find the proc<strong>es</strong>s).71


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3.1.2.7 <strong>The</strong> list of task free entri<strong>es</strong><strong>The</strong> task array must be updated every time a proc<strong>es</strong>s is created or d<strong>es</strong>troyed. As with theother lists shown in previous sections, a list is used here to speed additions and deletions.Adding a new entry into the array is done efficiently: instead of searching the array linearlyand looking for the first free entry, the kernel maintains a separate doubly linked, noncircularlist of free entri<strong>es</strong>. <strong>The</strong> tarray_freelist variable contains the first element of that list; eachfree entry in the array points to another free entry, while the last element of the list contains anull pointer. When a proc<strong>es</strong>s is d<strong>es</strong>troyed, the corr<strong>es</strong>ponding element in task is added to thehead of the list.In Figure 3-5, if the first element is counted as 0, the tarray_freelist variable points toelement 4 because it is the last freed element. Previously, the proc<strong>es</strong>s<strong>es</strong> corr<strong>es</strong>ponding toelements 2 and 1 were d<strong>es</strong>troyed, in that order. Element 2 points to another free element oftasks not shown in the figure.Figure 3-5. An example of task array with free entri<strong>es</strong>Deleting an entry from the array is also done efficiently. Each proc<strong>es</strong>s d<strong>es</strong>criptor p includ<strong>es</strong> atarray_ ptr field that points to the task entry containing the pointer to p.<strong>The</strong> get_free_taskslot( ) and add_free_taskslot( ) functions are used to get a freeentry and to free an entry, r<strong>es</strong>pectively.3.1.3 Parenthood Relationships Among Proc<strong>es</strong>s<strong>es</strong>Proc<strong>es</strong>s<strong>es</strong> created by a program have a parent/child relationship. Since a proc<strong>es</strong>s can creat<strong>es</strong>everal children, th<strong>es</strong>e have sibling relationships. Several fields must be introduced in aproc<strong>es</strong>s d<strong>es</strong>criptor to repr<strong>es</strong>ent th<strong>es</strong>e relationships. Proc<strong>es</strong>s<strong>es</strong> and 1 are created by the kernel;as we shall see later in the chapter, proc<strong>es</strong>s 1 (init) is the anc<strong>es</strong>tor of all other proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>d<strong>es</strong>criptor of a proc<strong>es</strong>s P includ<strong>es</strong> the following fields:p_opptr (original parent)Points to the proc<strong>es</strong>s d<strong>es</strong>criptor of the proc<strong>es</strong>s that created P or to the d<strong>es</strong>criptor ofproc<strong>es</strong>s 1 (init) if the parent proc<strong>es</strong>s no longer exists. Thus, when a shell user starts abackground proc<strong>es</strong>s and exits the shell, the background proc<strong>es</strong>s becom<strong>es</strong> the child ofinit.72


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>p_pptr (parent)Points to the current parent of P; its value usually coincid<strong>es</strong> with that of p_opptr. Itmay occasionally differ, such as when another proc<strong>es</strong>s issu<strong>es</strong> a ptrace( ) system callrequ<strong>es</strong>ting that it be allowed to monitor P (see Section 19.1.5 in Chapter 19).p_cptr (child)Points to the proc<strong>es</strong>s d<strong>es</strong>criptor of the young<strong>es</strong>t child of P, that is, of the proc<strong>es</strong>screated most recently by it.p_ysptr (younger sibling)Points to the proc<strong>es</strong>s d<strong>es</strong>criptor of the proc<strong>es</strong>s that has been created immediately afterP by P's current parent.p_osptr (older sibling)Points to the proc<strong>es</strong>s d<strong>es</strong>criptor of the proc<strong>es</strong>s that has been created immediatelybefore P by P's current parent.Figure 3-6 illustrat<strong>es</strong> the parenthood relationships of a group of proc<strong>es</strong>s<strong>es</strong>. Proc<strong>es</strong>s P0succ<strong>es</strong>sively created P1, P2, and P3. Proc<strong>es</strong>s P3, in turn, created proc<strong>es</strong>s P4. Starting withp_cptr and using the p_osptr pointers to siblings, P0 is able to retrieve all its children.Figure 3-6. Parenthood relationships among five proc<strong>es</strong>s<strong>es</strong>3.1.4 Wait Queu<strong>es</strong><strong>The</strong> runqueue list groups together all proc<strong>es</strong>s<strong>es</strong> in a TASK_RUNNING state. When it com<strong>es</strong> togrouping proc<strong>es</strong>s<strong>es</strong> in other stat<strong>es</strong>, the various stat<strong>es</strong> call for different typ<strong>es</strong> of treatment, with<strong>Linux</strong> opting for one of the following choic<strong>es</strong>:• Proc<strong>es</strong>s<strong>es</strong> in a TASK_STOPPED or in a TASK_ZOMBIE state are not linked in specific lists.<strong>The</strong>re is no need to group them, because either the proc<strong>es</strong>s PID or the proc<strong>es</strong>sparenthood relationships may be used by the parent proc<strong>es</strong>s to retrieve the childproc<strong>es</strong>s.73


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Proc<strong>es</strong>s<strong>es</strong> in a TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state are subdividedinto many class<strong>es</strong>, each of which corr<strong>es</strong>ponds to a specific event. In this case, theproc<strong>es</strong>s state do<strong>es</strong> not provide enough information to retrieve the proc<strong>es</strong>s quickly, so itis nec<strong>es</strong>sary to introduce additional lists of proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>se additional lists are calledwait queu<strong>es</strong>.Wait queu<strong>es</strong> have several us<strong>es</strong> in the kernel, particularly for interrupt handling, proc<strong>es</strong>ssynchronization, and timing. Because th<strong>es</strong>e topics are discussed in later chapters, we'll justsay here that a proc<strong>es</strong>s must often wait for some event to occur, such as for a disk operation toterminate, a system r<strong>es</strong>ource to be released, or a fixed interval of time to elapse. Wait queu<strong>es</strong>implement conditional waits on events: a proc<strong>es</strong>s wishing to wait for a specific event plac<strong>es</strong>itself in the proper wait queue and relinquish<strong>es</strong> control. <strong>The</strong>refore, a wait queue repr<strong>es</strong>ents aset of sleeping proc<strong>es</strong>s<strong>es</strong>, which are awakened by the kernel when some condition becom<strong>es</strong>true.Wait queu<strong>es</strong> are implemented as cyclical lists whose elements include pointers to proc<strong>es</strong>sd<strong>es</strong>criptors. Each element of a wait queue list is of type wait_queue:struct wait_queue {struct task_struct * task;struct wait_queue * next;};Each wait queue is identified by a wait queue pointer, which contains either the addr<strong>es</strong>s of thefirst element of the list or the null pointer if the list is empty. <strong>The</strong> next field of thewait_queue data structure points to the next element in the list, except for the last element,whose next field points to a dummy list element. <strong>The</strong> dummy's next field contains theaddr<strong>es</strong>s of the variable or field that identifi<strong>es</strong> the wait queue minus the size of a pointer (onIntel platforms, the size of the pointer is 4 byt<strong>es</strong>). Thus, the wait queue list can be consideredby kernel functions as a truly circular list, since the last element points to the dummy waitqueue structure whose next field coincid<strong>es</strong> with the wait queue pointer (see Figure 3-7).Figure 3-7. <strong>The</strong> wait queue data structure74


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> init_waitqueue( ) function initializ<strong>es</strong> an empty wait queue; it receiv<strong>es</strong> the addr<strong>es</strong>s q ofa wait queue pointer as its parameter and sets that pointer to q - 4. <strong>The</strong> add_wait_queue(q,entry) function inserts a new element with addr<strong>es</strong>s entry in the wait queue identified by thewait queue pointer q. Since wait queu<strong>es</strong> are modified by interrupt handlers as well as by majorkernel functions, the function execut<strong>es</strong> the following operations with disabled interrupts (seeChapter 4):if (*q != NULL)entry->next = *q;elseentry->next = (struct wait_queue *)(q-1);*q = entry;Since the wait queue pointer is set to entry, the new element is placed in the first position ofthe wait queue list. If the wait queue was not empty, the next field of the new element is setto the addr<strong>es</strong>s of the previous first element. Otherwise, the next field is set to the addr<strong>es</strong>s ofthe wait queue pointer minus 4, and thus points to the dummy element.<strong>The</strong> remove_wait_queue( ) function remov<strong>es</strong> the element pointed to by entry from a waitqueue. Once again, the function must disable interrupts before executing the followingoperations:next = entry->next;head = next;while ((tmp = head->next) != entry)head = tmp;head->next = next;<strong>The</strong> function scans the circular list to find the element head that preced<strong>es</strong> entry. It thendetach<strong>es</strong> entry from the list by letting the next field of head point to the element that followsentry. <strong>The</strong> peculiar format of the wait queue circular list simplifi<strong>es</strong> the code. Moreover, it isvery efficient for the following reasons:• Most wait queu<strong>es</strong> have just one element, which means that the body of the while loopis never executed.• While scanning the list, there is no need to distinguish the wait queue pointer (thedummy wait queue element) from wait_queue data structur<strong>es</strong>.A proc<strong>es</strong>s wishing to wait for a specific condition can invoke any of the following functions:• <strong>The</strong> sleep_on( ) function operat<strong>es</strong> on the current proc<strong>es</strong>s, which we'll call P:void sleep_on(struct wait_queue **p){struct wait_queue wait;current->state = TASK_UNINTERRUPTIBLE;wait.task = current;add_wait_queue(p, &wait);schedule( );remove_wait_queue(p, &wait);}75


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> function sets P's state to TASK_UNINTERRUPTIBLE and inserts P into the waitqueue whose pointer was specified as the parameter. <strong>The</strong>n it invok<strong>es</strong> the scheduler,which r<strong>es</strong>um<strong>es</strong> the execution of another proc<strong>es</strong>s. When P is awakened, the schedulerr<strong>es</strong>um<strong>es</strong> execution of the sleep_on( ) function, which remov<strong>es</strong> P from the waitqueue.• <strong>The</strong> interruptible_sleep_on( ) function is identical to sleep_on( ), except thatit sets the state of the current proc<strong>es</strong>s P to TASK_INTERRUPTIBLE instead ofTASK_UNINTERRUPTIBLE so that P can also be awakened by receiving a signal.• <strong>The</strong> sleep_on_timeout( ) and interruptible_sleep_on_timeout( ) functionsare similar to the previous on<strong>es</strong>, but they also allow the caller to define a time intervalafter which the proc<strong>es</strong>s will be woken up by the kernel. In order to do this, they invokethe schedule_timeout( ) function instead of schedule( ) (see Section 5.4.7 inChapter 5).Proc<strong>es</strong>s<strong>es</strong> inserted in a wait queue enter the TASK_RUNNING state by using either the wake_upor the wake_up_interruptible macros. Both macros use the __wake_up( ) function, whichreceiv<strong>es</strong> as parameters the addr<strong>es</strong>s q of the wait queue pointer and a bitmask mode specifyingone or more stat<strong>es</strong>. Proc<strong>es</strong>s<strong>es</strong> in the specified stat<strong>es</strong> will be woken up; others will be leftunchanged. <strong>The</strong> function <strong>es</strong>sentially execut<strong>es</strong> the following instructions:if (q && (next = *q)) {head = (struct wait_queue *)(q-1);while (next != head) {p = next->task;next = next->next;if (p->state & mode)wake_up_proc<strong>es</strong>s(p);}}<strong>The</strong> function checks the state p->state of each proc<strong>es</strong>s against mode to determine whetherthe caller wants the proc<strong>es</strong>s woken up. Only those proc<strong>es</strong>s<strong>es</strong> whose state is included in themode bitmask are actually awakened. <strong>The</strong> wake_up macro specifi<strong>es</strong> both theTASK_INTERRUPTIBLE and the TASK_UNINTERRUPTIBLE flags in mode, so it wak<strong>es</strong> up allsleeping proc<strong>es</strong>s<strong>es</strong>. Conversely, the wake_up_interruptible macro wak<strong>es</strong> up only theTASK_INTERRUPTIBLE proc<strong>es</strong>s<strong>es</strong> by specifying only that flag in mode. Notice that awakenedproc<strong>es</strong>s<strong>es</strong> are not removed from the wait queue. A proc<strong>es</strong>s that has been awakened do<strong>es</strong> notnec<strong>es</strong>sarily imply that the wait condition has become true, so the proc<strong>es</strong>s<strong>es</strong> could suspendthemselv<strong>es</strong> again.3.1.5 Proc<strong>es</strong>s Usage LimitsProc<strong>es</strong>s<strong>es</strong> are associated with sets of usage limits, which specify the amount of systemr<strong>es</strong>ourc<strong>es</strong> they can use. Specifically, <strong>Linux</strong> recogniz<strong>es</strong> the following usage limits:RLIMIT_CPUMaximum CPU time for the proc<strong>es</strong>s. If the proc<strong>es</strong>s exceeds the limit, the kernel sendsit a SIGXCPU signal, and then, if the proc<strong>es</strong>s do<strong>es</strong>n't terminate, a SIGKILL signal (seeChapter 9).76


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>RLIMIT_FSIZEMaximum file size allowed. If the proc<strong>es</strong>s tri<strong>es</strong> to enlarge a file to a size greater thanthis value, the kernel sends it a SIGXFSZ signal.RLIMIT_DATAMaximum heap size. <strong>The</strong> kernel checks this value before expanding the heap of theproc<strong>es</strong>s (see Section 7.6 in Chapter 7).RLIMIT_STACKMaximum stack size. <strong>The</strong> kernel checks this value before expanding the User Mod<strong>es</strong>tack of the proc<strong>es</strong>s (see Section 7.4 in Chapter 7).RLIMIT_COREMaximum core dump file size. <strong>The</strong> kernel checks this value when a proc<strong>es</strong>s is aborted,before creating a core file in the current directory of the proc<strong>es</strong>s (see Section 9.1.1 inChapter 9). If the limit is 0, the kernel won't create the file.RLIMIT_RSSMaximum number of page fram<strong>es</strong> owned by the proc<strong>es</strong>s. Actually, the kernel neverchecks this value, so this usage limit is not implemented.RLIMIT_NPROCMaximum number of proc<strong>es</strong>s<strong>es</strong> that the user can own (see Section 3.3.1 later in thischapter).RLIMIT_NOFILEMaximum number of open fil<strong>es</strong>. <strong>The</strong> kernel checks this value when opening a new fileor duplicating a file d<strong>es</strong>criptor (see Chapter 12).RLIMIT_MEMLOCKRLIMIT_ASMaximum size of nonswappable memory. <strong>The</strong> kernel checks this value when theproc<strong>es</strong>s tri<strong>es</strong> to lock a page frame in memory using the mlock( ) or mlockall( )system calls (see Section 7.3.4 in Chapter 7).Maximum size of proc<strong>es</strong>s addr<strong>es</strong>s space. <strong>The</strong> kernel checks this value when theproc<strong>es</strong>s us<strong>es</strong> malloc( ) or a related function to enlarge its addr<strong>es</strong>s space (see Section7.1 in Chapter 7).<strong>The</strong> usage limits are stored in the rlim field of the proc<strong>es</strong>s d<strong>es</strong>criptor. <strong>The</strong> field is an array ofelements of type struct rlimit, one for each usage limit:77


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>struct rlimit {long rlim_cur;long rlim_max;};<strong>The</strong> rlim_cur field is the current usage limit for the r<strong>es</strong>ource. For example, current->rlim[RLIMIT_CPU].rlim_cur repr<strong>es</strong>ents the current limit on the CPU time of the runningproc<strong>es</strong>s.<strong>The</strong> rlim_max field is the maximum allowed value for the r<strong>es</strong>ource limit. By using thegetrlimit( ) and setrlimit( ) system calls, a user can always increase the rlim_curlimit of some r<strong>es</strong>ource up to rlim_max. However, only the superuser can change therlim_max field or set the rlim_cur field to a value greater than the corr<strong>es</strong>ponding rlim_maxfield.Usually, most usage limits contain the value RLIMIT_INFINITY (0x7fffffff), which meansthat no limit is imposed on the corr<strong>es</strong>ponding r<strong>es</strong>ource. However, the system administratormay choose to impose stronger limits on some r<strong>es</strong>ourc<strong>es</strong>. Whenever a user logs into th<strong>es</strong>ystem, the kernel creat<strong>es</strong> a proc<strong>es</strong>s owned by the superuser, which can invoke setrlimit( )to decrease the rlim_max and rlim_cur fields for some r<strong>es</strong>ource. <strong>The</strong> same proc<strong>es</strong>s laterexecut<strong>es</strong> a login shell and becom<strong>es</strong> owned by the user. Each new proc<strong>es</strong>s created by the userinherits the content of the rlim array from its parent, and therefore the user cannot overridethe limits enforced by the system.3.2 Proc<strong>es</strong>s SwitchingIn order to control the execution of proc<strong>es</strong>s<strong>es</strong>, the kernel must be able to suspend theexecution of the proc<strong>es</strong>s running on the CPU and r<strong>es</strong>ume the execution of some other proc<strong>es</strong>spreviously suspended. This activity is called proc<strong>es</strong>s switching , task switching, or contextswitching. <strong>The</strong> following sections d<strong>es</strong>cribe the elements of proc<strong>es</strong>s switching in <strong>Linux</strong>:• Hardware context• Hardware support• <strong>Linux</strong> code• Saving the floating point registers3.2.1 Hardware ContextWhile each proc<strong>es</strong>s can have its own addr<strong>es</strong>s space, all proc<strong>es</strong>s<strong>es</strong> have to share the CPUregisters. So before r<strong>es</strong>uming the execution of a proc<strong>es</strong>s, the kernel must ensure that each suchregister is loaded with the value it had when the proc<strong>es</strong>s was suspended.<strong>The</strong> set of data that must be loaded into the registers before the proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its executionon the CPU is called the hardware context. <strong>The</strong> hardware context is a subset of the proc<strong>es</strong>sexecution context, which includ<strong>es</strong> all information needed for the proc<strong>es</strong>s execution. In <strong>Linux</strong>,part of the hardware context of a proc<strong>es</strong>s is stored in the TSS segment, while the remainingpart is saved in the <strong>Kernel</strong> Mode stack. As we learned in Section 2.3 in Chapter 2, the TSSsegment coincid<strong>es</strong> with the tss field of the proc<strong>es</strong>s d<strong>es</strong>criptor.78


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>We will assume the prev local variable refers to the proc<strong>es</strong>s d<strong>es</strong>criptor of the proc<strong>es</strong>s beingswitched out and next refers to the one being switched in to replace it. We can thus defineproc<strong>es</strong>s switching as the activity consisting of saving the hardware context of prev andreplacing it with the hardware context of next. Since proc<strong>es</strong>s switch<strong>es</strong> occur quite often, it isimportant to minimize the time spent in saving and loading hardware contexts.Earlier versions of <strong>Linux</strong> took advantage of the hardware support offered by the Intelarchitecture and performed proc<strong>es</strong>s switching through a far jmp instruction [3] to the selectorof the Task State Segment D<strong>es</strong>criptor of the next proc<strong>es</strong>s. While executing the instruction,the CPU performs a hardware context switch by automatically saving the old hardwarecontext and loading a new one. But for the following reasons, <strong>Linux</strong> 2.2 us<strong>es</strong> software toperform proc<strong>es</strong>s switching:[3]far jmp instructions modify both the cs and eip registers, while simple jmp instructions modify only eip.• Step-by-step switching performed through a sequence of mov instructions allows bettercontrol over the validity of the data being loaded. In particular, it is possible to checkthe valu<strong>es</strong> of segmentation registers. This type of checking is not possible when usinga single far jmp instruction.• <strong>The</strong> amount of time required by the old approach and the new approach is about th<strong>es</strong>ame. However, it is not possible to optimize a hardware context switch, while thecurrent switching code could perhaps be enhanced in the future.Proc<strong>es</strong>s switching occurs only in <strong>Kernel</strong> Mode. <strong>The</strong> contents of all registers used by a proc<strong>es</strong>sin User Mode have already been saved before performing proc<strong>es</strong>s switching (see Chapter 4).This includ<strong>es</strong> the contents of the ss and <strong>es</strong>p pair that specifi<strong>es</strong> the User Mode stack pointeraddr<strong>es</strong>s.3.2.2 Task State Segment<strong>The</strong> Intel 80x86 architecture includ<strong>es</strong> a specific segment type called the Task State Segment(TSS), to store hardware contexts. As we saw in Section 2.3 in Chapter 2, each proc<strong>es</strong>sinclud<strong>es</strong> its own TSS segment with a minimum length of 104 byt<strong>es</strong>. Additional byt<strong>es</strong> areneeded by the operating system to store registers that are not automatically saved by thehardware and to store the I/O Permission bitmap. That map is needed because the ioperm( )and iopl( ) system calls may grant a proc<strong>es</strong>s in User Mode direct acc<strong>es</strong>s to specific I/Oports. In particular, if the IOPL field in the eflags register is set to 3, the User Mode proc<strong>es</strong>sis allowed to acc<strong>es</strong>s any of the I/O ports whose corr<strong>es</strong>ponding bit in the I/O Permission BitMap is cleared.<strong>The</strong> thread_struct structure d<strong>es</strong>crib<strong>es</strong> the format of the <strong>Linux</strong> TSS. An additional area isintroduced to store the tr and cr2 registers, the floating point registers, the debug registers,and other miscellaneous information specific to Intel 80x86 proc<strong>es</strong>sors.Each TSS has its own 8-byte Task State Segment D<strong>es</strong>criptor (TSSD). This D<strong>es</strong>criptorinclud<strong>es</strong> a 32-bit Base field that points to the TSS starting addr<strong>es</strong>s and a 20-bit Limit fieldwhose value cannot be smaller than 0x67 (decimal 103, determined by the minimum TSSsegment length mentioned earlier). <strong>The</strong> S flag of a TSSD is cleared to denote the fact that thecorr<strong>es</strong>ponding TSS is a System Segment.79


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> Type field is set to 11 if the TSSD refers to the TSS of the proc<strong>es</strong>s currently running onthe CPU; otherwise it is set to 9. [4] <strong>The</strong> second least significant bit of the Type field is calledthe Busy bit since it discriminat<strong>es</strong> between the valu<strong>es</strong> 9 and 11. [5][4] <strong>Linux</strong> do<strong>es</strong> not make use of a hardware feature that us<strong>es</strong> the Type field in a peculiar way to allow the automatic reexecution of a previouslysuspended proc<strong>es</strong>s. Further details may be found in the Pentium manuals.[5] Since the proc<strong>es</strong>sor performs a "bus lock" before modifying this bit, a multitasking operating system may t<strong>es</strong>t the bit in order to check whether aCPU is trying to switch to a proc<strong>es</strong>s that's already executing. However, <strong>Linux</strong> do<strong>es</strong> not make use of this hardware feature (see Chapter 11).<strong>The</strong> TSSDs created by <strong>Linux</strong> are stored in the Global D<strong>es</strong>criptor Table (GDT), whose baseaddr<strong>es</strong>s is stored in the gdtr register. <strong>The</strong> tr register contains the TSSD Selector of theproc<strong>es</strong>s currently running on the CPU. It also includ<strong>es</strong> two hidden, nonprogrammable fields:the Base and Limit fields of the TSSD. In this way, the proc<strong>es</strong>sor can addr<strong>es</strong>s the TSSdirectly without having to retrieve the TSS addr<strong>es</strong>s from the GDT.As stated earlier, <strong>Linux</strong> stor<strong>es</strong> part of the hardware context in the tss field of the proc<strong>es</strong>sd<strong>es</strong>criptor. This means that when the kernel creat<strong>es</strong> a new proc<strong>es</strong>s, it must also initialize theTSSD so that it refers to the tss field. Even though the hardware context is saved viasoftware, the TSS segment still plays an important role because it may contain the I/OPermission Bit Map. In fact, when a proc<strong>es</strong>s execut<strong>es</strong> an in or out I/O instruction in UserMode, the control unit performs the following operations:1. It checks the IOPL field in the eflags register. If it is set to 3 (User Mode proc<strong>es</strong>senabled to acc<strong>es</strong>s I/O ports), it performs the next check; otherwise, it rais<strong>es</strong> a "Generalprotection error" exception.2. It acc<strong>es</strong>s<strong>es</strong> the tr register to determine the current TSS, and thus the proper I/OPermission Bit Map.3. It checks the bit corr<strong>es</strong>ponding to the I/O port specified in the I/O instruction. If it iscleared, the instruction is executed; otherwise, the control unit rais<strong>es</strong> a "Generalprotection error" exception.3.2.3 <strong>The</strong> switch_to Macro<strong>The</strong> switch_to macro performs a proc<strong>es</strong>s switch. It mak<strong>es</strong> use of two parameters denoted asprev and next: the first is the proc<strong>es</strong>s d<strong>es</strong>criptor pointer of the proc<strong>es</strong>s to be suspended, whilethe second is the proc<strong>es</strong>s d<strong>es</strong>criptor pointer of the proc<strong>es</strong>s to be executed on the CPU. <strong>The</strong>macro is invoked by the schedule( ) function to schedule a new proc<strong>es</strong>s on the CPU (seeChapter 10).<strong>The</strong> switch_to macro is one of the most hardware-dependent routin<strong>es</strong> of the kernel. Here is ad<strong>es</strong>cription of what it do<strong>es</strong> on an Intel 80x86 microproc<strong>es</strong>sor:1. Sav<strong>es</strong> the valu<strong>es</strong> of prev and next in the eax and edx registers, r<strong>es</strong>pectively (th<strong>es</strong>evalu<strong>es</strong> were previously stored in ebx and ecx):movl %ebx, %eaxmovl %ecx, %edx80


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Sav<strong>es</strong> the contents of the <strong>es</strong>i, edi, and ebp registers in the prev <strong>Kernel</strong> Mode stack.<strong>The</strong>y must be saved because the compiler assum<strong>es</strong> that they will stay unchanged untilthe end of switch_to:pushl %<strong>es</strong>ipushl %edipushl %ebp3. Sav<strong>es</strong> the content of <strong>es</strong>p in prev->tss.<strong>es</strong>p so that the field points to the top of theprev <strong>Kernel</strong> Mode stack:movl %<strong>es</strong>p, 532(%ebx)4. Loads next->tss.<strong>es</strong>p in <strong>es</strong>p. From now on, the kernel operat<strong>es</strong> on the <strong>Kernel</strong> Mod<strong>es</strong>tack of next, so this instruction performs the actual context switch from prev tonext. Since the addr<strong>es</strong>s of a proc<strong>es</strong>s d<strong>es</strong>criptor is closely related to that of the <strong>Kernel</strong>Mode stack (as explained in Section 3.1.2 earlier in this chapter), changing the kernelstack means changing the current proc<strong>es</strong>s:movl 532(%ecx), %<strong>es</strong>p5. Sav<strong>es</strong> the addr<strong>es</strong>s labeled 1 (shown later in this section) in prev->tss.eip. When theproc<strong>es</strong>s being replaced r<strong>es</strong>um<strong>es</strong> its execution, the proc<strong>es</strong>s will execute the instructionlabeled as 1:movl $1f, 508(%ebx)6. On the <strong>Kernel</strong> Mode stack of next, push<strong>es</strong> the next->tss.eip value, in most cas<strong>es</strong>the addr<strong>es</strong>s labeled 1:pushl 508(%ecx)7. Jumps to the __switch_to( ) C function:jmp __switch_toThis function acts on the prev and next parameters that denote the former proc<strong>es</strong>s andthe new proc<strong>es</strong>s. This function call is different from the average function call, though,because __switch_to( ) tak<strong>es</strong> the prev and next parameters from the eax and edxwhere we saw earlier they were stored, not from the stack like most functions. Toforce the function to go to the registers for its parameters, the kernel mak<strong>es</strong> use of__attribute_ _ and regparm keywords, which are nonstandard extensions of the Clanguage implemented by the gcc compiler. <strong>The</strong> __switch_to( ) function is declaredas follows in the include /asm-i386 /system.h header file:__switch_to(struct task_struct *prev,struct task_struct *next)__attribute__(regparm(3))<strong>The</strong> function complet<strong>es</strong> the proc<strong>es</strong>s switch started by the switch_to( ) macro. Itinclud<strong>es</strong> extended inline Assembly language code that mak<strong>es</strong> for rather complexreading, because the code refers to registers by means of special symbols. In order to81


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>simplify the following discussion, we will d<strong>es</strong>cribe the Assembly languageinstructions yielded by the compiler:a. Sav<strong>es</strong> the contents of the <strong>es</strong>i and ebx registers in the <strong>Kernel</strong> Mode stack ofnext, then loads ecx and ebx with the parameters prev and next, r<strong>es</strong>pectively:pushl %<strong>es</strong>ipushl %ebxmovl %eax, %ecxmovl %edx, %ebxb. Execut<strong>es</strong> the code yielded by the unlazy_fpu( ) macro (see Section 3.2.4later in this chapter) to optionally save the contents of the mathematicalcoproc<strong>es</strong>sor registers. As we shall see later, there is no need to load the floatingpoint registers of next while performing the context switch:unlazy_fpu(prev);c. Clears the Busy bit (see Section 3.2.2 earlier in this chapter) of next and loadits TSS selector in the tr register:movl 712(%ebx), %eaxandb $0xf8, %alandl $0xfffffdff, gdt_table+4(%eax)ltr 712(%ebx)<strong>The</strong> preceding code is fairly dense. It operat<strong>es</strong> on:<strong>The</strong> proc<strong>es</strong>s's TSSD selector, which is copied from next->tss.tr to eax.<strong>The</strong> 8 least significant bits of the selector, which are stored in al. [6] <strong>The</strong> 3 leastsignificant bits of al contain the RPL and the TI fields of the TSSD.[6] <strong>The</strong> ax register consists of the 16 least significant bits of eax. Moreover, the al register consists of the 8 least significant bits of ax, whileah consists of the 8 most significant bits of ax. Similar notations apply to the ebx, ecx, and edx registers. <strong>The</strong> 13 most significant bits of axspecify the TSSD index within the GDT.Clearing the 3 least significant bits of al leav<strong>es</strong> the TSSD index shifted to theleft 3 bits (that is, multiplied by 8). Since the TSSDs are 8 byt<strong>es</strong> long, theindex value multiplied by 8 yields the relative addr<strong>es</strong>s of the TSSD within theGDT. <strong>The</strong> gdt_table+4(%eax) notation refers to the addr<strong>es</strong>s of the fifth byteof the TSSD. <strong>The</strong> andl instruction clears the Busy bit in the fifth byte, whilethe ltr instruction plac<strong>es</strong> the next->tss.tr selector in the tr register andagain sets the Busy bit. [7][7] <strong>Linux</strong> must clear the Busy bit before loading the value in tr, or the control unit will raise an exception.d. Stor<strong>es</strong> the contents of the fs and gs segmentation registers in prev->tss.fsand prev->tss.gs, r<strong>es</strong>pectively:movl %fs,564(%ecx)movl %gs,568(%ecx)82


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>e. Loads the ldtr register with the next->tss.ldt value. This needs to be doneonly if the Local D<strong>es</strong>criptor Table used by prev differs from the one used bynext:2:movl 920(%ebx),%edxmovl 920(%ecx),%eaxmovl 112(%eax),%eaxcmpl %eax,112(%edx)je 2flldt 572(%ebx)In practice, the check is made by referring to the tss.segments field (at offset112 in the proc<strong>es</strong>s d<strong>es</strong>criptor) instead of the tss.ldt field.f. Loads the cr3 register with the next->tss.cr3 value. This can be avoided ifprev and next are lightweight proc<strong>es</strong>s<strong>es</strong> that share the same Page GlobalDirectory. Since the PGD addr<strong>es</strong>s of prev is never changed, it do<strong>es</strong>n't need tobe saved.movl 504(%ebx),%eaxcmpl %eax,504(%ecx)je 3fmovl %eax,%cr33:g. Load the fs and gs segment registers with the valu<strong>es</strong> contained in next->tss.fs and next->tss.gs, r<strong>es</strong>pectively. This step logically complements theactions performed in step 7d.movl 564(%ebx),%fsmovl 568(%ebx),%gs<strong>The</strong> code is actually more intricate, as an exception might be raised by theCPU when it detects an invalid segment register value. <strong>The</strong> code tak<strong>es</strong> thispossibility into account by adopting a "fix-up" approach (see Section 8.2.6 inChapter 8).h. Loads the eight debug registers [8] with the next->tss.debugreg[i] valu<strong>es</strong> (0i 7). This is done only if next was using the debug registers when it wassuspended (that is, field next->tss.debugreg[7] is not 0). As we shall see inChapter 19, th<strong>es</strong>e registers are modified only by writing in the TSS, thus thereis no need to save them:[8] <strong>The</strong> Intel 80x86 debug registers allow a proc<strong>es</strong>s to be monitored by the hardware. Up to four breakpoint areas may be defined. Whenever amonitored proc<strong>es</strong>s issu<strong>es</strong> a linear addr<strong>es</strong>s included in one of the breakpoints, an exception occurs.cmpl $0,760(%ebx)je 4fmovl 732(%ebx),%<strong>es</strong>imovl %<strong>es</strong>i,%db0movl 736(%ebx),%<strong>es</strong>imovl %<strong>es</strong>i,%db1movl 740(%ebx),%<strong>es</strong>i83


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4:movl %<strong>es</strong>i,%db2movl 744(%ebx),%<strong>es</strong>imovl %<strong>es</strong>i,%db3movl 756(%ebx),%<strong>es</strong>imovl %<strong>es</strong>i,%db6movl 760(%ebx),%ebxmovl %ebx,%db7i. <strong>The</strong> function ends up by r<strong>es</strong>toring the original valu<strong>es</strong> of the ebx and <strong>es</strong>iregisters, pushed on the stack in step 7a:popl %ebxpopl %<strong>es</strong>iretWhen the ret instruction is executed, the control unit fetch<strong>es</strong> the value to beloaded in the eip program counter from the stack. This value is usually theaddr<strong>es</strong>s of the instruction shown in the following item and labeled 1, whichwas stored in the stack by the switch_to macro. If, however, next was neversuspended before because it is being executed for the first time, the functionwill find the starting addr<strong>es</strong>s of the ret_from_fork( ) function (see Section3.3.1 later in this chapter).8. <strong>The</strong> remaining part of the switch_to macro includ<strong>es</strong> a few instructions that r<strong>es</strong>torethe contents of the <strong>es</strong>i, edi, and ebp registers. <strong>The</strong> first of th<strong>es</strong>e three instructions islabeled 1:1: popl %ebppopl %edipopl %<strong>es</strong>iNotice how th<strong>es</strong>e pop instructions refer to the kernel stack of the prev proc<strong>es</strong>s. <strong>The</strong>ywill be executed when the scheduler selects prev as the new proc<strong>es</strong>s to be executed onthe CPU, thus invoking switch_to with prev as second parameter. <strong>The</strong>refore, the <strong>es</strong>pregister points to the prev 's <strong>Kernel</strong> Mode stack.3.2.4 Saving the Floating Point RegistersStarting with the Intel 80486, the arithmetic floating point unit (FPU) has been integrated intothe CPU. <strong>The</strong> name mathematical coproc<strong>es</strong>sor continu<strong>es</strong> to be used in memory of the dayswhen floating point computations were executed by an expensive special-purpose chip. Inorder to maintain compatibility with older models, however, floating point arithmeticfunctions are performed by making use of ESCAPE instructions, which are instructions withsome prefix byte ranging between 0xd8 and 0xdf. <strong>The</strong>se instructions act on the set of floatingpoint registers included in the CPU. Clearly, if a proc<strong>es</strong>s is using ESCAPE instructions, thecontents of the floating point registers belong to its hardware context.Recently, Intel introduced a new set of Assembly instructions into its microproc<strong>es</strong>sors. <strong>The</strong>yare called MMX instructions and are supposed to speed up the execution of multimediaapplications. MMX instructions act on the floating point registers of the FPU. <strong>The</strong> obviousdisadvantage of this architectural choice is that programmers cannot mix floating point84


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>instructions and MMX instructions. <strong>The</strong> advantage is that operating system d<strong>es</strong>igners canignore the new instruction set, since the same facility of the task-switching code for saving th<strong>es</strong>tate of the floating point unit can also be relied upon to save the MMX state.<strong>The</strong> Intel 80x86 microproc<strong>es</strong>sors do not automatically save the floating point registers in theTSS. However, they include some hardware support that enabl<strong>es</strong> kernels to save th<strong>es</strong>eregisters only when needed. <strong>The</strong> hardware support consists of a TS (Task-Switching) flag inthe cr0 register, which obeys the following rul<strong>es</strong>:• Every time a hardware context switch is performed, the TS flag is set.• Every time an ESCAPE or an MMX instruction is executed when the TS flag is set, thecontrol unit rais<strong>es</strong> a "Device not available" exception (see Chapter 4).<strong>The</strong> TS flag allows the kernel to save and r<strong>es</strong>tore the floating point registers only when reallyneeded. To illustrate how it works, let's suppose that a proc<strong>es</strong>s A is using the mathematicalcoproc<strong>es</strong>sor. When a context switch occurs, the kernel sets the TS flag and sav<strong>es</strong> the floatingpoint registers into the TSS of proc<strong>es</strong>s A. If the new proc<strong>es</strong>s B do<strong>es</strong> not make use of themathematical coproc<strong>es</strong>sor, the kernel won't need to r<strong>es</strong>tore the contents of the floating pointregisters. But as soon as B tri<strong>es</strong> to execute an ESCAPE or MMX instruction, the CPU rais<strong>es</strong> a"Device not available" exception, and the corr<strong>es</strong>ponding handler loads the floating pointregisters with the valu<strong>es</strong> saved in the TSS of proc<strong>es</strong>s B.Let us now d<strong>es</strong>cribe the data structur<strong>es</strong> introduced to handle selective saving of floating pointregisters. <strong>The</strong>y are stored in the tss.i387 subfield, whose format is d<strong>es</strong>cribed by thei387_hard_struct structure. <strong>The</strong> proc<strong>es</strong>s d<strong>es</strong>criptor also stor<strong>es</strong> the value of two additionalflags:• <strong>The</strong> PF_USEDFPU flag included in the flags field. It specifi<strong>es</strong> whether the proc<strong>es</strong>s usedthe floating point registers when it was last executing on the CPU.• <strong>The</strong> used_math field. This flag specifi<strong>es</strong> whether the contents of the tss.i387subfield are significant. <strong>The</strong> flag is cleared (not significant) in two cas<strong>es</strong>:o When the proc<strong>es</strong>s starts executing a new program by invoking an execve( )system call (see Chapter 19). Since control will never return to the formerprogram, the data currently stored in tss.i387 will never be used again.o When a proc<strong>es</strong>s that was executing a program in User Mode starts executing asignal handler procedure (see Chapter 9). Since signal handlers areasynchronous with r<strong>es</strong>pect to the program execution flow, the floating pointregisters could be meaningl<strong>es</strong>s to the signal handler. However, the kernel sav<strong>es</strong>the floating point registers in tss.i387 before starting the handler and r<strong>es</strong>tor<strong>es</strong>them after the handler terminat<strong>es</strong>. <strong>The</strong>refore, a signal handler is allowed tomake use of the mathematical coproc<strong>es</strong>sor, but it cannot carry on a floatingpoint computation started during the normal program execution flow.As stated earlier, the __switch_to( ) function execut<strong>es</strong> the unlazy_fpu macro. This macroyields the following code:if (prev->flags & PF_USEDFPU) {/* save the floating point registers */asm("fnsave %0" : "=m" (prev->tss.i387));/* wait until all data has been transferred */asm("fwait");85


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>}prev->flags &= ~PF_USEDFPU;/* set the TS flag of cr0 to 1 */stts( );<strong>The</strong> stts( ) macro sets the TS flag of cr0. In practice, it yields the following Assemblylanguage instructions:movl %cr0, %eaxorb $8, %almovl %eax, %cr0<strong>The</strong> contents of the floating point registers are not r<strong>es</strong>tored right after a proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong>execution. However, the TS flag of cr0 has been set by unlazy_fpu( ). Thus, the first timethe proc<strong>es</strong>s tri<strong>es</strong> to execute an ESCAPE or MMX instruction, the control unit rais<strong>es</strong> a "Devicenot available" exception, and the kernel (more precisely, the exception handler involved bythe exception) runs the math_state_r<strong>es</strong>tore( ) function:void math_state_r<strong>es</strong>tore(void) {asm("clts"); /* clear the TS flag of cr0 */if (current->used_math)/* load the floating point registers */asm("frstor %0": :"m" (current->tss.i387));else {/* initialize the floating point unit */asm("fninit");current->used_math = 1;}current->flags |= PF_USEDFPU;}Since the proc<strong>es</strong>s is executing an ESCAPE instruction, this function sets the PF_USEDFPU flag.Moreover, the function clears the TS flag of cr0 so that further ESCAPE or MMX instructionsexecuted by the proc<strong>es</strong>s won't trigger the "Device not available" exception. If the data storedin the tss.i387 field is valid, the function loads the floating point registers with the propervalu<strong>es</strong>. Otherwise, the FPU is reinitialized and all its registers are cleared.3.3 Creating Proc<strong>es</strong>s<strong>es</strong>Unix operating systems rely heavily on proc<strong>es</strong>s creation to satisfy user requ<strong>es</strong>ts. As anexample, the shell proc<strong>es</strong>s creat<strong>es</strong> a new proc<strong>es</strong>s that execut<strong>es</strong> another copy of the shellwhenever the user enters a command.Traditional Unix systems treat all proc<strong>es</strong>s<strong>es</strong> in the same way: r<strong>es</strong>ourc<strong>es</strong> owned by the parentproc<strong>es</strong>s are duplicated, and a copy is granted to the child proc<strong>es</strong>s. This approach mak<strong>es</strong>proc<strong>es</strong>s creation very slow and inefficient, since it requir<strong>es</strong> copying the entire addr<strong>es</strong>s space ofthe parent proc<strong>es</strong>s. <strong>The</strong> child proc<strong>es</strong>s rarely needs to read or modify all the r<strong>es</strong>ourc<strong>es</strong> alreadyowned by the parent; in many cas<strong>es</strong>, it issu<strong>es</strong> an immediate execve( ) and wip<strong>es</strong> out theaddr<strong>es</strong>s space so carefully saved.Modern Unix kernels solve this problem by introducing three different mechanisms:86


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> Copy On Write technique allows both the parent and the child to read the samephysical pag<strong>es</strong>. Whenever either one tri<strong>es</strong> to write on a physical page, the kernelcopi<strong>es</strong> its contents into a new physical page that is assigned to the writing proc<strong>es</strong>s. <strong>The</strong>implementation of this technique in <strong>Linux</strong> is fully explained in Chapter 7.• Lightweight proc<strong>es</strong>s<strong>es</strong> allow both the parent and the child to share many per-proc<strong>es</strong>skernel data structur<strong>es</strong>, like the paging tabl<strong>es</strong> (and therefore the entire User Modeaddr<strong>es</strong>s space) and the open file tabl<strong>es</strong>.• <strong>The</strong> vfork( ) system call creat<strong>es</strong> a proc<strong>es</strong>s that shar<strong>es</strong> the memory addr<strong>es</strong>s space ofits parent. To prevent the parent from overwriting data needed by the child, theparent's execution is blocked until the child exits or execut<strong>es</strong> a new program. We'lllearn more about the vfork( ) system call in the following section.3.3.1 <strong>The</strong> clone( ), fork( ), and vfork( ) System CallsLightweight proc<strong>es</strong>s<strong>es</strong> are created in <strong>Linux</strong> by using a function named __clone( ), whichmak<strong>es</strong> use of four parameters:fnargflagsSpecifi<strong>es</strong> a function to be executed by the new proc<strong>es</strong>s; when the function returns, thechild terminat<strong>es</strong>. <strong>The</strong> function returns an integer, which repr<strong>es</strong>ents the exit code for thechild proc<strong>es</strong>s.Pointer to data passed to the fn( ) function.Miscellaneous information. <strong>The</strong> low byte specifi<strong>es</strong> the signal number to be sent to theparent proc<strong>es</strong>s when the child terminat<strong>es</strong>; the SIGCHLD signal is generally selected.<strong>The</strong> remaining 3 byt<strong>es</strong> encode a group of clone flags, which specify the r<strong>es</strong>ourc<strong>es</strong>shared between the parent and the child proc<strong>es</strong>s. <strong>The</strong> flags, when set, have thefollowing meanings:CLONE_VM<strong>The</strong> memory d<strong>es</strong>criptor and all page tabl<strong>es</strong> (see Chapter 7).CLONE_FS :<strong>The</strong> table that identifi<strong>es</strong> the root directory and the current working directory.CLONE_FILES :<strong>The</strong> table that identifi<strong>es</strong> the open fil<strong>es</strong> (see Chapter 12).CLONE_SIGHAND :<strong>The</strong> table that identifi<strong>es</strong> the signal handlers (see Chapter 9).87


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>CLONE_PID :<strong>The</strong> PID. [9][9] As we shall see later, the CLONE_PID flag can be used only by a proc<strong>es</strong>s having a PID of 0; in a uniproc<strong>es</strong>sor system,no two lightweight proc<strong>es</strong>s<strong>es</strong> have the same PID.CLONE_PTRACE :If a ptrace( ) system call is causing the parent proc<strong>es</strong>s to be traced, the child willalso be traced.CLONE_VFORK :Used for the vfork( ) system call (see later in this section).child_stackSpecifi<strong>es</strong> the User Mode stack pointer to be assigned to the <strong>es</strong>p register of the childproc<strong>es</strong>s. If it is equal to 0, the kernel assigns to the child the current parent stackpointer. Thus, the parent and child temporarily share the same User Mode stack. Butthanks to the Copy On Write mechanism, they usually get separate copi<strong>es</strong> of the UserMode stack as soon as one tri<strong>es</strong> to change the stack. However, this parameter musthave a non-null value if the child proc<strong>es</strong>s shar<strong>es</strong> the same addr<strong>es</strong>s space as the parent.__clone( ) is actually a wrapper function defined in the C library (see Section 8.1 in Chapter8), which in turn mak<strong>es</strong> use of a <strong>Linux</strong> system call hidden to the programmer, named clone(). <strong>The</strong> clone( ) system call receiv<strong>es</strong> only the flags and child_stack parameters; the newproc<strong>es</strong>s always starts its execution from the instruction following the system call invocation.When the system call returns to the __clone( ) function, it determin<strong>es</strong> whether it is in theparent or the child and forc<strong>es</strong> the child to execute the fn( ) function.<strong>The</strong> traditional fork( ) system call is implemented by <strong>Linux</strong> as a clone( ) whose firstparameter specifi<strong>es</strong> a SIGCHLD signal and all the clone flags cleared and whose secondparameter is 0.<strong>The</strong> old vfork( ) system call, d<strong>es</strong>cribed in the previous section, is implemented by <strong>Linux</strong> asa clone( ) whose first parameter specifi<strong>es</strong> a SIGCHLD signal and the flags CLONE_VM andCLONE_VFORK and whose second parameter is equal to 0.When either a clone( ), fork( ), or vfork( ) system call is issued, the kernel invok<strong>es</strong> thedo_fork( ) function, which execut<strong>es</strong> the following steps:1. If the CLONE_PID flag has been specified, the do_fork( ) function checks whether thePID of the parent proc<strong>es</strong>s is not null; if so, it returns an error code. Only the swapperproc<strong>es</strong>s is allowed to set CLONE_PID; this is required when initializing amultiproc<strong>es</strong>sor system (see Section 11.4.1 in Chapter 11).88


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. <strong>The</strong> alloc_task_struct( ) function is invoked in order to get a new 8 KB uniontask_union memory area to store the proc<strong>es</strong>s d<strong>es</strong>criptor and the <strong>Kernel</strong> Mode stack ofthe new proc<strong>es</strong>s.3. <strong>The</strong> function follows the current pointer to obtain the parent proc<strong>es</strong>s d<strong>es</strong>criptor andcopi<strong>es</strong> it into the new proc<strong>es</strong>s d<strong>es</strong>criptor in the memory area just allocated.4. A few checks occur to make sure the user has the r<strong>es</strong>ourc<strong>es</strong> nec<strong>es</strong>sary to start a newproc<strong>es</strong>s. First, the function checks whether current->rlim[RLIMIT_NPROC].rlim_cur is smaller than or equal to the current number ofproc<strong>es</strong>s<strong>es</strong> owned by the user: if so, an error code is returned. <strong>The</strong> function gets thecurrent number of proc<strong>es</strong>s<strong>es</strong> owned by the user from a per-user data structure nameduser_struct. This data structure can be found through a pointer in the user field ofthe proc<strong>es</strong>s d<strong>es</strong>criptor.5. <strong>The</strong> find_empty_proc<strong>es</strong>s( ) function is invoked. If the owner of the parent proc<strong>es</strong>sis not the superuser, this function checks whether nr_tasks (the total number ofproc<strong>es</strong>s<strong>es</strong> in the system) is smaller than NR_TASKS-MIN_TASKS_LEFT_FOR_ROOT. [10] Ifso, find_empty_proc<strong>es</strong>s( ) invok<strong>es</strong> get_free_taskslot( ) to find a free entry inthe task array. Otherwise, it returns an error.[10] A few proc<strong>es</strong>s<strong>es</strong>, usually four, are r<strong>es</strong>erved to the superuser; MIN_TASKS_LEFT_FOR_ROOT refers to this number. Thus, even if auser is allowed to overload the system with a "fork bomb" (a one-line program that forks itself forever), the superuser can log in, kill some proc<strong>es</strong>s<strong>es</strong>,and start searching for the guilty user.6. <strong>The</strong> function writ<strong>es</strong> the new proc<strong>es</strong>s d<strong>es</strong>criptor pointer into the previously obtainedtask entry and sets the tarray_ptr field of the proc<strong>es</strong>s d<strong>es</strong>criptor to the addr<strong>es</strong>s ofthat entry (see Section 3.1.2).7. If the parent proc<strong>es</strong>s mak<strong>es</strong> use of some kernel modul<strong>es</strong>, the function increments thecorr<strong>es</strong>ponding reference counters. Each kernel module has its own reference counter,which indicat<strong>es</strong> how many proc<strong>es</strong>s<strong>es</strong> are using it. A module cannot be removed unl<strong>es</strong>sits reference counter is null (see Appendix B).8. <strong>The</strong> function then updat<strong>es</strong> some of the flags included in the flags field that have beencopied from the parent proc<strong>es</strong>s:a. It clears the PF_SUPERPRIV flag, which indicat<strong>es</strong> whether the proc<strong>es</strong>s has usedany of its superuser privileg<strong>es</strong>.b. It clears the PF_USEDFPU flag.c. It clears the PF_PTRACED flag unl<strong>es</strong>s the CLONE_PTRACE parameter flag is set.When set, the CLONE_PTRACE flag means that the parent proc<strong>es</strong>s is being tracedwith the ptrace( ) function, so the child should be traced too.d. It clears PF_TRACESYS flag unl<strong>es</strong>s, once again, the CLONE_PTRACE parameterflag is set.e. It sets the PF_FORKNOEXEC flag, which indicat<strong>es</strong> that the child proc<strong>es</strong>s has notyet issued an execve( ) system call.f. It sets the PF_VFORK flag according to the value of the CLONE_VFORK flag. Thisspecifi<strong>es</strong> that the parent proc<strong>es</strong>s must be woken up whenever the proc<strong>es</strong>s (thechild) issu<strong>es</strong> an execve( ) system call or terminat<strong>es</strong>.9. Now the function has taken almost everything that it can use from the parent proc<strong>es</strong>s;the r<strong>es</strong>t of its activiti<strong>es</strong> focus on setting up new r<strong>es</strong>ourc<strong>es</strong> in the child and letting thekernel know that this new proc<strong>es</strong>s has been born. First, the function invok<strong>es</strong> theget_pid( ) function to obtain a new PID, which will be assigned to the child proc<strong>es</strong>s(unl<strong>es</strong>s the CLONE_PID flag is set).89


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>10. <strong>The</strong> function then updat<strong>es</strong> all the proc<strong>es</strong>s d<strong>es</strong>criptor fields that cannot be inheritedfrom the parent proc<strong>es</strong>s, such as the fields that specify the proc<strong>es</strong>s parenthoodrelationships.11. Unl<strong>es</strong>s specified differently by the flags parameter, it invok<strong>es</strong> copy_fil<strong>es</strong>( ),copy_fs( ), copy_sighand( ), and copy_mm( ) to create new data structur<strong>es</strong> andcopy into them the valu<strong>es</strong> of the corr<strong>es</strong>ponding parent proc<strong>es</strong>s data structur<strong>es</strong>.12. It invok<strong>es</strong> copy_thread( ) to initialize the <strong>Kernel</strong> Mode stack of the child proc<strong>es</strong>swith the valu<strong>es</strong> contained in the CPU registers when the clone( ) call was issued(th<strong>es</strong>e valu<strong>es</strong> have been saved in the <strong>Kernel</strong> Mode stack of the parent, as d<strong>es</strong>cribed inChapter 8). However, the function forc<strong>es</strong> the value into the field corr<strong>es</strong>ponding to theeax register. <strong>The</strong> tss.<strong>es</strong>p field of the TSS of the child proc<strong>es</strong>s is initialized with thebase addr<strong>es</strong>s of the <strong>Kernel</strong> Mode stack, and the addr<strong>es</strong>s of an Assembly languagefunction (ret_from_fork( )) is stored in the tss.eip field.13. It us<strong>es</strong> the SET_LINKS macro to insert the new proc<strong>es</strong>s d<strong>es</strong>criptor in the proc<strong>es</strong>s list.14. It us<strong>es</strong> the hash_pid( ) function to insert the new proc<strong>es</strong>s d<strong>es</strong>criptor in the pidhashhash table.15. It increments the valu<strong>es</strong> of nr_tasks and current->user->count.16. It sets the state field of the child proc<strong>es</strong>s d<strong>es</strong>criptor to TASK_RUNNING and theninvok<strong>es</strong> wake_up_proc<strong>es</strong>s( ) to insert the child in the runqueue list.17. If the CLONE_VFORK flag has been specified, the function suspends the parent proc<strong>es</strong>suntil the child releas<strong>es</strong> its memory addr<strong>es</strong>s space (that is, until the child eitherterminat<strong>es</strong> or execut<strong>es</strong> a new program). In order to do this, the proc<strong>es</strong>s d<strong>es</strong>criptorinclud<strong>es</strong> a kernel semaphore called vfork_sem (see Section 11.2.4 in Chapter 11).18. It returns the PID of the child, which will be eventually be read by the parent proc<strong>es</strong>sin User Mode.Now we have a complete child proc<strong>es</strong>s in the runnable state. But it isn't actually running. It isup to the scheduler to decide when to give the CPU to this child. At some future proc<strong>es</strong>sswitch, the schedule will b<strong>es</strong>tow this favor on the child proc<strong>es</strong>s by loading a few CPUregisters with the valu<strong>es</strong> of the tss field of the child's proc<strong>es</strong>s d<strong>es</strong>criptor. In particular, <strong>es</strong>pwill be loaded with tss.<strong>es</strong>p (that is, with the addr<strong>es</strong>s of child's <strong>Kernel</strong> Mode stack), and eipwill be loaded with the addr<strong>es</strong>s of ret_from_fork( ). This Assembly language function, inturn, invok<strong>es</strong> the ret_from_sys_call( ) function (see Chapter 8), which reloads all otherregisters with the valu<strong>es</strong> stored in the stack and forc<strong>es</strong> the CPU back to User Mode. <strong>The</strong> newproc<strong>es</strong>s will then start its execution right at the end of the fork( ), vfork( ), or clone( )system call. <strong>The</strong> value returned by the system call is contained in eax: the value is for thechild and equal to the PID for the child's parent.<strong>The</strong> child proc<strong>es</strong>s will execute the same code as the parent, except that the fork will return anull PID. <strong>The</strong> developer of the application can exploit this fact, in the manner familiar to Unixprogrammers, by inserting a conditional statement in the program based on the PID value thatforc<strong>es</strong> the child to behave differently from the parent proc<strong>es</strong>s.3.3.2 <strong>Kernel</strong> ThreadsTraditional Unix systems delegate some critical tasks to intermittently running proc<strong>es</strong>s<strong>es</strong>,including flushing disk cach<strong>es</strong>, swapping out unused page fram<strong>es</strong>, servicing networkconnections, and so on. Indeed, it is not efficient to perform th<strong>es</strong>e tasks in strict linear fashion;both their functions and the end user proc<strong>es</strong>s<strong>es</strong> get better r<strong>es</strong>ponse if they are scheduled in thebackground. Since some of the system proc<strong>es</strong>s<strong>es</strong> run only in <strong>Kernel</strong> Mode, modern operating90


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>systems delegate their functions to kernel threads, which are not encumbered with theunnec<strong>es</strong>sary User Mode context. In <strong>Linux</strong>, kernel threads differ from regular proc<strong>es</strong>s<strong>es</strong> in thefollowing ways:• Each kernel thread execut<strong>es</strong> a single specific kernel function, while regular proc<strong>es</strong>s<strong>es</strong>execute kernel functions only through system calls.• <strong>Kernel</strong> threads run only in <strong>Kernel</strong> Mode, while regular proc<strong>es</strong>s<strong>es</strong> run alternatively in<strong>Kernel</strong> Mode and in User Mode.• Since kernel threads run only in <strong>Kernel</strong> Mode, they use only linear addr<strong>es</strong>s<strong>es</strong> greaterthan PAGE_OFFSET. Regular proc<strong>es</strong>s<strong>es</strong>, on the other hand, use all 4 gigabyt<strong>es</strong> of linearaddr<strong>es</strong>s<strong>es</strong>, either in User Mode or in <strong>Kernel</strong> Mode.3.3.2.1 Creating a kernel thread<strong>The</strong> kernel_thread( ) function creat<strong>es</strong> a new kernel thread and can be executed only byanother kernel thread. <strong>The</strong> function contains mostly inline Assembly language code, but it issomewhat equivalent to the following:int kernel_thread(int (*fn)(void *), void * arg,unsigned long flags){pid_t p;p = clone( 0, flags | CLONE_VM );if ( p ) /* parent */return p;else { /* child */fn(arg);exit( );}}3.3.2.2 Proc<strong>es</strong>s 0<strong>The</strong> anc<strong>es</strong>tor of all proc<strong>es</strong>s<strong>es</strong>, called proc<strong>es</strong>s 0 or, for historical reasons, the swapper proc<strong>es</strong>s,is a kernel thread created from scratch during the initialization phase of <strong>Linux</strong> by th<strong>es</strong>tart_kernel( ) function (see Appendix A). This anc<strong>es</strong>tor proc<strong>es</strong>s mak<strong>es</strong> use of thefollowing data structur<strong>es</strong>:• A proc<strong>es</strong>s d<strong>es</strong>criptor and a <strong>Kernel</strong> Mode stack stored in the init_task_unionvariable. <strong>The</strong> init_task and init_stack macros yield the addr<strong>es</strong>s<strong>es</strong> of the proc<strong>es</strong>sd<strong>es</strong>criptor and the stack, r<strong>es</strong>pectively.• <strong>The</strong> following tabl<strong>es</strong>, which the proc<strong>es</strong>s d<strong>es</strong>criptor points to:o init_mmo init_mmapo init_fso init_fil<strong>es</strong>o init_signals<strong>The</strong> tabl<strong>es</strong> are initialized, r<strong>es</strong>pectively, by the following macros:ooooINIT_MMINIT_MMAPINIT_FSINIT_FILES91


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>o INIT_SIGNALS• A TSS segment, initialized by the INIT_TSS macro.• Two Segment D<strong>es</strong>criptors, namely a TSSD and an LDTD, which are stored in theGDT.• A Page Global Directory stored in swapper_pg_dir, which may be considered as thekernel Page Global Directory since it is used by all kernel threads.<strong>The</strong> start_kernel( ) function initializ<strong>es</strong> all the data structur<strong>es</strong> needed by the kernel,enabl<strong>es</strong> interrupts, and creat<strong>es</strong> another kernel thread, named proc<strong>es</strong>s 1, more commonlyreferred to as the init proc<strong>es</strong>s :kernel_thread(init, NULL,CLONE_FS | CLONE_FILES | CLONE_SIGHAND);<strong>The</strong> newly created kernel thread has PID 1 and shar<strong>es</strong> with proc<strong>es</strong>s all per-proc<strong>es</strong>s kernel datastructur<strong>es</strong>. Moreover, when selected from the scheduler, the init proc<strong>es</strong>s starts executing theinit( ) function.After having created the init proc<strong>es</strong>s, proc<strong>es</strong>s execut<strong>es</strong> the cpu_idle( ) function, which<strong>es</strong>sentially consists of repeatedly executing the hlt Assembly language instruction with theinterrupts enabled (see Chapter 4). Proc<strong>es</strong>s is selected by the scheduler only when there are noother proc<strong>es</strong>s<strong>es</strong> in the TASK_RUNNING state.3.3.2.3 Proc<strong>es</strong>s 1<strong>The</strong> kernel thread created by proc<strong>es</strong>s execut<strong>es</strong> the init( ) function, which in turn invok<strong>es</strong>the kernel_thread( ) function four tim<strong>es</strong> to initiate four kernel threads needed for routinekernel tasks:kernel_thread(bdflush, NULL,CLONE_FS | CLONE_FILES | CLONE_SIGHAND);kernel_thread(kupdate, NULL,CLONE_FS | CLONE_FILES | CLONE_SIGHAND);kernel_thread(kpiod, NULL,CLONE_FS | CLONE_FILES | CLONE_SIGHAND);kernel_thread(kswapd, NULL,CLONE_FS | CLONE_FILES | CLONE_SIGHAND);As a r<strong>es</strong>ult, four additional kernel threads are created to handle the memory cache and th<strong>es</strong>wapping activity:kflushd (also bdflush)kupdatekpiodFlush<strong>es</strong> "dirty" buffers to disk to reclaim memory, as d<strong>es</strong>cribed in Section 14.1.5 inChapter 14Flush<strong>es</strong> old "dirty" buffers to disk to reduce risks of fil<strong>es</strong>ystem inconsistenci<strong>es</strong>, asd<strong>es</strong>cribed in Section 14.1.5 in Chapter 1492


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>kswapdSwaps out pag<strong>es</strong> belonging to shared memory mappings, as d<strong>es</strong>cribed inSection 16.5.2 in Chapter 16,Performs memory reclaiming, as d<strong>es</strong>cribed in Section 16.7.6 in Chapter 16<strong>The</strong>n init( ) invok<strong>es</strong> the execve( ) system call to load the executable program init. As ar<strong>es</strong>ult, the init kernel thread becom<strong>es</strong> a regular proc<strong>es</strong>s having its own per-proc<strong>es</strong>s kernel datastructure. <strong>The</strong> init proc<strong>es</strong>s never terminat<strong>es</strong>, since it creat<strong>es</strong> and monitors the activity of all theproc<strong>es</strong>s<strong>es</strong> that implement the outer layers of the operating system.3.4 D<strong>es</strong>troying Proc<strong>es</strong>s<strong>es</strong>Most proc<strong>es</strong>s<strong>es</strong> "die" in the sense that they terminate the execution of the code they wer<strong>es</strong>upposed to run. When this occurs, the kernel must be notified so that it can release ther<strong>es</strong>ourc<strong>es</strong> owned by the proc<strong>es</strong>s; this includ<strong>es</strong> memory, open fil<strong>es</strong>, and any other odds andends that we will encounter in this book, such as semaphor<strong>es</strong>.<strong>The</strong> usual way for a proc<strong>es</strong>s to terminate is to invoke the exit( ) system call. This systemcall may be inserted by the programmer explicitly. Additionally, the exit( ) system call isalways executed when the control flow reach<strong>es</strong> the last statement of the main procedure (themain( ) function in C programs).Alternatively, the kernel may force a proc<strong>es</strong>s to die. This typically occurs when the proc<strong>es</strong>shas received a signal that it cannot handle or ignore (see Chapter 9) or when an unrecoverableCPU exception has been raised in <strong>Kernel</strong> Mode while the kernel was running on behalf of theproc<strong>es</strong>s (see Chapter 4).3.4.1 Proc<strong>es</strong>s TerminationAll proc<strong>es</strong>s terminations are handled by the do_exit( ) function, which remov<strong>es</strong> mostreferenc<strong>es</strong> to the terminating proc<strong>es</strong>s from kernel data structur<strong>es</strong>. <strong>The</strong> do_exit( ) functionexecut<strong>es</strong> the following actions:1. Sets the PF_EXITING flag in the flag field of the proc<strong>es</strong>s d<strong>es</strong>criptor to denote that theproc<strong>es</strong>s is being eliminated.2. Remov<strong>es</strong>, if nec<strong>es</strong>sary, the proc<strong>es</strong>s d<strong>es</strong>criptor from an IPC semaphore queue via th<strong>es</strong>em_exit( ) function (see Chapter 18) or from a dynamic timer queue via thedel_timer( ) function (see Chapter 5).3. Examin<strong>es</strong> the proc<strong>es</strong>s's data structur<strong>es</strong> related to paging, fil<strong>es</strong>ystem, open filed<strong>es</strong>criptors, and signal handling, r<strong>es</strong>pectively, with the __exit_mm( ),__exit_fil<strong>es</strong>( ), __exit_fs( ), and _ _exit_sighand( ) functions. <strong>The</strong>sefunctions also remove any of th<strong>es</strong>e data structur<strong>es</strong> if no other proc<strong>es</strong>s is sharing it.4. Sets the state field of the proc<strong>es</strong>s d<strong>es</strong>criptor to TASK_ZOMBIE. We shall see whathappens to zombie proc<strong>es</strong>s<strong>es</strong> in the following section.5. Sets the exit_code field of the proc<strong>es</strong>s d<strong>es</strong>criptor to the proc<strong>es</strong>s termination code.This value is either the exit( ) system call parameter (normal termination), or anerror code supplied by the kernel (abnormal termination).93


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>6. Invok<strong>es</strong> the exit_notify( ) function to update the parenthood relationships of boththe parent proc<strong>es</strong>s and the children proc<strong>es</strong>s<strong>es</strong>. All children proc<strong>es</strong>s<strong>es</strong> created by theterminating proc<strong>es</strong>s become children of the init proc<strong>es</strong>s.7. Invok<strong>es</strong> the schedule( ) function (see Chapter 10) to select a new proc<strong>es</strong>s to run.Since a proc<strong>es</strong>s in a TASK_ZOMBIE state is ignored by the scheduler, the proc<strong>es</strong>s willstop executing right after the switch_to macro in schedule( ) is invoked.3.4.2 Proc<strong>es</strong>s Removal<strong>The</strong> Unix operating system allows a proc<strong>es</strong>s to query the kernel to obtain the PID of its parentproc<strong>es</strong>s or the execution state for any of its children. A proc<strong>es</strong>s may, for instance, create achild proc<strong>es</strong>s to perform a specific task and then invoke a wait( )-like system call to checkwhether the child has terminated. If the child has terminated, its termination code will tell theparent proc<strong>es</strong>s if the task has been carried out succ<strong>es</strong>sfully.In order to comply with th<strong>es</strong>e d<strong>es</strong>ign choic<strong>es</strong>, Unix kernels are not allowed to discard dataincluded in a proc<strong>es</strong>s d<strong>es</strong>criptor field right after the proc<strong>es</strong>s terminat<strong>es</strong>. <strong>The</strong>y are allowed todo so only after the parent proc<strong>es</strong>s has issued a wait( )-like system call that refers to theterminated proc<strong>es</strong>s. This is why the TASK_ZOMBIE state has been introduced: although theproc<strong>es</strong>s is technically dead, its d<strong>es</strong>criptor must be saved until the parent proc<strong>es</strong>s is notified.What happens if parent proc<strong>es</strong>s<strong>es</strong> terminate before their children? In such a case, the systemmight be flooded with zombie proc<strong>es</strong>s<strong>es</strong> that might end up using all the available taskentri<strong>es</strong>. As mentioned earlier, this problem is solved by forcing all orphan proc<strong>es</strong>s<strong>es</strong> tobecome children of the init proc<strong>es</strong>s. In this way, the init proc<strong>es</strong>s will d<strong>es</strong>troy the zombi<strong>es</strong>while checking for the termination of one of its legitimate children through a wait( )-lik<strong>es</strong>ystem call.<strong>The</strong> release( ) function releas<strong>es</strong> the proc<strong>es</strong>s d<strong>es</strong>criptor of a zombie proc<strong>es</strong>s by executingthe following steps:1. Invok<strong>es</strong> the free_uid( ) function to decrement by 1 the number of proc<strong>es</strong>s<strong>es</strong> createdup to now by the user owner of the terminated proc<strong>es</strong>s. This value is stored in theuser_struct structure mentioned earlier in the chapter.2. Invok<strong>es</strong> add_free_taskslot( ) to free the entry in task that points to the proc<strong>es</strong>sd<strong>es</strong>criptor to be released.3. Decrements the value of the nr_tasks variable.4. Invok<strong>es</strong> unhash_pid( ) to remove the proc<strong>es</strong>s d<strong>es</strong>criptor from the pidhash hashtable.5. Us<strong>es</strong> the REMOVE_LINKS macro to unlink the proc<strong>es</strong>s d<strong>es</strong>criptor from the proc<strong>es</strong>s list.6. Invok<strong>es</strong> the free_task_struct( ) function to release the 8 KB memory area used tocontain the proc<strong>es</strong>s d<strong>es</strong>criptor and the <strong>Kernel</strong> Mode stack.3.5 Anticipating <strong>Linux</strong> 2.4<strong>The</strong> new kernel supports a huge number of users and groups, because it mak<strong>es</strong> use of 32-bitUIDs and GIDs.94


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In order to raise the hardcoded limit on the number of proc<strong>es</strong>s<strong>es</strong>, <strong>Linux</strong> 2.4 remov<strong>es</strong> thetasks array, which previously included pointers to all proc<strong>es</strong>s d<strong>es</strong>criptors.Moreover, <strong>Linux</strong> 2.4 no longer includ<strong>es</strong> a Task State Segment for each proc<strong>es</strong>s. <strong>The</strong> tss fieldin the proc<strong>es</strong>s d<strong>es</strong>criptor has thus been replaced by a pointer to a data structure storingthe information that was previously in the TSS, namely the register contents and the I/Obitmap. <strong>Linux</strong> 2.4 mak<strong>es</strong> use of just one TSS for each CPU in the system. When a contextswitch occurs, the kernel us<strong>es</strong> the per-proc<strong>es</strong>s data structur<strong>es</strong> to save and r<strong>es</strong>tore the registercontents and to fill the I/O bitmap in the TSS of the executing CPU.<strong>Linux</strong> 2.4 enhanc<strong>es</strong> wait queu<strong>es</strong>. Sleeping proc<strong>es</strong>s<strong>es</strong> are now stored in lists implementedthrough the efficient list_head data type. Moreover, the kernel is now able to wake up justa single proc<strong>es</strong>s that is sleeping in a wait queue, thus greatly improving the efficiency ofsemaphor<strong>es</strong>.Finally, <strong>Linux</strong> 2.4 adds a new flag to the clone( ) system call: CLONE_PARENT allowsthe new lightweight proc<strong>es</strong>s to have the same parent as the proc<strong>es</strong>s that invoked the systemcall.95


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 4. Interrupts and ExceptionsAn interrupt is usually defined as an event that alters the sequence of instructions executed bya proc<strong>es</strong>sor. Such events corr<strong>es</strong>pond to electrical signals generated by hardware circuits bothinside and outside of the CPU chip.Interrupts are often divided into synchronous and asynchronous interrupts:• Synchronous interrupts are produced by the CPU control unit while executinginstructions and are called synchronous because the control unit issu<strong>es</strong> them only afterterminating the execution of an instruction.• Asynchronous interrupts are generated by other hardware devic<strong>es</strong> at arbitrary tim<strong>es</strong>with r<strong>es</strong>pect to the CPU clock signals.Intel 80x86 microproc<strong>es</strong>sor manuals d<strong>es</strong>ignate synchronous and asynchronous interrupts asexceptions and interrupts, r<strong>es</strong>pectively. We'll adopt this classification, although we'lloccasionally use the term "interrupt signal" to d<strong>es</strong>ignate both typ<strong>es</strong> together (synchronous aswell as asynchronous).Interrupts are issued by interval timers and I/O devic<strong>es</strong>; for instance, the arrival of a keystrokefrom a user sets off an interrupt. Exceptions, on the other hand, are caused either byprogramming errors or by anomalous conditions that must be handled by the kernel. In thefirst case, the kernel handl<strong>es</strong> the exception by delivering to the current proc<strong>es</strong>s one of th<strong>es</strong>ignals familiar to every Unix programmer. In the second case, the kernel performs all th<strong>es</strong>teps needed to recover from the anomalous condition, such as a page fault or a requ<strong>es</strong>t (viaan int instruction) for a kernel service.We start by d<strong>es</strong>cribing in Section 4.1 the motivation for introducing such signals. We thenshow how the well-known IRQs (Interrupt ReQu<strong>es</strong>ts) issued by I/O devic<strong>es</strong> give rise tointerrupts, and we detail how Intel 80x86 proc<strong>es</strong>sors handle interrupts and exceptions at thehardware level. Next, we illustrate in Section 4.4 how <strong>Linux</strong> initializ<strong>es</strong> all the data structur<strong>es</strong>required by the Intel interrupt architecture. <strong>The</strong> remaining three sections d<strong>es</strong>cribe how <strong>Linux</strong>handl<strong>es</strong> interrupt signals at the software level.One word of caution before moving on: we cover in this chapter only "classic" interruptscommon to all PCs; we do not cover the nonstandard interrupts of some architectur<strong>es</strong>.For instance, laptops generate typ<strong>es</strong> of interrupts not discussed here. Other typ<strong>es</strong> of interruptsspecific to multiproc<strong>es</strong>sor architecture will be briefly d<strong>es</strong>cribed in Chapter 11.4.1 <strong>The</strong> Role of Interrupt SignalsAs the name sugg<strong>es</strong>ts, interrupt signals provide a way to divert the proc<strong>es</strong>sor to code outsidethe normal flow of control. When an interrupt signal arriv<strong>es</strong>, the CPU must stop what it'scurrently doing and switch to a new activity; it do<strong>es</strong> this by saving the current value of theprogram counter (i.e., the content of the eip and cs registers) in the <strong>Kernel</strong> Mode stack andby placing an addr<strong>es</strong>s related to the interrupt type into the program counter.<strong>The</strong>re are some things in this chapter that will remind you of the context switch we d<strong>es</strong>cribedin the previous chapter, carried out when a kernel substitut<strong>es</strong> one proc<strong>es</strong>s for another.96


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>But there is a key difference between interrupt handling and proc<strong>es</strong>s switching: the codeexecuted by an interrupt or by an exception handler is not a proc<strong>es</strong>s. Rather, it is a kernelcontrol path that runs on behalf of the same proc<strong>es</strong>s that was running when the interruptoccurred (see Section 4.3). As a kernel control path, the interrupt handler is lighter than aproc<strong>es</strong>s (it has l<strong>es</strong>s context and requir<strong>es</strong> l<strong>es</strong>s time to set up or tear down).Interrupt handling is one of the most sensitive tasks performed by the kernel, since it mustsatisfy the following constraints:• Interrupts can come at any time, when the kernel may want to finish something else itwas trying to do. <strong>The</strong> kernel's goal is therefore to get the interrupt out of the way assoon as possible and defer as much proc<strong>es</strong>sing as it can. For instance, suppose a blockof data has arrived on a network line. When the hardware interrupts the kernel, it couldsimply mark the pr<strong>es</strong>ence of data, give the proc<strong>es</strong>sor back to whatever was runningbefore, and do the r<strong>es</strong>t of the proc<strong>es</strong>sing later (like moving the data into a buffer whereits recipient proc<strong>es</strong>s can find it and r<strong>es</strong>tarting the proc<strong>es</strong>s). <strong>The</strong> activiti<strong>es</strong> that thekernel needs to perform in r<strong>es</strong>ponse to an interrupt are thus divided into two parts: atop half that the kernel execut<strong>es</strong> right away and a bottom half that is left for later. <strong>The</strong>kernel keeps a queue pointing to all the functions that repr<strong>es</strong>ent bottom halv<strong>es</strong> waitingto be executed and pulls them off the queue to execute them at particular points inproc<strong>es</strong>sing.• Since interrupts can come at any time, the kernel might be handling one of them whileanother one (of a different type) occurs. This should be allowed as much as possibl<strong>es</strong>ince it keeps the I/O devic<strong>es</strong> busy (see Section 4.3). As a r<strong>es</strong>ult, the interrupt handlersmust be coded so that the corr<strong>es</strong>ponding kernel control paths can be executed in an<strong>es</strong>ted manner. When the last kernel control path terminat<strong>es</strong>, the kernel must be ableto r<strong>es</strong>ume execution of the interrupted proc<strong>es</strong>s or switch to another proc<strong>es</strong>s if theinterrupt signal has caused a r<strong>es</strong>cheduling activity.• Although the kernel may accept a new interrupt signal while handling a previous one,some critical regions exist inside the kernel code where interrupts must be disabled.Such critical regions must be limited as much as possible since, according to theprevious requirement, the kernel, and in particular the interrupt handlers, should runmost of the time with the interrupts enabled.4.2 Interrupts and Exceptions<strong>The</strong> Intel documentation classifi<strong>es</strong> interrupts and exceptions as follows:• Interrupts:Maskable interruptsSent to the INTR pin of the microproc<strong>es</strong>sor. <strong>The</strong>y can be disabled by clearing the IFflag of the eflags register. All IRQs issued by I/O devic<strong>es</strong> give rise to maskableinterrupts.97


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Nonmaskable interruptsSent to the NMI (Nonmaskable Interrupts) pin of the microproc<strong>es</strong>sor. <strong>The</strong>y are notdisabled by clearing the IF flag. Only a few critical events, such as hardware failur<strong>es</strong>,give rise to nonmaskable interrupts.• Exceptions:Proc<strong>es</strong>sor-detected exceptionsGenerated when the CPU detects an anomalous condition while executing aninstruction. <strong>The</strong>se are further divided into three groups, depending on the value of theeip register that is saved on the <strong>Kernel</strong> Mode stack when the CPU control unit rais<strong>es</strong>the exception:Faults<strong>The</strong> saved value of eip is the addr<strong>es</strong>s of the instruction that caused the fault, andhence that instruction can be r<strong>es</strong>umed when the exception handler terminat<strong>es</strong>. As w<strong>es</strong>hall see in Section 7.4 in Chapter 7, r<strong>es</strong>uming the same instruction is nec<strong>es</strong>sarywhenever the handler is able to correct the anomalous condition that caused theexception.Traps<strong>The</strong> saved value of eip is the addr<strong>es</strong>s of the instruction that should be executed afterthe one that caused the trap. A trap is triggered only when there is no need toreexecute the instruction that terminated. <strong>The</strong> main use of traps is for debuggingpurpos<strong>es</strong>: the role of the interrupt signal in this case is to notify the debugger that aspecific instruction has been executed (for instance, a breakpoint has been reachedwithin a program). Once the user has examined the data provided by the debugger, shemay ask that execution of the debugged program r<strong>es</strong>ume starting from the nextinstruction.AbortsA serious error occurred; the control unit is in trouble, and it may be unable to store ameaningful value in the eip register. Aborts are caused by hardware failur<strong>es</strong> or byinvalid valu<strong>es</strong> in system tabl<strong>es</strong>. <strong>The</strong> interrupt signal sent by the control unit is anemergency signal used to switch control to the corr<strong>es</strong>ponding abort exception handler.This handler has no choice but to force the affected proc<strong>es</strong>s to terminate.Programmed exceptionsOccur at the requ<strong>es</strong>t of the programmer. <strong>The</strong>y are triggered by int or int3instructions; the into (check for overflow) and bound (check on addr<strong>es</strong>s bound)instructions also give rise to a programmed exception when the condition they arechecking is not true. Programmed exceptions are handled by the control unit as traps;they are often called software interrupts. Such exceptions have two common us<strong>es</strong>: toimplement system calls, and to notify a debugger of a specific event (see Chapter 8).98


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4.2.1 Interrupt and Exception VectorsEach interrupt or exception is identified by a number ranging from to 255; for some unknownreason, Intel calls this 8-bit unsigned number a vector. <strong>The</strong> vectors of nonmaskable interruptsand exceptions are fixed, while those of maskable interrupts can be altered by programmingthe Interrupt Controller (see Section 4.2.2).<strong>Linux</strong> us<strong>es</strong> the following vectors:• Vectors ranging from to 31 corr<strong>es</strong>pond to exceptions and nonmaskable interrupts.• Vectors ranging from 32 to 47 are assigned to maskable interrupts, that is, to interruptscaused by IRQs.• <strong>The</strong> remaining vectors ranging from 48 to 255 may be used to identify softwareinterrupts. <strong>Linux</strong> us<strong>es</strong> only one of them, namely the 128 or 0x80 vector, which it us<strong>es</strong>to implement system calls. When an int 0x80 Assembly instruction is executed by aproc<strong>es</strong>s in User Mode, the CPU switch<strong>es</strong> into <strong>Kernel</strong> Mode and starts executing th<strong>es</strong>ystem_call( ) kernel function (see Chapter 8).4.2.2 IRQs and InterruptsEach hardware device controller capable of issuing interrupt requ<strong>es</strong>ts has an output lined<strong>es</strong>ignated as an IRQ (Interrupt ReQu<strong>es</strong>t). All existing IRQ lin<strong>es</strong> are connected to the inputpins of a hardware circuit called the Interrupt Controller, which performs the followingactions:1. Monitors the IRQ lin<strong>es</strong>, checking for raised signals.2. If a raised signal occurs on an IRQ line:a. Converts the raised signal received into a corr<strong>es</strong>ponding vector.b. Stor<strong>es</strong> the vector in an Interrupt Controller I/O port, thus allowing the CPU toread it via the data bus.c. Sends a raised signal to the proc<strong>es</strong>sor INTR pin—that is, issu<strong>es</strong> an interrupt.d. Waits until the CPU acknowledg<strong>es</strong> the interrupt signal by writing into one ofthe Programmable Interrupt Controllers (PIC) I/O ports; when this occurs,clears the INTR line.3. Go<strong>es</strong> back to step 1.<strong>The</strong> IRQ lin<strong>es</strong> are sequentially numbered starting from 0; thus, the first IRQ line is usuallydenoted as IRQ0. Intel's default vector associated with IRQn is n+32; as mentioned before,the mapping between IRQs and vectors can be modified by issuing suitable I/O instructions tothe Interrupt Controller ports.Figure 4-1 illustrat<strong>es</strong> a typical connection "in cascade" of two Intel 8259A PICs that canhandle up to 15 different IRQ input lin<strong>es</strong>. Notice that the INT output line of the second PIC isconnected to the IRQ2 pin of the first PIC: a signal on that line denot<strong>es</strong> the fact that an IRQsignal on any one of the lin<strong>es</strong> IRQ8-IRQ15 has occurred. <strong>The</strong> number of available IRQ lin<strong>es</strong>is thus traditionally limited to 15; however, more recent PIC chips are able to handle manymore input lin<strong>es</strong>.99


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 4-1. Connecting two 8259A PICs in cascadeOther lin<strong>es</strong> not shown in the figure connect the PICs to the bus: in particular, bidirectionallin<strong>es</strong> D0-D7 connect the I/O port to the data bus, while another input line is connected to thecontrol bus and is used for receiving acknowledgment signals from the CPU.Since the number of available IRQ lin<strong>es</strong> is limited, it may be nec<strong>es</strong>sary to share the same lineamong several different I/O devic<strong>es</strong>. When this occurs, all the devic<strong>es</strong> connected to the sameline will have to be polled sequentially by the software interrupt handler in order to determinewhich of them has issued an interrupt requ<strong>es</strong>t. We'll d<strong>es</strong>cribe in Section 4.6 how <strong>Linux</strong>handl<strong>es</strong> this kind of hardware limitation.Each IRQ line can be selectively disabled. Thus, the PIC can be programmed to disable IRQs.That is, the PIC can be told to stop issuing interrupts that refer to a given IRQ line or viceversa to enable them. Disabled interrupts are not lost; the PIC sends them to the CPU as soonas they are enabled again. This feature is used by most interrupt handlers, since it allows themto proc<strong>es</strong>s IRQs of the same type serially.Selective enabling/disabling of IRQs is not the same as global masking/unmasking ofmaskable interrupts. When the IF flag of the eflags register is clear, any maskable interruptissued by the PIC is simply ignored by the CPU. <strong>The</strong> cli and sti Assembly instructions,r<strong>es</strong>pectively, clear and set that flag.4.2.3 Exceptions<strong>The</strong> Intel 80x86 microproc<strong>es</strong>sors issue roughly 20 different exceptions. [1] <strong>The</strong> kernel mustprovide a dedicated exception handler for each exception type. For some exceptions, the CPUcontrol unit also generat<strong>es</strong> a hardware error code and push<strong>es</strong> it in the <strong>Kernel</strong> Mode stackbefore starting the exception handler.[1] <strong>The</strong> exact number depends on the proc<strong>es</strong>sor model.<strong>The</strong> following list giv<strong>es</strong> the vector, the name, the type, and a brief d<strong>es</strong>cription of theexceptions found in a Pentium model. Additional information may be found in the Inteltechnical documentation.0 - "Divide error" (fault)Raised when a program tri<strong>es</strong> to divide by 0.100


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1- "Debug" (trap or fault)2 - Not usedRaised when the T flag of eflags is set (quite useful to implement step-by-stepexecution of a debugged program) or when the addr<strong>es</strong>s of an instruction or operandfalls within the range of an active debug register (see Section 3.2.1 in Chapter 3).R<strong>es</strong>erved for nonmaskable interrupts (those that use the NMI pin).3 - "Breakpoint" (trap)Caused by an int3 (breakpoint) instruction (usually inserted by a debugger).4 - "Overflow" (trap)An into (check for overflow) instruction has been executed when the OF (overflow)flag of eflags is set.5 - "Bounds check" (fault)A bound (check on addr<strong>es</strong>s bound) instruction has been executed with the operandoutside of the valid addr<strong>es</strong>s bounds.6 - "Invalid opcode" (fault)<strong>The</strong> CPU execution unit has detected an invalid opcode (the part of the machineinstruction that determin<strong>es</strong> the operation performed).7 - "Device not available" (fault)An ESCAPE or MMX instruction has been executed with the TS flag of cr0 set (seethe section Section 3.2.4 in Chapter 3).8 - "Double fault" (abort)Normally, when the CPU detects an exception while trying to call the handler for aprior exception, the two exceptions can be handled serially. In a few cas<strong>es</strong>, however,the proc<strong>es</strong>sor cannot handle them serially, hence it rais<strong>es</strong> this exception.9 - "Coproc<strong>es</strong>sor segment overrun" (abort)Problems with the external mathematical coproc<strong>es</strong>sor (appli<strong>es</strong> only to old 80386microproc<strong>es</strong>sors).10 - "Invalid TSS" (fault)<strong>The</strong> CPU has attempted a context switch to a proc<strong>es</strong>s having an invalid Task StateSegment.101


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>11 - "Segment not pr<strong>es</strong>ent" (fault)A reference was made to a segment not pr<strong>es</strong>ent in memory (one in which theSegment-Pr<strong>es</strong>ent flag of the Segment D<strong>es</strong>criptor was cleared).12 - "Stack segment" (fault)<strong>The</strong> instruction attempted to exceed the stack segment limit, or the segment identifiedby ss is not pr<strong>es</strong>ent in memory.13 - "General protection" (fault)One of the protection rul<strong>es</strong> in the protected mode of the Intel 80x86 has been violated.14 - "Page fault" (fault)<strong>The</strong> addr<strong>es</strong>sed page is not pr<strong>es</strong>ent in memory, the corr<strong>es</strong>ponding page table entry isnull, or a violation of the paging protection mechanism has occurred.15 - R<strong>es</strong>erved by Intel16 - "Floating point error" (fault)<strong>The</strong> floating point unit integrated into the CPU chip has signaled an error condition,such as numeric overflow or division by 0.17 - "Alignment check" (fault)18 to 31<strong>The</strong> addr<strong>es</strong>s of an operand is not correctly aligned (for instance, the addr<strong>es</strong>s of a longinteger is not a multiple of 4).<strong>The</strong>se valu<strong>es</strong> are r<strong>es</strong>erved by Intel for future development.As illustrated in Table 4-1, each exception is handled by a specific exception handler (seeSection 4.5 later in this chapter), which usually sends a Unix signal to the proc<strong>es</strong>s that causedthe exception.102


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 4-1. Signals Sent by the Exception Handlers# Exception Exception Handler Signal0 "Divide error" divide_error( ) SIGFPE1 "Debug" debug( ) SIGTRAP2 NMI nmi( ) None3 "Breakpoint" int3( ) SIGTRAP4 "Overflow" overflow( ) SIGSEGV5 "Bounds check" bounds( ) SIGSEGV6 "Invalid opcode" invalid_op( ) SIGILL7 "Device not available" device_not_available( ) SIGSEGV8 "Double fault" double_fault( ) SIGSEGV9 "Coproc<strong>es</strong>sor segment overrun" coproc<strong>es</strong>sor_segment_overrun( ) SIGFPE10 "Invalid TSS" invalid_tss( ) SIGSEGV11 "Segment not pr<strong>es</strong>ent" segment_not_pr<strong>es</strong>ent( ) SIGBUS12 "Stack exception" stack_segment( ) SIGBUS13 "General protection" general_protection( ) SIGSEGV14 "Page fault" page_fault( ) SIGSEGV15 Intel r<strong>es</strong>erved None None16 "Floating point error" coproc<strong>es</strong>sor_error( ) SIGFPE17 "Alignment check" alignment_check( ) SIGSEGV4.2.4 Interrupt D<strong>es</strong>criptor TableA system table called Interrupt D<strong>es</strong>criptor Table (IDT) associat<strong>es</strong> each interrupt or exceptionvector with the addr<strong>es</strong>s of the corr<strong>es</strong>ponding interrupt or exception handler. <strong>The</strong> IDT must beproperly initialized before the kernel enabl<strong>es</strong> interrupts.<strong>The</strong> IDT format is similar to that of the GDT and of the LDTs examined in Chapter 2: eachentry corr<strong>es</strong>ponds to an interrupt or an exception vector and consists of an 8-byte d<strong>es</strong>criptor.Thus, a maximum of 256x 8=2048 byt<strong>es</strong> are required to store the IDT.<strong>The</strong> idtr CPU register allows the IDT to be located anywhere in memory: it specifi<strong>es</strong> boththe IDT base physical addr<strong>es</strong>s and its limit (maximum length). It must be initialized beforeenabling interrupts by using the lidt assembly language instruction.<strong>The</strong> IDT may include three typ<strong>es</strong> of d<strong>es</strong>criptors; Figure 4-2 illustrat<strong>es</strong> the meaning of the 64bits included in each of them. In particular, the value of the Type field encoded in the bits 40-43 identifi<strong>es</strong> the d<strong>es</strong>criptor type.103


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 4-2. Gate d<strong>es</strong>criptors's format<strong>The</strong> d<strong>es</strong>criptors are:Task gateInclud<strong>es</strong> the TSS selector of the proc<strong>es</strong>s that must replace the current one when aninterrupt signal occurs. <strong>Linux</strong> do<strong>es</strong> not use task gat<strong>es</strong>.Interrupt gateTrap gateInclud<strong>es</strong> the Segment Selector and the offset inside the segment of an interrupt orexception handler. While transferring control to the proper segment, the proc<strong>es</strong>sorclears the IF flag, thus disabling further maskable interrupts.Similar to an interrupt gate, except that while transferring control to the propersegment, the proc<strong>es</strong>sor do<strong>es</strong> not modify the IF flag.As we shall see in Section 4.4.1, <strong>Linux</strong> us<strong>es</strong> interrupt gat<strong>es</strong> to handle interrupts and trap gat<strong>es</strong>to handle exceptions.4.2.5 Hardware Handling of Interrupts and ExceptionsWe now d<strong>es</strong>cribe how the CPU control unit handl<strong>es</strong> interrupts and exceptions. We assumethat the kernel has been initialized and thus the CPU is operating in protected mode.After executing an instruction, the cs and eip pair of registers contain the logical addr<strong>es</strong>s ofthe next instruction to be executed. Before dealing with that instruction, the control unit104


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>checks whether an interrupt or an exception has occurred while it executed the previousinstruction. If one occurred, the control unit:1. Determin<strong>es</strong> the vector i (0 i 255) associated with the interrupt or the exception.2. Reads the i th entry of the IDT referred by the idtr register (we assume in thefollowing d<strong>es</strong>cription that the entry contains an interrupt or a trap gate).3. Gets the base addr<strong>es</strong>s of the GDT from the gdtr register and looks in the GDT to readthe Segment D<strong>es</strong>criptor identified by the selector in the IDT entry. This d<strong>es</strong>criptorspecifi<strong>es</strong> the base addr<strong>es</strong>s of the segment that includ<strong>es</strong> the interrupt or exceptionhandler.4. Mak<strong>es</strong> sure the interrupt was issued by an authorized source. First compar<strong>es</strong> theCurrent Privilege Level (CPL), which is stored in the two least significant bits of thecs register, with the D<strong>es</strong>criptor Privilege Level (DPL) of the Segment D<strong>es</strong>criptorincluded in the GDT. Rais<strong>es</strong> a "General protection" exception if CPL is lower thanDPL, because the interrupt handler cannot have a lower privilege than the programthat caused the interrupt. For programmed exceptions, mak<strong>es</strong> a further security check:compar<strong>es</strong> the CPL with the DPL of the gate d<strong>es</strong>criptor included in the IDT and rais<strong>es</strong> a"General protection" exception if the DPL is lower than the CPL. This last checkmak<strong>es</strong> it possible to prevent acc<strong>es</strong>s by user applications to specific trap or interruptgat<strong>es</strong>.5. Checks whether a change of privilege level is taking place, that is, if CPL is differentfrom the selected Segment D<strong>es</strong>criptor's DPL. If so, the control unit must start using th<strong>es</strong>tack that is associated with the new privilege level. It do<strong>es</strong> this by performing thefollowing steps:a. Reads the tr register to acc<strong>es</strong>s the TSS segment of the current proc<strong>es</strong>s.b. Loads the ss and <strong>es</strong>p registers with the proper valu<strong>es</strong> for the stack segmentand stack pointer relative to the new privilege level. <strong>The</strong>se valu<strong>es</strong> are found inthe TSS (see Section 3.2.2 in Chapter 3).c. In the new stack, sav<strong>es</strong> the previous valu<strong>es</strong> of ss and <strong>es</strong>p, which define thelogical addr<strong>es</strong>s of the stack associated with the old privilege level.6. If a fault has occurred, loads cs and eip with the logical addr<strong>es</strong>s of the instruction thatcaused the exception so that it can be executed again.7. Sav<strong>es</strong> the contents of eflags, cs, and eip in the stack.8. If the exception carri<strong>es</strong> a hardware error code, sav<strong>es</strong> it on the stack.9. Loads cs and eip, r<strong>es</strong>pectively, with the Segment Selector and the Offset fields of theGate D<strong>es</strong>criptor stored in the i th entry of the IDT. <strong>The</strong>se valu<strong>es</strong> define the logicaladdr<strong>es</strong>s of the first instruction of the interrupt or exception handler.<strong>The</strong> last step performed by the control unit is equivalent to a jump to the interrupt orexception handler. In other words, the instruction proc<strong>es</strong>sed by the control unit after dealingwith the interrupt signal is the first instruction of the selected handler.After the interrupt or exception has been proc<strong>es</strong>sed, the corr<strong>es</strong>ponding handler must relinquishcontrol to the interrupted proc<strong>es</strong>s by issuing the iret instruction, which forc<strong>es</strong> the control unitto:1. Load the cs, eip, and eflags registers with the valu<strong>es</strong> saved on the stack. If ahardware error code has been pushed in the stack on top of the eip contents, it must bepopped before executing iret.105


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Check whether the CPL of the handler is equal to the value contained in the two leastsignificant bits of cs (this means the interrupted proc<strong>es</strong>s was running at the sameprivilege level as the handler). If so, iret conclud<strong>es</strong> execution; otherwise, go to thenext step.3. Load the ss and <strong>es</strong>p registers from the stack, and hence return to the stack associatedwith the old privilege level.4. Examine the contents of the ds, <strong>es</strong>, fs, and gs segment registers: if any of themcontains a selector that refers to a Segment D<strong>es</strong>criptor whose DPL value is lower thanCPL, clear the corr<strong>es</strong>ponding segment register. <strong>The</strong> control unit do<strong>es</strong> this to forbidUser Mode programs that run with a CPL equal to 3 from making use of segmentregisters previously used by kernel routin<strong>es</strong> (with a DPL equal to 0). If th<strong>es</strong>e registerswere not cleared, malicious User Mode programs could exploit them to acc<strong>es</strong>s thekernel addr<strong>es</strong>s space.4.3 N<strong>es</strong>ted Execution of Exception and Interrupt HandlersA kernel control path consists of the sequence of instructions executed in <strong>Kernel</strong> Mode tohandle an interrupt or an exception. When a proc<strong>es</strong>s issu<strong>es</strong> a system call requ<strong>es</strong>t, for instance,the first instructions of the corr<strong>es</strong>ponding kernel control path are those that save the content ofthe registers in the <strong>Kernel</strong> Mode stack, while the last instructions are those that r<strong>es</strong>tore thecontent of the registers and put the CPU back into User Mode.Assuming that the kernel is bug-free, most exceptions can occur only while the CPU is inUser Mode. Indeed, they are either caused by programming errors or triggered by debuggers.However, the "Page fault" exception may occur in <strong>Kernel</strong> Mode: this happens when theproc<strong>es</strong>s attempts to addr<strong>es</strong>s a page that belongs to its addr<strong>es</strong>s space but is not currently inRAM. While handling such an exception, the kernel may suspend the current proc<strong>es</strong>s andreplace it with another one until the requ<strong>es</strong>ted page is available. <strong>The</strong> kernel control path thathandl<strong>es</strong> the page fault exception will r<strong>es</strong>ume execution as soon as the proc<strong>es</strong>s gets theproc<strong>es</strong>sor again.Since the "Page fault" exception handler never giv<strong>es</strong> rise to further exceptions, at most twokernel control paths associated with exceptions may be stacked, one on top of the other.In contrast to exceptions, interrupts issued by I/O devic<strong>es</strong> do not refer to data structur<strong>es</strong>specific to the current proc<strong>es</strong>s, although the kernel control paths that handle them run onbehalf of that proc<strong>es</strong>s. As a matter of fact, it is impossible to predict which proc<strong>es</strong>s will becurrently running when a given interrupt occurs.<strong>Linux</strong> d<strong>es</strong>ign do<strong>es</strong> not allow proc<strong>es</strong>s switching while the CPU is executing a kernel controlpath associated with an interrupt. However, such kernel control paths may be arbitrarilyn<strong>es</strong>ted: an interrupt handler may be interrupted by another interrupt handler and so on.An interrupt handler may also defer an exception handler. Conversely, an exception handlernever defers an interrupt handler. <strong>The</strong> only exception that can be triggered in <strong>Kernel</strong> Mode isthe "Page fault" one just d<strong>es</strong>cribed. But interrupt handlers never perform operations that couldinduce page faults and thus, potentially, proc<strong>es</strong>s switching.<strong>Linux</strong> interleav<strong>es</strong> kernel control paths for two major reasons:106


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• To improve the throughput of programmable interrupt controllers and devicecontrollers. Assume that a device controller issu<strong>es</strong> a signal on an IRQ line: the PICtransforms it into an INTR requ<strong>es</strong>t, and then both the PIC and the device controllerremain blocked until the PIC receiv<strong>es</strong> an acknowledgment from the CPU. Thanks tokernel control path interleaving, the kernel is able to send the acknowledgment evenwhen it is handling a previous interrupt.• To implement an interrupt model without priority levels. Since each interrupt handlermay be deferred by another one, there is no need to <strong>es</strong>tablish predefined prioriti<strong>es</strong>among hardware devic<strong>es</strong>. This simplifi<strong>es</strong> the kernel code and improv<strong>es</strong> its portability.4.4 Initializing the Interrupt D<strong>es</strong>criptor TableNow that you understand what the Intel proc<strong>es</strong>sor do<strong>es</strong> with interrupts and exceptions at thehardware level, we can move on to d<strong>es</strong>cribe how the Interrupt D<strong>es</strong>criptor Table is initialized.Remember that before the kernel enabl<strong>es</strong> the interrupts, it must load the initial addr<strong>es</strong>s of theIDT table into the idtr register and initialize all the entri<strong>es</strong> of that table. This activity is donewhile initializing the system (see Appendix A).<strong>The</strong> int instruction allows a User Mode proc<strong>es</strong>s to issue an interrupt signal having anarbitrary vector ranging from to 255. <strong>The</strong> initialization of the IDT must thus be donecarefully, in order to block illegal interrupts and exceptions simulated by User Modeproc<strong>es</strong>s<strong>es</strong> via int instructions. This can be achieved by setting the DPL field of the Interruptor Trap Gate D<strong>es</strong>criptor to 0. If the proc<strong>es</strong>s attempts to issue one of such interrupt signals, thecontrol unit will check the CPL value against the DPL field and issue a "General protection"exception.In a few cas<strong>es</strong>, however, a User Mode proc<strong>es</strong>s must be able to issue a programmed exception.To allow this, it is sufficient to set the DPL field of the corr<strong>es</strong>ponding Interrupt or Trap GateD<strong>es</strong>criptors to 3; that is, as high as possible.Let's now see how <strong>Linux</strong> implements this strategy.4.4.1 Interrupt, Trap, and System Gat<strong>es</strong>As mentioned in Section 4.2.4, Intel provid<strong>es</strong> three typ<strong>es</strong> of interrupt d<strong>es</strong>criptors: Task,Interrupt, and Trap Gate D<strong>es</strong>criptors. Task Gate D<strong>es</strong>criptors are irrelevant to <strong>Linux</strong>, but itsInterrupt D<strong>es</strong>criptor Table contains several Interrupt and Trap Gate D<strong>es</strong>criptors. <strong>Linux</strong>classifi<strong>es</strong> them as follows, using a slightly different breakdown and terminology from Intel:Interrupt gateSystem gateAn Intel interrupt gate that cannot be acc<strong>es</strong>sed by a User Mode proc<strong>es</strong>s (the gate'sDPL field is equal to 0). All <strong>Linux</strong> interrupt handlers are activated by means ofinterrupt gat<strong>es</strong>, and all are r<strong>es</strong>tricted to <strong>Kernel</strong> Mode.An Intel trap gate that can be acc<strong>es</strong>sed by a User Mode proc<strong>es</strong>s (the gate's DPL field isequal to 3). <strong>The</strong> four <strong>Linux</strong> exception handlers associated with the vectors 3, 4, 5, and107


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Trap gate128 are activated by means of system gat<strong>es</strong>, so the four Assembly instructions int3,into, bound, and int 0x80 can be issued in User Mode.An Intel trap gate that cannot be acc<strong>es</strong>sed by a User Mode proc<strong>es</strong>s (the gate's DPLfield is equal to 0). All <strong>Linux</strong> exception handlers, except the four d<strong>es</strong>cribed in theprevious paragraph, are activated by means of trap gat<strong>es</strong>.<strong>The</strong> following functions are used to insert gat<strong>es</strong> in the IDT:set_intr_gate(n,addr)Inserts an interrupt gate in the n th IDT entry. <strong>The</strong> Segment Selector inside the gate isset to the kernel code's Segment Selector. <strong>The</strong> Offset field is set to addr, which is theaddr<strong>es</strong>s of the interrupt handler. <strong>The</strong> DPL field is set to 0.set_system_gate(n,addr)Inserts a trap gate in the n th IDT entry. <strong>The</strong> Segment Selector inside the gate is set tothe kernel code's Segment Selector. <strong>The</strong> Offset field is set to addr, which is theaddr<strong>es</strong>s of the exception handler. <strong>The</strong> DPL field is set to 3.set_trap_gate(n,addr)Similar to the previous function, except that the DPL field is set to 0.4.4.2 Preliminary Initialization of the IDT<strong>The</strong> IDT is initialized and used by the BIOS routin<strong>es</strong> when the computer still operat<strong>es</strong> in RealMode. Once <strong>Linux</strong> tak<strong>es</strong> over, however, the IDT is moved to another area of RAM andinitialized a second time, since <strong>Linux</strong> do<strong>es</strong> not make use of any BIOS routin<strong>es</strong> (seeAppendix A).<strong>The</strong> IDT is stored in the idt_table table, which includ<strong>es</strong> 256 entri<strong>es</strong>. [2] <strong>The</strong> 6-byteidt_d<strong>es</strong>cr variable specifi<strong>es</strong> both the size of the IDT and its addr<strong>es</strong>s; it is used only when thekernel initializ<strong>es</strong> the idtr register with the lidt Assembly instruction. In all other cas<strong>es</strong>, thekernel refers to the idt variable to get the addr<strong>es</strong>s of the IDT.[2] Some Pentium models have the notorious "f00f" bug, which allows a User Mode program to freeze the system. When executing on such CPUs,<strong>Linux</strong> us<strong>es</strong> a workaround based on storing the IDT in a write-protected page frame. <strong>The</strong> workaround for the bug is offered as an option when the usercompil<strong>es</strong> the kernel.During kernel initialization, the setup_idt( ) assembly language function starts by fillingall 256 entri<strong>es</strong> of idt_table with the same interrupt gate, which refers to the ignore_int( )interrupt handler:setup_idt:lea ignore_int, %edxmovl $(__KERNEL_CS


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>lea idt_table, %edimov $256, %ecxrp_sidt:movl %eax, (%edi)movl %edx, 4(%edi)addl $8, %edidec %ecxjne rp_sidtret<strong>The</strong> ignore_int( ) interrupt handler, which is in assembly language, may be viewed as anull handler that execut<strong>es</strong> the following actions:1. Sav<strong>es</strong> the content of some registers in the stack2. Invok<strong>es</strong> the printk( ) function to print an "Unknown interrupt" system m<strong>es</strong>sage3. R<strong>es</strong>tor<strong>es</strong> the register contents from the stack4. Execut<strong>es</strong> an iret instruction to r<strong>es</strong>tart the interrupted program<strong>The</strong> ignore_int( ) handler should never be executed: the occurrence of "Unknowninterrupt" m<strong>es</strong>sag<strong>es</strong> on the console or in the log fil<strong>es</strong> denot<strong>es</strong> either a hardware problem (anI/O device is issuing unfor<strong>es</strong>een interrupts) or a kernel problem (an interrupt or exception isnot being handled properly).Following this preliminary initialization, the kernel mak<strong>es</strong> a second pass in the IDT to replac<strong>es</strong>ome of the null handlers with meaningful trap and interrupt handlers. Once this is done, theIDT will include a specialized trap or system gate for each different exception issued by thecontrol unit, and a specialized interrupt gate for each IRQ recognized by the ProgrammableInterrupt Controller.<strong>The</strong> next two sections illustrate in detail how this is done, r<strong>es</strong>pectively, for exceptions andinterrupts.4.5 Exception Handling<strong>Linux</strong> tak<strong>es</strong> advantage of exceptions to achieve two quite different goals:• To send a signal to a proc<strong>es</strong>s to notify an anomalous condition• To handle demand pagingAn example of the first use is if a proc<strong>es</strong>s performs a division by 0. <strong>The</strong> CPU rais<strong>es</strong> a "Divideerror" exception, and the corr<strong>es</strong>ponding exception handler sends a SIGFPE signal to thecurrent proc<strong>es</strong>s, which will then take the nec<strong>es</strong>sary steps to recover or (if no signal handler isset for that signal) abort.Exception handlers have a standard structure consisting of three parts:1. Save the contents of most registers in the <strong>Kernel</strong> Mode stack (this part is coded inAssembly language).2. Handle the exception by means of a high-level C function.3. Exit from the handler by means of the ret_from_exception( ) function.109


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In order to take advantage of exceptions, the IDT must be properly initialized with anexception handler function for each recognized exception. It is the job of the trap_init()function to insert the final valu<strong>es</strong>—that is, the functions that handle the exceptions—into allIDT entri<strong>es</strong> that refer to nonmaskable interrupts and exceptions. This is accomplished throughthe set_trap_gate and set_system_gate macros:set_trap_gate(0,&divide_error);set_trap_gate(1,&debug);set_trap_gate(2,&nmi);set_system_gate(3,&int3);set_system_gate(4,&overflow);set_system_gate(5,&bounds);set_trap_gate(6,&invalid_op);set_trap_gate(7,&device_not_available);set_trap_gate(8,&double_fault);set_trap_gate(9,&coproc<strong>es</strong>sor_segment_overrun);set_trap_gate(10,&invalid_TSS);set_trap_gate(11,&segment_not_pr<strong>es</strong>ent);set_trap_gate(12,&stack_segment);set_trap_gate(13,&general_protection);set_trap_gate(14,&page_fault);set_trap_gate(16,&coproc<strong>es</strong>sor_error);set_trap_gate(17,&alignment_check);set_system_gate(0x80,&system_call);Now we will look at what a typical exception handler do<strong>es</strong> once it is invoked.4.5.1 Saving the Registers for the Exception HandlerLet us denote with handler_name the name of a generic exception handler. (<strong>The</strong> actual nam<strong>es</strong>of all the exception handlers appear on the list of macros in the previous section.) Eachexception handler starts with the following Assembly instructions:handler_name:pushl $0 /* only for some exceptions */pushl $do_handler_namejmp error_codeIf the control unit is not supposed to automatically insert a hardware error code on the stackwhen the exception occurs, the corr<strong>es</strong>ponding Assembly fragment includ<strong>es</strong> a pushl $0instruction to pad the stack with a null value. <strong>The</strong>n the addr<strong>es</strong>s of the high-level C function ispushed on the stack; its name consists of the exception handler name prefixed by do_.<strong>The</strong> Assembly fragment labeled as error_code is the same for all exception handlers exceptthe one for the "Device not available" exception (see Section 3.2.4 in Chapter 3). <strong>The</strong> codeperforms the following steps:1. Sav<strong>es</strong> the registers that might be used by the high-level C function on the stack.2. Issu<strong>es</strong> a cld instruction to clear the direction flag DF of eflags, thus making sure thatautoincrements on the edi and <strong>es</strong>i registers will be used with string instructions.3. Copi<strong>es</strong> the hardware error code saved in the stack at location <strong>es</strong>p+36 in eax. Stor<strong>es</strong> inthe same stack location the value -1: as we shall see in the section Section 9.3.4 inChapter 9, this value is used to separate 0x80 exceptions from other exceptions.110


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4. Loads ecx with the addr<strong>es</strong>s of the high-level do_handler_name( ) C function savedin the stack at location <strong>es</strong>p+32; writ<strong>es</strong> the contents of <strong>es</strong> in that stack location.5. Loads the kernel data Segment Selector into the ds and <strong>es</strong> registers, then sets the ebxregister to the addr<strong>es</strong>s of the current proc<strong>es</strong>s d<strong>es</strong>criptor (see Section 3.1.2 in Chapter3).6. Stor<strong>es</strong> the parameters to be passed to the high-level C function on the stack, namely,the exception hardware error code and the addr<strong>es</strong>s of the stack location where thecontents of User Mode registers was saved.7. Invok<strong>es</strong> the high-level C function whose addr<strong>es</strong>s is now stored in ecx.After the last step is executed, the invoked function will find on the top locations of the stack:• <strong>The</strong> return addr<strong>es</strong>s of the instruction to be executed after the C function terminat<strong>es</strong>(see next section)• <strong>The</strong> stack addr<strong>es</strong>s of the saved User Mode registers• <strong>The</strong> hardware error code4.5.2 Returning from the Exception HandlerWhen the C function that implements the exception handling terminat<strong>es</strong>, control is transferredto the following assembly language fragment:addl $8, %<strong>es</strong>pjmp ret_from_exception<strong>The</strong> code pops the stack addr<strong>es</strong>s of the saved User Mode registers and the hardware error codefrom the stack, then performs a jmp instruction to the ret_from_exception( ) function.This function will be d<strong>es</strong>cribed in Section 4.7.4.5.3 Invoking the Exception HandlerAs already explained, the nam<strong>es</strong> of the C functions that implement exception handlers alwaysconsist of the prefix do_ followed by the handler name. Most of th<strong>es</strong>e functions store thehardware error code and the exception vector in the proc<strong>es</strong>s d<strong>es</strong>criptor of current, then sendto that proc<strong>es</strong>s a suitable signal. This is done as follows:current->tss.error_code = error_code;current->tss.trap_no = vector;force_sig(sig_number, current);When the ret_from_exception( ) function is invoked, it checks whether the proc<strong>es</strong>s hasreceived a signal. If so, the signal will be handled either by the proc<strong>es</strong>s's own signal handler(if it exists) or by the kernel; in the latter case, the kernel will usually kill the proc<strong>es</strong>s (seeChapter 9). <strong>The</strong> signals sent by the exception handlers have already been illustrated in Table4-1.Finally, the handler invok<strong>es</strong> either die_if_kernel( ) or die_if_no_fixup( ):• <strong>The</strong> die_if_kernel( ) function checks whether the exception occurred in <strong>Kernel</strong>Mode; in this case, it invok<strong>es</strong> the die( ) function, which prints the contents of all111


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>CPU registers on the console and terminat<strong>es</strong> the current proc<strong>es</strong>s by invokingdo_exit( ) (see Chapter 19).• <strong>The</strong> die_if_no_fixup( ) function is similar, but before invoking die( ) it checkswhether the exception was due to an invalid argument of a system call: in theaffirmative case, it us<strong>es</strong> a "fixup" approach, which will be d<strong>es</strong>cribed in Section 8.2.6in Chapter 8.Two exceptions are exploited by the kernel to manage hardware r<strong>es</strong>ourc<strong>es</strong> more efficiently.<strong>The</strong> corr<strong>es</strong>ponding handlers are more complex because the exception do<strong>es</strong> not nec<strong>es</strong>sarilydenote an error condition:• "Device not available": as discussed in Section 3.2.4 in Chapter 3, this exception isused to defer loading the floating point registers until the last possible moment.• "Page fault": as we shall see in the section Section 7.4 in Chapter 7, this exception isused to defer allocating new page fram<strong>es</strong> to the proc<strong>es</strong>s until the last possiblefmoment.4.6 Interrupt HandlingAs we explained earlier, most exceptions are handled simply by sending a Unix signal to theproc<strong>es</strong>s that caused the exception. <strong>The</strong> action to be taken is thus deferred until the proc<strong>es</strong>sreceiv<strong>es</strong> the signal; as a r<strong>es</strong>ult, the kernel is able to proc<strong>es</strong>s the exception quickly.This approach do<strong>es</strong> not hold for interrupts, because they frequently arrive long after theproc<strong>es</strong>s to which they are related (for instance, a proc<strong>es</strong>s that requ<strong>es</strong>ted a data transfer) hasbeen suspended and a completely unrelated proc<strong>es</strong>s is running. So it would make no sense tosend a Unix signal to the current proc<strong>es</strong>s.Furthermore, due to hardware limitations, several devic<strong>es</strong> may share the same IRQ line.(Remember that PCs supply only a few IRQs.) This means that the interrupt vector alone do<strong>es</strong>not tell the whole story: as an example, some PC configurations may assign the same vector tothe network card and to the graphic card. <strong>The</strong>refore, an interrupt handler must be flexibleenough to service several devic<strong>es</strong>. In order to do this, several interrupt service routin<strong>es</strong> (ISRs)can be associated with the same interrupt handler; each of them is a function related to asingle device sharing the IRQ line. Since it is not possible to know in advance whichparticular device issued the IRQ, each ISR is executed to verify whether its device needsattention; if so, the ISR performs all the operations that need to be executed when the devicerais<strong>es</strong> an interrupt.Not all actions to be performed when an interrupt occurs have the same urgency. In fact, theinterrupt handler itself is not a suitable place for all kind of actions. Long noncriticaloperations should be deferred, since while an interrupt handler is running, the signals on thecorr<strong>es</strong>ponding IRQ line are ignored. Most important, the proc<strong>es</strong>s on behalf of which aninterrupt handler is executed must always stay in the TASK_RUNNING state, or a system freezecould occur. <strong>The</strong>refore, interrupt handlers cannot perform any blocking procedure such as I/Odisk operations. So <strong>Linux</strong> divid<strong>es</strong> the actions to be performed following an interrupt into threeclass<strong>es</strong>:112


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>CriticalNoncriticalActions such as acknowledging an interrupt to the PIC, reprogramming the PIC or thedevice controller, or updating data structur<strong>es</strong> acc<strong>es</strong>sed by both the device and theproc<strong>es</strong>sor. <strong>The</strong>se can be executed quickly and are critical because they must beperformed as soon as possible. Critical actions are executed within the interrupthandler immediately, with maskable interrupts disabled.Actions such as updating data structur<strong>es</strong> that are acc<strong>es</strong>sed only by the proc<strong>es</strong>sor (forinstance, reading the scan code after a keyboard key has been pushed). <strong>The</strong>se actionscan also finish quickly, so they are executed by the interrupt handler immediately,with the interrupts enabled.Noncritical deferrableActions such as copying a buffers contents into the addr<strong>es</strong>s space of some proc<strong>es</strong>s (forinstance, sending the keyboard line buffer to the terminal handler proc<strong>es</strong>s). <strong>The</strong>se maybe delayed for a long time interval without affecting the kernel operations; theinter<strong>es</strong>ted proc<strong>es</strong>s will just keep waiting for the data. Noncritical deferrable actions areperformed by means of separate functions called "bottom halv<strong>es</strong>." We shall discussthem in Section 4.6.6.All interrupt handlers perform the same four basic actions:1. Save the IRQ value and the registers contents in the <strong>Kernel</strong> Mode stack.2. Send an acknowledgment to the PIC that is servicing the IRQ line, thus allowing it toissue further interrupts.3. Execute the interrupt service routin<strong>es</strong> (ISRs) associated with all the devic<strong>es</strong> that sharethe IRQ.4. Terminate by jumping to the ret_from_intr( ) addr<strong>es</strong>s.Several d<strong>es</strong>criptors are needed to repr<strong>es</strong>ent both the state of the IRQ lin<strong>es</strong> and the functions tobe executed when an interrupt occurs. Figure 4-3 repr<strong>es</strong>ents in a schematic way the hardwarecircuits and the software functions used to handle an interrupt. <strong>The</strong>se functions will bediscussed in the following sections.113


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 4-3. Interrupt handling4.6.1 Interrupt VectorsAs explained in Section 4.2.2, the 16 physical IRQs are assigned the vectors 32-47. <strong>The</strong> IBMcompatiblePC architecture requir<strong>es</strong> that some devic<strong>es</strong> must be statically connected to specificIRQ lin<strong>es</strong>. In particular:• <strong>The</strong> interval timer device must be connected to the IRQ0 line (see Chapter 5).• <strong>The</strong> slave 8259A PIC must be connected to the IRQ2 line (see Figure 4-1).• <strong>The</strong> external mathematical coproc<strong>es</strong>sor must be connected to the IRQ13 line (althoughrecent Intel 80x86 proc<strong>es</strong>sors no longer use such a device, <strong>Linux</strong> continu<strong>es</strong> to supportthe venerable 80386 model).For all remaining IRQs, the kernel must <strong>es</strong>tablish a corr<strong>es</strong>pondence between IRQ number andI/O device before enabling interrupts. Otherwise, how could the kernel handle a signal from(say) a SCSI disk without knowing which vector corr<strong>es</strong>ponds to the device?Modern I/O devic<strong>es</strong> are able to connect themselv<strong>es</strong> to several IRQ lin<strong>es</strong>. <strong>The</strong> optimalselection depends on how many devic<strong>es</strong> are on the system and whether any are constrained tor<strong>es</strong>pond only to certain IRQs. <strong>The</strong>re are two ways to select a line for each device:• By a utility program executed when installing the device: such a program may ask theuser to select an available IRQ number or determine an available number by itself.• By a hardware protocol executed at system startup. Under this system, peripheraldevic<strong>es</strong> declare which interrupt lin<strong>es</strong> they are ready to use; the final valu<strong>es</strong> are thennegotiated to reduce conflicts as much as possible. Once this is done, each interrupthandler can read the assigned IRQ by using a function that acc<strong>es</strong>s<strong>es</strong> some I/O ports ofthe device. For instance, drivers for devic<strong>es</strong> that comply with the PeripheralComponent Interconnect (PCI) standard make use of a group of functions such aspci_read_config_byte( ) and pci_write_config_byte( ) to acc<strong>es</strong>s the deviceconfiguration space.114


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In both cas<strong>es</strong>, the kernel can retrieve the selected IRQ line of a device when initializing thecorr<strong>es</strong>ponding driver. Table 4-2 shows a fairly arbitrary arrangement of devic<strong>es</strong> and IRQs,such as might be found on one particular PC.Table 4-2. An Example of IRQ Assignment to I/O Devic<strong>es</strong>IRQ INT Hardware Device0 32 Timer1 33 Keyboard2 34 PIC cascading3 35 Second serial port4 36 First serial port6 38 Floppy disk8 40 System clock11 43 Network interface12 44 PS/2 mouse13 45 Mathematical coproc<strong>es</strong>sor14 46 EIDE disk controller's first chain15 47 EIDE disk controller's second chain4.6.2 IRQ Data Structur<strong>es</strong>As always when discussing complicated operations involving state transitions, it helps tounderstand first where key data is stored. Thus, this section explains the data structur<strong>es</strong> thatsupport interrupt handling and how they are laid out in various d<strong>es</strong>criptors. Figure 4-4illustrat<strong>es</strong> schematically the relationships between the main d<strong>es</strong>criptors that repr<strong>es</strong>ent the stateof the IRQ lin<strong>es</strong>. (<strong>The</strong> figure do<strong>es</strong> not illustrate the data structur<strong>es</strong> needed to handle bottomhalv<strong>es</strong>; they will be discussed later in this chapter.)Figure 4-4. IRQ d<strong>es</strong>criptors4.6.2.1 <strong>The</strong> irq_d<strong>es</strong>c_t d<strong>es</strong>criptorAn irq _d<strong>es</strong>c array includ<strong>es</strong> NR_IRQS irq _d<strong>es</strong>c_t d<strong>es</strong>criptors, which include the followingfields:statusA set of flags d<strong>es</strong>cribing the IRQ line status.115


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>handleractiondepthIRQ _INPROGRESSA handler for the IRQ is being executed.IRQ _DISABLED<strong>The</strong> IRQ line has been deliberately disabled by a device driver.IRQ _PENDINGAn IRQ has occurred on the line; its occurrence has been acknowledged to the PIC,but it has not yet been serviced by the kernel.IRQ _REPLAY<strong>The</strong> IRQ line has been disabled but the previous IRQ occurrence has not yet beenacknowledged to the PIC.IRQ _AUTODETECT<strong>The</strong> kernel is using the IRQ line while performing a hardware device probe.IRQ _WAITING<strong>The</strong> kernel is using the IRQ line while performing a hardware device probe; moreover,the corr<strong>es</strong>ponding interrupt has not been raised.Points to the hw_interrupt_type d<strong>es</strong>criptor that identifi<strong>es</strong> the PIC circuit servicingthe IRQ line.Identifi<strong>es</strong> the interrupt service routin<strong>es</strong> to be invoked when the IRQ occurs. <strong>The</strong> fieldpoints to the first element of the list of irqaction d<strong>es</strong>criptors associated with theIRQ. <strong>The</strong> irqaction d<strong>es</strong>criptor is d<strong>es</strong>cribed briefly later in the chapter.Shows 0 if the IRQ line is enabled and a positive value if it has been disabled at leastonce. Every time the disable_irq( ) function is invoked, it increments this field; ifdepth was equal to 0, the function disabl<strong>es</strong> the IRQ line. Conversely, each invocationof the enable_irq( ) function decrements the field; if depth becom<strong>es</strong> 0, the functionenabl<strong>es</strong> the IRQ line.During system initialization, the init_IRQ( ) function sets the status field of each IRQmain d<strong>es</strong>criptor to IRQ _DISABLED as follows:116


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>for (i=0; i


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>flagsD<strong>es</strong>crib<strong>es</strong> the relationships between IRQ line and I/O device in a set of flags:SA_INTERRUPT<strong>The</strong> handler must execute with interrupts disabled.SA_SHIRQ<strong>The</strong> device permits its IRQ line to be shared with other devic<strong>es</strong>.SA_SAMPLE_RANDOM<strong>The</strong> device may be considered as a source of events occurring randomly; it can thus beused by the kernel random number generator. (Users can acc<strong>es</strong>s this feature by takingrandom numbers from the /dev/random and /dev/urandom device fil<strong>es</strong>.)SA_PROBE<strong>The</strong> kernel is using the IRQ line while performing a hardware device probe.namedev_idnextNam<strong>es</strong> of the I/O device (shown when listing the serviced IRQs by reading the/proc/interrupts file).<strong>The</strong> major and minor numbers that identify the I/O device (see Section 13.2.1 inChapter 13).Points to the next element of a list of irqaction d<strong>es</strong>criptors. <strong>The</strong> elements in the listrefer to hardware devic<strong>es</strong> that share the same IRQ.4.6.3 Saving the Registers for the Interrupt HandlerAs with other context switch<strong>es</strong>, the need to save registers leav<strong>es</strong> the kernel developer asomewhat m<strong>es</strong>sy coding job because the registers have to be saved and r<strong>es</strong>tored usingassembly language code, but within those operations the proc<strong>es</strong>sor is expected to call andreturn from a C function. In this section we'll d<strong>es</strong>cribe the assembly language task of handlingregisters, while in the next we'll show some of the acrobatics required in the C function that issubsequently invoked.Saving registers is the first task of the interrupt handler. As already mentioned, the interrupthandler for IRQn is named IRQn_interrupt, and its addr<strong>es</strong>s is included in the interrupt gat<strong>es</strong>tored in the proper IDT entry.118


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> same BUILD_IRQ macro is duplicated 16 tim<strong>es</strong>, once for each IRQ number, in order toyield 16 different interrupt handler entry points. Each BUILD_IRQ expands to the followingassembly language fragment:IRQn_interrupt:pushl $n-256jmp common_interrupt<strong>The</strong> r<strong>es</strong>ult is to save on the stack the IRQ number associated with the interrupt minus 256; [3]the same code for all interrupt handlers can then be executed while referring to this number.<strong>The</strong> common code can be found in the BUILD_COMMON_IRQ macro, which expands to thefollowing assembly language fragment:[3] Subtracting 256 from an IRQ number yields a negative number. Positive numbers are r<strong>es</strong>erved to identify system calls (see Chapter 8).common_interrupt:SAVE_ALLcall do_IRQjmp ret_from_intr<strong>The</strong> SAVE_ALL macro, in turn, expands to the following fragment:cldpush %<strong>es</strong>push %dspushl %eaxpushl %ebppushl %edipushl %<strong>es</strong>ipushl %edxpushl %ecxpushl %ebxmovl $__KERNEL_DS,%edxmov %dx,%dsmov %dx,%<strong>es</strong>SAVE_ALL sav<strong>es</strong> all the CPU registers that may be used by the interrupt handler on the stack,except for eflags, cs, eip, ss, and <strong>es</strong>p, which are already saved automatically by the controlunit (see Section 4.2.5). <strong>The</strong> macro then loads the selector of the kernel data segment into dsand <strong>es</strong>.After saving the registers, BUILD_COMMON_IRQ invok<strong>es</strong> the do_IRQ( ) function and jumps tothe ret_from_intr( ) addr<strong>es</strong>s (see Section 4.7).4.6.4 <strong>The</strong> do_IRQ( ) Function<strong>The</strong> do_IRQ( ) function is invoked to execute all interrupt service routin<strong>es</strong> associated with aninterrupt. When it starts, the kernel stack contains from the top down:• <strong>The</strong> do_IRQ( ) return addr<strong>es</strong>s• <strong>The</strong> group of register valu<strong>es</strong> pushed on by SAVE_ALL• <strong>The</strong> encoding of the IRQ number• <strong>The</strong> registers saved automatically by the control unit when it recognized the interrupt119


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Since the C compiler plac<strong>es</strong> all the parameters on top of the stack, the do_IRQ( ) function isdeclared as follows:void do_IRQ(struct pt_regs regs)where the pt_regs structure consists of 15 fields:• <strong>The</strong> first nine fields corr<strong>es</strong>pond to the register valu<strong>es</strong> pushed by SAVE_ALL.• <strong>The</strong> tenth field, referenced through a field called orig_eax, encod<strong>es</strong> the IRQ number.• <strong>The</strong> remaining fields corr<strong>es</strong>pond to the register valu<strong>es</strong> pushed on automatically by thecontrol unit. [4][4] <strong>The</strong> ret_from_intr( ) return addr<strong>es</strong>s is missing from the pt_regs structure because the C compiler expects a return addr<strong>es</strong>s ontop of the stack and tak<strong>es</strong> this into account when generating the instructions to addr<strong>es</strong>s parameters.<strong>The</strong> do_IRQ( ) function can thus read the IRQ passed as a parameter and decode it asfollows:irq = regs.orig_eax & 0xff;<strong>The</strong> function then execut<strong>es</strong>:irq_d<strong>es</strong>c[irq].handler->handle(irq, &regs);<strong>The</strong> handler field points to the hw_interrupt_type d<strong>es</strong>criptor that refers to the PIC modelservicing the IRQ line (see Section 4.6.2). Assuming that the PIC is an 8259A, the handlefield points to the do_8259A_IRQ( ) function, which is thus executed.<strong>The</strong> do_8259A_IRQ( ) function starts by invoking the mask_and_ack_8259A( ) function,which acknowledg<strong>es</strong> the interrupt to the PIC and disabl<strong>es</strong> further interrupts with the sameIRQ number.<strong>The</strong>n the function checks whether the handler is willing to deal with the interrupt and whetherit is already handling it; to that end, it reads the valu<strong>es</strong> of the IRQ_DISABLED and IRQ_INPROGRESS flags stored in the status field of the IRQ main d<strong>es</strong>criptor. If both flags arecleared, the function picks up the pointer to the first irqaction d<strong>es</strong>criptor from the actionfield and sets the IRQ _INPROGRESS flag. It then invok<strong>es</strong> handle_IRQ _event( ), whichexecut<strong>es</strong> each interrupt service routine in turn through the following code. As mentionedpreviously, if the IRQ is shared by several devic<strong>es</strong>, each corr<strong>es</strong>ponding interrupt serviceroutine must be invoked because the kernel do<strong>es</strong> not know which device issued the interrupt:do {action->handler(irq, action->dev_id, regs);action = action->next;} while (action);Notice that the kernel cannot break the loop as soon as one ISR has claimed the interruptbecause another device on the same IRQ line might need to be serviced.120


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Finally, the do_8259A_IRQ( ) function cleans things up by clearing the IRQ_INPROGRESSflag just mentioned. Moreover, if the IRQ _DISABLED flag is not set, the function invok<strong>es</strong> thelow-level enable_8259A_irq( ) function to enable interrupts that come from the IRQ line.<strong>The</strong> control now returns to do_IRQ( ), which checks whether "bottom halv<strong>es</strong>" tasks arewaiting to be executed. (As we shall see, a queue of such bottom halv<strong>es</strong> is maintained by thekernel.) If bottom halv<strong>es</strong> are waiting, the function invok<strong>es</strong> the do_bottom_half( ) functionwe'll d<strong>es</strong>cribe shortly. Finally, do_IRQ( ) terminat<strong>es</strong> and control is transferred to theret_from_intr addr<strong>es</strong>s.4.6.5 Interrupt Service Routin<strong>es</strong>As mentioned previously, an interrupt service routine implements a device-specific operation.All of them act on the same parameters:irqdev_idregs<strong>The</strong> IRQ number<strong>The</strong> device identifierA pointer to the <strong>Kernel</strong> Mode stack area containing the registers saved right after theinterrupt occurred<strong>The</strong> first parameter allows a single ISR to handle several IRQ lin<strong>es</strong>, the second one allows asingle ISR to take care of several devic<strong>es</strong> of the same type, and the last one allows the ISR toacc<strong>es</strong>s the execution context of the interrupted kernel control path. In practice, most ISRs donot use th<strong>es</strong>e parameters.<strong>The</strong> SA_INTERRUPT flag of the main IRQ d<strong>es</strong>criptor determin<strong>es</strong> whether interrupts are enabledor disabled when the do_IRQ( ) function invok<strong>es</strong> an ISR. An ISR that has been invoked withthe interrupts in one state is allowed to put them in the opposite state through an assemblylanguage instruction: cli to disable interrupts and sti to enable them.<strong>The</strong> structure of an ISR depends on the characteristics of the device handled. We'll give a fewexampl<strong>es</strong> of ISRs in Chapter 5 and Chapter 13.4.6.6 Bottom HalfA bottom half is a low-priority function, usually related to interrupt handling, that is waitingfor the kernel to find a convenient moment to run it. Bottom halv<strong>es</strong> that are waiting will beexecuted only when one of the following events occurs:• <strong>The</strong> kernel finish<strong>es</strong> handling a system call.• <strong>The</strong> kernel finish<strong>es</strong> handling an exception.121


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> kernel terminat<strong>es</strong> the do_IRQ( ) function—that is, it finish<strong>es</strong> handling aninterrupt.• <strong>The</strong> kernel execut<strong>es</strong> the schedule( ) function to select a new proc<strong>es</strong>s to run on theCPU.Thus, when an interrupt service routine activat<strong>es</strong> a bottom half, a long time interval can occurbefore it is executed. [5] But as we have seen, the existence of bottom halv<strong>es</strong> is very importantto fulfill the kernel's r<strong>es</strong>ponsibility to service interrupts from multiple devic<strong>es</strong> quickly. Thisbook do<strong>es</strong>n't talk too much about the contents of bottom halv<strong>es</strong>—they depend on theparticular tasks needed to service devic<strong>es</strong>—but just about how the kernel maintains andinvok<strong>es</strong> the bottom halv<strong>es</strong>. You will find an example of a specific bottom half in Section 5.4in Chapter 5.[5] However, the execution of bottom halv<strong>es</strong> will not be deferred forever: the CPU do<strong>es</strong> not switch back to User Mode until there are no bottom halv<strong>es</strong>to be executed; see the Section 4.7.<strong>Linux</strong> mak<strong>es</strong> use of an array called the bh_base table to group all bottom halv<strong>es</strong> together. It isan array of pointers to bottom halv<strong>es</strong> and can include up to 32 entri<strong>es</strong>, one for each type ofbottom half. In practice, <strong>Linux</strong> us<strong>es</strong> about half of them; the typ<strong>es</strong> are listed in Table 4-3. Asyou can see from the table, some of the bottom halv<strong>es</strong> are associated with hardware devic<strong>es</strong>that are not nec<strong>es</strong>sarily installed in the system or that are specific to platforms b<strong>es</strong>id<strong>es</strong> theIBM PC compatible. But TIMER_BH, CONSOLE_BH, TQUEUE_BH, SERIAL_BH, IMMEDIATE_BH,and KEYBOARD_BH see wid<strong>es</strong>pread use.Bottom HalfAURORA_BHCM206_BHCONSOLE_BHCYCLADES_BHDIGI_BHESP_BHIMMEDIATE_BHISICOM_BHJS_BHKEYBOARD_BHMACSERIAL_BHNET_BHRISCOM8_BHSCSI_BHSERIAL_BHSPECIALIX_BHTIMER_BHTQUEUE_BHTable 4-3. <strong>The</strong> <strong>Linux</strong> Bottom Halv<strong>es</strong>Peripheral DeviceAurora multiport card (SPARC)CD-ROM Philips/LMS cm206 diskVirtual consoleCyclad<strong>es</strong> Cyclom-Y serial multiportDigiBoard PC/XeHay<strong>es</strong> ESP serial cardImmediate task queueMultiTech's ISI cardsJoystick (PC IBM compatible)KeyboardPower Macintosh's serial portNetwork interfaceRISCom/8SCSI interfaceSerial portSpecialix IO8+TimerPeriodic task queue4.6.6.1 Activating and tracking the state of bottom halv<strong>es</strong>Before invoking a bottom half for the first time, it must be initialized. This is done byinvoking the init_bh(n, routine) function, which inserts the routine addr<strong>es</strong>s in the n thentry of bh_base. Conversely, remove_bh(n) remov<strong>es</strong> the n th bottom half from the table.122


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Once a bottom half has been initialized, it can be "activated," thus executed any time one ofthe previously mentioned events occurs. <strong>The</strong> mark_bh(n) function is used by interrupthandlers to activate the n th bottom half. To keep track of the state of all th<strong>es</strong>e bottom halv<strong>es</strong>,the bh_active variable stor<strong>es</strong> 32 flags that specify which bottom halv<strong>es</strong> are currentlyactivated. When a bottom half conclud<strong>es</strong> its execution, the kernel clears the corr<strong>es</strong>pondingbh_active flag; thus, any activation caus<strong>es</strong> exactly one execution.<strong>The</strong> do_bottom_half( ) function is used to start executing all currently active unmaskedbottom halv<strong>es</strong>; it enabl<strong>es</strong> the maskable interrupts and then invok<strong>es</strong> run_bottom_halv<strong>es</strong>( ).This function mak<strong>es</strong> sure that only one bottom half is ever active at a time by executing thefollowing C code fragment:active = bh_mask & bh_active;bh_active &= active;bh = bh_base;do {if (active & 1)(*bh)( );bh++;active >>= 1;} while (active);<strong>The</strong> flags in bh_active that refer to the group of bottom halv<strong>es</strong> that must be executed arecleared. This ensur<strong>es</strong> that each bottom half activation caus<strong>es</strong> exactly one execution of thecorr<strong>es</strong>ponding function.Each bottom half can be individually "masked"; if this is the case, it won't be executed even ifit is activated. <strong>The</strong> bh_mask variable stor<strong>es</strong> 32 bits that specify which bottom halv<strong>es</strong> arecurrently masked. <strong>The</strong> disable_bh(n) and enable_bh(n) functions act on the nth flag ofbh_mask; they are used to mask and unmask a bottom half, r<strong>es</strong>pectively.Here's why masking bottom halv<strong>es</strong> is useful. Assume that a kernel function is modifyingsome kernel data structure when an exception (for instance, a "Page fault") occurs. After thekernel finish<strong>es</strong> handling the exception, all active nonmasked bottom halv<strong>es</strong> will be executed.If one of the bottom halv<strong>es</strong> acc<strong>es</strong>s<strong>es</strong> the same kernel data structure as the suspended kernelfunction, both the bottom half and the kernel function will find the data structure in anonconsistent state. In order to avoid this race condition, the kernel function must mask allbottom halv<strong>es</strong> that acc<strong>es</strong>s the data structure.Unfortunately, the bh_mask variable do<strong>es</strong> not always ensure that bottom halv<strong>es</strong> remaincorrectly masked. For instance, let us suppose that some bottom half B is masked by a kernelcontrol path P1, which is then interrupted by another kernel control path P2. P2 once againmasks the bottom half B, performs its own operations, and terminat<strong>es</strong> by unmasking B. NowP1 r<strong>es</strong>um<strong>es</strong> its execution, but Bis (incorrectly) unmasked.It is thus nec<strong>es</strong>sary to use counters rather than a simple binary flag to keep track of maskingand to add one more table called bh_mask_count whose entri<strong>es</strong> contain the masking level ofeach bottom half. <strong>The</strong> disable_bh(n) and enable_bh(n) functions updatebh_mask_count[n] before acting on the nth flag of bh_mask.123


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4.6.6.2 Extending a bottom half<strong>The</strong> motivation for introducing bottom halv<strong>es</strong> is to allow a limited number of functionsrelated to interrupt handling to be executed in a deferred manner. This approach has beenstretched in two directions:• To allow a generic kernel function, and not only a function that servic<strong>es</strong> an interrupt,to be executed as a bottom half• To allow several kernel functions, instead of a single one, to be associated with abottom halfGroups of functions are repr<strong>es</strong>ented by task queu<strong>es</strong>, which are lists of struct tq_structelements having the following structure:struct tq_struct {struct tq_struct *next; /* linked list of active bh's */unsigned long sync; /* must be initialized to zero */void (*routine)(void *); /* function to call */void *data; /* argument to function */};As we shall see in Chapter 13, I/O device drivers make intensive use of task queu<strong>es</strong> to requirethe execution of some functions when a specific interrupt occurs.<strong>The</strong> DECLARE_TASK_QUEUE macro is used to allocate a new task queue, while queue_task( )inserts a new function in a task queue. <strong>The</strong> run_task_queue( ) function execut<strong>es</strong> all thefunctions included in a given task queue. It's worth mentioning two particular task queu<strong>es</strong>,each associated with a specific bottom half:• <strong>The</strong> tq _immediate task queue, run by the IMMEDIATE_BH bottom half, includ<strong>es</strong>kernel functions to be executed together with the standard bottom halv<strong>es</strong>. <strong>The</strong> kernelactivat<strong>es</strong> the IMMEDIATE_BH bottom half whenever a function is added to the tq_immediate task queue.• <strong>The</strong> tq _timer task queue is run by the TQUEUE_BH bottom half, which is activated atevery timer interrupt. As we'll see in Chapter 5, that means it runs about every 10 ms.4.6.7 Dynamic Handling of IRQ Lin<strong>es</strong>With the exception of IRQ0, IRQ2, and IRQ13, the remaining 13 IRQs are dynamicallyhandled. <strong>The</strong>re is, therefore, a way in which the same interrupt can be used by severalhardware devic<strong>es</strong> even if they do not allow IRQ sharing: the trick consists in serializing theactivation of the hardware devic<strong>es</strong> so that just one at a time owns the IRQ line.Before activating a device that is going to make use of an IRQ line, the corr<strong>es</strong>ponding driverinvok<strong>es</strong> requ<strong>es</strong>t_irq( ). This function creat<strong>es</strong> a new irqaction d<strong>es</strong>criptor and initializ<strong>es</strong> itwith the parameter valu<strong>es</strong>; it then invok<strong>es</strong> the setup_x86_irq( ) function to insert thed<strong>es</strong>criptor in the proper IRQ list. <strong>The</strong> device driver aborts the operation if setup_x86_irq( )returns an error code, which means that the IRQ line is already in use by another device thatdo<strong>es</strong> not allow interrupt sharing. When the device operation is concluded, the driver invok<strong>es</strong>the free_irq( ) function to remove the d<strong>es</strong>criptor from the IRQ list and release the memoryarea.124


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Let us see how this scheme works on a simple example. Assume a program wants to addr<strong>es</strong>sthe /dev/fd0 device file, that is, the device file that corr<strong>es</strong>ponds to the first floppy disk on th<strong>es</strong>ystem. [6] <strong>The</strong> program can do this either by directly acc<strong>es</strong>sing /dev/fd0 or by mounting afil<strong>es</strong>ystem on it. Floppy disk controllers are usually assigned IRQ6; given this, the floppydriver will issue the following requ<strong>es</strong>t:[6] Floppy disks are "old" devic<strong>es</strong> that do not usually allow IRQ sharing.requ<strong>es</strong>t_irq(6, floppy_interrupt,SA_INTERRUPT|SA_SAMPLE_RANDOM, "floppy", NULL);As can be observed, the floppy_interrupt( ) interrupt service routine must execute withthe interrupts disabled (SA_INTERRUPT set) and no sharing of the IRQ (SA_SHIRQ flagcleared). When the operation on the floppy disk is concluded (either the I/O operation on/dev/fd0 terminat<strong>es</strong> or the fil<strong>es</strong>ystem is unmounted), the driver releas<strong>es</strong> IRQ6:free_irq(6, NULL);In order to insert an irqaction d<strong>es</strong>criptor in the proper list, the kernel invok<strong>es</strong> th<strong>es</strong>etup_x86_irq( ) function, passing to it the parameters irq _nr, the IRQ number, and new,the addr<strong>es</strong>s of a previously allocated irqaction d<strong>es</strong>criptor. This function:1. Checks whether another device is already using the irq _nr IRQ and, if so, whetherthe SA_SHIRQ flags in the irqaction d<strong>es</strong>criptors of both devic<strong>es</strong> specify that the IRQline can be shared. Returns an error code if the IRQ line cannot be used.2. Adds *new (the new irqaction d<strong>es</strong>criptor) at the end of the list to which irq_d<strong>es</strong>c[irq _nr]->action points.3. If no other device is sharing the same IRQ, clears the IRQ _DISABLED and IRQ_INPROGRESS flags in the flags field of *new and reprograms the PIC to make surethat IRQ signals are enabled.Here is an example of how setup_x86_irq( ) is used, drawn from system initialization. <strong>The</strong>kernel initializ<strong>es</strong> the irq0 d<strong>es</strong>criptor of the interval timer device by executing the followinginstructions in the time_init( ) function (see Chapter 5):struct irqaction irq0 ={timer_interrupt, SA_INTERRUPT, 0, "timer", NULL,};setup_x86_irq(0, &irq0);First, the irq0 variable of type irqaction is initialized: the handler field is set to theaddr<strong>es</strong>s of the timer_interrupt( ) function, the flags field is set to SA_INTERRUPT, thename field is set to "timer", and the last field is set to NULL to show that no dev_id value isused. Next, the kernel invok<strong>es</strong> setup_x86_irq( ) to insert irq0 in the list of irqactiond<strong>es</strong>criptors associated with IRQ0.Similarly, the kernel initializ<strong>es</strong> the irqaction d<strong>es</strong>criptors associated with IRQ2 and IRQ13and inserts them in the proper lists of irqaction d<strong>es</strong>criptors by executing the followinginstructions in the init_IRQ( ) function:125


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>struct irqaction irq2 ={no_action, 0, 0, "cascade", NULL,};struct irqaction irq13 ={math_error_irq, 0, 0, "fpu", NULL,};setup_x86_irq(2, &irq2);setup_x86_irq(13, &irq13);4.7 Returning from Interrupts and ExceptionsWe will finish the chapter by examining the termination phase of interrupt and exceptionhandlers. Although the main objective is clear, namely, to r<strong>es</strong>ume execution of some program,several issu<strong>es</strong> must be considered before doing it:• <strong>The</strong> number of kernel control paths being concurrently executed: if there is just one,the CPU must switch back to User Mode.• Active bottom halv<strong>es</strong> to be executed: if there are some, they must be executed.• Pending proc<strong>es</strong>s switch requ<strong>es</strong>ts: if there is any requ<strong>es</strong>t, the kernel must performproc<strong>es</strong>s scheduling; otherwise, control is returned to the current proc<strong>es</strong>s.• Pending signals: if a signal has been sent to the current proc<strong>es</strong>s, it must be handled.<strong>The</strong> kernel assembly language code that accomplish<strong>es</strong> all th<strong>es</strong>e things is not, technicallyspeaking, a function, since control is never returned to the functions that invoke it. It is a pieceof code with three different entry points called ret_from_intr, ret_from_sys_call, andret_from_exception. We will refer to it as three different functions since this mak<strong>es</strong> thed<strong>es</strong>cription simpler. We shall thus refer quite often to the following three entry points asfunctions:ret_from_intr( )Terminat<strong>es</strong> interrupt handlersret_from_sys_call( )Terminat<strong>es</strong> system calls, that is, kernel control paths engendered by 0x80 exceptionsret_from_exception( )Terminat<strong>es</strong> all exceptions except the 0x80 on<strong>es</strong><strong>The</strong> general flow diagram with the corr<strong>es</strong>ponding three entry points is illustrated in Figure 4-5. B<strong>es</strong>id<strong>es</strong> th<strong>es</strong>e three labels, a few other on<strong>es</strong> have been added to allow you to relate theassembly language code more easily to the flow diagram. Let us now examine in detail howthe termination occurs in each case.126


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 4-5. Returning from interrupts and exceptions4.7.1 <strong>The</strong> ret_ from_intr( ) FunctionWhen ret_from_intr( ) is invoked, the do_IRQ( ) function has already executed all activebottom halv<strong>es</strong> (see Section 4.6.4). <strong>The</strong> initial part of the ret_from_intr( ) function isimplemented by the following code:ret_from_intr:movl %<strong>es</strong>p, %ebxandl $0xffffe000, %ebxmovl 0x30(%<strong>es</strong>p), %eaxmovb 0x2c(%<strong>es</strong>p), %alt<strong>es</strong>tl $(0x00020000 | 3), %eaxjne ret_with_r<strong>es</strong>cheduleRESTORE_ALL<strong>The</strong> addr<strong>es</strong>s of the current's proc<strong>es</strong>s d<strong>es</strong>criptor is stored in ebx (see Section 3.1.2 in Chapter3). <strong>The</strong>n the valu<strong>es</strong> of the cs and eflags registers, which were pushed on the stack when theinterrupt occurred, are used by the function to determine whether the interrupted program was127


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>running in <strong>Kernel</strong> Mode. If so, a n<strong>es</strong>ting of interrupts has occurred and the interrupted kernelcontrol path is r<strong>es</strong>tarted by executing the following code, yielded by the RESTORE_ALL macro:popl %ebxpopl %ecxpopl %edxpopl %<strong>es</strong>ipopl %edipopl %ebppopl %eaxpopl %dspopl %<strong>es</strong>addl $4,%<strong>es</strong>piretThis macro loads the registers with the valu<strong>es</strong> saved by the SAVE_ALL macro and yieldscontrol to the interrupted program by executing the iret instruction.If, on the other hand, the interrupted program was running in User Mode or if the VM flag ofeflags was set, [7] a jump is made to the ret_with_r<strong>es</strong>chedule addr<strong>es</strong>s:[7] This flag allows programs to be executed in Virtual-8086 Mode; see the Pentium manuals for further details.ret_with_r<strong>es</strong>chedule:cmpl $0,20(%ebx)jne r<strong>es</strong>chedulecmpl $0,8(%ebx)jne signal_returnRESTORE_ALLAs we said previously, the ebx register points to the current proc<strong>es</strong>s d<strong>es</strong>criptor; within thatd<strong>es</strong>criptor, the need_r<strong>es</strong>ched field is at offset 20, which is checked by the first cmplinstruction. <strong>The</strong>refore, if the need_r<strong>es</strong>ched field is 1, the schedule( ) function is invoked toperform a proc<strong>es</strong>s switch.<strong>The</strong> offset of the sigpending field inside the proc<strong>es</strong>s d<strong>es</strong>criptor is 8. If it is null, currentr<strong>es</strong>um<strong>es</strong> execution in User Mode. Otherwise, the code jumps to signal_return to proc<strong>es</strong>sthe pending signals of current:signal_return:stit<strong>es</strong>tl $(0x00020000),0x30(%<strong>es</strong>p)movl %<strong>es</strong>p,%eaxjne v86_signal_returnxorl %edx,%edxcall do_signalRESTORE_ALLv86_signal_return:call save_v86_statemovl %eax,%<strong>es</strong>pxorl %edx,%edxcall do_signalRESTORE_ALL128


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If the interrupted proc<strong>es</strong>s was in VM86 mode, the save_v86_state( ) function is invoked.<strong>The</strong> do_signal( ) function (see Chapter 9) is then invoked to handle the pending signals.Finally, current can r<strong>es</strong>ume execution in User Mode.4.7.2 <strong>The</strong> ret_ from_sys_call( ) Function<strong>The</strong> ret_from_sys_call( ) function is equivalent to the following assembly language code:ret_from_sys_call:movl bh_mask, %eaxandl bh_active, %eaxje ret_with_r<strong>es</strong>chedulehandle_bottom_half:call do_bottom_halfjmp ret_from_intrFirst, the bh_mask and bh_active variabl<strong>es</strong> are checked to determine whether activeunmasked bottom halv<strong>es</strong> exist. If no bottom half must be executed, a jump is made to theret_with_r<strong>es</strong>chedule addr<strong>es</strong>s. Otherwise, the do_bottom_half( ) function is invoked;then control is transferred to ret_from_intr.4.7.3 <strong>The</strong> ret_ from_exception( ) Function<strong>The</strong> ret_from_exception( ) function is equivalent to the following assembly languagecode:ret_from_exception:movl bh_mask,%eaxandl bh_active,%eaxjne handle_bottom_halfjmp ret_from_intrFirst, the bh_mask and bh_active global variabl<strong>es</strong> are checked to determine whether activeunmasked bottom halv<strong>es</strong> exist. If so, they are executed. In any case, a jump is made to theret_from_intr addr<strong>es</strong>s. <strong>The</strong>refore exceptions terminate in the same way as interrupts.4.8 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 introduc<strong>es</strong> a new mechanism called software interrupt. Software interrupts ar<strong>es</strong>imilar to <strong>Linux</strong> 2.2's bottom halv<strong>es</strong>, in that they allow you to defer the execution of a kernelfunction. However, while bottom halv<strong>es</strong> were strictly serialized (because no two bottomhalv<strong>es</strong> can be executed at the same time even on different CPUs), software interrupts are notserialized in any way. It is quite possible that two CPUs run two instanc<strong>es</strong> of the sam<strong>es</strong>oftware interrupt at the same time. In this case, of course, the software interrupt must bereentrant. Networking, in particular, greatly benefits from software interrupts: it is much moreefficient on multiproc<strong>es</strong>sor systems because it us<strong>es</strong> two software interrupts in place of the oldNET_BH bottom half.<strong>Linux</strong> 2.4 introduc<strong>es</strong> another mechanism similar to the bottom half called tasklet. Tasklets arebuilt on top of software interrupts, but they are serialized with r<strong>es</strong>pect to themselv<strong>es</strong>: twoCPUs can execute two tasklets at the same time, but th<strong>es</strong>e tasklets must be different. Taskletsare much easier to write than generic software interrupts, because they need not be reentrant.129


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Bottom halv<strong>es</strong> continue to exist in <strong>Linux</strong> 2.4, but they are now built on top of tasklets. Asusual, no two bottom halv<strong>es</strong> can execute at the same time, not even on two different CPUs ofa multiproc<strong>es</strong>sor system. Device driver developers are expected to update their old drivers andreplace bottom halv<strong>es</strong> with tasklets, because bottom halv<strong>es</strong> degrade significantly theperformance of multiproc<strong>es</strong>sor systems.On the hardware side, <strong>Linux</strong> 2.4 now supports IO-APIC chips even in uniproc<strong>es</strong>sor systemsand is able to handle several external IO-APIC chips in multiproc<strong>es</strong>sor systems. (This featurewas required for porting <strong>Linux</strong> to large enterprise systems.)130


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 5. Timing MeasurementsCountl<strong>es</strong>s computerized activiti<strong>es</strong> are driven by timing measurements, often behind the user'sback. For instance, if the screen is automatically switched off after you have stopped using thecomputer's console, this is due to a timer that allows the kernel to keep track of how muchtime has elapsed since you pushed a key or moved the mouse. If you receive a warning fromthe system asking you to remove a set of unused fil<strong>es</strong>, this is the outcome of a program thatidentifi<strong>es</strong> all user fil<strong>es</strong> that have not been acc<strong>es</strong>sed for a long time. In order to do th<strong>es</strong>e things,programs must be able to retrieve from each file a tim<strong>es</strong>tamp identifying its last acc<strong>es</strong>s time,and therefore such a tim<strong>es</strong>tamp must be automatically written by the kernel. Mor<strong>es</strong>ignificantly, timing driv<strong>es</strong> proc<strong>es</strong>s switch<strong>es</strong> along with even more basic kernel activiti<strong>es</strong> likechecking for time-outs.We can distinguish two main kinds of timing measurement that must be performed by the<strong>Linux</strong> kernel:• Keeping the current time and date, so that they can be returned to user programsthrough the time( ), ftime( ), and gettimeofday( ) system calls (seeSection 5.5.1 later in this chapter) and used by the kernel itself as tim<strong>es</strong>tamps for fil<strong>es</strong>and network packets• Maintaining timers, that is, mechanisms that are able to notify the kernel (seeSection 5.4.4) or a user program (see Section 5.5.3) that a certain interval of time haselapsedTiming measurements are performed by several hardware circuits based on fixed-frequencyoscillators and counters. This chapter consists of three different parts. <strong>The</strong> first sectiond<strong>es</strong>crib<strong>es</strong> the hardware devic<strong>es</strong> that underlie timing; the next three sections d<strong>es</strong>cribe the kerneldata structur<strong>es</strong> and functions introduced to measure time; then a section discuss<strong>es</strong> the systemcalls related to timing measurements and the corr<strong>es</strong>ponding service routin<strong>es</strong>.5.1 Hardware Clocks<strong>The</strong> kernel must explicitly interact with three clocks: the Real Time Clock, the Time StampCounter, and the Programmable Interval Timer. <strong>The</strong> first two hardware devic<strong>es</strong> allow thekernel to keep track of the current time of day; the latter device is programmed by the kernelso that it issu<strong>es</strong> interrupts at a fixed, predefined frequency. Such periodic interrupts are crucialfor implementing the timers used by the kernel and the user programs.5.1.1 Real Time ClockAll PCs include a clock called Real Time Clock (RTC ), which is independent of the CPU andall other chips.<strong>The</strong> RTC continu<strong>es</strong> to tick even when the PC is switched off, since it is energized by a smallbattery or accumulator. <strong>The</strong> CMOS RAM and RTC are integrated in a single chip, theMotorola 146818 or an equivalent.131


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> RTC is capable of issuing periodic interrupts on IRQ8 at frequenci<strong>es</strong> ranging between 2Hz and 8192 Hz. It can also be programmed to activate the IRQ8 line when the RTC reach<strong>es</strong> aspecific value, thus working as an alarm clock.<strong>Linux</strong> us<strong>es</strong> the RTC only to derive the time and date; however, it allows proc<strong>es</strong>s<strong>es</strong> to programthe RTC by acting on the /dev/rtc device file (see Chapter 13). <strong>The</strong> kernel acc<strong>es</strong>s<strong>es</strong> the RTCthrough the 0x70 and 0x71 I/O ports. <strong>The</strong> system administrator can set up the clock byexecuting the /sbin/clock system program that acts directly on th<strong>es</strong>e two I/O ports.5.1.2 Time Stamp CounterAll Intel 80x86 microproc<strong>es</strong>sors include a CLK input pin, which receiv<strong>es</strong> the clock signal ofan external oscillator.Starting with the Pentium, many recent Intel 80x86 microproc<strong>es</strong>sors include a 64-bit TimeStamp Counter (TSC ) register that can be read by means of the rdtsc assembly languageinstruction. This register is a counter that is incremented at each clock signal: if, for instance,the clock ticks at 400 MHz, the Time Stamp Counter is incremented once every 2.5nanoseconds.<strong>Linux</strong> tak<strong>es</strong> advantage of this register to get much more accurate time measurements than theon<strong>es</strong> delivered by the Programmable Interval Timer. In order to do this, <strong>Linux</strong> must determinethe clock signal frequency while initializing the system: in fact, since this frequency is notdeclared when compiling the kernel, the same kernel image may run on CPUs whose clocksmay tick at any frequency. <strong>The</strong> task of figuring out the actual frequency is accomplishedduring the system's boot by the calibrate_tsc( ) function, which returns the number:<strong>The</strong> value of f is computed by counting the number of clock signals that occur in a relativelylong time interval, namely 50.00077 milliseconds. This time constant is produced by settingup one of the channels of the Programmable Interval Timer properly (see the next section).<strong>The</strong> long execution time of calibrate_tsc( ) do<strong>es</strong> not create problems, since the functionis invoked only during system initialization.5.1.3 Programmable Interval TimerB<strong>es</strong>id<strong>es</strong> the Real Time Clock and the Time Stamp Counter, IBM-compatible PCs include athird type of time-measuring device called Programmable Interval Timer (PIT ). <strong>The</strong> role of aPIT is similar to the alarm clock of a microwave oven: to make the user aware that thecooking time interval has elapsed. Instead of ringing a bell, this device issu<strong>es</strong> a specialinterrupt called timer interrupt, which notifi<strong>es</strong> the kernel that one more time interval haselapsed. [1] Another difference from the alarm clock is that the PIT go<strong>es</strong> on issuing interruptsforever at some fixed frequency <strong>es</strong>tablished by the kernel. Each IBM-compatible PC includ<strong>es</strong>at least one PIT, which is usually implemented by a 8254 CMOS chip using the 0x40-0x43I/O ports.[1] <strong>The</strong> PIT is also used to drive an audio amplifier connected to the computer's internal speaker.132


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As we shall see in detail in the next paragraphs, <strong>Linux</strong> programs the first PC's PIT to issuetimer interrupts on the IRQ0 at a (roughly) 100-Hz frequency, that is, once every 10milliseconds. This time interval is called a tick, and its length in microseconds is stored in thetick variable. <strong>The</strong> ticks beat time for all activiti<strong>es</strong> in the system; in some sense, they are likethe ticks sounded by a metronome while a musician is rehearsing.Generally speaking, shorter ticks yield better system r<strong>es</strong>ponsiven<strong>es</strong>s. This is because systemr<strong>es</strong>ponsiven<strong>es</strong>s largely depends on how fast a running proc<strong>es</strong>s is preempted by a higherpriorityproc<strong>es</strong>s once it becom<strong>es</strong> runnable (see Chapter 10); moreover, the kernel usuallychecks whether the running proc<strong>es</strong>s should be preempted while handling the timer interrupt.This is a trade-off however: shorter ticks require the CPU to spend a larger fraction of its timein <strong>Kernel</strong> Mode, that is, a smaller fraction of time in User Mode. As a consequence, userprograms run slower. <strong>The</strong>refore, only very powerful machin<strong>es</strong> can adopt very short ticks andafford the consequent overhead. Currently, only Compaq's Alpha port of the <strong>Linux</strong> kernelissu<strong>es</strong> 1024 timer interrupts per second, corr<strong>es</strong>ponding to a tick of roughly 1 millisecond.A few macros in the <strong>Linux</strong> code yield some constants that determine the frequency of timerinterrupts:• HZ yields the number of timer interrupts per second, that is, the frequency of timerinterrupts. This value is set to 100 for IBM PCs and most other hardware platforms.• CLOCK_TICK_RATE yields the value 1193180, which is the 8254 chip's internaloscillator frequency.• LATCH yields the ratio between CLOCK_TICK_RATE and HZ. It is used to program thePIT.<strong>The</strong> first PIT is initialized by init_IRQ( ) as follows:outb_p(0x34,0x43);outb_p(LATCH & 0xff , 0x40);outb(LATCH >> 8 , 0x40);<strong>The</strong> outb( ) C function is equivalent to the outb assembly language instruction: it copi<strong>es</strong> thefirst operand into the I/O port specified as the second operand. <strong>The</strong> outb_p( ) function issimilar to outb( ), except that it introduc<strong>es</strong> a pause by executing a no-op instruction. <strong>The</strong>first outb_ p( ) invocation is a command to the PIT to issue interrupts at a new rate. <strong>The</strong>next two outb_ p( ) and outb( ) invocations supply the new interrupt rate to the device.<strong>The</strong> 16-bit LATCH constant is sent to the 8-bit 0x40 I/O port of the device as 2 consecutivebyt<strong>es</strong>. As a r<strong>es</strong>ult, the PIT will issue timer interrupts at a (roughly) 100-Hz frequency, that is,once every 10 ms.Now that we understand what the hardware timers do, the following sections d<strong>es</strong>cribe all theactions performed by the kernel when it receiv<strong>es</strong> a timer interrupt—that is, when a tick haselapsed.5.2 <strong>The</strong> Timer Interrupt HandlerEach occurrence of a timer interrupt triggers the following major activiti<strong>es</strong>:• Updat<strong>es</strong> the time elapsed since system startup.133


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Updat<strong>es</strong> the time and date.• Determin<strong>es</strong> how long the current proc<strong>es</strong>s has been running on the CPU and preemptsit if it has exceeded the time allocated to it. <strong>The</strong> allocation of time slots (also calledquanta) is discussed in Chapter 10.• Updat<strong>es</strong> r<strong>es</strong>ource usage statistics.• Checks whether the interval of time associated with each software timer (seeSection 5.4.4) has elapsed; if so, invok<strong>es</strong> the proper function.<strong>The</strong> first activity is considered urgent, so it is performed by the timer interrupt handler itself.<strong>The</strong> remaining four activiti<strong>es</strong> are l<strong>es</strong>s urgent; they are performed by the functions invoked bythe TIMER_BH and TQUEUE_BH bottom halv<strong>es</strong> (see Section 4.6.6 in Chapter 4).<strong>The</strong> kernel us<strong>es</strong> two basic timekeeping functions: one to keep the current time up to date andanother to count the number of microseconds that have elapsed within the current second.<strong>The</strong>re are two different ways to maintain such valu<strong>es</strong>: a more precise method that is availableif the chip has a Time Stamp Counter (TSC) and a l<strong>es</strong>s precise method used in other cas<strong>es</strong>. Sothe kernel creat<strong>es</strong> two variabl<strong>es</strong> to store the functions it us<strong>es</strong>, pointing the variabl<strong>es</strong> to thefunctions using the TSC if it exists:• <strong>The</strong> current time is calculated by do_gettimeofday( ) if the CPU has the TSCregister and by do_normal_gettime( ) otherwise. A pointer to the proper function isstored in the variable do_get_fast_time.• <strong>The</strong> number of microseconds is calculated by do_fast_gettimeoffset( ) when theTSC register is available and by do_slow_gettimeoffset( ) otherwise. <strong>The</strong> addr<strong>es</strong>sof this function is stored in the variable do_gettimeoffset.<strong>The</strong> time_init( ) function, which runs during kernel startup, sets the variabl<strong>es</strong> to point tothe right functions and sets up the interrupt gate corr<strong>es</strong>ponding to IRQ0.5.3 PIT's Interrupt Service RoutineOnce the IRQ0 interrupt gate has been initialized, the handler field of IRQ0's irqactiond<strong>es</strong>criptor contains the addr<strong>es</strong>s of the timer_interrupt( ) function. This function startsrunning with the interrupts disabled, since the status field of IRQ0's main d<strong>es</strong>criptor has theSA_INTERRUPT flag set. It performs the following steps:1. If the CPU has a TSC register, it performs the following substeps:a. Execut<strong>es</strong> an rdtsc Assembly instruction to store the value of the TSC registerin the last_tsc_low variableb. Reads the state of the 8254 chip device internal oscillator and comput<strong>es</strong> thedelay between the timer interrupt occurrence and the execution of the interruptservice routine [2][2] <strong>The</strong> 8254 oscillator driv<strong>es</strong> a counter that is continuously decremented. When the counter becom<strong>es</strong> 0, the chiprais<strong>es</strong> an IRQ0. So reading the counter indicat<strong>es</strong> how much time has elapsed since the interrupt occurred.c. Stor<strong>es</strong> that delay (in microseconds) in the delay_at_last_interruptvariable2. It invok<strong>es</strong> do_timer_interrupt( ).134


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>do_timer_interrupt( ), which may be considered as the interrupt service routine commonto all 80x86 models, execut<strong>es</strong> the following operations:1. It invok<strong>es</strong> the do_timer( ) function, which is fully explained shortly.2. If an adjtimex( ) system call has been issued, it invok<strong>es</strong> the set_rtc_mmss( )function once every 660 seconds, that is, every 11 minut<strong>es</strong>, to adjust the Real TimeClock. This feature helps systems on a network synchronize their clocks (seeSection 5.5.2).<strong>The</strong> do_timer( ) function, which runs with the interrupts disabled, must be executed asquickly as possible. For this reason, it simply updat<strong>es</strong> one fundamental value—the timeelapsed from system startup—while delegating all remaining activiti<strong>es</strong> to two bottom halv<strong>es</strong>.<strong>The</strong> function refers to three main variabl<strong>es</strong> related to timing measurements; the first is thefundamental uptime just mentioned, while the latter two are needed to store lost ticks that takeplace before the bottom half functions have a chance to run. Thus, the first is absolute (it justkeeps incrementing) while the other two are relative to another variable called xtime thatstor<strong>es</strong> the approximate current time. (This variable will be d<strong>es</strong>cribed in Section 5.4.1).<strong>The</strong> three do_timer() variabl<strong>es</strong> are:jiffi<strong>es</strong><strong>The</strong> number of elapsed ticks since the system was started; it is set to during kernelinitialization and incremented by 1 when a timer interrupt occurs, that is, on everytick. [3][3] Since jiffi<strong>es</strong> is stored as a 32-bit unsigned integer, it returns to about 497 days after the systems has been booted.lost_ticks<strong>The</strong> number of ticks that has occurred since the last update of xtime.lost_ticks_system<strong>The</strong> number of ticks that has occurred while the proc<strong>es</strong>s was running in <strong>Kernel</strong> Mod<strong>es</strong>ince the last update of xtime. <strong>The</strong> user_mode macro examin<strong>es</strong> the CPL field of thecs register saved in the stack to determine if the proc<strong>es</strong>s was running in <strong>Kernel</strong> Mode.<strong>The</strong> do_timer( ) function is equivalent to:void do_timer(struct pt_regs * regs){jiffi<strong>es</strong>++;lost_ticks++;mark_bh(TIMER_BH);if (!user_mode(regs))lost_ticks_system++;if (tq_timer)mark_bh(TQUEUE_BH);}135


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Note that the TQUEUE_BH bottom half is activated only if the tq _timer task queue is notempty (see Section 4.6.6 in Chapter 4).5.4 <strong>The</strong> TIMER_BH Bottom Half Functions<strong>The</strong> timer_bh( ) function associated with the TIMER_BH bottom half invok<strong>es</strong> theupdate_tim<strong>es</strong>( ), run_old_timers( ), and run_timer_list( ) auxiliary functions,which are d<strong>es</strong>cribed next.5.4.1 Updating the Time and Date<strong>The</strong> xtime variable of type struct timeval is where user programs get the current time anddate. <strong>The</strong> kernel also occasionally refers to it, for instance, when updating inode tim<strong>es</strong>tamps(see Section 1.5.4 in Chapter 1). In particular, xtime.tv_sec stor<strong>es</strong> the number of secondsthat have elapsed since midnight of January 1, 1970 [4] , while xtime.tv_usec stor<strong>es</strong> thenumber of microseconds that have elapsed within the last second (its value thus rang<strong>es</strong>between and 999999).[4] This date is traditionally used by all Unix systems as the earli<strong>es</strong>t moment in counting time.During system initialization, the time_init( ) function is invoked to set up the time anddate: it reads them from the Real Time Clock by invoking the get_cmos_time( ) function,then it initializ<strong>es</strong> xtime. Once this has been done, the kernel do<strong>es</strong> not need the RTC anymore:it reli<strong>es</strong> instead on the TIMER_BH bottom half, which is activated once every tick.<strong>The</strong> update_tim<strong>es</strong>( ) function invoked by the TIMER_BH bottom half updat<strong>es</strong> xtime bydisabling interrupts and executing the following statement:if (lost_ticks)update_wall_time(lost_ticks);<strong>The</strong> update_wall_time( ) function invok<strong>es</strong> the update_wall_time_one_tick( ) functionlost_ticks consecutive tim<strong>es</strong>; each invocation adds 10000 to the xtime.tv_usec field. [5] Ifxtime.tv_usec has become greater than 999999, the update_wall_time( ) function alsoupdat<strong>es</strong> the tv_sec field of xtime.[5] In fact, the function is much more complex since it might slightly tune the value 10000. This may be nec<strong>es</strong>sary if an adjtimex( ) systemcall has been issued (see Section 5.5.2 later in this chapter).5.4.2 Updating R<strong>es</strong>ource Usage Statistics<strong>The</strong> value of lost_ticks is also used, together with that of lost_ticks_system, to updater<strong>es</strong>ource usage statistics. <strong>The</strong>se statistics are used by various administration utiliti<strong>es</strong> such astop. A user who enters the uptime command se<strong>es</strong> the statistics as the "load average" relativeto the last minute, the last 5 minut<strong>es</strong>, and the last 15 minut<strong>es</strong>. A value of means that there areno active proc<strong>es</strong>s<strong>es</strong> (b<strong>es</strong>id<strong>es</strong> the swapper proc<strong>es</strong>s 0) to run, while a value of 1 means that theCPU is 100% busy with a single proc<strong>es</strong>s, and valu<strong>es</strong> greater than 1 mean that the CPU isshared among several active proc<strong>es</strong>s<strong>es</strong>.After updating the system clock, update_tim<strong>es</strong>( ) reenabl<strong>es</strong> the interrupts and performs thefollowing actions:136


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Clears lost_ticks after storing its value in ticks• Clears lost_ticks_system after storing its value in system• Invok<strong>es</strong> calc_load(ticks)• Invok<strong>es</strong> update_proc<strong>es</strong>s_tim<strong>es</strong>(ticks, system)<strong>The</strong> calc_load( ) function counts the number of proc<strong>es</strong>s<strong>es</strong> in the TASK_RUNNING orTASK_UNINTERRUPTIBLE state and us<strong>es</strong> this number to update the CPU usage statistics.<strong>The</strong> update_ proc<strong>es</strong>s_tim<strong>es</strong>( ) function updat<strong>es</strong> some kernel statistics stored in the kstatvariable of type kernel_stat; it then invok<strong>es</strong> update_one_ proc<strong>es</strong>s( ) to update somefields storing statistics that can be exported to user programs through the tim<strong>es</strong>( ) systemcall. In particular, a distinction is made between CPU time spent in User Mode and in <strong>Kernel</strong>Mode. <strong>The</strong> function perform the following actions:• Updat<strong>es</strong> the per_cpu_utime field of current's proc<strong>es</strong>s d<strong>es</strong>criptor, which stor<strong>es</strong> thenumber of ticks during which the proc<strong>es</strong>s has been running in User Mode.• Updat<strong>es</strong> the per_cpu_stime field of current's proc<strong>es</strong>s d<strong>es</strong>criptor, which stor<strong>es</strong> thenumber of ticks during which the proc<strong>es</strong>s has been running in <strong>Kernel</strong> Mode.• Invok<strong>es</strong> do_ proc<strong>es</strong>s_tim<strong>es</strong>( ), which checks whether the total CPU time limit hasbeen reached; if so, sends SIGXCPU and SIGKILL signals to current. Section 3.1.5 inChapter 3, d<strong>es</strong>crib<strong>es</strong> how the limit is controlled by the rlim[RLIMIT_CPU].rlim_curfield of each proc<strong>es</strong>s d<strong>es</strong>criptor.• Invok<strong>es</strong> the do_it_virt( ) and do_it_ prof( ) functions, which are d<strong>es</strong>cribed inSection 5.5.3.Two additional fields called tim<strong>es</strong>.tms_cutime and tim<strong>es</strong>.tms_cstime are provided in theproc<strong>es</strong>s d<strong>es</strong>criptor to count the number of CPU ticks spent by the proc<strong>es</strong>s children in UserMode and in <strong>Kernel</strong> Mode, r<strong>es</strong>pectively. For reasons of efficiency, th<strong>es</strong>e fields are notupdated by do_proc<strong>es</strong>s_tim<strong>es</strong>( ) but rather when the parent proc<strong>es</strong>s queri<strong>es</strong> the state of oneof its children (see the section Section 3.4 in Chapter 3).5.4.3 CPU's Time SharingTimer interrupts are <strong>es</strong>sential for time sharing the CPU among runnable proc<strong>es</strong>s<strong>es</strong> (that is,those in the TASK_RUNNING state). As we shall see in Chapter 10, each proc<strong>es</strong>s is usuallyallowed a quantum of time of limited duration: if the proc<strong>es</strong>s is not terminated when itsquantum expir<strong>es</strong>, the schedule( ) function selects the new proc<strong>es</strong>s to run.<strong>The</strong> counter field of the proc<strong>es</strong>s d<strong>es</strong>criptor specifi<strong>es</strong> how many ticks of CPU time are left tothe proc<strong>es</strong>s. <strong>The</strong> quantum is always a multiple of a tick, that is, a multiple of about 10 ms. <strong>The</strong>value of counter is updated at every tick by update_proc<strong>es</strong>s_tim<strong>es</strong>( ) as follows:if (current->pid) {current->counter -= ticks;if (current->counter < 0) {current->counter = 0;current->need_r<strong>es</strong>ched = 1;}}137


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As stated in Section 3.1.2 in Chapter 3, the proc<strong>es</strong>s having PID (swapper) must not be tim<strong>es</strong>hared,because it is the proc<strong>es</strong>s that runs on the CPU when no other TASK_RUNNING proc<strong>es</strong>s<strong>es</strong>exist.Since counter is updated in a deferred manner by a bottom half, the decrement might belarger than a single tick. Thus, the ticks local variable denot<strong>es</strong> the number of ticks thatoccurred since the bottom half was activated. When counter becom<strong>es</strong> smaller than 0, theneed_r<strong>es</strong>ched field of the proc<strong>es</strong>s d<strong>es</strong>criptor is set to 1. In that case, the schedule( )function will be invoked before r<strong>es</strong>uming User Mode execution, and other TASK_RUNNINGproc<strong>es</strong>s<strong>es</strong> will have a chance to r<strong>es</strong>ume execution on the CPU.5.4.4 <strong>The</strong> Role of TimersA timer is a software facility that allows functions to be invoked at some future moment, aftera given time interval has elapsed; a time-out denot<strong>es</strong> a moment at which the time intervalassociated with a timer has elapsed.Timers are widely used both by the kernel and by proc<strong>es</strong>s<strong>es</strong>. Most device drivers make use oftimers to detect anomalous conditions: floppy disk drivers, for instance, use timers to switchoff the device motor after the floppy has not been acc<strong>es</strong>sed for a while, and parallel printerdrivers use them to detect erroneous printer conditions.Timers are also used quite often by programmers to force the execution of specific functionsat some future time (see Section 5.5.3).Implementing a timer is relatively easy: each timer contains a field that indicat<strong>es</strong> how far inthe future the timer should expire. This field is initially calculated by adding the right numberof ticks to the current value of jiffi<strong>es</strong>. <strong>The</strong> field do<strong>es</strong> not change. Every time the kernelchecks a timer, it compar<strong>es</strong> the expiration field to the value of jiffi<strong>es</strong> at the currentmoment, and the timer expir<strong>es</strong> when jiffi<strong>es</strong> is greater or equal to the stored value. Thiscomparison is made via the time_after, time_before, time_after_eq, andtime_before_eq macros, which take care of possible overflows of jiffi<strong>es</strong>.<strong>Linux</strong> considers three typ<strong>es</strong> of timers called static timers, dynamic timers, and interval timers.<strong>The</strong> first two typ<strong>es</strong> are used by the kernel, while interval timers may be created by proc<strong>es</strong>s<strong>es</strong>in User Mode.One word of caution about <strong>Linux</strong> timers: since checking for timer functions is always done bybottom halv<strong>es</strong> that may be executed a long time after they have been activated, the kernelcannot ensure that timer functions will start right at their expiration tim<strong>es</strong>; it can only ensurethat they will be executed either at the proper time or after they are supposed to with a delayof up to a few hundreds of milliseconds. For that reason, timers are not appropriate for realtimeapplications in which expiration tim<strong>es</strong> must be strictly enforced.138


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>5.4.5 Static Timers<strong>The</strong> first versions of <strong>Linux</strong> allowed only 32 different timers; [6] th<strong>es</strong>e static timers, which relyon statically allocated kernel data structure, still continue to be used. Since they were the firstto be introduced, <strong>Linux</strong> code refers to them as old timers.[6] This value was chosen so that the corr<strong>es</strong>ponding active flags could be stored in a single variable.Static timers are stored in the timer_table array, which includ<strong>es</strong> 32 entri<strong>es</strong>. Each entryconsists of the following timer_struct structure:struct timer_struct {unsigned long expir<strong>es</strong>;void (*fn)(void);};<strong>The</strong> expir<strong>es</strong> field specifi<strong>es</strong> when the timer expir<strong>es</strong>; the time is expr<strong>es</strong>sed as the number ofticks that have elapsed since the system was started up. All timers having an expir<strong>es</strong> valu<strong>es</strong>maller than or equal to the value of jiffi<strong>es</strong> are considered to be expired or decayed. <strong>The</strong> fnfield contains the addr<strong>es</strong>s of the function to be executed when the timer expir<strong>es</strong>.Although timer_table includ<strong>es</strong> 32 entri<strong>es</strong>, <strong>Linux</strong> us<strong>es</strong> only those listed in Table 5-1.Static TimerBACKGR_TIMERBEEP_TIMERBLANK_TIMERCOMTROL_TIMERCOPRO_TIMERDIGI_TIMERFLOPPY_TIMERGDTH_TIMERGSCD_TIMERHD_TIMERMCD_TIMERQIC02_TAPE_TIMERRS_TIMERSWAP_TIMERTable 5-1. Static TimersTime-out EffectBackground I/O operation requ<strong>es</strong>tLoudspeaker toneSwitch off the screenComtrol serial cardi80387 coproc<strong>es</strong>sorDigiboard cardFloppy diskGDTH SCSI driverGoldstar CD-ROMHard disk (old IDE driver)Mitsumi CD-ROMQIC-02 tape driverRS-232 serial portkswapd kernel thread activation<strong>The</strong> timer_active variable is used to identify the active static timers: each bit of this 32-bitvariable is a flag that specifi<strong>es</strong> whether the corr<strong>es</strong>ponding timer is activated.In order to activate a static timer, the kernel must simply:• Register the function to be executed in the fn field of the timer.• Compute the expiration time (this is usually done by adding some specified value tothe value of jiffi<strong>es</strong>) and store it in the expir<strong>es</strong> field of the timer.• Set the proper flag in timer_active.139


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> job of checking for decayed static timers is done by the run_old_timers( ) function,which is invoked by the TIMER_BH bottom half:void run_old_timers(void){struct timer_struct *tp;unsigned long mask;for (mask = 1, tp = timer_table; mask;tp++, mask += mask) {if (mask > timer_active)break;if (!(mask & timer_active))continue;if (tp->expir<strong>es</strong> > jiffi<strong>es</strong>)continue;timer_active &= ~mask;tp->fn( );sti( );}}Once a decayed active timer has been identified, the corr<strong>es</strong>ponding active flag is clearedbefore executing the function that the fn field points to, thus ensuring that the timer won't beinvoked again at each future execution of run_old_timers( ).5.4.6 Dynamic TimersDynamic timers may be dynamically created and d<strong>es</strong>troyed. No limit is placed on the numberof currently active dynamic timers.A dynamic timer is stored in the following timer_list structure:struct timer_list {struct timer_list *next;struct timer_list *prev;unsigned long expir<strong>es</strong>;unsigned long data;void (*function)(unsigned long);};<strong>The</strong> function field contains the addr<strong>es</strong>s of the function to be executed when the timerexpir<strong>es</strong>. <strong>The</strong> data field specifi<strong>es</strong> a parameter to be passed to this timer function. Thanks to thedata field, it is possible to define a single general-purpose function that handl<strong>es</strong> the time-outsof several device drivers; the data field could store the device ID or other meaningful datathat could be used by the function to differentiate the device.<strong>The</strong> meaning of the expir<strong>es</strong> field is the same as the corr<strong>es</strong>ponding field for static timers.<strong>The</strong> next and prev fields implement links for a doubly linked circular list. In fact, each activedynamic timer is inserted in exactly one of 512 doubly linked circular lists, depending on thevalue of the expir<strong>es</strong> field. <strong>The</strong> algorithm that us<strong>es</strong> this list is d<strong>es</strong>cribed later in the chapter.In order to create and activate a dynamic timer, the kernel must:140


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1. Create a new struct timer_list object, say t. This can be done in several ways by:o Defining a static global variable in the codeo Defining a local variable inside a function: in this case, the object is stored onthe <strong>Kernel</strong> Mode stacko Including the object in a dynamically allocated d<strong>es</strong>criptor2. Initialize the object by invoking the init_timer(&t) function. This simply sets thenext and prev fields to NULL.3. If the dynamic timer is not already inserted in a list, assign a proper value to theexpir<strong>es</strong> field. Otherwise, if the dynamic timer is already inserted in a list, update theexpir<strong>es</strong> field by invoking the mod_timer( ) function, which also tak<strong>es</strong> care ofmoving the object into the proper list (discussed shortly).4. Load the function field with the addr<strong>es</strong>s of the function to be activated when thetimer decays. If required, load the data field with a parameter value to be passed tothe function.5. If the dynamic timer is not already inserted in a list, insert the t element in the properlist by invoking the add_timer(&t) function.Once the timer has decayed, the kernel automatically remov<strong>es</strong> the t element from its list.Sometim<strong>es</strong>, however, a proc<strong>es</strong>s should explicitly remove a timer from its list using thedel\_timer( ) function. Indeed, a sleeping proc<strong>es</strong>s may be woken up before the time-out isover, and in this case the proc<strong>es</strong>s may choose to d<strong>es</strong>troy the timer. Invoking del\_timer( )on a timer already removed from a list do<strong>es</strong> no harm, so calling del\_timer( ) from thetimer function is considered a good practice.We saw previously how the run_old_timers( ) function was able to identify the activedecayed static timers by executing a single for loop on the 32 timer_table components.This approach is no longer applicable to dynamic timers, since scanning a long list ofdynamic timers at every tick would be too costly. On the other hand, maintaining a sorted listwould not be much more efficient, since the insertion and deletion operations would also becostly.<strong>The</strong> solution adopted is based on a clever data structure that partitions the expir<strong>es</strong> valu<strong>es</strong> intoblocks of ticks and allows dynamic timers to percolate efficiently from lists with largerexpir<strong>es</strong> valu<strong>es</strong> to lists with smaller on<strong>es</strong>.<strong>The</strong> main data structure is an array called tvecs, whose elements point to five groups of listsidentified by the tv1, tv2, tv3, tv4, and tv5 structur<strong>es</strong> (see Figure 5-1).<strong>The</strong> tv1 structure is of type struct timer_vec_root, which includ<strong>es</strong> an index field and avec array of 256 pointers to timer_list elements, that is, to lists of dynamic timers. Itcontains all dynamic timers that will decay within the next 255 ticks.<strong>The</strong> index field specifi<strong>es</strong> the currently scanned list; it is initialized to and incremented by 1(modulo 256) at every tick. <strong>The</strong> list referenced by index contains all dynamic timers thathave expired during the current tick; the next list contains all dynamic timers that will expirein the next tick; the (index+k)-th list contains all dynamic timers that will expire in exactly kticks. When index returns to 0, this means that all the timers in tv1 have been scanned: inthis case, the list pointed to by tv2.vec[tv2.index] is used to replenish tv1.141


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> tv2, tv3, and tv4 structur<strong>es</strong> of type struct timer_vec contain all dynamic timers thatwill decay within the next 2 14 -1, 2 20 -1, and 2 26 -1 ticks, r<strong>es</strong>pectively.<strong>The</strong> tv5 structure is identical to the previous on<strong>es</strong>, except that the last entry of the vec arrayinclud<strong>es</strong> dynamic timers with arbitrarily large expir<strong>es</strong> fields; it needs never be replenishedfrom another array.<strong>The</strong> timer_vec structure is very similar to timer_vec_root: it contains an index field and avec array of 64 pointers to dynamic timer lists. <strong>The</strong> index field specifi<strong>es</strong> the currentlyscanned list; it is incremented by 1 (modulo 64) every 256 i-1 ticks, where i ranging between 2and 5 is the tvi group number. As in the case of tv1, when index returns to 0, the list pointedto by tvj.vec[tvj.index] is used to replenish tvi (i rang<strong>es</strong> between 2 and 4, j is equal toi+1).A single entry of tv2 is sufficient to replenish the whole array tv1; similarly, a single entry oftv3 is sufficient to replenish the whole array tv2 and so on.Figure 5-1 shows how th<strong>es</strong>e data structur<strong>es</strong> are connected together.Figure 5-1. <strong>The</strong> groups of lists associated with dynamic timers<strong>The</strong> timer_bh( ) function associated with the TIMER_BH bottom half invok<strong>es</strong> therun_timer_list( ) auxiliary function to check for decayed dynamic timers. <strong>The</strong> functionreli<strong>es</strong> on a variable similar to jiffi<strong>es</strong> called timer_jiffi<strong>es</strong>. This new variable is neededbecause a few timer interrupts might occur before the activated TIMER_BH bottom half has achance to run; this happens typically when several interrupts of different typ<strong>es</strong> are issued in ashort interval of time.<strong>The</strong> value of timer_jiffi<strong>es</strong> repr<strong>es</strong>ents the expiration time of the dynamic timer list yet to bechecked: if it coincid<strong>es</strong> with the value of jiffi<strong>es</strong>, no backlog of bottom half functions hasaccumulated; if it is smaller than jiffi<strong>es</strong>, then bottom half functions that refer to previousticks have to be dealt with. <strong>The</strong> variable is set to at system startup and is incremented only byrun_timer_list( ), which is invoked once every tick. Its value can never be greater thanjiffi<strong>es</strong>.142


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> run_timer_list( ) function includ<strong>es</strong> the following C fragment (assuming a uniproc<strong>es</strong>sorsystem):cli( );while ((long)(jiffi<strong>es</strong> - timer_jiffi<strong>es</strong>) >= 0) {struct timer_list *timer;if (!tv1.index) {int n = 1;do {cascade_timers(tvecs[n]);} while (tvecs[n]->index == 1 && ++n < 5));}while ((timer = tv1.vec[tv1.index])) {detach_timer(timer);timer->next = timer->prev = NULL;sti( );timer->function(timer->data);cli( );}++timer_jiffi<strong>es</strong>;tv1.index = (tv1.index + 1) & 0xff;}sti( );<strong>The</strong> outermost while loop ends when timer_jiffi<strong>es</strong> becom<strong>es</strong> greater than the value ofjiffi<strong>es</strong>. Since the valu<strong>es</strong> of jiffi<strong>es</strong> and timer_jiffi<strong>es</strong> usually coincide, the outermostwhile cycle will often be executed only once. In general, the outermost loop will be executedjiffi<strong>es</strong> - timer_jiffi<strong>es</strong> + 1 consecutive tim<strong>es</strong>. Moreover, if a timer interrupt occurs whilerun_timer_list( ) is being executed, dynamic timers that decay at this tick occurrence willalso be considered, since the jiffi<strong>es</strong> variable is asynchronously incremented by the IRQ0interrupt handler (see Section 5.3).During a single execution of the outermost while cycle, the dynamic timer functions includedin the tv1.vec[tv1.index] list are executed. Before executing a dynamic timer function, theloop invok<strong>es</strong> the detach_timer( ) function to remove the dynamic timer from the list. Oncethe list is emptied, the valu<strong>es</strong> of tv1.index is incremented (modulo 256) and the value oftimer_jiffi<strong>es</strong> is incremented.When tv1.index becom<strong>es</strong> equal to 0, all the lists of tv1 have been checked; in this case, it isnec<strong>es</strong>sary to refill the tv1 structure. This is accomplished by the cascade_timers( )function, which transfers the dynamic timers included in tv2.vec[tv2.index] into tv1.vec,since they will nec<strong>es</strong>sarily decay within the next 256 ticks. If tv2.index is equal to 0, it isnec<strong>es</strong>sary to refill the tv2 array of lists with the elements of tv3.vec[tv3.index] and so on.Notice that run_timer_list( ) disabl<strong>es</strong> interrupts just before entering the outermost loop;interrupts are enabled right before invoking each dynamic timer function, and again disabledright after its termination. This ensur<strong>es</strong> that the dynamic timer data structur<strong>es</strong> are notcorrupted by interleaved kernel control paths.To sum up, this rather complex algorithm ensur<strong>es</strong> excellent performance. To see why, assumefor the sake of simplicity that the TIMER_BH bottom half is executed right after thecorr<strong>es</strong>ponding timer interrupt has occurred. <strong>The</strong>n in 255 timer interrupt occurrenc<strong>es</strong> out of256, that is in 99.6% of the cas<strong>es</strong>, the run_timer_list( ) function just runs the functions of143


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the decayed timers, if any. In order to replenish tv1.vec periodically, it will be sufficient 63tim<strong>es</strong> out of 64 to partition the list pointed to by tv2.vec[tv2.index] into the 256 listspointed to by tv1.vec. <strong>The</strong> tv2.vec array, in turn, must be replenished in 0.02% of the cas<strong>es</strong>,that is, once every 163 seconds. Similarly, tv3 is replenished every 2 hours and 54 minut<strong>es</strong>,tv4 every 7 days and 18 hours, while tv5 do<strong>es</strong>n't need to be replenished.5.4.7 An Application of Dynamic TimersOn some occasions, for instance when it is unable to provide a given service, the kernel maydecide to suspend the current proc<strong>es</strong>s for a fixed amount of time. This is usually done byperforming a proc<strong>es</strong>s time-out .Let us assume that the kernel has decided to suspend the current proc<strong>es</strong>s for two seconds. Itdo<strong>es</strong> this by executing the following code:timeout = 2 * HZ;current->state = TASK_INTERRUPTIBLE;timeout = schedule_timeout(timeout);<strong>The</strong> kernel implements proc<strong>es</strong>s time-outs by using dynamic timers. <strong>The</strong>y appear in th<strong>es</strong>chedule_timeout( ) function, which execut<strong>es</strong> the following statements:struct timer_list timer;expire = timeout + jiffi<strong>es</strong>;init_timer(&timer);timer.expir<strong>es</strong> = expire;timer.data = (unsigned long) current;timer.function = proc<strong>es</strong>s_timeout;add_timer(&timer);schedule( ); /* proc<strong>es</strong>s suspended until timer expir<strong>es</strong> */del_timer(&timer);timeout = expire - jiffi<strong>es</strong>;return (timeout < 0 ? 0 : timeout);When schedule( ) is invoked, another proc<strong>es</strong>s is selected for execution; when the formerproc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution, the function remov<strong>es</strong> the dynamic timer. In the last statement,the function returns either if the time-out is expired or the number of ticks left to the time-outexpiration if the proc<strong>es</strong>s has been awoken for some other reason.When the time-out expir<strong>es</strong>, the kernel execut<strong>es</strong> the following function:void proc<strong>es</strong>s_timeout(unsigned long data){struct task_struct * p = (struct task_struct *) data;wake_up_proc<strong>es</strong>s(p);}<strong>The</strong> run_timer_list( ) function invok<strong>es</strong> proc<strong>es</strong>s_timeout( ), passing as its parameterthe proc<strong>es</strong>s d<strong>es</strong>criptor pointer stored in the data field of the timer object. As a r<strong>es</strong>ult, th<strong>es</strong>uspended proc<strong>es</strong>s is woken up.144


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>5.5 System Calls Related to Timing MeasurementsSeveral system calls allow User Mode proc<strong>es</strong>s<strong>es</strong> to read and modify the time and date and tocreate timers. Let us briefly review them and discuss how the kernel handl<strong>es</strong> them.5.5.1 <strong>The</strong> time( ), ftime( ), and gettimeofday( ) System CallsProc<strong>es</strong>s<strong>es</strong> in User Mode can get the current time and date by means of several system calls:time( )ftime( )Returns the number of elapsed seconds since midnight at the start of January 1, 1970Returns, in a data structure of type timeb, the number of elapsed seconds sincemidnight of January 1, 1970; the number of elapsed milliseconds in the last second;the time zone; and the current status of daylight saving timegettimeofday( )Returns the same information as ftime( ) in two data structur<strong>es</strong> named timeval andtimezone<strong>The</strong> former system calls are superseded by gettimeofday( ), but they are still included in<strong>Linux</strong> for backward compatibility. We don't discuss them further.<strong>The</strong> gettimeofday( ) system call is implemented by the sys_gettimeofday( ) function.In order to compute the current date and time of the day, this function invok<strong>es</strong> do_gettimeofday( ), which execut<strong>es</strong> the following actions:• Copi<strong>es</strong> the contents of xtime into the user-space buffer specified by the system callparameter tv:*tv = xtime;• Updat<strong>es</strong> the number of microseconds by invoking the function addr<strong>es</strong>sed by the do_gettimeoffset variable:tv->tv_usec += do_gettimeoffset( );If the CPU has a Time Stamp Counter, the do_fast_gettimeoffset( ) function isexecuted. It reads the TSC register by using the rdtsc Assembly instruction; it thensubtracts the value stored in last_tsc_low to obtain the number of CPU cycl<strong>es</strong>elapsed since the last timer interrupt was handled. <strong>The</strong> function converts that numberto microseconds and adds in the delay that elapsed before the activation of the timerinterrupt handler, which is stored in the delay_at_last_interrupt variablementioned earlier in Section 5.3.145


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If the CPU do<strong>es</strong> not have a TSC register, do_ gettimeoffset points to the do_slow_gettimeoffset( ) function. It reads the state of the 8254 chip device internaloscillator and then comput<strong>es</strong> the time length elapsed since the last timer interrupt.Using that value and the contents of jiffi<strong>es</strong>, it can derive the number ofmicroseconds elapsed in the last second.• Further increas<strong>es</strong> the number of microseconds to take into account all timer interruptswhose bottom halv<strong>es</strong> have not yet been executed:if (lost_ticks)tv->tv_usec += lost_ticks * (1000000/HZ);• Finally, checks for an overflow in the microseconds field, adjusting both that field andthe second field if nec<strong>es</strong>sary:while (tv->tv_usec >= 1000000) {tv->tv_usec -= 1000000;tv->tv_sec++;}Proc<strong>es</strong>s<strong>es</strong> in User Mode with root privilege may modify the current date and time by usingeither the obsolete stime( ) or the settimeofday( ) system call. <strong>The</strong> sys_settimeofday() function invok<strong>es</strong> do_settimeofday( ), which execut<strong>es</strong> operations complementary to thoseof do_gettimeofday( ).Notice that both system calls modify the value of xtime while leaving unchanged the RTCregisters. <strong>The</strong>refore, the new time will be lost when the system shuts down, unl<strong>es</strong>s the userexecut<strong>es</strong> the /sbin/clock program to change the RTC value.5.5.2 <strong>The</strong> adjtimex( ) System CallAlthough clock drift ensur<strong>es</strong> that all systems eventually move away from the correct time,changing the time abruptly is both an administrative nuisance and risky behavior. Imagine, forinstance, programmers trying to build a large program and depending on filetime stamps tomake sure that out-of-date object fil<strong>es</strong> are recompiled. A large change in the system's timecould confuse the make program and lead to an incorrect build. Keeping the clocks tuned isalso important when implementing a distributed fil<strong>es</strong>ystem on a network of computers: in thiscase, it is wise to adjust the clocks of the interconnected PCs so that the tim<strong>es</strong>tamp valu<strong>es</strong>associated with the inod<strong>es</strong> of the acc<strong>es</strong>sed fil<strong>es</strong> are coherent. Thus, systems are oftenconfigured to run a time synchronization protocol such as Network Time Protocol (NTP) on aregular basis to change the time gradually at each tick. This utility depends on the adjtimex() system call in <strong>Linux</strong>.This system call is pr<strong>es</strong>ent in several Unix variants, although it should not be used inprograms intended to be portable. It receiv<strong>es</strong> as its parameter a pointer to a timex structure,updat<strong>es</strong> kernel parameters from the valu<strong>es</strong> in the timex fields, and returns the same structurewith current kernel valu<strong>es</strong>. Such kernel valu<strong>es</strong> are used by update_wall_time_one_tick( )to slightly adjust the number of microseconds added to xtime.tv_usec at each tick.146


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>5.5.3 <strong>The</strong> setitimer( ) and alarm( ) System Calls<strong>Linux</strong> allows User Mode proc<strong>es</strong>s<strong>es</strong> to activate special timers called interval timers. [7] <strong>The</strong>timers cause Unix signals (see Chapter 9) to be sent periodically to the proc<strong>es</strong>s. It is alsopossible to activate an interval timer so that it sends just one signal after a specified delay.Each interval timer is therefore characterized by:[7] <strong>The</strong>se software constructs have nothing in common with the Programmable Interval Timer chips d<strong>es</strong>cribed earlier in this chapter.• <strong>The</strong> frequency at which the signals must be emitted, or a null value if just one signalhas to be generated• <strong>The</strong> time remaining until the next signal is to be generated<strong>The</strong> warning earlier in the chapter about accuracy appli<strong>es</strong> to th<strong>es</strong>e timers. <strong>The</strong>y are guaranteedto execute after the requ<strong>es</strong>ted time has elapsed, but it is impossible to predict exactly whenthey will be delivered.Interval timers are activated by means of the POSIX setitimer( ) system call. <strong>The</strong> firstparameter specifi<strong>es</strong> which of the following polici<strong>es</strong> should be adopted:ITIMER_REAL<strong>The</strong> actual elapsed time; the proc<strong>es</strong>s receiv<strong>es</strong> SIGALRM signalsITIMER_VIRTUAL<strong>The</strong> time spent by the proc<strong>es</strong>s in User Mode; the proc<strong>es</strong>s receiv<strong>es</strong> SIGVTALRM signalsITIMER_PROF<strong>The</strong> time spent by the proc<strong>es</strong>s both in User and in <strong>Kernel</strong> Mode; the proc<strong>es</strong>s receiv<strong>es</strong>SIGPROF signalsIn order to implement an interval timer for each of the preceding polici<strong>es</strong>, the proc<strong>es</strong>sd<strong>es</strong>criptor includ<strong>es</strong> three pairs of fields:• it_real_incr and it_real_value• it_virt_incr and it_virt_value• it_prof_incr and it_prof_value<strong>The</strong> first field of each pair stor<strong>es</strong> the interval in ticks between two signals; the other fieldstor<strong>es</strong> the current value of the timer.<strong>The</strong> ITIMER_REAL interval timer is implemented by making use of dynamic timers, becausethe kernel must send signals to the proc<strong>es</strong>s even when it is not running on the CPU. <strong>The</strong>refore,each proc<strong>es</strong>s d<strong>es</strong>criptor includ<strong>es</strong> a dynamic timer object called real_timer. <strong>The</strong> setitimer() system call initializ<strong>es</strong> the real_timer fields and then invok<strong>es</strong> add_timer( ) to insert thedynamic timer in the proper list. When the timer expir<strong>es</strong>, the kernel execut<strong>es</strong> theit_real_fn( ) timer function. In turn, the it_real_fn( ) function sends a SIGALRM signal147


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>to the proc<strong>es</strong>s; if it_real_incr is not null, it sets the expir<strong>es</strong> field again, reactivating thetimer.<strong>The</strong> ITIMER_VIRTUAL and ITIMER_PROF interval timers do not require dynamic timers, sincethey can be updated while the proc<strong>es</strong>s is running: the do_it_virt( ) and do_it_prof( )functions are invoked by update_one_ proc<strong>es</strong>s( ), which runs when the TIMER_BH bottomhalf is executed. <strong>The</strong>refore, the two interval timers are usually updated once every tick, and ifthey are expired, the proper signal is sent to the current proc<strong>es</strong>s.<strong>The</strong> alarm( ) system call sends a SIGALRM signal to the calling proc<strong>es</strong>s when a specified timeinterval has elapsed. It is very similar to setitimer( ) when invoked with the ITIMER_REALparameter, since it mak<strong>es</strong> use of the real_timer dynamic timer included in the proc<strong>es</strong>sd<strong>es</strong>criptor. <strong>The</strong>refore, alarm( ) and setitimer( ) with parameter ITIMER_REAL cannot beused at the same time.5.6 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 introduc<strong>es</strong> no significant change to the time-handling functions of the 2.2 version.148


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 6. Memory ManagementWe saw in Chapter 2, how <strong>Linux</strong> tak<strong>es</strong> advantage of Intel's segmentation and paging circuitsto translate logical addr<strong>es</strong>s<strong>es</strong> into physical on<strong>es</strong>. In the same chapter, we mentioned that someportion of RAM is permanently assigned to the kernel and used to store both the kernel codeand the static kernel data structur<strong>es</strong>.<strong>The</strong> remaining part of the RAM is called dynamic memory. It is a valuable r<strong>es</strong>ource, needednot only by the proc<strong>es</strong>s<strong>es</strong> but also by the kernel itself. In fact, the performance of the entir<strong>es</strong>ystem depends on how efficiently dynamic memory is managed. <strong>The</strong>refore, all currentmultitasking operating systems try to optimize the use of dynamic memory, assigning it onlywhen it is needed and freeing it as soon as possible.This chapter, which consists of three main sections, d<strong>es</strong>crib<strong>es</strong> how the kernel allocat<strong>es</strong>dynamic memory for its own use. Section 6.1 and Section 6.2 illustrate two differenttechniqu<strong>es</strong> for handling physically contiguous memory areas, while Section 6.3 illustrat<strong>es</strong> athird technique that handl<strong>es</strong> noncontiguous memory areas.6.1 Page Frame ManagementWe saw in Section 2.4 in Chapter 2 how the Intel Pentium proc<strong>es</strong>sor can use two differentpage frame siz<strong>es</strong>: 4 KB and 4 MB. <strong>Linux</strong> adopts the smaller 4 KB page frame size as th<strong>es</strong>tandard memory allocation unit. This mak<strong>es</strong> things simpler for two reasons:• <strong>The</strong> paging circuitry automatically checks whether the page being addr<strong>es</strong>sed iscontained in some page frame; furthermore, each page frame is hardware-protectedthrough the flags included in the Page Table entry that points to it. By choosing a 4KB allocation unit, the kernel can directly determine the memory allocation unitassociated with the page where a page fault exception occurs.• <strong>The</strong> 4 KB size is a multiple of most disk block siz<strong>es</strong>, so transfers of data between mainmemory and disks are more efficient. Yet this smaller size is much more manageablethan the 4 MB size.<strong>The</strong> kernel must keep track of the current status of each page frame. For instance, it must beable to distinguish the page fram<strong>es</strong> used to contain pag<strong>es</strong> belonging to proc<strong>es</strong>s<strong>es</strong> from thosethat contain kernel code or kernel data structur<strong>es</strong>; similarly, it must be able to determinewhether a page frame in dynamic memory is free or not. This sort of state information is keptin an array of d<strong>es</strong>criptors, one for each page frame. <strong>The</strong> d<strong>es</strong>criptors of type struct page havethe following format:typedef struct page {struct page *next;struct page *prev;struct inode *inode;unsigned long offset;struct page *next_hash;atomic_t count;unsigned long flags;struct wait_queue *wait;struct page **pprev_hash;struct buffer_head * buffers;} mem_map_t;149


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>We shall d<strong>es</strong>cribe only a few fields (the remaining on<strong>es</strong> will be discussed in later chaptersdealing with fil<strong>es</strong>ystems, I/O buffers, memory mapping, and so on):countprev , nextflagsSet 0 to if the corr<strong>es</strong>ponding page frame is free; set to a value greater than if the pageframe has been assigned to one or more proc<strong>es</strong>s<strong>es</strong> or if it is used for some kernel datastructur<strong>es</strong>.Used to insert the d<strong>es</strong>criptor in a doubly linked circular list. <strong>The</strong> meaning of th<strong>es</strong>efields depends on the current use of the page frame.An array of up to 32 flags (see Table 6-1) d<strong>es</strong>cribing the status of the page frame. Foreach PG_xyz flag, a corr<strong>es</strong>ponding PageXyz macro has been defined to read or set itsvalue.Some of the flags listed in Table 6-1 are explained in later chapters. <strong>The</strong> PG_DMA flag existsbecause of a limitation on Direct Memory Acc<strong>es</strong>s (DMA) proc<strong>es</strong>sors for ISA bus<strong>es</strong>: suchDMA proc<strong>es</strong>sors are able to addr<strong>es</strong>s only the first 16 MB of RAM, hence page fram<strong>es</strong> aredivided into two groups depending on whether they can be addr<strong>es</strong>sed by the DMA or not.(Section 13.1.4 in Chapter 13, giv<strong>es</strong> further details on DMAs.) In this chapter, the term"DMA" will always refer to DMA for ISA bus<strong>es</strong>.Table 6-1. Flags D<strong>es</strong>cribing the Status of a Page FrameFlag NameMeaningPG_decr_after See Section 16.4.3 in Chapter 16.PG_dirtyNot used.PG_errorAn I/O error occurred while transferring the page.PG_free_after See Section 15.1.1 in Chapter 15.PG_DMAUsable by ISA DMA (see text).PG_lockedPage cannot be swapped out.PG_referencedPage frame has been acc<strong>es</strong>sed through the hash table of the page cache (seeSection 14.2 in Chapter 14).PG_r<strong>es</strong>ervedPage frame r<strong>es</strong>erved to kernel code or unusable.PG_skipUsed on SPARC/SPARC64 architectur<strong>es</strong> to "skip" some parts of the addr<strong>es</strong>sspace.PG_SlabIncluded in a slab: see Section 6.2 later in this chapter.PG_swap_cache Included in the swap cache; see Section 16.3 in Chapter 16PG_swap_unlock_after See Section 16.4.3 in Chapter 16.PG_uptodateSet after completing a read operation, unl<strong>es</strong>s a disk I/O error happened.All the page frame d<strong>es</strong>criptors on the system are included in an array called mem_map. Sinceeach d<strong>es</strong>criptor is l<strong>es</strong>s than 64 byt<strong>es</strong> long, mem_map requir<strong>es</strong> about four page fram<strong>es</strong> for eachmegabyte of RAM. <strong>The</strong> MAP_NR macro comput<strong>es</strong> the number of the page frame whose addr<strong>es</strong>sis passed as a parameter, and thus the index of the corr<strong>es</strong>ponding d<strong>es</strong>criptor in mem_map:150


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>#define MAP_NR(addr)(__pa(addr) >> PAGE_SHIFT)<strong>The</strong> macro mak<strong>es</strong> use of the __ pa macro, which converts a logical addr<strong>es</strong>s to a physical one.Dynamic memory, and the valu<strong>es</strong> used to refer to it, are illustrated in Figure 6-1. Page framed<strong>es</strong>criptors are initialized by the free_area_init( ) function, which acts on twoparameters: start_mem denot<strong>es</strong> the first linear addr<strong>es</strong>s of the dynamic memory immediatelyafter the kernel memory, while end_mem denot<strong>es</strong> the last linear addr<strong>es</strong>s of the dynamicmemory plus 1 (see Section 2.5.3 and Section 2.5.5 in Chapter 2). <strong>The</strong> free_area_init( )function also considers the i386_endbase variable, which stor<strong>es</strong> the initial addr<strong>es</strong>s of ther<strong>es</strong>erved page fram<strong>es</strong>. <strong>The</strong> function allocat<strong>es</strong> a suitably sized memory area to mem_map. <strong>The</strong>function then initializ<strong>es</strong> the area by setting all fields to 0, except for the flags fields, in whichit sets the PG_DMA and PG_r<strong>es</strong>erved flags:mem_map = (mem_map_t *) start_mem;p = mem_map + MAP_NR(end_mem);start_mem = ((unsigned long) p + sizeof(long) - 1) &~(sizeof(long)-1);memset(mem_map, 0, start_mem - (unsigned long) mem_map);do {--p;p->count = 0;p->flags = (1


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>while (start_mem < end_mem) {clear_bit(PG_r<strong>es</strong>erved,&mem_map[MAP_NR(start_mem)].flags);start_mem += PAGE_SIZE;}for (tmp = PAGE_OFFSET ; tmp < end_mem ; tmp += PAGE_SIZE) {if (tmp >= PAGE_OFFSET+0x1000000)clear_bit(PG_DMA, &mem_map[MAP_NR(tmp)].flags);if (PageR<strong>es</strong>erved(mem_map+MAP_NR(tmp))) {if (tmp >= (unsigned long) &_text&& tmp < (unsigned long) &_edata)if (tmp < (unsigned long) &_etext)codepag<strong>es</strong>++;elsedatapag<strong>es</strong>++;else if (tmp >= (unsigned long) &__init_begin&& tmp < (unsigned long) &__init_end)initpag<strong>es</strong>++;else if (tmp >= (unsigned long) &__bss_start&& tmp < (unsigned long) start_mem)datapag<strong>es</strong>++;elser<strong>es</strong>ervedpag<strong>es</strong>++;continue;}mem_map[MAP_NR(tmp)].count = 1;free_page(tmp);}First, the mem_init( ) function determin<strong>es</strong> the value of num_physpag<strong>es</strong>, the total number ofpage fram<strong>es</strong> pr<strong>es</strong>ent in the system. It then counts the number of page fram<strong>es</strong> of typePG_r<strong>es</strong>erved. Several symbols produced while compiling the kernel (we d<strong>es</strong>cribed some ofthem in Section 2.5.3 in Chapter 2) enable the function to count the number of page fram<strong>es</strong>r<strong>es</strong>erved for the hardware, kernel code, and kernel data and the number of page fram<strong>es</strong> usedduring kernel initialization that can be succ<strong>es</strong>sively released.Finally, mem_init( ) sets the count field of each page frame d<strong>es</strong>criptor associated with thedynamic memory to 1 and calls the free_ page( ) function (see Section 6.1.2 later in thischapter). Since this function increments the value of the variable nr_free_pag<strong>es</strong>, thatvariable will contain the total number of page fram<strong>es</strong> in the dynamic memory at the end of theloop.6.1.1 Requ<strong>es</strong>ting and Releasing Page Fram<strong>es</strong>After having seen how the kernel allocat<strong>es</strong> and initializ<strong>es</strong> the data structur<strong>es</strong> for page framehandling, we now look at how page fram<strong>es</strong> are allocated and released. Page fram<strong>es</strong> can berequ<strong>es</strong>ted by making use of four slightly differing functions and macros:__get_free_pag<strong>es</strong>(gfp_mask, order)Function used to requ<strong>es</strong>t 2 order contiguous page fram<strong>es</strong>.__get_dma_pag<strong>es</strong>(gfp_mask, order)Macro used to get page fram<strong>es</strong> suitable for DMA; it expands to:152


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>__get_free_pag<strong>es</strong>(gfp_mask | GFP_DMA, order)__get_free_page(gfp_mask)Macro used to get a single page frame; it expands to:__get_free_pag<strong>es</strong>(gfp_mask, 0)get_free_page(gfp_mask) :Function that invok<strong>es</strong>:__get_free_page(gfp_mask)and then fills the page frame obtained with zeros.<strong>The</strong> parameter gfp_mask specifi<strong>es</strong> how to look for free page fram<strong>es</strong>. It consists of thefollowing flags:__GFP_WAIT__GFP_IO__GFP_DMASet if the kernel is allowed to discard the contents of page fram<strong>es</strong> in order to freememory before satisfying the requ<strong>es</strong>t.Set if the kernel is allowed to write pag<strong>es</strong> to disk in order to free the corr<strong>es</strong>pondingpage fram<strong>es</strong>. (Since swapping can block the proc<strong>es</strong>s in <strong>Kernel</strong> Mode, this flag must becleared when handling interrupts or modifying critical kernel data structur<strong>es</strong>.)Set if the requ<strong>es</strong>ted page fram<strong>es</strong> must be suitable for DMA. (<strong>The</strong> hardware limitationthat giv<strong>es</strong> rise to this flag was explained in Section 6.1.)__GFP_HIGH , __GFP_MED , __GFP_LOWSpecify the requ<strong>es</strong>t priority. _ _GFP_LOW is usually associated with dynamic memoryrequ<strong>es</strong>ts issued by User Mode proc<strong>es</strong>s<strong>es</strong>, while the other prioriti<strong>es</strong> are associated withkernel requ<strong>es</strong>ts.In practice, <strong>Linux</strong> us<strong>es</strong> the predefined combinations of flag valu<strong>es</strong> shown in Table 6-2; thegroup name is what you'll encounter in the source code.Table 6-2. Groups of Flag Valu<strong>es</strong> Used to Requ<strong>es</strong>t Page Fram<strong>es</strong>Group Name __GFP_WAIT __GFP_IO PriorityGFP_ATOMIC 0 0 __GFP_HIGHGFP_BUFFER 1 0 __GFP_LOWGFP_KERNEL 1 1 __GFP_MEDGFP_NFS 1 1 __GFP_HIGHGFP_USER 1 1 __GFP_LOW153


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Page fram<strong>es</strong> can be released through any of the following three functions and macros:free_pag<strong>es</strong>(addr, order)This function checks the page d<strong>es</strong>criptor of the page frame having physical addr<strong>es</strong>saddr; if the page frame is not r<strong>es</strong>erved (i.e., if the PG_r<strong>es</strong>erved flag is equal to 0), itdecrements the count field of the d<strong>es</strong>criptor. If count becom<strong>es</strong> 0, it assum<strong>es</strong> that 2 ordercontiguous page fram<strong>es</strong> starting from addr are no longer used. In that case, thefunction invok<strong>es</strong> free_ pag<strong>es</strong>_ok( ) to insert the page frame d<strong>es</strong>criptor of the firstfree page in the proper list of free page fram<strong>es</strong> (d<strong>es</strong>cribed in the following section).__free_page(p)Similar to the previous function, except that it releas<strong>es</strong> the page frame whosed<strong>es</strong>criptor is pointed to by parameter p.free_page(addr)Macro used to release the page frame having physical addr<strong>es</strong>s addr; it expandsto free_pag<strong>es</strong>(addr,0) .6.1.2 <strong>The</strong> Buddy System Algorithm<strong>The</strong> kernel must <strong>es</strong>tablish a robust and efficient strategy for allocating groups of contiguouspage fram<strong>es</strong>. In doing so, it must deal with a well-known memory management problemcalled external fragmentation : frequent requ<strong>es</strong>ts and releas<strong>es</strong> of groups of contiguous pagefram<strong>es</strong> of different siz<strong>es</strong> may lead to a situation in which several small blocks of free pagefram<strong>es</strong> are "scattered" inside blocks of allocated page fram<strong>es</strong>. As a r<strong>es</strong>ult, it may becomeimpossible to allocate a large block of contiguous page fram<strong>es</strong>, even if there are enough freepag<strong>es</strong> to satisfy the requ<strong>es</strong>t.<strong>The</strong>re are <strong>es</strong>sentially two ways to avoid external fragmentation:• Make use of the paging circuitry to map groups of noncontiguous free page fram<strong>es</strong>into intervals of contiguous linear addr<strong>es</strong>s<strong>es</strong>.• Develop a suitable technique to keep track of the existing blocks of free contiguouspage fram<strong>es</strong>, avoiding as much as possible the need to split up a large free block inorder to satisfy a requ<strong>es</strong>t for a smaller one.<strong>The</strong> second approach is the one preferred by the kernel for two good reasons:• In some cas<strong>es</strong>, contiguous page fram<strong>es</strong> are really nec<strong>es</strong>sary, since contiguous linearaddr<strong>es</strong>s<strong>es</strong> are not sufficient to satisfy the requ<strong>es</strong>t. A typical example is a memoryrequ<strong>es</strong>t for buffers to be assigned to a DMA proc<strong>es</strong>sor (see Chapter 13). Since theDMA ignor<strong>es</strong> the paging circuitry and acc<strong>es</strong>s<strong>es</strong> the addr<strong>es</strong>s bus directly whiletransferring several disk sectors in a single I/O operation, the buffers requ<strong>es</strong>ted mustbe located in contiguous page fram<strong>es</strong>.• Even if contiguous page frame allocation is not strictly nec<strong>es</strong>sary, it offers the bigadvantage of leaving the kernel paging tabl<strong>es</strong> unchanged. What's wrong with154


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>modifying the page tabl<strong>es</strong>? As we know from Chapter 2, frequent page tablemodifications lead to higher average memory acc<strong>es</strong>s tim<strong>es</strong>, since they make the CPUflush the contents of the translation lookaside buffers.<strong>The</strong> technique adopted by <strong>Linux</strong> to solve the external fragmentation problem is based on thewell-known buddy system algorithm. All free page fram<strong>es</strong> are grouped into 10 lists of blocksthat contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, and 512 contiguous page fram<strong>es</strong>,r<strong>es</strong>pectively. <strong>The</strong> physical addr<strong>es</strong>s of the first page frame of a block is a multiple of the groupsize: for example, the initial addr<strong>es</strong>s of a 16-page-frame block is a multiple of 16 x 2 12 .We'll show how the algorithm works through a simple example.Assume there is a requ<strong>es</strong>t for a group of 128 contiguous page fram<strong>es</strong> (i.e., a half-megabyte).<strong>The</strong> algorithm checks first whether a free block in the 128-page-frame list exists. If there is nosuch block, the algorithm looks for the next larger block, that is, a free block in the 256-pageframelist. If such a block exists, the kernel allocat<strong>es</strong> 128 of the 256 page fram<strong>es</strong> to satisfy therequ<strong>es</strong>t and inserts the remaining 128 page fram<strong>es</strong> into the list of free 128-page-frame blocks.If there is no free 256-page block, it then looks for the next larger block, that is, a free 512-page-frame block. If such a block exists, it allocat<strong>es</strong> 128 of the 512 page fram<strong>es</strong> to satisfy therequ<strong>es</strong>t, inserts the first 256 of the remaining 384 page fram<strong>es</strong> into the list of free 256-pageframeblocks, and inserts the last 128 of the remaining 384 page fram<strong>es</strong> into the list of free128-page-frame blocks. If the list of 512-page-frame blocks is empty, the algorithm giv<strong>es</strong> upand signals an error condition.<strong>The</strong> reverse operation, releasing blocks of page fram<strong>es</strong>, giv<strong>es</strong> rise to the name of thisalgorithm. <strong>The</strong> kernel attempts to merge together pairs of free buddy blocks of size b into asingle block of size 2b. Two blocks are considered buddy if:• Both blocks have the same size, say b.• <strong>The</strong>y are located in contiguous physical addr<strong>es</strong>s<strong>es</strong>.• <strong>The</strong> physical addr<strong>es</strong>s of the first page frame of the first block is a multiple of 2 x b x2 12 .<strong>The</strong> algorithm is iterative; if it succeeds in merging released blocks, it doubl<strong>es</strong> b and tri<strong>es</strong>again so as to create even bigger blocks.6.1.2.1 Data structur<strong>es</strong><strong>Linux</strong> mak<strong>es</strong> use of two different buddy systems: one handl<strong>es</strong> the page fram<strong>es</strong> suitable forISA DMA, while the other one handl<strong>es</strong> the remaining page fram<strong>es</strong>. Each buddy system reli<strong>es</strong>on the following main data structur<strong>es</strong>:• <strong>The</strong> mem_map array introduced previously.• An array having 10 elements of type free_area_struct, one element for each groupsize. <strong>The</strong> variable free_area[0] points to the array used by the buddy system for thepage fram<strong>es</strong> that are not suitable for ISA DMA, while free_area[1] points to thearray used by the buddy system for page fram<strong>es</strong> suitable for ISA DMA.• Ten binary arrays named bitmaps, one for each group size. Each buddy system has itsown set of bitmaps, which it us<strong>es</strong> to keep track of the blocks it allocat<strong>es</strong>.155


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Each element of the free_area[0] and free_area[1] arrays is a structure of typefree_area_struct, which is defined as follows:struct free_area_struct {struct page *next;struct page *prev;unsigned int *map;unsigned long count;};Notice that the first two fields of this structure match the corr<strong>es</strong>ponding fields of a paged<strong>es</strong>criptor; in fact, pointers to free_area_struct structur<strong>es</strong> are sometim<strong>es</strong> used as pointersto page d<strong>es</strong>criptors.<strong>The</strong> k th element of either the free_area[0] or the free_area[1] array is associated with adoubly linked circular list of blocks of size 2 k , implemented through the next and prev fields.Each member of such a list is the d<strong>es</strong>criptor of the first page frame of a block. <strong>The</strong> count fieldof each free_area_struct structure stor<strong>es</strong> the number of elements in the corr<strong>es</strong>ponding list.<strong>The</strong> map field points to a bitmap whose size depends on the number of existing page fram<strong>es</strong>.Each bit of the bitmap of the k th entry of either free_area[0] or free_area[1] d<strong>es</strong>crib<strong>es</strong>the status of two buddy blocks of size 2 k page fram<strong>es</strong>. If a bit of the bitmap is equal to 0,either both buddy blocks of the pair are free or both are busy; if it is equal to 1, exactly one ofthe blocks is busy. When both buddi<strong>es</strong> are free, the kernel treats them as a single free block ofsize 2 k+1 .Let us consider, for sake of illustration, a 128 MB RAM and the bitmaps associated with thenon-DMA page fram<strong>es</strong>. <strong>The</strong> 128 MB can be divided into 32768 single pag<strong>es</strong>, 16384 groups of2 pag<strong>es</strong> each, or 8192 groups of 4 pag<strong>es</strong> each and so on up to 64 groups of 512 pag<strong>es</strong> each. Sothe bitmap corr<strong>es</strong>ponding to free_area[0][0] consists of 16384 bits, one for each pair of the32768 existing page fram<strong>es</strong>; the bitmap corr<strong>es</strong>ponding to free_area[0][1] consists of 8192bits, one for each pair of blocks of two consecutive page fram<strong>es</strong>; the last bitmapcorr<strong>es</strong>ponding to free_area[0][9] consists of 32 bits, one for each pair of blocks of 512contiguous page fram<strong>es</strong>.Figure 6-2 illustrat<strong>es</strong> with a simple example the use of the data structur<strong>es</strong> introduced by thebuddy system algorithm. <strong>The</strong> array mem_map contains nine free page fram<strong>es</strong> grouped in oneblock of one (that is, a single page frame) at the top and two blocks of four further down. <strong>The</strong>double arrows denote doubly linked circular lists implemented by the next and prev fields.Notice that the bitmaps are not drawn to scale.156


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 6-2. Data structur<strong>es</strong> used by the buddy system6.1.2.2 Allocating a block<strong>The</strong> __get_free_ pag<strong>es</strong>( ) function implements the buddy system strategy for allocatingpage fram<strong>es</strong>. This function checks first whether there are enough free pag<strong>es</strong>, that is, ifnr_free_ pag<strong>es</strong> is greater than freepag<strong>es</strong>.min. If not, it may decide to reclaim page fram<strong>es</strong>(see Section 16.7.4 in Chapter 16). Otherwise, it go<strong>es</strong> on with the allocation by executing thecode included in the RMQUEUE_TYPE macro:if (!(gfp_mask & __GFP_DMA))RMQUEUE_TYPE(order, 0);RMQUEUE_TYPE(order, 1);<strong>The</strong> order parameter denot<strong>es</strong> the logarithm of the size of the requ<strong>es</strong>ted block of free pag<strong>es</strong> (0for a one-page block, 1 for a two-page block, and so forth). <strong>The</strong> second parameter is the indexinto free_area, which is for non-DMA blocks and 1 for DMA blocks. So the code checksgfp_mask to see whether non-DMA blocks are allowed and, if so, tri<strong>es</strong> to get blocks from thatlist (index 0), because it would be better to save DMA blocks for requ<strong>es</strong>ts that really needthem. If the page fram<strong>es</strong> are succ<strong>es</strong>sfully allocated, the code in the RMQUEUE_TYPE macroexecut<strong>es</strong> a return statement, thus terminating the _ _get_free_ pag<strong>es</strong>( ) function.Otherwise, the code in the RMQUEUE_TYPE macro is executed again with the second parameterequal to 1, that is, the memory allocation requ<strong>es</strong>t is satisfied using page fram<strong>es</strong> suitable forDMA.<strong>The</strong> code yielded by the RMQUEUE_TYPE macro is equivalent to the following fragments. First,a few local variabl<strong>es</strong> are declared and initialized:struct free_area_struct * area = &free_area[type][order];unsigned long new_order = order;struct page *prev;struct page *ret;unsigned long map_nr;struct page * next;157


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> type variable repr<strong>es</strong>ents the second parameter of the macro: it is equal to when the macrooperat<strong>es</strong> on the buddy system for non-DMA page fram<strong>es</strong> and to 1 otherwise.<strong>The</strong> macro then performs a cyclic search through each list for an available block (denoted byan entry that do<strong>es</strong>n't point to the entry itself), starting with the list for the requ<strong>es</strong>ted order andcontinuing if nec<strong>es</strong>sary to larger orders. This cycle is equivalent to the following structure:do {prev = (struct page *)area;ret = prev->next;if ((struct page *) area != ret)goto block_found;new_order++;area++;} while (new_order < 10);If the while loop terminat<strong>es</strong>, no suitable free block has been found, so _ _get_free_pag<strong>es</strong>() returns a NULL value. Otherwise, a suitable free block has been found; in this case, thed<strong>es</strong>criptor of its first page frame is removed from the list, the corr<strong>es</strong>ponding bitmap isupdated, and the value of nr_free_ pag<strong>es</strong> is decreased:block_found:prev->next = ret->next;prev->next->prev = prev;map_nr = ret-mem_map;change_bit(map_nr>>(1+new_order), area->map);nr_free_pag<strong>es</strong> -= 1 count--;If the block found com<strong>es</strong> from a list of size new_order greater than the requ<strong>es</strong>ted size order,a while cycle is executed. <strong>The</strong> rationale behind th<strong>es</strong>e lin<strong>es</strong> of cod<strong>es</strong> is the following: when itbecom<strong>es</strong> nec<strong>es</strong>sary to use a block of 2 k page fram<strong>es</strong> to satisfy a requ<strong>es</strong>t for 2 h page fram<strong>es</strong> (h< k), the program allocat<strong>es</strong> the last 2 h page fram<strong>es</strong> and iteratively reassigns the first 2 k - 2 hpage fram<strong>es</strong> to the free_area lists having index<strong>es</strong> between h and k.size = 1 order) {area--;new_order--;size >>= 1;/* insert *ret as first element in the listand update the bitmap */next = area->next;ret->prev = (struct page *) area;ret->next = next;next->prev = ret;area->next = ret;area->count++;change_bit(map_nr >> (1+new_order), area->map);/* now take care of the second half ofthe free block starting at *ret */map_nr += size;ret += size;}158


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Finally, RMQUEUE_TYPE updat<strong>es</strong> the count field of the page d<strong>es</strong>criptor associated with th<strong>es</strong>elected block and execut<strong>es</strong> a return instruction:ret->count = 1;return PAGE_OFFSET + (map_nr > (1 + order);unsigned long mask = (~0UL)


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If the buddy block is not free, the function breaks out of the cycle; if it is free, the functiondetach<strong>es</strong> it from the corr<strong>es</strong>ponding list of free blocks. <strong>The</strong> block number of the buddy isderived from map_nr by switching a single bit:area->count--;next = mem_map[map_nr ^ -mask].next;prev = mem_map[map_nr ^ -mask].prev;next->prev = prev;prev->next = next;At the end of each iteration, the function updat<strong>es</strong> the mask, area, index, and map_nrvariabl<strong>es</strong>:mask = 1;map_nr &= mask;<strong>The</strong> function then continu<strong>es</strong> the next iteration, trying to merge free blocks twice as large asthe on<strong>es</strong> considered in the previous cycle. When the cycle is finished, the free block obtainedcannot be further merged with other free blocks. It is then inserted in the proper list:next = area->next;mem_map[map_nr].prev = (struct page *) area;mem_map[map_nr].next = next;next->prev= &mem_map[map_nr];area->next =&mem_map[map_nr];area->count++;6.2 Memory Area ManagementThis section deals with memory areas, that is, with sequenc<strong>es</strong> of memory cells havingcontiguous physical addr<strong>es</strong>s<strong>es</strong> and an arbitrary length.<strong>The</strong> buddy system algorithm adopts the page frame as the basic memory area. This is fine fordealing with relatively large memory requ<strong>es</strong>ts, but how are we going to deal with requ<strong>es</strong>ts forsmall memory areas, say a few tens or hundred of byt<strong>es</strong>?Clearly, it would be quite wasteful to allocate a full page frame to store a few byt<strong>es</strong>. <strong>The</strong>correct approach instead consists of introducing new data structur<strong>es</strong> that d<strong>es</strong>cribe how smallmemory areas are allocated within the same page frame. In doing so, we introduce a newproblem called internal fragmentation. It is caused by a mismatch between the size of thememory requ<strong>es</strong>t and the size of the memory area allocated to satisfy the requ<strong>es</strong>t.A classical solution adopted by <strong>Linux</strong> 2.0 consists of providing memory areas whose siz<strong>es</strong> aregeometrically distributed: in other words, the size depends on a power of 2 rather than on th<strong>es</strong>ize of the data to be stored. In this way, no matter what the memory requ<strong>es</strong>t size is, we canensure that the internal fragmentation is always smaller than 50%. Following this approach,<strong>Linux</strong> 2.0 creat<strong>es</strong> 13 geometrically distributed lists of free memory areas whose siz<strong>es</strong> rangefrom 32 to 131056 byt<strong>es</strong>. <strong>The</strong> buddy system is invoked both to obtain additional page fram<strong>es</strong>160


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>needed to store new memory areas and conversely to release page fram<strong>es</strong> that no longercontain memory areas. A dynamic list is used to keep track of the free memory areascontained in each page frame.6.2.1 <strong>The</strong> Slab AllocatorRunning a memory area allocation algorithm on top of the buddy algorithm is not particularlyefficient. <strong>Linux</strong> 2.2 reexamin<strong>es</strong> the memory area allocation from scratch and com<strong>es</strong> out withsome very clever improvements.<strong>The</strong> new algorithm is derived from the slab allocator schema developed in 1994 for the SunMicrosystem Solaris 2.4 operating system. It is based on the following premis<strong>es</strong>:• <strong>The</strong> type of data to be stored may affect how memory areas are allocated; for instance,when allocating a page frame to a User Mode proc<strong>es</strong>s, the kernel invok<strong>es</strong> theget_free_page( ) function, which fills the page with zeros.<strong>The</strong> concept of a slab allocator expands upon this idea and views the memory areas asobjects consisting of both a set of data structur<strong>es</strong> and a couple of functions or methodscalled the constructor and d<strong>es</strong>tructor : the former initializ<strong>es</strong> the memory area while thelatter deinitializ<strong>es</strong> it.In order to avoid initializing objects repeatedly, the slab allocator do<strong>es</strong> not discard theobjects that have been allocated and then released but sav<strong>es</strong> them in memory. When anew object is then requ<strong>es</strong>ted, it can be taken from memory without having to bereinitialized.In practice, the memory areas handled by <strong>Linux</strong> do not need to be initialized ordeinitialized. For efficiency reasons, <strong>Linux</strong> do<strong>es</strong> not rely on objects that needconstructor or d<strong>es</strong>tructor methods; the main motivation for introducing a slab allocatoris to reduce the number of calls to the buddy system allocator. Thus, although thekernel fully supports the constructor and d<strong>es</strong>tructor methods, the pointers to th<strong>es</strong>emethods are NULL.• <strong>The</strong> kernel functions tend to requ<strong>es</strong>t memory areas of the same type repeatedly. Forinstance, whenever the kernel creat<strong>es</strong> a new proc<strong>es</strong>s, it allocat<strong>es</strong> memory areas forsome fixed size tabl<strong>es</strong> such as the proc<strong>es</strong>s d<strong>es</strong>criptor, the open file object, and so on(see Chapter 3). When a proc<strong>es</strong>s terminat<strong>es</strong>, the memory areas used to contain th<strong>es</strong>etabl<strong>es</strong> can be reused. Since proc<strong>es</strong>s<strong>es</strong> are created and d<strong>es</strong>troyed quite frequently,previous versions of the <strong>Linux</strong> kernel wasted time allocating and deallocating the pagefram<strong>es</strong> containing the same memory areas repeatedly; in <strong>Linux</strong> 2.2 they are saved in acache and reused instead.• Requ<strong>es</strong>ts for memory areas can be classified according to their frequency. Requ<strong>es</strong>ts ofa particular size that are expected to occur frequently can be handled most efficientlyby creating a set of special purpose objects having the right size, thus avoiding internalfragmentation. Meanwhile, siz<strong>es</strong> that are rarely encountered can be handled through anallocation scheme based on objects in a seri<strong>es</strong> of geometrically distributed siz<strong>es</strong> (suchas the power-of-2 siz<strong>es</strong> used in <strong>Linux</strong> 2.0), even if this approach leads to internalfragmentation.161


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong>re is another subtle bonus in introducing objects whose siz<strong>es</strong> are not geometricallydistributed: the initial addr<strong>es</strong>s<strong>es</strong> of the data structur<strong>es</strong> are l<strong>es</strong>s prone to be concentratedon physical addr<strong>es</strong>s<strong>es</strong> whose valu<strong>es</strong> are a power of 2. This, in turn, leads to betterperformance by the proc<strong>es</strong>sor hardware cache.• Hardware cache performance creat<strong>es</strong> an additional reason for limiting calls to thebuddy system allocator as much as possible: every call to a buddy system function"dirti<strong>es</strong>" the hardware cache, thus increasing the average memory acc<strong>es</strong>s time. [1][1] <strong>The</strong> impact of a kernel function on the hardware cache is denoted as the function footprint; it is defined as the percentage of cache overwritten bythe function when it terminat<strong>es</strong>. Clearly, large footprints lead to a slower execution of the code executed right after the kernel function, since thehardware cache is by now filled with usel<strong>es</strong>s information.<strong>The</strong> slab allocator groups objects into cach<strong>es</strong>. Each cache is a "store" of objects of the sametype. For instance, when a file is opened, the memory area needed to store the corr<strong>es</strong>ponding"open file" object is taken from a slab allocator cache named filp (for "file pointer"). <strong>The</strong> slaballocator cach<strong>es</strong> used by <strong>Linux</strong> may be viewed at runtime by reading the /proc/slabinfo file.<strong>The</strong> area of main memory that contains a cache is divided into slabs; each slab consists of oneor more contiguous page fram<strong>es</strong> that contain both allocated and free objects (see Figure 6-3).Figure 6-3. <strong>The</strong> slab allocator components<strong>The</strong> slab allocator never releas<strong>es</strong> the page fram<strong>es</strong> of an empty slab on its own. It would notknow when free memory is needed, and there is no benefit to releasing objects when there isstill plenty of free memory for new objects. <strong>The</strong>refore, releas<strong>es</strong> occur only when the kernel islooking for additional free page fram<strong>es</strong> (see tSection 6.2.12 later in this chapter and Section16.7 in Chapter 16).6.2.2 Cache D<strong>es</strong>criptorEach cache is d<strong>es</strong>cribed by a table of type struct kmem_cache_s (which is equivalent to thetype kmem_cache_t). <strong>The</strong> most significant fields of this table are:c_namePoints to the name of the cache.c_firstp , c_lastpPoint, r<strong>es</strong>pectively, to the first and last slab d<strong>es</strong>criptor of the cache. <strong>The</strong> slabd<strong>es</strong>criptors of a cache are linked together through a doubly linked, circular, partiallyordered list: the first elements of the list include slabs with no free objects, then come162


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>c_freepc_numc_offsetthe slabs that include used objects along with at least one free object, and finally th<strong>es</strong>labs that include only free objects.Points to the s_nextp field of the first slab d<strong>es</strong>criptor that includ<strong>es</strong> at least one freeobject.Number of objects packed into a single slab. (All slabs of the cache have the sam<strong>es</strong>ize.)Size of the objects included in the cache. (This size may be rounded up if the initialaddr<strong>es</strong>s<strong>es</strong> of the objects must be memory aligned.)c_ gfporderLogarithm of the number of contiguous page fram<strong>es</strong> included in a single slab.c_ctor , c_dtorc_nextpc_flagsc_magicPoint, r<strong>es</strong>pectively, to the constructor and d<strong>es</strong>tructor methods associated with thecache objects. <strong>The</strong>y are currently set to NULL, as stated earlier.Points to the next cache d<strong>es</strong>criptor. All cache d<strong>es</strong>criptors are linked together in asimple list by means of this field.An array of flags that d<strong>es</strong>crib<strong>es</strong> some permanent properti<strong>es</strong> of the cache. <strong>The</strong>re is, forinstance, a flag that specifi<strong>es</strong> which of two possible alternativ<strong>es</strong> (see the followingsection) has been chosen to store the object d<strong>es</strong>criptors in memory.A magic number selected from a predefined set of valu<strong>es</strong>. Used to check both thecurrent state of the cache and its consistency.6.2.3 Slab D<strong>es</strong>criptorEach slab of a cache has its own d<strong>es</strong>criptor of type struct kmem_slab_s (equivalent to thetype kem_slab_t).Slab d<strong>es</strong>criptors can be stored in two possible plac<strong>es</strong>, the choice depending normally on th<strong>es</strong>ize of the objects in the slab. If the object size is smaller than 512 byt<strong>es</strong>, the slab d<strong>es</strong>criptor is163


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>stored at the end of the slab; otherwise, it is stored outside of the slab. <strong>The</strong> latter option ispreferable for large objects whose siz<strong>es</strong> are a submultiple of the slab size. In some cas<strong>es</strong>, thekernel may violate this rule by setting the c_flags field of the cache d<strong>es</strong>criptor differently.<strong>The</strong> most significant fields of a slab d<strong>es</strong>criptor are:s_inus<strong>es</strong>_mems_freepNumber of objects in the slab that are currently allocated.Points to the first object (either allocated or free) inside the slab.Points to the first free object (if any) in the slab.s_nextp , s_prevps_dmas_magicPoint, r<strong>es</strong>pectively, to the next and previous slab d<strong>es</strong>criptor. <strong>The</strong> s_nextp field of thelast slab d<strong>es</strong>criptor in the list points to the c_offset field of the corr<strong>es</strong>ponding cached<strong>es</strong>criptor.Flag set if the objects included in the slab can be used by the DMA proc<strong>es</strong>sor.Similar to the c_magic field of the cache d<strong>es</strong>criptor. It contains a magic numberselected from a predefined set of valu<strong>es</strong> and is used to check both the current state ofthe slab and its consistency. <strong>The</strong> valu<strong>es</strong> of this field are different from those of thecorr<strong>es</strong>ponding c_magic field of the cache d<strong>es</strong>criptor. <strong>The</strong> offset of s_magic within th<strong>es</strong>lab d<strong>es</strong>criptor is equal to the offset of c_magic with r<strong>es</strong>pect to c_offset inside thecache d<strong>es</strong>criptor; the checking routine reli<strong>es</strong> on their being the same.Figure 6-4 illustrat<strong>es</strong> the major relationships between cache and slab d<strong>es</strong>criptors. Full slabsprecede partially full slabs that precede empty slabs.164


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 6-4. Relationships between cache and slab d<strong>es</strong>criptors6.2.4 General and Specific Cach<strong>es</strong>Cach<strong>es</strong> are divided into two typ<strong>es</strong>: general and specific. General cach<strong>es</strong> are used only by th<strong>es</strong>lab allocator for its own purpos<strong>es</strong>, while specific cach<strong>es</strong> are used by the remaining parts ofthe kernel.<strong>The</strong> general cach<strong>es</strong> are:• A first cache contains the cache d<strong>es</strong>criptors of the remaining cach<strong>es</strong> used by thekernel. <strong>The</strong> cache_cache variable contains its d<strong>es</strong>criptor.• A second cache contains the slab d<strong>es</strong>criptors that are not stored inside the slabs. <strong>The</strong>cache_slabp variable points to its d<strong>es</strong>criptor.• Thirteen additional cach<strong>es</strong> contain geometrically distributed memory areas. <strong>The</strong> tablecalled cache_siz<strong>es</strong> whose elements are of type cache_siz<strong>es</strong>_t points to the 13cache d<strong>es</strong>criptors associated with memory areas of size 32, 64, 128, 256, 512, 1024,2048, 4096, 8192, 16384, 32768, 65536, and 131072 byt<strong>es</strong>, r<strong>es</strong>pectively. <strong>The</strong> tablecache_siz<strong>es</strong> is used to efficiently derive the cache addr<strong>es</strong>s corr<strong>es</strong>ponding to a givensize.<strong>The</strong> kmem_cache_init( ) and kmem_cache_siz<strong>es</strong>_init( ) functions are invoked duringsystem initialization to set up the general cach<strong>es</strong>.Specific cach<strong>es</strong> are created by the kmem_cache_create( ) function. Depending on theparameters, the function first determin<strong>es</strong> the b<strong>es</strong>t way to handle the new cache (for instance,whether to include the slab d<strong>es</strong>criptor inside or outside of the slab); it then creat<strong>es</strong> a newcache d<strong>es</strong>criptor for the new cache and inserts the d<strong>es</strong>criptor in the cache_cache generalcache. It should be noted that once a cache has been created, it cannot be d<strong>es</strong>troyed.<strong>The</strong> nam<strong>es</strong> of all general and specific cach<strong>es</strong> can be obtained at runtime by reading/proc/slabinfo; this file also specifi<strong>es</strong> the number of free objects and the number of allocatedobjects in each cache.165


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>6.2.5 Interfacing the Slab Allocator with the Buddy SystemWhen the slab allocator creat<strong>es</strong> new slabs, it reli<strong>es</strong> on the buddy system algorithm to obtain agroup of free contiguous page fram<strong>es</strong>. To that purpose, it invok<strong>es</strong> the kmem_getpag<strong>es</strong>( )function:void * kmem_getpag<strong>es</strong>(kmem_cache_t *cachep,unsigned long flags, unsigned int *dma){void *addr;*dma = flags & SLAB_DMA;addr = (void*) __get_free_pag<strong>es</strong>(flags, cachep->c_gfporder);if (!*dma && addr) {struct page *page = mem_map + MAP_NR(addr);*dma = 1c_gfporder field)Specifi<strong>es</strong> how the page frame is requ<strong>es</strong>ted (see Section 6.1.1 earlier in this chapter)Points to a variable that is set to 1 by kmem_getpag<strong>es</strong>( ) if the allocated page fram<strong>es</strong>are suitable for ISA DMAIn the reverse operation, page fram<strong>es</strong> assigned to a slab allocator can be released (seeSection 6.2.7 later in this chapter) by invoking the kmem_freepag<strong>es</strong>( ) function:void kmem_freepag<strong>es</strong>(kmem_cache_t *cachep, void *addr){unsigned long i = (1c_gfporder);}166


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> function releas<strong>es</strong> the page fram<strong>es</strong>, starting from the one having physical addr<strong>es</strong>s addr,that had been allocated to the slab of the cache identified by cachep.6.2.6 Allocating a Slab to a CacheA newly created cache do<strong>es</strong> not contain any slab and therefore no free objects. New slabs areassigned to a cache only when both of the following are true:• A requ<strong>es</strong>t has been issued to allocate a new object.• <strong>The</strong> cache do<strong>es</strong> not include any free object.When this occurs, the slab allocator assigns a new slab to the cache by invoking kmem_cache_grow( ). This function calls kmem_ getpag<strong>es</strong>( ) to obtain a group of page fram<strong>es</strong> from thebuddy system; it then calls kmem_cache_slabmgmt( ) to get a new slab d<strong>es</strong>criptor. Next, itcalls kmem_cache_init_objs( ), which appli<strong>es</strong> the constructor method (if defined) to all theobjects contained in the new slab. It then calls kmem_slab_link_end( ), which inserts th<strong>es</strong>lab d<strong>es</strong>criptor at the end of the cache slab list:void kmem_slab_link_end(kmem_cache_t *cachep,kmem_slab_t *slabp){kmem_slab_t *lastp = cachep->c_lastp;slabp->s_nextp = kmem_slab_end(cachep);slabp->s_prevp = lastp;cachep->c_lastp = slabp;lastp->s_nextp = slabp;}<strong>The</strong> kmem_slab_end macro yields the addr<strong>es</strong>s of the c_offset field of the corr<strong>es</strong>pondingcache d<strong>es</strong>criptor (as stated before, the last element of a slab list points to that field).After inserting the new slab d<strong>es</strong>criptor into the list, kmem_cache_ grow( ) loads the nextand prev fields, r<strong>es</strong>pectively, of the d<strong>es</strong>criptors of all page fram<strong>es</strong> included in the new slabwith the addr<strong>es</strong>s of the cache d<strong>es</strong>criptor and the addr<strong>es</strong>s of the slab d<strong>es</strong>criptor. This workscorrectly because the next and prev fields are used by functions of the buddy system onlywhen the page frame is free, while page fram<strong>es</strong> handled by the slab allocator functions are notfree as far as the buddy system is concerned. <strong>The</strong>refore, the buddy system will not beconfused by this specialized use of the page frame d<strong>es</strong>criptor.6.2.7 Releasing a Slab from a CacheAs stated previously, the slab allocator never releas<strong>es</strong> the page fram<strong>es</strong> of an empty slab on itsown. In fact, a slab is released only if both the following conditions hold:• <strong>The</strong> buddy system is unable to satisfy a new requ<strong>es</strong>t for a group of page fram<strong>es</strong>.• <strong>The</strong> slab is empty, that is, all the objects included in it are free.When the kernel looks for additional free page fram<strong>es</strong>, it calls try_to_free_pag<strong>es</strong>( ); thisfunction, in turn, may invoke kmem_cache_reap( ), which selects a cache that contains atleast one empty slab. <strong>The</strong> kmem_slab_unlink( ) function then remov<strong>es</strong> the slab from thecache list of slabs:167


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>void kmem_slab_unlink(kmem_slab_t *slabp){kmem_slab_t *prevp = slabp->s_prevp;kmem_slab_t *nextp = slabp->s_nextp;prevp->s_nextp = nextp;nextp->s_prevp = prevp;}Subsequently, the slab—together with the objects in it—is d<strong>es</strong>troyed by invokingkmem_slab_d<strong>es</strong>troy( ):void kmem_slab_d<strong>es</strong>troy(kmem_cache_t *cachep, kmem_slab_t *slabp){if (cachep->c_dtor) {unsigned long num = cachep->c_num;void *objp = slabp->s_mem;do {(cachep->c_dtor)(objp, cachep, 0);objp += cachep->c_offset;if (!slabp->s_index)objp += sizeof(kmem_bufctl_t);} while (--num);}slabp->s_magic = SLAB_MAGIC_DESTROYED;if (slabp->s_index)kmem_cache_free(cachep->c_index_cachep, slabp->s_index);kmem_freepag<strong>es</strong>(cachep, slabp->s_mem-slabp->s_offset);if (SLAB_OFF_SLAB(cachep->c_flags))kmem_cache_free(cache_slabp, slabp);}<strong>The</strong> function checks whether the cache has a d<strong>es</strong>tructor method for its objects (the c_dtorfield is not NULL), in which case it appli<strong>es</strong> the d<strong>es</strong>tructor to all the objects in the slab; theobjp local variable keeps track of the currently examined object. Next, it callskmem_freepag<strong>es</strong>( ), which returns all the contiguous page fram<strong>es</strong> used by the slab to thebuddy system. Finally, if the slab d<strong>es</strong>criptor is stored outside of the slab (in this case th<strong>es</strong>_index and c_index_cachep fields are not NULL, as explained later in this chapter), thefunction releas<strong>es</strong> it from the cache of the slab d<strong>es</strong>criptors.Some modul<strong>es</strong> of <strong>Linux</strong> (see Appendix B) may create cach<strong>es</strong>. In order to avoid wastingmemory space, the kernel must d<strong>es</strong>troy all slabs in all cach<strong>es</strong> created by a module beforeremoving it. [2] <strong>The</strong> kmem_cache_shrink( ) function d<strong>es</strong>troys all the slabs in a cache byinvoking kmem_slab_d<strong>es</strong>troy( ) iteratively. <strong>The</strong> c_ growing field of the cache d<strong>es</strong>criptor isused to prevent kmem_cache_shrink( ) from shrinking a cache while another kernel controlpath attempts to allocate a new slab for it.[2] We stated previously that <strong>Linux</strong> do<strong>es</strong> not d<strong>es</strong>troy cach<strong>es</strong>. Thus, when linking in a new module, the kernel must check whether the new cached<strong>es</strong>criptors requ<strong>es</strong>ted by it were already created in a previous installation of that module or another one.6.2.8 Object D<strong>es</strong>criptorEach object has a d<strong>es</strong>criptor of type struct kmem_bufctl_s (equivalent to the typekmem_bufctl_t). Like the slab d<strong>es</strong>criptors themselv<strong>es</strong>, the object d<strong>es</strong>criptors of a slab can b<strong>es</strong>tored in two possible ways, illustrated by Figure 6-5.168


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 6-5. Relationships between slab and object d<strong>es</strong>criptorsExternal object d<strong>es</strong>criptorsStored outside the slab, in one of the general cach<strong>es</strong> pointed to by cache_siz<strong>es</strong>. Inthis case, the first object d<strong>es</strong>criptor in the memory area d<strong>es</strong>crib<strong>es</strong> the first object in th<strong>es</strong>lab and so on. <strong>The</strong> size of the memory area, and thus the particular general cacheused to store object d<strong>es</strong>criptors, depends on the number of objects stored in the slab(c_num field of the cache d<strong>es</strong>criptor). <strong>The</strong> cache containing the objects themselv<strong>es</strong> istied to the cache containing their d<strong>es</strong>criptors through two fields. First, thec_index_cachep field of the cache containing the slab points to the cache d<strong>es</strong>criptorof the cache containing the object d<strong>es</strong>criptors. Second, the s_index field of the slabd<strong>es</strong>criptor points to the memory area containing the object d<strong>es</strong>criptors.Internal object d<strong>es</strong>criptorsStored inside the slab, right after the objects they d<strong>es</strong>cribe. In this case, thec_index_cachep field of the cache d<strong>es</strong>criptor and the s_index field of the slabd<strong>es</strong>criptor are both NULL.<strong>The</strong> slab allocator choos<strong>es</strong> the first solution when the size of the objects is a multiple of 512,1024, 2048, or 4096: in this case, storing control structur<strong>es</strong> inside the slab would r<strong>es</strong>ult in ahigh level of internal fragmentation. If the size of the objects is smaller than 512 byt<strong>es</strong> or not amultiple of 512, 1024, 2048, or 4096 the slab allocator stor<strong>es</strong> the object d<strong>es</strong>criptors inside th<strong>es</strong>lab.Object d<strong>es</strong>criptors are simple structur<strong>es</strong> consisting of a single field:169


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>typedef struct kmem_bufctl_s {union {struct kmem_bufctl_s * buf_nextp;kmem_slab_t *buf_slabp;void *buf_objp;} u;} kmem_bufctl_t;#define buf_nextp u.buf_nextp#define buf_slabp u.buf_slabp#define buf_objp u.buf_objpThis field has the following meaning, depending on the state of the object and the locations ofthe object d<strong>es</strong>criptors:buf_nextpbuf_objpbuf_slabpIf the object is free, it points to the next free object in the slab, thus implementing asimple list of free objects inside the slab.If the object is allocated and its object d<strong>es</strong>criptor is stored outside of the slab, it pointsto the object.If the object is allocated and its object d<strong>es</strong>criptor is stored inside the slab, it points tothe slab d<strong>es</strong>criptor of the slab in which the object is stored. This holds whether the slabd<strong>es</strong>criptor is stored inside or outside of the slab.Figure 6-5 illustrat<strong>es</strong> the relationships among slabs, slab d<strong>es</strong>criptors, objects, and objectd<strong>es</strong>criptors. Notice that, although the figure sugg<strong>es</strong>ts that the slab d<strong>es</strong>criptor is stored outsideof the slab, it remains unchanged if the d<strong>es</strong>criptor is stored inside it.6.2.9 Aligning Objects in Memory<strong>The</strong> objects managed by the slab allocator can be aligned in memory, that is, they can b<strong>es</strong>tored in memory cells whose initial physical addr<strong>es</strong>s<strong>es</strong> are multipl<strong>es</strong> of a given constant,usually a power of 2. This constant is called the alignment factor, and its value is stored in thec_align field of the cache d<strong>es</strong>criptor. <strong>The</strong> c_offset field, which contains the object size,tak<strong>es</strong> into account the number of padding byt<strong>es</strong> added to obtain the proper alignment. If thevalue of c_align is 0, no alignment is required for the objects.<strong>The</strong> larg<strong>es</strong>t alignment factor allowed by the slab allocator is 4096, that is, the page frame size.This means that objects can be aligned by referring either to their physical addr<strong>es</strong>s<strong>es</strong> or totheir linear addr<strong>es</strong>s<strong>es</strong>: in both cas<strong>es</strong>, only the 12 least significant bits of the addr<strong>es</strong>s may bealtered by the alignment.Usually, microcomputers acc<strong>es</strong>s memory cells more quickly if their physical addr<strong>es</strong>s<strong>es</strong> arealigned with r<strong>es</strong>pect to the word size, that is, to the width of the internal memory bus of thecomputer. Thus, the kmem_cache_create( ) function attempts to align objects according tothe word size specified by the BYTES_PER_WORD macro. For Intel Pentium proc<strong>es</strong>sors, the170


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>macro yields the value 4 because the word is 32 bits long. However, the function do<strong>es</strong> notalign objects if this leads to a consistent waste of memory.When creating a new cache, it's possible to specify that the objects included in it be aligned inthe first-level cache. To achieve this, set the SLAB_HWCACHE_ALIGN cache d<strong>es</strong>criptor flag. <strong>The</strong>kmem_cache_create( ) function handl<strong>es</strong> the requ<strong>es</strong>t as follows:• If the object's size is greater than half of a cache line, it is aligned in RAM to amultiple of L1_CACHE_BYTES, that is, at the beginning of the line.• Otherwise, the object size is rounded up to a factor of L1_CACHE_BYTES; this ensur<strong>es</strong>that an object will never span across two cache lin<strong>es</strong>.Clearly, what the slab allocator is doing here is trading memory space for acc<strong>es</strong>s time: it getsbetter cache performance by artificially increasing the object size, thus causing additionalinternal fragmentation.6.2.10 Slab ColoringWe know from Chapter 2 that the same hardware cache line maps many different blocks ofRAM. In this chapter we have also seen that objects of the same size tend to be stored at th<strong>es</strong>ame offset within a cache. Objects that have the same offset within different slabs will, witha relatively high probability, end up mapped in the same cache line. <strong>The</strong> cache hardwaremight therefore waste memory cycl<strong>es</strong> transferring two objects from the same cache line backand forth to different RAM locations, while other cache lin<strong>es</strong> go underutilized. <strong>The</strong> slaballocator tri<strong>es</strong> to reduce this unpleasant cache behavior by a policy called slab coloring:different arbitrary valu<strong>es</strong> called colors are assigned to the slabs.Before examining slab coloring, we have to look at the layout of objects in the cache. Let usconsider a cache whose objects are aligned in RAM. Thus, the c_align field of the cached<strong>es</strong>criptor has a positive value, say aln. Even taking into account the alignment constraint,there are many possible ways to place objects inside the slab. <strong>The</strong> choic<strong>es</strong> depend ondecisions made for the following variabl<strong>es</strong>:numosizedsizeNumber of objects that can be stored in a slab (its value is in the c_num field of thecache d<strong>es</strong>criptor).Object size including the alignment byt<strong>es</strong> (its value is in the c_offset field) plusobject d<strong>es</strong>criptor size (if the d<strong>es</strong>criptor is contained inside the slab).Slab d<strong>es</strong>criptor size; its value is equal to if the slab d<strong>es</strong>criptor is stored outside of th<strong>es</strong>lab.171


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>freeNumber of unused byt<strong>es</strong> (byt<strong>es</strong> not assigned to any object) inside the slab.<strong>The</strong> total length in byt<strong>es</strong> of a slab can then be expr<strong>es</strong>sed as:slab length = (num x osize)+dsize +freefree is always smaller than osize, since otherwise it would be possible to place additionalobjects inside the slab. However, free could be greater than aln.<strong>The</strong> slab allocator tak<strong>es</strong> advantage of the free unused byt<strong>es</strong> to color the slab. <strong>The</strong> term "color"is used simply to subdivide the slabs and allow the memory allocator to spread objects outamong different linear addr<strong>es</strong>s<strong>es</strong>. In this way, the kernel obtains the b<strong>es</strong>t possible performancefrom the microproc<strong>es</strong>sor's hardware cache.Slabs having different colors store the first object of the slab in different memory locations,while satisfying the alignment constraint. <strong>The</strong> number of available colors is free/aln+1. <strong>The</strong>first color is denoted as and the last one (whose value is in the c_colour field of the cached<strong>es</strong>criptor) is denoted as free/aln.If a slab is colored with color col, the offset of the first object (with r<strong>es</strong>pect to the slab initialaddr<strong>es</strong>s) is equal to col x aln byt<strong>es</strong>; this value is stored in the s_offset field of the slabd<strong>es</strong>criptor. Figure 6-6 illustrat<strong>es</strong> how the placement of objects inside the slab depends on th<strong>es</strong>lab color. Coloring <strong>es</strong>sentially leads to moving some of the free area of the slab from the endto the beginning.Figure 6-6. Slab with color col and alignment alnColoring works only when free is large enough. Clearly, if no alignment is required for theobjects or if the number of unused byt<strong>es</strong> inside the slab is smaller than the required alignment(free < aln), the only possible slab coloring is the one having the color 0, that is, the one thatassigns a zero offset to the first object.<strong>The</strong> various colors are distributed equally among slabs of a given object type by storing thecurrent color in a field of the cache d<strong>es</strong>criptor called c_colour_next. <strong>The</strong> kmem_cache_grow( ) function assigns the color specified by c_colour_next to a new slab and thendecrements the value of this field. After reaching 0, it wraps around again to the maximumavailable value:172


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (!(offset = cachep->c_colour_next--))cachep->c_colour_next = cachep->c_colour;offset *= cachep->c_align;slabp->s_offset = offset;In this way, each slab is created with a different color from the previous one, up to themaximum available colors.6.2.11 Allocating an Object to a CacheNew objects may be obtained by invoking the kmem_cache_alloc( ) function. <strong>The</strong>parameter cachep points to the cache d<strong>es</strong>criptor from which the new free object must beobtained. kmem_cache_alloc( ) first checks whether the cache d<strong>es</strong>criptor exists; it thenretriev<strong>es</strong> from the c_freep field the addr<strong>es</strong>s of the s_nextp field of the first slab that includ<strong>es</strong>at least one free object:slabp = cachep->c_freep;If slabp do<strong>es</strong> not point to a slab, it then jumps to alloc_new_slab and invok<strong>es</strong>kmem_cache_grow( ) to add a new slab to the cache:if (slabp->s_magic != SLAB_MAGIC_ALLOC)goto alloc_new_slab;<strong>The</strong> value SLAB_MAGIC_ALLOC in the s_magic field indicat<strong>es</strong> that the slab contains at least onefree object. If the slab is full, slabp points to the cachep->c_offset field, and thus slabp->s_magic coincid<strong>es</strong> with cachep->c_magic: in this case, however, this field contains a magicnumber for the cache different from SLAB_MAGIC_ALLOC.After obtaining a slab with a free object, the function increments the counter containing thenumber of objects currently allocated in the slab:slabp->s_inuse++;It then loads bufp with the addr<strong>es</strong>s of the first free object inside the slab and, corr<strong>es</strong>pondingly,updat<strong>es</strong> the slabp->s_freep field of the slab d<strong>es</strong>criptor to point to the next free object:bufp = slabp->s_freep;slabp->s_freep = bufp->buf_nextp;If slabp->s_freep becom<strong>es</strong> NULL, the slab no longer includ<strong>es</strong> free objects, so the c_freepfield of the cache d<strong>es</strong>criptor must be updated:if (!slabp->s_freep)cachep->c_freep = slabp->s_nextp;Notice that there is no need to change the position of the slab d<strong>es</strong>criptor inside the list since itremains partially ordered. Now the function must derive the addr<strong>es</strong>s of the free object andupdate the object d<strong>es</strong>criptor.If the slabp->s_index field is null, the object d<strong>es</strong>criptors are stored right after the objectsinside the slab. In this case, the addr<strong>es</strong>s of the slab d<strong>es</strong>criptor is first stored in the object173


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>d<strong>es</strong>criptor's single field to denote the fact that the object is no longer free; then the objectaddr<strong>es</strong>s is derived by subtracting from the addr<strong>es</strong>s of the object d<strong>es</strong>criptor the object sizeincluded in the cachep->c_offset field:if (!slabp->s_index) {bufp->buf_slabp = slabp;objp = ((void*)bufp) - cachep->c_offset;}If the slabp->s_index field is not zero, it points to a memory area outside of the slab wherethe object d<strong>es</strong>criptors are stored. In this case, the function first comput<strong>es</strong> the relative positionof the object d<strong>es</strong>criptor in the outside memory area; it then multipli<strong>es</strong> this number by theobject size; finally, it adds the r<strong>es</strong>ult to the addr<strong>es</strong>s of the first object in the slab, thus yieldingthe addr<strong>es</strong>s of the object to be returned. As in the previous case, the object d<strong>es</strong>criptor singlefield is updated and points now to the object:if (slabp->s_index) {objp = ((bufp-slabp->s_index)*cachep->c_offset) +slabp->s_mem;bufp->buf_objp = objp;}<strong>The</strong> function terminat<strong>es</strong> by returning the addr<strong>es</strong>s of the new object:return objp;6.2.12 Releasing an Object from a Cache<strong>The</strong> kmem_cache_free( ) function releas<strong>es</strong> an object previously obtained by the slaballocator. Its parameters are cachep, the addr<strong>es</strong>s of the cache d<strong>es</strong>criptor, and objp, theaddr<strong>es</strong>s of the object to be released. <strong>The</strong> function starts by checking the parameters, afterwhich it determin<strong>es</strong> the addr<strong>es</strong>s of the object d<strong>es</strong>criptor and that of the slab containing theobject. It us<strong>es</strong> the cachep->c_flags flag, included in the cache d<strong>es</strong>criptor, to determinewhether the object d<strong>es</strong>criptor is located inside or outside of the slab.In the former case, it determin<strong>es</strong> the addr<strong>es</strong>s of the object d<strong>es</strong>criptor by adding the object'ssize to its initial addr<strong>es</strong>s. <strong>The</strong> addr<strong>es</strong>s of the slab d<strong>es</strong>criptor is then extracted from theappropriate field in the object d<strong>es</strong>criptor:if (!SLAB_BUFCTL(cachep->c_flags)) {bufp = (kmem_bufctl_t *)(objp+cachep->c_offset);slabp = bufp->buf_slabp;}In the latter case, it determin<strong>es</strong> the addr<strong>es</strong>s of the slab d<strong>es</strong>criptor from the prev field of thed<strong>es</strong>criptor of the page frame containing the object (refer to Section 6.2.6 for the role of prev).<strong>The</strong> addr<strong>es</strong>s of the object d<strong>es</strong>criptor is derived by first computing the sequence number of theobject inside the slab (object addr<strong>es</strong>s minus first object addr<strong>es</strong>s divided by object length). Thisnumber is then used to determine the position of the object d<strong>es</strong>criptor starting from thebeginning of the outside area pointed to by the slabp->s_index field of the slab d<strong>es</strong>criptor.To be on the safe side, the function checks that the object's addr<strong>es</strong>s passed as a parametercoincid<strong>es</strong> with the addr<strong>es</strong>s that its object d<strong>es</strong>criptor says it should have:174


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (SLAB_BUFCTL(cachep->c_flags)) {slabp = (kmem_slab_t *)((&mem_map[MAP_NR(objp)])->prev);bufp = &slabp->s_index[(objp - slabp->s_mem) /cachep->c_offset];if (objp != bufp->buf_objp)goto bad_obj_addr;}Now the function checks whether the slabp->s_magic field of the slab d<strong>es</strong>criptor containsthe correct magic number and whether the slabp->s_inuse field is greater than 0. Ifeverything is okay, it decrements the value of slabp->s_inuse and inserts the object into th<strong>es</strong>lab list of free objects:slabp->s_inuse--;bufp->buf_nextp = slabp->s_freep;slabp->s_freep = bufp;If bufp->buf_nextp is NULL, the list of free objects includ<strong>es</strong> only one element: the objectthat is being released. In this case, the slab was previously filled to capacity and it might benec<strong>es</strong>sary to reinsert its slab d<strong>es</strong>criptor in a new position in the list of slab d<strong>es</strong>criptors.(Remember that completely filled slabs appear before slabs with some free objects in thepartially ordered list.) This is done by the kmem_cache_one_free( ) function:if (!bufp->buf_nextp)kmem_cache_one_free(cachep, slabp);If the slab includ<strong>es</strong> other free objects b<strong>es</strong>id<strong>es</strong> the one being released, it is nec<strong>es</strong>sary to checkwhether all objects are free. As in the previous case, this would make it nec<strong>es</strong>sary to reinsertthe slab d<strong>es</strong>criptor in a new position in the list of slab d<strong>es</strong>criptors. <strong>The</strong> move is done by thekmem_cache_full_free( ) function:if (bufp->buf_nextp)if (!slabp->s_inuse)kmem_cache_full_free(cachep, slabp);<strong>The</strong> kmem_cache_free( ) function terminat<strong>es</strong> here.6.2.13 General Purpose ObjectsAs stated in Section 6.1.2, infrequent requ<strong>es</strong>ts for memory areas are handled through a groupof general cach<strong>es</strong> whose objects have geometrically distributed siz<strong>es</strong> ranging from a minimumof 32 to a maximum of 131072 byt<strong>es</strong>.Objects of this type are obtained by invoking the kmalloc( ) function:void * kmalloc(size_t size, int flags){cache_siz<strong>es</strong>_t *csizep = cache_siz<strong>es</strong>;for (; csizep->cs_size; csizep++) {if (size > csizep->cs_size)continue;return __kmem_cache_alloc(csizep->cs_cachep, flags);}printk(KERN_ERR "kmalloc: Size (%lu) too large\n",175


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>}return NULL;(unsigned long) size);<strong>The</strong> function us<strong>es</strong> the cache_siz<strong>es</strong> table to locate the cache d<strong>es</strong>criptor of the cachecontaining objects of the right size. It then calls kmem_cache_alloc( ) to allocate theobject. [3][3]Actually, for efficiency reasons, the code of kmem_cache_alloc() is copied inside the body of kmalloc(). <strong>The</strong>__kmem_cache alloc() function, which implements kmem_cache_alloc(), is declared inline.Objects obtained by invoking kmalloc( ) can be released by calling kfree( ): [4][4] A similar function called kfree_s( ) requir<strong>es</strong> an additional parameter, namely, the size of the object to be released. This function was usedin previous versions of <strong>Linux</strong> where the size of the memory area had to be determined before releasing it. It is still used by some modul<strong>es</strong> of thefil<strong>es</strong>ystem.void kfree(const void *objp){struct page *page;int nr;if (!objp)goto null_ptr;nr = MAP_NR(objp);if (nr >= num_physpag<strong>es</strong>)goto bad_ptr;page = &mem_map[nr];if (PageSlab(page)) {kmem_cache_t *cachep;cachep = (kmem_cache_t *)(page->next);if (cachep && (cachep->c_flags & SLAB_CFLGS_GENERAL)) {__kmem_cache_free(cachep, objp);return;}}bad_ptr:printk(KERN_ERR "kfree: Bad obj %p\n", objp);*(int *) 0 = 0; /* FORCE A KERNEL DUMP */null_ptr:return;}<strong>The</strong> proper cache d<strong>es</strong>criptor is identified by reading the next field of the d<strong>es</strong>criptor of the firstpage frame containing the memory area. If this field points to a valid d<strong>es</strong>criptor, the memoryarea is released by invoking kmem_cache_free( ).6.3 Noncontiguous Memory Area ManagementWe already know from an earlier discussion that it is preferable to map memory areas intosets of contiguous page fram<strong>es</strong>, thus making better use of the cache and achieving loweraverage memory acc<strong>es</strong>s tim<strong>es</strong>. Neverthel<strong>es</strong>s, if the requ<strong>es</strong>ts for memory areas are infrequent,it mak<strong>es</strong> sense to consider an allocation schema based on noncontiguous page fram<strong>es</strong>acc<strong>es</strong>sed through contiguous linear addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong> main advantage of this schema is to avoidexternal fragmentation, while the disadvantage is that it is nec<strong>es</strong>sary to fiddle with the kernelpage tabl<strong>es</strong>. Clearly, the size of a noncontiguous memory area must be a multiple of 4096.<strong>Linux</strong> us<strong>es</strong> noncontiguous memory areas sparingly, for instance, to allocate data structur<strong>es</strong> for176


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>active swap areas (see Section 16.2.3 in Chapter 16), to allocate space for a module (seeAppendix B), or to allocate buffers to some I/O drivers.6.3.1 Linear Addr<strong>es</strong>s<strong>es</strong> of Noncontiguous Memory AreasTo find a free range of linear addr<strong>es</strong>s<strong>es</strong>, we can look in the area starting from PAGE_OFFSET(usually 0xc0000000, the beginning of the fourth gigabyte). We learned in the Chapter 2 inSection 2.5.4 that the kernel r<strong>es</strong>erved this whole upper area of memory to map available RAMfor kernel use. But available RAM occupi<strong>es</strong> only a small fraction of the gigabyte, starting atthe PAGE_OFFSET addr<strong>es</strong>s. All the linear addr<strong>es</strong>s<strong>es</strong> above that r<strong>es</strong>erved area are available formapping noncontiguous memory areas. <strong>The</strong> linear addr<strong>es</strong>s that corr<strong>es</strong>ponds to the end ofphysical memory is stored in the high_memory variable.Figure 6-7 shows how linear addr<strong>es</strong>s<strong>es</strong> are assigned to noncontiguous memory areas. A safetyinterval of size 8 MB (macro VMALLOC_OFFSET) is inserted between the end of the physicalmemory and the first memory area; its purpose is to "capture" out-of-bounds memoryacc<strong>es</strong>s<strong>es</strong>. For the same reason, additional safety intervals of size 4 KB are inserted to separatenoncontiguous memory areas.Figure 6-7. <strong>The</strong> linear addr<strong>es</strong>s interval starting from PAGE_OFFSET<strong>The</strong> VMALLOC_START macro defin<strong>es</strong> the starting addr<strong>es</strong>s of the linear space r<strong>es</strong>erved fornoncontiguous memory areas. It is defined as follows:#define VMALLOC_START (((unsigned long) high_memory + \VMALLOC_OFFSET) & ~(VMALLOC_OFFSET-1))6.3.2 D<strong>es</strong>criptors of Noncontiguous Memory AreasEach noncontiguous memory area is associated with a d<strong>es</strong>criptor of type struct vm_struct:struct vm_struct {unsigned long flags;void * addr;unsigned long size;struct vm_struct * next;};<strong>The</strong>se d<strong>es</strong>criptors are inserted in a simple list by means of the next field; the addr<strong>es</strong>s of thefirst element of the list is stored in the vmlist variable. <strong>The</strong> addr field contains the linearaddr<strong>es</strong>s of the first memory cell of the area; the size field contains the size of the area plus4096 (the size of the previously mentioned interarea safety interval).177


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> get_vm_area( ) function creat<strong>es</strong> new d<strong>es</strong>criptors of type struct vm_struct; itsparameter size specifi<strong>es</strong> the size of the new memory area:struct vm_struct * get_vm_area(unsigned long size){unsigned long addr;struct vm_struct **p, *tmp, *area;area = (struct vm_struct *) kmalloc(sizeof(*area),GFP_KERNEL);if (!area)return NULL;addr = VMALLOC_START;for (p = &vmlist; (tmp = *p) ; p = &tmp->next) {if (size + addr < (unsigned long) tmp->addr)break;addr = tmp->size + (unsigned long) tmp->addr;if (addr > 0xffffd000-size) {kfree(area);return NULL;}}area->addr = (void *)addr;area->size = size + PAGE_SIZE;area->next = *p;*p = area;return area;}<strong>The</strong> function first calls kmalloc( ) to obtain a memory area for the new d<strong>es</strong>criptor. It thenscans the list of d<strong>es</strong>criptors of type struct vm_struct looking for an available range oflinear addr<strong>es</strong>s<strong>es</strong> that includ<strong>es</strong> at least size+4096 addr<strong>es</strong>s<strong>es</strong>. If such an interval exists, thefunction initializ<strong>es</strong> the fields of the d<strong>es</strong>criptor and terminat<strong>es</strong> by returning the initial addr<strong>es</strong>sof the noncontiguous memory area. Otherwise, when addr + size exceeds the 4 GB limit,get_vm_area( ) releas<strong>es</strong> the d<strong>es</strong>criptor and returns NULL.6.3.3 Allocating a Noncontiguous Memory Area<strong>The</strong> vmalloc( ) function allocat<strong>es</strong> a noncontiguous memory area to the kernel. <strong>The</strong>parameter size denot<strong>es</strong> the size of the requ<strong>es</strong>ted area. If the function is able to satisfy therequ<strong>es</strong>t, then it returns the initial linear addr<strong>es</strong>s of the new area; otherwise, it returns a NULLpointer:void * vmalloc(unsigned long size){void * addr;struct vm_struct *area;size = (size+PAGE_SIZE-1)&PAGE_MASK;if (!size || size > (num_physpag<strong>es</strong> addr;if (vmalloc_area_pag<strong>es</strong>((unsigned long) addr, size)) {vfree(addr);return NULL;}178


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>}return addr;<strong>The</strong> function starts by rounding up the value of the size parameter to a multiple of 4096 (thepage frame size). It also performs a sanity check to make sure the size is greater than and l<strong>es</strong>sthan or equal to the existing number of page fram<strong>es</strong>. If the size fits available memory,vmalloc( ) invok<strong>es</strong> get_vm_area( ), which creat<strong>es</strong> a new d<strong>es</strong>criptor and returns the linearaddr<strong>es</strong>s<strong>es</strong> assigned to the memory area. <strong>The</strong>n vmalloc( ) invok<strong>es</strong> vmalloc_area_pag<strong>es</strong>( )to requ<strong>es</strong>t noncontiguous page fram<strong>es</strong> and terminat<strong>es</strong> by returning the initial linear addr<strong>es</strong>s ofthe noncontiguous memory area.<strong>The</strong> vmalloc_area_pag<strong>es</strong>( ) function mak<strong>es</strong> use of two parameters: addr<strong>es</strong>s, the initiallinear addr<strong>es</strong>s of the area, and size, its size. <strong>The</strong> linear addr<strong>es</strong>s of the end of the area isassigned to the end local variable:end = addr<strong>es</strong>s + size;<strong>The</strong> function then us<strong>es</strong> the pgd_offset_k macro to derive the entry in the Page GlobalDirectory related to the initial linear addr<strong>es</strong>s of the area:dir = pgd_offset_k(addr<strong>es</strong>s);<strong>The</strong> function then execut<strong>es</strong> the following cycle:while (addr<strong>es</strong>s < end) {pmd_t *pmd = pmd_alloc_kernel(dir, addr<strong>es</strong>s);if (!pmd)return -ENOMEM;if (alloc_area_pmd(pmd, addr<strong>es</strong>s, end - addr<strong>es</strong>s))return -ENOMEM;set_pgdir(addr<strong>es</strong>s, *dir);addr<strong>es</strong>s = (addr<strong>es</strong>s + PGDIR_SIZE) & PGDIR_MASK;dir++;}In each cycle, it first invok<strong>es</strong> pmd_alloc_kernel( ) to create a Page Middle Directory forthe new area. It then calls alloc_area_pmd( ) to allocate all the Page Tabl<strong>es</strong> associated withthe new Page Middle Directory. Next, it invok<strong>es</strong> set_pgdir( ) to update the entrycorr<strong>es</strong>ponding to the new Page Middle Directory in all existing Page Global Directori<strong>es</strong> (seeSection 2.5.4 in Chapter 2). It adds the constant 2 22 , that is, the size of the range of linearaddr<strong>es</strong>s<strong>es</strong> spanned by a single Page Middle Directory, to the current value of addr<strong>es</strong>s, and itincreas<strong>es</strong> the pointer dir to the Page Global Directory.<strong>The</strong> cycle is repeated until all page table entri<strong>es</strong> referring to the noncontiguous memory areahave been set up.<strong>The</strong> alloc_area_pmd( ) function execut<strong>es</strong> a similar cycle for all the Page Tabl<strong>es</strong> that a PageMiddle Directory points to:179


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>while (addr<strong>es</strong>s < end) {pte_t * pte = pte_alloc_kernel(pmd, addr<strong>es</strong>s);if (!pte)return -ENOMEM;if (alloc_area_pte(pte, addr<strong>es</strong>s, end - addr<strong>es</strong>s))return -ENOMEM;addr<strong>es</strong>s = (addr<strong>es</strong>s + PMD_SIZE) & PMD_MASK;pmd++;}<strong>The</strong> pte_alloc_kernel( ) function (see Section 2.5.2 in Chapter 2) allocat<strong>es</strong> a new PageTable and updat<strong>es</strong> the corr<strong>es</strong>ponding entry in the Page Middle Directory. Next,alloc_area_pte( ) allocat<strong>es</strong> all the page fram<strong>es</strong> corr<strong>es</strong>ponding to the entri<strong>es</strong> in the PageTable. <strong>The</strong> value of addr<strong>es</strong>s is increased by 2 22 , that is, the size of the linear addr<strong>es</strong>s intervalspanned by a single Page Table, and the cycle is repeated.<strong>The</strong> main cycle of alloc_area_pte( ) is:while (addr<strong>es</strong>s < end) {unsigned long page;if (!pte_none(*pte))printk("alloc_area_pte: page already exists\n");page = __ get_free_page(GFP_KERNEL);if (!page)return -ENOMEM;set_pte(pte, mk_pte(page, PAGE_KERNEL));addr<strong>es</strong>s += PAGE_SIZE;pte++;}Each page frame is allocated through __get_free_page( ). <strong>The</strong> physical addr<strong>es</strong>s of the newpage frame is written into the Page Table by the set_pte and mk_pte macros. <strong>The</strong> cycle isrepeated after adding the constant 4096, that is, the length of a page frame, to addr<strong>es</strong>s.6.3.4 Releasing a Noncontiguous Memory Area<strong>The</strong> vfree( ) function releas<strong>es</strong> noncontiguous memory areas. Its parameter addr containsthe initial linear addr<strong>es</strong>s of the area to be released. vfree( ) first scans the list pointed byvmlist to find the addr<strong>es</strong>s of the area d<strong>es</strong>criptor associated with the area to be released:for (p = &vmlist ; (tmp = *p) ; p = &tmp->next) {if (tmp->addr == addr) {*p = tmp->next;vmfree_area_pag<strong>es</strong>((unsigned long)(tmp->addr),tmp->size);kfree(tmp);return;}}printk("Trying to vfree( ) nonexistent vm area (%p)\n", addr);<strong>The</strong> size field of the d<strong>es</strong>criptor specifi<strong>es</strong> the size of the area to be released. <strong>The</strong> area itself isreleased by invoking vmfree_area_pag<strong>es</strong>( ), while the d<strong>es</strong>criptor is released by invokingkfree( ).180


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> vmfree_area_pag<strong>es</strong>( ) function tak<strong>es</strong> two parameters: the initial linear addr<strong>es</strong>s and th<strong>es</strong>ize of the area. It execut<strong>es</strong> the following cycle to reverse the actions performed byvmalloc_area_pag<strong>es</strong>( ):while (addr<strong>es</strong>s < end) {free_area_pmd(dir, addr<strong>es</strong>s, end - addr<strong>es</strong>s);addr<strong>es</strong>s = (addr<strong>es</strong>s + PGDIR_SIZE) & PGDIR_MASK;dir++;}In turn, free_area_pmd( ) revers<strong>es</strong> the actions of alloc_area_pmd( ) in the cycle:while (addr<strong>es</strong>s < end) {free_area_pte(pmd, addr<strong>es</strong>s, end - addr<strong>es</strong>s);addr<strong>es</strong>s = (addr<strong>es</strong>s + PMD_SIZE) & PMD_MASK;pmd++;}Again, free_area_pte( ) revers<strong>es</strong> the activity of alloc_area_pte( ) in the cycle:while (addr<strong>es</strong>s < end) {pte_t page = *pte;pte_clear(pte);addr<strong>es</strong>s += PAGE_SIZE;pte++;if (pte_none(page))continue;if (pte_pr<strong>es</strong>ent(page)) {free_page(pte_page(page));continue;}printk("Whee... Swapped out page in kernel page table\n");}Each page frame assigned to the noncontiguous memory area is released by means of thebuddy system free_ page( ) function. <strong>The</strong> corr<strong>es</strong>ponding entry in the Page Table is set toby the pte_clear macro.6.4 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.2 has two buddy systems: the first one handl<strong>es</strong> page fram<strong>es</strong> suitable for ISA DMA,while the second one handl<strong>es</strong> page fram<strong>es</strong> not suitable for ISA DMA. <strong>Linux</strong> 2.4 adds a thirdbuddy system for the high physical memory, that is, for the page fram<strong>es</strong> not permanentlymapped by the kernel. Using a high-memory page frame impli<strong>es</strong> changing an entry in aspecial kernel Page Table to map the page frame physical addr<strong>es</strong>s<strong>es</strong> in the 4 GB linear addr<strong>es</strong>sspace.Actually, <strong>Linux</strong> 2.4 views the three portions of RAM as different "zon<strong>es</strong>." Each zone has itsown counters and watermarks to monitor the number of free page fram<strong>es</strong>. When a memoryallocation requ<strong>es</strong>t tak<strong>es</strong> place, the kernel first tri<strong>es</strong> to fetch the page fram<strong>es</strong> from the mostsuitable zone; if it fails, it may fall back on another zone.<strong>The</strong> slab allocator is mostly unchanged. However, <strong>Linux</strong> 2.4 allows a slab allocator cache thatis no longer useful to be d<strong>es</strong>troyed. Recall that in <strong>Linux</strong> 2.2 a slab allocator cache can be181


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>dynamically created but not d<strong>es</strong>troyed. Modul<strong>es</strong> that create their own slab allocator cachewhen loaded are now expected to d<strong>es</strong>troy it when unloaded.182


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 7. Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceAs seen in the previous chapter, a kernel function gets dynamic memory in a fairlystraightforward manner by invoking one of a variety of functions: __get_free_pag<strong>es</strong>( ) toget pag<strong>es</strong> from the buddy system algorithm, kmem_cache_alloc( ) or kmalloc( ) to use th<strong>es</strong>lab allocator for specialized or general-purpose objects, and vmalloc( ) to geta noncontiguous memory area. If the requ<strong>es</strong>t can be satisfied, each of th<strong>es</strong>e functions returnsa linear addr<strong>es</strong>s identifying the beginning of the allocated dynamic memory area.<strong>The</strong>se simple approach<strong>es</strong> work for two reasons:• <strong>The</strong> kernel is the high<strong>es</strong>t priority component of the operating system: if some kernelfunction mak<strong>es</strong> a requ<strong>es</strong>t for dynamic memory, it must have some valid reason to issuethat requ<strong>es</strong>t, and there is no point in trying to defer it.• <strong>The</strong> kernel trusts itself: all kernel functions are assumed error-free, so it do<strong>es</strong> not needto insert any protection against programming errors.When allocating memory to User Mode proc<strong>es</strong>s<strong>es</strong>, the situation is entirely different:• Proc<strong>es</strong>s requ<strong>es</strong>ts for dynamic memory are considered nonurgent. When a proc<strong>es</strong>s'sexecutable file is loaded, for instance, it is unlikely that the proc<strong>es</strong>s will addr<strong>es</strong>s all thepag<strong>es</strong> of code in the near future. Similarly, when a proc<strong>es</strong>s invok<strong>es</strong> malloc( ) to getadditional dynamic memory, it do<strong>es</strong>n't mean the proc<strong>es</strong>s will soon acc<strong>es</strong>s all theadditional memory obtained. So as a general rule, the kernel tri<strong>es</strong> to defer allocatingdynamic memory to User Mode proc<strong>es</strong>s<strong>es</strong>.• Since user programs cannot be trusted, the kernel must be prepared to catch alladdr<strong>es</strong>sing errors caused by proc<strong>es</strong>s<strong>es</strong> in User Mode.As we shall see in this chapter, the kernel succeeds in deferring the allocation of dynamicmemory to proc<strong>es</strong>s<strong>es</strong> by making use of a new kind of r<strong>es</strong>ource. When a User Mode proc<strong>es</strong>sasks for dynamic memory, it do<strong>es</strong>n't get additional page fram<strong>es</strong>; instead, it gets the right touse a new range of linear addr<strong>es</strong>s<strong>es</strong>, which become part of its addr<strong>es</strong>s space. This interval iscalled a memory region.We start in Section 7.1 by discussing how the proc<strong>es</strong>s views dynamic memory. We thend<strong>es</strong>cribe the basic components of the proc<strong>es</strong>s addr<strong>es</strong>s space in Section 7.3. Next, we examinein detail the role played by the page fault exception handler in deferring the allocation of pagefram<strong>es</strong> to proc<strong>es</strong>s<strong>es</strong>. We then illustrate how the kernel creat<strong>es</strong> and delet<strong>es</strong> whole proc<strong>es</strong>saddr<strong>es</strong>s spac<strong>es</strong>. Last, we discuss the APIs and system calls related to addr<strong>es</strong>s spacemanagement.7.1 <strong>The</strong> Proc<strong>es</strong>s's Addr<strong>es</strong>s Space<strong>The</strong> addr<strong>es</strong>s space of a proc<strong>es</strong>s consists of all linear addr<strong>es</strong>s<strong>es</strong> that the proc<strong>es</strong>s is allowed touse. Each proc<strong>es</strong>s se<strong>es</strong> a different set of linear addr<strong>es</strong>s<strong>es</strong>; the addr<strong>es</strong>s used by one proc<strong>es</strong>sbears no relation to the addr<strong>es</strong>s used by another. As we shall see later, the kernel maydynamically modify a proc<strong>es</strong>s addr<strong>es</strong>s space by adding or removing intervals of linearaddr<strong>es</strong>s<strong>es</strong>.183


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> kernel repr<strong>es</strong>ents intervals of linear addr<strong>es</strong>s<strong>es</strong> by means of r<strong>es</strong>ourc<strong>es</strong> called memoryregions, which are characterized by an initial linear addr<strong>es</strong>s, a length, and some acc<strong>es</strong>s rights.For reasons of efficiency, both the initial addr<strong>es</strong>s and the length of a memory region must bemultipl<strong>es</strong> of 4096, so that the data identified by each memory region entirely fills up the pagefram<strong>es</strong> allocated to it. Let us briefly mention typical situations in which a proc<strong>es</strong>s gets newmemory regions:• When the user typ<strong>es</strong> a command at the console, the shell proc<strong>es</strong>s creat<strong>es</strong> a new proc<strong>es</strong>sto execute the command. As a r<strong>es</strong>ult, a fr<strong>es</strong>h addr<strong>es</strong>s space, thus a set of memoryregions, is assigned to the new proc<strong>es</strong>s (see Section 7.5 later in this chapter andChapter 19).• A running proc<strong>es</strong>s may decide to load an entirely different program. In this case, theproc<strong>es</strong>s ID remains unchanged but the memory regions used before loading theprogram are released, and a new set of memory regions is assigned to the proc<strong>es</strong>s (seeSection 19.4 in Chapter 19).• A running proc<strong>es</strong>s may perform a "memory mapping" on a file (or on a portion of it).In such cas<strong>es</strong>, the kernel assigns a new memory region to the proc<strong>es</strong>s to map the file(see Section 15.2 in Chapter 15).• A proc<strong>es</strong>s may keep adding data on its User Mode stack until all addr<strong>es</strong>s<strong>es</strong> in thememory region that map the stack have been used. In such cas<strong>es</strong>, the kernel maydecide to expand the size of that memory region (see Section 7.4 later in this chapter).• A proc<strong>es</strong>s may create an IPC shared memory region to share data with othercooperating proc<strong>es</strong>s<strong>es</strong>. In such cas<strong>es</strong>, the kernel assigns a new memory region to theproc<strong>es</strong>s to implement this construct (see Section 18.3.5 in Chapter 18).• A proc<strong>es</strong>s may expand its dynamic area (the heap) through a function such as malloc(). As a r<strong>es</strong>ult, the kernel may decide to expand the size of the memory region assignedto the heap (see Section 7.6 later in this chapter).Table 7-1 illustrat<strong>es</strong> some of the system calls related to the previously mentioned tasks. Withthe exception of brk( ), which is discussed at the end of this chapter, the system calls ared<strong>es</strong>cribed in other chapters.Table 7-1. System Calls Related to Memory Region Creation and DeletionSystem Call D<strong>es</strong>criptionbrk( ) Chang<strong>es</strong> the heap size of the proc<strong>es</strong>sexecve( ) Loads a new executable file, thus changing the proc<strong>es</strong>s addr<strong>es</strong>s spaceexit( ) Terminat<strong>es</strong> the current proc<strong>es</strong>s and d<strong>es</strong>troys its addr<strong>es</strong>s spacefork( ) Creat<strong>es</strong> a new proc<strong>es</strong>s, and thus a new addr<strong>es</strong>s spacemmap( ) Creat<strong>es</strong> a memory mapping for a file, thus enlarging the proc<strong>es</strong>s addr<strong>es</strong>s spacemunmap( ) D<strong>es</strong>troys a memory mapping for a file, thus contracting the proc<strong>es</strong>s addr<strong>es</strong>s spac<strong>es</strong>hmat( ) Creat<strong>es</strong> a shared memory regionshmdt( ) D<strong>es</strong>troys a shared memory regionAs we shall see in Section 7.4, it is <strong>es</strong>sential for the kernel to identify the memory regionscurrently owned by a proc<strong>es</strong>s (that is, the addr<strong>es</strong>s space of a proc<strong>es</strong>s) since that allows the"Page fault" exception handler to efficiently distinguish between two typ<strong>es</strong> of invalid linearaddr<strong>es</strong>s<strong>es</strong> that cause it to be invoked:• Those caused by programming errors.184


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Those caused by a missing page; even though the linear addr<strong>es</strong>s belongs to theproc<strong>es</strong>s's addr<strong>es</strong>s space, the page frame corr<strong>es</strong>ponding to that addr<strong>es</strong>s has yet to beallocated.<strong>The</strong> latter addr<strong>es</strong>s<strong>es</strong> are not invalid from the proc<strong>es</strong>s's point of view; the kernel handl<strong>es</strong> thepage fault by providing the page frame and letting the proc<strong>es</strong>s continue.7.2 <strong>The</strong> Memory D<strong>es</strong>criptorAll information related to the proc<strong>es</strong>s addr<strong>es</strong>s space is included in a table referenced by the mmfield of the proc<strong>es</strong>s d<strong>es</strong>criptor. This table is a structure of type mm_struct as follows:struct mm_struct {struct vm_area_struct *mmap, *mmap_avl, *mmap_cache;pgd_t * pgd;atomic_t count;int map_count;struct semaphore mmap_sem;unsigned long context;unsigned long start_code, end_code, start_data, end_data;unsigned long start_brk, brk, start_stack;unsigned long arg_start, arg_end, env_start, env_end;unsigned long rss, total_vm, locked_vm;unsigned long def_flags;unsigned long cpu_vm_mask;unsigned long swap_cnt;unsigned long swap_addr<strong>es</strong>s;void * segments;};For the pr<strong>es</strong>ent discussion, the most important fields are:pgd and segmentsrsstotal_vmlocked_vmPoint, r<strong>es</strong>pectively, to the Page Global Directory and Local D<strong>es</strong>criptor Table of theproc<strong>es</strong>s.Specifi<strong>es</strong> the number of page fram<strong>es</strong> allocated to the proc<strong>es</strong>s.Denot<strong>es</strong> the size of the proc<strong>es</strong>s addr<strong>es</strong>s space expr<strong>es</strong>sed as a number of pag<strong>es</strong>.Counts the number of "locked" pag<strong>es</strong>, that is, pag<strong>es</strong> that cannot be swapped out (seeChapter 16).185


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>countDenot<strong>es</strong> the number of proc<strong>es</strong>s<strong>es</strong> that share the same struct mm_struct d<strong>es</strong>criptor. Ifcount is greater than 1, the proc<strong>es</strong>s<strong>es</strong> are lightweight proc<strong>es</strong>s<strong>es</strong> sharing the sameaddr<strong>es</strong>s space, that is, using the same memory d<strong>es</strong>criptor.<strong>The</strong> mm_alloc( ) function is invoked to get a new memory d<strong>es</strong>criptor. Since th<strong>es</strong>ed<strong>es</strong>criptors are stored in a slab allocator cache, mm_alloc( ) calls kmem_cache_alloc( ),initializ<strong>es</strong> the new memory d<strong>es</strong>criptor by duplicating the content of the memory d<strong>es</strong>criptor ofcurrent, and sets the count field to 1.Conversely, the mmput( ) function decrements the count field of a memory d<strong>es</strong>criptor. If thatfield becom<strong>es</strong> 0, the function releas<strong>es</strong> the Local D<strong>es</strong>criptor Table, the memory regiond<strong>es</strong>criptors (see later in this chapter), the page tabl<strong>es</strong> referenced by the memory d<strong>es</strong>criptor,and the memory d<strong>es</strong>criptor itself.<strong>The</strong> mmap, mmap_avl, and mmap_cache fields are discussed in the next section.7.3 Memory Regions<strong>Linux</strong> implements memory regions by means of d<strong>es</strong>criptors of type vm_area_struct:struct vm_area_struct {struct mm_struct * vm_mm;unsigned long vm_start;unsigned long vm_end;struct vm_area_struct *vm_next;pgprot_t vm_page_prot;unsigned short vm_flags;short vm_avl_height;struct vm_area_struct *vm_avl_left, *vm_avl_right;struct vm_area_struct *vm_next_share, **vm_pprev_share;struct vm_operations_struct * vm_ops;unsigned long vm_offset;struct file * vm_file;unsigned long vm_pte;};Each memory region d<strong>es</strong>criptor identifi<strong>es</strong> a linear addr<strong>es</strong>s interval. <strong>The</strong> vm_start fieldcontains the first linear addr<strong>es</strong>s of the interval, while the vm_end field contains the first linearaddr<strong>es</strong>s outside of the interval; vm_end - vm_start thus denot<strong>es</strong> the length of the memoryregion. <strong>The</strong> vm_mm field points to the mm_struct memory d<strong>es</strong>criptor of the proc<strong>es</strong>s that ownsthe region. We shall d<strong>es</strong>cribe the other fields of vm_area_struct later.Memory regions owned by a proc<strong>es</strong>s never overlap, and the kernel tri<strong>es</strong> to merge regionswhen a new one is allocated right next to an existing one. Two adjacent regions can bemerged if their acc<strong>es</strong>s rights match.As shown in Figure 7-1, when a new range of linear addr<strong>es</strong>s<strong>es</strong> is added to the proc<strong>es</strong>s addr<strong>es</strong>sspace, the kernel checks whether an already existing memory region can be enlarged (case a).If not, a new memory region is created (case b). Similarly, if a range of linear addr<strong>es</strong>s<strong>es</strong> isremoved from the proc<strong>es</strong>s addr<strong>es</strong>s space, the kernel r<strong>es</strong>iz<strong>es</strong> the affected memory regions186


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>(case c). In some cas<strong>es</strong>, the r<strong>es</strong>izing forc<strong>es</strong> a memory region to be split into two smaller on<strong>es</strong>(case d ). [1][1] Removing a linear addr<strong>es</strong>s interval may theoretically fail because no free memory is available for a new memory d<strong>es</strong>criptor.Figure 7-1. Adding or removing a linear addr<strong>es</strong>s interval7.3.1 Memory Region Data Structur<strong>es</strong>All the regions owned by a proc<strong>es</strong>s are linked together in a simple list. Regions appear in thelist in ascending order by memory addr<strong>es</strong>s; however, each two regions can be separated by anarea of unused memory addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong> vm_next field of each vm_area_struct elementpoints to the next element in the list. <strong>The</strong> kernel finds the memory regions through the mmapfield of the proc<strong>es</strong>s memory d<strong>es</strong>criptor, which points to the vm_next field of the first memoryregion d<strong>es</strong>criptor in the list.<strong>The</strong> map_count field of the memory d<strong>es</strong>criptor contains the number of regions owned by theproc<strong>es</strong>s. A proc<strong>es</strong>s may own up to MAX_MAP_COUNT different memory regions (this value isusually set to 65536).Figure 7-2 illustrat<strong>es</strong> the relationships among the addr<strong>es</strong>s space of a proc<strong>es</strong>s, its memoryd<strong>es</strong>criptor, and the list of memory regions.187


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 7-2. D<strong>es</strong>criptors related to the addr<strong>es</strong>s space of a proc<strong>es</strong>sA frequent operation performed by the kernel is to search the memory region that includ<strong>es</strong> aspecific linear addr<strong>es</strong>s. Since the list is sorted, the search can terminate as soon as a memoryregion that ends after the specific linear addr<strong>es</strong>s has been found.However, using the list is convenient only if the proc<strong>es</strong>s has very few memory regions, let'ssay l<strong>es</strong>s than a few tens of them. Searching, inserting elements, and deleting elements in thelist involve a number of operations whose tim<strong>es</strong> are linearly proportional to the list length.Although most <strong>Linux</strong> proc<strong>es</strong>s<strong>es</strong> use very few memory regions, there are some largeapplications like object-oriented databas<strong>es</strong> that one might consider "pathological" in that theyhave many hundreds or even thousands of regions. In such cas<strong>es</strong>, the memory region listmanagement becom<strong>es</strong> very inefficient, hence the performance of the memory-related systemcalls degrad<strong>es</strong> to an intolerable point.When proc<strong>es</strong>s<strong>es</strong> have a large number of memory regions, <strong>Linux</strong> stor<strong>es</strong> their d<strong>es</strong>criptors indata structur<strong>es</strong> called AVL tre<strong>es</strong>, which were invented in 1962 by Adelson-Velskii and Landis.In an AVL tree, each element (or node) usually has two children: a left child and a right child.<strong>The</strong> elements in the AVL tree are sorted: for each node N, all elements of the subtree rooted atthe left child of N precede N, while, conversely, all elements of the subtree rooted at the rightchild of N follow N (see Figure 7-3 (a); the key of the node is written inside the node itself).Every node N of an AVL tree has a balancing factor, which shows how well balanced thebranch<strong>es</strong> under the node are. <strong>The</strong> balancing factor is the depth of the subtree rooted at N's leftchild minus the depth of the subtree rooted at N's right child. Every node of a properlybalanced AVL tree must have a balancing factor equal to -1, 0, or +1 (see Figure 7-3 (a); thebalancing factor of the node is written to the left of the node itself).188


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 7-3. Example of AVL tre<strong>es</strong>Searching an element in an AVL tree is very efficient, since it requir<strong>es</strong> operations whoseexecution time is linearly proportional to the logarithm (of 2) of the tree size. In other words,doubling the number of memory regions adds just one more iteration to the operation.Inserting and deleting an element in an AVL tree is also efficient, since the algorithm canquickly traverse the tree in order to locate the position at which the element will be inserted orfrom which it will be removed. However, such operations could make the AVL treeunbalanced. For instance, let's suppose that an element having value 11 must be inserted inthe AVL tree shown in Figure 7-3 (a). Its proper position is the left child of node having key12, but once it is inserted, the balancing factor of the node having key 13 becom<strong>es</strong> -2. In orderto rebalance the AVL tree, the algorithm performs a "rotation" on the subtree rooted at thenode having the key 13, thus producing the new AVL tree shown in Figure 7-3 (b). This lookscomplicated, but inserting or deleting an element in an AVL tree requir<strong>es</strong> a small number ofoperations—a number linearly proportional to the logarithm of the tree size.Still, AVL tre<strong>es</strong> have their drawbacks. <strong>The</strong> functions that handle them are a lot more complexthan the functions that handle lists. When the number of elements is small, it is far moreefficient to put them in a list instead of in an AVL tree.<strong>The</strong>refore, in order to store the memory regions of a proc<strong>es</strong>s, <strong>Linux</strong> generally mak<strong>es</strong> use ofthe linked list referred by the mmap field of the memory d<strong>es</strong>criptor; it starts using an AVL treeonly when the number of memory regions of the proc<strong>es</strong>s becom<strong>es</strong> higher thanAVL_MIN_MAP_COUNT (usually 32 elements). Thus, the memory d<strong>es</strong>criptor of a proc<strong>es</strong>sinclud<strong>es</strong> another field named mmap_avl pointing to the AVL tree. This field has the valueuntil the kernel decid<strong>es</strong> it needs to create the tree. Once an AVL tree has been created tohandle memory regions of a proc<strong>es</strong>s, <strong>Linux</strong> keeps both the linked list and the AVL tree up-todate.Both data structur<strong>es</strong> contain pointers to the same memory region d<strong>es</strong>criptors. Wheninserting or removing a memory region d<strong>es</strong>criptor, the kernel search<strong>es</strong> the previous and nextelements through the AVL tree and us<strong>es</strong> them to quickly update the list without scanning it.<strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of the left and right children of every AVL node are stored in the vm_avl_leftand vm_avl_right fields, r<strong>es</strong>pectively, of the vm_area_struct d<strong>es</strong>criptor. This d<strong>es</strong>criptoralso includ<strong>es</strong> the vm_avl_height field, which stor<strong>es</strong> the height of the subtree rooted at thememory region itself. <strong>The</strong> tree is sorted on the vm_end field value.189


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> avl_rebalance( ) function receiv<strong>es</strong> a path in a memory region's AVL tree as aparameter. It rebalanc<strong>es</strong> the tree, if nec<strong>es</strong>sary, by properly rotating a subtree branching offfrom a node of the path. <strong>The</strong> function is invoked by the avl_insert( ) and avl_remove( )functions, which insert and remove a memory region d<strong>es</strong>criptor in a tree, r<strong>es</strong>pectively. <strong>Linux</strong>also mak<strong>es</strong> use of the avl_insert_neighbours( ) function to insert an element into the treeand return the addr<strong>es</strong>s<strong>es</strong> of the near<strong>es</strong>t nod<strong>es</strong> at the left and the right of the new element.7.3.2 Memory Region Acc<strong>es</strong>s RightsBefore moving on, we should clarify the relation between a page and a memory region. Asmentioned in Chapter 2, we use the term "page" to refer both to a set of linear addr<strong>es</strong>s<strong>es</strong> andto the data contained in this group of addr<strong>es</strong>s<strong>es</strong>. In particular, we denote the linear addr<strong>es</strong>sinterval ranging between and 4095 as page 0, the linear addr<strong>es</strong>s interval ranging between 4096and 8191 as page 1, and so forth. Each memory region thus consists of a set of pag<strong>es</strong> havingconsecutive page numbers.We have already discussed in previous chapters two kinds of flags associated with a page:• A few flags such as Read/Write , Pr<strong>es</strong>ent, or User/Supervisor stored in each pagetable entry (see Section 2.4.1 in Chapter 2).• A set of flags stored in the flags field of each page d<strong>es</strong>criptor (see Section 6.1 inChapter 6).<strong>The</strong> first kind of flag is used by the Intel 80x86 hardware to check whether the requ<strong>es</strong>ted kindof addr<strong>es</strong>sing can be performed; the second kind is used by <strong>Linux</strong> for many different purpos<strong>es</strong>(see Table 6-1).We now introduce a third kind of flags: those associated with the pag<strong>es</strong> of a memory region.<strong>The</strong>y are stored in the vm_flags field of the vm_area_struct d<strong>es</strong>criptor (see Table 7-2).Some flags offer the kernel information about all the pag<strong>es</strong> of the memory region, such aswhat they contain and what rights the proc<strong>es</strong>s has to acc<strong>es</strong>s each page. Other flags d<strong>es</strong>cribethe region itself, such as how it can grow.Flag NameVM_DENYWRITEVM_EXECVM_EXECUTABLEVM_GROWSDOWNVM_GROWSUPVM_IOVM_LOCKEDVM_MAYEXECVM_MAYREADVM_MAYSHAREVM_MAYWRITEVM_READVM_SHAREDVM_SHMTable 7-2. <strong>The</strong> Memory Region FlagsD<strong>es</strong>cription<strong>The</strong> region maps a file that cannot be opened for writing.Pag<strong>es</strong> can be executed.Pag<strong>es</strong> contain executable code.<strong>The</strong> region can expand toward lower addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong> region can expand toward higher addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong> region maps the I/O addr<strong>es</strong>s space of a device.Pag<strong>es</strong> are locked and cannot be swapped out.VM_EXEC flag may be set.VM_READ flag may be set.VM_SHARE flag may be set.VM_WRITE flag may be set.Pag<strong>es</strong> can be read.Pag<strong>es</strong> can be shared by several proc<strong>es</strong>s<strong>es</strong>.Pag<strong>es</strong> are used for IPC's shared memory.190


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>VM_WRITEPag<strong>es</strong> can be written.Page acc<strong>es</strong>s rights included in a memory region d<strong>es</strong>criptor may be combined arbitrarily: it ispossible, for instance, to allow the pag<strong>es</strong> of a region to be executed but not to be read. In orderto implement this protection scheme efficiently, the read, write, and execute acc<strong>es</strong>s rightsassociated with the pag<strong>es</strong> of a memory region must be duplicated in all the corr<strong>es</strong>pondingpage table entri<strong>es</strong>, so that checks can be directly performed by the Paging Unit circuitry. Inother words, the page acc<strong>es</strong>s rights dictate what kinds of acc<strong>es</strong>s should generate a "Page fault"exception. As we shall see shortly, <strong>Linux</strong> delegat<strong>es</strong> the job of figuring out what caused thepage fault to the page fault handler, which implements several page-handling strategi<strong>es</strong>.<strong>The</strong> initial valu<strong>es</strong> of the page table flags (which must be the same for all pag<strong>es</strong> in the memoryregion, as we have seen) are stored in the vm_ page_ prot field of the vm_area_structd<strong>es</strong>criptor. When adding a page, the kernel sets the flags in the corr<strong>es</strong>ponding page table entryaccording to the value of the vm_ page_ prot field.However, translating the memory region's acc<strong>es</strong>s rights into the page protection bits is notstraightforward, for the following reasons:• In some cas<strong>es</strong>, a page acc<strong>es</strong>s should generate a "Page fault" exception even when itsacc<strong>es</strong>s type is granted by the page acc<strong>es</strong>s rights specified in the vm_flags field of thecorr<strong>es</strong>ponding memory region. For instance, the kernel might decide to store twoidentical, writable private pag<strong>es</strong> (whose VM_SHARE flags are cleared) belonging to twodifferent proc<strong>es</strong>s<strong>es</strong> into the same page frame; in this case, an exception should begenerated when either one of the proc<strong>es</strong>s<strong>es</strong> tri<strong>es</strong> to modify the page (see Section 7.4.4later in this chapter).• Intel 80x86 proc<strong>es</strong>sors's page tabl<strong>es</strong> have just two protection bits, namely theRead/Write and User/Supervisor flags. Moreover, the User/Supervisor flag ofany page included in a memory region must always be set, since the page must alwaysbe acc<strong>es</strong>sible by User Mode proc<strong>es</strong>s<strong>es</strong>.In order to overcome the hardware limitation of the Intel microproc<strong>es</strong>sors, <strong>Linux</strong> adopts thefollowing rul<strong>es</strong>:• <strong>The</strong> read acc<strong>es</strong>s right always impli<strong>es</strong> the execute acc<strong>es</strong>s right.• <strong>The</strong> write acc<strong>es</strong>s right always impli<strong>es</strong> the read acc<strong>es</strong>s right.Moreover, in order to correctly defer the allocation of page fram<strong>es</strong> through the Section 7.4.4technique (see later in this chapter), the page frame is write-protected whenever thecorr<strong>es</strong>ponding page must not be shared by several proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>refore, the 16 possiblecombinations of the read, write, execute, and share acc<strong>es</strong>s rights are scaled down to thefollowing three:• If the page has both write and share acc<strong>es</strong>s rights, the Read/Write bit is set.• If the page has the read or execute acc<strong>es</strong>s right but do<strong>es</strong> not have either the write or th<strong>es</strong>hare acc<strong>es</strong>s right, the Read/Write bit is cleared.• If the page do<strong>es</strong> not have any acc<strong>es</strong>s rights, the Pr<strong>es</strong>ent bit is cleared, so that eachacc<strong>es</strong>s generat<strong>es</strong> a "Page fault" exception. However, in order to distinguish thiscondition from the real page-not-pr<strong>es</strong>ent case, <strong>Linux</strong> also sets the Page size bit to 1. [2]191


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>[2] You might consider this use of the Page size bit to be a dirty trick, since the bit was meant to indicate the real page size. But <strong>Linux</strong> can getaway with the deception because the Intel chip checks the Page size bit in Page Directory entri<strong>es</strong>, but not in Page Table entri<strong>es</strong>.<strong>The</strong> downscaled protection bits corr<strong>es</strong>ponding to each combination of acc<strong>es</strong>s rights are storedin the protection_map array.7.3.3 Memory Region HandlingHaving the basic understanding of data structur<strong>es</strong> and state information that control memoryhandling, we can look at a group of low-level functions that operate on memory regiond<strong>es</strong>criptors. <strong>The</strong>y should be considered as auxiliary functions that simplify theimplementation of do_map( ) and ddo_unmap( ). Those two functions, which are d<strong>es</strong>cribedin Section 7.3.4 and Section 7.3.5 later in this chapter, r<strong>es</strong>pectively, enlarge and shrink theaddr<strong>es</strong>s space of a proc<strong>es</strong>s. Working at a higher level than the functions we consider here,they do not receive a memory region d<strong>es</strong>criptor as their parameter, but rather the initialaddr<strong>es</strong>s, the length, and the acc<strong>es</strong>s rights of a linear addr<strong>es</strong>s interval.7.3.3.1 Finding the clos<strong>es</strong>t region to a given addr<strong>es</strong>s<strong>The</strong> find_vma( ) function acts on two parameters: the addr<strong>es</strong>s mm of a proc<strong>es</strong>s memoryd<strong>es</strong>criptor and a linear addr<strong>es</strong>s addr. It locat<strong>es</strong> the first memory region whose vm_end field isgreater than addr and returns the addr<strong>es</strong>s of its d<strong>es</strong>criptor; if no such region exists, it returns aNULL pointer. Notice that the region selected by find_vma( ) do<strong>es</strong> not nec<strong>es</strong>sarily includeaddr.Each memory d<strong>es</strong>criptor includ<strong>es</strong> a mmap_cache field that stor<strong>es</strong> the d<strong>es</strong>criptor addr<strong>es</strong>s of theregion that was last referenced by the proc<strong>es</strong>s. This additional field is introduced to reduce thetime spent in looking for the region that contains a given linear addr<strong>es</strong>s: locality of addr<strong>es</strong>sreferenc<strong>es</strong> in programs mak<strong>es</strong> it highly likely that if the last linear addr<strong>es</strong>s checked belongedto a given region, the next one to be checked belongs to the same region.<strong>The</strong> function thus starts by checking whether the region identified by mmap_cache includ<strong>es</strong>addr. If so, it returns the region d<strong>es</strong>criptor pointer:vma = mm->mmap_cache;if (vma && vma->vm_end > addr && vma->vm_start mmap_avl) {vma = mm->mmap;while (vma && vma->vm_end vm_next;if (vma)mm->mmap_cache = vma;return vma;}Otherwise, the function looks up the memory region in the AVL tree:192


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>tree = mm->mmap_avl;vma = NULL;for (;;) {if (tree == NULL)break;if (tree->vm_end > addr) {vma = tree;if (tree->vm_start vm_avl_left;} elsetree = tree->vm_avl_right;}if (vma)mm->mmap_cache = vma;return vma;<strong>The</strong> kernel also mak<strong>es</strong> use of the find_vma_prev( ) function, which returns the d<strong>es</strong>criptoraddr<strong>es</strong>s<strong>es</strong> of the memory region that preced<strong>es</strong> the linear addr<strong>es</strong>s given as parameter and of thememory region that follows it.7.3.3.2 Finding a region that overlaps a given addr<strong>es</strong>s interval<strong>The</strong> find_vma_intersection( ) function finds the first memory region that overlaps agiven linear addr<strong>es</strong>s interval; the mm parameter points to the memory d<strong>es</strong>criptor of the proc<strong>es</strong>s,while the start_addr and end_addr linear addr<strong>es</strong>s<strong>es</strong> specify the interval:vma = find_vma(mm,start_addr);if (vma && end_addr vm_start)vma = NULL;return vma;<strong>The</strong> function returns a NULL pointer if no such region exists. To be exact, if find_vma( )returns a valid addr<strong>es</strong>s but the memory region found starts after the end of the linear addr<strong>es</strong>sinterval, vma is set to NULL.7.3.3.3 Finding a free addr<strong>es</strong>s interval<strong>The</strong> get_unmapped_area( ) function search<strong>es</strong> the proc<strong>es</strong>s addr<strong>es</strong>s space to find an availablelinear addr<strong>es</strong>s interval. <strong>The</strong> len parameter specifi<strong>es</strong> the interval length, while the addrparameter may specify the addr<strong>es</strong>s from which the search is started. If the search is succ<strong>es</strong>sful,the function returns the initial addr<strong>es</strong>s of the new interval; otherwise, it returns 0:if (len > PAGE_OFFSET)return 0;if (!addr)addr = PAGE_OFFSET / 3;addr = (addr + 0xfff) & 0xfffff000;for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {if (addr + len > PAGE_OFFSET)return 0;if (!vmm || addr + len vm_start)return addr;addr = vmm->vm_end;}193


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> function starts by checking to make sure the interval length is within the limit imposed onUser Mode linear addr<strong>es</strong>s<strong>es</strong>, usually 3 GB. If addr is NULL, the search's starting point is setto one-third of the User Mode linear addr<strong>es</strong>s space. To be on the safe side, the function roundsup the value of addr to a multiple of 4 KB. Starting from addr, it then repeatedly invok<strong>es</strong>find_vma( ) with increasing valu<strong>es</strong> of addr to find the required free interval. During thissearch, the following cas<strong>es</strong> may occur:• <strong>The</strong> requ<strong>es</strong>ted interval is larger than the portion of linear addr<strong>es</strong>s space yet to b<strong>es</strong>canned (addr + len > PAGE_OFFSET): since there are not enough linear addr<strong>es</strong>s<strong>es</strong> tosatisfy the requ<strong>es</strong>t, return 0.• <strong>The</strong> hole following the last scanned region is not large enough (vmm != NULL && vmm->vm_start < addr + len): consider the next region.• If neither one of the preceding conditions holds, a large enough hole has been found:return addr.7.3.3.4 Inserting a region in the memory d<strong>es</strong>criptor listi nsert_vm_struct( ) inserts a vm_area_struct structure in the list of memoryd<strong>es</strong>criptors and, if nec<strong>es</strong>sary, in the AVL tree. It mak<strong>es</strong> use of two parameters: mm, whichspecifi<strong>es</strong> the addr<strong>es</strong>s of a proc<strong>es</strong>s memory d<strong>es</strong>criptor, and vmp, which specifi<strong>es</strong> the addr<strong>es</strong>s ofthe vm_area_struct d<strong>es</strong>criptor to be inserted:if (!mm->mmap_avl) {pprev = &mm->mmap;while (*pprev && (*pprev)->vm_start vm_start)pprev = &(*pprev)->vm_next;} else {struct vm_area_struct *prev, *next;avl_insert_neighbours(vmp, &mm->mmap_avl, &prev, &next);pprev = (prev ? &prev->vm_next : &mm->mmap);}vmp->vm_next = *pprev;*pprev = vmp;If the proc<strong>es</strong>s mak<strong>es</strong> use of the AVL tree, the avl_insert_neighbours( ) function isinvoked to insert the memory region d<strong>es</strong>criptor in the proper position; otherwise,insert_vm_struct( ) scans forward through the linked list using the pprev local variableuntil it finds the d<strong>es</strong>criptor that should precede vmp. At the end of the search, pprev points tothe vm_next field of the memory region d<strong>es</strong>criptor that should precede vmp in the list, hence*pprev yields the addr<strong>es</strong>s of the memory region d<strong>es</strong>criptor that should follow vmp. <strong>The</strong>d<strong>es</strong>criptor can thus be inserted into the list.mm->map_count++;if (mm->map_count >= AVL_MIN_MAP_COUNT && !mm->mmap_avl)build_mmap_avl(mm);<strong>The</strong> map_count field of the proc<strong>es</strong>s memory d<strong>es</strong>criptor is then incremented by 1. Moreover, ifthe proc<strong>es</strong>s was not using the AVL tree up to now but the number of memory regionsbecom<strong>es</strong> greater than or equal to AVL_MIN_MAP_COUNT , the build_mmap_avl( ) function isinvoked:194


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>void build_mmap_avl(struct mm_struct * mm){struct vm_area_struct * vma;mm->mmap_avl = NULL;for (vma = mm->mmap; vma; vma = vma->vm_next)avl_insert(vma, &mm->mmap_avl);}From now on, the proc<strong>es</strong>s will use an AVL tree.If the region contains a memory mapped file, the function performs additional tasks that ared<strong>es</strong>cribed in Chapter 16.No explicit function exists for removing a region from the memory d<strong>es</strong>criptor list (see Section7.3.5).7.3.3.5 Merging contiguous regions<strong>The</strong> merge_segments( ) function attempts to merge together the memory regions included ina given linear addr<strong>es</strong>s interval. As illustrated in Figure 7-1, this can be achieved only if thecontiguous regions have the same acc<strong>es</strong>s rights. <strong>The</strong> parameters of merge_segments( ) are amemory d<strong>es</strong>criptor pointer mm and two linear addr<strong>es</strong>s<strong>es</strong> start_addr and end_addr, whichdelimit the interval. <strong>The</strong> function finds the last memory region that ends before start_addrand puts the addr<strong>es</strong>s of its d<strong>es</strong>criptor in the prev local variable. <strong>The</strong>n it iteratively execut<strong>es</strong>the following actions:• Loads the mpnt local variable with prev->vm_next, that is, the d<strong>es</strong>criptor addr<strong>es</strong>s ofthe first memory region that starts after start_addr. If no such region exists, nomerging is possible.• Cycl<strong>es</strong> through the list as long as prev->vm_start is smaller than end_addr. Checkswhether it is possible to merge the memory regions associated with prev and mpnt.This is possible if:o <strong>The</strong> memory regions are contiguous: prev->vm_end = mpnt->vm_start.oo<strong>The</strong>y have the same flags: prev->vm_flags = mpnt->vm_flags.When they map fil<strong>es</strong> or are shared among proc<strong>es</strong>s<strong>es</strong>, they satisfy additionalrequirements to be discussed in later chapters.If merging is possible, remove the memory region d<strong>es</strong>criptor from the list and, ifnec<strong>es</strong>sary, from the AVL tree.• Decrement the map_count field of the memory d<strong>es</strong>criptor by 1, and r<strong>es</strong>ume the searchby setting prev so that it points to the merged memory region d<strong>es</strong>criptor.<strong>The</strong> function ends by setting the mmap_cache field of the memory d<strong>es</strong>criptor to NULL, sincethe memory region cache could now refer to a memory region that no longer exists.7.3.4 Allocating a Linear Addr<strong>es</strong>s IntervalNow let's discuss how new linear addr<strong>es</strong>s intervals are allocated. In order to do this, thedo_mmap( ) function creat<strong>es</strong> and initializ<strong>es</strong> a new memory region for the current proc<strong>es</strong>s.195


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>However, after a succ<strong>es</strong>sful allocation, the memory region could be merged with othermemory regions defined for the proc<strong>es</strong>s.<strong>The</strong> function mak<strong>es</strong> use of the following parameters:file and offaddrlenprotflagFile d<strong>es</strong>criptor pointer file and file offset off are used if the new memory region willmap a file into memory. This topic will be discussed in Chapter 15. In this section,we'll assume that no memory mapping is required and that file and off are bothNULL.This linear addr<strong>es</strong>s specifi<strong>es</strong> where the search for a free interval must start (see theprevious d<strong>es</strong>cription of the get_unmapped_area( ) function).<strong>The</strong> length of the linear addr<strong>es</strong>s interval.This parameter specifi<strong>es</strong> the acc<strong>es</strong>s rights of the pag<strong>es</strong> included in the memory region.Possible flags are PROT_READ, PROT_WRITE , PROT_EXEC, and PROT_NONE. <strong>The</strong> firstthree flags mean the same things as the VM_READ, VM_WRITE, and VM_EXEC flags.PROT_NONE indicat<strong>es</strong> that the proc<strong>es</strong>s has none of those acc<strong>es</strong>s rights.This parameter specifi<strong>es</strong> the remaining memory region flags:MAP_GROWSDOWN , MAP_LOCKED , MAP_DENYWRITE , and MAP_EXECUTABLE<strong>The</strong>ir meanings are identical to those of the flags listed in Table 7-2.MAP_SHARED and MAP_PRIVATE<strong>The</strong> former flag specifi<strong>es</strong> that the pag<strong>es</strong> in the memory region can be shared amongseveral proc<strong>es</strong>s<strong>es</strong>; the latter flag has the opposite effect. Both flags refer to theVM_SHARED flag in the vm_area_struct d<strong>es</strong>criptor.MAP_ANONYMOUSNo file is associated with the memory region (see Chapter 15).MAP_FIXED<strong>The</strong> initial linear addr<strong>es</strong>s of the interval must be the one specified in the addrparameter.196


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>MAP_NORESERVE<strong>The</strong> function do<strong>es</strong>n't have to do a preliminary check of the number of free pagefram<strong>es</strong>.<strong>The</strong> do_mmap( ) function starts by checking whether the parameter valu<strong>es</strong> are correct andwhether the requ<strong>es</strong>t can be satisfied. In particular, it checks for the following conditions thatprevent it from satisfying the requ<strong>es</strong>t:• <strong>The</strong> linear addr<strong>es</strong>s interval includ<strong>es</strong> addr<strong>es</strong>s<strong>es</strong> greater than PAGE_OFFSET.• <strong>The</strong> proc<strong>es</strong>s has already mapped too many memory regions: the value of themap_count field of its mm memory d<strong>es</strong>criptor exceeds the MAX_MAP_COUNT value.• <strong>The</strong> file parameter is equal to NULL and the flag parameter specifi<strong>es</strong> that the pag<strong>es</strong>of the new linear addr<strong>es</strong>s interval must be shared.• <strong>The</strong> flag parameter specifi<strong>es</strong> that the pag<strong>es</strong> of the new linear addr<strong>es</strong>s interval must belocked in RAM, and the number of pag<strong>es</strong> locked by the proc<strong>es</strong>s exceeds the thr<strong>es</strong>holdstored in the rlim[RLIMIT_MEMLOCK].rlim_cur field of the proc<strong>es</strong>s d<strong>es</strong>criptor.If any of the preceding conditions holds, do_mmap( ) terminat<strong>es</strong> by returning a negativevalue. If the linear addr<strong>es</strong>s interval has a zero length, the function returns without performingany action.<strong>The</strong> next step consists of obtaining a linear addr<strong>es</strong>s interval; if the MAP_FIXED flag is set, acheck is made on the proper alignment of the addr value; then the get_unmapped_area( )function is invoked to get it:if (flags & MAP_FIXED) {if (addr & 0xfffff000)return -EINVAL;} else {addr = get_unmapped_area(addr, len);if (!addr)return -ENOMEM;}Now a vm_area_struct d<strong>es</strong>criptor must be allocated for the new region. This is done byinvoking the kmem_cache_alloc( ) slab allocator function:vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);if (!vma)return -ENOMEM;<strong>The</strong> memory region d<strong>es</strong>criptor is then initialized. Notice how the value of the vm_flags fieldis determined both by the prot and flags parameters (joined together by means of thevm_flags( ) function) and by the def_flags field of the memory d<strong>es</strong>criptor. <strong>The</strong> latter fieldallows the kernel to define a set of flags that should be set for every memory region in theproc<strong>es</strong>s. [3][3] Actually, this field is modified only by the mlockall( ) system call, which can be used to set the VM_LOCKED flag, thus locking allfuture pag<strong>es</strong> of the calling proc<strong>es</strong>s in RAM.197


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>vma->vm_mm = current->mm;vma->vm_start = addr;vma->vm_end = addr + len;vma->vm_flags = vm_flags(prot,flags) | current->mm->def_flags;vma->vm_flags |= VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;vma->vm_page_prot = protection_map[vma->vm_flags & 0x0f];<strong>The</strong> do_mmap( ) function then checks whether any of th<strong>es</strong>e error conditions holds:• <strong>The</strong> proc<strong>es</strong>s already includ<strong>es</strong> in its addr<strong>es</strong>s space a memory region that overlaps thelinear addr<strong>es</strong>s interval ranging from addr to addr+len; this check is performed by thedo_munmap( ) function, which returns the value if the overlap exists.• <strong>The</strong> size in pag<strong>es</strong> of the proc<strong>es</strong>s addr<strong>es</strong>s space exceeds the thr<strong>es</strong>hold stored in therlim[RLIMIT_AS].rlim_cur field of the proc<strong>es</strong>s d<strong>es</strong>criptor.• <strong>The</strong> MAP_NORESERVE flag was not set in the flags parameter, the new memory regionmust contain private writable pag<strong>es</strong>, and the number of free page fram<strong>es</strong> is l<strong>es</strong>s thanthe size (in pag<strong>es</strong>) of the linear addr<strong>es</strong>s interval; this last check is performed by thevm_enough_memory( ) function.If any of the preceding conditions holds, do_mmap( ) releas<strong>es</strong> the vm_area_structd<strong>es</strong>criptor obtained and terminat<strong>es</strong> by returning the -ENOMEM value.Once all checks have been performed, do_mmap( ) increments the size of current's addr<strong>es</strong>sspace stored in the total_vm field of the memory d<strong>es</strong>criptor. It then invok<strong>es</strong>insert_vm_struct( ), which inserts the new region in the list of regions owned by current(and, if nec<strong>es</strong>sary, in its AVL tree), and merge_segments( ), which checks whether regionscan be merged. Since the new region may be d<strong>es</strong>troyed by a merge, the valu<strong>es</strong> of vm_flagsand vm_start, which may be needed later, are saved in the flags and addr local variabl<strong>es</strong>:current->mm->total_vm += len >> PAGE_SHIFT;flags = vma->vm_flags;addr = vma->vm_start;insert_vm_struct(current->mm, vma);merge_segments(current->mm, vma->vm_start, vma->vm_end);<strong>The</strong> final step is executed only if the MAP_LOCKED flag is set. First, the number of pag<strong>es</strong> in thememory region is added to the locked_vm field of the memory d<strong>es</strong>criptor. <strong>The</strong>n the make_pag<strong>es</strong>_ pr<strong>es</strong>ent( ) function is invoked to allocate all the pag<strong>es</strong> of the memory region insucc<strong>es</strong>sion and lock them in RAM. <strong>The</strong> core code of make_pag<strong>es</strong>_ pr<strong>es</strong>ent( ) is:vma = find_vma(current->mm, addr);write = (vma->vm_flags & VM_WRITE) != 0;while (addr < addr + len) {handle_mm_fault(current, vma, addr, write);addr += PAGE_SIZE;}As we shall see in Section 7.4.2, handle_mm_fault( ) allocat<strong>es</strong> one page and sets its pagetable entry according to the vm_flags field of the memory region d<strong>es</strong>criptor.Finally, the do_mmap( ) function terminat<strong>es</strong> by returning the linear addr<strong>es</strong>s of the newmemory region.198


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>7.3.5 Releasing a Linear Addr<strong>es</strong>s Interval<strong>The</strong> do_munmap( ) function delet<strong>es</strong> a linear addr<strong>es</strong>s interval from the addr<strong>es</strong>s space of thecurrent proc<strong>es</strong>s. <strong>The</strong> parameters are the starting addr<strong>es</strong>s addr of the interval and its lengthlen. <strong>The</strong> interval to be deleted do<strong>es</strong> not usually corr<strong>es</strong>pond to a memory region: it may beincluded in one memory region, or it may span two or more regions.<strong>The</strong> function go<strong>es</strong> through two main phas<strong>es</strong>. First, it scans the list of memory regions ownedby the proc<strong>es</strong>s and remov<strong>es</strong> all regions that overlap the linear addr<strong>es</strong>s interval. In the secondphase, the function updat<strong>es</strong> the proc<strong>es</strong>s page tabl<strong>es</strong> and reinserts a downsized version of thememory regions that were removed during the first phase.7.3.5.1 First phase: scanning the memory regionsA preliminary check is made on the parameter valu<strong>es</strong>: if the linear addr<strong>es</strong>s interval includ<strong>es</strong>addr<strong>es</strong>s<strong>es</strong> greater than PAGE_OFFSET, if addr is not a multiple of 4096, or if the linear addr<strong>es</strong>sinterval has a zero length, the function returns a negative error code.Next, the function locat<strong>es</strong> the first memory region that overlaps the linear addr<strong>es</strong>s interval tobe deleted:mpnt = find_vma_prev(current->mm, addr, &prev);if (!mpnt || mpnt->vm_start >= addr + len)return 0;If the linear addr<strong>es</strong>s interval is located inside a memory region, its deletion will split theregion into two smaller on<strong>es</strong>. In this case, do_munmap( ) checks whether current is allowedto obtain an additional memory region:if ((mpnt->vm_start < addr && mpnt->vm_end > addr + len) &&current->mm->map_count > MAX_MAP_COUNT)return -ENOMEM;<strong>The</strong> function then attempts to get a new vm_area_struct d<strong>es</strong>criptor. <strong>The</strong>re may be no needfor it, but the function mak<strong>es</strong> the requ<strong>es</strong>t anyway so that it can terminate right away if theallocation fails. This cautious approach simplifi<strong>es</strong> the code since it allows an easy error exit:extra = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);if (!extra)return -ENOMEM;Now the function builds up a list including all d<strong>es</strong>criptors of the memory regions that overlapthe linear addr<strong>es</strong>s interval. This list is created by setting the vm_next field of the memoryregion d<strong>es</strong>criptor (temporarily) so it points to the previous item in the list; this field thus actsas a backward link. As each region is added to this backward list, a local variable named freepoints to the last inserted element. <strong>The</strong> regions inserted in the list are also removed from thelist of memory regions owned by the proc<strong>es</strong>s and, if nec<strong>es</strong>sary, from the AVL tree:199


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>npp = (prev ? &prev->vm_next : &current->mm->mmap);free = NULL;for ( ; mpnt && mpnt->vm_start < addr + len; mpnt = *npp) {*npp = mpnt->vm_next;mpnt->vm_next = free;free = mpnt;if (current->mm->mmap_avl)avl_remove(mpnt, &current->mm->mmap_avl);}7.3.5.2 Second phase: updating the page tabl<strong>es</strong>A while cycle is used to scan the list of memory regions built in the first phase, starting withthe memory region d<strong>es</strong>criptor that free points to.In each iteration, the mpnt local variable points to the d<strong>es</strong>criptor of a memory region in thelist. <strong>The</strong> map_count field of the current->mm memory d<strong>es</strong>criptor is decremented (since theregion has been removed in the first phase from the list of regions owned by the proc<strong>es</strong>s) anda check is made (by means of two qu<strong>es</strong>tion-mark conditional expr<strong>es</strong>sions) to determinewhether the mpnt region must be eliminated or simply downsized:current->mm->map_count--;st = addr < mpnt->vm_start ? mpnt->vm_start : addr;end = addr + len;end = end > mpnt->vm_end ? mpnt->vm_end : end;size = end - st;<strong>The</strong> st and end local variabl<strong>es</strong> delimit the linear addr<strong>es</strong>s interval in the mpnt memory regionthat should be deleted; the size local variable specifi<strong>es</strong> the length of the interval.Next, do_munmap( ) releas<strong>es</strong> the page fram<strong>es</strong> allocated for the pag<strong>es</strong> included in the intervalfrom st to end:zap_page_range(current->mm, st, size);flush_tlb_range(current->mm, st, end);<strong>The</strong> zap_page_range( ) function deallocat<strong>es</strong> the page fram<strong>es</strong> included in the interval fromst to end and updat<strong>es</strong> the corr<strong>es</strong>ponding page table entri<strong>es</strong>. <strong>The</strong> function invok<strong>es</strong> in n<strong>es</strong>tedfashion the zap_pmd_range( ) and zap_pte_range( ) functions for scanning the pagetabl<strong>es</strong>; the latter function us<strong>es</strong> the pte_clear macro to clear the page table entri<strong>es</strong> and thefree_pte( ) function to free the corr<strong>es</strong>ponding page fram<strong>es</strong>.<strong>The</strong> flush_tlb_range( ) function is then invoked to invalidate the TLB entri<strong>es</strong>corr<strong>es</strong>ponding to the interval from st to end. In the Intel 80x86 architecture that functionsimply invok<strong>es</strong> __flush_tlb( ), thus invalidating all TLB entri<strong>es</strong>.<strong>The</strong> last action performed in each iteration of the do_munmap( ) loop is to check whether adownsized version of the mpnt memory region must be reinserted in the list of regions ofcurrent:extra = unmap_fixup(mpnt, st, size, extra);<strong>The</strong> unmap_fixup( ) function considers four possible cas<strong>es</strong>:200


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> memory region has been totally canceled. Return the addr<strong>es</strong>s stored in the extralocal variable, thus signaling that the extra memory region d<strong>es</strong>criptor can be releasedby invoking kmem_cache_free( ).• Only the lower part of the memory region has been removed, that is:(mpnt->vm_start < st) && (mpnt->vm_end == end)In this case, update the vm_end field of mnpt, invoke insert_vm_struct( ) to insertthe downsized region in the list of regions belonging to the proc<strong>es</strong>s, and return theaddr<strong>es</strong>s stored in extra.• Only the upper part of the memory region has been removed, that is:(mpnt->vm_start == st) && (mpnt->vm_end > end)In this case, update the vm_start field of mnpt, invoke insert_vm_struct( ) toinsert the downsized region in the list of regions belonging to the proc<strong>es</strong>s, and returnthe addr<strong>es</strong>s stored in extra.• <strong>The</strong> linear addr<strong>es</strong>s interval is in the middle of the memory region, that is:(mpnt->vm_start < st) && (mpnt->vm_end > end)Update the vm_start and vm_end fields of mnpt and extra (the previously allocatedextra memory region d<strong>es</strong>criptor) so that they refer to the linear addr<strong>es</strong>s intervals,r<strong>es</strong>pectively, from mpnt->vm_start to st and from end to mpnt->vm_end. <strong>The</strong>ninvoke insert_vm_struct( ) twice to insert the two regions in the list of regionsbelonging to the proc<strong>es</strong>s (and, if nec<strong>es</strong>sary, in the AVL tree) and return NULL, thuspr<strong>es</strong>erving the extra memory region d<strong>es</strong>criptor previously allocated.This terminat<strong>es</strong> the d<strong>es</strong>cription of what must be done in a single iteration of the second-phaseloop.After handling all the memory region d<strong>es</strong>criptors in the list built during the first phase,do_munmap( ) checks if the additional extra memory d<strong>es</strong>criptor has been used. If extra isNULL, the d<strong>es</strong>criptor has been used; otherwise, do_munmap( ) invok<strong>es</strong> kmem_cache_free( )to release it. Finally, if the proc<strong>es</strong>s addr<strong>es</strong>s space has been modified, do_munmap( ) sets themmap_cache field of the memory d<strong>es</strong>criptor to NULL and returns 0.7.4 Page Fault Exception HandlerAs stated previously, the <strong>Linux</strong> "Page fault" exception handler must distinguish exceptionscaused by programming errors from those caused by a reference to a page that legitimatelybelongs to the proc<strong>es</strong>s addr<strong>es</strong>s space but simply hasn't been allocated yet.<strong>The</strong> memory region d<strong>es</strong>criptors allow the exception handler to perform its job quiteefficiently. <strong>The</strong> do_page_fault( ) function, which is the "Page fault" interrupt serviceroutine, compar<strong>es</strong> the linear addr<strong>es</strong>s that caused the page fault against the memory regions ofthe current proc<strong>es</strong>s; it can thus determine the proper way to handle the exception accordingto the scheme illustrated in Figure 7-4.201


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 7-4. Overall scheme for the page fault handlerIn practice, things are a lot more complex since the page fault handler must recognize severalparticular subcas<strong>es</strong> that fit awkwardly into the overall scheme, and it must distinguish severalkinds of legal acc<strong>es</strong>s. A detailed flow diagram of the handler is illustrated in Figure 7-5.Figure 7-5. <strong>The</strong> flow diagram of the page fault handler202


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> identifiers good_area, bad_area, and no_context are labels appearing indo_page_fault( ) that should help you to relate the blocks of the flow diagram to specificlin<strong>es</strong> of code.<strong>The</strong> do_ page_fault( ) function accepts the following input parameters:• <strong>The</strong> regs addr<strong>es</strong>s of a pt_regs structure containing the valu<strong>es</strong> of the microproc<strong>es</strong>sorregisters when the exception occurred.• A 3-bit error_code, which is pushed on the stack by the control unit when theexception occurred (see Section 4.2.5 in Chapter 4). <strong>The</strong> bits have the followingmeanings.o If bit is clear, the exception was caused by an acc<strong>es</strong>s to a page that is notpr<strong>es</strong>ent (the Pr<strong>es</strong>ent flag in the Page Table entry is clear); otherwise, if bit isset, the exception was caused by an invalid acc<strong>es</strong>s right.o If bit 1 is clear, the exception was caused by a read or execute acc<strong>es</strong>s; if set, theexception was caused by a write acc<strong>es</strong>s.o If bit 2 is clear, the exception occurred while the proc<strong>es</strong>sor was in <strong>Kernel</strong>Mode; otherwise, it occurred in User Mode.<strong>The</strong> first operation of do_ page_fault( ) consists of reading the linear addr<strong>es</strong>s that causedthe page fault. When the exception occurs, the CPU control unit stor<strong>es</strong> that value in the cr2control register:asm("movl %%cr2,%0":"=r" (addr<strong>es</strong>s));tsk = current;mm = tsk->mm;<strong>The</strong> linear addr<strong>es</strong>s is saved in the addr<strong>es</strong>s local variable. <strong>The</strong> function also sav<strong>es</strong> the pointersto the proc<strong>es</strong>s d<strong>es</strong>criptor and the memory d<strong>es</strong>criptor of current in the tsk and mm localvariabl<strong>es</strong>, r<strong>es</strong>pectively.As shown at the top of Figure 7-5, do_ page_fault( ) first checks whether the exceptionoccurred while handling an interrupt or executing a kernel thread:if (in_interrupt( ) || mm == &init_mm)goto no_context;In both cas<strong>es</strong>, do_ page_fault( ) do<strong>es</strong> not try to compare the linear addr<strong>es</strong>s with thememory regions of current, since it would not make any sense: interrupt handlers and kernelthreads never use linear addr<strong>es</strong>s<strong>es</strong> below PAGE_OFFSET, and thus never rely on memoryregions.Let us suppose that the page fault did not occur in an interrupt handler or in a kernel thread.<strong>The</strong>n the function must inspect the memory regions owned by the proc<strong>es</strong>s to determinewhether the faulty linear addr<strong>es</strong>s is included in the proc<strong>es</strong>s addr<strong>es</strong>s space:vma = find_vma(mm, addr<strong>es</strong>s);if (!vma)goto bad_area;if (vma->vm_start


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Now the function has determined that addr<strong>es</strong>s is not included in any memory region;however, it must perform an additional check, since the faulty addr<strong>es</strong>s may have been causedby a push or pusha instruction on the User Mode stack of the proc<strong>es</strong>s.Let us make a short digr<strong>es</strong>sion to explain how stacks are mapped into memory regions. Eachregion that contains a stack expands toward lower addr<strong>es</strong>s<strong>es</strong>; its VM_GROWSDOWN flag is set,thus the value of its vm_end field remains fixed while the value of its vm_start field may bedecreased. <strong>The</strong> region boundari<strong>es</strong> include, but do not delimit precisely, the current size of theUser Mode stack. <strong>The</strong> reasons for the fuzz factor are:• <strong>The</strong> region size is a multiple of 4 KB (it must include complete pag<strong>es</strong>) while the stacksize is arbitrary.• Page fram<strong>es</strong> assigned to a region are never released until the region is deleted; inparticular, the value of the vm_start field of a region that includ<strong>es</strong> a stack can onlydecrease; it can never increase. Even if the proc<strong>es</strong>s execut<strong>es</strong> a seri<strong>es</strong> of popinstructions, the region size remains unchanged.It should now be clear how a proc<strong>es</strong>s that has filled up the last page frame allocated to itsstack may cause a "Page fault" exception: the push refers to an addr<strong>es</strong>s outside of the region(and to a nonexistent page frame). Notice that this kind of exception is not caused by aprogramming error; it must thus be handled separately by the page fault handler.We now return to the d<strong>es</strong>cription of do_ page_fault( ), which checks for the cased<strong>es</strong>cribed previously:if (!(vma->vm_flags & VM_GROWSDOWN))goto bad_area;if (error_code & 4 /* User Mode */&& addr<strong>es</strong>s + 32 < regs-><strong>es</strong>p)goto bad_area;if (expand_stack(vma, addr<strong>es</strong>s))goto bad_area;goto good_area;If the VM_GROWSDOWN flag of the region is set and the exception occurred in User Mode, thefunction checks whether addr<strong>es</strong>s is smaller than the regs-><strong>es</strong>p stack pointer (it should beonly a little smaller). Since a few stack-related assembly language instructions (like pusha)perform a decrement of the <strong>es</strong>p register only after the memory acc<strong>es</strong>s, a 32-byte toleranceinterval is granted to the proc<strong>es</strong>s. If the addr<strong>es</strong>s is high enough (within the tolerance granted),the code invok<strong>es</strong> the expand_stack( ) function to check whether the proc<strong>es</strong>s is allowed toextend both its stack and its addr<strong>es</strong>s space; if everything is OK, it sets the vm_start field ofvma to addr<strong>es</strong>s and returns 0; otherwise, it returns 1.Note that the preceding code skips the tolerance check whenever the VM_GROWSDOWN flag ofthe region is set and the exception did not occur in User Mode. Those conditions mean thatthe kernel is addr<strong>es</strong>sing the User Mode stack and that the code should always runexpand_stack( ).204


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>7.4.1 Handling a Faulty Addr<strong>es</strong>s Outside the Addr<strong>es</strong>s SpaceIf addr<strong>es</strong>s do<strong>es</strong> not belong to the proc<strong>es</strong>s addr<strong>es</strong>s space, do_page_fault( ) proceeds toexecute the statements at the label bad_area. If the error occurred in User Mode, it sends aSIGSEGV signal to current (see Section 9.2 in Chapter 9) and terminat<strong>es</strong>:bad_area:if (error_code & 4) { /* User Mode */tsk->tss.cr2 = addr<strong>es</strong>s;tsk->tss.error_code = error_code;tsk->tss.trap_no = 14;force_sig(SIGSEGV, tsk);return;}If, however, the exception occurred in <strong>Kernel</strong> Mode (bit 2 of error_code is clear), there ar<strong>es</strong>till two alternativ<strong>es</strong>:• <strong>The</strong> exception occurred while using some linear addr<strong>es</strong>s that has been passed to thekernel as parameter of a system call.• <strong>The</strong> exception is due to a real kernel bug.<strong>The</strong> function distinguish<strong>es</strong> th<strong>es</strong>e two alternativ<strong>es</strong> as follows:no_context:if ((fixup = search_exception_table(regs->eip)) != 0) {regs->eip = fixup;return;}In the first case, it jumps to some "fixup code," which typically sends a SIGSEGV signal tocurrent or terminat<strong>es</strong> a system call handler with a proper error code (see Section 8.2.6 inChapter 8).In the second case, the function prints a complete dump of the CPU registers and the <strong>Kernel</strong>Mode stack on the console and on a system m<strong>es</strong>sage buffer, then kills the current proc<strong>es</strong>s byinvoking the do_exit( ) function (see Chapter 19). This is the so-called "<strong>Kernel</strong> oops" error,named after the m<strong>es</strong>sage displayed. <strong>The</strong> dumped valu<strong>es</strong> can be used by kernel hackers toreconstruct the conditions that triggered the bug, thus find and correct it.7.4.2 Handling a Faulty Addr<strong>es</strong>s Inside the Addr<strong>es</strong>s SpaceIf addr<strong>es</strong>s belongs to the proc<strong>es</strong>s addr<strong>es</strong>s space, do_ page_fault( ) proceeds to th<strong>es</strong>tatement labeled good_area:205


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>good_area:write = 0;if (error_code & 2) { /* write acc<strong>es</strong>s */if (!(vma->vm_flags & VM_WRITE))goto bad_area;write++;} else /* read acc<strong>es</strong>s */if (error_code & 1 ||!(vma->vm_flags & (VM_READ | VM_EXEC)))goto bad_area;If the exception was caused by a write acc<strong>es</strong>s, the function checks whether the memory regionis writable. If not, it jumps to the bad_area code; if so, it sets the write local variable to 1.If the exception was caused by a read or execute acc<strong>es</strong>s, the function checks whether the pageis already pr<strong>es</strong>ent in RAM. In this case, the exception occurred because the proc<strong>es</strong>s tried toacc<strong>es</strong>s a privileged page frame (one whose User/Supervisor flag is clear) in User Mode, sothe function jumps to the bad_area code. [4] If the page is not pr<strong>es</strong>ent, the function also checkswhether the memory region is readable or executable.[4] However, this case should never happen, since the kernel do<strong>es</strong> not assign privileged page fram<strong>es</strong> to the proc<strong>es</strong>s<strong>es</strong>.If the memory region acc<strong>es</strong>s rights match the acc<strong>es</strong>s type that caused the exception, thehandle_mm_fault( ) function is invoked:if (!handle_mm_fault(tsk, vma, addr<strong>es</strong>s, write)) {tsk->tss.cr2 = addr<strong>es</strong>s;tsk->tss.error_code = error_code;tsk->tss.trap_no = 14;force_sig(SIGBUS, tsk);if (!(error_code & 4)) /* <strong>Kernel</strong> Mode */goto no_context;}<strong>The</strong> handle_mm_fault( ) function returns 1 if it succeeded in allocating a new page framefor the proc<strong>es</strong>s; otherwise, it returns an appropriate error code so that do_page_fault( ) cansend a SIGBUS signal to the proc<strong>es</strong>s. It acts on four parameters:tskvmaaddr<strong>es</strong>sA pointer to the d<strong>es</strong>criptor of the proc<strong>es</strong>s that was running on the CPU when theexception occurredA pointer to the d<strong>es</strong>criptor of the memory region including the linear addr<strong>es</strong>s thatcaused the exception<strong>The</strong> linear addr<strong>es</strong>s that caused the exception206


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>write_acc<strong>es</strong>sSet to 1 if tsk attempted to write in addr<strong>es</strong>s and to if tsk attempted to read orexecute it<strong>The</strong> function starts by checking whether the Page Middle Directory and the Page Table usedto map addr<strong>es</strong>s exist. Even if addr<strong>es</strong>s belongs to the proc<strong>es</strong>s addr<strong>es</strong>s space, thecorr<strong>es</strong>ponding page tabl<strong>es</strong> might not have been allocated, so the task of allocating thempreced<strong>es</strong> everything else:pgd = pgd_offset(vma->vm_mm, addr<strong>es</strong>s);pmd = pmd_alloc(pgd, addr<strong>es</strong>s);if (!pmd)return -1;pte = pte_alloc(pmd, addr<strong>es</strong>s);if (!pte)return -1;<strong>The</strong> pgd local variable contains the Page Global Directory entry that refers to addr<strong>es</strong>s;pmd_alloc( ) is invoked to allocate, if needed, a new Page Middle Directory. [5] pte_alloc() is then invoked to allocate, if needed, a new Page Table. If both operations are succ<strong>es</strong>sful,the pte local variable points to the Page Table entry that refers to addr<strong>es</strong>s. <strong>The</strong>handle_pte_fault( ) function is then invoked to inspect the Page Table entrycorr<strong>es</strong>ponding to addr<strong>es</strong>s:[5] On Intel 80x86 microproc<strong>es</strong>sors, this kind of allocation never occurs since Page Middle Directori<strong>es</strong> are included in the Page Global Directory.return handle_pte_fault(tsk, vma, addr<strong>es</strong>s, write_acc<strong>es</strong>s, pte);<strong>The</strong> handle_ pte_fault( ) function determin<strong>es</strong> how to allocate a new page frame for theproc<strong>es</strong>s:• If the acc<strong>es</strong>sed page is not pr<strong>es</strong>ent—that is, if it is not already stored in any pageframe—the kernel allocat<strong>es</strong> a new page frame and initializ<strong>es</strong> it properly; this techniqueis called demand paging.• If the acc<strong>es</strong>sed page is pr<strong>es</strong>ent but is marked read only—that is, if it is already storedin a page frame—the kernel allocat<strong>es</strong> a new page frame and initializ<strong>es</strong> its contents bycopying the old page frame data; this technique is called Copy On Write.7.4.3 Demand Paging<strong>The</strong> term demand paging denot<strong>es</strong> a dynamic memory allocation technique that consists ofdeferring page frame allocation until the last possible moment, that is, until the proc<strong>es</strong>sattempts to addr<strong>es</strong>s a page that is not pr<strong>es</strong>ent in RAM, thus causing a "Page fault" exception.<strong>The</strong> motivation behind demand paging is that proc<strong>es</strong>s<strong>es</strong> do not addr<strong>es</strong>s all the addr<strong>es</strong>s<strong>es</strong>included in their addr<strong>es</strong>s space right from the start; in fact, some of th<strong>es</strong>e addr<strong>es</strong>s<strong>es</strong> may neverbe used by the proc<strong>es</strong>s. Moreover, the program locality principle (see Section 2.4.6 in Chapter2) ensur<strong>es</strong> that, at each stage of program execution, only a small subset of the proc<strong>es</strong>s pag<strong>es</strong>are really referenced, and therefore the page fram<strong>es</strong> containing the temporarily usel<strong>es</strong>s pag<strong>es</strong>can be used by other proc<strong>es</strong>s<strong>es</strong>. Demand paging is thus preferable to global allocation(assigning all page fram<strong>es</strong> to the proc<strong>es</strong>s right from the start and leaving them in memory207


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>until program termination) since it increas<strong>es</strong> the average number of free page fram<strong>es</strong> in th<strong>es</strong>ystem and hence allows better use of the available free memory. From another viewpoint, itallows the system as a whole to get a better throughput with the same amount of RAM.<strong>The</strong> price to pay for all th<strong>es</strong>e good things is system overhead: each "Page fault" exceptioninduced by demand paging must be handled by the kernel, thus wasting CPU cycl<strong>es</strong>.Fortunately, the locality principle ensur<strong>es</strong> that once a proc<strong>es</strong>s starts working with a group ofpag<strong>es</strong>, it will stick with them without addr<strong>es</strong>sing other pag<strong>es</strong> for quite a while: "Page fault"exceptions may thus be considered rare events.An addr<strong>es</strong>sed page may not be pr<strong>es</strong>ent in main memory for the following reasons:• <strong>The</strong> page was never acc<strong>es</strong>sed by the proc<strong>es</strong>s. <strong>The</strong> kernel can recognize this case sincethe Page Table entry is filled with zeros, that is, the pte_none macro returns the value1.• <strong>The</strong> page was already acc<strong>es</strong>sed by the proc<strong>es</strong>s, but its content is temporarily saved ondisk. <strong>The</strong> kernel can recognize this case since the Page Table entry is not filled withzeros (however, the Pr<strong>es</strong>ent flag is cleared, since the page is not pr<strong>es</strong>ent in RAM).<strong>The</strong> handle_ pte_fault( ) function distinguish<strong>es</strong> the two cas<strong>es</strong> by inspecting the PageTable entry that refers to addr<strong>es</strong>s:entry = *pte;if (!pte_pr<strong>es</strong>ent(entry)) {if (pte_none(entry))return do_no_page(tsk, vma, addr<strong>es</strong>s, write_acc<strong>es</strong>s,pte);return do_swap_page(tsk, vma, addr<strong>es</strong>s, pte, entry,write_acc<strong>es</strong>s);}We'll examine the case in which the page is saved on disk (do_swap_ page( ) function) inSection 16.6 in Chapter 16.In the other situation, when the page was never acc<strong>es</strong>sed, the do_no_page( ) function isinvoked. <strong>The</strong>re are two ways to load the missing page, depending on whether the page ismapped to a disk file. <strong>The</strong> function determin<strong>es</strong> this by checking a field called nopage in thevma memory region d<strong>es</strong>criptor, which points to the function that loads the missing page fromdisk into RAM if the page is mapped to a file. <strong>The</strong>refore, the possibiliti<strong>es</strong> are:• <strong>The</strong> vma->vm_ops->nopage field is not NULL. In this case, the memory region mapsa disk file and the field points to the function that loads the page. This case will becovered in Section 15.2 in Chapter 15 and in Section 18.3.5 in Chapter 18.• Either the vm_ops field or the vma->vm_ops->nopage field is NULL. In this case, thememory region do<strong>es</strong> not map a file on disk, that is, it is an anonymous mapping. Thus,do_no_ page( ) invok<strong>es</strong> the do_anonymous_page( ) function to get a new pageframe:if (!vma->vm_ops || !vma->vm_ops->nopage)return do_anonymous_page(tsk, vma, page_table,write_acc<strong>es</strong>s);208


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> do_anonymous_page( ) function handl<strong>es</strong> write and read requ<strong>es</strong>ts separately:if (write_acc<strong>es</strong>s) {page = __get_free_page(GFP_USER);memset((void *)(page), 0, PAGE_SIZE)entry = pte_mkwrite(pte_mkdirty(mk_pte(page,vma->vm_page_prot)));vma->vm_mm->rss++;tsk->min_flt++;set_pte(pte, entry);return 1;}When handling a write acc<strong>es</strong>s, the function invok<strong>es</strong> __get_free_page( ) and fills the newpage frame with zeros by using the memset macro. <strong>The</strong> function then increments the min_fltfield of tsk to keep track of the number of minor page faults (those that require only a newpage frame) caused by the proc<strong>es</strong>s and the rss field of the vma->vm_mm proc<strong>es</strong>s memoryd<strong>es</strong>criptor to keep track of the number of page fram<strong>es</strong> allocated to the proc<strong>es</strong>s. [6] <strong>The</strong> PageTable entry is then set to the physical addr<strong>es</strong>s of the page frame, which is marked as writableand dirty.[6] <strong>Linux</strong> records the number of minor page faults for each proc<strong>es</strong>s. This information, together with several other statistics, may be used to tune th<strong>es</strong>ystem. <strong>The</strong> value stored in the rss field of memory d<strong>es</strong>criptors is also used by the kernel to select the region from which to steal page fram<strong>es</strong> (seeSection 16.7 in Chapter 16).Conversely, when handling a read acc<strong>es</strong>s, the content of the page is irrelevant because theproc<strong>es</strong>s is addr<strong>es</strong>sing it for the first time. It is safer to give to the proc<strong>es</strong>s a page filled withzeros rather than an old page filled with information written by some other proc<strong>es</strong>s. <strong>Linux</strong>go<strong>es</strong> one step further in the spirit of demand paging. <strong>The</strong>re is no need to assign a new pageframe filled with zeros to the proc<strong>es</strong>s right away, since we might as well give it an existingpage called zero page, thus deferring further page frame allocation. <strong>The</strong> zero page is allocatedstatically during kernel initialization in the empty_zero_page variable (an array of 1024 longintegers filled with zeros); it is stored in the sixth page frame, starting from physical addr<strong>es</strong>s0x00005000, and it can be referenced by means of the ZERO_PAGE macro.<strong>The</strong> Page Table entry is thus set with the physical addr<strong>es</strong>s of the zero page:entry = pte_wrprotect(mk_pte(ZERO_PAGE, vma->vm_page_prot));set_pte(pte, entry);return 1;Since the page is marked as nonwritable, if the proc<strong>es</strong>s attempts to write in it, the Copy OnWrite mechanism will be activated. <strong>The</strong>n, and only then, will the proc<strong>es</strong>s get a page of its ownto write in. <strong>The</strong> mechanism is d<strong>es</strong>cribed in the next section.7.4.4 Copy On WriteFirst-generation Unix systems implemented proc<strong>es</strong>s creation in a rather clumsy way: when afork( ) system call was issued, the kernel duplicated the whole parent addr<strong>es</strong>s space in theliteral sense of the word and assigned the copy to the child proc<strong>es</strong>s. This activity was quitetime-consuming since it required:209


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Allocating page fram<strong>es</strong> for the page tabl<strong>es</strong> of the child proc<strong>es</strong>s• Allocating page fram<strong>es</strong> for the pag<strong>es</strong> of the child proc<strong>es</strong>s• Initializing the page tabl<strong>es</strong> of the child proc<strong>es</strong>s• Copying the pag<strong>es</strong> of the parent proc<strong>es</strong>s into the corr<strong>es</strong>ponding pag<strong>es</strong> of the childproc<strong>es</strong>sThis way of creating an addr<strong>es</strong>s space involved many memory acc<strong>es</strong>s<strong>es</strong>, used up many CPUcycl<strong>es</strong>, and entirely spoiled the cache contents. Last but not least, it was often pointl<strong>es</strong>sbecause many child proc<strong>es</strong>s<strong>es</strong> start their execution by loading a new program, thus discardingentirely the inherited addr<strong>es</strong>s space (see Chapter 19).Modern Unix kernels, including <strong>Linux</strong>, follow a more efficient approach called Copy OnWrite, or COW. <strong>The</strong> idea is quite simple: instead of duplicating page fram<strong>es</strong>, they are sharedbetween the parent and the child proc<strong>es</strong>s. However, as long as they are shared, they cannot bemodified. Whenever the parent or the child proc<strong>es</strong>s attempts to write into a shared page frame,an exception occurs, and at this point the kernel duplicat<strong>es</strong> the page into a new page framethat it marks as writable. <strong>The</strong> original page frame remains write-protected: when the otherproc<strong>es</strong>s tri<strong>es</strong> to write into it, the kernel checks whether the writing proc<strong>es</strong>s is the only ownerof the page frame; in such a case, it mak<strong>es</strong> the page frame writable for the proc<strong>es</strong>s.<strong>The</strong> count field of the page d<strong>es</strong>criptor is used to keep track of the number of proc<strong>es</strong>s<strong>es</strong> thatare sharing the corr<strong>es</strong>ponding page frame. Whenever a proc<strong>es</strong>s releas<strong>es</strong> a page frame or aCopy On Write is executed on it, its count field is decremented; the page frame is freed onlywhen count becom<strong>es</strong> NULL.Let us now d<strong>es</strong>cribe how <strong>Linux</strong> implements COW. When handle_ pte_fault( )determin<strong>es</strong> that the "Page fault" exception was caused by a requ<strong>es</strong>t to write into a writeprotectedpage pr<strong>es</strong>ent in memory, it execut<strong>es</strong> the following instructions:if (pte_pr<strong>es</strong>ent(pte)) {entry = pte_mkyoung(entry);set_ pte(pte, entry);flush_tlb_page(vma, addr<strong>es</strong>s);if (write_acc<strong>es</strong>s) {if (!pte_write(entry))return do_wp_page(tsk, vma, addr<strong>es</strong>s, pte);entry = pte_mkdirty(entry);set_pte(pte, entry);flush_tlb_page(vma, addr<strong>es</strong>s);}return 1;}First, the pte_mkyoung( ) and set_pte( ) functions are invoked in order to set theAcc<strong>es</strong>sed bit in the Page Table entry of the page that caused the exception. This settingmak<strong>es</strong> the page "younger" and reduc<strong>es</strong> its chanc<strong>es</strong> of being swapped out to disk (see Chapter16). If the exception was caused by a write-protection violation, handle_pte_fault( )returns the value yielded by the do_wp_page( ) function; otherwise, some error condition hasbeen detected (for instance, a page inside the User Mode proc<strong>es</strong>s addr<strong>es</strong>s space with theUser/Supervisor flag equal to 0), and the function returns the value 1.210


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> do_wp_page( ) function starts by loading the pte local variable with the Page Tableentry referenced by the page_table parameter and by getting a new page frame:pte = *page_table;new_page = __get_free_ page(GFP_USER);Since the allocation of a page frame can block the proc<strong>es</strong>s, the function performs thefollowing consistency checks on the Page Table entry once the page frame has been obtained:• Whether the page has been swapped out while the proc<strong>es</strong>s waited for a free page frame(pte and *page_table do not have the same value)• Whether the page is no longer in RAM (the page's Pr<strong>es</strong>ent flag is in its Page Tableentry)• Whether the page can now be written (the page's Read/Write flag is 1 in its PageTable entry)If any of th<strong>es</strong>e conditions occurs, do_wp_page( ) releas<strong>es</strong> the page frame obtained previouslyand returns the value 1.Now the function updat<strong>es</strong> the number of minor page faults and stor<strong>es</strong> in the page_map localvariable a pointer to the page d<strong>es</strong>criptor of the page that caused the exception:tsk->min_flt++;page_map = mem_map + MAP_NR(old_page);Next, the function must determine whether the page must really be duplicated. If only oneproc<strong>es</strong>s owns the page, Copy On Write do<strong>es</strong> not apply and the proc<strong>es</strong>s should be free to writethe page. Thus, the page frame is marked as writable so that it will not cause further "Pagefault" exceptions when writ<strong>es</strong> are attempted, the previously allocated new page frame isreleased, and the function terminat<strong>es</strong> with a return value of 1. This check is made by readingthe value of the count field of the page d<strong>es</strong>criptor: [7][7] Actually, the check is slightly more complicated, since the count field is also incremented when the page is inserted into the swap cache (seeSection 16.3 in Chapter 16).if (page_map->count == 1) {set_pte(page_table, pte_mkdirty(pte_mkwrite(pte)));flush_tlb_page(vma, addr<strong>es</strong>s);if (new_page)free_page(new_page);return 1;}Conversely, if the page frame is shared among two or more proc<strong>es</strong>s<strong>es</strong>, the function copi<strong>es</strong> thecontent of the old page frame (old_page) into the newly allocated one (new_page):211


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (old_page == ZERO_PAGE)memset((void *) new_page, 0, PAGE_SIZE);elsememcpy((void *) new_page, (void *) old_page, PAGE_SIZE);set_pte(page_table, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot))));flush_tlb_page(vma, addr<strong>es</strong>s);__free_page(page_map);return 1;If the old page is the zero page, the new frame is efficiently filled with zeros by using thememset macro. Otherwise, the page frame content is copied using the memcpy macro. Specialhandling for the zero page is not strictly required, but it improv<strong>es</strong> the system performanc<strong>es</strong>ince it pr<strong>es</strong>erv<strong>es</strong> the microproc<strong>es</strong>sor hardware cache by making fewer addr<strong>es</strong>s referenc<strong>es</strong>.<strong>The</strong> Page Table entry is then updated with the physical addr<strong>es</strong>s of the new page frame, whichis also marked as writable and dirty. Finally, the function invok<strong>es</strong> __free_page( ) todecrement the usage counter of the old page frame.7.5 Creating and Deleting a Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceOut of the six typical cas<strong>es</strong> mentioned in Section 7.1 in which a proc<strong>es</strong>s gets new memoryregions, the first one—issuing a fork( ) system call—requir<strong>es</strong> the creation of a whole newaddr<strong>es</strong>s space for the child proc<strong>es</strong>s. Conversely, when a proc<strong>es</strong>s terminat<strong>es</strong>, the kerneld<strong>es</strong>troys its addr<strong>es</strong>s space. In this section we'll discuss how th<strong>es</strong>e two activiti<strong>es</strong> are performedby <strong>Linux</strong>.7.5.1 Creating a Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceWe have mentioned in Section 3.3.1 in Chapter 3, that the kernel invok<strong>es</strong> the copy_mm( )function while creating a new proc<strong>es</strong>s. This function tak<strong>es</strong> care of the proc<strong>es</strong>s addr<strong>es</strong>s spacecreation by setting up all page tabl<strong>es</strong> and memory d<strong>es</strong>criptors of the new proc<strong>es</strong>s.Each proc<strong>es</strong>s usually has its own addr<strong>es</strong>s space, but lightweight proc<strong>es</strong>s<strong>es</strong> can be created bycalling __clone( ) with the CLONE_VM flag set. <strong>The</strong>se share the same addr<strong>es</strong>s space; that is,they are allowed to addr<strong>es</strong>s the same set of pag<strong>es</strong>.Following the COW approach d<strong>es</strong>cribed earlier, traditional proc<strong>es</strong>s<strong>es</strong> inherit the addr<strong>es</strong>s spaceof their parent: pag<strong>es</strong> stay shared as long as they are only read. When one of the proc<strong>es</strong>s<strong>es</strong>attempts to write one of them, however, the page is duplicated; after some time, a forkedproc<strong>es</strong>s usually gets its own addr<strong>es</strong>s space different from that of the parent proc<strong>es</strong>s.Lightweight proc<strong>es</strong>s<strong>es</strong>, on the other hand, use the addr<strong>es</strong>s space of their parent proc<strong>es</strong>s: <strong>Linux</strong>implements them simply by not duplicating addr<strong>es</strong>s space. Lightweight proc<strong>es</strong>s<strong>es</strong> can becreated considerably faster than normal proc<strong>es</strong>s<strong>es</strong>, and the sharing of pag<strong>es</strong> can also beconsidered a benefit so long as the parent and children coordinate their acc<strong>es</strong>s<strong>es</strong> carefully.If the new proc<strong>es</strong>s has been created by means of the _ _clone( ) system call and if theCLONE_VM flag of the flag parameter is set, copy_mm( ) giv<strong>es</strong> the clone the addr<strong>es</strong>s space ofits parent:212


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (clone_flags & CLONE_VM) {mmget(current->mm);copy_segments(nr, tsk, NULL);SET_PAGE_DIR(tsk, current->mm->pgd);return 0;}<strong>The</strong> copy_segments( ) function sets up the LDT for the clone proc<strong>es</strong>s, because even alightweight proc<strong>es</strong>s must have a separate LDT entry in the GDT. <strong>The</strong> SET_PAGE_DIR macrosets the Page Global Directory of the new proc<strong>es</strong>s and stor<strong>es</strong> the Page Global Directoryaddr<strong>es</strong>s in the mm->pgd field of the new memory d<strong>es</strong>criptor.If the CLONE_VM flag is not set, copy_mm( ) must create a new addr<strong>es</strong>s space (even though nomemory is allocated within addr<strong>es</strong>s space until the proc<strong>es</strong>s requ<strong>es</strong>ts an addr<strong>es</strong>s). <strong>The</strong> functionallocat<strong>es</strong> a new memory d<strong>es</strong>criptor and stor<strong>es</strong> its addr<strong>es</strong>s in the mm field of the new proc<strong>es</strong>sd<strong>es</strong>criptor; it then initializ<strong>es</strong> several fields in the new proc<strong>es</strong>s d<strong>es</strong>criptor to and, as in theprevious case, sets up the LDT d<strong>es</strong>criptor by invoking copy_segments( ):mm = mm_alloc( );if (!mm)return -ENOMEM;tsk->mm = mm;copy_segments(nr, tsk, mm);Next, copy_mm( ) invok<strong>es</strong> new_page_tabl<strong>es</strong>( ) to allocate the Page Global Directory. <strong>The</strong>last entri<strong>es</strong> of this table, which corr<strong>es</strong>pond to linear addr<strong>es</strong>s<strong>es</strong> greater than PAGE_OFFSET, arecopied from the Page Global Directory of the swapper proc<strong>es</strong>s, while the remaining entri<strong>es</strong>are set to (in particular, the Pr<strong>es</strong>ent and Read/Write flags are cleared). Finally,new_page_tabl<strong>es</strong>( ) stor<strong>es</strong> the Page Global Directory addr<strong>es</strong>s in the mm->pgd field of thenew memory d<strong>es</strong>criptor. <strong>The</strong> dup_mmap( ) function is then invoked to duplicate both thememory regions and the Page Tabl<strong>es</strong> of the parent proc<strong>es</strong>s:new_page_tabl<strong>es</strong>(tsk);dup_mmap(mm);return 0;<strong>The</strong> dup_mmap( ) function scans the list of regions owned by the parent proc<strong>es</strong>s, startingfrom the one pointed by current->mm->mmap. It duplicat<strong>es</strong> each vm_area_struct memoryregion d<strong>es</strong>criptor encountered and inserts the copy in the list of regions owned by the childproc<strong>es</strong>s.Right after inserting a new memory region d<strong>es</strong>criptor, dup_mmap( ) invok<strong>es</strong>copy_page_range( ) to create, if nec<strong>es</strong>sary, the Page Tabl<strong>es</strong> needed to map the group ofpag<strong>es</strong> included in the memory region and to initialize the new Page Table entri<strong>es</strong>. Inparticular, any page frame corr<strong>es</strong>ponding to a private, writable page (VM_SHARE flag off andVM_MAYWRITE flag on) is marked as read only for both the parent and the child, so that it willbe handled with the Copy On Write mechanism. Finally, if the number of memory regions isgreater than or equal to AVL_MIN_MAP_COUNT, the memory region AVL tree of the childproc<strong>es</strong>s is created by invoking the build_mmap_avl( ) function.213


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>7.5.2 Deleting a Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceWhen a proc<strong>es</strong>s terminat<strong>es</strong>, the kernel invok<strong>es</strong> the exit_mm( ) function to release the addr<strong>es</strong>sspace owned by that proc<strong>es</strong>s. Since the proc<strong>es</strong>s is entering the TASK_ZOMBIE state, thefunction assigns the addr<strong>es</strong>s space of the swapper proc<strong>es</strong>s to it:flush_tlb_mm(mm);tsk->mm = &init_mm;tsk->swappable = 0;SET_PAGE_DIR(tsk, swapper_pg_dir);mm_release( );mmput(mm);<strong>The</strong> function then invok<strong>es</strong> mm_release( ) and mmput( ) to release the proc<strong>es</strong>s addr<strong>es</strong>sspace. <strong>The</strong> first function clears the fs and gs segmentation registers and r<strong>es</strong>tor<strong>es</strong> the LDT ofthe proc<strong>es</strong>s to default_ldt; the second function decrements the value of the mm->count fieldand releas<strong>es</strong> the LDT, the memory region d<strong>es</strong>criptors, and the page tabl<strong>es</strong> referred by mm.Finally, the mm memory d<strong>es</strong>criptor itself is released.7.6 Managing the HeapEach Unix proc<strong>es</strong>s owns a specific memory region called heap, which is used to satisfy theproc<strong>es</strong>s's dynamic memory requ<strong>es</strong>ts. <strong>The</strong> start_brk and brk fields of the memory d<strong>es</strong>criptordelimit the starting and ending addr<strong>es</strong>s, r<strong>es</strong>pectively, of that region.<strong>The</strong> following C library functions can be used by the proc<strong>es</strong>s to requ<strong>es</strong>t and release dynamicmemory:malloc(size)Requ<strong>es</strong>t size byt<strong>es</strong> of dynamic memory; if the allocation succeeds, it returns thelinear addr<strong>es</strong>s of the first memory location.calloc(n,size)Requ<strong>es</strong>t an array consisting of n elements of size size; if the allocation succeeds, itinitializ<strong>es</strong> the array components to and returns the linear addr<strong>es</strong>s of the first element.free(addr)brk(addr)Release the memory region allocated by malloc( ) or calloc( ) having initialaddr<strong>es</strong>s addr.Modify the size of the heap directly; the addr parameter specifi<strong>es</strong> the new value ofcurrent->mm->brk, and the return value is the new ending addr<strong>es</strong>s of the memoryregion (the proc<strong>es</strong>s must check whether it coincid<strong>es</strong> with the requ<strong>es</strong>ted addr value).214


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> brk( ) function differs from the other functions listed because it is the only oneimplemented as a system call: all the other functions are implemented in the C library bymaking use of brk( ) and mmap( ).When a proc<strong>es</strong>s in User Mode invok<strong>es</strong> the brk( ) system call, the kernel execut<strong>es</strong> th<strong>es</strong>ys_brk(addr) function (see Chapter 8). This function verifi<strong>es</strong> first whether the addrparameter falls inside the memory region that contains the proc<strong>es</strong>s code; if so, it returnsimmediately:mm = current->mm;if (addr < mm->end_code)return mm->brk;Since the brk( ) system call acts on a memory region, it allocat<strong>es</strong> and deallocat<strong>es</strong> wholepag<strong>es</strong>. <strong>The</strong>refore, the function aligns the value of addr to a multiple of PAGE_SIZE, thencompar<strong>es</strong> the r<strong>es</strong>ult with the value of the brk field of the memory d<strong>es</strong>criptor:newbrk = (addr + 0xfff) & 0xfffff000;oldbrk = (mm->brk + 0xfff) & 0xfffff000;if (oldbrk == newbrk) {mm->brk = addr;return mm->brk;}If the proc<strong>es</strong>s has asked to shrink the heap, sys_brk( ) invok<strong>es</strong> the do_munmap( ) functionto do the job and then returns:if (addr brk) {if (!do_munmap(newbrk, oldbrk-newbrk))mm->brk = addr;return mm->brk;}If the proc<strong>es</strong>s has asked to enlarge the heap, sys_brk( ) checks first whether the proc<strong>es</strong>s isallowed to do so. If the proc<strong>es</strong>s is trying to allocate memory outside its limit, the functionsimply returns the original value of mm->brk without allocating more memory:rlim = current->rlim[RLIMIT_DATA].rlim_cur;if (rlim < RLIM_INFINITY && addr - mm->end_code > rlim)return mm->brk;<strong>The</strong> function then checks whether the enlarged heap would overlap some other memoryregion belonging to the proc<strong>es</strong>s and, if so, returns without doing anything:if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))return mm->brk;<strong>The</strong> last check before proceeding to the expansion consists of verifying whether the availablefree virtual memory is sufficient to support the enlarged heap (see Section 7.3.4):if (!vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT))return mm->brk;215


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If everything is OK, the do_mmap( ) function is invoked with the MAP_FIXED flag set: if itreturns the oldbrk value, the allocation was succ<strong>es</strong>sful and sys_brk( ) returns the valueaddr; otherwise, it returns the old mm->brk value:if (do_mmap(NULL, oldbrk, newbrk-oldbrk,PROT_READ|PROT_WRITE|PROT_EXEC,MAP_FIXED|MAP_PRIVATE, 0) == oldbrk)mm->brk = addr;return mm->brk;7.7 Anticipating <strong>Linux</strong> 2.4B<strong>es</strong>ide minor optimizations and adjustments, the proc<strong>es</strong>s addr<strong>es</strong>s space is handled in the sameway by <strong>Linux</strong> 2.4.216


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 8. System CallsOperating systems offer proc<strong>es</strong>s<strong>es</strong> running in User Mode a set of interfac<strong>es</strong> to interact withhardware devic<strong>es</strong> such as the CPU, disks, printers, and so on. Putting an extra layer betweenthe application and the hardware has several advantag<strong>es</strong>. First, it mak<strong>es</strong> programming easier,freeing users from studying low-level programming characteristics of hardware devic<strong>es</strong>.Second, it greatly increas<strong>es</strong> system security, since the kernel can check the correctn<strong>es</strong>s of therequ<strong>es</strong>t at the interface level before attempting to satisfy it. Last but not least, th<strong>es</strong>e interfac<strong>es</strong>make programs more portable since they can be compiled and executed correctly on anykernel that offers the same set of interfac<strong>es</strong>.Unix systems implement most interfac<strong>es</strong> between User Mode proc<strong>es</strong>s<strong>es</strong> and hardware devic<strong>es</strong>by means of system calls issued to the kernel. This chapter examin<strong>es</strong> in detail how systemcalls are implemented by the <strong>Linux</strong> kernel.8.1 POSIX APIs and System CallsLet us start by str<strong>es</strong>sing the difference between an application programmer interface (API)and a system call. <strong>The</strong> former is a function definition that specifi<strong>es</strong> how to obtain a givenservice, while the latter is an explicit requ<strong>es</strong>t to the kernel made via a software interrupt.Unix systems include several librari<strong>es</strong> of functions that provide APIs to programmers. Someof the APIs defined by the libc standard C library refer to wrapper routin<strong>es</strong>, that is, routin<strong>es</strong>whose only purpose is to issue a system call. Usually, each system call corr<strong>es</strong>ponds to awrapper routine; the wrapper routine defin<strong>es</strong> the API that application programs should referto.<strong>The</strong> converse is not true, by the way—an API do<strong>es</strong> not nec<strong>es</strong>sarily corr<strong>es</strong>pond to a specificsystem call. First of all, the API could offer its servic<strong>es</strong> directly in User Mode. (For somethingabstract like math functions, there may be no reason to make system calls.) Second, a singleAPI function could make several system calls. Moreover, several API functions could makethe same system call but wrap extra functionality around it. For instance, in <strong>Linux</strong> themalloc( ), calloc( ), and free( ) POSIX APIs are implemented in the libc library: thecode in that library keeps track of the allocation and deallocation requ<strong>es</strong>ts and us<strong>es</strong>the brk( ) system call in order to enlarge or shrink the proc<strong>es</strong>s heap (see Section 7.6 inChapter 7).<strong>The</strong> POSIX standard refers to APIs and not to system calls. A system can be certified asPOSIX-compliant if it offers the proper set of APIs to the application programs, no matterhow the corr<strong>es</strong>ponding functions are implemented. As a matter of fact, several non-Unixsystems have been certified as POSIX-compliant since they offer all traditional Unix servic<strong>es</strong>in User Mode librari<strong>es</strong>.From the programmer's point of view, the distinction between an API and a system call isirrelevant: the only things that matter are the function name, the parameter typ<strong>es</strong>, and themeaning of the return code. From the kernel d<strong>es</strong>igner's point of view, however, the distinctiondo<strong>es</strong> matter since system calls belong to the kernel, while User Mode librari<strong>es</strong> don't.217


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Most wrapper routin<strong>es</strong> return an integer value, whose meaning depends on the corr<strong>es</strong>pondingsystem call. A return value of -1 denot<strong>es</strong> in most cas<strong>es</strong>, but not always, that the kernel wasunable to satisfy the proc<strong>es</strong>s requ<strong>es</strong>t. A failure in the system call handler may be caused byinvalid parameters, a lack of available r<strong>es</strong>ourc<strong>es</strong>, hardware problems, and so on. <strong>The</strong> specificerror code is contained in the errno variable, which is defined in the libc library.Each error code is associated with a macro, which yields a corr<strong>es</strong>ponding positive integervalue. <strong>The</strong> POSIX standard specifi<strong>es</strong> the macro nam<strong>es</strong> of several error cod<strong>es</strong>. In <strong>Linux</strong> onIntel 80x86 systems, those macros are defined in a header file calledinclude/asm-i386/errno.h. To allow portability of C programs among Unix systems,the include/asm-i386/errno.h header file is included, in turn, in the standard/usr/include/errno.h C library header file. Other systems have their own specializedsubdirectori<strong>es</strong> of header fil<strong>es</strong>.8.2 System Call Handler and Service Routin<strong>es</strong>When a User Mode proc<strong>es</strong>s invok<strong>es</strong> a system call, the CPU switch<strong>es</strong> to <strong>Kernel</strong> Mode andstarts the execution of a kernel function. In <strong>Linux</strong> the system calls must be invoked byexecuting the int $0x80 Assembly instruction, which rais<strong>es</strong> the programmed exceptionhaving vector 128 (see Section 4.4.1 and Section 4.2.5 in Chapter 4).Since the kernel implements many different system calls, the proc<strong>es</strong>s must pass a parametercalled the system call number to identify the required system call; the eax register is used forthat purpose. As we shall see in Section 8.2.3 later in this chapter, additional parameters areusually passed when invoking a system call.All system calls return an integer value. <strong>The</strong> conventions for th<strong>es</strong>e return valu<strong>es</strong> are differentfrom those for wrapper routin<strong>es</strong>. In the kernel, positive or null valu<strong>es</strong> denote a succ<strong>es</strong>sfultermination of the system call, while negative valu<strong>es</strong> denote an error condition. In the lattercase, the value is the negation of the error code that must be returned to the applicationprogram. <strong>The</strong> errno variable is not set or used by the kernel.<strong>The</strong> system call handler, which has a structure similar to that of the other exception handlers,performs the following operations:• Sav<strong>es</strong> the contents of most registers in the <strong>Kernel</strong> Mode stack (this operation iscommon to all system calls and is coded in assembly language).• Handl<strong>es</strong> the system call by invoking a corr<strong>es</strong>ponding C function called the system callservice routine.• Exits from the handler by means of the ret_from_sys_call( ) function (thisfunction is coded in assembly language).<strong>The</strong> name of the service routine associated with the xyz( ) system call is usually sys_xyz(); there are, however, a few exceptions to this rule.Figure 8-1 illustrat<strong>es</strong> the relationships among the application program that invok<strong>es</strong> a systemcall, the corr<strong>es</strong>ponding wrapper routine, the system call handler, and the system call serviceroutine. <strong>The</strong> arrows denote the execution flow between the functions.218


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 8-1. Invoking a system callIn order to associate each system call number with its corr<strong>es</strong>ponding service routine, thekernel mak<strong>es</strong> use of a system call dispatch table ; this table is stored in the sys_call_tablearray and has NR_syscalls entri<strong>es</strong> (usually 256): the nth entry contains the service routineaddr<strong>es</strong>s of the system call having number n.<strong>The</strong> NR_syscalls macro is just a static limit on the maximum number of implementabl<strong>es</strong>ystem calls: it do<strong>es</strong> not indicate the number of system calls actually implemented. Indeed,any entry of the dispatch table may contain the addr<strong>es</strong>s of the sys_ni_syscall( ) function,which is the service routine of the "nonimplemented" system calls: it just returns the errorcode -ENOSYS.8.2.1 Initializing System Calls<strong>The</strong> trap_init( ) function, invoked during kernel initialization, sets up the IDT entrycorr<strong>es</strong>ponding to vector 128 as follows:set_system_gate(0x80, &system_call);<strong>The</strong> call loads the following valu<strong>es</strong> into the gate d<strong>es</strong>criptor fields (see Section 4.4.1 inChapter 4):Segment SelectorOffsetType<strong>The</strong> __KERNEL_CS Segment Selector of the kernel code segment.Pointer to the system_call( ) exception handler.Set to 15. Indicat<strong>es</strong> that the exception is a Trap and that the corr<strong>es</strong>ponding handlerdo<strong>es</strong> not disable maskable interrupts.219


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>DPL (D<strong>es</strong>criptor Privilege Level)Set to 3; this allows proc<strong>es</strong>s<strong>es</strong> in User Mode to invoke the exception handler (seeSection 4.2.5 in Chapter 4).8.2.2 <strong>The</strong> system_call( ) Function<strong>The</strong> system_call( ) function implements the system call handler. It starts by saving th<strong>es</strong>ystem call number and all the CPU registers that may be used by the exception handler onthe stack, except for eflags, cs, eip, ss, and <strong>es</strong>p, which have already been savedautomatically by the control unit (see the section Section 4.2.5 in Chapter 4). <strong>The</strong> SAVE_ALLmacro, which was already discussed in Section 4.6.3 in Chapter 4, also loads the SegmentSelector of the kernel data segment in ds and <strong>es</strong>:system_call:pushl %eaxSAVE_ALLmovl %<strong>es</strong>p, %ebxandl $0xffffe000, %ebx<strong>The</strong> function also stor<strong>es</strong> in ebx the addr<strong>es</strong>s of the current proc<strong>es</strong>s d<strong>es</strong>criptor; this is done bytaking the value of the kernel stack pointer and rounding it up to a multiple of 8 KB (seeSection 3.1.2 in Chapter 3).A validity check is then performed on the system call number passed by the User Modeproc<strong>es</strong>s. If it is greater than or equal to NR_syscalls, the system call handler terminat<strong>es</strong>:cmpl $(NR_syscalls), %eaxjb nobadsysmovl $(-ENOSYS), 24(%<strong>es</strong>p)jmp ret_from_sys_callnobadsys:If the system call number is not valid, the function stor<strong>es</strong> the -ENOSYS value in the stacklocation where the eax register has been saved (at offset 24 from the current stack top). It thenjumps to ret_from_sys_call( ). In this way, when the proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution inUser Mode, it will find a negative return code in eax.Next, the system_call( ) function checks whether the PF_TRACESYS flag included inthe flags field of current is equal to 1, that is, whether the system call invocations ofthe executed program are being traced by some debugger. If this is the case, system_call( )invok<strong>es</strong> the syscall_trace( ) function twice, once right before and once right afterthe execution of the system call service routine. This function stops current and thus allowsthe debugging proc<strong>es</strong>s to collect information about it.Finally, the specific service routine associated with the system call number contained in eax isinvoked:call *sys_call_table(0, %eax, 4)Since each entry in the dispatch table is 4 byt<strong>es</strong> long, the kernel finds the addr<strong>es</strong>s ofthe service routine to be invoked by first multiplying the system call number by 4, adding220


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the initial addr<strong>es</strong>s of the sys_call_table dispatch table, and extracting a pointer tothe service routine from that slot in the table.When the service routine terminat<strong>es</strong>, system_call( ) gets its return code from eax andstor<strong>es</strong> it in the stack location where the User Mode value of the eax register has been saved. Itthen jumps to ret_from_sys_call( ), which terminat<strong>es</strong> the execution of the system callhandler (see Section 4.7.2 in Chapter 4):movl %eax, 24(%<strong>es</strong>p)jmp ret_from_sys_callWhen the proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution in User Mode, it will find in eax the return code ofthe system call.8.2.3 Parameter PassingLike ordinary functions, system calls often require some input/output parameters, which mayconsist of actual valu<strong>es</strong> (i.e., numbers) or addr<strong>es</strong>s<strong>es</strong> of functions and variabl<strong>es</strong> in the addr<strong>es</strong>sspace of the User Mode proc<strong>es</strong>s. Since the system_call( ) function is the unique entry pointfor all system calls in <strong>Linux</strong>, each of them has at least one parameter: the system call numberpassed in the eax register. For instance, if an application program invok<strong>es</strong> the fork( )wrapper routine, the eax register is set to 5 before executing the int $0x80 Assemblyinstruction. Because the register is set by the wrapper routin<strong>es</strong> included in the libc library,programmers do not usually care about the system call number.<strong>The</strong> fork( ) system call do<strong>es</strong> not require other parameters. However, many system calls dorequire additional parameters, which must be explicitly passed by the application program.For instance, the mmap( ) system call may require up to six parameters (b<strong>es</strong>id<strong>es</strong> the systemcall number).<strong>The</strong> parameters of ordinary functions are passed by writing their valu<strong>es</strong> in the active programstack (either the User Mode stack or the <strong>Kernel</strong> Mode stack). But system call parameters areusually passed to the system call handler in the CPU registers, then copied onto the <strong>Kernel</strong>Mode stack, since system call service routin<strong>es</strong> are ordinary C functions.Why do<strong>es</strong>n't the kernel copy parameters directly from the User Mode stack to the <strong>Kernel</strong>Mode stack? First of all, working with two stacks at the same time is complex; moreover, theuse of registers mak<strong>es</strong> the structure of the system call handler similar to that of otherexception handlers.However, in order to pass parameters in registers, two conditions must be satisfied:• <strong>The</strong> length of each parameter cannot exceed the length of a register, that is 32 bits. [1][1] We refer as usual to the 32-bit architecture of the Intel 80x86 proc<strong>es</strong>sors. <strong>The</strong> discussion in this section do<strong>es</strong> not apply to Compaq's Alpha 64-bitproc<strong>es</strong>sors.• <strong>The</strong> number of parameters must not exceed six (including the system call numberpassed in eax), since the Intel Pentium has a very limited number of registers.221


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> first condition is always true since, according to the POSIX standard, large parametersthat cannot be stored in a 32-bit register must be passed by specifying their addr<strong>es</strong>s<strong>es</strong>. Atypical example is the settimeofday( ) system call, which must read two 64-bit structur<strong>es</strong>.However, system calls that have more than six parameters exist: in such cas<strong>es</strong>, a singleregister is used to point to a memory area in the proc<strong>es</strong>s addr<strong>es</strong>s space that contains theparameter valu<strong>es</strong>. Of course, programmers do not have to care about this workaround. As withany C call, parameters are automatically saved on the stack when the wrapper routine isinvoked. This routine will find the appropriate way to pass the parameters to the kernel.<strong>The</strong> six registers used to store system call parameters are, in increasing order: eax (for th<strong>es</strong>ystem call number), ebx , ecx, edx, <strong>es</strong>i, and edi. As seen before, system_call( ) sav<strong>es</strong>the valu<strong>es</strong> of th<strong>es</strong>e registers on the <strong>Kernel</strong> Mode stack by using the SAVE_ALL macro.<strong>The</strong>refore, when the system call service routine go<strong>es</strong> to the stack, it finds the return addr<strong>es</strong>s tosystem_call( ), followed by the parameter stored in ebx (that is, the first parameter of th<strong>es</strong>ystem call), the parameter stored in ecx, and so on (see the Section 4.6.3 in Chapter 4). Thisstack configuration is exactly the same as in an ordinary function call, and therefore th<strong>es</strong>ervice routine can easily refer to its parameters by using the usual C-language constructs.Let's look at an example. <strong>The</strong> sys_write( ) service routine, which handl<strong>es</strong> the write( )system call, is declared as:int sys_write (unsigned int fd, const char * buf,unsigned int count)<strong>The</strong> C compiler produc<strong>es</strong> an assembly language function that expects to find the fd, buf, andcount parameters on top of the stack, right below the return addr<strong>es</strong>s, in the locations used tosave the contents of the ebx, ecx, and edx registers, r<strong>es</strong>pectively.In a few cas<strong>es</strong>, even if the system call do<strong>es</strong>n't make use of any parameters, the corr<strong>es</strong>pondingservice routine needs to know the contents of the CPU registers right before the system callwas issued. As an example, the do_fork( ) function that implements fork( ) needs to knowthe value of the registers in order to duplicate them in the child proc<strong>es</strong>s TSS. In th<strong>es</strong>e cas<strong>es</strong>, asingle parameter of type pt_regs allows the service routine to acc<strong>es</strong>s the valu<strong>es</strong> saved in the<strong>Kernel</strong> Mode stack by the SAVE_ALL macro (see Section 4.6.4 in Chapter 4):int sys_fork (struct pt_regs regs)<strong>The</strong> return value of a service routine must be written into the eax register. This isautomatically done by the C compiler when a return n; instruction is executed.8.2.4 Verifying the ParametersAll system call parameters must be carefully checked before the kernel attempts to satisfy auser requ<strong>es</strong>t. <strong>The</strong> type of check depends both on the system call and on the specific parameter.Let us go back to the write( ) system call introduced before: the fd parameter should be afile d<strong>es</strong>criptor that d<strong>es</strong>crib<strong>es</strong> a specific file, so sys_write( ) must check whether fd really isa file d<strong>es</strong>criptor of a file previously opened and whether the proc<strong>es</strong>s is allowed to write into it(see Section 1.5.6 in Chapter 1). If any of th<strong>es</strong>e conditions is not true, the handler must returna negative value, in this case the error code -EBADF.222


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>One type of checking, however, is common to all system calls: whenever a parameterspecifi<strong>es</strong> an addr<strong>es</strong>s, the kernel must check whether it is inside the proc<strong>es</strong>s addr<strong>es</strong>s space.<strong>The</strong>re are two possible ways to perform this check:• Verify that the linear addr<strong>es</strong>s belongs to the proc<strong>es</strong>s addr<strong>es</strong>s space and, if so, that thememory region including it has the proper acc<strong>es</strong>s rights.• Verify just that the linear addr<strong>es</strong>s is lower than PAGE_OFFSET (i.e., that it do<strong>es</strong>n't fallwithin the range of interval addr<strong>es</strong>s<strong>es</strong> r<strong>es</strong>erved to the kernel).Previous <strong>Linux</strong> kernels performed the first type of checking. But it is quite time-consumingsince it must be executed for each addr<strong>es</strong>s parameter included in a system call; furthermore, itis usually pointl<strong>es</strong>s because faulty programs are not very common.<strong>The</strong>refore, the <strong>Linux</strong> 2.2 kernel performs the second type of checking. It is much moreefficient because it do<strong>es</strong> not require any scan of the proc<strong>es</strong>s memory region d<strong>es</strong>criptors.Obviously, it is a very coarse check: verifying that the linear addr<strong>es</strong>s is smaller thanPAGE_OFFSET is a nec<strong>es</strong>sary but not sufficient condition for its validity. But there's no risk inconfining the kernel to this limited kind of check because other errors will be caught later.<strong>The</strong> approach followed in <strong>Linux</strong> 2.2 is thus to defer the real checking until the last possiblemoment, that is, until the Paging Unit translat<strong>es</strong> the linear addr<strong>es</strong>s into a physical one. W<strong>es</strong>hall discuss in Section 8.2.6 later in this chapter how the "Page fault" exception handlersucceeds in detecting those bad addr<strong>es</strong>s<strong>es</strong> issued in <strong>Kernel</strong> Mode that have been passed asparameters by User Mode proc<strong>es</strong>s<strong>es</strong>.One might wonder at this point why the coarse check is performed at all. This type ofchecking is actually crucial to pr<strong>es</strong>erve both proc<strong>es</strong>s addr<strong>es</strong>s spac<strong>es</strong> and the kernel addr<strong>es</strong>sspace from illegal acc<strong>es</strong>s<strong>es</strong>. We have seen in Chapter 2, that the RAM is mapped startingfrom PAGE_OFFSET. This means that kernel routin<strong>es</strong> are able to addr<strong>es</strong>s all pag<strong>es</strong> pr<strong>es</strong>ent inmemory. Thus, if the coarse check were not performed, a User Mode proc<strong>es</strong>s might pass anaddr<strong>es</strong>s belonging to the kernel addr<strong>es</strong>s space as a parameter and then be able to read or writeany page pr<strong>es</strong>ent in memory without causing a "Page fault" exception!<strong>The</strong> check on addr<strong>es</strong>s<strong>es</strong> passed to system calls is performed by the verify_area( ) function,which acts on two [2] parameters denoted as addr and size. <strong>The</strong> function checks the addr<strong>es</strong>sinterval delimited by addr and addr + size - 1, and is <strong>es</strong>sentially equivalent to thefollowing C function:[2] A third parameter named type specifi<strong>es</strong> whether the system call should read or write the referred memory locations. It is used only in systemshaving buggy versions of the Intel 80486 microproc<strong>es</strong>sor, in which writing in <strong>Kernel</strong> Mode to a write-protected page do<strong>es</strong> not generate a page fault.We don't discuss this case further.int verify_area(const void * addr, unsigned long size){unsigned long a = (unsigned long) addr;if (a + size < a || a + size > current->addr_limit.seg)return -EFAULT;return 0;}<strong>The</strong> function verifi<strong>es</strong> first whether addr + size, the high<strong>es</strong>t addr<strong>es</strong>s to be checked, is largerthan 2 32 -1; since unsigned long integers and pointers are repr<strong>es</strong>ented by the GNU C compiler223


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>(gcc) as 32-bit numbers, this is equivalent to checking for an overflow condition. <strong>The</strong>function also checks whether addr exceeds the value stored in the addr_limit.seg field ofcurrent. This field usually has the value PAGE_OFFSET-1 for normal proc<strong>es</strong>s<strong>es</strong> and the value0xffffffff for kernel threads. <strong>The</strong> value of the addr_limit.seg field can be dynamicallychanged by the get_fs and set_fs macros; this allows the kernel to invoke system callservice routin<strong>es</strong> directly and pass addr<strong>es</strong>s<strong>es</strong> in the kernel data segment to them.<strong>The</strong> acc<strong>es</strong>s_ok macro performs the same check as verify_area( ). It yields 1 if th<strong>es</strong>pecified addr<strong>es</strong>s interval is valid and otherwise.8.2.5 Acc<strong>es</strong>sing the Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceSystem call service routin<strong>es</strong> quite often need to read or write data contained in the proc<strong>es</strong>s'saddr<strong>es</strong>s space. <strong>Linux</strong> includ<strong>es</strong> a set of macros that make this acc<strong>es</strong>s easier. We'll d<strong>es</strong>cribe twoof them, called get_user( ) and put_user( ). <strong>The</strong> first can be used to read 1, 2, or 4consecutive byt<strong>es</strong> from an addr<strong>es</strong>s, while the second can be used to write data of those siz<strong>es</strong>into an addr<strong>es</strong>s.Each function accepts two arguments, a value x to transfer and a variable ptr. <strong>The</strong> secondvariable also determin<strong>es</strong> how many byt<strong>es</strong> to transfer. Thus, in get_user(x,ptr), the size ofthe variable pointed to by ptr caus<strong>es</strong> the function to expand into a __get_user_1( ),__get_user_2( ), or __get_user_4( ) assembly language function. Let us consider one ofthem, for instance, __get_user_2( ):__get_user_2:addl $1, %eaxjc bad_get_usermovl %<strong>es</strong>p, %edxandl $0xffffe000, %edxcmpl 12(%edx), %eaxjae bad_get_user2: movzwl -1(%eax), %edxxorl %eax, %eaxretbad_get_user:xorl %edx, %edxmovl $-EFAULT, %eaxret<strong>The</strong> eax register contains the addr<strong>es</strong>s ptr of the first byte to be read. <strong>The</strong> first six instructions<strong>es</strong>sentially perform the same checks as the verify_area( ) functions: they ensure that the 2byt<strong>es</strong> to be read have addr<strong>es</strong>s<strong>es</strong> l<strong>es</strong>s than 4 GB as well as l<strong>es</strong>s than the addr_limit.seg fieldof the current proc<strong>es</strong>s. (This field is stored at offset 12 in the proc<strong>es</strong>s d<strong>es</strong>criptor, whichappears in the first operand of the cmpl instruction.)If the addr<strong>es</strong>s<strong>es</strong> are valid, the function execut<strong>es</strong> the movzwl instruction to store the data to beread in the 2 least significant byt<strong>es</strong> of edx register while setting the high-order byt<strong>es</strong> of edx to0; then it sets a return code in eax and terminat<strong>es</strong>. If the addr<strong>es</strong>s<strong>es</strong> are not valid, the functionclears edx, sets the -EFAULT value into eax, and terminat<strong>es</strong>.<strong>The</strong> put_user(x,ptr) macro is similar to the one discussed before, except that it writ<strong>es</strong> thevalue x into the proc<strong>es</strong>s addr<strong>es</strong>s space starting from addr<strong>es</strong>s ptr. Depending on the size of x224


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>(1, 2, or 4 byt<strong>es</strong>), it invok<strong>es</strong> the __put_user_1( ), __put_user_2( ),or __put_user_4( )function. Let's consider __put_user_4( ) for our example this time. <strong>The</strong> function performsthe usual checks on the ptr addr<strong>es</strong>s stored in the eax register, then execut<strong>es</strong> a movlinstruction to write the 4 byt<strong>es</strong> stored into the edx register. <strong>The</strong> function returns the value inthe eax register if it succeeds, and -EFAULT otherwise.Several other functions and macros are available to acc<strong>es</strong>s the proc<strong>es</strong>s addr<strong>es</strong>s space in <strong>Kernel</strong>Mode; they are listed in Table 8-1. Notice that many of them also have a variant prefixed bytwo underscor<strong>es</strong> ( __). <strong>The</strong> on<strong>es</strong> without initial underscor<strong>es</strong> take extra time to check thevalidity of the linear addr<strong>es</strong>s interval requ<strong>es</strong>ted, while the on<strong>es</strong> with the underscor<strong>es</strong> bypassthat check. Whenever the kernel must repeatedly acc<strong>es</strong>s the same memory area in the proc<strong>es</strong>saddr<strong>es</strong>s space, it is more efficient to check the addr<strong>es</strong>s once at the start, then acc<strong>es</strong>s theproc<strong>es</strong>s area without making any further checks.Table 8-1. Functions and Macros that Acc<strong>es</strong>s the Proc<strong>es</strong>s Addr<strong>es</strong>s SpaceFunctionActionget_userReads an integer value from user space (1, 2, or 4 byt<strong>es</strong>)__get_userput_userWrit<strong>es</strong> an integer value to user space (1, 2, or 4 byt<strong>es</strong>)__put_userget_user_retLike get_user, but returns a specified value on error__get_user_retput_user_retLike put_user, but returns a specified value on error__put_user_retcopy_from_userCopi<strong>es</strong> a block of arbitrary size from user space__copy_from_usercopy_to_userCopi<strong>es</strong> a block of arbitrary size to user space__copy_to_usercopy_from_user_retcopy_to_user_retstrncpy_from_user__strncpy_from_userstrlen_userstrnlen_userclear_user__clear_userLike copy_from_user, but returns a specified value on errorLike copy_to_user, but returns a specified value on errorCopi<strong>es</strong> a null-terminated string from user spaceReturns the length of a null-terminated string in user spaceFills a memory area in user space with zeros8.2.6 Dynamic Addr<strong>es</strong>s Checking: <strong>The</strong> Fixup CodeAs seen previously, the verify_area( ) function and the acc<strong>es</strong>s_ok macro make only acoarse check on the validity of linear addr<strong>es</strong>s<strong>es</strong> passed as parameters of a system call. Since225


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>they do not ensure that th<strong>es</strong>e addr<strong>es</strong>s<strong>es</strong> are included in the proc<strong>es</strong>s addr<strong>es</strong>s space, a proc<strong>es</strong>scould cause a "Page fault" exception by passing a wrong addr<strong>es</strong>s.Before d<strong>es</strong>cribing how the kernel detects this type of error, let us specify the three cas<strong>es</strong> inwhich "Page fault" exceptions may occur in <strong>Kernel</strong> Mode:• <strong>The</strong> kernel attempts to addr<strong>es</strong>s a page belonging to the proc<strong>es</strong>s addr<strong>es</strong>s space, buteither the corr<strong>es</strong>ponding page frame do<strong>es</strong> not exist, or the kernel is trying to write aread-only page.• Some kernel function includ<strong>es</strong> a programming bug that caus<strong>es</strong> the exception to beraised when that program is executed; alternatively, the exception might be caused bya transient hardware error.• <strong>The</strong> case introduced in this chapter: a system call service routine attempts to read orwrite into a memory area whose addr<strong>es</strong>s has been passed as a system call parameter,but that addr<strong>es</strong>s do<strong>es</strong> not belong to the proc<strong>es</strong>s addr<strong>es</strong>s space.<strong>The</strong>se cas<strong>es</strong> must be distinguished by the page fault handler, since the actions to be taken arequite different. In the first case, the handler must allocate and initialize a new page frame (seeSection 7.4.3 and Section 7.4.4 in Chapter 7); in the second case, the handler must perform akernel oops (see Section 7.4.1 in Chapter 7); in the third case, the handler must terminate th<strong>es</strong>ystem call by returning a proper error code.<strong>The</strong> page fault handler can easily recognize the first case by determining whether the faultylinear addr<strong>es</strong>s is included in one of the memory regions owned by the proc<strong>es</strong>s. Let us nowexplain how the handler distinguish<strong>es</strong> the remaining two cas<strong>es</strong>.8.2.6.1 <strong>The</strong> exception tabl<strong>es</strong><strong>The</strong> key to determining the source of a page fault li<strong>es</strong> in the narrow range of calls that thekernel us<strong>es</strong> to acc<strong>es</strong>s the proc<strong>es</strong>s addr<strong>es</strong>s space. Only the small group of functions and macrosd<strong>es</strong>cribed in the previous section are ever used to acc<strong>es</strong>s that addr<strong>es</strong>s space; thus, if theexception is caused by an invalid parameter, the instruction that caused it must be included inone of the functions or be generated by expanding one of the macros. If you add up the codein all th<strong>es</strong>e functions and macros, they consist of a fairly small set of addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong>refore, it would not take much effort to put the addr<strong>es</strong>s of any kernel instruction thatacc<strong>es</strong>s<strong>es</strong> the proc<strong>es</strong>s addr<strong>es</strong>s space into a structure called the exception table. If we succeed indoing this, the r<strong>es</strong>t is easy. When a "Page fault" exception occurs in <strong>Kernel</strong> Mode, the do_page_fault( ) handler examin<strong>es</strong> the exception table: if it includ<strong>es</strong> the addr<strong>es</strong>s of theinstruction that triggered the exception, the error is caused by a bad system call parameter;otherwise, it is caused by some more serious bug.<strong>Linux</strong> defin<strong>es</strong> several exception tabl<strong>es</strong>. <strong>The</strong> main exception table is automatically generatedby the C compiler when building the kernel program image. It is stored in the _ _ex_tabl<strong>es</strong>ection of the kernel code segment, and its starting and ending addr<strong>es</strong>s<strong>es</strong> are identified by twosymbols produced by the C compiler: __start__ _ex_table and __stop__ _ex_table.Moreover, each dynamically loaded module of the kernel (see Appendix B, Modul<strong>es</strong>) includ<strong>es</strong>its own local exception table. This table is automatically generated by the C compiler when226


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>building the module image, and it is loaded in memory when the module is inserted in therunning kernel.Each entry of an exception table is an exception_table_entry structure having two fields:insnfixup<strong>The</strong> linear addr<strong>es</strong>s of an instruction that acc<strong>es</strong>s<strong>es</strong> the proc<strong>es</strong>s addr<strong>es</strong>s space<strong>The</strong> addr<strong>es</strong>s of the assembly language code to be invoked when a "Page fault"exception triggered by the instruction located at insn occurs<strong>The</strong> fixup code consists of a few assembly language instructions that solve the problemtriggered by the exception. As we shall see later in this section, the fix usually consists ofinserting a sequence of instructions that forc<strong>es</strong> the service routine to return an error code tothe User Mode proc<strong>es</strong>s. Such instructions are usually defined in the same macro or functionthat acc<strong>es</strong>s<strong>es</strong> the proc<strong>es</strong>s addr<strong>es</strong>s space; sometim<strong>es</strong>, they are placed by the C compiler in aseparate section of the kernel code segment called .fixup.<strong>The</strong> search_exception_table( ) function is used to search for a specified addr<strong>es</strong>s in allexception tabl<strong>es</strong>: if the addr<strong>es</strong>s is included in a table, the function returns the corr<strong>es</strong>pondingfixup addr<strong>es</strong>s; otherwise, it returns 0. Thus the page fault handler do_page_fault( )execut<strong>es</strong> the following statements:if ((fixup = search_exception_table(regs->eip)) != 0) {regs->eip = fixup;return;}<strong>The</strong> regs->eip field contains the value of the eip register saved on the <strong>Kernel</strong> Mode stackwhen the exception occurred. If the value in the register (the instruction pointer) is in anexception table, do_page_fault( ) replac<strong>es</strong> the saved value with the addr<strong>es</strong>s returned bysearch_exception_table( ). <strong>The</strong>n the page fault handler terminat<strong>es</strong> and the interruptedprogram r<strong>es</strong>um<strong>es</strong> with execution of the fixup code.8.2.6.2 Generating the exception tabl<strong>es</strong> and the fixup code<strong>The</strong> GNU Assembler .section directive allows programmers to specify which section of theexecutable file contains the code that follows. As we shall see in Chapter 19, an executablefile includ<strong>es</strong> a code segment, which in turn may be subdivided into sections. Thus, thefollowing assembly language instructions add an entry into an exception table; the "a"attribute specifi<strong>es</strong> that the section must be loaded in memory together with the r<strong>es</strong>t of thekernel image:.section __ex_table, "a".long faulty_instruction_addr<strong>es</strong>s, fixup_code_addr<strong>es</strong>s.previous227


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> .previous directive forc<strong>es</strong> the assembler to insert the code that follows into the sectionthat was active when the last .section directive was encountered.Let us consider again the __get_user_1( ), __get_user_2( ), and __get_user_4( )functions mentioned before:_ _get_user_1:[...]1: movzbl (%eax), %edx[...]_ _get_user_2:[...]2: movzwl -1(%eax), %edx[...]__get_user_4:[...]3: movl -3(%eax), %edx[...]bad_get_user:xorl %edx, %edxmovl $-EFAULT, %eaxret.section _ _ex_table,"a".long 1b, bad_get_user.long 2b, bad_get_user.long 3b, bad_get_user.previous<strong>The</strong> instructions that acc<strong>es</strong>s the proc<strong>es</strong>s addr<strong>es</strong>s space are those labeled as 1, 2, and 3. <strong>The</strong>fixup code is common to the three functions and is labeled as bad_get_user. Each exceptiontable entry consists simply of two labels. <strong>The</strong> first one is a numeric label with a b suffix toindicate that the label is a "backward" one: in other words, it appears in a previous line of theprogram. <strong>The</strong> fixup code at bad_get_user returns an EFAULT error code to the proc<strong>es</strong>s thatissued the system call.Let us consider a second example, the strlen_user(string) macro. This returns the lengthof a null-terminated string in the proc<strong>es</strong>s addr<strong>es</strong>s space or the value on error. <strong>The</strong> macro<strong>es</strong>sentially yields the following assembly language instructions:movl $0, %eaxmovl $0x7fffffff, %ecxmovl %ecx, %edxmovl string, %edi0: repne; scasbsubl %ecx, %edxmovl %edx, %eax1:.section .fixup,"ax"2: movl $0, %eaxjmp 1b.previous.section __ex_table,"a".long 0b, 2b.previous228


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> ecx and edx registers are initialized with the 0x7fffffff value, which repr<strong>es</strong>ents themaximum allowed length for the string. <strong>The</strong> repne; scasb assembly language instructionsiteratively scan the string pointed to by the edi register, looking for the value (the end ofstring \0 character) in eax. Since the ecx register is decremented at each iteration, the eaxregister will ultimately store the total number of byt<strong>es</strong> scanned in the string; that is, the lengthof the string.<strong>The</strong> fixup code of the macro is inserted into the .fixup section. <strong>The</strong> "ax" attribut<strong>es</strong> specifythat the section must be loaded in memory and that it contains executable code. If a page faultexception is generated by the instructions at label 0, the fixup code is executed: it simplyloads the value in eax, thus forcing the macro to return a error code instead of the stringlength, then jumps to the 1 label, which corr<strong>es</strong>ponds to the instruction following the macro.8.3 Wrapper Routin<strong>es</strong>Although system calls are mainly used by User Mode proc<strong>es</strong>s<strong>es</strong>, they can also be invoked bykernel threads, which cannot make use of library functions. In order to simplify thedeclarations of the corr<strong>es</strong>ponding wrapper routin<strong>es</strong>, <strong>Linux</strong> defin<strong>es</strong> a set of six macros called_syscall0 through _syscall5.<strong>The</strong> numbers through 5 in the name of each macro corr<strong>es</strong>pond to the number of parametersused by the system call (excluding the system call number). <strong>The</strong> macros may also be used tosimplify the declarations of the wrapper routin<strong>es</strong> in the libc standard library; however, theycannot be used to define wrapper routin<strong>es</strong> for system calls having more than five parameters(excluding the system call number) or for system calls that yield nonstandard return valu<strong>es</strong>.Each macro requir<strong>es</strong> exactly 2+2xn parameters, with n being the number of parameters of th<strong>es</strong>ystem call. <strong>The</strong> first two parameters specify the return type and the name of the system call;each additional pair of parameters specifi<strong>es</strong> the type and the name of the corr<strong>es</strong>pondingsystem call parameter. Thus, for instance, the wrapper routine of the fork( ) system call maybe generated by:_syscall0(int,fork)while the wrapper routine of the write( ) system call may be generated by:_syscall3(int,write,int,fd,const char *,buf,unsigned int,count)In the latter case, the macro yields the following code:229


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>int write(int fd,const char * buf,unsigned int count){long __r<strong>es</strong>;asm("int $0x80": "=a" (__r<strong>es</strong>): "0" (__NR_write), "b" ((long)fd),"c" ((long)buf), "d" ((long)count));if ((unsigned long)__r<strong>es</strong> >= (unsigned long)-125) {errno = -__r<strong>es</strong>;__r<strong>es</strong> = -1;}return (int) __r<strong>es</strong>;}<strong>The</strong> __NR_write macro is derived from the second parameter of _syscall3; it expands intothe system call number of write( ). When compiling the preceding function, the followingassembly language code is produced:write:pushl %ebx; push ebx into stackmovl 8(%<strong>es</strong>p), %ebx ; put first parameter in ebxmovl 12(%<strong>es</strong>p), %ecx ; put second parameter in ecxmovl 16(%<strong>es</strong>p), %edx ; put third parameter in edxmovl $4, %eax; put __NR_write in eaxint $0x80; invoke system callcmpl $-126, %eax ; check return codejbe .L1; if no error, jumpnegl %eax; complement the value of eaxmovl %eax, errno ; put r<strong>es</strong>ult in errnomovl $-1, %eax ; set eax to -1.L1: popl %ebx ; pop ebx from stackret; return to calling programNotice how the parameters of the write( ) function are loaded into the CPU registers beforethe int $0x80 instruction is executed. <strong>The</strong> value returned in eax must be interpreted as anerror code if it li<strong>es</strong> between -1 and -125 (the kernel assum<strong>es</strong> that the larg<strong>es</strong>t error code definedin include/asm-i386/errno.h is 125). If this is the case, the wrapper routine will store the valueof -eax in errno and return the value -1; otherwise, it will return the value of eax.8.4 Anticipating <strong>Linux</strong> 2.4B<strong>es</strong>ide adding a few new system calls, <strong>Linux</strong> 2.4 do<strong>es</strong> not introduce any change to the systemcall mechanism of <strong>Linux</strong> 2.2.230


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 9. SignalsSignals were introduced by the first Unix systems to simplify interproc<strong>es</strong>s communication.<strong>The</strong> kernel also us<strong>es</strong> them to notify proc<strong>es</strong>s<strong>es</strong> of system events. In contrast to interrupts andexceptions, most signals are visible to User Mode proc<strong>es</strong>s<strong>es</strong>.Signals have been around for 30 years with only minor chang<strong>es</strong>. Due to their relativ<strong>es</strong>implicity and efficiency, they continue to be widely used, although as we shall see inChapter 18, other higher-level tools have been introduced for the same purpose.<strong>The</strong> first sections of this chapter examine in detail how signals are handled by the <strong>Linux</strong>kernel, then we discuss the system calls that allow proc<strong>es</strong>s<strong>es</strong> to exchange signals.9.1 <strong>The</strong> Role of SignalsA signal is a very short m<strong>es</strong>sage that may be sent to a proc<strong>es</strong>s or to a group of proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>only information given to the proc<strong>es</strong>s is usually the number identifying the signal; there is noroom in standard signals for arguments, a m<strong>es</strong>sage, or other accompanying information.A set of macros whose nam<strong>es</strong> start with the prefix SIG is used to identify signals; we havealready made a few referenc<strong>es</strong> to them in previous chapters. For instance, the SIGCHLD macrohas been mentioned in Section 3.3.1 in Chapter 3. This macro, which expands into the value17 in <strong>Linux</strong>, yields the identifier of the signal that is sent to a parent proc<strong>es</strong>s when some childstops or terminat<strong>es</strong>. <strong>The</strong> SIGSEGV macro, which expands into the value 11, has beenmentioned in Section 7.4 in Chapter 7 : it yields the identifier of the signal that is sent toa proc<strong>es</strong>s when it mak<strong>es</strong> an invalid memory reference.Signals serve two main purpos<strong>es</strong>:• To make a proc<strong>es</strong>s aware that a specific event has occurred• To force a proc<strong>es</strong>s to execute a signal handler function included in its codeOf course, the two purpos<strong>es</strong> are not mutually exclusive, since often a proc<strong>es</strong>s must react tosome event by executing a specific routine.Table 9-1 lists the first 31 signals handled by <strong>Linux</strong> 2.2 for the Intel 80x86 architecture (som<strong>es</strong>ignal numbers such as SIGCHLD or SIGSTOP are architecture-dependent; furthermore, som<strong>es</strong>ignals are defined only for specific architectur<strong>es</strong>). B<strong>es</strong>id<strong>es</strong> the signals d<strong>es</strong>cribed in this table,the POSIX standard has introduced a new class of signals called "real-time." <strong>The</strong>y will bediscussed separately in Section 9.4 later in this chapter.231


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 9-1. <strong>The</strong> First 31 Signals in <strong>Linux</strong>/i386# Signal Name Default Action Comment POSIX1 SIGHUP Abort Hangup of controlling terminal or proc<strong>es</strong>s Y<strong>es</strong>2 SIGINT Abort Interrupt from keyboard Y<strong>es</strong>3 SIGQUIT Dump Quit from keyboard Y<strong>es</strong>4 SIGILL Dump Illegal instruction Y<strong>es</strong>5 SIGTRAP Dump Breakpoint for debugging No6 SIGABRT Dump Abnormal termination Y<strong>es</strong>6 SIGIOT Dump Equivalent to SIGABRT No7 SIGBUS Abort Bus error No8 SIGFPE Dump Floating point exception Y<strong>es</strong>9 SIGKILL Abort Forced proc<strong>es</strong>s termination Y<strong>es</strong>10 SIGUSR1 Abort Available to proc<strong>es</strong>s<strong>es</strong> Y<strong>es</strong>11 SIGSEGV Dump Invalid memory reference Y<strong>es</strong>12 SIGUSR2 Abort Available to proc<strong>es</strong>s<strong>es</strong> Y<strong>es</strong>13 SIGPIPE Abort Write to pipe with no readers Y<strong>es</strong>14 SIGALRM Abort Real timer clock Y<strong>es</strong>15 SIGTERM Abort Proc<strong>es</strong>s termination Y<strong>es</strong>16 SIGSTKFLT Abort Coproc<strong>es</strong>sor stack error No17 SIGCHLD Ignore Child proc<strong>es</strong>s stopped or terminated Y<strong>es</strong>18 SIGCONT Continue R<strong>es</strong>ume execution, if stopped Y<strong>es</strong>19 SIGSTOP Stop Stop proc<strong>es</strong>s execution Y<strong>es</strong>20 SIGTSTP Stop Stop proc<strong>es</strong>s issued from tty Y<strong>es</strong>21 SIGTTIN Stop Background proc<strong>es</strong>s requir<strong>es</strong> input Y<strong>es</strong>22 SIGTTOU Stop Background proc<strong>es</strong>s requir<strong>es</strong> output Y<strong>es</strong>23 SIGURG Ignore Urgent condition on socket No24 SIGXCPU Abort CPU time limit exceeded No25 SIGXFSZ Abort File size limit exceeded No26 SIGVTALRM Abort Virtual timer clock No27 SIGPROF Abort Profile timer clock No28 SIGWINCH Ignore Window r<strong>es</strong>izing No29 SIGIO Abort I/O now possible No29 SIGPOLL Abort Equivalent to SIGIO No30 SIGPWR Abort Power supply failure No31 SIGUNUSED Abort Not used NoA number of system calls allow programmers to send signals and determine how theirproc<strong>es</strong>s<strong>es</strong> exploit the signals they recieve. Table 9-2 d<strong>es</strong>crib<strong>es</strong> th<strong>es</strong>e calls succinctly; theirbehavior is d<strong>es</strong>cribed in detail later in Section 9.5.232


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>System Callkill( )sigaction( )Table 9-2. System Calls Related to SignalsD<strong>es</strong>criptionSend a signal to a proc<strong>es</strong>s.Change the action associated with a signal.signal( ) Similar to sigaction( ).sigpending( )sigprocmask( )sigsuspend( )rt_sigaction( )rt_sigpending( )rt_sigprocmask( )rt_sigqueueinfo( )rt_sigsuspend( )Check whether there are pending signals.Modify the set of blocked signals.Wait for a signal.Change the action associated with a real-time signal.Check whether there are pending real-time signals.Modify the set of blocked real-time signals.Send a real-time signal to a proc<strong>es</strong>s.Wait for a real-time signal.rt_sigtimedwait( ) Similar to rt_sigsuspend( ).An important characteristic of signals is that they may be sent at any time to proc<strong>es</strong>s<strong>es</strong> whos<strong>es</strong>tate is usually unpredictable. Signals sent to a nonrunning proc<strong>es</strong>s must be saved by thekernel until that proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> execution. Blocking signals (d<strong>es</strong>cribed later) require signalsto be queued, which exacerbat<strong>es</strong> the problem of signals being raised before they can bedelivered.<strong>The</strong>refore, the kernel distinguish<strong>es</strong> two different phas<strong>es</strong> related to signal transmission:Signal sending<strong>The</strong> kernel updat<strong>es</strong> the d<strong>es</strong>criptor of the d<strong>es</strong>tination proc<strong>es</strong>s to repr<strong>es</strong>ent that a newsignal has been sent.Signal receiving<strong>The</strong> kernel forc<strong>es</strong> the d<strong>es</strong>tination proc<strong>es</strong>s to react to the signal by changing itsexecution state or by starting the execution of a specified signal handler or both.Each signal sent can be received no more than once. Signals are consumable r<strong>es</strong>ourc<strong>es</strong>: oncethey have been received, all proc<strong>es</strong>s d<strong>es</strong>criptor information that refers to their previousexistence is canceled.Signals that have been sent but not yet received are called pending signals . At any time, onlyone pending signal of a given type may exist for a proc<strong>es</strong>s; additional pending signals of th<strong>es</strong>ame type to the same proc<strong>es</strong>s are not queued but simply discarded. In general, a signal mayremain pending for an unpredictable amount of time. Indeed, the following factors must betaken into consideration:• Signals are usually received only by the currently running proc<strong>es</strong>s (that is, by thecurrent proc<strong>es</strong>s).• Signals of a given type may be selectively blocked by a proc<strong>es</strong>s (see Section 9.5.4): inthis case, the proc<strong>es</strong>s will not receive the signal until it remov<strong>es</strong> the block.• When a proc<strong>es</strong>s execut<strong>es</strong> a signal-handler function, it usually "masks" thecorr<strong>es</strong>ponding signal, that is, it automatically blocks the signal until the handler233


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>terminat<strong>es</strong>. A signal handler therefore cannot be interrupted by another occurrence ofthe handled signal, and therefore the function do<strong>es</strong>n't need to be reentrant. A maskedsignal is always blocked, but the converse do<strong>es</strong> not hold.Although the notion of signals is intuitive, the kernel implementation is rather complex. <strong>The</strong>kernel must:• Remember which signals are blocked by each proc<strong>es</strong>s.• When switching from <strong>Kernel</strong> Mode to User Mode, check whether a signal for anyproc<strong>es</strong>s has arrived. This happens at almost every timer interrupt, that is, roughlyevery 10 ms.• Determine whether the signal can be ignored. This happens when all of the followingconditions are fulfilled:o <strong>The</strong> d<strong>es</strong>tination proc<strong>es</strong>s is not traced by another proc<strong>es</strong>s (the PF_TRACED flag inthe proc<strong>es</strong>s d<strong>es</strong>criptor flags field is equal to 0). [1][1] If a proc<strong>es</strong>s receiv<strong>es</strong> a signal while it is being traced, the kernel stops the proc<strong>es</strong>s and notifi<strong>es</strong> the tracing proc<strong>es</strong>s by sending a SIGCHLD signalto it. <strong>The</strong> tracing proc<strong>es</strong>s may, in turn, r<strong>es</strong>ume execution of the traced proc<strong>es</strong>s by means of a SIGCONT signal.oo<strong>The</strong> signal is not blocked by the d<strong>es</strong>tination proc<strong>es</strong>s.<strong>The</strong> signal is being ignored by the d<strong>es</strong>tination proc<strong>es</strong>s (either because theproc<strong>es</strong>s has explicitly ignored it or because the proc<strong>es</strong>s did not change thedefault action of the signal and that action is "ignore").• Handle the signal, which may require switching the proc<strong>es</strong>s to a handler function atany point during its execution and r<strong>es</strong>toring the original execution context after thefunction returns.Moreover, <strong>Linux</strong> must take into account the different semantics for signals adopted by BSDand System V; furthermore, it must comply with the rather cumbersome POSIX requirements.9.1.1 Actions Performed upon Receiving a Signal<strong>The</strong>re are three ways in which a proc<strong>es</strong>s can r<strong>es</strong>pond to a signal:• Explicitly ignore the signal.• Execute the default action associated with the signal (see Table 9-1). This action,which is predefined by the kernel, depends on the signal type and may be any one ofthe following:Abort<strong>The</strong>proc<strong>es</strong>s is d<strong>es</strong>troyed (killed).Dump<strong>The</strong> proc<strong>es</strong>s is d<strong>es</strong>troyed (killed) and a core file containing its execution context iscreated, if possible; this file may be used for debug purpos<strong>es</strong>.234


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Ignore<strong>The</strong> signal is ignored.Stop<strong>The</strong> proc<strong>es</strong>s is stopped, that is, put in a TASK_STOPPED state (see Section 3.1.1 inChapter 3).ContinueIf the proc<strong>es</strong>s is stopped (TASK_STOPPED), it is put into the TASK_RUNNING state.• Catch the signal by invoking a corr<strong>es</strong>ponding signal-handler function.Notice that blocking a signal is different from ignoring it: a signal is never received while it isblocked; an ignored signal is always received, and there is simply no further action.<strong>The</strong> SIGKILL and SIGSTOP signals cannot be explicitly ignored or caught, and thus theirdefault actions must always be executed. <strong>The</strong>refore, SIGKILL and SIGSTOP allow a user withappropriate privileg<strong>es</strong> to d<strong>es</strong>troy and to stop, r<strong>es</strong>pectively, any proc<strong>es</strong>s [2] regardl<strong>es</strong>s of thedefens<strong>es</strong> taken by the program it is executing.[2] Actually, there are two exceptions: all signals sent to proc<strong>es</strong>s (swapper) are discarded, while those sent to proc<strong>es</strong>s 1 (init) are always discardedunl<strong>es</strong>s they are caught. <strong>The</strong>refore, proc<strong>es</strong>s never di<strong>es</strong>, while proc<strong>es</strong>s 1 di<strong>es</strong> only when the initprogram terminat<strong>es</strong>.9.1.2 Data Structur<strong>es</strong> Associated with Signals<strong>The</strong> basic data structure used to store the signals sent to a proc<strong>es</strong>s is a sigset_t array of bits,one for each signal type:typedef struct {unsigned long sig[2];} sigset_t;Since each unsigned long number consists of 32 bits, the maximum number of signals thatmay be declared in <strong>Linux</strong> is 64 (the _NSIG macro denot<strong>es</strong> this value). No signal has thenumber 0, so the other 31 bits in the first element of sigset_t are the standard on<strong>es</strong> listed inTable 9-1. <strong>The</strong> bits in the second element are the real-time signals. <strong>The</strong> following fields areincluded in the proc<strong>es</strong>s d<strong>es</strong>criptor to keep track of the signals sent to the proc<strong>es</strong>s:signalblockedA sigset_t variable that denot<strong>es</strong> the signals sent to the proc<strong>es</strong>sA sigset_t variable that denot<strong>es</strong> the blocked signals235


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>sigpendinggsigA flag set if one or more nonblocked signals are pendingA pointer to a signal_struct data structure that d<strong>es</strong>crib<strong>es</strong> how each signal must behandled<strong>The</strong> signal_struct structure, in turn, is defined as follows:struct signal_struct {atomic_tcount;struct k_sigaction action[64];spinlock_tsiglock;};As mentioned in Section 3.3.1 in Chapter 3, this structure may be shared by several proc<strong>es</strong>s<strong>es</strong>by invoking the clone( ) system call with the CLONE_SIGHAND flag set. [3] <strong>The</strong> count fieldspecifi<strong>es</strong> the number of proc<strong>es</strong>s<strong>es</strong> that share the signal_struct structure, while the siglockfield is used to ensure exclusive acc<strong>es</strong>s to its fields (see Chapter 11). <strong>The</strong> action field is anarray of 64 k_sigaction structur<strong>es</strong> that specify how each signal must be handled.[3] If this is not done, about 1300 byt<strong>es</strong> are added to the proc<strong>es</strong>s data structur<strong>es</strong> just to take care of signal handling.Some architectur<strong>es</strong> assign properti<strong>es</strong> to a signal that are visible only to the kernel. Thus, theproperti<strong>es</strong> of a signal are stored in a k_sigaction structure, which contains both theproperti<strong>es</strong> hidden from the User Mode proc<strong>es</strong>s and the more familiar sigaction structure thatholds all the properti<strong>es</strong> a User Mode proc<strong>es</strong>s can see. Actually, on the Intel platform all signalproperti<strong>es</strong> are visible to User Mode proc<strong>es</strong>s<strong>es</strong>. So the k_sigaction structure simply reduc<strong>es</strong>to a single sa structure of type sigaction, which includ<strong>es</strong> the following fields:sa_handlersa_flagssa_maskThis field specifi<strong>es</strong> the type of action to be performed; its value can be a pointer to th<strong>es</strong>ignal handler, SIG_DFL (that is, the value 0) to specify that the default action must beexecuted or SIG_IGN (that is, the value 1) to specify that the signal must be explicitlyignored.This set of flags specifi<strong>es</strong> how the signal must be handled; some of them are listed inTable 9-3.This sigset_t variable specifi<strong>es</strong> the signals to be masked when running the signalhandler.236


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 9-3. Flags Specifying How to Handle a SignalFlag NameD<strong>es</strong>criptionSA_NOCLDSTOPDo not send SIGCHLD to the parent when the proc<strong>es</strong>s is stopped.SA_NODEFER, SA_NOMASK Do not mask the signal while executing the signal handler.SA_RESETHAND, SA_ONESHOT R<strong>es</strong>et to default action after executing the signal handler.SA_ONSTACK Use an alternate stack for the signal handler (see Section 9.3.3).SA_RESTART Interrupted system calls are automatically r<strong>es</strong>tarted (see Section 9.3.4).SA_SIGINFO Provide additional information to the signal handler (see Section 9.5.2).9.1.3 Operations on Signal Data Structur<strong>es</strong>Several functions and macros are used by the kernel to handle signals. In the followingd<strong>es</strong>cription, set is a pointer to a sigset_t variable, nsig is the number of a signal, and maskis an unsigned long bit mask.sigaddset(set,nsig) and sigdelset(set,nsig)Sets the bit of the sigset_t variable corr<strong>es</strong>ponding to signal nsig to 1 or 0,r<strong>es</strong>pectively. In practice, sigaddset( ) reduc<strong>es</strong> to:set->sig[(nsig - 1) / 32] |= 1UL sig[(nsig - 1) / 32] &= ~(1UL sig[0] |= mask;and to:set->sig[0] &= ~mask;sigismember(set,nsig)Returns the value of the bit of the sigset_t variable corr<strong>es</strong>ponding to the signal nsig.In practice, this function reduc<strong>es</strong> to:1 & (set->sig[(nsig - 1) / 32] >> ((nsig - 1) % 32))sigmask(nsig)Yields the bit index of the signal nsig. In other words, if the kernel needs to set, clear,or t<strong>es</strong>t a bit in an element of sigset_t that corr<strong>es</strong>ponds to a particular signal, it canderive the proper bit through this macro.237


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>signal_pending(p)Returns the value 1 (true) if the proc<strong>es</strong>s identified by the *p proc<strong>es</strong>s d<strong>es</strong>criptor hasnonblocked pending signals and the value (false) if it do<strong>es</strong>n't. <strong>The</strong> function isimplemented as a simple check on the sigpending field of the proc<strong>es</strong>s d<strong>es</strong>criptor.recalc_sigpending(t)Checks whether the proc<strong>es</strong>s identified by the proc<strong>es</strong>s d<strong>es</strong>criptor at *t has nonblockedpending signals, by looking at the sig and blocked fields of the proc<strong>es</strong>s, then sets th<strong>es</strong>igpending field properly as follows:ready = t->signal.sig[1] &~ t->blocked.sig[1];ready |= t->signal.sig[0] &~ t->blocked.sig[0];t->sigpending = (ready != 0);sigandsets(d,s1,s2) , sigorsets(d,s1,s2) , andsignandsets(d,s1,s2)Performs a logical AND, a logical OR, and a logical NAND, r<strong>es</strong>pectively, between th<strong>es</strong>igset_t variabl<strong>es</strong> to which s1 and s2 point; the r<strong>es</strong>ult is stored in the sigset_tvariable to which d points.dequeue_signal(mask, info)Checks whether the current proc<strong>es</strong>s has nonblocked pending signals. If so, returns thelow<strong>es</strong>t-numbered pending signal and updat<strong>es</strong> the data structur<strong>es</strong> to indicate it is nolonger pending. This task involv<strong>es</strong> clearing the corr<strong>es</strong>ponding bit in current->signal, updating the value of current->sigpending, and storing the signal numberof the dequeued signal into the *info table. In the mask parameter each bit that is setrepr<strong>es</strong>ents a blocked signal:sig = 0;if (((x = current->signal.sig[0]) & ~mask->sig[0]) != 0)sig = 1 + ffz(~x);else if (((x = current->signal.sig[1]) &~mask->sig[1]) != 0)sig = 33 + ffz(~x);if (sig) {sigdelset(&current->signal, sig);recalc_sigpending(current);}return sig;<strong>The</strong> collection of currently pending signals is ANDed with the blocked signals (thecomplement of mask). If anything is left, it repr<strong>es</strong>ents a signal that should be deliveredto the proc<strong>es</strong>s. <strong>The</strong> ffz( ) function returns the index of the first bit in its parameter;this value is used to compute the low<strong>es</strong>t-number signal to be delivered.238


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>flush_signals(t)Delet<strong>es</strong> all signals sent to the proc<strong>es</strong>s identified by the proc<strong>es</strong>s d<strong>es</strong>criptor at *t. This isdone by clearing both the t->sigpending and the t->signal fields and by emptyingthe real-time queue of signals (see Section 9.4).9.2 Sending a SignalWhen a signal is sent to a proc<strong>es</strong>s, either from the kernel or from another proc<strong>es</strong>s, the kerneldelivers it by invoking the send_sig_info( ), send_sig( ), force_sig( ), orforce_sig_info( ) functions. <strong>The</strong>se accomplish the first phase of signal handling d<strong>es</strong>cribedearlier in Section 9.1: updating the proc<strong>es</strong>s d<strong>es</strong>criptor as needed. <strong>The</strong>y do not directly performthe second phase of receiving the signal but, depending on the type of signal and the state ofthe proc<strong>es</strong>s, may wake up the proc<strong>es</strong>s and force it to receive the signal.9.2.1 <strong>The</strong> send_sig_info( ) and send_sig( ) Functions<strong>The</strong> send_sig_info( ) function acts on three parameters:siginfot<strong>The</strong> signal number.Either the addr<strong>es</strong>s of a siginfo_t table associated with real-time signals or one of twospecial valu<strong>es</strong>: means that the signal has been sent by a User Mode proc<strong>es</strong>s, while 1means that it has been sent by the kernel. <strong>The</strong> siginfo_t data structure hasinformation that must be passed to the proc<strong>es</strong>s receiving the real-time signal, such asthe PID of the sender proc<strong>es</strong>s and the UID of its owner.A pointer to the d<strong>es</strong>criptor of the d<strong>es</strong>tination proc<strong>es</strong>s.<strong>The</strong> send_sig_info( ) function starts by checking whether the parameters are correct:if (sig < 0 || sig > 64)return -EINVAL;<strong>The</strong> function checks then if the signal is being sent by a User Mode proc<strong>es</strong>s. This occurs wheninfo is equal to or when the si_code field of the siginfo_t table is negative or zero (thepositive valu<strong>es</strong> of this field are r<strong>es</strong>erved to identify the kernel function that sent the signal):239


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if ((!info || ((unsigned long)info != 1 && (info->si_code s<strong>es</strong>sion != t->s<strong>es</strong>sion))&& (current->euid [supscrsym] t->suid)&& (current->euid [supscrsym] t->uid)&& (current->uid [supscrsym] t->suid)&& (current->uid [supscrsym] t->uid)&& !capable(CAP_KILL))return -EPERM;If the signal is sent by a User Mode proc<strong>es</strong>s, the function determin<strong>es</strong> whether the operation isallowed. <strong>The</strong> signal is delivered only if the owner of the sending proc<strong>es</strong>s has the propercapability (see Chapter 19), the signal is SIGCONT, the d<strong>es</strong>tination proc<strong>es</strong>s is in the same logins<strong>es</strong>sion of the sending proc<strong>es</strong>s, or both proc<strong>es</strong>s<strong>es</strong> belong to the same user.If the sig parameter has the value 0, the function returns immediately without sending anysignal: since is not a valid signal number, it is used to allow the sending proc<strong>es</strong>s to checkwhether it has the required privileg<strong>es</strong> to send a signal to the d<strong>es</strong>tination proc<strong>es</strong>s. <strong>The</strong> functionreturns also if the d<strong>es</strong>tination proc<strong>es</strong>s is in the TASK_ZOMBIE state, indicated by checkingwhether its siginfo_t table has been released:if (!sig || !t->sig)return 0;Some typ<strong>es</strong> of signals might nullify other pending signals for the d<strong>es</strong>tination proc<strong>es</strong>s.<strong>The</strong>refore, the function checks whether one of the following cas<strong>es</strong> occurs:• sig is a SIGKILL or SIGCONT signal. If the d<strong>es</strong>tination proc<strong>es</strong>s is stopped, it is put inthe TASK_RUNNING state so that it will be able to execute the do_exit( ) function;moreover, if the d<strong>es</strong>tination proc<strong>es</strong>s has SIGSTOP, SIGTSTP, SIGTTOU, or SIGTTINpending signals, they are removed:if (t->state == TASK_STOPPED)wake_up_proc<strong>es</strong>s(t);t->exit_code = 0;sigdelsetmask(&t->signal, (sigmask(SIGSTOP) |sigmask(SIGTSTP) | sigmask(SIGTTOU) |sigmask(SIGTTIN)));recalc_sigpending(t);• sig is a SIGSTOP, SIGTSTP, SIGTTIN, or SIGTTOU signal. If the d<strong>es</strong>tination proc<strong>es</strong>s hasa pending SIGCONT signal, it is d<strong>es</strong>troyed:sigdelset(&t->signal, SIGCONT);recalc_sigpending(t);Next, send_sig_info( ) checks whether the new signal can be handled immediately. In thiscase, the function also tak<strong>es</strong> care of the receiving phase of the signal:if (ignored_signal(sig, t)) {out:if (t->state == TASK_INTERRUPTIBLE && signal_pending(t))wake_up_proc<strong>es</strong>s(t);return 0;}240


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> ignored_signal( ) function returns the value 1 when all three conditions for ignoring asignal mentioned in Section 9.1 are satisfied. However, in order to fulfill a POSIXrequirement, the SIGCHLD signal is handled specially. POSIX distinguish<strong>es</strong> between explicitlysetting the "ignore" action for the SIGCHLD signal and leaving the default in place (even if thedefault is to ignore the signal). In order to let the kernel clean up a terminated child proc<strong>es</strong>sand prevent it from becoming a zombie (see Section 3.4.2 in Chapter 3) the parent mustexplicitly set the action to "ignore" the signal. So ignored_signal( ) handl<strong>es</strong> as follows: ifthe signal is explicitly ignored, ignored_signal( ) returns 0, but if the default action was"ignore" and the proc<strong>es</strong>s didn't change that default, ignored_signal( ) returns 1.If ignored_signal( ) returns 1, the siginfo_t table of the d<strong>es</strong>tination proc<strong>es</strong>s must not beupdated; however, if the proc<strong>es</strong>s is in the TASK_INTERRUPTIBLE state and if it has othernonblocked pending signals, send_sig_info( ) invok<strong>es</strong> the wake_up_proc<strong>es</strong>s( ) functionto wake it up.If ignored_signal( ) returns 0, the phase of signal receiving has to be deferred, therefor<strong>es</strong>end_sig_info( ) may have to modify the data structur<strong>es</strong> of the d<strong>es</strong>tination proc<strong>es</strong>s to let itknow later that a new signal has been sent to it. Since standard signals are not queued,send_sig_info( ) must check whether another instance of the same signal is alreadypending, then leave its mark on the proper data structur<strong>es</strong> of the proc<strong>es</strong>s d<strong>es</strong>criptor:if (sigismember(&t->signal, sig))goto out;sigaddset(&t->signal, sig);if (!sigismember(&t->blocked, sig))t->sigpending = 1;goto out;<strong>The</strong> sigaddset( ) function is invoked to set the proper bit in t->signal.<strong>The</strong> t->sigpending flag is also set, unl<strong>es</strong>s the d<strong>es</strong>tination proc<strong>es</strong>s has blocked the sig signal.<strong>The</strong> function terminat<strong>es</strong> in the usual way by waking up, if nec<strong>es</strong>sary, the d<strong>es</strong>tination proc<strong>es</strong>s.In Section 9.3, we'll discuss the actions performed by the proc<strong>es</strong>s.<strong>The</strong> send_sig( ) function is similar to send_sig_info( ). However, the info parameter isreplaced by a priv flag, which is true if the signal is sent by the kernel and false if it is sent bya proc<strong>es</strong>s. <strong>The</strong> send_sig( ) function is implemented as a special case ofsend_sig_info( ):return send_sig_info(sig, (void*)(priv != 0), t);9.2.2 <strong>The</strong> force_sig_info( ) and force_sig( ) Functions<strong>The</strong> force_sig_info( ) function is used by the kernel to send signals that cannot beexplicitly ignored or blocked by the d<strong>es</strong>tination proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong> function's parameters are th<strong>es</strong>ame as those of send_sig_info( ). <strong>The</strong> force_sig_info( ) function acts on th<strong>es</strong>ignal_struct data structure that is referenced by the sig field included in the d<strong>es</strong>criptor tof the d<strong>es</strong>tination proc<strong>es</strong>s:241


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (t->sig->action[sig-1].sa.sa_handler == SIG_IGN)t->sig->action[sig-1].sa.sa_handler = SIG_DFL;sigdelset(&t->blocked, sig);return send_sig_info(sig, info, t);force_sig( ) is similar to force_sig_info( ). Its use is limited to signals sent by thekernel; it can be implemented as a special case of the force_sig_info( ) function:force_sig_info(sig, (void*)1L, t);9.3 Receiving a SignalWe assume that the kernel has noticed the arrival of a signal and has invoked one of thefunctions in the previous section to prepare the proc<strong>es</strong>s d<strong>es</strong>criptor of the proc<strong>es</strong>s that issupposed to receive the signal. But in case that proc<strong>es</strong>s was not running on the CPU at thatmoment, the kernel deferred the task of waking the proc<strong>es</strong>s, if nec<strong>es</strong>sary, and making itreceive the signal. We now turn to the activiti<strong>es</strong> that the kernel performs to ensure thatpending signals of a proc<strong>es</strong>s are handled.As mentioned in Section 4.7.1 in Chapter 4, the kernel checks whether there are nonblockedpending signals before allowing a proc<strong>es</strong>s to r<strong>es</strong>ume its execution in User Mode. This check isperformed in ret_from_intr( ) every time an interrupt or an exception has been handled bythe kernel routin<strong>es</strong>.In order to handle the nonblocked pending signals, the kernel invok<strong>es</strong> the do_signal( )function, which receiv<strong>es</strong> two parameters:regsoldset<strong>The</strong> addr<strong>es</strong>s of the stack area where the User Mode register contents of the currentproc<strong>es</strong>s have been saved<strong>The</strong> addr<strong>es</strong>s of a variable where the function is supposed to save the bit mask array ofblocked signals (actually, this parameter is NULL when invoked fromret_from_intr( ))<strong>The</strong> function starts by checking whether the interrupt occurred while the proc<strong>es</strong>s was runningin User Mode; if not, it simply returns:if ((regs->xcs & 3) != 3)return 1;However, as we'll see in Section 9.3.4, this do<strong>es</strong> not mean that a system call cannot beinterrupted by a signal.If the oldset parameter is NULL, the function initializ<strong>es</strong> it with the addr<strong>es</strong>s of the current->blocked field:242


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if (!oldset)oldset = &current->blocked;<strong>The</strong> heart of the do_signal( ) function consists of a loop that repeatedly invok<strong>es</strong>dequeue_signal( ) until no more nonblocked pending signals are left. <strong>The</strong> return code ofdequeue_signal( ) is stored in the signr local variable: if its value is 0, it means that allpending signals have been handled and do_signal( ) can finish. As long as a nonzero valueis returned, a pending signal is waiting to be handled and dequeue_signal( ) is invokedagain after do_signal( ) handl<strong>es</strong> the current signal.If the current receiver proc<strong>es</strong>s is being monitored by some other proc<strong>es</strong>s, the do_signal( )function invok<strong>es</strong> notify_parent( ) and schedule( ) to make the monitoring proc<strong>es</strong>saware of the signal handling.<strong>The</strong>n do_signal( ) loads the ka local variable with the addr<strong>es</strong>s of the k_sigaction datastructure of the signal to be handled:ka = &current->sig->action[signr-1];Depending on the contents, three kinds of actions may be performed: ignoring the signal,executing a default action, or executing a signal handler.9.3.1 Ignoring the SignalWhen a received signal is explicitly ignored, the do_signal( ) function normally justcontinu<strong>es</strong> with a new execution of the loop and therefore considers another pending signal.One exception exists, as d<strong>es</strong>cribed earlier:if (ka->sa.sa_handler == SIG_IGN) {if (signr == SIGCHLD)while (sys_wait4(-1, NULL, WNOHANG, NULL) > 0)/* nothing */;continue;}If the signal received is SIGCHLD, the sys_wait4( ) service routine of the wait4( ) systemcall is invoked to force the proc<strong>es</strong>s to read information about its children, thus cleaning upmemory left over by the terminated child proc<strong>es</strong>s<strong>es</strong> (see Section 3.4 in Chapter 3).9.3.2 Executing the Default Action for the SignalIf ka->sa.sa_handler is equal to SIG_DFL, do_signal( ) must perform the default actionof the signal. <strong>The</strong> only exception com<strong>es</strong> when the receiving proc<strong>es</strong>s is init, in which case th<strong>es</strong>ignal is discarded as d<strong>es</strong>cribed in Section 9.1.1:if (current->pid == 1)continue;For other proc<strong>es</strong>s<strong>es</strong>, since the default action depends on the type of signal, the functionexecut<strong>es</strong> a switch statement based on the value of signr.<strong>The</strong> signals whose default action is "ignore" are easily handled:243


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>case SIGCONT: case SIGCHLD: case SIGWINCH:continue;<strong>The</strong> signals whose default action is "stop" may stop the current proc<strong>es</strong>s. In order to do this,do_signal( ) sets the state of current to TASK_STOPPED and then invok<strong>es</strong> the schedule( )function (see Section 10.2.4.2 in Chapter 10). <strong>The</strong> do_signal( ) function also sends aSIGCHLD signal to the parent proc<strong>es</strong>s of current, unl<strong>es</strong>s the parent has set the SA_NOCLDSTOPflag of SIGCHLD:case SIGTSTP: case SIGTTIN: case SIGTTOU:if (is_orphaned_pgrp(current->pgrp))continue;case SIGSTOP:current->state = TASK_STOPPED;current->exit_code = signr;if (!(SA_NOCLDSTOP &current->p_pptr->sig->action[SIGCHLD-1].sa.sa_flags))notify_parent(current, SIGCHLD);schedule( );continue;<strong>The</strong> difference between SIGSTOP and the other signals is subtle: SIGSTOP always stops theproc<strong>es</strong>s, while the other signals stop the proc<strong>es</strong>s only if it is not in an "orphaned proc<strong>es</strong>sgroup"; the POSIX standard specifi<strong>es</strong> that a proc<strong>es</strong>s group is not orphaned as long as there is aproc<strong>es</strong>s in the group that has a parent in a different proc<strong>es</strong>s group but in the same s<strong>es</strong>sion.<strong>The</strong> signals whose default action is "dump" may create a core file in the proc<strong>es</strong>s workingdirectory; this file lists the complete contents of the proc<strong>es</strong>s's addr<strong>es</strong>s space and CPUregisters. After the do_signal( ) creat<strong>es</strong> the core file, it kills the proc<strong>es</strong>s. <strong>The</strong> default actionof the remaining 18 signals is "abort," which consists of just killing the proc<strong>es</strong>s:exit_code = sig_nr;case SIGQUIT: case SIGILL: case SIGTRAP:case SIGABRT: case SIGFPE: case SIGSEGV:if (current->binfmt&& current->binfmt->core_dump&& current->binfmt->core_dump(signr, regs))exit_code |= 0x80;default:sigaddset(&current->signal, signr);current->flags |= PF_SIGNALED;do_exit(exit_code);<strong>The</strong> do_exit( ) function receiv<strong>es</strong> as its input parameter the signal number ORed with a flagset when a core dump has been performed. That value is used to determine the exit code of theproc<strong>es</strong>s. <strong>The</strong> function terminat<strong>es</strong> the current proc<strong>es</strong>s, and hence never returns (see Chapter19).9.3.3 Catching the SignalIf the signal has a specific handler, the do_signal( ) function must enforce its execution. Itdo<strong>es</strong> this by invoking handle_signal( ):244


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>handle_signal(signr, ka, &info, oldset, regs);return 1;Notice how do_signal( ) returns after having handled a single signal: other pending signalswon't be considered until the next invocation of do_signal( ). This approach ensur<strong>es</strong> thatreal-time signals will be dealt in the proper order (see Section 9.4).Executing a signal handler is a rather complex task because of the need to juggle stackscarefully while switching between User Mode and <strong>Kernel</strong> Mode. We'll explain exactly what isentailed here.Signal handlers are functions defined by User Mode proc<strong>es</strong>s<strong>es</strong> and included in the User Modecode segment. <strong>The</strong> handle_signal( ) function runs in <strong>Kernel</strong> Mode while signal handlersrun in User Mode; this means that the current proc<strong>es</strong>s must first execute the signal handler inUser Mode before being allowed to r<strong>es</strong>ume its "normal" execution. Moreover, when thekernel attempts to r<strong>es</strong>ume the normal execution of the proc<strong>es</strong>s, the <strong>Kernel</strong> Mode stack nolonger contains the hardware context of the interrupted program because the <strong>Kernel</strong> Mod<strong>es</strong>tack is emptied at every transition from User Mode to <strong>Kernel</strong> Mode.An additional complication is that signal handlers may invoke system calls: in this case, afterhaving executed the service routine, control must be returned to the signal handler instead ofto the code of the interrupted program.<strong>The</strong> solution adopted in <strong>Linux</strong> consists of copying the hardware context saved in the <strong>Kernel</strong>Mode stack onto the User Mode stack of the current proc<strong>es</strong>s. <strong>The</strong> User Mode stack is alsomodified in such a way that, when the signal handler terminat<strong>es</strong>, the sigreturn( ) systemcall is automatically invoked to copy the hardware context back on the <strong>Kernel</strong> Mode stack andr<strong>es</strong>tore the original content of the User Mode stack.Figure 9-1 illustrat<strong>es</strong> the flow of execution of the functions involved in catching a signal. Anonblocked signal is sent to a proc<strong>es</strong>s. When an interrupt or exception occurs, the proc<strong>es</strong>sswitch<strong>es</strong> into <strong>Kernel</strong> Mode. Right before returning to User Mode, the kernel execut<strong>es</strong> thedo_signal( ) function, which in turn handl<strong>es</strong> the signal (by invoking handle_signal( ))and sets up the User Mode stack (by invoking setup_frame( )). When the proc<strong>es</strong>s switch<strong>es</strong>again to User Mode, it starts executing the signal handler because the handler's startingaddr<strong>es</strong>s was forced into the program counter. When that function terminat<strong>es</strong>, the return codeplaced on the User Mode stack by the setup_frame( ) function is executed. This codeinvok<strong>es</strong> the sigreturn( ) system call, whose service routine copi<strong>es</strong> the hardware context ofthe normal program in the <strong>Kernel</strong> Mode stack and r<strong>es</strong>tor<strong>es</strong> the User Mode stack back to itsoriginal state (by invoking r<strong>es</strong>tore_sigcontext( )). When the system call terminat<strong>es</strong>, thenormal program can thus r<strong>es</strong>ume its execution.245


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 9-1. Catching a signalLet us now examine in detail how this scheme is carried out.9.3.3.1 Setting up the frameIn order to properly set the User Mode stack of the proc<strong>es</strong>s, the handle_signal( ) functioninvok<strong>es</strong> either setup_frame( ) (for signals without siginfo_t table) or setup_rt_frame().<strong>The</strong> setup_frame( ) function receiv<strong>es</strong> four parameters, which have the following meanings:sigkaoldsetregsSignal numberAddr<strong>es</strong>s of the k_sigaction table associated with the signalAddr<strong>es</strong>s of a bit mask array of blocked signalsAddr<strong>es</strong>s in the <strong>Kernel</strong> Mode stack area where the User Mode register contents havebeen saved<strong>The</strong> function push<strong>es</strong> onto the User Mode stack a data structure called a frame, which containsthe information needed to handle the signal and to ensure the correct return to thehandle_signal( ) function. A frame is a sigframe table that includ<strong>es</strong> the following fields(see Figure 9-2):246


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 9-2. Frame on the User Mode stackpretcod<strong>es</strong>igscfpstateextramaskretcodeReturn addr<strong>es</strong>s of the signal handler function; it points to the retcode field (later inthis list) in the same table.<strong>The</strong> signal number; this is the parameter required by the signal handler.Structure of type sigcontext containing the hardware context of the User Modeproc<strong>es</strong>s right before switching to <strong>Kernel</strong> Mode (this information is copied from the<strong>Kernel</strong> Mode stack of current). It also contains a bit array that specifi<strong>es</strong> the blockedstandard signals of the proc<strong>es</strong>s.Structure of type _fpstate that may be used to store the floating point registers of theUser Mode proc<strong>es</strong>s (see Section 3.2.4 in Chapter 3).Bit array that specifi<strong>es</strong> the blocked real-time signals.Eight-byte code issuing a sigreturn( ) system call; this code is executed whenreturning from the signal handler.<strong>The</strong> setup_frame( ) function starts by invoking get_sigframe( ) to compute the firstmemory location of the frame. That memory location is usually [4] in the User Mode stack, thusthe function returns the value:[4] <strong>Linux</strong> allows proc<strong>es</strong>s<strong>es</strong> to specify an alternate stack for their signal handlers by invoking the sigaltstack( ) system call; this feature isalso requ<strong>es</strong>ted by the X/Open standard. When an alternate stack is pr<strong>es</strong>ent, the get_sigframe( ) function returns an addr<strong>es</strong>s inside thatstack. We don't discuss this feature further, since it is conceptually similar to standard signal handling.(regs-><strong>es</strong>p - sizeof(struct sigframe)) & 0xfffffff8247


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Since stacks grow toward lower addr<strong>es</strong>s<strong>es</strong>, the initial addr<strong>es</strong>s of the frame is obtained bysubtracting its size from the addr<strong>es</strong>s of the current stack top and aligning the r<strong>es</strong>ult to amultiple of 8.<strong>The</strong> returned addr<strong>es</strong>s is then verified by means of the acc<strong>es</strong>s_ok macro; if it is valid, thefunction repeatedly invok<strong>es</strong> __put_user( ) to fill all the fields of the frame. Once this isdone, it modifi<strong>es</strong> the regs area of the <strong>Kernel</strong> Mode stack, thus ensuring that control will betransferred to the signal handler when current r<strong>es</strong>um<strong>es</strong> its execution in User Mode:regs-><strong>es</strong>p = (unsigned long) frame;regs->eip = (unsigned long) ka->sa.sa_handler;<strong>The</strong> setup_frame( ) function terminat<strong>es</strong> by r<strong>es</strong>etting the segmentation registers saved on the<strong>Kernel</strong> Mode stack to their default value. Now the information needed by the signal handler ison the top of the User Mode stack.<strong>The</strong> setup_rt_frame( ) function is very similar to setup_frame( ), but it puts on the UserMode stack an extended frame (stored in the rt_sigframe data structure) that also includ<strong>es</strong>the content of the siginfo_t table associated with the signal.9.3.3.2 Evaluating the signal flagsAfter setting up the User Mode stack, the handle_signal( ) function checks the valu<strong>es</strong> ofthe flags associated with the signal.If the received signal has the SA_ONESHOT flag set, it must be r<strong>es</strong>et to its default action so thatfurther occurrenc<strong>es</strong> of the same signal will not trigger the execution of the signal handler:if (ka->sa.sa_flags & SA_ONESHOT)ka->sa.sa_handler = SIG_DFL;Moreover, if the signal do<strong>es</strong> not have the SA_NODEFER flag set, the signals in the sa_maskfield of the sigaction table must be blocked during the execution of the signal handler:if (!(ka->sa.sa_flags & SA_NODEFER)) {sigorsets(&current->blocked,&current->blocked,&ka->sa.sa_mask);sigaddset(&current->blocked,sig);recalc_sigpending(current);}<strong>The</strong> function returns then to do_signal( ), which also returns immediately.9.3.3.3 Starting the signal handlerWhen do_signal( ) returns, the current proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution in User Mode.Because of the preparation by setup_frame( ) d<strong>es</strong>cribed earlier, the eip register points tothe first instruction of the signal handler, while <strong>es</strong>p points to the first memory location ofthe frame that has been pushed on top of the User Mode stack. As a r<strong>es</strong>ult, the signal handleris executed.248


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>9.3.3.4 Terminating the signal handlerWhen the signal handler terminat<strong>es</strong>, the return addr<strong>es</strong>s on top of the stack points to the code inthe retcode field of the frame. For signals without siginfo_t table, the code is equivalent tothe following Assembly instructions:popl %eaxmovl $__NR_sigreturn, %eaxint $0x80<strong>The</strong>refore, the signal number (that is, the sig field of the frame) is discarded from the stack,and the sigreturn( ) system call is then invoked.<strong>The</strong> sys_sigreturn( ) function receiv<strong>es</strong> as its parameter the pt_regs data structure regs,which contains the hardware context of the User Mode proc<strong>es</strong>s (see Section 8.2.3 inChapter 8). It can thus derive the frame addr<strong>es</strong>s inside the User Mode stack:frame = (struct sigframe *)(regs.<strong>es</strong>p - 8);<strong>The</strong> function reads from the sc field of the frame the bit array of signals that were blockedbefore invoking the signal handler and writ<strong>es</strong> it in the blocked field of current. As a r<strong>es</strong>ult,all signals that have been masked for the execution of the signal handler are unblocked. <strong>The</strong>recalc_sigpending( ) function is then invoked.<strong>The</strong> sys_sigreturn( ) function must at this point copy the proc<strong>es</strong>s hardware context fromthe sc field of the frame to the <strong>Kernel</strong> Mode stack; it then remov<strong>es</strong> the frame from the UserMode stack by invoking the r<strong>es</strong>tore_sigcontext( ) function.For signals having a siginfo_t table, the mechanism is very similar. <strong>The</strong> return code in theretcode field of the extended frame invok<strong>es</strong> the rt_sigreturn( ) system call; thecorr<strong>es</strong>ponding sys_rt_sigreturn( ) service routine copi<strong>es</strong> the proc<strong>es</strong>s hardware contextfrom the extended frame to the <strong>Kernel</strong> Mode stack and r<strong>es</strong>tor<strong>es</strong> the original User Mode stackcontent by removing the extended frame from it.9.3.4 Reexecution of System CallsIn some cas<strong>es</strong>, the requ<strong>es</strong>t associated with a system call cannot be immediately satisfied bythe kernel; when this happens the proc<strong>es</strong>s that issued the system call is put in aTASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE state.If the proc<strong>es</strong>s is put in a TASK_INTERRUPTIBLE state and some other proc<strong>es</strong>s sends a signal toit, the kernel puts it in the TASK_RUNNING state without completing the system call (seeSection 4.7 in Chapter 4). When this happens, the system call service routine do<strong>es</strong> notcomplete its job but returns an EINTR, ERESTARTNOHAND, ERESTARTSYS, or ERESTARTNOINTRerror code. <strong>The</strong> proc<strong>es</strong>s receiv<strong>es</strong> the signal while switching back to User Mode.In practice, the only error code a User Mode proc<strong>es</strong>s can get in this situation is EINTR, whichmeans that the system call has not been completed. (<strong>The</strong> application programmer may checkthis code and decide whether to reissue the system call.) <strong>The</strong> remaining error cod<strong>es</strong> are used249


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>internally by the kernel to specify whether the system call may be reexecuted automaticallyafter the signal handler termination.Table 9-4 lists the error cod<strong>es</strong> related to unfinished system calls and their impact for each ofthe three possible signal actions. <strong>The</strong> meaning of the terms appearing in the entri<strong>es</strong> is thefollowing:TerminateReexecuteDepends<strong>The</strong> system call will not be automatically reexecuted; the proc<strong>es</strong>s will r<strong>es</strong>ume itsexecution in User Mode at the instruction following the int $0x80 one and the eaxregister will contain the -EINTR value.<strong>The</strong> kernel forc<strong>es</strong> the User Mode proc<strong>es</strong>s to reload the eax register with the systemcall number and to reexecute the int $0x80 instruction; the proc<strong>es</strong>s is not aware ofthe reexecution and the error code is not passed to it.<strong>The</strong> system call is reexecuted only if the SA_RESTART flag of the received signal is set;otherwise, the system call terminat<strong>es</strong> with a -EINTR error code.Table 9-4. Reexecution of System CallsSignalError Cod<strong>es</strong> and <strong>The</strong>ir Impact onSystem Call ExecutionAction EINTR ERESTARTSYS ERESTARTNOHAND ERESTARTNOINTRDefault Terminate Reexecute Reexecute ReexecuteIgnore Terminate Reexecute Reexecute ReexecuteCatch Terminate Depends Terminate ReexecuteWhen receiving a signal, the kernel must be sure that the proc<strong>es</strong>s really issued a system callbefore attempting to reexecute it. This is where the orig_eax field of the regs hardwarecontext plays a critical role. Let us recall how this field is initialized when the interrupt orexception handler starts:Interrupt<strong>The</strong> field contains the IRQ number associated with the interrupt minus 256 (see th<strong>es</strong>ection Section 4.6.3 in Chapter 4).0x80 exception<strong>The</strong> field contains the system call number (see Section 8.2.2 in Chapter 8).Other exceptions<strong>The</strong> field contains the value -1 (see Section 4.5.1 in Chapter 4).250


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong>refore, a nonnegative value in the orig_eax field means that the signal has woken up aTASK_INTERRUPTIBLE proc<strong>es</strong>s that was sleeping in a system call. <strong>The</strong> service routinerecogniz<strong>es</strong> that the system call was interrupted, and thus returns one of the previouslymentioned error cod<strong>es</strong>.If the signal is explicitly ignored or if its default action has been executed, do_signal( )analyz<strong>es</strong> the error code of the system call to decide whether the unfinished system call mustbe automatically reexecuted, as specified in Table 9-4. If the call must be r<strong>es</strong>tarted, thefunction modifi<strong>es</strong> the regs hardware context so that, when the proc<strong>es</strong>s is back in User Mode,eip points to the int $0x80 instruction and eax contains the system call number:if (regs->orig_eax >= 0) {if (regs->eax == -ERESTARTNOHAND ||regs->eax == -ERESTARTSYS ||regs->eax == -ERESTARTNOINTR) {regs->eax = regs->orig_eax;regs->eip -= 2;}}<strong>The</strong> regs->eax field has been filled with the return code of a system call service routine (seeSection 8.2.2 in Chapter 8).If the signal has been caught, handle_signal( ) analyz<strong>es</strong> the error code and, possibly, theSA_RESTART flag of the sigaction table to decide whether the unfinished system call must bereexecuted:if (regs->orig_eax >= 0) {switch (regs->eax) {case -ERESTARTNOHAND:regs->eax = -EINTR;break;case -ERESTARTSYS:if (!(ka->sa.sa_flags & SA_RESTART)) {regs->eax = -EINTR;break;}/* fallthrough */case -ERESTARTNOINTR:regs->eax = regs->orig_eax;regs->eip -= 2;}}If the system call must be r<strong>es</strong>tarted, handle_signal( ) proceeds exactly as do_signal( );otherwise, it returns an -EINTR error code to the User Mode proc<strong>es</strong>s.9.4 Real-Time Signals<strong>The</strong> POSIX standard introduced a new class of signals denoted as real-time signals;the corr<strong>es</strong>ponding signal numbers range from 32 to 63. <strong>The</strong> main difference with r<strong>es</strong>pect tostandard signals is that real-time signals of the same kind may be queued. This ensur<strong>es</strong> thatmultiple signals sent will be received. Although the <strong>Linux</strong> kernel do<strong>es</strong> not make use of251


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>real-time signals, it fully supports the POSIX standard by means of several specific systemcalls (see Section 9.5.6).<strong>The</strong> queue of real-time signals is implemented as a list of signal_queue elements:struct signal_queue {struct signal_queue *next;siginfo_t info;};<strong>The</strong> info table of type siginfo_t was explained in Section 9.2.1; the next field points to thenext element in the list.Each proc<strong>es</strong>s d<strong>es</strong>criptor has two specific fields; sigqueue points to the first element of thequeue of received real-time signals, while sigqueue_tail points to the next field of the lastelement of the queue.When sending a signal, the send_sig_info( ) function checks whether its number is greaterthan 31; if so, it inserts the signal in the queue of real-time signals for the d<strong>es</strong>tination proc<strong>es</strong>s.Similarly, when receiving a signal, dequeue_signal( ) checks whether the signal number ofthe pending signal is greater than 31; if so, it extracts from the queue the elementcorr<strong>es</strong>ponding to the received signal. If the queue do<strong>es</strong> not contain other signals of the sametype, the function also clears the corr<strong>es</strong>ponding bit in current->signal.9.5 System Calls Related to Signal HandlingAs stated in the introduction of this chapter, programs running in User Mode are allowed tosend and receive signals. This means that a set of system calls must be defined to allow th<strong>es</strong>ekinds of operations. Unfortunately, due to historical reasons, several noncompatible systemcalls exist that serve <strong>es</strong>sentially the same purpose. In order to ensure full compatibility witholder Unix versions, <strong>Linux</strong> supports both older system calls and newer on<strong>es</strong> introduced in thePOSIX standard. We shall d<strong>es</strong>cribe some of the most significant POSIX system calls.9.5.1 <strong>The</strong> kill( ) System Call<strong>The</strong> kill(pid,sig) system call is commonly used to send signals; its corr<strong>es</strong>ponding serviceroutine is the sys_kill( ) function. <strong>The</strong> integer pid parameter has several meanings,depending on its numerical value:pid > 0pid = 0<strong>The</strong> sig signal is sent to the proc<strong>es</strong>s whose PID is equal to pid.<strong>The</strong> sig signal is sent to all proc<strong>es</strong>s<strong>es</strong> in the same group of the calling proc<strong>es</strong>s.252


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>pid = -1pid < -1<strong>The</strong> signal is sent to all proc<strong>es</strong>s<strong>es</strong>, except swapper (PID 0), init (PID 1), and current.<strong>The</strong> signal is sent to all proc<strong>es</strong>s<strong>es</strong> in the proc<strong>es</strong>s group -pid.<strong>The</strong> sys_kill( ) function invok<strong>es</strong> kill_something_info( ). This in turn invok<strong>es</strong> eithersend_sig_info( ), to send the signal to a single proc<strong>es</strong>s, or kill_pg_info( ), to scan allproc<strong>es</strong>s<strong>es</strong> and invoke send_sig_info( ) for each proc<strong>es</strong>s in the d<strong>es</strong>tination group.System V and BSD Unix variants also have a killpg( ) system call, which is able toexplicitly send a signal to a group of proc<strong>es</strong>s<strong>es</strong>. In <strong>Linux</strong> the function is implemented as alibrary function that mak<strong>es</strong> use of the kill( ) system call.9.5.2 Changing a Signal Action<strong>The</strong> sigaction(sig,act,oact) system call allows users to specify an action for a signal; ofcourse, if no signal action is defined, the kernel execut<strong>es</strong> the default action associated with thereceived signal.<strong>The</strong> corr<strong>es</strong>ponding sys_sigaction( ) service routine acts on two parameters: the sig signalnumber and the act table of type sigaction that specifi<strong>es</strong> the new action. A third oactoptional output parameter may be used to get the previous action associated with the signal.<strong>The</strong> function checks first whether the act addr<strong>es</strong>s is valid. <strong>The</strong>n it fills the sa_handler,sa_flags, and sa_mask fields of a new_ka local variable of type k_sigaction with thecorr<strong>es</strong>ponding fields of *act:__get_user(new_ka.sa.sa_handler, &act->sa_handler);__get_user(new_ka.sa.sa_flags, &act->sa_flags);__get_user(mask, &act->sa_mask);new_ka.sa.sa_mask.sig[0] = mask;new_ka.sa.sa_mask.sig[1] = 0<strong>The</strong> function invok<strong>es</strong> do_sigaction( ) to copy the new new_ka table into the entry at th<strong>es</strong>ig-1 position of current->sig->action:k = &current->sig->action[sig-1];if (act) {*k = *act;sigdelsetmask(&k->sa.sa_mask, sigmask(SIGKILL)| sigmask(SIGSTOP));if (k->sa.sa_handler == SIG_IGN|| (k->sa.sa_handler == SIG_DFL&& (sig == SIGCONT ||sig == SIGCHLD ||sig == SIGWINCH))) {sigdelset(&current->signal, sig);recalc_sigpending(current);}}253


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> POSIX standard requir<strong>es</strong> that setting a signal action either to SIG_IGN, or to SIG_DFLwhen the default action is "ignore," will cause any pending signal of the same type to bediscarded. Moreover, notice that, no matter what the requ<strong>es</strong>ted masked signals are for th<strong>es</strong>ignal handler, SIGKILL and SIGSTOP are never masked.If the oact parameter is not NULL, the contents of the previous sigaction table are copiedto the proc<strong>es</strong>s addr<strong>es</strong>s space at the addr<strong>es</strong>s specified by that parameter:if (oact) {__put_user(old_ka.sa.sa_handler, &oact->sa_handler);__put_user(old_ka.sa.sa_flags, &oact->sa_flags);__put_user(old_ka.sa.sa_mask.sig[0], &oact->sa_mask);}For compatibility with BSD Unix variants, <strong>Linux</strong> provid<strong>es</strong> the signal( ) system call, whichis still widely used by programmers. <strong>The</strong> corr<strong>es</strong>ponding sys_signal( ) service routine justinvok<strong>es</strong> do_sigaction( ):new_sa.sa.sa_handler = handler;new_sa.sa.sa_flags = SA_ONESHOT | SA_NOMASK;ret = do_sigaction(sig, &new_sa, &old_sa);return ret ? ret : (unsigned long)old_sa.sa.sa_handler;9.5.3 Examining the Pending Blocked Signals<strong>The</strong> sigpending( ) system call allows a proc<strong>es</strong>s to examine the set of pending blockedsignals, that is, those that have been raised while blocked. This system call fetch<strong>es</strong> only th<strong>es</strong>tandard signals.<strong>The</strong> corr<strong>es</strong>ponding sys_sigpending( ) service routine acts on a single parameter, set,namely, the addr<strong>es</strong>s of a user variable where the array of bits must be copied:pending = current->blocked.sig[0] & current->signal.sig[0];if (copy_to_user(set, &pending, sizeof(*set)))return -EFAULT;return 0;9.5.4 Modifying the Set of Blocked Signals<strong>The</strong> sigprocmask( ) system call allows proc<strong>es</strong>s<strong>es</strong> to modify the set of blocked signals; lik<strong>es</strong>igpending( ), this system call appli<strong>es</strong> only to the standard signals.<strong>The</strong> corr<strong>es</strong>ponding sys_sigprocmask( ) service routine acts on three parameters:osetsetPointer in the proc<strong>es</strong>s addr<strong>es</strong>s space to a bit array where the previous bit mask must b<strong>es</strong>toredPointer in the proc<strong>es</strong>s addr<strong>es</strong>s space to the bit array containing the new bit mask254


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>howFlag that may have one of the following valu<strong>es</strong>:SIG_BLOCK<strong>The</strong> *set bit mask array specifi<strong>es</strong> the signals that must be added to the bit mask arrayof blocked signals.SIG_UNBLOCK<strong>The</strong> *set bit mask array specifi<strong>es</strong> the signals that must be removed from the bit maskarray of blocked signals.SIG_SETMASK<strong>The</strong> *set bit mask array specifi<strong>es</strong> the new bit mask array of blocked signals.<strong>The</strong> function invok<strong>es</strong> copy_from_user( ) to copy the value pointed to by the set parameterinto the new_set local variable and copi<strong>es</strong> the bit mask array of standard blocked signals ofcurrent into the old_set local variable. It then acts as the how flag specifi<strong>es</strong> on th<strong>es</strong>e twovariabl<strong>es</strong>:if (copy_from_user(&new_set, set, sizeof(*set)))return -EFAULT;new_set &= ~(sigmask(SIGKILL)|sigmask(SIGSTOP));old_set = current->blocked.sig[0];if (how == SIG_BLOCK)sigaddsetmask(&current->blocked, new_set);else if (how == SIG_UNBLOCK)sigdelsetmask(&current->blocked, new_set);else if (how == SIG_SETMASK)current->blocked.sig[0] = new_set;elsereturn -EINVAL;recalc_sigpending(current);if (oset) {if (copy_to_user(oset, &old_set, sizeof(*oset)))return -EFAULT;}return 0;9.5.5 Suspending the Proc<strong>es</strong>s<strong>The</strong> sigsuspend( ) system call puts the proc<strong>es</strong>s in the TASK_INTERRUPTIBLE state, afterhaving blocked the standard signals specified by a bit mask array to which the mask parameterpoints. <strong>The</strong> proc<strong>es</strong>s will wake up only when a nonignored, nonblocked signal is sent to it.<strong>The</strong> corr<strong>es</strong>ponding sys_sigsuspend( ) service routine execut<strong>es</strong> th<strong>es</strong>e statements:255


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>mask &= ~(sigmask(SIGKILL) | sigmask(SIGSTOP));sav<strong>es</strong>et = current->blocked;current->blocked.sig[0] = mask;current->blocked.sig[1] = 0;recalc_sigpending(current);regs->eax = -EINTR;while (1) {current->state = TASK_INTERRUPTIBLE;schedule( );if (do_signal(regs, &sav<strong>es</strong>et))return -EINTR;}<strong>The</strong> schedule( ) function selects another proc<strong>es</strong>s to run. When the proc<strong>es</strong>s that issued th<strong>es</strong>igsuspend( ) system call is executed again, sys_sigsuspend( ) invok<strong>es</strong> the do_signal() system call in order to receive the signal that has woken up the proc<strong>es</strong>s. If that functionreturns the value 1, the signal is not ignored, therefore the system call terminat<strong>es</strong> by returningthe error code -EINTR.<strong>The</strong> sigsuspend( ) system call may appear redundant, since the combined execution ofsigprocmask( ) and sleep( ) apparently yields the same r<strong>es</strong>ult. But this is not true:because of interleaving of proc<strong>es</strong>s executions, one must be conscious that invoking a systemcall to perform action A followed by another system call to perform action B is not equivalentto invoking a single system call that performs action A and then action B.In the particular case, sigprocmask( ) might unblock a signal that will be received beforeinvoking sleep( ). If this happens, the proc<strong>es</strong>s might remain in a TASK_INTERRUPTIBLEstate forever, waiting for the signal that was already received. On the other hand, th<strong>es</strong>igsuspend( ) system call do<strong>es</strong> not allow signals to be sent after unblocking and before th<strong>es</strong>chedule( ) invocation because other proc<strong>es</strong>s<strong>es</strong> cannot grab the CPU during that timeinterval.9.5.6 System Calls for Real-Time SignalsSince the system calls previously examined apply only to standard signals, additional systemcalls must be introduced to allow User Mode proc<strong>es</strong>s<strong>es</strong> to handle real-time signals.Several system calls for real-time signals (rt_sigaction( ) , rt_sigpending( ),rt_sigprocmask( ), and rt_sigsuspend( ) are similar to those d<strong>es</strong>cribed earlier and won'tbe further discussed.Two other system calls have been introduced to deal with queu<strong>es</strong> of real-time signals:rt_sigqueueinfo( )Sends a real-time signal so that it is added to the real-time signal queue of thed<strong>es</strong>tination proc<strong>es</strong>srt_sigtimedwait( )Similar to rt_sigsuspend( ), but the proc<strong>es</strong>s remains suspended only for a fixedtime interval256


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>We do not discuss th<strong>es</strong>e system calls because they are quite similar to those used for standardsignals.9.6 Anticipating <strong>Linux</strong> 2.4Signals are pretty much the same in <strong>Linux</strong> 2.2 and <strong>Linux</strong> 2.4.257


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 10. Proc<strong>es</strong>s SchedulingLike any time-sharing system, <strong>Linux</strong> achiev<strong>es</strong> the magical effect of an apparent simultaneousexecution of multiple proc<strong>es</strong>s<strong>es</strong> by switching from one proc<strong>es</strong>s to another in a very short timeframe. Proc<strong>es</strong>s switch itself was discussed in Chapter 3; this chapter deals with scheduling,which is concerned with when to switch and which proc<strong>es</strong>s to choose.<strong>The</strong> chapter consists of three parts. <strong>The</strong> Section 10.1 introduc<strong>es</strong> the choic<strong>es</strong> made by <strong>Linux</strong> toschedule proc<strong>es</strong>s<strong>es</strong> in the abstract. Section 10.2 discuss<strong>es</strong> the data structur<strong>es</strong> used toimplement scheduling and the corr<strong>es</strong>ponding algorithm. Finally, Section 10.3 d<strong>es</strong>crib<strong>es</strong>the system calls that affect proc<strong>es</strong>s scheduling.10.1 Scheduling Policy<strong>The</strong> scheduling algorithm of traditional Unix operating systems must fulfill several conflictingobjectiv<strong>es</strong>: fast proc<strong>es</strong>s r<strong>es</strong>ponse time, good throughput for background jobs, avoidance ofproc<strong>es</strong>s starvation, reconciliation of the needs of low- and high-priority proc<strong>es</strong>s<strong>es</strong>, and so on.<strong>The</strong> set of rul<strong>es</strong> used to determine when and how selecting a new proc<strong>es</strong>s to run is calledscheduling policy.<strong>Linux</strong> scheduling is based on the time-sharing technique already introduced in Section 5.4.3in Chapter 5: several proc<strong>es</strong>s<strong>es</strong> are allowed to run "concurrently," which means that the CPUtime is roughly divided into "slic<strong>es</strong>," one for each runnable proc<strong>es</strong>s. [1] Of course, a singleproc<strong>es</strong>sor can run only one proc<strong>es</strong>s at any given instant. If a currently running proc<strong>es</strong>s is notterminated when its time slice or quantum expir<strong>es</strong>, a proc<strong>es</strong>s switch may take place.Time-sharing reli<strong>es</strong> on timer interrupts and is thus transparent to proc<strong>es</strong>s<strong>es</strong>. No additionalcode needs to be inserted in the programs in order to ensure CPU time-sharing.[1] Recall that stopped and suspended proc<strong>es</strong>s<strong>es</strong> cannot be selected by the scheduling algorithm to run on the CPU.<strong>The</strong> scheduling policy is also based on ranking proc<strong>es</strong>s<strong>es</strong> according to their priority.Complicated algorithms are sometim<strong>es</strong> used to derive the current priority of a proc<strong>es</strong>s,but the end r<strong>es</strong>ult is the same: each proc<strong>es</strong>s is associated with a value that denot<strong>es</strong> howappropriate it is to be assigned to the CPU.In <strong>Linux</strong>, proc<strong>es</strong>s priority is dynamic. <strong>The</strong> scheduler keeps track of what proc<strong>es</strong>s<strong>es</strong> are doingand adjusts their prioriti<strong>es</strong> periodically; in this way, proc<strong>es</strong>s<strong>es</strong> that have been denied the use ofthe CPU for a long time interval are boosted by dynamically increasing their priority.Corr<strong>es</strong>pondingly, proc<strong>es</strong>s<strong>es</strong> running for a long time are penalized by decreasing their priority.When speaking about scheduling, proc<strong>es</strong>s<strong>es</strong> are traditionally classified as "I/O-bound" or"CPU-bound." <strong>The</strong> former make heavy use of I/O devic<strong>es</strong> and spend much time waiting forI/O operations to complete; the latter are number-crunching applications that require a lot ofCPU time.An alternative classification distinguish<strong>es</strong> three class<strong>es</strong> of proc<strong>es</strong>s<strong>es</strong>:258


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Interactive proc<strong>es</strong>s<strong>es</strong><strong>The</strong>se interact constantly with their users, and therefore spend a lot of time waiting forkeypr<strong>es</strong>s<strong>es</strong> and mouse operations. When input is received, the proc<strong>es</strong>s must be wokenup quickly, or the user will find the system to be unr<strong>es</strong>ponsive. Typically, the averagedelay must fall between 50 and 150 ms. <strong>The</strong> variance of such delay must also bebounded, or the user will find the system to be erratic. Typical interactive programsare command shells, text editors, and graphical applications.Batch proc<strong>es</strong>s<strong>es</strong><strong>The</strong>se do not need user interaction, and hence they often run in the background. Sinc<strong>es</strong>uch proc<strong>es</strong>s<strong>es</strong> do not need to be very r<strong>es</strong>ponsive, they are often penalized by th<strong>es</strong>cheduler. Typical batch programs are programming language compilers, databas<strong>es</strong>earch engin<strong>es</strong>, and scientific computations.Real-time proc<strong>es</strong>s<strong>es</strong><strong>The</strong>se have very strong scheduling requirements. Such proc<strong>es</strong>s<strong>es</strong> should never beblocked by lower-priority proc<strong>es</strong>s<strong>es</strong>, they should have a short r<strong>es</strong>ponse time and, mostimportant, such r<strong>es</strong>ponse time should have a minimum variance. Typical real-timeprograms are video and sound applications, robot controllers, and programs thatcollect data from physical sensors.<strong>The</strong> two classifications we just offered are somewhat independent. For instance, a batchproc<strong>es</strong>s can be either I/O-bound (e.g., a database server) or CPU-bound(e.g., an image-rendering program). While in <strong>Linux</strong> real-time programs are explicitlyrecognized as such by the scheduling algorithm, there is no way to distinguish betweeninteractive and batch programs. In order to offer a good r<strong>es</strong>ponse time to interactiveapplications, <strong>Linux</strong> (like all Unix kernels) implicitly favors I/O-bound proc<strong>es</strong>s<strong>es</strong> overCPU-bound on<strong>es</strong>.Programmers may change the scheduling parameters by means of the system calls illustratedin Table 10-1. More details will be given in Section 10.3.Table 10-1. System Calls Related to SchedulingSystem CallD<strong>es</strong>criptionnice( )Change the priority of a conventional proc<strong>es</strong>s.getpriority( )Get the maximum priority of a group of conventional proc<strong>es</strong>s<strong>es</strong>.setpriority( )Set the priority of a group of conventional proc<strong>es</strong>s<strong>es</strong>.sched_getscheduler( ) Get the scheduling policy of a proc<strong>es</strong>s.sched_setscheduler( ) Set the scheduling policy and priority of a proc<strong>es</strong>s.sched_getparam( )Get the scheduling priority of a proc<strong>es</strong>s.sched_setparam( )Set the priority of a proc<strong>es</strong>s.sched_yield( )Relinquish the proc<strong>es</strong>sor voluntarily without blocking.sched_get_ priority_min( ) Get the minimum priority value for a policy.sched_get_ priority_max( ) Get the maximum priority value for a policy.sched_rr_get_interval( ) Get the time quantum value for the Round Robin policy.259


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Most system calls shown in the table apply to real-time proc<strong>es</strong>s<strong>es</strong>, thus allowing users todevelop real-time applications. However, <strong>Linux</strong> do<strong>es</strong> not support the most demandingreal-time applications because its kernel is nonpreemptive (see Section 10.2.5).10.1.1 Proc<strong>es</strong>s PreemptionAs mentioned in the first chapter, <strong>Linux</strong> proc<strong>es</strong>s<strong>es</strong> are preemptive. If a proc<strong>es</strong>s enters theTASK_RUNNING state, the kernel checks whether its dynamic priority is greater than the priorityof the currently running proc<strong>es</strong>s. If it is, the execution of current is interrupted and th<strong>es</strong>cheduler is invoked to select another proc<strong>es</strong>s to run (usually the proc<strong>es</strong>s that just becamerunnable). Of course, a proc<strong>es</strong>s may also be preempted when its time quantum expir<strong>es</strong>. Asmentioned in Section 5.4.3 in Chapter 5, when this occurs, the need_r<strong>es</strong>ched field ofthe current proc<strong>es</strong>s is set, so the scheduler is invoked when the timer interrupt handlerterminat<strong>es</strong>.For instance, let us consider a scenario in which only two programs—a text editorand a compiler—are being executed. <strong>The</strong> text editor is an interactive program, therefore it hasa higher dynamic priority than the compiler. Neverthel<strong>es</strong>s, it is often suspended, since the useralternat<strong>es</strong> between paus<strong>es</strong> for think time and data entry; moreover, the average delay betweentwo keypr<strong>es</strong>s<strong>es</strong> is relatively long. However, as soon as the user pr<strong>es</strong>s<strong>es</strong> a key, an interrupt israised, and the kernel wak<strong>es</strong> up the text editor proc<strong>es</strong>s. <strong>The</strong> kernel also determin<strong>es</strong> that thedynamic priority of the editor is higher than the priority of current, the currently runningproc<strong>es</strong>s (that is, the compiler), and hence it sets the need_r<strong>es</strong>ched field of this proc<strong>es</strong>s, thusforcing the scheduler to be activated when the kernel finish<strong>es</strong> handling the interrupt.<strong>The</strong> scheduler selects the editor and performs a task switch; as a r<strong>es</strong>ult, the execution of theeditor is r<strong>es</strong>umed very quickly and the character typed by the user is echoed to the screen.When the character has been proc<strong>es</strong>sed, the text editor proc<strong>es</strong>s suspends itself waiting foranother keypr<strong>es</strong>s, and the compiler proc<strong>es</strong>s can r<strong>es</strong>ume its execution.Be aware that a preempted proc<strong>es</strong>s is not suspended, since it remains in the TASK_RUNNINGstate; it simply no longer us<strong>es</strong> the CPU.Some real-time operating systems feature preemptive kernels, which means that a proc<strong>es</strong>srunning in <strong>Kernel</strong> Mode can be interrupted after any instruction, just as it can in User Mode.<strong>The</strong> <strong>Linux</strong> kernel is not preemptive, which means that a proc<strong>es</strong>s can be preempted only whilerunning in User Mode; nonpreemptive kernel d<strong>es</strong>ign is much simpler, since mostsynchronization problems involving the kernel data structur<strong>es</strong> are easily avoided (seeSection 11.2.1 in Chapter 11).10.1.2 How Long Must a Quantum Last?<strong>The</strong> quantum duration is critical for system performanc<strong>es</strong>: it should be neither too long nortoo short.If the quantum duration is too short, the system overhead caused by task switch<strong>es</strong> becom<strong>es</strong>exc<strong>es</strong>sively high. For instance, suppose that a task switch requir<strong>es</strong> 10 milliseconds; if thequantum is also set to 10 milliseconds, then at least 50% of the CPU cycl<strong>es</strong> will be dedicatedto task switch. [2][2] Actually, things could be much worse than this; for example, if the time required for task switch is counted in the proc<strong>es</strong>s quantum, all CPU timewill be devoted to task switch and no proc<strong>es</strong>s can progr<strong>es</strong>s toward its termination. Anyway, you got the point.260


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If the quantum duration is too long, proc<strong>es</strong>s<strong>es</strong> no longer appear to be executed concurrently.For instance, let's suppose that the quantum is set to five seconds; each runnable proc<strong>es</strong>smak<strong>es</strong> progr<strong>es</strong>s for about five seconds, but then it stops for a very long time (typically, fiv<strong>es</strong>econds tim<strong>es</strong> the number of runnable proc<strong>es</strong>s<strong>es</strong>).It is often believed that a long quantum duration degrad<strong>es</strong> the r<strong>es</strong>ponse time of interactiveapplications. This is usually false. As d<strong>es</strong>cribed in Section 10.1.1 earlier in this chapter,interactive proc<strong>es</strong>s<strong>es</strong> have a relatively high priority, therefore they quickly preempt the batchproc<strong>es</strong>s<strong>es</strong>, no matter how long the quantum duration is.In some cas<strong>es</strong>, a quantum duration that is too long degrad<strong>es</strong> the r<strong>es</strong>ponsiven<strong>es</strong>s of the system.For instance, suppose that two users concurrently enter two commands at the r<strong>es</strong>pective shellprompts; one command is CPU-bound, while the other is an interactive application. Bothshells fork a new proc<strong>es</strong>s and delegate the execution of the user's command to it; moreover,suppose that such new proc<strong>es</strong>s<strong>es</strong> have the same priority initially (<strong>Linux</strong> do<strong>es</strong> not know inadvance if an executed program is batch or interactive). Now, if the scheduler selects theCPU-bound proc<strong>es</strong>s to run, the other proc<strong>es</strong>s could wait for a whole time quantum befor<strong>es</strong>tarting its execution. <strong>The</strong>refore, if such duration is long, the system could appear to beunr<strong>es</strong>ponsive to the user that launched it.<strong>The</strong> choice of quantum duration is always a compromise. <strong>The</strong> rule of thumb adopted by <strong>Linux</strong>is: choose a duration as long as possible, while keeping good system r<strong>es</strong>ponse time.10.2 <strong>The</strong> Scheduling Algorithm<strong>The</strong> <strong>Linux</strong> scheduling algorithm works by dividing the CPU time into epochs . In a singleepoch, every proc<strong>es</strong>s has a specified time quantum whose duration is computed when theepoch begins. In general, different proc<strong>es</strong>s<strong>es</strong> have different time quantum durations. <strong>The</strong> timequantum value is the maximum CPU time portion assigned to the proc<strong>es</strong>s in that epoch. Whena proc<strong>es</strong>s has exhausted its time quantum, it is preempted and replaced by another runnableproc<strong>es</strong>s. Of course, a proc<strong>es</strong>s can be selected several tim<strong>es</strong> from the scheduler in the sameepoch, as long as its quantum has not been exhausted—for instance, if it suspends itself towait for I/O, it pr<strong>es</strong>erv<strong>es</strong> some of its time quantum and can be selected again during the sameepoch. <strong>The</strong> epoch ends when all runnable proc<strong>es</strong>s<strong>es</strong> have exhausted their quantum; in thiscase, the scheduler algorithm recomput<strong>es</strong> the time-quantum durations of all proc<strong>es</strong>s<strong>es</strong> and anew epoch begins.Each proc<strong>es</strong>s has a base time quantum: it is the time-quantum value assigned by the schedulerto the proc<strong>es</strong>s if it has exhausted its quantum in the previous epoch. <strong>The</strong> users can change thebase time quantum of their proc<strong>es</strong>s<strong>es</strong> by using the nice( ) and setpriority( ) system calls(see Section 10.3 later in this chapter). A new proc<strong>es</strong>s always inherits the base time quantumof its parent.<strong>The</strong> INIT_TASK macro sets the value of the base time quantum of proc<strong>es</strong>s (swapper) toDEF_PRIORITY; that macro is defined as follows:#define DEF_PRIORITY (20*HZ/100)Since HZ, which denot<strong>es</strong> the frequency of timer interrupts, is set to 100 for IBM PCs (seeSection 5.1.3 in Chapter 5), the value of DEF_PRIORITY is 20 ticks, that is, about 210 ms.261


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Users rarely change the base time quantum of their proc<strong>es</strong>s<strong>es</strong>, so DEF_PRIORITY also denot<strong>es</strong>the base time quantum of most proc<strong>es</strong>s<strong>es</strong> in the system.In order to select a proc<strong>es</strong>s to run, the <strong>Linux</strong> scheduler must consider the priority of eachproc<strong>es</strong>s. Actually, there are two kinds of priority:Static priorityThis kind is assigned by the users to real-time proc<strong>es</strong>s<strong>es</strong> and rang<strong>es</strong> from 1 to 99. It isnever changed by the scheduler.Dynamic priorityThis kind appli<strong>es</strong> only to conventional proc<strong>es</strong>s<strong>es</strong>; it is <strong>es</strong>sentially the sum of the basetime quantum (which is therefore also called the base priority of the proc<strong>es</strong>s) and ofthe number of ticks of CPU time left to the proc<strong>es</strong>s before its quantum expir<strong>es</strong> in thecurrent epoch.Of course, the static priority of a real-time proc<strong>es</strong>s is always higher than the dynamic priorityof a conventional one: the scheduler will start running conventional proc<strong>es</strong>s<strong>es</strong> only when thereis no real-time proc<strong>es</strong>s in a TASK_RUNNING state.10.2.1 Data Structur<strong>es</strong> Used by the SchedulerWe recall from Section 3.1 in Chapter 3 that the proc<strong>es</strong>s list links together all proc<strong>es</strong>sd<strong>es</strong>criptors, while the runqueue list links together the proc<strong>es</strong>s d<strong>es</strong>criptors of all runnableproc<strong>es</strong>s<strong>es</strong>—that is, of those in a TASK_RUNNING state. In both cas<strong>es</strong>, the init_task proc<strong>es</strong>sd<strong>es</strong>criptor plays the role of list header.Each proc<strong>es</strong>s d<strong>es</strong>criptor includ<strong>es</strong> several fields related to scheduling:need_r<strong>es</strong>chedpolicyA flag checked by ret_from_intr( ) to decide whether to invoke the schedule( )function (see Section 4.7.1 in Chapter 4).<strong>The</strong> scheduling class. <strong>The</strong> valu<strong>es</strong> permitted are:SCHED_FIFOA First-In, First-Out real-time proc<strong>es</strong>s. When the scheduler assigns the CPU to theproc<strong>es</strong>s, it leav<strong>es</strong> the proc<strong>es</strong>s d<strong>es</strong>criptor in its current position in the runqueue list. Ifno other higher-priority real-time proc<strong>es</strong>s is runnable, the proc<strong>es</strong>s will continue to usethe CPU as long as it wish<strong>es</strong>, even if other real-time proc<strong>es</strong>s<strong>es</strong> having the samepriority are runnable.262


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>SCHED_RRA Round Robin real-time proc<strong>es</strong>s. When the scheduler assigns the CPU to the proc<strong>es</strong>s,it puts the proc<strong>es</strong>s d<strong>es</strong>criptor at the end of the runqueue list. This policy ensur<strong>es</strong> a fairassignment of CPU time to all SCHED_RR real-time proc<strong>es</strong>s<strong>es</strong> that have the samepriority.SCHED_OTHERA conventional, time-shared proc<strong>es</strong>s.<strong>The</strong> policy field also encod<strong>es</strong> a SCHED_YIELD binary flag. This flag is set when theproc<strong>es</strong>s invok<strong>es</strong> the sched_ yield( ) system call (a way of voluntarily relinquishingthe proc<strong>es</strong>sor without the need to start an I/O operation or go to sleep; see Section10.3.3). <strong>The</strong> scheduler puts the proc<strong>es</strong>s d<strong>es</strong>criptor at the bottom of the runqueue list(see Section 10.3).rt_priorityprioritycounter<strong>The</strong> static priority of a real-time proc<strong>es</strong>s. Conventional proc<strong>es</strong>s<strong>es</strong> do not make use ofthis field.<strong>The</strong> base time quantum (or base priority) of the proc<strong>es</strong>s.<strong>The</strong> number of ticks of CPU time left to the proc<strong>es</strong>s before its quantum expir<strong>es</strong>; whena new epoch begins, this field contains the time-quantum duration of the proc<strong>es</strong>s.Recall that the update_proc<strong>es</strong>s_tim<strong>es</strong>( ) function decrements the counter field ofthe current proc<strong>es</strong>s by 1 at every tick.When a new proc<strong>es</strong>s is created, do_fork( ) sets the counter field of both current (theparent) and p (the child) proc<strong>es</strong>s<strong>es</strong> in the following way:current->counter >>= 1;p->counter = current->counter;In other words, the number of ticks left to the parent is split in two halv<strong>es</strong>, one for the parentand one for the child. This is done to prevent users from getting an unlimited amount of CPUtime by using the following method: the parent proc<strong>es</strong>s creat<strong>es</strong> a child proc<strong>es</strong>s that runs th<strong>es</strong>ame code and then kills itself; by properly adjusting the creation rate, the child proc<strong>es</strong>swould always get a fr<strong>es</strong>h quantum before the quantum of its parent expir<strong>es</strong>. Thisprogramming trick do<strong>es</strong> not work since the kernel do<strong>es</strong> not reward forks. Similarly, a usercannot hog an unfair share of the proc<strong>es</strong>sor by starting lots of background proc<strong>es</strong>s<strong>es</strong> in a shellor by opening a lot of windows on a graphical d<strong>es</strong>ktop. More generally speaking, a proc<strong>es</strong>scannot hog r<strong>es</strong>ourc<strong>es</strong> (unl<strong>es</strong>s it has privileg<strong>es</strong> to give itself a real-time policy) by forkingmultiple d<strong>es</strong>cendents.263


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Notice that the priority and counter fields play different rol<strong>es</strong> for the various kinds ofproc<strong>es</strong>s<strong>es</strong>. For conventional proc<strong>es</strong>s<strong>es</strong>, they are used both to implement time-sharing and tocompute the proc<strong>es</strong>s dynamic priority. For SCHED_RR real-time proc<strong>es</strong>s<strong>es</strong>, they are used onlyto implement time-sharing. Finally, for SCHED_FIFO real-time proc<strong>es</strong>s<strong>es</strong>, they are not used atall, because the scheduling algorithm regards the quantum duration as unlimited.10.2.2 <strong>The</strong> schedule( ) Functionschedule( ) implements the scheduler. Its objective is to find a proc<strong>es</strong>s in the runqueue listand then assign the CPU to it. It is invoked, directly or in a lazy way, by several kernelroutin<strong>es</strong>.10.2.2.1 Direct invocation<strong>The</strong> scheduler is invoked directly when the current proc<strong>es</strong>s must be blocked right awaybecause the r<strong>es</strong>ource it needs is not available. In this case, the kernel routine that wants toblock it proceeds as follows:1. Inserts current in the proper wait queue2. Chang<strong>es</strong> the state of current either to TASK_INTERRUPTIBLE or toTASK_UNINTERRUPTIBLE3. Invok<strong>es</strong> schedule( )4. Checks if the r<strong>es</strong>ource is available; if not, go<strong>es</strong> to step 25. Once the r<strong>es</strong>ource is available, remov<strong>es</strong> current from the wait queueAs can be seen, the kernel routine checks repeatedly whether the r<strong>es</strong>ource needed by theproc<strong>es</strong>s is available; if not, it yields the CPU to some other proc<strong>es</strong>s by invoking schedule(). Later, when the scheduler once again grants the CPU to the proc<strong>es</strong>s, the availability of ther<strong>es</strong>ource is again checked.You may have noticed that th<strong>es</strong>e steps are similar to those performed by the sleep_on( ) andinterruptible_sleep_on( ) functions d<strong>es</strong>cribed in Section 3.1.4 in Chapter 3. However,the functions we discuss here immediately remove the proc<strong>es</strong>s from the wait queue as soon asit is woken up.<strong>The</strong> scheduler is also directly invoked by many device drivers that execute long iterativetasks. At each iteration cycle, the driver checks the value of the need_r<strong>es</strong>ched field and, ifnec<strong>es</strong>sary, invok<strong>es</strong> schedule( ) to voluntarily relinquish the CPU.10.2.2.2 Lazy invocation<strong>The</strong> scheduler can also be invoked in a lazy way by setting the need_r<strong>es</strong>ched field ofcurrent to 1. Since a check on the value of this field is always made before r<strong>es</strong>uming theexecution of a User Mode proc<strong>es</strong>s (see Section 4.7 in Chapter 4), schedule( ) will definitelybe invoked at some close future time.Lazy invocation of the scheduler is performed in the following cas<strong>es</strong>:• When current has used up its quantum of CPU time; this is done by theupdate_proc<strong>es</strong>s_tim<strong>es</strong>( ) function.264


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• When a proc<strong>es</strong>s is woken up and its priority is higher than that of the current proc<strong>es</strong>s;this task is performed by the r<strong>es</strong>chedule_idle( ) function, which is invoked by thewake_up_proc<strong>es</strong>s( ) function (see Section 3.1.2 in Chapter 3):• if (goodn<strong>es</strong>s(current, p) > goodn<strong>es</strong>s(current, current))current->need_r<strong>es</strong>ched = 1;(<strong>The</strong> goodn<strong>es</strong>s( ) function will be d<strong>es</strong>cribed later in Section 10.2.3)• When a sched_setscheduler( ) or sched_ yield( ) system call is issued (seeSection 10.3 later in this chapter).10.2.2.3 Actions performed by schedule( )Before actually scheduling a proc<strong>es</strong>s, the schedule( ) function starts by running thefunctions left by other kernel control paths in various queu<strong>es</strong>. <strong>The</strong> function invok<strong>es</strong>run_task_queue( ) on the tq _scheduler task queue. <strong>Linux</strong> puts a function in that taskqueue when it must defer its execution until the next schedule( ) invocation:run_task_queue(&tq_scheduler);<strong>The</strong> function then execut<strong>es</strong> all active unmasked bottom halv<strong>es</strong>. <strong>The</strong>se are usually pr<strong>es</strong>ent toperform tasks requ<strong>es</strong>ted by device drivers (see Section 4.6.6 in Chapter 4):if (bh_active & bh_mask)do_bottom_half( );Now com<strong>es</strong> the actual scheduling, and therefore a potential proc<strong>es</strong>s switch.<strong>The</strong> value of current is saved in the prev local variable and the need_r<strong>es</strong>ched field of previs set to 0. <strong>The</strong> key outcome of the function is to set another local variable called next so thatit points to the d<strong>es</strong>criptor of the proc<strong>es</strong>s selected to replace prev.First, a check is made to determine whether prev is a Round Robin real-time proc<strong>es</strong>s (policyfield set to SCHED_RR) that has exhausted its quantum. If so, schedule( ) assigns a newquantum to prev and puts it at the bottom of the runqueue list:if (!prev->counter && prev->policy == SCHED_RR) {prev->counter = prev->priority;move_last_runqueue(prev);}Now schedule( ) examin<strong>es</strong> the state of prev. If it has nonblocked pending signals and itsstate is TASK_INTERRUPTIBLE, the function wak<strong>es</strong> up the proc<strong>es</strong>s as follows. This action is notthe same as assigning the proc<strong>es</strong>sor to prev; it just giv<strong>es</strong> prev a chance to be selected forexecution:if (prev->state == TASK_INTERRUPTIBLE &&signal_pending(prev))prev->state = TASK_RUNNING;265


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If prev is not in the TASK_RUNNING state, schedule( ) was directly invoked by the proc<strong>es</strong>sitself because it had to wait on some external r<strong>es</strong>ource; therefore, prev must be removed fromthe runqueue list:if (prev->state != TASK_RUNNING)del_from_runqueue(prev);Next, schedule( ) must select the proc<strong>es</strong>s to be executed in the next time quantum. To thatend, the function scans the runqueue list. It starts from the proc<strong>es</strong>s referenced by thenext_run field of init_task, which is the d<strong>es</strong>criptor of proc<strong>es</strong>s (swapper). <strong>The</strong> objective isto store in next the proc<strong>es</strong>s d<strong>es</strong>criptor pointer of the high<strong>es</strong>t priority proc<strong>es</strong>s. In order to dothis, next is initialized to the first runnable proc<strong>es</strong>s to be checked, and c is initialized to its"goodn<strong>es</strong>s" (see Section 10.2.3):if (prev->state == TASK_RUNNING) {next = prev;if (prev->policy & SCHED_YIELD) {prev->policy &= ~SCHED_YIELD;c = 0;} elsec = goodn<strong>es</strong>s(prev, prev);} else {c = -1000;next = &init_task;}If the SCHED_YIELD flag of prev->policy is set, prev has voluntarily relinquished the CPUby issuing a sched_ yield( ) system call. In this case, the function assigns a zero goodn<strong>es</strong>sto it.Now schedule( ) repeatedly invok<strong>es</strong> the goodn<strong>es</strong>s( ) function on the runnable proc<strong>es</strong>s<strong>es</strong>to determine the b<strong>es</strong>t candidate:p = init_task.next_run;while (p != &init_task) {weight = goodn<strong>es</strong>s(prev, p);if (weight > c) {c = weight;next = p;}p = p->next_run;}<strong>The</strong> while loop selects the first proc<strong>es</strong>s in the runqueue having maximum weight. If theprevious proc<strong>es</strong>s is runnable, it is preferred with r<strong>es</strong>pect to other runnable proc<strong>es</strong>s<strong>es</strong> havingthe same weight.Notice that if the runqueue list is empty (no runnable proc<strong>es</strong>s exists except for swapper), thecycle is not entered and next points to init_task. Moreover, if all proc<strong>es</strong>s<strong>es</strong> in the runqueuelist have a priority l<strong>es</strong>ser than or equal to the priority of prev, no proc<strong>es</strong>s switch will takeplace and the old proc<strong>es</strong>s will continue to be executed.A further check must be made at the exit of the loop to determine whether c is 0. This occursonly when all the proc<strong>es</strong>s<strong>es</strong> in the runqueue list have exhausted their quantum, that is, all of266


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>them have a zero counter field. When this happens, a new epoch begins, therefor<strong>es</strong>chedule( ) assigns to all existing proc<strong>es</strong>s<strong>es</strong> (not only to the TASK_RUNNING on<strong>es</strong>) a fr<strong>es</strong>hquantum, whose duration is the sum of the priority value plus half the counter value:if (!c) {for_each_task(p)p->counter = (p->counter >> 1) + p->priority;}In this way, suspended or stopped proc<strong>es</strong>s<strong>es</strong> have their dynamic prioriti<strong>es</strong> periodicallyincreased. As stated earlier, the rationale for increasing the counter value of suspended orstopped proc<strong>es</strong>s<strong>es</strong> is to give preference to I/O-bound proc<strong>es</strong>s<strong>es</strong>. However, even after aninfinite number of increas<strong>es</strong>, the value of counter can never become larger than twice [3] thepriority value.[3] Assume both priority and counter equal to P; then the geometric seri<strong>es</strong> Px (1 + 1 /2 + 1 /4 + 1 /8 + . . . ) converg<strong>es</strong> to 2 xP.Now com<strong>es</strong> the concluding part of schedule( ): if a proc<strong>es</strong>s other than prev has beenselected, a proc<strong>es</strong>s switch must take place. Before performing it, however, thecontext_swtch field of kstat is increased by 1 to update the statistics maintained by thekernel:if (prev != next) {kstat.context_swtch++;switch_to(prev,next);}return;Notice that the return statement that exits from schedule( ) will not be performed rightaway by the next proc<strong>es</strong>s but at a later time by the prev one when the scheduler selects itagain for execution.10.2.3 How Good Is a Runnable Proc<strong>es</strong>s?<strong>The</strong> heart of the scheduling algorithm includ<strong>es</strong> identifying the b<strong>es</strong>t candidate among allproc<strong>es</strong>s<strong>es</strong> in the runqueue list. This is what the goodn<strong>es</strong>s( ) function do<strong>es</strong>. It receiv<strong>es</strong> asinput parameters prev (the d<strong>es</strong>criptor pointer of the previously running proc<strong>es</strong>s) and p (thed<strong>es</strong>criptor pointer of the proc<strong>es</strong>s to evaluate). <strong>The</strong> integer value c returned by goodn<strong>es</strong>s( )measur<strong>es</strong> the "goodn<strong>es</strong>s" of p and has the following meanings:c = -1000c = 0p must never be selected; this value is returned when the runqueue list contains onlyinit_task.p has exhausted its quantum. Unl<strong>es</strong>s p is the first proc<strong>es</strong>s in the runqueue list and allrunnable proc<strong>es</strong>s<strong>es</strong> have also exhausted their quantum, it will not be selected forexecution.267


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>0 < c < 1000c >= 1000p is a conventional proc<strong>es</strong>s that has not exhausted its quantum; a higher value of cdenot<strong>es</strong> a higher level of goodn<strong>es</strong>s.p is a real-time proc<strong>es</strong>s; a higher value of c denot<strong>es</strong> a higher level of goodn<strong>es</strong>s.<strong>The</strong> goodn<strong>es</strong>s( ) function is equivalent to:if (p->policy != SCHED_OTHER)return 1000 + p->rt_priority;if (p->counter == 0)return 0;if (p->mm == prev->mm)return p->counter + p->priority + 1;return p->counter + p->priority;If the proc<strong>es</strong>s is real-time, its goodn<strong>es</strong>s is set to at least 1000. If it is a conventional proc<strong>es</strong>sthat has exhausted its quantum, its goodn<strong>es</strong>s is set to 0; otherwise, it is set to p->counter +p->priority.A small bonus is given to p if it shar<strong>es</strong> the addr<strong>es</strong>s space with prev (i.e., if their proc<strong>es</strong>sd<strong>es</strong>criptors' mm fields point to the same memory d<strong>es</strong>criptor). <strong>The</strong> rationale for this bonus is thatif p runs right after prev, it will use the same page tabl<strong>es</strong>, hence the same memory; some ofthe valuable data may still be in the hardware cache.10.2.4 <strong>The</strong> <strong>Linux</strong>/SMP Scheduler<strong>The</strong> <strong>Linux</strong> scheduler must be slightly modified in order to support the symmetricmultiproc<strong>es</strong>sor (SMP) architecture. Actually, each proc<strong>es</strong>sor runs the schedule( ) functionon its own, but proc<strong>es</strong>sors must exchange information in order to boost system performance.When the scheduler comput<strong>es</strong> the goodn<strong>es</strong>s of a runnable proc<strong>es</strong>s, it should consider whetherthat proc<strong>es</strong>s was previously running on the same CPU or on another one. A proc<strong>es</strong>s that wasrunning on the same CPU is always preferred, since the hardware cache of the CPU could stillinclude useful data. This rule helps in reducing the number of cache miss<strong>es</strong>.Let us suppose, however, that CPU 1 is running a proc<strong>es</strong>s when a second, higher-priorityproc<strong>es</strong>s that was last running on CPU 2 becom<strong>es</strong> runnable. Now the kernel is faced with aninter<strong>es</strong>ting dilemma: should it immediately execute the higher-priority proc<strong>es</strong>s on CPU 1, orshould it defer that proc<strong>es</strong>s's execution until CPU 2 becom<strong>es</strong> available? In the former case,hardware cach<strong>es</strong> contents are discarded; in the latter case, parallelism of the SMP architecturemay not be fully exploited when CPU 2 is running the idle proc<strong>es</strong>s (swapper).In order to achieve good system performance, <strong>Linux</strong>/SMP adopts an empirical rule to solvethe dilemma. <strong>The</strong> adopted choice is always a compromise, and the trade-off mainly dependson the size of the hardware cach<strong>es</strong> integrated into each CPU: the larger the CPU cache is, themore convenient it is to keep a proc<strong>es</strong>s bound on that CPU.268


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>10.2.4.1 <strong>Linux</strong>/SMP scheduler data structur<strong>es</strong>An aligned_data table includ<strong>es</strong> one data structure for each proc<strong>es</strong>sor, which is used mainlyto obtain the d<strong>es</strong>criptors of current proc<strong>es</strong>s<strong>es</strong> quickly. Each element is filled by everyinvocation of the schedule( ) function and has the following structure:struct schedule_data {struct task_struct * curr;unsigned long last_schedule;};<strong>The</strong> curr field points to the d<strong>es</strong>criptor of the proc<strong>es</strong>s running on the corr<strong>es</strong>ponding CPU,while last_schedule specifi<strong>es</strong> when schedule( ) selected curr as the running proc<strong>es</strong>s.Several SMP-related fields are included in the proc<strong>es</strong>s d<strong>es</strong>criptor. In particular, theavg_slice field keeps track of the average quantum duration of the proc<strong>es</strong>s, and theproc<strong>es</strong>sor field stor<strong>es</strong> the logical identifier of the last CPU that executed it.<strong>The</strong> cacheflush_time variable contains a rough <strong>es</strong>timate of the minimal number of CPUcycl<strong>es</strong> it tak<strong>es</strong> to entirely overwrite the hardware cache content. It is initialized by th<strong>es</strong>mp_tune_scheduling( ) function to:Intel Pentium proc<strong>es</strong>sors have a hardware cache of 8 KB, so their cacheflush_time isinitialized to a few hundred CPU cycl<strong>es</strong>, that is, a few microseconds. Recent Intel proc<strong>es</strong>sorshave larger hardware cach<strong>es</strong>, and therefore the minimal cache flush time could range from 50to 100 microseconds.As we shall see later, if cacheflush_time is greater than the average time slice of somecurrently running proc<strong>es</strong>s, no proc<strong>es</strong>s preemption is performed because it is convenient in thiscase to bind proc<strong>es</strong>s<strong>es</strong> to the proc<strong>es</strong>sors that last executed them.10.2.4.2 <strong>The</strong> schedule( ) functionWhen the schedule( ) function is executed on an SMP system, it carri<strong>es</strong> out the followingoperations:1. Performs the initial part of schedule( ) as usual.2. Stor<strong>es</strong> the logical identifier of the executing proc<strong>es</strong>sor in the this_cpu local variable;such value is read from the proc<strong>es</strong>sor field of prev (that is, of the proc<strong>es</strong>s to bereplaced).3. Initializ<strong>es</strong> the sched_data local variable so that it points to the schedule_datastructure of the this_cpu CPU.4. Invok<strong>es</strong> goodn<strong>es</strong>s( ) repeatedly to select the new proc<strong>es</strong>s to be executed; thisfunction also examin<strong>es</strong> the proc<strong>es</strong>sor field of the proc<strong>es</strong>s<strong>es</strong> and giv<strong>es</strong> a consistentbonus (PROC_CHANGE_PENALTY, usually 15) to the proc<strong>es</strong>s that was last executed onthe this_cpu CPU.5. If needed, recomput<strong>es</strong> proc<strong>es</strong>s dynamic prioriti<strong>es</strong> as usual.6. Sets sched_data->curr to next.269


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>7. Sets next->has_cpu to 1 and next->proc<strong>es</strong>sor to this_cpu.8. Stor<strong>es</strong> the current Time Stamp Counter value in the t local variable.9. Stor<strong>es</strong> the last time slice duration of prev in the this_slice local variable; this valueis the difference between t and sched_data->last_schedule.10. Sets sched_data->last_schedule to t.11. Sets the avg_slice field of prev to (prev->avg_slice+this_slice)/2; in otherwords, updat<strong>es</strong> the average.12. Performs the context switch.13. When the kernel returns here, the original previous proc<strong>es</strong>s has been selected again bythe scheduler; the prev local variable now refers to the proc<strong>es</strong>s that has just beenreplaced. If prev is still runnable and it is not the idle task of this CPU, invok<strong>es</strong> ther<strong>es</strong>chedule_idle( ) function on it (see the next section).14. Sets the has_cpu field of prev to 0.10.2.4.3 <strong>The</strong> r<strong>es</strong>chedule_idle( ) function<strong>The</strong> r<strong>es</strong>chedule_idle( ) function is invoked when a proc<strong>es</strong>s p becom<strong>es</strong> runnable (seeSection 10.2.2). On an SMP system, the function determin<strong>es</strong> whether the proc<strong>es</strong>s shouldpreempt the current proc<strong>es</strong>s of some CPU. It performs the following operations:1. If p is a real-time proc<strong>es</strong>s, always attempts to perform preemption: go to step 3.2. Returns immediately (do<strong>es</strong> not attempt to preempt) if there is a CPU whose currentproc<strong>es</strong>s satisfi<strong>es</strong> both of the following conditions: [4][4] <strong>The</strong>se conditions look like voodoo magic; perhaps, they are empirical rul<strong>es</strong> that make the SMP scheduler work better.o cacheflush_time is greater than the average time slice of the current proc<strong>es</strong>s.If this is true, the proc<strong>es</strong>s is not dirtying the cache significantly.o Both p and the current proc<strong>es</strong>s need the global kernel lock (see Section 11.4.6in Chapter 11) in order to acc<strong>es</strong>s some critical kernel data structure. This checkis performed because replacing a proc<strong>es</strong>s holding the lock with another onethat needs it is not fruitful.3. If the p->proc<strong>es</strong>sor CPU (the one on which p was last running) is idle, selects it.4. Otherwise, comput<strong>es</strong> the difference:goodn<strong>es</strong>s(tsk, p) - goodn<strong>es</strong>s(tsk, tsk)for each task tsk running on some CPU and selects the CPU for which the differenceis great<strong>es</strong>t, provided it is a positive value.5. If CPU has been selected, sets the need_r<strong>es</strong>ched field of the corr<strong>es</strong>ponding runningproc<strong>es</strong>s and sends a "r<strong>es</strong>chedule" m<strong>es</strong>sage to that proc<strong>es</strong>sor (see Section 11.4.7 inChapter 11).10.2.5 Performance of the Scheduling Algorithm<strong>The</strong> scheduling algorithm of <strong>Linux</strong> is both self-contained and relatively easy to follow. Forthat reason, many kernel hackers love to try to make improvements. However, the scheduler isa rather mysterious component of the kernel. While you can change its performanc<strong>es</strong>ignificantly by modifying just a few key parameters, there is usually no theoretical support to270


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>justify the r<strong>es</strong>ults obtained. Furthermore, you can't be sure that the positive (or negative)r<strong>es</strong>ults obtained will continue to hold when the mix of requ<strong>es</strong>ts submitted by the users (realtime,interactive, I/O-bound, background, etc.) vari<strong>es</strong> significantly. Actually, for almost everyproposed scheduling strategy, it is possible to derive an artificial mix of requ<strong>es</strong>ts that yieldspoor system performanc<strong>es</strong>.Let us try to outline some pitfalls of the <strong>Linux</strong> scheduler. As it will turn out, some of th<strong>es</strong>elimitations become significant on large systems with many users. On a single workstation thatis running a few tens of proc<strong>es</strong>s<strong>es</strong> at a time, the <strong>Linux</strong> scheduler is quite efficient. Since <strong>Linux</strong>was born on an Intel 80386 and continu<strong>es</strong> to be most popular in the PC world, we consider thecurrent <strong>Linux</strong> scheduler quite appropriate.10.2.5.1 <strong>The</strong> algorithm do<strong>es</strong> not scale wellIf the number of existing proc<strong>es</strong>s<strong>es</strong> is very large, it is inefficient to recompute all dynamicprioriti<strong>es</strong> at once.In old traditional Unix kernels, the dynamic prioriti<strong>es</strong> were recomputed every second, thus theproblem was even worse. <strong>Linux</strong> tri<strong>es</strong> instead to minimize the overhead of the scheduler.Prioriti<strong>es</strong> are recomputed only when all runnable proc<strong>es</strong>s<strong>es</strong> have exhausted their timequantum. <strong>The</strong>refore, when the number of proc<strong>es</strong>s<strong>es</strong> is large, the recomputation phase is moreexpensive but is executed l<strong>es</strong>s frequently.This simple approach has the disadvantage that when the number of runnable proc<strong>es</strong>s<strong>es</strong> isvery large, I/O-bound proc<strong>es</strong>s<strong>es</strong> are seldom boosted, and therefore interactive applicationshave a longer r<strong>es</strong>ponse time.10.2.5.2 <strong>The</strong> predefined quantum is too large for high system loads<strong>The</strong> system r<strong>es</strong>ponsiven<strong>es</strong>s experienced by users depends heavily on the system load, which isthe average number of proc<strong>es</strong>s<strong>es</strong> that are runnable, and hence waiting for CPU time. [5][5]<strong>The</strong> uptime program returns the system load for the past 1, 5, and 15 minut<strong>es</strong>. <strong>The</strong> same information can be obtained by reading the/proc/loadavgfile.As mentioned before, system r<strong>es</strong>ponsiven<strong>es</strong>s depends also on the average time-quantumduration of the runnable proc<strong>es</strong>s<strong>es</strong>. In <strong>Linux</strong>, the predefined time quantum appears to be toolarge for high-end machin<strong>es</strong> having a very high expected system load.10.2.5.3 I/O-bound proc<strong>es</strong>s boosting strategy is not optimal<strong>The</strong> preference for I/O-bound proc<strong>es</strong>s<strong>es</strong> is a good strategy to ensure a short r<strong>es</strong>ponse time forinteractive programs, but it is not perfect. Indeed, some batch programs with almost no userinteraction are I/O-bound. For instance, consider a database search engine that must typicallyread lots of data from the hard disk or a network application that must collect data from aremote host on a slow link. Even if th<strong>es</strong>e kinds of proc<strong>es</strong>s<strong>es</strong> do not need a short r<strong>es</strong>ponse time,they are boosted by the scheduling algorithm.On the other hand, interactive programs that are also CPU-bound may appear unr<strong>es</strong>ponsive tothe users, since the increment of dynamic priority due to I/O blocking operations may notcompensate for the decrement due to CPU usage.271


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>10.2.5.4 Support for real-time applications is weakAs stated in the first chapter, nonpreemptive kernels are not well suited for real-timeapplications, since proc<strong>es</strong>s<strong>es</strong> may spend several milliseconds in <strong>Kernel</strong> Mode while handlingan interrupt or exception. During this time, a real-time proc<strong>es</strong>s that becom<strong>es</strong> runnable cannotbe r<strong>es</strong>umed. This is unacceptable for real-time applications, which require predictable and lowr<strong>es</strong>ponse tim<strong>es</strong>. [6][6] <strong>The</strong> <strong>Linux</strong> kernel has been modified in several ways so it can handle a few hard real-time jobs if they remain short. Basically, hardware interruptsare trapped and kernel execution is monitored by a kind of "superkernel." <strong>The</strong>se chang<strong>es</strong> do not make <strong>Linux</strong> a true real-time system, though.Future versions of <strong>Linux</strong> will likely addr<strong>es</strong>s this problem, either by implementing SVR4's"fixed preemption points" or by making the kernel fully preemptive.However, kernel preemption is just one of several nec<strong>es</strong>sary conditions for implementing aneffective real-time scheduler. Several other issu<strong>es</strong> must be considered. For instance, real-timeproc<strong>es</strong>s<strong>es</strong> often must use r<strong>es</strong>ourc<strong>es</strong> also needed by conventional proc<strong>es</strong>s<strong>es</strong>. A real-timeproc<strong>es</strong>s may thus end up waiting until a lower-priority proc<strong>es</strong>s releas<strong>es</strong> some r<strong>es</strong>ource. Thisphenomenon is called priority inversion. Moreover, a real-time proc<strong>es</strong>s could require a kernelservice that is granted on behalf of another lower-priority proc<strong>es</strong>s (for example, a kernelthread). This phenomenon is called hidden scheduling. An effective real-time schedulershould addr<strong>es</strong>s and r<strong>es</strong>olve such problems.10.3 System Calls Related to SchedulingSeveral system calls have been introduced to allow proc<strong>es</strong>s<strong>es</strong> to change their prioriti<strong>es</strong> andscheduling polici<strong>es</strong>. As a general rule, users are always allowed to lower the prioriti<strong>es</strong> of theirproc<strong>es</strong>s<strong>es</strong>. However, if they want to modify the prioriti<strong>es</strong> of proc<strong>es</strong>s<strong>es</strong> belonging to someother user or if they want to increase the prioriti<strong>es</strong> of their own proc<strong>es</strong>s<strong>es</strong>, they must hav<strong>es</strong>uperuser privileg<strong>es</strong>.10.3.1 <strong>The</strong> nice( ) System Call<strong>The</strong> nice( ) [7] system call allows proc<strong>es</strong>s<strong>es</strong> to change their base priority. <strong>The</strong> integer valuecontained in the increment parameter is used to modify the priority field of the proc<strong>es</strong>sd<strong>es</strong>criptor. <strong>The</strong> nice Unix command, which allows users to run programs with modifiedscheduling priority, is based on this system call.[7] Since this system call is usually invoked to lower the priority of a proc<strong>es</strong>s, users who invoke it for their proc<strong>es</strong>s<strong>es</strong> are "nice" toward other users.<strong>The</strong> sys_nice( ) service routine handl<strong>es</strong> the nice( ) system call. Although the incrementparameter may have any value, absolute valu<strong>es</strong> larger than 40 are trimmed down to 40.Traditionally, negative valu<strong>es</strong> corr<strong>es</strong>pond to requ<strong>es</strong>ts for priority increments and requir<strong>es</strong>uperuser privileg<strong>es</strong>, while positive on<strong>es</strong> corr<strong>es</strong>pond to requ<strong>es</strong>ts for priority decrements.<strong>The</strong> function starts by copying the value of increment into the newprio local variable. In thecase of a negative increment, the function invok<strong>es</strong> the capable( ) function to verify whetherthe proc<strong>es</strong>s has a CAP_SYS_NICE capability. We shall discuss that function, together with thenotion of capability, in Chapter 19. If the user turns out to have the capability required tochange prioriti<strong>es</strong>, sys_nice( ) chang<strong>es</strong> the sign of newprio and it sets the increase localflag:272


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>increase = 0newprio = increment;if (increment < 0) {if (!capable(CAP_SYS_NICE))return -EPERM;newprio = -increment;increase = 1;}If newprio has a value larger than 40, the function trims it down to 40. At this point, thenewprio local variable may have any value included from to 40, inclusive. <strong>The</strong> value is thenconverted according to the priority scale used by the scheduling algorithm. Since the high<strong>es</strong>tbase priority allowed is 2 x DEF_PRIORITY, the new value is:<strong>The</strong> r<strong>es</strong>ulting value is copied into increment with the proper sign:if (newprio > 40)newprio = 40;newprio = (newprio * DEF_PRIORITY + 10) / 20;increment = newprio;if (increase)increment = -increment;Since newprio is an integer variable, the expr<strong>es</strong>sion in the code is equivalent to the formulashown earlier.<strong>The</strong> function then sets the final value of priority by subtracting the value of incrementfrom it. However, the final base priority of the proc<strong>es</strong>s cannot be smaller than 1 or larger than2 x DEF_PRIORITY:if (current->priority - increment < 1)current->priority = 1;else if (current->priority > DEF_PRIORITY*2)current->priority = DEF_PRIORITY*2;elsecurrent->priority -= increment;return 0;A niced proc<strong>es</strong>s chang<strong>es</strong> over time like any other proc<strong>es</strong>s, getting extra priority if nec<strong>es</strong>saryor dropping back in deference to other proc<strong>es</strong>s<strong>es</strong>.10.3.2 <strong>The</strong> getpriority( ) and setpriority( ) System Calls<strong>The</strong> nice( ) system call affects only the proc<strong>es</strong>s that invok<strong>es</strong> it. Two other system calls,denoted as getpriority( ) and setpriority( ), act on the base prioriti<strong>es</strong> of all proc<strong>es</strong>s<strong>es</strong>in a given group. getpriority( ) returns 20 plus the high<strong>es</strong>t base priority among allproc<strong>es</strong>s<strong>es</strong> in a given group; setpriority( ) sets the base priority of all proc<strong>es</strong>s<strong>es</strong> in a givengroup to a given value.273


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> kernel implements th<strong>es</strong>e system calls by means of the sys_getpriority( ) andsys_setpriority( ) service routin<strong>es</strong>. Both of them act <strong>es</strong>sentially on the same group ofparameters:whichwhonicevalIdentifi<strong>es</strong> the group of proc<strong>es</strong>s<strong>es</strong>; it can assume one of the following valu<strong>es</strong>:PRIO_PROCESSSelect the proc<strong>es</strong>s<strong>es</strong> according to their proc<strong>es</strong>s ID (pid field of the proc<strong>es</strong>s d<strong>es</strong>criptor).PRIO_PGRPSelect the proc<strong>es</strong>s<strong>es</strong> according to their group ID (pgrp field of the proc<strong>es</strong>s d<strong>es</strong>criptor).PRIO_USERSelect the proc<strong>es</strong>s<strong>es</strong> according to their user ID (uid field of the proc<strong>es</strong>s d<strong>es</strong>criptor).Value of the pid, pgrp, or uid field (depending on the value of which) to be used forselecting the proc<strong>es</strong>s<strong>es</strong>. If who is 0, its value is set to that of the corr<strong>es</strong>ponding field ofthe current proc<strong>es</strong>s.<strong>The</strong> new base priority value (needed only by sys_setpriority( )). It should rangebetween -20 (high<strong>es</strong>t priority) and +20 (minimum priority).As stated before, only proc<strong>es</strong>s<strong>es</strong> with a CAP_SYS_NICE capability are allowed to increase theirown base priority or to modify that of other proc<strong>es</strong>s<strong>es</strong>.As we have seen in Chapter 8, system calls return a negative value only if some erroroccurred. For that reason, getpriority( ) do<strong>es</strong> not return a normal nice value rangingbetween -20 and 20, but rather a nonnegative value ranging between and 40.10.3.3 System Calls Related to Real-Time Proc<strong>es</strong>s<strong>es</strong>We now introduce a group of system calls that allow proc<strong>es</strong>s<strong>es</strong> to change their schedulingdiscipline and, in particular, to become real-time proc<strong>es</strong>s<strong>es</strong>. As usual, a proc<strong>es</strong>s must have aCAP_SYS_NICE capability in order to modify the valu<strong>es</strong> of the rt_priority and policyproc<strong>es</strong>s d<strong>es</strong>criptor fields of any proc<strong>es</strong>s, including itself.10.3.3.1 <strong>The</strong> sched_getscheduler( ) and sched_setscheduler( ) system calls<strong>The</strong> sched_ getscheduler( ) system call queri<strong>es</strong> the scheduling policy currently applied tothe proc<strong>es</strong>s identified by the pid parameter. If pid equals 0, the policy of the calling proc<strong>es</strong>swill be retrieved. On succ<strong>es</strong>s, the system call returns the policy for the proc<strong>es</strong>s: SCHED_FIFO ,274


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>SCHED_RR, or SCHED_OTHER. <strong>The</strong> corr<strong>es</strong>ponding sys_sched_getscheduler( ) serviceroutine invok<strong>es</strong> find_task_by_pid( ), which locat<strong>es</strong> the proc<strong>es</strong>s d<strong>es</strong>criptor corr<strong>es</strong>pondingto the given pid and returns the value of its policy field.<strong>The</strong> sched_setscheduler( ) system call sets both the scheduling policy and the associatedparameters for the proc<strong>es</strong>s identified by the parameter pid. If pid is equal to 0, the schedulerparameters of the calling proc<strong>es</strong>s will be set.<strong>The</strong> corr<strong>es</strong>ponding sys_sched_setscheduler( ) function checks whether the schedulingpolicy specified by the policy parameter and the new static priority specified by the param->sched_priority parameter are valid. It also checks whether the proc<strong>es</strong>s has CAP_SYS_NICEcapability or whether its owner has superuser rights. If everything is OK, it execut<strong>es</strong> thefollowing statements:p->policy = policy;p->rt_priority = param->sched_priority;if (p->next_run)move_first_runqueue(p);current->need_r<strong>es</strong>ched = 1;10.3.3.2 <strong>The</strong> sched_ getparam( ) and sched_setparam( ) system calls<strong>The</strong> sched_getparam( ) system call retriev<strong>es</strong> the scheduling parameters for the proc<strong>es</strong>sidentified by pid. If pid is 0, the parameters of the current proc<strong>es</strong>s are retrieved. <strong>The</strong>corr<strong>es</strong>ponding sys_sched_getparam( ) service routine, as one would expect, finds theproc<strong>es</strong>s d<strong>es</strong>criptor pointer associated with pid, stor<strong>es</strong> its rt_priority field in a local variableof type sched_param, and invok<strong>es</strong> copy_to_user( ) to copy it into the proc<strong>es</strong>s addr<strong>es</strong>sspace at the addr<strong>es</strong>s specified by the param parameter.<strong>The</strong> sched_setparam( ) system call is similar to sched_setscheduler( ): it differs fromthe latter by not letting the caller set the policy field's value. [8] <strong>The</strong> corr<strong>es</strong>pondingsys_sched_setparam( ) service routine is almost identical to sys_sched_setscheduler(), but the policy of the affected proc<strong>es</strong>s is never changed.[8] This anomaly is caused by a specific requirement of the POSIX standard.10.3.3.3 <strong>The</strong> sched_ yield( ) system call<strong>The</strong> sched_ yield( ) system call allows a proc<strong>es</strong>s to relinquish the CPU voluntarily withoutbeing suspended; the proc<strong>es</strong>s remains in a TASK_RUNNING state, but the scheduler puts it at theend of the runqueue list. In this way, other proc<strong>es</strong>s<strong>es</strong> having the same dynamic priority willhave a chance to run. <strong>The</strong> call is used mainly by SCHED_FIFO proc<strong>es</strong>s<strong>es</strong>.<strong>The</strong> corr<strong>es</strong>ponding sys_sched_ yield( ) service routine execut<strong>es</strong> th<strong>es</strong>e statements:if (current->policy == SCHED_OTHER)current->policy |= SCHED_YIELD;current->need_r<strong>es</strong>ched = 1;move_last_runqueue(current);Notice that the SCHED_YIELD field is set in the policy field of the proc<strong>es</strong>s d<strong>es</strong>criptor only ifthe proc<strong>es</strong>s is a conventional SCHED_OTHER proc<strong>es</strong>s. As a r<strong>es</strong>ult, the next invocation of275


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>schedule( ) will view this proc<strong>es</strong>s as one that has exhausted its time quantum (see howschedule( ) handl<strong>es</strong> the SCHED_YIELD field).10.3.3.4 <strong>The</strong> sched_ get_priority_min( ) and sched_ get_priority_max( ) system calls<strong>The</strong> sched_get_priority_min( ) and sched_get_priority_max( ) system calls return,r<strong>es</strong>pectively, the minimum and the maximum real-time static priority value that can be usedwith the scheduling policy identified by the policy parameter.<strong>The</strong> sys_sched_get_priority_min( ) service routine returns 1 if current is a real-timeproc<strong>es</strong>s, otherwise.<strong>The</strong> sys_sched_get_priority_max( ) service routine returns 99 (the high<strong>es</strong>t priority) ifcurrent is a real-time proc<strong>es</strong>s, otherwise.10.3.3.5 <strong>The</strong> sched_rr_ get_interval( ) system call<strong>The</strong> sched_rr_get_interval( ) system call should get the round robin time quantum forthe named real-time proc<strong>es</strong>s.<strong>The</strong> corr<strong>es</strong>ponding sys_sched_rr_get_interval( ) service routine do<strong>es</strong> not operate asexpected, since it always returns a 150-millisecond value in the tim<strong>es</strong>pec structure pointed toby tp. This system call remains effectively unimplemented in <strong>Linux</strong>.10.4 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 introduc<strong>es</strong> a subtle optimization concerning TLB flushing for kernel threads andzombie proc<strong>es</strong>s<strong>es</strong>. As a r<strong>es</strong>ult, the active Page Global Directory is set by the schedule( )function rather than by the switch_to macro.<strong>The</strong> <strong>Linux</strong> 2.4 scheduling algorithm for SMP machin<strong>es</strong> has been improved and simplified.Whenever a new proc<strong>es</strong>s becom<strong>es</strong> runnable, the kernel checks whether the preferred CPU ofthe proc<strong>es</strong>s, that is, the CPU on which it was last running, is idle; in this case, the kernelassigns the proc<strong>es</strong>s to that CPU. Otherwise, the kernel assigns the proc<strong>es</strong>s to another idleCPU, if any. If all CPUs are busy, the kernel checks whether the proc<strong>es</strong>s has enough priorityto preempt the proc<strong>es</strong>s running on the preferred CPU. If not, the kernel tri<strong>es</strong> to preempt someother CPU only if the new runnable proc<strong>es</strong>s is real-time or if it has short average time slic<strong>es</strong>compared to the hardware cache rewriting time. (Roughly, preemption occurs if the newrunnable proc<strong>es</strong>s is interactive and the preferred CPU will not r<strong>es</strong>chedule shortly.)276


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 11. <strong>Kernel</strong> SynchronizationYou could think of the kernel as a server that answers requ<strong>es</strong>ts; th<strong>es</strong>e requ<strong>es</strong>ts can come eitherfrom a proc<strong>es</strong>s running on a CPU or an external device issuing an interrupt requ<strong>es</strong>t. We makethis analogy to underscore that parts of the kernel are not run serially but in an interleavedway. Thus, they can give rise to race conditions, which must be controlled through propersynchronization techniqu<strong>es</strong>. A general introduction to th<strong>es</strong>e topics can be found in Section 1.6in Chapter 1.We start this chapter by reviewing when, and to what extent, kernel requ<strong>es</strong>ts are executed inan interleaved fashion. We then introduce four basic synchronization techniqu<strong>es</strong> implementedby the kernel and illustrate how they are applied by means of exampl<strong>es</strong>.<strong>The</strong> next two sections deal with the extension of the <strong>Linux</strong> kernel to multiproc<strong>es</strong>sorarchitectur<strong>es</strong>. <strong>The</strong> first d<strong>es</strong>crib<strong>es</strong> some hardware featur<strong>es</strong> of the Symmetric Multiproc<strong>es</strong>sor(SMP) architecture, while the second discuss<strong>es</strong> additional mutual exclusion techniqu<strong>es</strong>adopted by the SMP version of the <strong>Linux</strong> kernel.11.1 <strong>Kernel</strong> Control PathsAs we said, kernel functions are executed following a requ<strong>es</strong>t that may be issued in twopossible ways:• A proc<strong>es</strong>s executing in User Mode caus<strong>es</strong> an exception, for instance by executing anint 0x80 assembly language instruction.• An external device sends a signal to a Programmable Interrupt Controller by using anIRQ line, and the corr<strong>es</strong>ponding interrupt is enabled.<strong>The</strong> sequence of instructions executed in <strong>Kernel</strong> Mode to handle a kernel requ<strong>es</strong>t is denoted askernel control path : when a User Mode proc<strong>es</strong>s issu<strong>es</strong> a system call requ<strong>es</strong>t, for instance, thefirst instructions of the corr<strong>es</strong>ponding kernel control path are those included in the initial partof the system_call( ) function, while the last instructions are those included in theret_from_sys_call( ) function.In Section 4.3 in Chapter 4, a kernel control path was defined as a sequence of instructionsexecuted by the kernel to handle a system call, an exception, or an interrupt. <strong>Kernel</strong> controlpaths play a role similar to that of proc<strong>es</strong>s<strong>es</strong>, except that they are much more rudimentary:first, no d<strong>es</strong>criptor of any kind is attached to them; second, they are not scheduled through asingle function, but rather by inserting sequenc<strong>es</strong> of instructions that stop or r<strong>es</strong>ume the pathsinto the kernel code.In the simpl<strong>es</strong>t cas<strong>es</strong>, the CPU execut<strong>es</strong> a kernel control path sequentially from the firstinstruction to the last. When one of the following events occurs, however, the CPU interleav<strong>es</strong>kernel control paths:• A context switch occurs. As we have seen in Chapter 10, a context switch can occuronly when the schedule( ) function is invoked.277


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• An interrupt occurs while the CPU is running a kernel control path with interruptsenabled. In this case, the first kernel control path is left unfinished and the CPU startsproc<strong>es</strong>sing another kernel control path to handle the interrupt.It is important to interleave kernel control paths in order to implement multiproc<strong>es</strong>sing. Inaddition, as already noticed in Section 4.3 in Chapter 4, interleaving improv<strong>es</strong> the throughputof programmable interrupt controllers and device controllers.While interleaving kernel control paths, special care must be applied to data structur<strong>es</strong> thatcontain several related member variabl<strong>es</strong>, for instance, a buffer and an integer indicating itslength. All statements affecting such a data structure must be put into a single critical section,otherwise, it is in danger of being corrupted.11.2 Synchronization Techniqu<strong>es</strong>Chapter 1 introduced the concepts of race condition and critical region for proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>same definitions apply to kernel control paths. In this chapter, a race condition can occurwhen the outcome of some computation depends on how two or more interleaved kernelcontrol paths are n<strong>es</strong>ted. A critical region is any section of code that should be completelyexecuted by each kernel control path that begins it, before another kernel control path canenter it.We now examine how kernel control paths can be interleaved while avoiding race conditionsamong shared data. We'll distinguish four broad typ<strong>es</strong> of synchronization techniqu<strong>es</strong>:• Nonpreemptability of proc<strong>es</strong>s<strong>es</strong> in <strong>Kernel</strong> Mode• Atomic operations• Interrupt disabling• Locking11.2.1 Nonpreemptability of Proc<strong>es</strong>s<strong>es</strong> in <strong>Kernel</strong> ModeAs already pointed out, the <strong>Linux</strong> kernel is not preemptive, that is, a running proc<strong>es</strong>s cannotbe preempted (replaced by a higher-priority proc<strong>es</strong>s) while it remains in <strong>Kernel</strong> Mode. Inparticular, the following assertions always hold in <strong>Linux</strong>:• No proc<strong>es</strong>s running in <strong>Kernel</strong> Mode may be replaced by another proc<strong>es</strong>s, except whenthe former voluntarily relinquish<strong>es</strong> control of the CPU. [1][1] Of course, all context switch<strong>es</strong> are performed in <strong>Kernel</strong> Mode. However, a context switch may occur only when the current proc<strong>es</strong>s is going toreturn in User Mode.• Interrupt or exception handling can interrupt a proc<strong>es</strong>s running in <strong>Kernel</strong> Mode;however, when the interrupt handler terminat<strong>es</strong>, the kernel control path of the proc<strong>es</strong>sis r<strong>es</strong>umed.• A kernel control path performing interrupt or exception handling can be interruptedonly by another control path performing interrupt or exception handling.Thanks to the above assertions, kernel control paths dealing with nonblocking system calls areatomic with r<strong>es</strong>pect to other control paths started by system calls. This simplifi<strong>es</strong> theimplementation of many kernel functions: any kernel data structur<strong>es</strong> that are not updated by278


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>interrupt or exception handlers can be safely acc<strong>es</strong>sed. However, if a proc<strong>es</strong>s in <strong>Kernel</strong> Modevoluntarily relinquish<strong>es</strong> the CPU, it must ensure that all data structur<strong>es</strong> are left in a consistentstate. Moreover, when it r<strong>es</strong>um<strong>es</strong> its execution, it must recheck the value of all previouslyacc<strong>es</strong>sed data structur<strong>es</strong> that could be changed. <strong>The</strong> change could be caused by a differentkernel control path, possibly running the same code on behalf of a separate proc<strong>es</strong>s.11.2.2 Atomic Operations<strong>The</strong> easi<strong>es</strong>t way to prevent race conditions is by ensuring that an operation is atomic at thechip level: the operation must be executed in a single instruction. <strong>The</strong>se very small atomicoperations can be found at the base of other, more flexible mechanisms to create criticalsections.Thus, an atomic operation is something that can be performed by executing a single assemblylanguage instruction in an "atomic" way, that is, without being interrupted in the middle.Let's review Intel 80x86 instructions according to that classification:• Assembly language instructions that make zero or one memory acc<strong>es</strong>s are atomic.• Read/modify/write assembly language instructions such as inc or dec that read datafrom memory, update it, and write the updated value back to memory are atomic if noother proc<strong>es</strong>sor has taken the memory bus after the read and before the write. Memorybus stealing, naturally, never happens in a uniproc<strong>es</strong>sor system, because all memoryacc<strong>es</strong>s<strong>es</strong> are made by the same proc<strong>es</strong>sor.• Read/modify/write assembly language instructions whose opcode is prefixed by thelock byte (0xf0) are atomic even on a multiproc<strong>es</strong>sor system. When the control unitdetects the prefix, it "locks" the memory bus until the instruction is finished.<strong>The</strong>refore, other proc<strong>es</strong>sors cannot acc<strong>es</strong>s the memory location while the lockedinstruction is being executed.• Assembly language instructions whose opcode is prefixed by a rep byte (0xf2, 0xf3),which forc<strong>es</strong> the control unit to repeat the same instruction several tim<strong>es</strong>, are notatomic: the control unit checks for pending interrupts before executing a new iteration.When you write C code, you cannot guarantee that the compiler will use a single, atomicinstruction for an operation like a=a+1 or even for a++. Thus, the <strong>Linux</strong> kernel provid<strong>es</strong>special functions (see Table 11-1) that it implements as single, atomic assembly languageinstructions; on multiproc<strong>es</strong>sor systems each such instruction is prefixed by a lock byte.279


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 11-1. Atomic Operations in CFunctionD<strong>es</strong>criptionatomic_read(v) Return *vatomic_set(v,i) Set *v to i.atomic_add(i,v)atomic_sub(i,v)atomic_inc(v)atomic_dec(v)Add i to *v.Subtract i from *v.Add 1 to *v.Subtract 1 from *v.Subtract 1 from *v and return 1 if the r<strong>es</strong>ult is non-null,atomic_dec_and_t<strong>es</strong>t(v)otherwise.atomic_inc_and_t<strong>es</strong>t_greater_zero(v) Add 1 to *v and return 1 if the r<strong>es</strong>ult is positive, otherwise.atomic_clear_mask(mask,addr)atomic_set_mask(mask,addr)11.2.3 Interrupt DisablingClear all bits of addr specified by mask.Set all bits of addr specified by mask.For any section of code too large to be defined as an atomic operation, more complicatedmeans of providing critical sections are needed. To ensure that no window is left open for arace condition to slip in, even a window one instruction long, th<strong>es</strong>e critical sections alwayshave an atomic operation at their base.Interrupt disabling is one of the key mechanisms used to ensure that a sequence of kernelstatements is operated as a critical section. It allows a kernel control path to continueexecuting even when hardware devic<strong>es</strong> issue IRQ signals, thus providing an effective way toprotect data structur<strong>es</strong> that are also acc<strong>es</strong>sed by interrupt handlers.However, interrupt disabling alone do<strong>es</strong> not always prevent kernel control path interleaving.Indeed, a kernel control path could raise a "Page fault" exception, which in turn could suspendthe current proc<strong>es</strong>s (and thus the corr<strong>es</strong>ponding kernel control path). Or again, a kernelcontrol path could directly invoke the schedule( ) function. This happens during most I/Odisk operations because they are potentially blocking, that is, they may force the proc<strong>es</strong>s tosleep until the I/O operation complet<strong>es</strong>. <strong>The</strong>refore, the kernel must never execute a blockingoperation when interrupts are disabled, since the system could freeze.Interrupts can be disabled by means of the cli assembly language instruction, which isyielded by the _ _cli( ) and cli( ) macros. Interrupts can be enabled by means of the stiassembly language instruction, which is yielded by the __sti( ) and sti( ) macros. On auniproc<strong>es</strong>sor system cli( ) is equivalent to __cli( ) and sti( ) is equivalent to __sti(); however, as we shall see later in this chapter, th<strong>es</strong>e macros are quite different on amultiproc<strong>es</strong>sor system.When the kernel enters a critical section, it clears the IF flag of the eflags register in order todisable interrupts. But at the end of the critical section, the kernel can't simply set the flagagain. Interrupts can execute in n<strong>es</strong>ted fashion, so the kernel do<strong>es</strong> not know what the IF flagwas before the current control path executed. Each control path must therefore save the oldsetting of the flag and r<strong>es</strong>tore that setting at the end.In order to save the eflags content, the kernel us<strong>es</strong> the __save_flags macro; on auniproc<strong>es</strong>sor system it is identical to the save_flags macro. In order to r<strong>es</strong>tore the eflags280


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>content, the kernel us<strong>es</strong> the _ _r<strong>es</strong>tore_flags and (on a uniproc<strong>es</strong>sor system)r<strong>es</strong>tore_flags macros. Typically, th<strong>es</strong>e macros are used in the following way:__save_flags(old);__cli( );[...]__r<strong>es</strong>tore_flags(old);<strong>The</strong> __save_flags macro copi<strong>es</strong> the content of the eflags register into the old localvariable; the IF flag is then cleared by __cli( ). At the end of the critical region, the__r<strong>es</strong>tore_flags macro r<strong>es</strong>tor<strong>es</strong> the original content of eflags; therefore, interrupts areenabled only if they were enabled before this control path issued the __cli( ) macro.<strong>Linux</strong> offers several additional synchronization macros that are important on a multiproc<strong>es</strong>sorsystem (see Section 11.4.2 later in this chapter) but are somewhat redundant on a uniproc<strong>es</strong>sorsystem (see Table 11-2). Notice that some functions do not perform any visible operation.<strong>The</strong>y just act as "barriers" for the gcc compiler, since they prevent the compiler fromoptimizing the code by moving around assembly language instructions. <strong>The</strong> lck parameter isalways ignored.Table 11-2. Interrupt Disabling/Enabling Macros on a Uniproc<strong>es</strong>sor SystemMacroD<strong>es</strong>criptionspin_lock_init(lck)No operationspin_lock(lck)No operationspin_unlock(lck)No operationspin_unlock_wait(lck)No operationspin_trylock(lck) Return always 1spin_lock_irq(lck) _ _cli( )spin_unlock_irq(lck) _ _sti( )spin_lock_irqsave(lck, flags) _ _save_flags(flags); _ _cli( )spin_unlock_irqr<strong>es</strong>tore(lck, flags) _ _r<strong>es</strong>tore_flags(flags)read_lock_irq(lck) _ _cli( )read_unlock_irq(lck) _ _sti( )read_lock_irqsave(lck, flags) _ _save_flags(flags); _ _cli( )read_unlock_irqr<strong>es</strong>tore(lck, flags) _ _r<strong>es</strong>tore_flags(flags)write_lock_irq(lck) _ _cli( )write_unlock_irq(lck) _ _sti( )write_lock_irqsave(lck, flags) _ _save_flags(flags); _ _cli( )write_unlock_irqr<strong>es</strong>tore(lck, flags) _ _r<strong>es</strong>tore_flags(flags)Let us recall a few exampl<strong>es</strong> of how th<strong>es</strong>e macros are used in functions introduced in previouschapters:• <strong>The</strong> add_wait_queue( ) and remove_wait_queue( ) functions protect the waitqueue list with the write_lock_irqsave( ) and write_unlock_irqr<strong>es</strong>tore( )functions.• <strong>The</strong> setup_x86_irq( ) adds a new interrupt handler for a specific IRQ; th<strong>es</strong>pin_lock_irqsave( ) and spin_unlock_irqr<strong>es</strong>tore( ) functions are used toprotect the corr<strong>es</strong>ponding list of handlers.281


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> run_timer_list( ) function protects the dynamic timer data structur<strong>es</strong> with th<strong>es</strong>pin_lock_irq( ) and spin_unlock_irq( ) functions.• <strong>The</strong> handle_signal( ) function protects the blocked field of current with th<strong>es</strong>pin_lock_irq( ) and spin_unlock_irq( ) functions.Because of its simplicity, interrupt disabling is widely used by kernel functions forimplementing critical regions. Clearly, the critical regions obtained by interrupt disablingmust be short, because any kind of communication between the I/O device controllers and theCPU is blocked when the kernel enters one. Longer critical regions should be implemented bymeans of locking.11.2.4 Locking Through <strong>Kernel</strong> Semaphor<strong>es</strong>A widely used synchronization technique is locking: when a kernel control path must acc<strong>es</strong>s ashared data structure or enter a critical region, it must acquire a "lock" for it. A r<strong>es</strong>ourceprotected by a locking mechanism is quite similar to a r<strong>es</strong>ource confined in a room whosedoor is locked when someone is inside. If a kernel control path wish<strong>es</strong> to acc<strong>es</strong>s the r<strong>es</strong>ource,it tri<strong>es</strong> to "open the door" by acquiring the lock. It will succeed only if the r<strong>es</strong>ource is free.<strong>The</strong>n, as long as it wants to use the r<strong>es</strong>ource, the door remains locked. When the kernelcontrol path releas<strong>es</strong> the lock, the door is unlocked and another kernel control path may enterthe room.<strong>Linux</strong> offers two kinds of locking: kernel semaphor<strong>es</strong>, which are widely used both onuniproc<strong>es</strong>sor systems and multiproc<strong>es</strong>sor on<strong>es</strong>, and spin locks, which are used only onmultiproc<strong>es</strong>sors systems. We'll discuss just kernel semaphor<strong>es</strong> here; the other solution will bediscussed in the Section 11.4.2 later in this chapter. When a kernel control path tri<strong>es</strong> toacquire a busy r<strong>es</strong>ource protected by a kernel semaphore, the corr<strong>es</strong>ponding proc<strong>es</strong>s issuspended. It will become runnable again when the r<strong>es</strong>ource is released.<strong>Kernel</strong> semaphor<strong>es</strong> are objects of type struct semaphore and have th<strong>es</strong>e fields:countwaitwakingStor<strong>es</strong> an integer value. If it is greater than 0, the r<strong>es</strong>ource is free, that is, it is currentlyavailable. Conversely, if count is l<strong>es</strong>s than or equal to 0, the semaphore is busy, thatis, the protected r<strong>es</strong>ource is currently unavailable. In the latter case, the absolute valueof count denot<strong>es</strong> the number of kernel control paths waiting for the r<strong>es</strong>ource. Zeromeans that a kernel control path is using the r<strong>es</strong>ource but no other kernel control pathis waiting for it.Stor<strong>es</strong> the addr<strong>es</strong>s of a wait queue list that includ<strong>es</strong> all sleeping proc<strong>es</strong>s<strong>es</strong> that arecurrently waiting for the r<strong>es</strong>ource. Of course, if count is greater than or equal to 0, thewait queue is empty.Ensur<strong>es</strong> that, when the r<strong>es</strong>ource is freed and the sleeping proc<strong>es</strong>s<strong>es</strong> is woken up, onlyone of them succeeds in acquiring the r<strong>es</strong>ource. We'll see this field in operation soon.282


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> count field is decremented when a proc<strong>es</strong>s tri<strong>es</strong> to acquire the lock and incrementedwhen a proc<strong>es</strong>s releas<strong>es</strong> it. <strong>The</strong> MUTEX and MUTEX_LOCKED macros may be used to initialize asemaphore for exclusive acc<strong>es</strong>s: they set the count field, r<strong>es</strong>pectively, to 1 (free r<strong>es</strong>ource withexclusive acc<strong>es</strong>s) and (busy r<strong>es</strong>ource with exclusive acc<strong>es</strong>s currently granted to the proc<strong>es</strong>sthat initializ<strong>es</strong> the semaphore). Note that a semaphore could also be initialized with anarbitrary positive value n for count: in this case, at most n proc<strong>es</strong>s<strong>es</strong> will be allowed toconcurrently acc<strong>es</strong>s the r<strong>es</strong>ource.When a proc<strong>es</strong>s wish<strong>es</strong> to acquire a kernel semaphore lock, it invok<strong>es</strong> the down( ) function.<strong>The</strong> implementation of down( ) is quite involved, but it is <strong>es</strong>sentially equivalent to thefollowing:void down(struct semaphore * sem){/* BEGIN CRITICAL SECTION */--sem->count;if (sem->count < 0) {/* END CRITICAL SECTION */struct wait_queue wait = { current, NULL };current->state = TASK_UNINTERRUPTIBLE;add_wait_queue(&sem->wait, &wait);for (;;) {unsigned long flags;spin_lock_irqsave(&semaphore_wake_lock, flags);if (sem->waking > 0) {sem->waking--;break;}spin_unlock_irqr<strong>es</strong>tore(&semaphore_wake_lock, flags);schedule( );current->state = TASK_UNINTERRUPTIBLE;}spin_unlock_irqr<strong>es</strong>tore(&semaphore_wake_lock, flags);current->state = TASK_RUNNING;remove_wait_queue(&sem->wait, &wait);}}<strong>The</strong> function decrements the count field of the *sem semaphore, then checks whether itsvalue is negative. <strong>The</strong> decrement and the t<strong>es</strong>t must be atomically executed, otherwise anotherkernel control path could concurrently acc<strong>es</strong>s the field value, with disastrous r<strong>es</strong>ults (seeSection 1.6.5 in Chapter 1). <strong>The</strong>refore, th<strong>es</strong>e two operations are implemented by means of thefollowing assembly language instructions:movl sem, %ecxlock /* only for multiproc<strong>es</strong>sor systems */decl (%ecx)js 2fOn a multiproc<strong>es</strong>sor system, the decl instruction is prefixed by a lock prefix to ensure theatomicity of the decrement operation (see Section 11.2.2).If count is greater than or equal to 0, the current proc<strong>es</strong>s acquir<strong>es</strong> the r<strong>es</strong>ource and theexecution continu<strong>es</strong> normally. Otherwise, count is negative and the current proc<strong>es</strong>s must be283


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>suspended. It is inserted into the wait queue list of the semaphore and put to sleep by directlyinvoking the schedule( ) function.<strong>The</strong> proc<strong>es</strong>s is woken up when the r<strong>es</strong>ource is freed. Nonethel<strong>es</strong>s, it cannot assume that ther<strong>es</strong>ource is now available, since several proc<strong>es</strong>s<strong>es</strong> in the semaphore wait queue could bewaiting for it. In order to select a winning proc<strong>es</strong>s, the waking field is used: when thereleasing proc<strong>es</strong>s is going to wake up the proc<strong>es</strong>s<strong>es</strong> in the wait queue, it increments waking;each awakened proc<strong>es</strong>s then enters a critical region of the down( ) function and t<strong>es</strong>ts whetherwaking is positive. If an awakened proc<strong>es</strong>s finds the field to be positive, it decrements wakingand acquir<strong>es</strong> the r<strong>es</strong>ource; otherwise it go<strong>es</strong> back to sleep. <strong>The</strong> critical region is protected bythe semaphore_wake_lock global spin lock and by interrupt disabling.Notice that an interrupt handler or a bottom half must not invoke down( ), since this functionsuspends the proc<strong>es</strong>s when the semaphore is busy. [2] For that reason, <strong>Linux</strong> provid<strong>es</strong> thedown_trylock( ) function, which may be safely used by one of the previously mentionedasynchronous functions. It is identical to down( ) except when the r<strong>es</strong>ource is busy: in thiscase, the function returns immediately instead of putting the proc<strong>es</strong>s to sleep.[2] Exception handlers can block on a semaphore. <strong>Linux</strong> tak<strong>es</strong> special care to avoid the particular kind of race condition in which two n<strong>es</strong>ted kernelcontrol paths compete for the same semaphore; naturally, one of them waits forever because the other cannot run and free the semaphore.A slightly different function called down_interruptible( ) is also defined. It is widely usedby device drivers since it allows proc<strong>es</strong>s<strong>es</strong> that receive a signal while being blocked on asemaphore to give up the "down" operation. If the sleeping proc<strong>es</strong>s is awakened by a signalbefore getting the needed r<strong>es</strong>ource, the function increments the count field of the semaphoreand returns the value -EINTR. On the other hand, if down_interruptible( ) runs to normalcompletion and gets the r<strong>es</strong>ource, it returns 0. <strong>The</strong> device driver may thus abort the I/Ooperation when the return value is -EINTR.When a proc<strong>es</strong>s releas<strong>es</strong> a kernel semaphore lock, it invok<strong>es</strong> the up( ) function, which is<strong>es</strong>sentially equivalent to the following:void up(struct semaphore * sem){/* BEGIN CRITICAL SECTION */++sem->count;if (sem->count count) waking++;spin_unlock_irqr<strong>es</strong>tore(&semaphore_wake_lock, flags);wake_up(&sem->wait);}}<strong>The</strong> function increments the count field of the *sem semaphore, then checks whether its valueis negative or null. <strong>The</strong> increment and the t<strong>es</strong>t must be atomically executed, so th<strong>es</strong>e twooperations are implemented by means of the following assembly language instructions:284


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>movl sem, %ecxlockincl (%ecx)jle 2fIf the new value of count is positive, no proc<strong>es</strong>s is waiting for the r<strong>es</strong>ource, and thus thefunction terminat<strong>es</strong>. Otherwise, it must wake up the proc<strong>es</strong>s<strong>es</strong> in the semaphore wait queue. Inorder to do this, it increments the waking field, which is protected by th<strong>es</strong>emaphore_wake_lock spin lock and by interrupt disabling, then invok<strong>es</strong> wake_up( ) on th<strong>es</strong>emaphore wait queue.<strong>The</strong> increment of the waking field is included in a critical region because there can be severalproc<strong>es</strong>s<strong>es</strong> that concurrently acc<strong>es</strong>s the same protected r<strong>es</strong>ource; therefore, a proc<strong>es</strong>s couldstart executing up( ) while the waiting proc<strong>es</strong>s<strong>es</strong> have already been woken up and one ofthem is already acc<strong>es</strong>sing the waking field. This also explains why up( ) checks whethercount is nonpositive right before incrementing waking: another proc<strong>es</strong>s could have executedthe up( ) function after the first count check and before entering the critical region.We now examine how semaphor<strong>es</strong> are used in <strong>Linux</strong>. Since the kernel is nonpreemptive, onlya few semaphor<strong>es</strong> are needed. Indeed, on a uniproc<strong>es</strong>sor system race conditions usually occureither when a proc<strong>es</strong>s is blocked during an I/O disk operation or when an interrupt handleracc<strong>es</strong>s<strong>es</strong> a global kernel data structure. Other kinds of race conditions may occur inmultiproc<strong>es</strong>sor systems, but in such cas<strong>es</strong> <strong>Linux</strong> tends to make use of spin locks (see Section11.4.2 later in this chapter).<strong>The</strong> following sections discuss a few typical exampl<strong>es</strong> of semaphore use.11.2.4.1 Slab cache list semaphore<strong>The</strong> list of slab cache d<strong>es</strong>criptors (see Section 6.2.2 in Chapter 6) is protected by thecache_chain_sem semaphore, which grants an exclusive right to acc<strong>es</strong>s and modify the list.A race condition is possible when kmem_cache_create( ) adds a new element in the list,while kmem_cache_shrink( ) and kmem_cache_reap( ) sequentially scan the list.However, th<strong>es</strong>e functions are never invoked while handling an interrupt, and they can neverblock while acc<strong>es</strong>sing the list. Since the kernel is nonpreemptive, this semaphore plays anactive role only in multiproc<strong>es</strong>sor systems.11.2.4.2 Memory d<strong>es</strong>criptor semaphoreEach memory d<strong>es</strong>criptor of type mm_struct includ<strong>es</strong> its own semaphore in the mmap_sem field(see Section 7.2 in Chapter 7). <strong>The</strong> semaphore protects the d<strong>es</strong>criptor against race conditionsthat could arise because a memory d<strong>es</strong>criptor can be shared among several lightweightproc<strong>es</strong>s<strong>es</strong>.For instance, let us suppose that the kernel must create or extend a memory region for someproc<strong>es</strong>s; in order to do this, it invok<strong>es</strong> the do_mmap( ) function, which allocat<strong>es</strong> a newvm_area_struct data structure. In doing so, the current proc<strong>es</strong>s could be suspended if no freememory is available, and another proc<strong>es</strong>s sharing the same memory d<strong>es</strong>criptor could run.Without the semaphore, any operation of the second proc<strong>es</strong>s that requir<strong>es</strong> acc<strong>es</strong>s to the285


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>memory d<strong>es</strong>criptor (for instance, a page fault due to a Copy On Write) could lead to severedata corruption.11.2.4.3 Inode semaphoreThis example refers to fil<strong>es</strong>ystem handling, which this book has not examined yet. <strong>The</strong>refore,we shall limit ourselv<strong>es</strong> to giving the general picture without going into too many details. Aswe shall see in Chapter 12, <strong>Linux</strong> stor<strong>es</strong> the information on a disk file in a memory objectcalled an inode. <strong>The</strong> corr<strong>es</strong>ponding data structure includ<strong>es</strong> its own semaphore in the i_semfield.A huge number of race conditions can occur during fil<strong>es</strong>ystem handling. Indeed, each file ondisk is a r<strong>es</strong>ource held in common for all users, since all proc<strong>es</strong>s<strong>es</strong> may (potentially) acc<strong>es</strong>sthe file content, change its name or location, d<strong>es</strong>troy or duplicate it, and so on.For example, let us suppose that a proc<strong>es</strong>s is listing the fil<strong>es</strong> contained in some directory.Each disk operation is potentially blocking, and therefore even in uniproc<strong>es</strong>sor systems otherproc<strong>es</strong>s<strong>es</strong> could acc<strong>es</strong>s the same directory and modify its content while the first proc<strong>es</strong>s is inthe middle of the listing operation. Or again, two different proc<strong>es</strong>s<strong>es</strong> could modify the samedirectory at the same time. All th<strong>es</strong>e race conditions are avoided by protecting the directoryfile with the inode semaphore.11.2.5 Avoiding Deadlocks on Semaphor<strong>es</strong>Whenever a program us<strong>es</strong> two or more semaphor<strong>es</strong>, the potential for deadlock is pr<strong>es</strong>entbecause two different paths could end up waiting for each other to release a semaphore. Atypical deadlock condition occurs when a kernel control path gets the lock for semaphore Aand is waiting for semaphore B, while another kernel control path holds the lock forsemaphore B and is waiting for semaphore A. <strong>Linux</strong> has few problems with deadlocks onsemaphore requ<strong>es</strong>ts, since each kernel control path usually needs to acquire just on<strong>es</strong>emaphore at a time.However, in a couple of cas<strong>es</strong> the kernel must get two semaphore locks. This occurs in th<strong>es</strong>ervice routin<strong>es</strong> of the rmdir( ) and the rename( ) system calls (notice that in both cas<strong>es</strong>two inod<strong>es</strong> are involved in the operation). In order to avoid such deadlocks, semaphorerequ<strong>es</strong>ts are performed in the order given by addr<strong>es</strong>s<strong>es</strong>: the semaphore requ<strong>es</strong>t whos<strong>es</strong>emaphore data structure is located at the low<strong>es</strong>t addr<strong>es</strong>s is issued first.11.3 <strong>The</strong> SMP ArchitectureSymmetrical multiproc<strong>es</strong>sing (SMP ) denot<strong>es</strong> a multiproc<strong>es</strong>sor architecture in which no CPUis selected as the Master CPU, but rather all of them cooperate on an equal basis, hence thename "symmetrical." As usual, we shall focus on Intel SMP architectur<strong>es</strong>.How many independent CPUs are most profitably included in a multiproc<strong>es</strong>sor system is a hotissue. <strong>The</strong> troubl<strong>es</strong> are mainly due to the impr<strong>es</strong>sive progr<strong>es</strong>s reached in the area of cach<strong>es</strong>ystems. Many of the benefits introduced by hardware cach<strong>es</strong> are lost by wasting bus cycl<strong>es</strong>in synchronizing the local hardware cach<strong>es</strong> located on the CPU chips. <strong>The</strong> higher the numberof CPUs, the worse the problem becom<strong>es</strong>.286


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>From the kernel d<strong>es</strong>ign point of view, however, we can completely ignore this issue: an SMPkernel remains the same no matter how many CPUs are involved. <strong>The</strong> big jump in complexityoccurs when moving from one CPU (a uniproc<strong>es</strong>sor system) to two.Before proceeding in d<strong>es</strong>cribing the chang<strong>es</strong> that had to be made to <strong>Linux</strong> in order to make ita true SMP kernel, we shall briefly review the hardware featur<strong>es</strong> of the Pentium dualproc<strong>es</strong>singsystems. <strong>The</strong>se featur<strong>es</strong> lie in the following areas of computer architecture:• Shared memory• Hardware cache synchronization• Atomic operations• Distributed interrupt handling• Interrupt signals for CPU synchronizationSome hardware issu<strong>es</strong> are completely r<strong>es</strong>olved within the hardware, so we don't have to saymuch about them.11.3.1 Common MemoryAll the CPUs share the same memory; that is, they are connected to a common bus. Thismeans that RAM chips may be acc<strong>es</strong>sed concurrently by independent CPUs. Since read orwrite operations on a RAM chip must be performed serially, a hardware circuit called amemory arbiter is inserted between the bus and every RAM chip. Its role is to grant acc<strong>es</strong>s toa CPU if the chip is free and to delay it if the chip is busy. Even uniproc<strong>es</strong>sor systems makeuse of memory arbiters, since they include a specialized proc<strong>es</strong>sor called DMA that operat<strong>es</strong>concurrently with the CPU (see Section 13.1.4, in Chapter 13).In the case of multiproc<strong>es</strong>sor systems, the structure of the arbiter is more complex since it hasmore input ports. <strong>The</strong> dual Pentium, for instance, maintains a two-port arbiter at each chipentrance and requir<strong>es</strong> that the two CPUs exchange synchronization m<strong>es</strong>sag<strong>es</strong> beforeattempting to use the bus. From the programming point of view, the arbiter is hidden since itis managed by hardware circuits.11.3.2 Hardware Support to Cache Synchronization<strong>The</strong> section Section 2.4.6 in Chapter 2,explained that the contents of the hardware cache andthe RAM maintain their consistency at the hardware level. <strong>The</strong> same approach holds in thecase of a dual proc<strong>es</strong>sor. As shown in Figure 11-1, each CPU has its own local hardwarecache. But now updating becom<strong>es</strong> more time-consuming: whenever a CPU modifi<strong>es</strong> itshardware cache it must check whether the same data is contained in the other hardware cacheand, if so, notify the other CPU to update it with the proper value. This activity is often calledcache snooping. Luckily, all this is done at the hardware level and is of no concern to thekernel.287


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 11-1. <strong>The</strong> cach<strong>es</strong> in a dual proc<strong>es</strong>sor11.3.3 SMP Atomic OperationsAtomic operations for uniproc<strong>es</strong>sor systems have already been introduced in Section 11.2.2.Since standard read-modify-write instructions actually acc<strong>es</strong>s the memory bus twice, they arenot atomic on a multiproc<strong>es</strong>sor system.Let us give a simple example of what might happen if an SMP kernel used standardinstructions. Consider the semaphore implementation d<strong>es</strong>cribed in Section 11.2.4 earlier inthis chapter and assume that the down( ) function decrements and t<strong>es</strong>ts the count field of th<strong>es</strong>emaphore with a simple decl assembly language instruction. What happens if two proc<strong>es</strong>s<strong>es</strong>running on two different CPUs simultaneously execute the decl instruction on the sam<strong>es</strong>emaphore? Well, decl is a read-modify-write instruction that acc<strong>es</strong>s<strong>es</strong> the same memorylocation twice: once to read the old value and again to write the new value.At first, both CPUs are trying to read the same memory location, but the memory arbiter stepsin to grant acc<strong>es</strong>s to one of them and delay the other. However, when the first read operationis complete the delayed CPU reads exactly the same (old) value from the memory location.Both CPUs then try to write the same (new) value on the memory location; again, the busmemory acc<strong>es</strong>s is serialized by the memory arbiter, but eventually both write operations willsucceed and the memory location will contain the old value decremented by 1. But of course,the global r<strong>es</strong>ult is completely incorrect. For instance, if count was previously set to 1, bothkernel control paths will simultaneously gain mutual exclusive acc<strong>es</strong>s to the protectedr<strong>es</strong>ource.Since the early days of the Intel 80286, lock instruction prefix<strong>es</strong> have been introduced tosolve that kind of problem. From the programmer's point of view, lock is just a special bytethat is prefixed to an assembly language instruction. When the control unit detects a lockbyte, it locks the memory bus so that no other proc<strong>es</strong>sor can acc<strong>es</strong>s the memory locationspecified by the d<strong>es</strong>tination operand of the following assembly language instruction. <strong>The</strong> buslock is released only when the instruction has been executed. <strong>The</strong>refore, read-modify-writeinstructions prefixed by lock are atomic even in a multiproc<strong>es</strong>sor environment.<strong>The</strong> Pentium allows a lock prefix on 18 different instructions. Moreover, some kind ofinstructions like xchg do not require the lock prefix because the bus lock is implicitlyenforced by the CPU's control unit.288


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>11.3.4 Distributed Interrupt HandlingBeing able to deliver interrupts to any CPU in the system is crucial for fully exploiting theparallelism of the SMP architecture. For that reason, Intel has introduced a new componentd<strong>es</strong>ignated as the I/O APIC (I/O Advanced Programmable Interrupt Controller), whichreplac<strong>es</strong> the old 8259A Programmable Interrupt Controller.Figure 11-2 illustrat<strong>es</strong> in a schematic way the structure of a multi-APIC system. Each CPUchip has its own integrated Local APIC. An Interrupt Controller Communication (ICC ) busconnects a frontend I/O APIC to the Local APICs. <strong>The</strong> IRQ lin<strong>es</strong> coming from the devic<strong>es</strong> areconnected to the I/O APIC, which therefore acts as a router with r<strong>es</strong>pect to the Local APICs.Figure 11-2. APIC systemEach Local APIC has 32-bit registers, an internal clock, a timer device, 240 different interruptvectors, and two additional IRQ lin<strong>es</strong> r<strong>es</strong>erved for local interrupts, which are typically used tor<strong>es</strong>et the system.<strong>The</strong> I/O APIC consists of a set of IRQ lin<strong>es</strong>, a 24-entry Interrupt Redirection Table,programmable registers, and a m<strong>es</strong>sage unit for sending and receiving APIC m<strong>es</strong>sag<strong>es</strong> overthe ICC bus. Unlike IRQ pins of the 8259A, interrupt priority is not related to pin number:each entry in the Redirection Table can be individually programmed to indicate the interruptvector and priority, the d<strong>es</strong>tination proc<strong>es</strong>sor, and how the proc<strong>es</strong>sor is selected. <strong>The</strong>information in the Redirection Table is used to translate any external IRQ signal into am<strong>es</strong>sage to one or more Local APIC units via the ICC bus.Interrupt requ<strong>es</strong>ts can be distributed among the available CPUs in two ways:Fixed mode<strong>The</strong> IRQ signal is delivered to the Local APICs listed in the corr<strong>es</strong>ponding RedirectionTable entry.Low<strong>es</strong>t-priority mode<strong>The</strong> IRQ signal is delivered to the Local APIC of the proc<strong>es</strong>sor which is executing theproc<strong>es</strong>s with the low<strong>es</strong>t priority. Any Local APIC has a programmable task priority289


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>register, which contains the priority of the currently running proc<strong>es</strong>s. It must bemodified by the kernel at each task switch.Another important feature of the APIC allows CPUs to generate interproc<strong>es</strong>sor interrupts .When a CPU wish<strong>es</strong> to send an interrupt to another CPU, it stor<strong>es</strong> the interrupt vector and theidentifier of the target's Local APIC in the Interrupt Command Register of its own LocalAPIC. A m<strong>es</strong>sage is then sent via the ICC bus to the target's Local APIC, which thereforeissu<strong>es</strong> a corr<strong>es</strong>ponding interrupt to its own CPU.We'll discuss in Section 11.4.7 later in this chapter how the SMP version of <strong>Linux</strong> mak<strong>es</strong> useof th<strong>es</strong>e interproc<strong>es</strong>sor interrupts.11.4 <strong>The</strong> <strong>Linux</strong>/SMP <strong>Kernel</strong><strong>Linux</strong> 2.2 support for SMP is compliant with Version 1.4 of the Intel MultiProc<strong>es</strong>sorSpecification, which <strong>es</strong>tablish<strong>es</strong> a multiproc<strong>es</strong>sor platform interface standard whilemaintaining full PC/AT binary compatibility.As we have seen in Section 11.2.1 earlier in this chapter, race conditions are relatively limitedin <strong>Linux</strong> on a uniproc<strong>es</strong>sor system, so interrupt disabling and kernel semaphor<strong>es</strong> can be usedto protect data structur<strong>es</strong> that are asynchronously acc<strong>es</strong>sed by interrupt or exception handlers.In a multiproc<strong>es</strong>sor system, however, things are much more complicated: several proc<strong>es</strong>s<strong>es</strong>may be running in <strong>Kernel</strong> Mode, and therefore data structure corruption can occur even if norunning proc<strong>es</strong>s is preempted. <strong>The</strong> usual way to synchronize acc<strong>es</strong>s to SMP kernel datastructur<strong>es</strong> is by means of semaphor<strong>es</strong> and spin locks (see Section 11.4.2).Before discussing in detail how <strong>Linux</strong> 2.2 serializ<strong>es</strong> the acc<strong>es</strong>s<strong>es</strong> to kernel data structur<strong>es</strong> inmultiproc<strong>es</strong>sor systems, let us make a brief digr<strong>es</strong>sion to how this goal was achieved when<strong>Linux</strong> first introduced SMP support. In order to facilitate the transition from a uniproc<strong>es</strong>sorkernel to a multiproc<strong>es</strong>sor one, the old 2.0 version of <strong>Linux</strong>/SMP adopted this drastic rule:At any given instant, at most one proc<strong>es</strong>sor is allowed to acc<strong>es</strong>s the kernel data structur<strong>es</strong> andto handle the interrupts.This rule dictat<strong>es</strong> that each proc<strong>es</strong>sor wishing to acc<strong>es</strong>s the kernel data structur<strong>es</strong> must get aglobal lock. As long as it holds the lock, it has exclusive acc<strong>es</strong>s to all kernel data structur<strong>es</strong>.Of course, since the proc<strong>es</strong>sor will also handle any incoming interrupts, the data structur<strong>es</strong>that are asynchronously acc<strong>es</strong>sed by interrupt and exception handlers must still be protectedwith interrupt disabling and kernel semaphor<strong>es</strong>.Although very simple, this approach has a serious drawback: proc<strong>es</strong>s<strong>es</strong> spend a significantfraction of their computing time in <strong>Kernel</strong> Mode, therefore this rule may force I/O-boundproc<strong>es</strong>s<strong>es</strong> to be sequentially executed. <strong>The</strong> situation was far from satisfactory, hence the rulewas not strictly enforced in the next stable version of <strong>Linux</strong>/SMP (2.2). Instead, many lockswere added, each of which grants exclusive acc<strong>es</strong>s to single kernel data structure or a singlecritical region. <strong>The</strong>refore, several proc<strong>es</strong>s<strong>es</strong> are allowed to concurrently run in <strong>Kernel</strong> Modeas long as each of them acc<strong>es</strong>s<strong>es</strong> different data structur<strong>es</strong> protected by locks. However, aglobal kernel lock is still pr<strong>es</strong>ent (see Section 11.4.6 later in this chapter), since not all kerneldata structur<strong>es</strong> have been protected with specific locks.290


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 11-3 illustrat<strong>es</strong> the more flexible <strong>Linux</strong> 2.2 system. Five kernel control paths—P0, P1,P2, P3, and P4—are trying to acc<strong>es</strong>s two critical regions—C1 and C2. <strong>Kernel</strong> control path P0is inside C1, while P2 and P4 are waiting to enter it. At the same time, P1 is inside C2, whileP3 is waiting to enter it. Notice that P0 and P1 could run concurrently. <strong>The</strong> lock for criticalregion C3 is open since no kernel control path needs to enter it.Figure 11-3. Protecting critical regions with several locks11.4.1 Main SMP Data Structur<strong>es</strong>In order to handle several CPUs, the kernel must be able to repr<strong>es</strong>ent the activity that tak<strong>es</strong>place on each of them. In this section we'll consider some significant kernel data structur<strong>es</strong>that have been added to allow multiproc<strong>es</strong>sing.<strong>The</strong> most important information is what proc<strong>es</strong>s is currently running on each CPU, but thisinformation actually do<strong>es</strong> not require a new CPU-specific data structure. Instead, each CPUretriev<strong>es</strong> the current proc<strong>es</strong>s through the same current macro defined for uniproc<strong>es</strong>sorsystems: since it extracts the proc<strong>es</strong>s d<strong>es</strong>criptor addr<strong>es</strong>s from the <strong>es</strong>p stack pointer register, ityields a value that is CPU-dependent.A first group of new CPU-specific variabl<strong>es</strong> refers to the SMP architecture. <strong>Linux</strong>/SMP has ahard-wired limit on the number of CPUs, which is defined by the NR_CPUS macro (usually32).During the initialization phase, <strong>Linux</strong> running on the booting CPU prob<strong>es</strong> whether otherCPUs exist (some CPU slots of an SMP board may be empty). As a r<strong>es</strong>ult, both a counter anda bitmap are initialized: max_cpus stor<strong>es</strong> the number of existing CPUs whilecpu_pr<strong>es</strong>ent_map specifi<strong>es</strong> which slots contain a CPU.An existing CPU is not nec<strong>es</strong>sarily activated, that is, initialized and recognized by the kernel.Another pair of variabl<strong>es</strong>, a counter called smp_num_cpus and a bitmap calledcpu_online_map, keeps track of the activated CPUs. If some CPU cannot be properlyinitialized, the kernel clears the corr<strong>es</strong>ponding bit in cpu_online_map.Each active CPU is identified in <strong>Linux</strong> by a sequential logical number called CPU ID, whichdo<strong>es</strong> not nec<strong>es</strong>sarily coincide with the CPU slot number. <strong>The</strong> cpu_number_map and __cpu_logical_map arrays allow conversion between CPU IDs and CPU slot numbers.291


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> proc<strong>es</strong>s d<strong>es</strong>criptor includ<strong>es</strong> the following fields repr<strong>es</strong>enting the relationships between theproc<strong>es</strong>s and a proc<strong>es</strong>sor:has_cpuproc<strong>es</strong>sorFlag denoting whether the proc<strong>es</strong>s is currently running (value 1) or not running (value0)Logical number of the CPU that is running the proc<strong>es</strong>s, or NO_PROC_ID if the proc<strong>es</strong>sis not running<strong>The</strong> smp_proc<strong>es</strong>sor_id( ) macro returns the value of current->proc<strong>es</strong>sor, that is, thelogical number of the CPU that execut<strong>es</strong> the proc<strong>es</strong>s.When a new proc<strong>es</strong>s is created by fork( ), the has_cpu and proc<strong>es</strong>sor fields of itsd<strong>es</strong>criptor are initialized r<strong>es</strong>pectively to and to the value NO_PROC_ID. When the schedule( )function selects a new proc<strong>es</strong>s to run, it sets its has_cpu field to 1 and its proc<strong>es</strong>sor field tothe logical number of the CPU that is doing the task switch. <strong>The</strong> corr<strong>es</strong>ponding fields of theproc<strong>es</strong>s being replaced are set to and to NO_PROC_ID, r<strong>es</strong>pectively.During system initialization smp_num_cpus different swapper proc<strong>es</strong>s<strong>es</strong> are created. Each ofthem has a PID equal to and is bound to a specific CPU. As usual, a swapper proc<strong>es</strong>s isexecuted only when the corr<strong>es</strong>ponding CPU is idle.11.4.2 Spin LocksSpin locks are a locking mechanism d<strong>es</strong>igned to work in a multiproc<strong>es</strong>sing environment. <strong>The</strong>yare similar to the kernel semaphor<strong>es</strong> d<strong>es</strong>cribed earlier, except that when a proc<strong>es</strong>s finds thelock closed by another proc<strong>es</strong>s, it "spins" around repeatedly, executing a tight instructionloop.Of course, spin locks would be usel<strong>es</strong>s in a uniproc<strong>es</strong>sor environment, since the waitingproc<strong>es</strong>s would keep running, and therefore the proc<strong>es</strong>s that is holding the lock would not haveany chance to release it. In a multiproc<strong>es</strong>sing environment, however, spin locks are muchmore convenient, since their overhead is very small. In other words, a context switch tak<strong>es</strong> asignificant amount of time, so it is more efficient for each proc<strong>es</strong>s to keep its own CPU andsimply spin while waiting for a r<strong>es</strong>ource.Each spin lock is repr<strong>es</strong>ented by a spinlock_t structure consisting of a single lock field; thevalu<strong>es</strong> and 1 corr<strong>es</strong>pond, r<strong>es</strong>pectively, to the "unlocked" and the "locked" state. <strong>The</strong>SPIN_LOCK_UNLOCKED macro initializ<strong>es</strong> a spin lock to 0.<strong>The</strong> functions that operate on spin locks are based on atomic read/modify/write operations;this ensur<strong>es</strong> that the spin lock will be properly updated by a proc<strong>es</strong>s running on a CPU even ifother proc<strong>es</strong>s<strong>es</strong> running on different CPUs attempt to modify the spin lock at the same time. [3][3] Spin locks, ironically enough, are global and therefore must themselv<strong>es</strong> be protected against concurrent acc<strong>es</strong>s.292


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> spin_lock macro is used to acquire a spin lock. It tak<strong>es</strong> the addr<strong>es</strong>s slp of the spin lockas its parameter and yields <strong>es</strong>sentially the following code:1: lock; btsl $0, slpjnc 3f2: t<strong>es</strong>tb $1,slpjne 2bjmp 1b3:<strong>The</strong> btsl atomic instruction copi<strong>es</strong> into the carry flag the value of bit in *slp, then sets thebit. A t<strong>es</strong>t is then performed on the carry flag: if it is null, it means that the spin lock wasunlocked and hence normal execution continu<strong>es</strong> at label 3 (the f suffix denot<strong>es</strong> the fact thatthe label is a "forward" one: it appear in a later line of the program). Otherwise, the tight loopat label 2 (the b suffix denot<strong>es</strong> a "backward" label) is executed until the spin lock assum<strong>es</strong> thevalue 0. <strong>The</strong>n execution r<strong>es</strong>tarts from label 1, since it would be unsafe to proceed withoutchecking whether another proc<strong>es</strong>sor has grabbed the lock. [4][4] <strong>The</strong> actual implementation of spin_lock is slightly more complicated. <strong>The</strong> code at label 2, which is executed only if the spin lock is busy, isincluded in an auxiliary section so that in the most frequent case (free spin lock) the hardware cache is not filled with code that won't be executed. Inour discussion we omit th<strong>es</strong>e optimization details.<strong>The</strong> spin_unlock macro releas<strong>es</strong> a previously acquired spin lock; it <strong>es</strong>sentially yields thefollowing code:lock; btrl $0, slp<strong>The</strong> btrl atomic assembly language instruction clears the bit of the spin lock *slp.Several other macros have been introduced to handle spin locks; their definitions on amultiproc<strong>es</strong>sor system are d<strong>es</strong>cribed in Table 11-3 (see Table 11-2 for their definitions on auniproc<strong>es</strong>sor system).Table 11-3. Spin Lock Macros on a Multiproc<strong>es</strong>sor SystemMacroD<strong>es</strong>criptionspin_lock_init(slp) Set slp->lock to 0spin_trylock (slp)Set slp->lock to 1, return 1 if got the lock, otherwis<strong>es</strong>pin_unlock_wait(slp) Cycle until slp->lock becom<strong>es</strong> 0spin_lock_irq(slp)_ _cli( ); spin_lock(slp)spin_unlock_irq(slp) spin_unlock(slp); __sti( )spin_lock_irqsave(slp,flags)__save_flags(flags); __cli( );spin_lock(slp)spin_unlock_irqr<strong>es</strong>tore(slp,flags) spin_unlock(slp); _ _r<strong>es</strong>tore_flags(flags)11.4.3 Read/Write Spin LocksRead/write spin locks have been introduced to increase the amount of concurrency inside thekernel. <strong>The</strong>y allow several kernel control paths to simultaneously read the same data structure,as long as no kernel control path modifi<strong>es</strong> it. If a kernel control path wish<strong>es</strong> to write to th<strong>es</strong>tructure, it must acquire the write version of the read/write lock, which grants exclusive293


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>acc<strong>es</strong>s to the r<strong>es</strong>ource. Of course, allowing concurrent reads on data structur<strong>es</strong> improv<strong>es</strong>system performance.Figure 11-4 illustrat<strong>es</strong> two critical regions, C1 and C2, protected by read/write locks. <strong>Kernel</strong>control paths R0 and R1 are reading the data structur<strong>es</strong> in C1 at the same time, while W0 iswaiting to acquire the lock for writing. <strong>Kernel</strong> control path W1 is writing the data structur<strong>es</strong> inC2, while both R2 and W2 are waiting to acquire the lock for reading and writing,r<strong>es</strong>pectively.Figure 11-4. Read/write spin locksEach read/write spin lock is a rwlock_t structure; its lock field is a 32-bit counter thatrepr<strong>es</strong>ents the number of kernel control paths currently reading the protected data structure.<strong>The</strong> high<strong>es</strong>t-order bit of the lock field is the write lock: it is set when a kernel control path ismodifying the data structure. [5] <strong>The</strong> RW_LOCK_UNLOCKED macro initializ<strong>es</strong> the lock field of aread/write spin lock to 0. <strong>The</strong> read_lock macro, applied to the addr<strong>es</strong>s rwlp of a read/writ<strong>es</strong>pin lock, <strong>es</strong>sentially yields the following code:[5] It would also be set if there are more than 2,147,483,647 readers: of course, such a huge limit is never reached.1: lock; incl rwlpjns 3flock; decl rwlp2: cmpl $0, rwlpjs 2bjmp 1b3:After increasing by 1 the value of rwlp->lock, the function checks whether the field has anegative value—that is, if it is already locked for writing. If not, execution continu<strong>es</strong> at label3. Otherwise, the macro r<strong>es</strong>tor<strong>es</strong> the previous value and spins around until the high<strong>es</strong>t-orderbit becom<strong>es</strong> 0; then it starts back from the beginning.<strong>The</strong> read_unlock function, applied to the addr<strong>es</strong>s rwlp of a read/write spin lock, yields thefollowing assembly language instruction:lock; decl rwlp<strong>The</strong> write_lock function applied to the addr<strong>es</strong>s rwlp of a read/write spin lock yields thefollowing instructions:294


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1: lock; btsl $31, rwlpjc 2ft<strong>es</strong>tl $0x7fffffff, rwlpje 3flock; btrl $31, rwlp2: cmp $0, rwlpjne 2bjmp 1b3:<strong>The</strong> high<strong>es</strong>t-order bit of rwlp->lock is set. If its old value was 1, the write lock is alreadybusy, and therefore the execution continu<strong>es</strong> at label 2. Here the macro execut<strong>es</strong> a tight loopwaiting for the lock field to become (meaning that the write lock was released). If the oldvalue of the high<strong>es</strong>t-order bit was (meaning there is no write lock), the macro checks whetherthere are readers. If so, the write lock is released and the macro waits until lock becom<strong>es</strong> 0;otherwise, the CPU has the exclusive acc<strong>es</strong>s to the r<strong>es</strong>ource, so execution continu<strong>es</strong> at label 3.Finally, the write_unlock macro, applied to the addr<strong>es</strong>s rwlp of a read/write spin lock,yields the following instruction:lock; btrl $31, rwlpTable 11-4 lists the interrupt-safe versions of the macros d<strong>es</strong>cribed in this section.Table 11-4. Read/Write Spin Lock Macros on a Multiproc<strong>es</strong>sor SystemFunctionD<strong>es</strong>criptionread_lock_irq(rwlp)_ _cli( ); read_lock(rwlp)read_unlock_irq(rwlp) read_unlock(rwlp); _ _sti( )write_lock_irq(rwlp)_ _cli( ); write_lock(rwlp)write_unlock_irq(rwlp) write_lock(rwlp); _ _sti( )read_lock_irqsave(rwlp,flags)__save_flags(flags); __cli( );read_lock(rwlp)read_unlock_irqr<strong>es</strong>tore(rwlp,flag)read_unlock(rwlp); __r<strong>es</strong>tore_flags(flags)write_lock_irqsave(rwlp,flags)__save_flags(flags); __cli( );write_lock(rwlp)write_unlock(rwlp); __r<strong>es</strong>tore_write_unlock_irqr<strong>es</strong>tore(rwlp,flags)flags(flags)11.4.4 <strong>Linux</strong>/SMP Interrupt HandlingWe stated previously that, on <strong>Linux</strong>/SMP, interrupts are broadcast by the I/O APIC to allLocal APICs; that is, to all CPUs. This means that all CPUs having the IF flags set willreceive the same interrupt. However, only one CPU must handle the interrupt, although all ofthem must acknowledge to their Local APICs they received it.In order to do this, each IRQ main d<strong>es</strong>criptor (see Section 4.6.2 in Chapter 4) includ<strong>es</strong> an IRQ_INPROGRESS flag. If it is set, the corr<strong>es</strong>ponding interrupt handler is already running on someCPU. <strong>The</strong>refore, when each CPU acknowledg<strong>es</strong> to its Local APIC that the interrupt wasaccepted, it checks whether the flag is already set. If it is, the CPU do<strong>es</strong> not handle theinterrupt and exits back to what it was running; otherwise, the CPU sets the flag and startsexecuting the interrupt handler.295


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Of course, acc<strong>es</strong>s<strong>es</strong> to the IRQ main d<strong>es</strong>criptor must be mutually exclusive; therefore, eachCPU always acquir<strong>es</strong> the irq _controller_lock spin lock before checking the value of IRQ_INPROGRESS. <strong>The</strong> same lock also prevents several CPUs from fiddling with the interruptcontroller simultaneously; this precaution is nec<strong>es</strong>sary for old SMP machin<strong>es</strong> that have justone external interrupt controller shared by all CPUs.<strong>The</strong> IRQ _INPROGRESS flag ensur<strong>es</strong> that each specific interrupt handler is atomic with r<strong>es</strong>pectto itself among all CPUs. However, several CPUs may concurrently handle differentinterrupts. <strong>The</strong> global_irq _count variable contains the number of interrupt handlers thatare being handled at each given instant on all CPUs. This value could be greater than thenumber of CPUs, since any interrupt handler can be interrupted by another interrupt handlerof a different kind. Similarly, the local_irq _count array stor<strong>es</strong> the number of interrupthandlers being handled on each CPU.As we have already seen, the kernel must often disable interrupts in order to preventcorruption of a kernel data structure that may be acc<strong>es</strong>sed by interrupt handlers. Of course,local CPU interrupt disabling provided by the __cli( ) macro is not enough, since it do<strong>es</strong>not prevent some other CPU from acc<strong>es</strong>sing the kernel data structure. <strong>The</strong> usual solutionconsists of acquiring a spin lock with an IRQ-safe macro (like spin_lock_irqsave).In a few cas<strong>es</strong>, however, interrupts should be disabled on all CPUs. In order to achieve such ar<strong>es</strong>ult, the kernel do<strong>es</strong> not clear the IF flags on all CPUs; instead it us<strong>es</strong> the global_irq_lock spin lock to delay the execution of the interrupt handlers. <strong>The</strong> global_irq _holdervariable contains the logical identifier of the CPU that is holding the lock. <strong>The</strong> get_irqlock() function acquir<strong>es</strong> the spin lock and waits for the termination of all interrupt handlersrunning on the other CPUs. Moreover, if the caller is not a bottom half itself, the functionwaits for the termination of all bottom halv<strong>es</strong> running on the other CPUs. No further interrupthandler on other CPUs will start running until the lock is released by invokingrelease_irqlock( ).Global interrupt disabling is performed by the cli( ) macro, which just invok<strong>es</strong> the__global_cli( ) function:__save_flags(flags);if (!(flags & (1


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Global interrupt enabling is performed by the sti( ) macro, which just invok<strong>es</strong> the__global_sti( ) function:cpu = smp_proc<strong>es</strong>sor_id( );if (!local_irq_count[cpu])release_irqlock(cpu);__sti( );<strong>Linux</strong> also provid<strong>es</strong> SMP versions of the _ _save_flags and __r<strong>es</strong>tore_flags macros,which are called save_flags and r<strong>es</strong>tore_flags: they save and reload, r<strong>es</strong>pectively,information controlling the interrupt handling for the executing CPU. As illustrated in Figure11-5, save_flags yields an integer value that depends on three conditions; r<strong>es</strong>tore_flagsperforms actions based on the value yielded by save_flags.Figure 11-5. Actions performed by save_ flags( ) and r<strong>es</strong>tore_ flags( )Finally, the synchronize_irq( ) function is called when a kernel control path wish<strong>es</strong> tosynchronize itself with all interrupt handlers:if (atomic_read(&global_irq_count)) {cli();sti();}By invoking cli( ), the function acquir<strong>es</strong> the global_irq _lock spin lock and then waitsuntil all executing interrupt handlers terminate; once this is done, it reenabl<strong>es</strong> interrupts. <strong>The</strong>synchronize_irq( ) function is usually called by device drivers when they want to mak<strong>es</strong>ure that all activiti<strong>es</strong> carried on by interrupt handlers are over.297


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>11.4.5 <strong>Linux</strong>/SMP Bottom Half HandlingBottom halv<strong>es</strong> are handled much like interrupt handlers, but no bottom half can ever runconcurrently with other bottom halv<strong>es</strong>. Moreover, disabling interrupts also disabl<strong>es</strong> therunning of bottom halv<strong>es</strong>. <strong>The</strong> global_bh_count variable is a flag that specifi<strong>es</strong> whether abottom half is currently active on some CPU. <strong>The</strong> synchronize_bh( ) function is calledwhen a kernel control path must wait for the termination of a currently executing bottom half.<strong>The</strong> global_bh_lock variable is used to disable the execution of bottom halv<strong>es</strong> on all CPUs;in other words, it ensur<strong>es</strong> that some critical region is atomic with r<strong>es</strong>pect to all bottom halv<strong>es</strong>on all CPUs.<strong>The</strong> start_bh_atomic( ) function, which locks out bottom halv<strong>es</strong>, consists of:atomic_inc(&global_bh_lock);synchronize_bh();<strong>The</strong> complementary end_bh_atomic( ) function is used to reenable the bottom halv<strong>es</strong> byexecuting:atomic_dec(&global_bh_lock);<strong>The</strong>refore, the do_bottom_half( ) function starts bottom halv<strong>es</strong> only if:• No other bottom half is currently running on any CPU (global_bh_count is null).• <strong>The</strong> bottom halv<strong>es</strong> are not disabled (global_bh_lock is null).• No interrupt handler is running on any CPU (global_irq_count is null).• Interrupts are globally enabled (global_irq_lock is free).Serial execution of bottom halv<strong>es</strong> is inherited from previous versions of <strong>Linux</strong>. Allowingbottom halv<strong>es</strong> to be executed concurrently would require a full revision of all device driversthat use them.11.4.6 Global and Local <strong>Kernel</strong> LocksAs we have already mentioned, in the Version 2.2 of <strong>Linux</strong>/SMP a global kernel lock namedkernel_flag is still widely used. In Version 2.0, this spin lock was relatively crude, ensuringsimply that only one proc<strong>es</strong>sor at a time could run in <strong>Kernel</strong> Mode. <strong>The</strong> 2.2 kernel isconsiderably more flexible and no longer reli<strong>es</strong> on a single spin lock; however, it is still usedto protect a very large number of kernel data structur<strong>es</strong>, namely:• All data structur<strong>es</strong> related to the Virtual Fil<strong>es</strong>ystem and to file handling (seeChapter 12)• Most kernel data structur<strong>es</strong> related to networking• All kernel data structur<strong>es</strong> for interproc<strong>es</strong>s communication (IPC); see Chapter 18• Several l<strong>es</strong>s important kernel data structur<strong>es</strong><strong>The</strong> global kernel lock still exists because introducing new locks is not trivial: both deadlocksand race conditions must be carefully avoided.298


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>All system call service routin<strong>es</strong> related to fil<strong>es</strong>, including the on<strong>es</strong> related to file memorymapping, must acquire the global kernel lock before starting their operations and must releaseit when they terminate. <strong>The</strong>refore, a very large number of system calls cannot concurrentlyexecute on <strong>Linux</strong>/SMP.Every proc<strong>es</strong>s d<strong>es</strong>criptor includ<strong>es</strong> a lock_depth field, which allows the same proc<strong>es</strong>s toacquire the global kernel lock several tim<strong>es</strong>. <strong>The</strong>refore, two consecutive requ<strong>es</strong>ts for it willnot hang the proc<strong>es</strong>sor (as for normal spin locks). If the proc<strong>es</strong>s do<strong>es</strong> not want the lock, thefield has the value -1. If the proc<strong>es</strong>s wants it, the field value plus 1 specifi<strong>es</strong> how many tim<strong>es</strong>the lock has been requ<strong>es</strong>ted. <strong>The</strong> lock_depth field is crucial for interrupt handlers, exceptionhandlers, and bottom halv<strong>es</strong>. Without it, any asynchronous function that tri<strong>es</strong> to get the globalkernel lock could generate a deadlock if the current proc<strong>es</strong>s already owns the lock.<strong>The</strong> lock_kernel( ) and unlock_kernel( ) functions are used to get and release the globalkernel lock. <strong>The</strong> former function is equivalent to:if (++current->lock_depth == 0)spin_lock(&kernel_flag);while the latter is equivalent to:if (--current->lock_depth < 0)spin_unlock(&kernel_flag);Notice that the if statements of the lock_kernel( ) and unlock_kernel( ) functions neednot be executed atomically because lock_depth is not a global variable: each CPU addr<strong>es</strong>s<strong>es</strong>a field of its own current proc<strong>es</strong>s d<strong>es</strong>criptor. Local interrupts inside the if statements do notinduce race conditions either: even if the new kernel control path invok<strong>es</strong> lock_kernel( ), itmust release the global kernel lock before terminating.Although the global kernel lock still protects a large number of kernel data structur<strong>es</strong>, work isin progr<strong>es</strong>s to reduce that number by introducing many additional smaller locks. Table 11-5lists some kernel data structur<strong>es</strong> that are already protected by specific (read/write) spin locks.299


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Spin Lockconsole_lockdma_spin_lockinode_lockio_requ<strong>es</strong>t_lockkbd_controller_lockpage_alloc_lockrunqueue_locksemaphore_wake_locktasklist_lock (rw)taskslot_locktimerlist_locktqueue_lockuidhash_lockwaitqueue_lock (rw)xtime_lock (rw)Table 11-5. Various <strong>Kernel</strong> Spin LocksProtected R<strong>es</strong>ourceConsoleDMA's data structur<strong>es</strong>Inode's data structur<strong>es</strong>Block IO subsystemKeyboardBuddy system's data structur<strong>es</strong>Runqueue listSemaphor<strong>es</strong>'s waking fieldsProc<strong>es</strong>s listList of free entri<strong>es</strong> in taskDynamic timer listsTask queu<strong>es</strong>' listsUID hash tableWait queu<strong>es</strong>' listsxtime and lost_ticksAs already explained, finer granularity in the lock mechanism enhanc<strong>es</strong> system performance,since l<strong>es</strong>s serialization is enforced among the proc<strong>es</strong>sors. For instance, a kernel control paththat acc<strong>es</strong>s<strong>es</strong> the runqueue list is allowed to concurrently run with another kernel control paththat is servicing a file-related system call. Similarly, using a read/write lock, two kernelcontrol paths may concurrently acc<strong>es</strong>s the proc<strong>es</strong>s list as long as neither of them wants tomodify it.11.4.7 Interproc<strong>es</strong>sor InterruptsInterproc<strong>es</strong>sor interrupts (in short, IPIs) are part of the SMP architecture and are actively usedby <strong>Linux</strong> in order to exchange m<strong>es</strong>sag<strong>es</strong> among CPUs. <strong>Linux</strong>/SMP provid<strong>es</strong> the followingfunctions to handle them:send_IPI_all( )Sends an IPI to all CPUs (including the sender)send_IPI_allbutself( )Sends an IPI to all CPUs except the sendersend_IPI_self( )Sends an IPI to the sender CPUsend_IPI_single( )Sends an IPI to a single, specified CPU300


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Depending on the I/O APIC configuration, the kernel may sometim<strong>es</strong> need to invoke th<strong>es</strong>end_IPI_self( ) function. <strong>The</strong> other functions are used to implement interproc<strong>es</strong>sorm<strong>es</strong>sag<strong>es</strong>.<strong>Linux</strong>/SMP recogniz<strong>es</strong> five kinds of m<strong>es</strong>sag<strong>es</strong>, which are interpreted by the receiving CPU asdifferent interrupt vectors:RESCHEDULE_VECTOR (0x30 )Sent to a single CPU in order to force the execution of the schedule( ) function onit. <strong>The</strong> corr<strong>es</strong>ponding Interrupt Service Routine (ISR) is namedsmp_r<strong>es</strong>chedule_interrupt( ). This m<strong>es</strong>sage is used by r<strong>es</strong>chedule_idle( ) andby send_sig_info( ) to preempt the running proc<strong>es</strong>s on a CPU.INVALIDATE_TLB_VECTOR (0x31 )Sent to all CPUs but the sender, forcing them to invalidate their translation lookasidebuffers. <strong>The</strong> corr<strong>es</strong>ponding ISR, named smp_invalidate_interrupt( ), invok<strong>es</strong>the _ _flush_tlb( ) function. [7] This m<strong>es</strong>sage is used whenever the kernel modifi<strong>es</strong>a page table of some proc<strong>es</strong>s.[7] A subtle concurrency problem occurs when trying to flush the translation lookaside buffers of all proc<strong>es</strong>sors while some of them run with theinterrupts disabled. <strong>The</strong>refore, while spinning in tight loops, the kernel control paths keep checking whether some CPU has sent an "invalidate TLB"m<strong>es</strong>sage.STOP_CPU_VECTOR (0x40 )Sent to all CPUs but the sender, forcing the receiving CPUs to halt. <strong>The</strong> corr<strong>es</strong>pondingInterrupt Service Routine is named smp_stop_cpu_interrupt( ). This m<strong>es</strong>sage isused only when the kernel detects an unrecoverable internal error.LOCAL_TIMER_VECTOR (0x41 )A timer interrupt automatically sent to all CPUs by the I/O APIC. <strong>The</strong> corr<strong>es</strong>pondingInterrupt Service Routine is named smp_apic_timer_interrupt( ).CALL_FUNCTION_VECTOR (0x50 )Sent to all CPUs but the sender, forcing those CPUs to run a function passed by th<strong>es</strong>ender. <strong>The</strong> corr<strong>es</strong>ponding ISR is named smp_call_function_interrupt( ). Atypical use of this m<strong>es</strong>sage is to force CPUs to synchronize and to reload the state ofthe Memory Type Range Registers (MTRRs). Starting with the Pentium Pro model,Intel microproc<strong>es</strong>sors include th<strong>es</strong>e additional registers to easily customize cacheoperations. <strong>Linux</strong> us<strong>es</strong> th<strong>es</strong>e registers to disable the hardware cache for the addr<strong>es</strong>s<strong>es</strong>mapping the frame buffer of a PCI/AGP graphic card while maintaining the "writecombining" mode of operation: the paging unit combin<strong>es</strong> write transfers into largerchunks before copying them into the frame buffer.301


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>11.5 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 chang<strong>es</strong> a bit the way semaphor<strong>es</strong> are implemented. Essentially, they are now moreefficient because, when a semaphore is released, usually only one sleeping proc<strong>es</strong>s is awoken.As already mentioned, <strong>Linux</strong> 2.4 enhanc<strong>es</strong> support for high-end SMP architectur<strong>es</strong>. It is nowpossible to make use of multiple external I/O APIC chips, and all the code that handl<strong>es</strong>interproc<strong>es</strong>sor interrupts (IPIs) has been rewritten.However, the most important change is that <strong>Linux</strong> 2.4 is much more multithreaded than <strong>Linux</strong>2.2. In other words, it mak<strong>es</strong> use of many new spin locks and reduc<strong>es</strong> the role of the globalkernel lock, particularly in the networking code. <strong>Linux</strong> 2.4 is therefore much more efficient onSMP architectur<strong>es</strong> and performs much better as a high-end server.302


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 12. <strong>The</strong> Virtual Fil<strong>es</strong>ystemOne of <strong>Linux</strong>'s keys to succ<strong>es</strong>s is its ability to coexist comfortably with other systems. Youcan transparently mount disks or partitions that host file formats used by Windows, otherUnix systems, or even systems with tiny market shar<strong>es</strong> like the Amiga. <strong>Linux</strong> manag<strong>es</strong> tosupport multiple disk typ<strong>es</strong> in the same way other Unix variants do, through a concept calledthe Virtual Fil<strong>es</strong>ystem.<strong>The</strong> idea behind the Virtual Fil<strong>es</strong>ystem is that the internal objects repr<strong>es</strong>enting fil<strong>es</strong> andfil<strong>es</strong>ystems in kernel memory embody a wide range of information; there is a field or functionto support any operation provided by any real fil<strong>es</strong>ystem supported by <strong>Linux</strong>. For each read,write, or other function called, the kernel substitut<strong>es</strong> the actual function that supports a native<strong>Linux</strong> fil<strong>es</strong>ystem, the NT fil<strong>es</strong>ystem, or whatever other fil<strong>es</strong>ystem the file is on.This chapter discuss<strong>es</strong> the aims, the structure, and the implementation of <strong>Linux</strong>'s VirtualFil<strong>es</strong>ystem. It focus<strong>es</strong> on three of the five standard Unix file typ<strong>es</strong>, namely, regular fil<strong>es</strong>,directori<strong>es</strong>, and symbolic links. Device fil<strong>es</strong> will be covered in Chapter 13, while pip<strong>es</strong> will bediscussed in Chapter 18. To show how a real fil<strong>es</strong>ystem works, Chapter 17, covers the SecondExtended Fil<strong>es</strong>ystem that appears on nearly all <strong>Linux</strong> systems.12.1 <strong>The</strong> Role of the VFS<strong>The</strong> Virtual Fil<strong>es</strong>ystem (also known as Virtual Fil<strong>es</strong>ystem Switch or VFS) is a kernel softwarelayer that handl<strong>es</strong> all system calls related to a standard Unix fil<strong>es</strong>ystem. Its main strength isproviding a common interface to several kinds of fil<strong>es</strong>ystems.For instance, let us assume that a user issu<strong>es</strong> the shell command:$ cp /floppy/TEST /tmp/t<strong>es</strong>twhere /floppy is the mount point of an MS-DOS diskette and /tmp is a normal Ext2 (SecondExtended Fil<strong>es</strong>ystem) directory. As shown in Figure 12-1 (a), the VFS is an abstraction layerbetween the application program and the fil<strong>es</strong>ystem implementations. <strong>The</strong>refore, the cpprogram is not required to know the fil<strong>es</strong>ystem typ<strong>es</strong> of /floppy/TEST and /tmp/t<strong>es</strong>t. Instead,cp interacts with the VFS by means of generic system calls well known to anyone who hasdone Unix programming (see also Section 1.5.6 in Chapter 1); the code executed by cp isshown in Figure 12-1 (b).303


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 12-1. VFS role in a simple file copy operationFil<strong>es</strong>ystems supported by the VFS may be grouped into three main class<strong>es</strong>:Disk-based fil<strong>es</strong>ystemsManage the memory space available in a local disk partition. <strong>The</strong> official <strong>Linux</strong> diskbasedfil<strong>es</strong>ystem is Ext2. Other well-known disk-based fil<strong>es</strong>ystems supported by theVFS are:• Fil<strong>es</strong>ystems for Unix variants like System V and BSD• Microsoft fil<strong>es</strong>ystems like MS-DOS, VFAT (Windows 98), and NTFS(Windows NT)• ISO9660 CD-ROM fil<strong>es</strong>ystem (formerly High Sierra Fil<strong>es</strong>ystem)• Other proprietary fil<strong>es</strong>ystems like HPFS (IBM's OS/2), HFS (Apple'sMacintosh), FFS (Amiga's Fast Fil<strong>es</strong>ystem), and ADFS (Acorn's machin<strong>es</strong>)Network fil<strong>es</strong>ystemsAllow easy acc<strong>es</strong>s to fil<strong>es</strong> included in fil<strong>es</strong>ystems belonging to other networkedcomputers. Some well-known network fil<strong>es</strong>ystems supported by the VFS are NFS,Coda, AFS (Andrew's fil<strong>es</strong>ystem), SMB (Microsoft's Windows and IBM's OS/2 LANManager), and NCP (Novell's NetWare Core Protocol).Special fil<strong>es</strong>ystems (also called virtual fil<strong>es</strong>ystems)Do not manage disk space. <strong>Linux</strong>'s /proc fil<strong>es</strong>ystem provid<strong>es</strong> a simple interface thatallows users to acc<strong>es</strong>s the contents of some kernel data structur<strong>es</strong>. <strong>The</strong> /dev/ptsfil<strong>es</strong>ystem is used for pseudo-terminal support as d<strong>es</strong>cribed in the Open Group'sUnix98 standard.In this book we d<strong>es</strong>cribe only the Ext2 fil<strong>es</strong>ystem, which is the topic of Chapter 17; the otherfil<strong>es</strong>ystems will not be covered for lack of space.304


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As mentioned in Section 1.5 in Chapter 1, Unix directori<strong>es</strong> build a tree whose root is the /directory. <strong>The</strong> root directory is contained in the root fil<strong>es</strong>ystem, which in <strong>Linux</strong> is usually oftype Ext2. All other fil<strong>es</strong>ystems can be "mounted" on subdirectori<strong>es</strong> of the root fil<strong>es</strong>ystem. [1][1] When a fil<strong>es</strong>ystem is mounted on some directory, the contents of the directory in the parent fil<strong>es</strong>ystem are no longer acc<strong>es</strong>sible, since any pathnameincluding the mount point will refer to the mounted fil<strong>es</strong>ystem. However, the original directory's content will show up again when the fil<strong>es</strong>ystem isunmounted. This somewhat surprising feature of Unix fil<strong>es</strong>ystems is used by system administrators to hide fil<strong>es</strong>; they simply mount a fil<strong>es</strong>ystem on thedirectory containing the fil<strong>es</strong> to be hidden.A disk-based fil<strong>es</strong>ystem is usually stored in a hardware block device like a hard disk, afloppy, or a CD-ROM. A useful feature of <strong>Linux</strong>'s VFS allows it to handle virtual blockdevic<strong>es</strong> like /dev/loop0, which may be used to mount fil<strong>es</strong>ystems stored in regular fil<strong>es</strong>. As apossible application, a user may protect his own private fil<strong>es</strong>ystem by storing an encryptedversion of it in a regular file.<strong>The</strong> first Virtual Fil<strong>es</strong>ystem was included in Sun Microsystems's SunOS in 1986. Since then,most Unix fil<strong>es</strong>ystems include a VFS. <strong>Linux</strong>'s VFS, however, supports the wid<strong>es</strong>t range offil<strong>es</strong>ystems.12.1.1 <strong>The</strong> Common File Model<strong>The</strong> key idea behind the VFS consists of introducing a common file model capable ofrepr<strong>es</strong>enting all supported fil<strong>es</strong>ystems. This model strictly mirrors the file model provided bythe traditional Unix fil<strong>es</strong>ystem. This is not surprising, since <strong>Linux</strong> wants to run its nativefil<strong>es</strong>ystem with minimum overhead. However, each specific fil<strong>es</strong>ystem implementation musttranslate its physical organization into the VFS's common file model.For instance, in the common file model each directory is regarded as a normal file, whichcontains a list of fil<strong>es</strong> and other directori<strong>es</strong>. However, several non-Unix disk-basedfil<strong>es</strong>ystems make use of a File Allocation Table (FAT), which stor<strong>es</strong> the position of each filein the directory tree: in th<strong>es</strong>e fil<strong>es</strong>ystems, directori<strong>es</strong> are not fil<strong>es</strong>. In order to stick to theVFS's common file model, the <strong>Linux</strong> implementations of such FAT-based fil<strong>es</strong>ystems mustbe able to construct on the fly, when needed, the fil<strong>es</strong> corr<strong>es</strong>ponding to the directori<strong>es</strong>. Suchfil<strong>es</strong> exist only as objects in kernel memory.More <strong>es</strong>sentially, the <strong>Linux</strong> kernel cannot hardcode a particular function to handle anoperation such as read( ) or ioctl( ). Instead, it must use a pointer for each operation; thepointer is made to point to the proper function for the particular fil<strong>es</strong>ystem being acc<strong>es</strong>sed.Let's illustrate this concept by showing how the read( ) shown in Figure 12-1 would betranslated by the kernel into a call specific to the MS-DOS fil<strong>es</strong>ystem. <strong>The</strong> application's callto read( ) mak<strong>es</strong> the kernel invoke sys_read( ), just like any other system call. <strong>The</strong> file isrepr<strong>es</strong>ented by a file data structure in kernel memory, as we shall see later in the chapter.This data structure contains a field called f_op that contains pointers to functions specific toMS-DOS fil<strong>es</strong>, including a function that reads a file. sys_read( ) finds the pointer to thisfunction and invok<strong>es</strong> it. Thus, the application's read( ) is turned into the rather indirect call:file->f_op->read(...);Similarly, the write( ) operation triggers the execution of a proper Ext2 write functionassociated with the output file. In short, the kernel is r<strong>es</strong>ponsible for assigning the right set of305


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>pointers to the file variable associated with each open file, then for invoking the call specificto each fil<strong>es</strong>ystem that the f_op field points to.One can think of the common file model as object-oriented, where an object is a softwareconstruct that defin<strong>es</strong> both a data structure and the methods that operate on it. For reasons ofefficiency, <strong>Linux</strong> is not coded in an object-oriented language like C++. Objects are thusimplemented as data structur<strong>es</strong> with some fields pointing to functions that corr<strong>es</strong>pond to theobject's methods.<strong>The</strong> common file model consists of the following object typ<strong>es</strong>:<strong>The</strong> superblock objectStor<strong>es</strong> information concerning a mounted fil<strong>es</strong>ystem. For disk-based fil<strong>es</strong>ystems, thisobject usually corr<strong>es</strong>ponds to a fil<strong>es</strong>ystem control block stored on disk.<strong>The</strong> inode objectStor<strong>es</strong> general information about a specific file. For disk-based fil<strong>es</strong>ystems, this objectusually corr<strong>es</strong>ponds to a file control block stored on disk. Each inode object isassociated with an inode number, which uniquely identifi<strong>es</strong> the file within thefil<strong>es</strong>ystem.<strong>The</strong> file objectStor<strong>es</strong> information about the interaction between an open file and a proc<strong>es</strong>s. Thisinformation exists only in kernel memory during the period each proc<strong>es</strong>s acc<strong>es</strong>s<strong>es</strong> afile.<strong>The</strong> dentry objectStor<strong>es</strong> information about the linking of a directory entry with the corr<strong>es</strong>ponding file.Each disk-based fil<strong>es</strong>ystem stor<strong>es</strong> this information in its own particular way on disk.Figure 12-2 illustrat<strong>es</strong> with a simple example how proc<strong>es</strong>s<strong>es</strong> interact with fil<strong>es</strong>. Threedifferent proc<strong>es</strong>s<strong>es</strong> have opened the same file, two of them using the same hard link. In thiscase, each of the three proc<strong>es</strong>s<strong>es</strong> mak<strong>es</strong> use of its own file object, while only two dentryobjects are required, one for each hard link. Both dentry objects refer to the same inodeobject, which identifi<strong>es</strong> the superblock object and, together with the latter, the common diskfile.306


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 12-2. Interaction between proc<strong>es</strong>s<strong>es</strong> and VFS objectsB<strong>es</strong>id<strong>es</strong> providing a common interface to all fil<strong>es</strong>ystem implementations, the VFS has anotherimportant role related to system performance. <strong>The</strong> most recently used dentry objects arecontained in a disk cache named the dentry cache, which speeds up the translation from a filepathname to the inode of the last pathname component.Generally speaking, a disk cache is a software mechanism that allows the kernel to keep inRAM some information that is normally stored on a disk, so that further acc<strong>es</strong>s<strong>es</strong> to that datacan be quickly satisfied without a slow acc<strong>es</strong>s to the disk itself. [2] B<strong>es</strong>ide the dentry cache,<strong>Linux</strong> us<strong>es</strong> other disk cach<strong>es</strong>, like the buffer cache and the page cache, which will bed<strong>es</strong>cribed in forthcoming chapters.[2] Notice how a disk cache differs from a hardware cache or a memory cache, neither of which has anything to do with disks or other devic<strong>es</strong>. Ahardware cache is a fast static RAM that speeds up requ<strong>es</strong>ts directed to the slower dynamic RAM (see Section 2.4.6 in Chapter 2). A memory cache isa software mechanism introduced to bypass the <strong>Kernel</strong> Memory Allocator (see Section 6.2.1 in Chapter 6).12.1.2 System Calls Handled by the VFSTable 12-1 illustrat<strong>es</strong> the VFS system calls that refer to fil<strong>es</strong>ystems, regular fil<strong>es</strong>, directori<strong>es</strong>,and symbolic links. A few other system calls handled by the VFS, such as ioperm( ),ioctl( ), pipe( ), and mknod( ), refer to device fil<strong>es</strong> and pip<strong>es</strong> and hence will bediscussed in later chapters. A last group of system calls handled by the VFS, such as socket(), connect( ), bind( ), and protocols( ), refer to sockets and are used to implementnetworking; they will not be covered in this book. Some of the kernel service routin<strong>es</strong> thatcorr<strong>es</strong>pond to the system calls listed in Table 12-1 are discussed either in this chapter or inChapter 17.307


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-1. Some System Calls Handled by the VFSSystem Call NameD<strong>es</strong>criptionmount( ) umount( )Mount/Unmount fil<strong>es</strong>ystemssysfs( )Get fil<strong>es</strong>ystem informationstatfs( ) fstatfs( ) ustat( )Get fil<strong>es</strong>ystem statisticschroot( )Change root directorychdir( ) fchdir( ) getcwd( )Manipulate current directorymkdir( ) rmdir( )Create and d<strong>es</strong>troy directori<strong>es</strong>getdents( ) readdir( ) link( ) unlink( ) rename( ) Manipulate directory entri<strong>es</strong>readlink( ) symlink( )Manipulate soft linkschown( ) fchown( ) lchown( )Modify file ownerchmod( ) fchmod( ) utime( )Modify file attribut<strong>es</strong>stat( ) fstat( ) lstat( ) acc<strong>es</strong>s( )Read file statusopen( ) close( ) creat( ) umask( )Open and close fil<strong>es</strong>dup( ) dup2( ) fcntl( )Manipulate file d<strong>es</strong>criptorsselect( ) poll( )Asynchronous I/O notificationtruncate( ) ftruncate( )Change file sizelseek( ) _llseek( )Change file pointerread( ) write( ) readv( ) writev( ) sendfile( ) File I/O operationspread( ) pwrite( )Seek file and acc<strong>es</strong>s itmmap( ) munmap( )File memory mappingfdatasync( ) fsync( ) sync( ) msync( )Synchronize file dataflock( )Manipulate file lockWe said earlier that the VFS is a layer between application programs and specific fil<strong>es</strong>ystems.However, in some cas<strong>es</strong> a file operation can be performed by the VFS itself, without invokinga lower-level procedure. For instance, when a proc<strong>es</strong>s clos<strong>es</strong> an open file, the file on diskdo<strong>es</strong>n't usually need to be touched, and hence the VFS simply releas<strong>es</strong> the corr<strong>es</strong>ponding fileobject. Similarly, when the lseek( ) system call modifi<strong>es</strong> a file pointer, which is an attributerelated to the interaction between an opened file and a proc<strong>es</strong>s, the VFS needs to modify onlythe corr<strong>es</strong>ponding file object without acc<strong>es</strong>sing the file on disk and therefore do<strong>es</strong> not have toinvoke a specific fil<strong>es</strong>ystem procedure. In some sense, the VFS could be considered as a"generic" fil<strong>es</strong>ystem that reli<strong>es</strong>, when nec<strong>es</strong>sary, on specific on<strong>es</strong>.12.2 VFS Data Structur<strong>es</strong>Each VFS object is stored in a suitable data structure, which includ<strong>es</strong> both the objectattribut<strong>es</strong> and a pointer to a table of object methods. <strong>The</strong> kernel may dynamically modify themethods of the object, and hence it may install specialized behavior for the object.<strong>The</strong> following sections explain the VFS objects and their interrelationships in detail.12.2.1 Superblock ObjectsA superblock object consists of a super_block structure whose fields are d<strong>es</strong>cribed inTable 12-2.308


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-2. <strong>The</strong> Fields of the Superblock ObjectType Field D<strong>es</strong>criptionstruct list_head s_list Pointers for superblock listkdev_t s_dev Device identifierunsigned long s_blocksize Block size in byt<strong>es</strong>unsigned char s_blocksize_bits Block size in number of bitsunsigned char s_lock Lock flagunsigned char s_rd_only Read-only flagunsigned char s_dirt Modified (dirty) flagstruct file_system_type * s_type Fil<strong>es</strong>ystem typ<strong>es</strong>truct super_operations * s_op Superblock methodsstruct dquot_operations * dq_op Disk quota methodsunsigned long s_flags Mount flagsunsigned long s_magic Fil<strong>es</strong>ystem magic numberunsigned long s_time Time of last superblock chang<strong>es</strong>truct dentry * s_root Dentry object of mount directorystruct wait_queue * s_wait Mount wait queu<strong>es</strong>truct inode * s_ibasket Future developmentshort int s_ibasket_count Future developmentshort int s_ibasket_max Future developmentstruct list_head s_dirty List of modified inod<strong>es</strong>union u Specific fil<strong>es</strong>ystem informationAll superblock objects (one per mounted fil<strong>es</strong>ystem) are linked together in a circular doublylinked list. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of the first and last elements of the list are stored in the next andprev fields, r<strong>es</strong>pectively, of the s_list field in the super_blocks variable. This field has thedata type struct list_head, which is also found in the s_dirty field of the superblock andin a number of other plac<strong>es</strong> in the kernel; it consists simply of pointers to the next andprevious elements of a list. Thus, the s_list field of a superblock object includ<strong>es</strong> the pointersto the two adjacent superblock objects in the list. Figure 12-3 illustrat<strong>es</strong> how the list_headelements, next and prev, are embedded in the superblock object.<strong>The</strong> last u union field includ<strong>es</strong> superblock information that belongs to a specific fil<strong>es</strong>ystem;for instance, as we shall see later in Chapter 17, if the superblock object refers to an Ext2fil<strong>es</strong>ystem, the field stor<strong>es</strong> an ext2_sb_info structure, which includ<strong>es</strong> the disk allocation bitmasks and other data of no concern to the VFS common file model.In general, data in the u field is duplicated in memory for reasons of efficiency. Any diskbasedfil<strong>es</strong>ystem needs to acc<strong>es</strong>s and update its allocation bitmaps in order to allocate orrelease disk blocks. <strong>The</strong> VFS allows th<strong>es</strong>e fil<strong>es</strong>ystems to act directly on the u union field ofthe superblock in memory, without acc<strong>es</strong>sing the disk.This approach leads to a new problem, however: the VFS superblock might end up no longersynchronized with the corr<strong>es</strong>ponding superblock on disk. It is thus nec<strong>es</strong>sary to introduce ans_dirt flag, which specifi<strong>es</strong> whether the superblock is dirty, that is, whether the data on thedisk must be updated. <strong>The</strong> lack of synchronization leads to the familiar problem of acorrupted fil<strong>es</strong>ystem when a site's power go<strong>es</strong> down without giving the user the chance to shutdown a system cleanly. As we shall see in Section 14.1.5 in Chapter 14, <strong>Linux</strong> minimiz<strong>es</strong> thisproblem by periodically copying all dirty superblocks to disk.309


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 12-3. <strong>The</strong> superblock list<strong>The</strong> methods associated with a superblock are called superblock operations. <strong>The</strong>y ared<strong>es</strong>cribed by the super_operations structure whose addr<strong>es</strong>s is included in the s_op field.Each specific fil<strong>es</strong>ystem can define its own superblock operations. When the VFS needs toinvoke one of them, say read_inode( ), it execut<strong>es</strong>:sb->s_op->read_inode(inode);where sb stor<strong>es</strong> the addr<strong>es</strong>s of the superblock object involved. <strong>The</strong> read_inode field of th<strong>es</strong>uper_operations table contains the addr<strong>es</strong>s of the suitable function, which is thus directlyinvoked.Let us briefly d<strong>es</strong>cribe the superblock operations, which implement higher-level operationslike deleting fil<strong>es</strong> or mounting disks. <strong>The</strong>y are listed in the order they appear in th<strong>es</strong>uper_operations table:read_inode(inode)Fills the fields of the inode object whose addr<strong>es</strong>s is passed as the parameter from thedata on disk; the i_ino field of the inode object identifi<strong>es</strong> the specific fil<strong>es</strong>ystem inodeon disk to be read.write_inode(inode)Updat<strong>es</strong> a fil<strong>es</strong>ystem inode with the contents of the inode object passed as theparameter; the i_ino field of the inode object identifi<strong>es</strong> the fil<strong>es</strong>ystem inode on diskthat is concerned.put_inode(inode)Releas<strong>es</strong> the inode object whose addr<strong>es</strong>s is passed as the parameter. As usual,releasing an object do<strong>es</strong> not nec<strong>es</strong>sarily mean freeing memory since other proc<strong>es</strong>s<strong>es</strong>may still use that object.delete_inode(inode)Delet<strong>es</strong> the data blocks containing the file, the disk inode, and the VFS inode.310


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>notify_change(dentry, iattr)Chang<strong>es</strong> some attribut<strong>es</strong> of the inode according to the iattr parameter. If thenotify_change field is NULL, the VFS falls back on the write_inode( ) method.put_super(super)Releas<strong>es</strong> the superblock object whose addr<strong>es</strong>s is passed as the parameter (because thecorr<strong>es</strong>ponding fil<strong>es</strong>ystem is unmounted).write_super(super)Updat<strong>es</strong> a fil<strong>es</strong>ystem superblock with the contents of the object indicated.statfs(super, buf, bufsize)Returns statistics on a fil<strong>es</strong>ystem by filling the buf buffer.remount_fs(super, flags, data)Remounts the fil<strong>es</strong>ystem with new options (invoked when a mount option must bechanged).clear_inode(inode)Like put_inode, but also releas<strong>es</strong> all pag<strong>es</strong> that contain data concerning the file thatcorr<strong>es</strong>ponds to the indicated inode.umount_begin(super)Interrupts a mount operation, because the corr<strong>es</strong>ponding unmount operation has beenstarted (used only by network fil<strong>es</strong>ystems).<strong>The</strong> preceding methods are available to all possible fil<strong>es</strong>ystem typ<strong>es</strong>. However, only a subsetof them appli<strong>es</strong> to each specific fil<strong>es</strong>ystem; the fields corr<strong>es</strong>ponding to unimplementedmethods are set to NULL. Notice that no read_super method to read a superblock is defined:how could the kernel invoke a method of an object yet to be read from disk? We'll find theread_super method in another object d<strong>es</strong>cribing the fil<strong>es</strong>ystem type (seelater Section 12.3).12.2.2 Inode ObjectsAll information needed by the fil<strong>es</strong>ystem to handle a file is included in a data structure calledan inode. A filename is a casually assigned label that can be changed, but the inode is uniqueto the file and remains the same as long as the file exists. An inode object in memory consistsof an inode structure whose fields are d<strong>es</strong>cribed in Table 12-3.311


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-3. <strong>The</strong> Fields of the Inode ObjectType Field D<strong>es</strong>criptionstruct list_head i_hash Pointers for the hash liststruct list_head i_list Pointers for the inode liststruct list_head i_dentry Pointers for the dentry listunsigned long i_ino inode numberunsigned int i_count Usage counterkdev_t i_dev Device identifierumode_t i_mode File type and acc<strong>es</strong>s rightsnlink_t i_nlink Number of hard linksuid_t i_uid Owner identifiergid_t i_gid Group identifierkdev_t i_rdev Real device identifieroff_t i_size File length in byt<strong>es</strong>time_t i_atime Time of last file acc<strong>es</strong>stime_t i_mtime Time of last file writetime_t i_ctime Time of last inode changeunsigned long i_blksize Block size in byt<strong>es</strong>unsigned long i_blocks Number of blocks of the fileunsigned longi_versionVersion number, automatically incremented aftereach useunsigned long i_nrpag<strong>es</strong> Number of pag<strong>es</strong> containing file datastruct semaphore i_sem inode semaphor<strong>es</strong>truct semaphorei_atomic_write inode semaphore for atomic writ<strong>es</strong>truct inode_operationsi_op*inode operationsstruct super_block * i_sb Pointer to superblock objectstruct wait_queue * i_wait inode wait queu<strong>es</strong>truct file_lock * i_flock Pointer to file lock liststruct vm_area_struct * i_mmap Pointer to memory regions used to map the fil<strong>es</strong>truct page * i_pag<strong>es</strong> Pointer to page d<strong>es</strong>criptorstruct dquot ** i_dquot inode disk quotasunsigned long i_state inode state flagunsigned int i_flags Fil<strong>es</strong>ystem mount flagunsigned char i_pipe True if file is a pipeunsigned char i_sock True if file is a socketint i_writecount Usage counter for writing proc<strong>es</strong>sunsigned int i_attr_flags File creation flags_ _u32 i_generation R<strong>es</strong>erved for future developmentunion u Specific fil<strong>es</strong>ystem information<strong>The</strong> final u union field is used to include inode information that belongs to a specificfil<strong>es</strong>ystem. For instance, as we shall see in Chapter 17, if the inode object refers to an Ext2file, the field stor<strong>es</strong> an ext2_inode_info structure.Each inode object duplicat<strong>es</strong> some of the data included in the disk inode, for instance, thenumber of blocks allocated to the file. When the value of the i_state field is equal toI_DIRTY, the inode is dirty, that is, the corr<strong>es</strong>ponding disk inode must be updated. Other312


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>valu<strong>es</strong> of the i_state field are I_LOCK (which means that the inode object is locked) andI_FREEING (which means that the inode object is being freed).Each inode object always appears in one of the following circular doubly linked lists:• <strong>The</strong> list of unused inod<strong>es</strong>. <strong>The</strong> first and last elements of this list are referenced by thenext and prev fields, r<strong>es</strong>pectively, of the inode_unused variable. This list acts as amemory cache.• <strong>The</strong> list of in-use inod<strong>es</strong>. <strong>The</strong> first and last elements are referenced by theinode_in_use variable.• <strong>The</strong> list of dirty inod<strong>es</strong>. <strong>The</strong> first and last elements are referenced by the s_dirty fieldof the corr<strong>es</strong>ponding superblock object.Each of the lists just mentioned links together the i_list fields of the proper inode objects.Inode objects belonging to the "in use" or "dirty" lists are also included in a hash table namedinode_hashtable . <strong>The</strong> hash table speeds up the search of the inode object when the kernelknows both the inode number and the addr<strong>es</strong>s of the superblock object corr<strong>es</strong>ponding to thefil<strong>es</strong>ystem that includ<strong>es</strong> the file. [3] Since hashing may induce collisions, the inode objectinclud<strong>es</strong> an i_hash field that contains a backward and a forward pointer to other inod<strong>es</strong> thathash to the same position; this field creat<strong>es</strong> a doubly linked list of those inod<strong>es</strong>.[3] Actually, a Unix proc<strong>es</strong>s may open a file and then unlink it: the i_nlinkfield of the inode could become 0, yet the proc<strong>es</strong>s is still able to act onthe file. In this particular case, the inode is removed from the hash table, even if it still belongs to the in-use or dirty list.<strong>The</strong> methods associated with an inode object are also called inode operations. <strong>The</strong>y ared<strong>es</strong>cribed by an inode_operations structure, whose addr<strong>es</strong>s is included in the i_op field.<strong>The</strong> structure also includ<strong>es</strong> a pointer to the file operation methods (see Section 12.2.3). Hereare the inode operations, in the order they appear in the inode_operations table:create(dir, dentry, mode)Creat<strong>es</strong> a new disk inode for a regular file associated with a dentry object in somedirectory.lookup(dir, dentry)Search<strong>es</strong> a directory for an inode corr<strong>es</strong>ponding to the filename included in a dentryobject.link(old_dentry, dir, new_dentry)Creat<strong>es</strong> a new hard link that refers to the file specified by old_dentry in the directorydir; the new hard link has the name specified by new_dentry.unlink(dir, dentry)Remov<strong>es</strong> the hard link of the file specified by a dentry object from a directory.313


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>symlink(dir, dentry, symname)Creat<strong>es</strong> a new inode for a symbolic link associated with a dentry object in somedirectory.mkdir(dir, dentry, mode)Creat<strong>es</strong> a new inode for a directory associated with a dentry object in some directory.rmdir(dir, dentry)Remov<strong>es</strong> from a directory the subdirectory whose name is included in a dentry object.mknod(dir, dentry, mode, rdev)Creat<strong>es</strong> a new disk inode for a special file associated with a dentry object in somedirectory. <strong>The</strong> mode and rdev parameters specify, r<strong>es</strong>pectively, the file type and thedevice's major number.rename(old_dir, old_dentry, new_dir, new_dentry)Mov<strong>es</strong> the file identified by old_entry from the old_dir directory to the new_dirone. <strong>The</strong> new filename is included in the dentry object that new_dentry points to.readlink(dentry, buffer, buflen)Copi<strong>es</strong> into a memory area specified by buffer the file pathname corr<strong>es</strong>ponding to th<strong>es</strong>ymbolic link specified by the dentry.follow_link(inode, dir)Translat<strong>es</strong> a symbolic link specified by an inode object; if the symbolic link is arelative pathname, the lookup operation starts from the specified directory.readpage(file, pg)Reads a page of data from an open file. As we shall see in Chapter 15, regular fil<strong>es</strong> areread by this method.writepage(file, pg)Writ<strong>es</strong> a page of data into an open file. Most fil<strong>es</strong>ystems do not make use of thismethod when writing regular fil<strong>es</strong>.bmap(inode, block)Returns the logical block number corr<strong>es</strong>ponding to the file block number of the fileassociated with an inode.314


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>truncate(inode)Modifi<strong>es</strong> the size of the file associated with an inode. Before invoking this method, itis nec<strong>es</strong>sary to set the i_size field of the inode object to the required new size.permission(inode, mask)Checks whether the specified acc<strong>es</strong>s mode is allowed for the file associated withinode.smap(inode, sector)Similar to bmap( ), but determin<strong>es</strong> the disk sector number; used by FAT-basedfil<strong>es</strong>ystems.updatepage(inode, pg, buf, offset, count, sync)Updat<strong>es</strong>, if needed, a page of data of a file associated with an inode (usually invokedby network fil<strong>es</strong>ystems, which may have to wait a long time before updating remotefil<strong>es</strong>).revalidate(dentry)Updat<strong>es</strong> the cached attribut<strong>es</strong> of a file specified by a dentry object (usually invoked bythe network fil<strong>es</strong>ystem).<strong>The</strong> methods just listed are available to all possible inod<strong>es</strong> and fil<strong>es</strong>ystem typ<strong>es</strong>. However,only a subset of them appli<strong>es</strong> to any specific inode and fil<strong>es</strong>ystem; the fields corr<strong>es</strong>ponding tounimplemented methods are set to NULL.12.2.3 File ObjectsA file object d<strong>es</strong>crib<strong>es</strong> how a proc<strong>es</strong>s interacts with a file it has opened. <strong>The</strong> object is createdwhen the file is opened and consists of a file structure, whose fields are d<strong>es</strong>cribed in Table12-4. Notice that file objects have no corr<strong>es</strong>ponding image on disk, and hence no "dirty" fieldis included in the file structure to specify that the file object has been modified.<strong>The</strong> main information stored in a file object is the file pointer, that is, the current position inthe file from which the next operation will take place. Since several proc<strong>es</strong>s<strong>es</strong> may acc<strong>es</strong>s th<strong>es</strong>ame file concurrently, the file pointer cannot be kept in the inode object.315


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-4. <strong>The</strong> Fields of the File ObjectType Field D<strong>es</strong>criptionstruct file * f_next Pointer to next file objectstruct file ** f_pprev Pointer to previous file objectstruct dentry * f_dentry Pointer to associated dentry objectstruct file_operationsf_op*Pointer to file operation tablemode_t f_mode Proc<strong>es</strong>s acc<strong>es</strong>s modeloff_t f_pos Current file offset (file pointer)unsigned int f_count File object's usage counterunsigned int f_flags Flags specified when opening the fileunsigned long f_reada Read-ahead flagunsigned long f_ramax Maximum number of pag<strong>es</strong> to be read-aheadunsigned long f_raend File pointer after last read-aheadunsigned long f_ralen Number of read-ahead byt<strong>es</strong>unsigned long f_rawin Number of read-ahead pag<strong>es</strong>struct fown_struct f_owner Data for asynchronous I/O via signalsunsigned int f_uid User's UIDunsigned int f_gid User's GIDint f_error Error code for network write operationunsigned longf_versionVersion number, automatically incremented after eachusevoid *private_data Needed for tty driverEach file object is always included in one of the following circular doubly linked lists:• <strong>The</strong> list of "unused" file objects. This list acts both as a memory cache for the fileobjects and as a r<strong>es</strong>erve for the superuser; it allows the superuser to open a file even ifthe dynamic memory in the system is exhausted. Since the objects are unused, theirf_count fields are null. <strong>The</strong> addr<strong>es</strong>s of the first element in the list is stored in thefree_filps variable. <strong>The</strong> kernel mak<strong>es</strong> sure that the list always contains at leastNR_RESERVED_FILES objects, usually 10.• <strong>The</strong> list of "in use" file objects. Each element in the list is used by at least one proc<strong>es</strong>s,and hence its f_count field is not null. <strong>The</strong> addr<strong>es</strong>s of the first element in the list isstored in the inuse_filps variable.Regardl<strong>es</strong>s of which list a file object is in at the moment, its f_next field points to the nextelement in the list, while the f_ pprev field points to the f_next field of the previouselement.<strong>The</strong> size of the list of "unused" file objects is stored in the nr_free_fil<strong>es</strong> variable. <strong>The</strong>get_empty_filp( ) function is invoked when the VFS must allocate a new file object. <strong>The</strong>function checks whether the "unused" list has more than NR_RESERVED_FILES items, in whichcase one can be used for the newly opened file. Otherwise, it falls back to normal memoryallocation.As we explained in Section 12.1.1, each fil<strong>es</strong>ystem includ<strong>es</strong> its own set of file operations thatperform such activiti<strong>es</strong> as reading and writing a file. When the kernel loads an inode intomemory from disk, it stor<strong>es</strong> a pointer to th<strong>es</strong>e file operations in a file_operations structure316


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>whose addr<strong>es</strong>s is contained in the default_file_ops field of the inode_operationsstructure of the inode object. When a proc<strong>es</strong>s opens the file, the VFS initializ<strong>es</strong> the f_op fieldof the new file object with the addr<strong>es</strong>s stored in the inode so that further calls to fileoperations can use th<strong>es</strong>e functions. If nec<strong>es</strong>sary, the VFS may later modify the set of fileoperations by storing a new value in f_op.<strong>The</strong> following list d<strong>es</strong>crib<strong>es</strong> the file operations in the order in which they appear in thefile_operations table:llseek(file, offset, whence)Updat<strong>es</strong> the file pointer.read(file, buf, count, offset)Reads count byt<strong>es</strong> from a file starting at position *offset; the value *offset (whichusually corr<strong>es</strong>ponds to the file pointer) is then incremented.write(file, buf, count, offset)Writ<strong>es</strong> count byt<strong>es</strong> into a file starting at position *offset; the value *offset (whichusually corr<strong>es</strong>ponds to the file pointer) is then incremented.readdir(dir, dirent, filldir)Returns the next directory entry of a directory in dirent; the filldir parametercontains the addr<strong>es</strong>s of an auxiliary function that extracts the fields in a directoryentry.poll(file, poll_table)Checks whether there is activity on a file and go<strong>es</strong> to sleep until something happens onit.ioctl(inode, file, cmd, arg)Sends a command to an underlying hardware device. This method appli<strong>es</strong> only todevice fil<strong>es</strong>.mmap(file, vma)Performs a memory mapping of the file into a proc<strong>es</strong>s addr<strong>es</strong>s space (see Section 15.2in Chapter 15).open(inode, file)Opens a file by creating a new file object and linking it to the corr<strong>es</strong>ponding inodeobject (see Section 12.5.1 later in this chapter).317


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>flush(file)Called when a reference to an open file is closed, that is, the f_count field of the fileobject is decremented. <strong>The</strong> actual purpose of this method is fil<strong>es</strong>ystem-dependent.release(inode, file)Releas<strong>es</strong> the file object. Called when the last reference to an open file is closed, that is,the f_count field of the file object becom<strong>es</strong> 0.fsync(file, dentry)Writ<strong>es</strong> all cached data of the file to disk.fasync(file, on)Enabl<strong>es</strong> or disabl<strong>es</strong> asynchronous I/O notification by means of signals.check_media_change(dev)Checks whether there has been a change of media since the last operation on thedevice file (applicable to block devic<strong>es</strong> that support removable media, such as floppi<strong>es</strong>and CD-ROMs).revalidate(dev)R<strong>es</strong>tor<strong>es</strong> the consistency of a device (used by network fil<strong>es</strong>ystems after a mediachange has been recognized on a remote device).lock(file, cmd, file_lock)Appli<strong>es</strong> a lock to the file (see Section 12.6 later in this chapter).<strong>The</strong> methods just d<strong>es</strong>cribed are available to all possible file typ<strong>es</strong>. However, only a subset ofthem appli<strong>es</strong> to a specific file type; the fields corr<strong>es</strong>ponding to unimplemented methods ar<strong>es</strong>et to NULL.12.2.4 Special Handling for Directory File ObjectsDirectori<strong>es</strong> must be handled with care because several proc<strong>es</strong>s<strong>es</strong> can change their contentsconcurrently. Explicit locking, which is frequently performed on regular fil<strong>es</strong> (see Section12.6 later in this chapter), is not well suited for directori<strong>es</strong> because it prevents other proc<strong>es</strong>s<strong>es</strong>from acc<strong>es</strong>sing the whole subtree of fil<strong>es</strong> rooted at the locked directory. <strong>The</strong>refore, thef_version field of the file object is used together with the i_version field of the inodeobject to ensure that acc<strong>es</strong>s<strong>es</strong> to each directory file maintain consistency.We'll explain the use of th<strong>es</strong>e fields by d<strong>es</strong>cribing the most common operation in which theyare needed, the readdir( ) system call. Each invocation of this call is supposed to return adirectory entry and update the directory's file pointer so that the next invocation of the sam<strong>es</strong>ystem call will return the next directory entry. But the directory could be modified by318


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>another proc<strong>es</strong>s that concurrently acc<strong>es</strong>s<strong>es</strong> it. Without some kind of consistency check, thereaddir( ) system call could return the wrong directory entry. Long intervals—potentiallyhours—could elapse between a proc<strong>es</strong>s's calls to readdir( ), and the proc<strong>es</strong>s may choose tostop calling it at any time, so we don't want to lock the directory. What we want is a way tomake readdir( ) adapt to chang<strong>es</strong>.<strong>The</strong> problem is solved by introducing the global_event variable, which plays the role ofversion stamp. Whenever the inode object of a directory file is modified, the global_event isincreased by 1, and the new version stamp is stored in the i_version field of the object.Whenever a file object is created or its file pointer is modified, global_event is increased by1, and the new version stamp is stored in the f_version field of the object. When servicingthe readdir( ) system call, the VFS checks whether the version stamps contained in thei_version and f_version fields coincide. If not, the directory may have been modified bysome other proc<strong>es</strong>s after the previous execution of readdir( ).When the readdir( ) call detects this consistency problem, it recomput<strong>es</strong> the directory's filepointer by reading again the whole directory contents. <strong>The</strong> system call returns the directoryentry immediately following the entry that was returned by the proc<strong>es</strong>s's last readdir( ).f_version is then set to i_version to indicate that readdir( ) is now synchronized withthe actual state of the directory.12.2.5 Dentry ObjectsWe mentioned in Section 12.1.1 that each directory is considered by the VFS as a normal filethat contains a list of fil<strong>es</strong> and other directori<strong>es</strong>. We shall discuss in Chapter 17 howdirectori<strong>es</strong> are implemented on a specific fil<strong>es</strong>ystem. Once a directory entry has been read intomemory, however, it is transformed by the VFS into a dentry object based on the dentrystructure, whose fields are d<strong>es</strong>cribed in Table 12-5. A dentry object is created by the kernelfor every component of a pathname that a proc<strong>es</strong>s looks up; the dentry object associat<strong>es</strong> thecomponent to its corr<strong>es</strong>ponding inode. For example, when looking up the /tmp/t<strong>es</strong>t pathname,the kernel creat<strong>es</strong> a dentry object for the / root directory, a second dentry object for the tmpentry of the root directory, and a third dentry object for the t<strong>es</strong>t entry of the /tmp directory.Notice that dentry objects have no corr<strong>es</strong>ponding image on disk, and hence no field isincluded in the dentry structure to specify that the object has been modified. Dentry objectsare stored in a slab allocator cache called dentry_cache; dentry objects are thus created andd<strong>es</strong>troyed by invoking kmem_cache_alloc( ) and kmem_cache_free( ).319


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-5. <strong>The</strong> Fields of the Dentry ObjectType Field D<strong>es</strong>criptionint d_count Dentry object usage counterunsigned int d_flags Dentry flagsstruct inode * d_inode Inode associated with filenam<strong>es</strong>truct dentry * d_parent Dentry object of parent directorystruct dentry *d_mountsFor a mount point, the dentry of the root of the mountedfil<strong>es</strong>ystemstruct dentry * d_covers For the root of a fil<strong>es</strong>ystem, the dentry of the mount pointstruct list_head d_hash Pointers for list in hash table entrystruct list_head d_lru Pointers for unused liststruct list_headd_childPointers for the list of dentry objects included in parentdirectorystruct list_head d_subdirs For directori<strong>es</strong>, list of dentry objects of subdirectori<strong>es</strong>struct list_head d_alias List of associated inod<strong>es</strong> (alias)struct qstr d_name Filenameunsigned long d_time Used by d_revalidate methodstructdentry_operations* d_op Dentry methodsstruct super_block * d_sb Superblock object of the fileunsigned long d_reftime Time when dentry was discardedvoid * d_fsdata Fil<strong>es</strong>ystem-dependent dataunsigned chard_iname[16] Space for short filenameEach dentry object may be in one of four stat<strong>es</strong>:FreeUnusedIn useNegative<strong>The</strong> dentry object contains no valid information and is not used by the VFS. <strong>The</strong>corr<strong>es</strong>ponding memory area is handled by the slab allocator.<strong>The</strong> dentry object is not currently used by the kernel. <strong>The</strong> d_count usage counter ofthe object is null, but the d_inode field still points to the associated inode. <strong>The</strong> dentryobject contains valid information, but its contents may be discarded if nec<strong>es</strong>sary toreclaim memory.<strong>The</strong> dentry object is currently used by the kernel. <strong>The</strong> d_count usage counter ispositive and the d_inode field points to the associated inode object. <strong>The</strong> dentry objectcontains valid information and cannot be discarded.<strong>The</strong> inode associated with the dentry no longer exists, because the corr<strong>es</strong>ponding diskinode has been deleted. <strong>The</strong> d_inode field of the dentry object is set to NULL, but theobject still remains in the dentry cache so that further lookup operations to the same320


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>file pathname can be quickly r<strong>es</strong>olved. <strong>The</strong> term "negative" is misleading since nonegative value is involved.12.2.6 <strong>The</strong> Dentry CacheSince reading a directory entry from disk and constructing the corr<strong>es</strong>ponding dentry objectrequir<strong>es</strong> considerable time, it mak<strong>es</strong> sense to keep in memory dentry objects that you'vefinished with but might need later. For instance, people often edit a file and then compile it oredit, then print or copy, then edit. In any case like th<strong>es</strong>e, the same file needs to be repeatedlyacc<strong>es</strong>sed.In order to maximize efficiency in handling dentri<strong>es</strong>, <strong>Linux</strong> us<strong>es</strong> a dentry cache, whichconsists of two kinds of data structur<strong>es</strong>:• A set of dentry objects in the in-use, unused, or negative state.• A hash table to derive the dentry object associated with a given filename and a givendirectory quickly. As usual, if the required object is not included in the dentry cache,the hashing function returns a null value.<strong>The</strong> dentry cache also acts as a controller for an inode cache . <strong>The</strong> inod<strong>es</strong> in kernel memorythat are associated with unused dentri<strong>es</strong> are not discarded, since the dentry cache is still usingthem and therefore their i_count fields are not null. Thus, the inode objects are kept in RAMand can be quickly referenced by means of the corr<strong>es</strong>ponding dentri<strong>es</strong>.All the "unused" dentri<strong>es</strong> are included in a doubly linked " Least Recently Used" list sorted bytime of insertion. In other words, the dentry object that was last released is put in front of thelist, so the least recently used dentry objects are always near the end of the list. When thedentry cache has to shrink, the kernel remov<strong>es</strong> elements from the tail of this list so that themost recently used objects are pr<strong>es</strong>erved. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of the first and last elements of theLRU list are stored in the next and prev fields of the dentry_unused variable. <strong>The</strong> d_lrufield of the dentry object contains pointers to the adjacent dentri<strong>es</strong> in the list.Each "in use" dentry object is inserted into a doubly linked list specified by the i_dentryfield of the corr<strong>es</strong>ponding inode object (since each inode could be associated with severalhard links, a list is required). <strong>The</strong> d_alias field of the dentry object stor<strong>es</strong> the addr<strong>es</strong>s<strong>es</strong> ofthe adjacent elements in the list. Both fields are of type struct list_head.An "in use" dentry object may become "negative" when the last hard link to the corr<strong>es</strong>pondingfile is deleted. In this case, the dentry object is moved into the LRU list of unused dentri<strong>es</strong>.Each time the kernel shrinks the dentry cache, negative dentri<strong>es</strong> move toward the tail of theLRU list so that they are gradually freed (see Section 16.7.3 in Chapter 16).<strong>The</strong> hash table is implemented by means of a dentry_hashtable array. Each element is apointer to a list of dentri<strong>es</strong> that hash to the same hash table value. <strong>The</strong> array's size depends onthe amount of RAM installed in the system. <strong>The</strong> d_hash field of the dentry object containspointers to the adjacent elements in the list associated with a single hash value. <strong>The</strong> hashfunction produc<strong>es</strong> its value from both the addr<strong>es</strong>s of the dentry object of the directory and thefilename.321


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> methods associated with a dentry object are called dentry operations; they are d<strong>es</strong>cribedby the dentry_operations structure, whose addr<strong>es</strong>s is stored in the d_op field. Althoughsome fil<strong>es</strong>ystems define their own dentry methods, the fields are usually NULL, and the VFSreplac<strong>es</strong> them with default functions. Here are the methods, in the order they appear in thedentry_operations table.d_revalidate(dentry)Determin<strong>es</strong> whether the dentry object is still valid before using it for translating a filepathname. <strong>The</strong> default VFS function do<strong>es</strong> nothing, although network fil<strong>es</strong>ystems mayspecify their own functions.d_hash(dentry, hash)Creat<strong>es</strong> a hash value; a fil<strong>es</strong>ystem-specific hash function for the dentry hash table. <strong>The</strong>dentry parameter identifi<strong>es</strong> the directory containing the component. <strong>The</strong> hashparameter points to a structure containing both the pathname component to be lookedup and the value produced by the hash function.d_compare(dir, name1, name2)Compar<strong>es</strong> two filenam<strong>es</strong>; name1 should belong to the directory referenced by dir. <strong>The</strong>default VFS function is a normal string match. However, each fil<strong>es</strong>ystem canimplement this method in its own way. For instance, MS-DOS do<strong>es</strong> not distinguishcapital from lowercase letters.d_delete(dentry)Called when the last reference to a dentry object is deleted (d_count becom<strong>es</strong> 0). <strong>The</strong>default VFS function do<strong>es</strong> nothing.d_release(dentry)Called when a dentry object is going to be freed (released to the slab allocator). <strong>The</strong>default VFS function do<strong>es</strong> nothing.d_iput(dentry, ino)Called when a dentry object becom<strong>es</strong> "negative," that is, it los<strong>es</strong> its inode. <strong>The</strong> defaultVFS function invok<strong>es</strong> iput( ) to release the inode object.12.2.7 Fil<strong>es</strong> Associated with a Proc<strong>es</strong>sWe mentioned in Section 1.5 in Chapter 1 that each proc<strong>es</strong>s has its own current workingdirectory and its own root directory. This information is stored in an fs_struct kernel table,whose addr<strong>es</strong>s is contained in the fs field of the proc<strong>es</strong>s d<strong>es</strong>criptor.322


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>struct fs_struct {atomic_t count;int umask;struct dentry * root, * pwd;};<strong>The</strong> count field specifi<strong>es</strong> the number of proc<strong>es</strong>s<strong>es</strong> sharing the same fs_struct table (seeSection 3.3.1 in Chapter 3). <strong>The</strong> umask field is used by the umask( ) system call to set initialfile permissions on newly created fil<strong>es</strong>.A second table, whose addr<strong>es</strong>s is contained in the fil<strong>es</strong> field of the proc<strong>es</strong>s d<strong>es</strong>criptor,specifi<strong>es</strong> which fil<strong>es</strong> are currently opened by the proc<strong>es</strong>s. It is a fil<strong>es</strong>_struct structurewhose fields are illustrated in Table 12-6. A proc<strong>es</strong>s cannot have more than NR_OPEN (usually,1024) file d<strong>es</strong>criptors. It is possible to define a smaller, dynamic bound on the maximumnumber of allowed open fil<strong>es</strong> by changing the rlim[RLIMIT_NOFILE] structure in the proc<strong>es</strong>sd<strong>es</strong>criptor.Table 12-6. <strong>The</strong> Fields of the fil<strong>es</strong>_struct StructureType Field D<strong>es</strong>criptionint count Number of proc<strong>es</strong>s<strong>es</strong> sharing this tableint max_fds Current maximum number of file objectsint max_fdset Current maximum number of file d<strong>es</strong>criptorsint next_fd Maximum file d<strong>es</strong>criptors ever allocated plus 1struct file ** fd Pointer to array of file object pointersfd_set * close_on_exec Pointer to file d<strong>es</strong>criptors to be closed on exec( )fd_set * open_fds Pointer to open file d<strong>es</strong>criptorsfd_set close_on_exec_init Initial set of file d<strong>es</strong>criptors to be closed on exec( )fd_set open_fds_init Initial set of file d<strong>es</strong>criptorsstruct file * fd_array[32] Initial array of file object pointers<strong>The</strong> fd field points to an array of pointers to file objects. <strong>The</strong> size of the array is stored in themax_fds field. Usually, fd points to the fd_array field of the fil<strong>es</strong>_struct structure, whichinclud<strong>es</strong> 32 file object pointers. If the proc<strong>es</strong>s opens more than 32 fil<strong>es</strong>, the kernel allocat<strong>es</strong> anew, larger array of file pointers and stor<strong>es</strong> its addr<strong>es</strong>s in the fd fields; it also updat<strong>es</strong> themax_fds field.For every file with an entry in the fd array, the array index is the file d<strong>es</strong>criptor. Usually, thefirst element (index 0) of the array is associated with the standard input of the proc<strong>es</strong>s, th<strong>es</strong>econd with the standard output, and the third with the standard error (see Figure 12-4). Unixproc<strong>es</strong>s<strong>es</strong> use the file d<strong>es</strong>criptor as the main file identifier. Notice that, thanks to the dup( ),dup2( ), and fcntl( ) system calls, two file d<strong>es</strong>criptors may refer to the same opened file,that is, two elements of the array could point to the same file object. Users see this all the timewhen they use shell constructs like 2>&1 to redirect the standard error to the standard output.<strong>The</strong> open_fds field contains the addr<strong>es</strong>s of the open_fds_init field, which is a bitmap thatidentifi<strong>es</strong> the file d<strong>es</strong>criptors of currently opened fil<strong>es</strong>. <strong>The</strong> max_fdset field stor<strong>es</strong> the numberof bits in the bitmap. Since the fd_set data structure includ<strong>es</strong> 1024 bits, there is usually noneed to expand the size of the bitmap. However, the kernel may dynamically expand the sizeof the bitmap if this turns out to be nec<strong>es</strong>sary, much as in the case of the array of file objects.323


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 12-4. <strong>The</strong> fd array<strong>The</strong> kernel provid<strong>es</strong> an fget( ) function to be invoked when it starts using a file object. Thisfunction receiv<strong>es</strong> as its parameter a file d<strong>es</strong>criptor fd . It returns the addr<strong>es</strong>s in current->fil<strong>es</strong>->fd[fd], that is, the addr<strong>es</strong>s of the corr<strong>es</strong>ponding file object, or NULL if no filecorr<strong>es</strong>ponds to fd . In the first case, fget( ) increments by 1 the file object usage counterf_count.<strong>The</strong> kernel also provid<strong>es</strong> an fput( ) function to be invoked when a kernel control pathfinish<strong>es</strong> using a file object. This function receiv<strong>es</strong> as its parameter the addr<strong>es</strong>s of a file objectand decrements its usage counter f_count. Moreover, if this field becom<strong>es</strong> null, the functioninvok<strong>es</strong> the release method of the file operations (if defined), releas<strong>es</strong> the associated dentryobject, decrements the i_writeacc<strong>es</strong>s field in the inode object (if the file was opened forwriting), and finally mov<strong>es</strong> the file object from the "in use" list to the "unused" one.12.3 Fil<strong>es</strong>ystem MountingNow we'll focus on how the VFS keeps track of the fil<strong>es</strong>ystems it is supposed to support. Twobasic operations must be performed before making use of a fil<strong>es</strong>ystem: registration andmounting.Registration is done either when the system boots or when the module implementing thefil<strong>es</strong>ystem is being loaded. Once a fil<strong>es</strong>ystem has been registered, its specific functions areavailable to the kernel, so that kind of fil<strong>es</strong>ystem can be mounted on the system's directorytree.Each fil<strong>es</strong>ystem has its own root directory. <strong>The</strong> fil<strong>es</strong>ystem whose root directory is the root ofthe system's directory tree is called root fil<strong>es</strong>ystem. Other fil<strong>es</strong>ystems can be mounted on th<strong>es</strong>ystem's directory tree: the directori<strong>es</strong> on which they are inserted are called mount points.12.3.1 Fil<strong>es</strong>ystem RegistrationOften, the user configur<strong>es</strong> <strong>Linux</strong> to recognize all the fil<strong>es</strong>ystems needed when compiling thekernel for her system. But the code for a fil<strong>es</strong>ystem actually may either be included in thekernel image or dynamically loaded as a module (see Appendix B). <strong>The</strong> VFS must keep trackof all fil<strong>es</strong>ystems whose code is currently included in the kernel. It do<strong>es</strong> this by performingfil<strong>es</strong>ystem registrations.324


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Each registered fil<strong>es</strong>ystem is repr<strong>es</strong>ented as a file_system_type object whose fields areillustrated in Table 12-7. All fil<strong>es</strong>ystem-type objects are inserted into a simply linked list. <strong>The</strong>file_systems variable points to the first item.Table 12-7. <strong>The</strong> Fields of the file_system_type ObjectType Field D<strong>es</strong>criptionconst char * name Fil<strong>es</strong>ystem nameint fs_flag Mount flagsstruct super_block *(*)( ) read_super Method for reading superblockstruct file_system_type * next Pointer to next list elementDuring system initialization, the fil<strong>es</strong>ystem_setup( ) function is invoked to register thefil<strong>es</strong>ystems specified at compile time. For each fil<strong>es</strong>ystem type, the register_fil<strong>es</strong>ystem() function is invoked with a parameter pointing to the proper file_system_type object,which is thus inserted into the fil<strong>es</strong>ystem-type list.<strong>The</strong> register_fil<strong>es</strong>ystem( ) is also invoked when a module implementing a fil<strong>es</strong>ystem isloaded. In this case, the fil<strong>es</strong>ystem may also be unregistered (by invoking theunregister_fil<strong>es</strong>ystem( ) function) when the module is unloaded.<strong>The</strong> get_fs_type( ) function, which receiv<strong>es</strong> a fil<strong>es</strong>ystem name as its parameter, scans thelist of registered fil<strong>es</strong>ystems and returns a pointer to the corr<strong>es</strong>ponding file_system_typeobject if it is pr<strong>es</strong>ent.12.3.2 Mounting the Root Fil<strong>es</strong>ystemMounting the root fil<strong>es</strong>ystem is a crucial part of system initialization. While the system boots,it finds the major number of the disk containing the root fil<strong>es</strong>ystem in the ROOT_DEV variable.<strong>The</strong> root fil<strong>es</strong>ystem can be specified as a device file in the /dev directory either whencompiling the kernel or by passing a suitable option to the initial bootstrap loader. Similarly,the mount flags of the root fil<strong>es</strong>ystem are stored in the root_mountflags variable. <strong>The</strong> userspecifi<strong>es</strong> th<strong>es</strong>e flags either by using the /sbin/rdev external program on a compiled kernelimage or by passing suitable options to the initial bootstrap loader (see Appendix A).During system initialization, right after the fil<strong>es</strong>ystem_setup( ) invocation, themount_root( ) function is executed. It performs the following operations (assuming that thefil<strong>es</strong>ystem to be mounted is a disk-based one): [4][4] Diskl<strong>es</strong>s workstations can mount the root directory over a network-based fil<strong>es</strong>ystem such as NFS, but we don't d<strong>es</strong>cribe how this is done.1. Initializ<strong>es</strong> a dummy, local file object filp. <strong>The</strong> f_mode field is set according to themount flags of the root, while all other fields are set to 0.2. Creat<strong>es</strong> a dummy inode object and initializ<strong>es</strong> it by setting its i_rdev field toROOT_DEV.3. Invok<strong>es</strong> the blkdev_open( ) function, passing the dummy inode and the file object.As we shall see later in Chapter 13, the function checks whether the disk exists and isproperly working.4. Releas<strong>es</strong> the dummy inode object, which was needed just to verify the disk.5. Scans the fil<strong>es</strong>ystem-type list. For each file_system_type object, invok<strong>es</strong>read_super( ) to attempt to read the corr<strong>es</strong>ponding superblock. This function checks325


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>that the device is not already mounted and attempts to fill a superblock object by usingthe method to which the read_super field of the file_system_type object points.Since each fil<strong>es</strong>ystem-specific method us<strong>es</strong> unique magic numbers, all read_super() invocations will fail except the one that attempts to fill the superblock by using themethod of the fil<strong>es</strong>ystem really used on the root device. <strong>The</strong> read_super( ) methodalso creat<strong>es</strong> an inode object and a dentry object for the root directory; the dentry objectmaps / to the inode object.6. Sets the root and pwd fields of the fs_struct table of current (the init proc<strong>es</strong>s) tothe dentry object of the root directory.7. Invok<strong>es</strong> add_vfsmnt( ) to insert a first element into the list of mounted fil<strong>es</strong>ystems(see next section).12.3.3 Mounting a Generic Fil<strong>es</strong>ystemOnce the root fil<strong>es</strong>ystem has been initialized, additional fil<strong>es</strong>ystems may be mounted. Each ofthem must have its own mount point, which is just an already existing directory in th<strong>es</strong>ystem's directory tree.All mounted fil<strong>es</strong>ystems are included in a list, whose first element is referenced by thevfsmntlist variable. Each element is a structure of type vfsmount, whose fields are shownin Table 12-8.Table 12-8. <strong>The</strong> Fields of the vfsmount Data StructureType Field D<strong>es</strong>criptionkdev_t mnt_dev Device numberchar * mnt_devname Device namechar * mnt_dirname Mount pointunsigned int mnt_flags Device flagsstruct super_block * mnt_sb Superblock pointerstruct quota_mount_options mnt_dquot Disk quota mount optionsstruct vfsmount * mnt_next Pointer to next list elementThree low-level functions are used to handle the list and are invoked by the service routin<strong>es</strong> ofthe mount( ) and umount( ) system calls. <strong>The</strong> add_vfsmnt( ) and remove_vfsmnt( )functions add and remove, r<strong>es</strong>pectively, an element in the list. <strong>The</strong> lookup_vfsmnt( )function search<strong>es</strong> a specific mounted fil<strong>es</strong>ystem and returns the addr<strong>es</strong>s of the corr<strong>es</strong>pondingvfsmount data structure.<strong>The</strong> mount( ) system call is used to mount a fil<strong>es</strong>ystem; its sys_mount( ) service routineacts on the following parameters:• <strong>The</strong> pathname of a device file containing the fil<strong>es</strong>ystem or NULL if it is not required(for instance, when the fil<strong>es</strong>ystem to be mounted is network-based)• <strong>The</strong> pathname of the directory on which the fil<strong>es</strong>ystem will be mounted (the mountpoint)• <strong>The</strong> fil<strong>es</strong>ystem type, which must be the name of a registered fil<strong>es</strong>ystem• <strong>The</strong> mount flags (permitted valu<strong>es</strong> are listed in Table 12-9)• A pointer to a fil<strong>es</strong>ystem-dependent data structure (which may be NULL)326


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 12-9. Fil<strong>es</strong>ystem Mounting OptionsMacro Value D<strong>es</strong>criptionMS_MANDLOCK 0x040 Mandatory locking allowed.MS_NOATIME 0x400 Do not update file acc<strong>es</strong>s time.MS_NODEV 0x004 Forbid acc<strong>es</strong>s to device fil<strong>es</strong>.MS_NODIRATIME 0x800 Do not update directory acc<strong>es</strong>s time.MS_NOEXEC 0x008 Disallow program execution.MS_NOSUID 0x002 Ignore setuid and setgid flags.MS_RDONLY 0x001 Fil<strong>es</strong> can only be read.MS_REMOUNT 0x020 Remount the fil<strong>es</strong>ystem.MS_SYNCHRONOUS 0x010 Write operations are immediate.S_APPEND 0x100 Allow append-only file.S_IMMUTABLE 0x200 Allow immutable file.S_QUOTA 0x080 Initialize disk quota.<strong>The</strong> sys_mount( ) function performs the following operations:1. Checks whether the proc<strong>es</strong>s has the required capability to mount a fil<strong>es</strong>ystem.2. If the MS_REMOUNT option has been specified, invok<strong>es</strong> do_remount( ) to modify themount flags and terminate.3. Otherwise, gets a pointer to the proper file_system_type object by invokingget_fs_type( ).4. If the fil<strong>es</strong>ystem to be mounted refers to a hardware device like /dev/hda1, checkswhether the device exists and is operational. This is done as follows:a. Invok<strong>es</strong> namei( ) to get the dentry object of the corr<strong>es</strong>ponding device file (seethe section Section 12.4 later in this chapter).b. Checks whether the inode associated with the device file refers to a valid blockdevice (see Section 13.2.1 in Chapter 13).c. Initializ<strong>es</strong> a dummy file object that refers to the device file, then opens thedevice file by using the open method of the file operations. If this operationsucceeds, the device is operational.5. If the fil<strong>es</strong>ystem to be mounted do<strong>es</strong> not refer to a hardware device, gets a fictitiousblock device with major number by invoking get_unnamed_dev( ).6. Invok<strong>es</strong> do_mount( ), passing the parameters dev (device number), dev_name (devicefilename), dir_name (mount point), type (fil<strong>es</strong>ystem type), flags (mount flags), anddata (pointer to optional data area). This function mounts the required fil<strong>es</strong>ystem byperforming the following operations:a. Invok<strong>es</strong> namei( ) to locate the dir_d dentry object corr<strong>es</strong>ponding todir_name; if it do<strong>es</strong> not exist, creat<strong>es</strong> it (see Figure 12-5 (a)).b. Acquir<strong>es</strong> the mount_sem semaphore, which is used to serialize the mountingand unmounting operations.c. Checks to make sure that dir_d->d_inode is the inode of a directory and thatthe directory is not the root of a fil<strong>es</strong>ystem that is already mounted (dir_d->d_covers must be equal to dir_d).d. Invok<strong>es</strong> read_super( ) to get the superblock object sb of the new fil<strong>es</strong>ystem.(If the object do<strong>es</strong> not exist, it is created and filled with information read fromthe dev device.) <strong>The</strong> s_root field of the superblock object points to the dentryobject of the root directory of the fil<strong>es</strong>ystem to be mounted (see Figure 12-5(b)).327


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>e. <strong>The</strong> previous operation could have suspended the current proc<strong>es</strong>s; therefore,checks that no other proc<strong>es</strong>s is using the superblock and that no proc<strong>es</strong>s hasalready succeeded in mounting the same fil<strong>es</strong>ystem.f. Invok<strong>es</strong> add_vfsmnt( ) in order to insert a new element in the list of mountedfil<strong>es</strong>ystems.g. Sets the d_mounts field of dir_d to the s_root field of the superblock, that is,to the root directory of the mounted fil<strong>es</strong>ystem.h. Sets the d_covers field of the dentry object of the root directory of themounted fil<strong>es</strong>ystem to dir_d (see Figure 12-5 (c)).i. Releas<strong>es</strong> the mount_sem semaphore.Figure 12-5. Mounting a fil<strong>es</strong>ystemNow, the dir_d dentry object of the mount point is linked through the d_mounts field to theroot directory dentry object of the mounted fil<strong>es</strong>ystem; this latter object is linked to the dir_ddentry object through the d_covers field.12.3.4 Unmounting a Fil<strong>es</strong>ystem<strong>The</strong> umount( ) system call is used to unmount a fil<strong>es</strong>ystem. <strong>The</strong> corr<strong>es</strong>ponding sys_umount() service routine acts on two parameters: a filename (either a mount directory or a blockdevice file) and a set of flags. It performs the following actions:1. Checks whether the proc<strong>es</strong>s has the required capability to unmount the fil<strong>es</strong>ystem.2. Invok<strong>es</strong> namei( ) on the filename to derive the dentry pointer to the associateddentry object.328


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3. If the filename refers to the mount point, deriv<strong>es</strong> the device identifier from dentry->d_inode->i_sb->s_dev. In other words, the function go<strong>es</strong> from the dentry object ofthe mount point to the relative inode, then to the corr<strong>es</strong>ponding superblock, and finallyto the device identifier.4. Otherwise, if the filename refers to the device file, deriv<strong>es</strong> the device identifier fromdentry->d_inode->i_rdev.5. Invok<strong>es</strong> dput(dentry) to release the dentry object.6. Flush<strong>es</strong> the buffers of the device (see Section 14.1 in Chapter 14).7. Acquir<strong>es</strong> the mount_sem semaphore.8. Invok<strong>es</strong> do_umount( ), which performs the following operations:a. Invok<strong>es</strong> get_super( ) to get the pointer sb of the superblock of the mountedfil<strong>es</strong>ystem.b. Invok<strong>es</strong> shrink_dcache_sb( ) to remove the dentry objects that refer to thedev device without disturbing other dentri<strong>es</strong>. <strong>The</strong> dentry object of the rootdirectory of the mounted fil<strong>es</strong>ystem will not be removed, since it is still usedby the proc<strong>es</strong>s doing the unmount.c. Invok<strong>es</strong> fsync_dev( ) to force all "dirty" buffers that refer to the dev deviceto be written to disk.d. If dev is the root device (dev == ROOT_DEV), it cannot be unmounted. If it hasnot been already remounted, remounts it with the MS_RDONLY flag set andreturns.e. Checks whether the usage counter of the dentry object corr<strong>es</strong>ponding to theroot directory of the fil<strong>es</strong>ystem to be unmounted is greater than 1. If so, someproc<strong>es</strong>s is acc<strong>es</strong>sing a file in the fil<strong>es</strong>ystem, so returns an error code.f. Decrements the usage counter of sb->s_root->d_covers (the dentry object ofthe mount point directory).g. Sets sb->s_root->d_covers->d_mounts to sb->s_root->d_covers. Thisremov<strong>es</strong> the link from the inode of the mount point to the inode of the rootdirectory of the fil<strong>es</strong>ystem.h. Releas<strong>es</strong> the dentry object to which sb->s_root (the root directory of thepreviously mounted fil<strong>es</strong>ystem) points and sets sb->s_root to NULL.i. If the superblock has been modified and the write_super superblock's methodis defined, execut<strong>es</strong> it.j. If defined, invok<strong>es</strong> the put_super( ) method of the superblock.k. Sets sb->s_dev to 0.l. Invok<strong>es</strong> remove_vfsmnt( ) to remove the proper element from the list ofmounted fil<strong>es</strong>ystems.9. Invok<strong>es</strong> fsync_dev( ) to force a write to disk for all remaining "dirty" buffers thatrefer to the dev device (pr<strong>es</strong>umably, the buffers containing the superblockinformation), then invok<strong>es</strong> the release( ) method of the device file operations.10. Releas<strong>es</strong> the mount_sem semaphore.12.4 Pathname LookupWe illustrate in this section how the VFS deriv<strong>es</strong> an inode from the corr<strong>es</strong>ponding filepathname. When a proc<strong>es</strong>s must identify a file, it pass<strong>es</strong> its file pathname to some VFSsystem call, such as open( ), mkdir( ), rename( ), stat( ), and so on.329


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> standard procedure for performing this task consists of analyzing the pathname andbreaking it into a sequence of filenam<strong>es</strong>. All filenam<strong>es</strong> except the last must identifydirectori<strong>es</strong>.If the first character of the pathname is /, the pathname is absolute, and the search starts fromthe directory identified by current->fs->root (the proc<strong>es</strong>s root directory). Otherwise, thepathname is relative, and the search starts from the directory identified by current->fs->pwd(the proc<strong>es</strong>s current directory).Having in hand the inode of the initial directory, the code examin<strong>es</strong> the entry matching thefirst name to derive the corr<strong>es</strong>ponding inode. <strong>The</strong>n the directory file having that inode is readfrom disk and the entry matching the second name is examined to derive the corr<strong>es</strong>pondinginode. This procedure is repeated for each name included in the path.<strong>The</strong> dentry cache considerably speeds up the procedure, since it keeps the most recently useddentry objects in memory. As we have seen before, each such object associat<strong>es</strong> a filename in aspecific directory to its corr<strong>es</strong>ponding inode. In many cas<strong>es</strong>, therefore, the analysis of thepathname can avoid reading the intermediate directori<strong>es</strong> from the disk.However, things are not as simple as they look, since the following Unix and VFS fil<strong>es</strong>ystemfeatur<strong>es</strong> must be taken into consideration:• <strong>The</strong> acc<strong>es</strong>s rights of each directory must be checked to verify whether the proc<strong>es</strong>s isallowed to read the directory's content.• A filename can be a symbolic link that corr<strong>es</strong>ponds to an arbitrary pathname: in thatcase, the analysis must be extended to all components of that pathname.• Symbolic links may induce circular referenc<strong>es</strong>: the kernel must take this possibilityinto account and break endl<strong>es</strong>s loops when they occur.• A filename can be the mount point of a mounted fil<strong>es</strong>ystem: this situation must bedetected, and the lookup operation must continue into the new fil<strong>es</strong>ystem.<strong>The</strong> namei( ) and lnamei( ) functions derive an inode from a pathname. <strong>The</strong> differencebetween them is that namei( ) follows a symbolic link if it appears as the last component in apathname without trailing slash<strong>es</strong>, while lnamei( ) do<strong>es</strong> not. Both functions delegate theheavy work by invoking the lookup_dentry( ) function, which acts on three parameters:name points to a file pathname, base points to the dentry object of the directory from which tostart searching, and lookup_flags is a bit array that includ<strong>es</strong> the following flags:LOOKUP_FOLLOWIf the last component of the pathname is a symbolic link, interpret (follow) it. This flagis set when lookup_dentry( ) is invoked by namei( ) and cleared when it isinvoked by lnamei( ).LOOKUP_DIRECTORY<strong>The</strong> last component of the pathname must be a directory.330


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>LOOKUP_SLASHOKA trailing / in the pathname is allowed even if the last filename do<strong>es</strong> not exist.LOOKUP_CONTINUE<strong>The</strong>re are still filenam<strong>es</strong> to be examined in the pathname. This flag is used internallyby lookup_dentry( ).<strong>The</strong> lookup_dentry( ) function is recursive, since it may end up invoking itself. <strong>The</strong>refore,name could repr<strong>es</strong>ent the still unr<strong>es</strong>olved trailing portion of a complete pathname. In this case,base points to the dentry object of the last r<strong>es</strong>olved pathname component. lookup_dentry() execut<strong>es</strong> the following actions:1. Examin<strong>es</strong> both the first character of name and the value of base to identify thedirectory from which the search should start. Three cas<strong>es</strong> can occur.o <strong>The</strong> first character of name is /: the pathname is an absolute pathname, thusbase is set to current->fs->root.o <strong>The</strong> first character of name is different from "/" and base is NULL: thepathname is a relative pathname and base is set to current->fs->pwd.o <strong>The</strong> first character of name is different from "/" and base is not NULL: thepathname is a relative pathname and base is left unchanged. (This case shouldoccur only when lookup_dentry( ) is recursively invoked.)2. Gets the inode of the initial directory from base->d_inode.3. Clears the LOOKUP_CONTINUE, LOOKUP_DIRECTORY, and LOOKUP_SLASHOK flags inlookup_flags.4. Iteratively repeats the following procedure on each filename included in the path. If anerror condition is encountered, exits from the cycle and returns a NULL dentry pointer,else returns the dentry pointer corr<strong>es</strong>ponding to the file pathname. At the start of eachiteration, name points to the next filename to be examined and base points to thedentry object of the current directory.a. Checks whether the proc<strong>es</strong>s is allowed to acc<strong>es</strong>s the base directory (if defined,us<strong>es</strong> the permission method of inode).b. Comput<strong>es</strong> a hash value from the first component name in name to be used insearching for the corr<strong>es</strong>ponding entry in the dentry cache. Moreover, if thebase directory is in a fil<strong>es</strong>ystem that has its own d_hash( ) dentry hashingmethod, invok<strong>es</strong> base->d_op->d_hash( ) to compute the hash value basedon the directory, the component name, and the previous hash value.c. Updat<strong>es</strong> name so that it points to the first character of the next component name(if any), skipping any "/" separator.d. Sets the flag local variable to the value previously set in lookup_flags.Additionally, if the currently r<strong>es</strong>olved component was followed by a trailing /,sets the LOOKUP_DIRECTORY flag (requiring a check on whether the componentis a directory) and the LOOKUP_FOLLOW flag (interprets the component even if itis a symbolic link). Moreover, if there is a non-null component after thecomponent currently r<strong>es</strong>olved, sets the LOOKUP_CONTINUE flag.e. Invok<strong>es</strong> r<strong>es</strong>erved_lookup( ) to perform the following actions:a. If the first component name is a single period (.), sets the dentry localvariable to base.331


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>b. If the first component name is a double period (..) and base is equal tocurrent->fs->root, sets the dentry local variable to base (becausethe proc<strong>es</strong>s is already in its root directory).c. If the first component name is a double period (..) and base is notequal to current->fs->root, sets the dentry local variable to base->d_covers->d_parent. Usually, d_covers points to base itself anddentry is set to the directory that includ<strong>es</strong> base; however, if the basedirectory is the root of a mounted fil<strong>es</strong>ystem, the d_covers field pointsto the inode of the mount point and dentry is set to the directory thatinclud<strong>es</strong> the mount point.If the first component name is neither (.) nor (..) invok<strong>es</strong>cached_lookup( ), passing as parameters base and the hash numberpreviously derived. If the dentry hash table includ<strong>es</strong> the required object,returns its addr<strong>es</strong>s in dentry.If the required dentry object is not in the dentry cache, invok<strong>es</strong>real_lookup( ) to read the directory from disk and creat<strong>es</strong> a newdentry object. This function, which acts on the base and nameparameters, performs the following steps:d. Gets the i_sem semaphore of the directory inode.e. Reexecut<strong>es</strong> cached_lookup( ), since the required dentry object couldhave been inserted in the cache while the proc<strong>es</strong>s was waiting for thedirectory semaphore.f. We assume that the previous attempt failed. Invok<strong>es</strong> d_alloc( ) toallocate a new dentry object.g. Invok<strong>es</strong> the lookup method of the inode associated with the basedirectory to find the directory entry containing the required name andfills the new dentry object. This method is fil<strong>es</strong>ystem-dependent. We'lld<strong>es</strong>cribe its Ext2 implementation in Chapter 17.h. Releas<strong>es</strong> the i_sem semaphore of the directory inode.i. Returns the addr<strong>es</strong>s of the new object in dentry or an error code if theentry was not found.f. Invok<strong>es</strong> follow_mount( ) to check whether the d_mounts field of dentry hasthe same value as dentry. If not, dentry is the mount point of a fil<strong>es</strong>ystem. Inthis case, the old dentry object is replaced by the one having the addr<strong>es</strong>s indentry->d_mounts.g. Invok<strong>es</strong> do_follow_link( ) to check whether name is a symbolic link. Thisfunction receiv<strong>es</strong> as its parameters base, dentry, and flags and execut<strong>es</strong> thefollowing steps:a. If the LOOKUP_FOLLOW flag is not set, returns immediately. Since theflag is set by lnamei( ), this ensur<strong>es</strong> that lnamei( ) do<strong>es</strong> not follow asymbolic link if it appears as the last component in a pathname withouttrailing slash<strong>es</strong>.b. Checks whether dentry->d_inode contains the follow_link method.If not, the inode is not a symbolic link, so the function returns thedentry input parameter.332


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>c. Invok<strong>es</strong> the follow_link inode method. This fil<strong>es</strong>ystem-dependentfunction reads the pathname associated with the symbolic link from thedisk and recursively invok<strong>es</strong> lookup_dentry( ) to r<strong>es</strong>olve it. <strong>The</strong>function then returns a pointer to the dentry object referred by th<strong>es</strong>ymbolic link (we shall d<strong>es</strong>cribe in Chapter 17 how symbolic links arehandled by Ext2).Since lookup_dentry( ) invok<strong>es</strong> do_follow_link( ), which may inturn invoke the follow_link inode method, which invok<strong>es</strong>, in turn,lookup_dentry( ), recursive cycl<strong>es</strong> of function calls may be created.<strong>The</strong> link_count field of current is used to avoid endl<strong>es</strong>s recursivecalls due to circular referenc<strong>es</strong> inside the symbolic links. This field isincremented before each recursive execution of follow_link( ) anddecremented right after. If it reach<strong>es</strong> the value 5, do_follow_link( )terminat<strong>es</strong> with an error code. <strong>The</strong>refore, while there is no limit on thenumber of symbolic links in a pathname, the level of n<strong>es</strong>ting ofsymbolic links can be at most five.h. If everything went smoothly, base now points to the dentry object associatedto the currently r<strong>es</strong>olved component, so sets inode to base->d_inode.i. If the LOOKUP_DIRECTORY flag of flag is not set, the currently r<strong>es</strong>olvedcomponent is the last one in the file pathname, so returns the addr<strong>es</strong>s in base.Note that base could point to a negative dentry object; that is, there might beno associated inode. This is fine for the lookup operation, since the lastcomponent must not be followed.j. Otherwise, if LOOKUP_DIRECTORY is set, there is a slash after the currentlyr<strong>es</strong>olved component. <strong>The</strong>re are two cas<strong>es</strong> to consider:• inode points to a valid inode object. In this case, checks that it is adirectory by seeing whether the lookup method of the inode operationsis defined; if not, returns an error code. <strong>The</strong>n either starts a new cycleiteration if the LOOKUP_CONTINUE flag in flags is set (meaning that thecurrently r<strong>es</strong>olved component is not the last one) or returns the addr<strong>es</strong>sin base (meaning that the component is the last one, even if it isfollowed by a trailing slash).• inode is NULL (meaning that base points to a negative dentry object).Returns base only if LOOKUP_CONTINUE is cleared andLOOKUP_SLASHOK is set; otherwise, returns an error code. Since anegative dentry object repr<strong>es</strong>ents a file that was removed, it must notappear in the middle of a pathname lookup (which happens whenLOOKUP_CONTINUE is set). Moreover, a negative dentry object must notappear as the last component in the pathname when a trailing slash ispr<strong>es</strong>ent, unl<strong>es</strong>s explicitly allowed by setting LOOKUP_SLASHOK.12.5 Implementations of VFS System CallsFor the sake of brevity, we cannot discuss the implementation of all VFS system calls listed inTable 12-1. However, it could be useful to sketch out the implementation of a few systemcalls, just to show how VFS's data structur<strong>es</strong> interact.333


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Let us reconsider the example proposed at the beginning of this chapter: a user issu<strong>es</strong> a shellcommand that copi<strong>es</strong> an MS-DOS file /floppy/TEST in an Ext2 file /tmp/t<strong>es</strong>t. <strong>The</strong> commandshell invok<strong>es</strong> an external program like cp, which we assume execut<strong>es</strong> the following codefragment:inf = open("/floppy/TEST", O_RDONLY, 0);outf = open("/tmp/t<strong>es</strong>t", O_WRONLY | O_CREAT | O_TRUNC, 0600);do {l = read(inf, buf, 4096);write(outf, buf, l);} while (l);close(outf);close(inf);Actually, the code of the real cp program is more complicated, since it must also check forpossible error cod<strong>es</strong> returned by each system call. In our example, we just focus our attentionon the "normal" behavior of a copy operation.12.5.1 <strong>The</strong> open( ) System Call<strong>The</strong> open( ) system call is serviced by the sys_open( ) function, which receiv<strong>es</strong> asparameters the pathname filename of the file to be opened, some acc<strong>es</strong>s mode flags flags,and a permission bit mask mode if the file must be created. If the system call succeeds, itreturns a file d<strong>es</strong>criptor, that is, the index in the current->fil<strong>es</strong>->fd array of pointers to fileobjects; otherwise, it returns -1.In our example, open( ) is invoked twice: the first time to open /floppy/TEST for reading(O_RDONLY flag) and the second time to open /tmp/t<strong>es</strong>t for writing (O_WRONLY flag). If/tmp/t<strong>es</strong>t do<strong>es</strong> not already exist, it will be created (O_CREAT flag) with exclusive read andwrite acc<strong>es</strong>s for the owner (octal 0600 number in the third parameter).Conversely, if the file already exists, it will be rewritten from scratch (O_TRUNC flag).Table 12-10 lists all flags of the open( ) system call.334


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Flag NameFASYNCO_APPENDO_CREATO_DIRECTORYO_EXCLO_LARGEFILEO_NDELAYO_NOCTTYO_NOFOLLOWO_NONBLOCKO_RDONLYO_RDWRO_SYNCO_TRUNCO_WRONLYTable 12-10. <strong>The</strong> Flags of the open( ) System CallD<strong>es</strong>criptionAsynchronous I/O notification via signalsWrite always at end of the fileCreate the file if it do<strong>es</strong> not existFail if file is not a directoryWith O_CREAT, fail if the file already existsLarge file (size greater than 2 GB)Same as O_NONBLOCKNever consider the file as a controlling terminalDo not follow a trailing symbolic link in pathnameNo system calls will block on the fileOpen for readingOpen for both reading and writingSynchronous write (block until physical write terminat<strong>es</strong>)Truncate the fileOpen for writingLet us d<strong>es</strong>cribe the operation of the sys_open( ) function. It performs the following:1. Invok<strong>es</strong> getname( ) to read the file pathname from the proc<strong>es</strong>s addr<strong>es</strong>s space.2. Invok<strong>es</strong> get_unused_fd( ) to find an empty slot in current->fil<strong>es</strong>->fd. <strong>The</strong>corr<strong>es</strong>ponding index (the new file d<strong>es</strong>criptor) is stored in the fd local variable.3. Invok<strong>es</strong> the filp_open( ) function, passing as parameters the pathname, the acc<strong>es</strong>smode flags, and the permission bit mask. This function, in turn, execut<strong>es</strong> the followingsteps:a. Invok<strong>es</strong> get_empty_filp( ) to get a new file object.b. Sets the f_flags and f_mode fields of the file object according to the valu<strong>es</strong> ofthe flags and mod<strong>es</strong> parameters.c. Invok<strong>es</strong> open_namei( ), which execut<strong>es</strong> the following operations:a. Invok<strong>es</strong> lookup_dentry( ) to interpret the file pathname and gets thedentry object associated with the requ<strong>es</strong>ted file.b. Performs a seri<strong>es</strong> of checks to verify whether the proc<strong>es</strong>s is permittedto open the file as specified by the valu<strong>es</strong> of the flags parameter. If so,returns the addr<strong>es</strong>s of the dentry object; otherwise, returns an errorcode.d. If the acc<strong>es</strong>s is for writing, checks the value of the i_writecount field of theinode object. A negative value means that the file has been memory-mapped,specifying that write acc<strong>es</strong>s<strong>es</strong> must be denied (see the section Section 15.2 inChapter 15). In this case, returns an error code. Any other value specifi<strong>es</strong> thenumber of proc<strong>es</strong>s<strong>es</strong> that are actually writing into the file. In the latter case,increments the counter.e. Initializ<strong>es</strong> the fields of the file object; in particular, sets the f_op field to thecontents of the i_op->default_file_ops field of the inode object. This setsup all the right functions for future file operations.f. If the open method of the (default) file operations is defined, invok<strong>es</strong> it.g. Clears the O_CREAT, O_EXCL, O_NOCTTY, and O_TRUNC flags in f_flags.h. Returns the addr<strong>es</strong>s of the file object.4. Sets current->fil<strong>es</strong>->fd[fd] to the addr<strong>es</strong>s of the file object.335


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>5. Returns fd .12.5.2 <strong>The</strong> read( ) and write( ) System CallsLet's return to the code in our cp example. <strong>The</strong> open( ) system calls return two filed<strong>es</strong>criptors, which are stored in the inf and outf variabl<strong>es</strong>. <strong>The</strong>n the program starts a loop: ateach iteration, a portion of the /floppy/TEST file is copied into a local buffer (read( ) systemcall), and then the data in the local buffer is written into the /tmp/t<strong>es</strong>t file (write( ) systemcall).<strong>The</strong> read( ) and write( ) system calls are quite similar. Both require three parameters: afile d<strong>es</strong>criptor fd, the addr<strong>es</strong>s buf of a memory area (the buffer containing the data to betransferred), and a number count that specifi<strong>es</strong> how many byt<strong>es</strong> should be transferred. Ofcourse, read( ) will transfer the data from the file into the buffer, while write( ) will dothe opposite. Both system calls return the number of byt<strong>es</strong> that were succ<strong>es</strong>sfully transferredor -1 to signal an error condition. [5][5] A return value l<strong>es</strong>s than count do<strong>es</strong> not mean that an error occurred. <strong>The</strong> kernel is always allowed to terminate the system call even if not allrequ<strong>es</strong>ted byt<strong>es</strong> were transferred, and the user application must accordingly check the return value and reissue, if nec<strong>es</strong>sary, the system call. Typically,a small value is returned when reading from a pipe or a terminal device, when reading past the end of the file, or when the system call is interrupted bya signal. <strong>The</strong> End-Of-File condition (EOF) can easily be recognized by a null return value from read( ). This condition will not be confusedwith an abnormal termination due to a signal, because if read( )is interrupted by a signal before any data was read, an error occurs.<strong>The</strong> read or write operation always tak<strong>es</strong> place at the file offset specified by the current filepointer (field f_pos of the file object). Both system calls update the file pointer by adding thenumber of transferred byt<strong>es</strong> to it.In short, both sys_read( ) (the read( )'s service routine) and sys_write( ) (the write()'s service routine) perform almost the same steps:1. Invok<strong>es</strong> fget( ) to derive from fd the addr<strong>es</strong>s file of the corr<strong>es</strong>ponding file objectand increments the usage counter file->f_count.2. Checks whether the flags in file->f_mode allow the requ<strong>es</strong>ted acc<strong>es</strong>s (read or writeoperation).3. Invok<strong>es</strong> locks_verify_area( ) to check whether there are mandatory locks for thefile portion to be acc<strong>es</strong>sed (see Section 12.6 later in this chapter).4. If executing a write operation, acquir<strong>es</strong> the i_sem semaphore included in the inodeobject. This semaphore forbids a proc<strong>es</strong>s to write into the file while another proc<strong>es</strong>s isflushing to disk buffers relative to the same file (see Section 14.1.5 in Chapter 14). Italso forbids two proc<strong>es</strong>s<strong>es</strong> to write into the same file at the same time. Notice that,unl<strong>es</strong>s the O_APPEND flag is set, POSIX do<strong>es</strong> not require serialized file acc<strong>es</strong>s<strong>es</strong>: if aprogrammer wants exclusive acc<strong>es</strong>s to a file, he must use a file lock (see next section).Thus, it is possible that a proc<strong>es</strong>s is reading from a file while another proc<strong>es</strong>s iswriting to it.5. Invok<strong>es</strong> either file->f_op->read or file->f_op->write to transfer the data. Bothfunctions return the number of byt<strong>es</strong> that were actually transferred. As a side effect,the file pointer is properly updated.6. Invok<strong>es</strong> fput( ) to decrement the usage counter file->f_count.7. Returns the number of byt<strong>es</strong> actually transferred.336


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>12.5.3 <strong>The</strong> close( ) System Call<strong>The</strong> loop in our example code terminat<strong>es</strong> when the read( ) system call returns the value 0,that is, when all byt<strong>es</strong> of /floppy/TEST have been copied into /tmp/t<strong>es</strong>t. <strong>The</strong> program can thenclose the open fil<strong>es</strong>, since the copy operation has been completed.<strong>The</strong> close( ) system call receiv<strong>es</strong> as its parameter fd the file d<strong>es</strong>criptor of the file to beclosed. <strong>The</strong> sys_close( ) service routine performs the following operations:1. Gets the file object addr<strong>es</strong>s stored in current->fil<strong>es</strong>->fd[fd]; if it is NULL, returnsan error code.2. Sets current->fil<strong>es</strong>->fd[fd] to NULL. Releas<strong>es</strong> the file d<strong>es</strong>criptor fd by clearingthe corr<strong>es</strong>ponding bits in the open_fds and close_on_exec fields of current->fil<strong>es</strong> (see Chapter 19, for the Close on Execution flag).3. Invok<strong>es</strong> filp_close( ), which performs the following operations:a. Invok<strong>es</strong> the flush method of the file operations, if definedb. Releas<strong>es</strong> any mandatory lock on the filec. Invok<strong>es</strong> fput( ) to release the file object4. Returns the error code of the flush method (usually 0).12.6 File LockingWhen a file can be acc<strong>es</strong>sed by more than one proc<strong>es</strong>s, a synchronization problem occurs:what happens if two proc<strong>es</strong>s<strong>es</strong> try to write in the same file location? Or again, what happens ifa proc<strong>es</strong>s reads from a file location while another proc<strong>es</strong>s is writing into it?In traditional Unix systems, concurrent acc<strong>es</strong>s<strong>es</strong> to the same file location produceunpredictable r<strong>es</strong>ults. However, the systems provide a mechanism that allows the proc<strong>es</strong>s<strong>es</strong> tolock a file region so that concurrent acc<strong>es</strong>s<strong>es</strong> may be easily avoided.<strong>The</strong> POSIX standard requir<strong>es</strong> a file-locking mechanism based on the fcntl( ) system call. Itis possible to lock an arbitrary region of a file (even a single byte) or to lock the whole file(including data appended in the future). Since a proc<strong>es</strong>s can choose to lock just a part of a file,it can also hold multiple locks on different parts of the file.This kind of lock do<strong>es</strong> not keep out another proc<strong>es</strong>s that is ignorant of locking. Like a criticalregion in code, the lock is considered "advisory" because it do<strong>es</strong>n't work unl<strong>es</strong>s otherproc<strong>es</strong>s<strong>es</strong> cooperate in checking the existence of a lock before acc<strong>es</strong>sing the file. <strong>The</strong>refore,POSIX's locks are known as advisory locks .Traditional BSD variants implement advisory locking through the flock( ) system call. Thiscall do<strong>es</strong> not allow a proc<strong>es</strong>s to lock a file region, just the whole file.337


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Traditional System V variants provide the lockf( ) system call, which is just an interface tofcntl( ). More importantly, System V Release 3 introduced mandatory locking: the kernelchecks that every invocation of the open( ), read( ), and write( ) system calls do<strong>es</strong> notviolate a mandatory lock on the file being acc<strong>es</strong>sed. <strong>The</strong>refore, mandatory locks are enforcedeven between noncooperative proc<strong>es</strong>s<strong>es</strong>. [6] A file is marked as a candidate for mandatorylocking by setting its set-group bit (SGID) and clearing the group-execute permission bit.Since the set-group bit mak<strong>es</strong> no sense when the group-execute bit is off, the kernel interpretsthat combination as a hint to use mandatory locks instead of advisory on<strong>es</strong>.[6] Oddly enough, a proc<strong>es</strong>s may still unlink (delete) a file even if some other proc<strong>es</strong>s owns a mandatory lock on it! This perplexing situation ispossible because, when a proc<strong>es</strong>s delet<strong>es</strong> a file hard link, it do<strong>es</strong> not modify its contents but only the contents of its parent directory.Whether proc<strong>es</strong>s<strong>es</strong> use advisory or mandatory locks, they can make use of both shared readlocks and exclusive write locks. Any number of proc<strong>es</strong>s<strong>es</strong> may have read locks on some fileregion, but only one proc<strong>es</strong>s can have a write lock on it at the same time. Moreover, it is notpossible to get a write lock when another proc<strong>es</strong>s owns a read lock for the same file regionand vice versa (see Table 12-11).Table 12-11. Whether a Lock Is GrantedRequ<strong>es</strong>ted LockRequ<strong>es</strong>ted LockCurrent Locks Read WriteNo lock Y<strong>es</strong> Y<strong>es</strong>Read locks Y<strong>es</strong> NoWrite lock No No12.6.1 <strong>Linux</strong> File Locking<strong>Linux</strong> supports all fashions of file locking: advisory and mandatory locks and the fcntl( ),flock( ), and the lockf( ) system calls. However, the lockf( ) system call is just alibrary wrapper routine, and therefore will not be discussed here.Mandatory locks can be enabled and disabled on a per-fil<strong>es</strong>ystem basis using theMS_MANDLOCK flag of the mount( ) system call. <strong>The</strong> default is to switch off mandatorylocking: in this case, both flock( ) and fcntl( ) create advisory locks. When the flag isset, flock( ) still produc<strong>es</strong> advisory locks, while fcntl( ) produc<strong>es</strong> mandatory locks if thefile has the set-group bit on and the group-execute bit off; it produc<strong>es</strong> advisory locksotherwise.B<strong>es</strong>ide the checks in the read( ) and write( ) system calls, the kernel tak<strong>es</strong> intoconsideration the existence of mandatory locks when servicing all system calls that couldmodify the contents of a file. For instance, an open( ) system call with the O_TRUNC flag setfails if any mandatory lock exists for the file.A lock produced by fcntl( ) is of type FL_POSIX, while a lock produced by flock( ) is oftype FL_LOCK. <strong>The</strong>se two typ<strong>es</strong> of locks may safely coexist, but neither one has any effect onthe other. <strong>The</strong>refore, a file locked through fcntl( ) do<strong>es</strong> not appear locked to flock( ) andvice versa.An FL_POSIX lock is always associated with a proc<strong>es</strong>s and with an inode; the lock isautomatically released either when the proc<strong>es</strong>s di<strong>es</strong> or when a file d<strong>es</strong>criptor is closed (even if338


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the proc<strong>es</strong>s opened the same file twice or duplicated a file d<strong>es</strong>criptor). Moreover, FL_POSIXlocks are never inherited by the child across a fork( ).An FL_LOCK lock is always associated with a file object. When a lock is requ<strong>es</strong>ted, the kernelreplac<strong>es</strong> any other lock that refers to the same file object. This happens only when a proc<strong>es</strong>swants to change an already owned read lock into a write one or vice versa. Moreover, when afile object is being freed by the fput( ) function, all FL_LOCK locks that refer to the fileobject are d<strong>es</strong>troyed. However, there could be other FL_LOCK read locks set by other proc<strong>es</strong>s<strong>es</strong>for the same file (inode), and they still remain active.12.6.2 File-Locking Data Structur<strong>es</strong><strong>The</strong> file_lock data structure repr<strong>es</strong>ents file locks; its fields are shown in Table 12-12. Allfile_lock data structur<strong>es</strong> are included in a doubly linked list. <strong>The</strong> addr<strong>es</strong>s of the firstelement is stored in file_lock_table, while the fields fl_nextlink and fl_prevlink storethe addr<strong>es</strong>s<strong>es</strong> of the adjacent elements in the list.Table 12-12. <strong>The</strong> Fields of the file_lock Data StructureType Field D<strong>es</strong>criptionstruct file_lock * fl_next Next element in inode liststruct file_lock * fl_nextlink Next element in global liststruct file_lock * fl_prevlink Previous element in global liststruct file_lock *fl_nextblock Next element in proc<strong>es</strong>s liststruct file_lock *fl_prevblock Previous element in proc<strong>es</strong>s liststruct fil<strong>es</strong>_struct * fl_owner Owner's fil<strong>es</strong>_structunsigned int fl_pid PID of the proc<strong>es</strong>s ownerstruct wait_queue * fl_wait Wait queue of blocked proc<strong>es</strong>s<strong>es</strong>struct file * fl_file Pointer to file objectunsigned char fl_flags Lock flagsunsigned char fl_type Lock typeoff_t fl_start Starting offset of locked regionoff_t fl_end Ending offset of locked regionvoid (*)(struct file_lock *) fl_notify Callback function when lock is unblockedunion u Fil<strong>es</strong>ystem-specific informationAll lock_file structur<strong>es</strong> that refer to the same file on disk are collected in a simply linkedlist, whose first element is pointed to by the i_flock field of the inode object. <strong>The</strong> fl_nextfield of the lock_file structure specifi<strong>es</strong> the next element in the list.When a proc<strong>es</strong>s tri<strong>es</strong> to get an advisory or mandatory lock, it may be suspended until thepreviously allocated lock on the same file region is released. All proc<strong>es</strong>s<strong>es</strong> sleeping on somelock are inserted into a wait queue, whose addr<strong>es</strong>s is stored in the fl_wait field of thefile_lock structure. Moreover, all proc<strong>es</strong>s<strong>es</strong> sleeping on any file locks are inserted into aglobal circular list implemented by means of the fl_nextblock and fl_prevblock fields.<strong>The</strong> following sections examine the differenc<strong>es</strong> between the two lock typ<strong>es</strong>.339


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>12.6.3 FL_LOCK Locks<strong>The</strong> flock( ) system call acts on two parameters: the fd file d<strong>es</strong>criptor of the file to be actedupon and a cmd parameter that specifi<strong>es</strong> the lock operation. A cmd parameter of LOCK_SHrequir<strong>es</strong> a shared lock for reading, LOCK_EX requir<strong>es</strong> an exclusive lock for writing, andLOCK_UN releas<strong>es</strong> the lock. If the LOCK_NB value is ORed to the LOCK_SH or LOCK_EXoperation, the system call do<strong>es</strong> not block; in other words, if the lock cannot be immediatelyobtained, the system call returns an error code. Note that it is not possible to specify a regioninside the file: the lock always appli<strong>es</strong> to the whole file.When the sys_flock( ) service routine is invoked, it performs the following steps:1. Checks whether fd is a valid file d<strong>es</strong>criptor; if not, returns an error code. Gets theaddr<strong>es</strong>s of the corr<strong>es</strong>ponding file object.2. Invok<strong>es</strong> flock_make_lock( ) to initialize a file_lock structure by setting thefl_flags field to FL_LOCK; sets the fl_type field to F_RDLCK, F_WRLCK, or F_UNLCK,depending on the value of cmd, and sets the fl_file field to the addr<strong>es</strong>s of the fileobject.3. If the lock must be acquired, checks that the proc<strong>es</strong>s has both read and writepermission on the open file; if not, returns an error code.4. Invok<strong>es</strong> flock_lock_file( ), passing as parameters the file object pointer filp, apointer caller to the initialized file_lock structure, and a flag wait. This lastparameter is set if the system call should block and cleared otherwise. This functionperforms, in turn, the following actions:a. Search<strong>es</strong> the list that filp->f_dentry->d_inode->i_flock points to. If anFL_LOCK lock for the same file object is found and an F_UNLCK operation isrequired, remov<strong>es</strong> the file_lock element from the inode list and the globallist, wak<strong>es</strong> up all proc<strong>es</strong>s<strong>es</strong> sleeping in the lock's wait queue, fre<strong>es</strong> thefile_lock structure, and returns.b. Otherwise, search<strong>es</strong> the inode list again to verify that no existing FL_LOCK lockconflicts with the requ<strong>es</strong>ted one. <strong>The</strong>re must be no FL_LOCK write lock in theinode list, and moreover there must be no FL_LOCK lock at all if the proc<strong>es</strong>singis requ<strong>es</strong>ting a write lock. However, a proc<strong>es</strong>s may want to change the type ofa lock it already owns; this is done by issuing a second flock( ) system call.<strong>The</strong>refore, the kernel always allows the proc<strong>es</strong>s to change locks that refer tothe same file object. If a conflicting lock is found and the LOCK_NB flag wasspecified, returns an error code, otherwise inserts the current proc<strong>es</strong>s in thecircular list of blocked proc<strong>es</strong>s<strong>es</strong> and invok<strong>es</strong> interruptible_sleep_on( )to suspend it.c. Otherwise, if no incompatibility exists, inserts the file_lock structure into theglobal lock list and the inode list, then returns (succ<strong>es</strong>s).12.6.4 FL_POSIX LocksWhen used to lock fil<strong>es</strong>, the fcntl( ) system call acts on three parameters: the fd filed<strong>es</strong>criptor of the file to be acted upon, a cmd parameter that specifi<strong>es</strong> the lock operation, andan fl pointer to a flock structure.340


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Locks of type FL_POSIX are able to protect an arbitrary file region, even a single byte. <strong>The</strong>region is specified by three fields of the flock structure. l_start is the initial offset of theregion and is relative to the beginning of the file (if field l_whence is set to SEEK_SET), to thecurrent file pointer (if l_whence is set to SEEK_CUR), or to the end of the file (if l_whence isset to SEEK_END). <strong>The</strong> l_len field specifi<strong>es</strong> the length of the file region (or 0, which meansthat the region extends beyond the end of the file).<strong>The</strong> sys_fcntl( ) service routine behav<strong>es</strong> differently depending on the value of the flag setin the cmd parameter:F_GETLKF_SETLKF_SETLKWDetermin<strong>es</strong> whether the lock d<strong>es</strong>cribed by the flock structure conflicts with someFL_POSIX lock already obtained by another proc<strong>es</strong>s. In that case, the flock structure isoverwritten with the information about the existing lock.Sets the lock d<strong>es</strong>cribed by the flock structure. If the lock cannot be acquired, th<strong>es</strong>ystem call returns an error code.Sets the lock d<strong>es</strong>cribed by the flock structure. If the lock cannot be acquired, th<strong>es</strong>ystem call blocks; that is, the calling proc<strong>es</strong>s is put to sleep.When sys_fcntl( ) acquir<strong>es</strong> a lock, it performs the following:1. Reads the flock structure from user space.2. Gets the file object corr<strong>es</strong>ponding to fd.3. Checks whether the lock should be a mandatory one. In that case, returns an error codeif the file has a shared memory mapping (see Section 15.2 in Chapter 15).4. Invok<strong>es</strong> the posix_make_lock( ) function to initialize a new file_lock structure.5. Returns an error code if the file do<strong>es</strong> not allow the acc<strong>es</strong>s mode specified by the typeof the requ<strong>es</strong>ted lock.6. Invok<strong>es</strong> the lock method of the file operations, if defined.7. Invok<strong>es</strong> the posix_lock_file( ) function, which execut<strong>es</strong> the following actions:a. Invok<strong>es</strong> posix_locks_conflict( ) for each FL_POSIX lock in the inode'slock list. <strong>The</strong> function checks whether the lock conflicts with the requ<strong>es</strong>tedone. Essentially, there must be no FL_POSIX write lock for the same region inthe inode list, and there may be no FL_POSIX lock at all for the same region ifthe proc<strong>es</strong>s is requ<strong>es</strong>ting a write lock. However, locks owned by the sameproc<strong>es</strong>s never conflict; this allows a proc<strong>es</strong>s to change the characteristics of alock it already owns.b. If a conflicting lock is found and fcntl( ) was invoked with the F_SETLKflag, returns an error code. Otherwise, the current proc<strong>es</strong>s should be suspended.In this case, invok<strong>es</strong> posix_locks_deadlock( ) to check that no deadlockcondition is being created among proc<strong>es</strong>s<strong>es</strong> waiting for FL_POSIX locks, then341


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>inserts the current proc<strong>es</strong>s in the circular list of blocked proc<strong>es</strong>s<strong>es</strong> and invok<strong>es</strong>interruptible_sleep_on( ) to suspend it.c. As soon as the inode's lock list includ<strong>es</strong> no conflicting lock, checks all theFL_POSIX locks of the current proc<strong>es</strong>s that overlap the file region that thecurrent proc<strong>es</strong>s wants to lock and combin<strong>es</strong> and splits adjacent areas asrequired. For example, if the proc<strong>es</strong>s requ<strong>es</strong>ted a write lock for a file regionthat falls inside a read-locked wider region, the previous read lock is split intotwo parts covering the nonoverlapping areas, while the central region isprotected by the new write lock. In case of overlaps, newer locks alwaysreplace older on<strong>es</strong>.d. Inserts the new file_lock structure in the global lock list and in the inode list.8. Returns the value 0 (succ<strong>es</strong>s).12.7 Anticipating <strong>Linux</strong> 2.4<strong>The</strong> <strong>Linux</strong> 2.4 VFS handl<strong>es</strong> eight new fil<strong>es</strong>ystems, among them the udf for handling DVDdevic<strong>es</strong>. <strong>The</strong> maximum file size has been considerably increased (at least from the VFS pointof view) by expanding the i_size field of the inode from 32 to 64 bits.Additional acc<strong>es</strong>s typ<strong>es</strong> can now be specified when opening a file: one refers to "raw" writerequ<strong>es</strong>ts that do not make use of the buffer cache.342


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 13. Managing I/O Devic<strong>es</strong><strong>The</strong> Virtual File System in the last chapter depends on lower-level functions to carry out eachread, write, or other operation in a manner suited to each device. <strong>The</strong> previous chapterincluded a brief discussion of how operations are handled by different fil<strong>es</strong>ystems. In thischapter, we'll look at how the kernel invok<strong>es</strong> the operations on actual devic<strong>es</strong>.In Section 13.1 we give a brief survey of the Intel 80x86 I/O architecture. In Section 13.2 w<strong>es</strong>how how the VFS associat<strong>es</strong> a "device file" with each different hardware device so thatapplication programs can use all kinds of devic<strong>es</strong> in the same way. Most of the chapterfocus<strong>es</strong> on the two typ<strong>es</strong> of drivers, character and block.<strong>The</strong> aim of this chapter is to illustrate the overall organization of device drivers in <strong>Linux</strong>.Readers inter<strong>es</strong>ted in developing device drivers on their own may want to refer to Al<strong>es</strong>sandroRubini's <strong>Linux</strong> Device Drivers book from O'Reilly.13.1 I/O ArchitectureIn order to make a computer work properly, data paths must be provided that let informationflow between CPU(s), RAM, and the score of I/O devic<strong>es</strong> that can be connected nowadays toa personal computer. <strong>The</strong>se data paths, which are denoted collectively as the bus, act as theprimary communication channel inside the computer.Several typ<strong>es</strong> of bus<strong>es</strong>, such as the ISA, EISA, PCI, and MCA, are currently in use. In thissection we'll discuss the functional characteristics common to all PC architectur<strong>es</strong>, withoutgiving details about a specific bus type.In fact, what is commonly denoted as bus consists of three specialized bus<strong>es</strong>:Data busAddr<strong>es</strong>s busControl busA group of lin<strong>es</strong> that transfers data in parallel. <strong>The</strong> Pentium has a 64-bit-wide data bus.A group of lin<strong>es</strong> that transmits an addr<strong>es</strong>s in parallel. <strong>The</strong> Pentium has a 32-bit-wideaddr<strong>es</strong>s bus.A group of lin<strong>es</strong> that transmits control information to the connected circuits. <strong>The</strong>Pentium mak<strong>es</strong> use of control lin<strong>es</strong> to specify, for instance, whether the bus is used toallow data transfers between a proc<strong>es</strong>sor and the RAM or alternatively between aproc<strong>es</strong>sor and an I/O device. Control lin<strong>es</strong> also determine whether a read or a writetransfer must be performed.When the bus connects the CPU to an I/O device, it is called an I/O bus. In this case, Intel80x86 microproc<strong>es</strong>sors use 16 out of the 32 addr<strong>es</strong>s lin<strong>es</strong> to addr<strong>es</strong>s I/O devic<strong>es</strong> and 8, 16, or32 out of the 64 data lin<strong>es</strong> to transfer data. <strong>The</strong> I/O bus, in turn, is connected to each I/O343


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>device by means of a hierarchy of hardware components including up to three elements: I/Oports, interfac<strong>es</strong>, and device controllers. Figure 13-1 shows the components of the I/Oarchitecture.Figure 13-1. PC's I/O architecture13.1.1 I/O PortsEach device connected to the I/O bus has its own set of I/O addr<strong>es</strong>s<strong>es</strong>, which are usuallycalled I/O ports. In the IBM PC architecture, the I/O addr<strong>es</strong>s space provid<strong>es</strong> up to 65,536 8-bitI/O ports. Two consecutive 8-bit ports may be regarded as a single 16-bit port, which muststart on an even addr<strong>es</strong>s. Similarly, two consecutive 16-bit ports may be regarded as a single32-bit port, which must start on an addr<strong>es</strong>s that is a multiple of 4. Four special assemblylanguage instructions called in, ins, out, and outs allow the CPU to read from and write intoan I/O port. While executing one of th<strong>es</strong>e instructions, the CPU mak<strong>es</strong> use of the addr<strong>es</strong>s busto select the required I/O port and of the data bus to transfer data between a CPU register andthe port.I/O ports may also be mapped into addr<strong>es</strong>s<strong>es</strong> of the physical addr<strong>es</strong>s space: the proc<strong>es</strong>sor isthen able to communicate with an I/O device by issuing assembly language instructions thatoperate directly on memory (for instance, mov, and, or, and so on). Modern hardware devic<strong>es</strong>tend to prefer mapped I/O, since it is faster and can be combined with DMA.An important objective for system d<strong>es</strong>igners is to offer a unified approach to I/Oprogramming without sacrificing performance. Toward that end, the I/O ports of each deviceare structured into a set of specialized registers as shown in Figure 13-2. <strong>The</strong> CPU writ<strong>es</strong> intothe control register the commands to be sent to the device and reads from the status register avalue that repr<strong>es</strong>ents the internal state of the device. <strong>The</strong> CPU also fetch<strong>es</strong> data from thedevice by reading byt<strong>es</strong> from the input register and push<strong>es</strong> data to the device by writing byt<strong>es</strong>into the output register.344


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 13-2. Specialized I/O portsIn order to lower costs, the same I/O port is often used for different purpos<strong>es</strong>. For instance,some bits d<strong>es</strong>cribe the device state, while others specify the command to be issued to thedevice. Similarly, the same I/O port may be used as an input register or an output register.13.1.2 I/O Interfac<strong>es</strong>An I/O interface is a hardware circuit inserted between a group of I/O ports and thecorr<strong>es</strong>ponding device controller. It acts as an interpreter that translat<strong>es</strong> the valu<strong>es</strong> in the I/Oports into commands and data for the device. In the opposite direction, it detects chang<strong>es</strong> inthe device state and corr<strong>es</strong>pondingly updat<strong>es</strong> the I/O port that plays the role of status register.This circuit can also be connected through an IRQ line to a Programmable InterruptController, so that it issu<strong>es</strong> interrupt requ<strong>es</strong>ts on behalf of the device.<strong>The</strong>re are two typ<strong>es</strong> of interfac<strong>es</strong>:Custom I/O interfac<strong>es</strong>Devoted to one specific hardware device. In some cas<strong>es</strong>, the device controller islocated in the same card [1] that contains the I/O interface. <strong>The</strong> devic<strong>es</strong> attached to acustom I/O interface can be either internal devic<strong>es</strong> (devic<strong>es</strong> located inside the PC'scabinet) or external devic<strong>es</strong> (devic<strong>es</strong> located outside the PC's cabinet).[1] Each card must be inserted in one of the available free bus slots of the PC. If the card can be connected to an external devicethrough an external cable, the card sports a suitable connector in the rear panel of the PC.General-purpose I/O interfac<strong>es</strong>Used to connect several different hardware devic<strong>es</strong>. Devic<strong>es</strong> attached to a generalpurposeI/O interface are always external devic<strong>es</strong>.13.1.2.1 Custom I/O interfac<strong>es</strong>Just to give an idea of how much variety is encompassed by custom I/O interfac<strong>es</strong>, thus by thedevic<strong>es</strong> currently installed in a PC, we'll list some of the most commonly found:Keyboard interfaceConnected to a keyboard controller that includ<strong>es</strong> a dedicated microproc<strong>es</strong>sor. Thismicroproc<strong>es</strong>sor decod<strong>es</strong> the combination of pr<strong>es</strong>sed keys, generat<strong>es</strong> an interrupt, andputs the corr<strong>es</strong>ponding scan code in an input register.345


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Graphic interfacePacked together with the corr<strong>es</strong>ponding controller in a graphic card that has its ownframe buffer, as well as a specialized proc<strong>es</strong>sor and some code stored in a Read-OnlyMemory chip (ROM). <strong>The</strong> frame buffer is an on-board memory containing thegraphics d<strong>es</strong>cription of the current screen contents.Disk interfaceConnected by a cable to the disk controller, which is usually integrated with the disk.For instance, the IDE interface is connected by a 40-wire flat conductor cable to anintelligent disk controller that can be found on the disk itself.Bus mouse interface<strong>The</strong> corr<strong>es</strong>ponding controller is included in the mouse, which is connected via a cableto the interface.Network interfacePacked together with the corr<strong>es</strong>ponding controller in a network card used to receive ortransmit network packets. Although there are several widely adopted networkstandards, Ethernet is the most common.13.1.2.2 General-purpose I/O interfac<strong>es</strong>Modern PCs include several general-purpose I/O interfac<strong>es</strong>, which are used to connect a widerange of external devic<strong>es</strong>. <strong>The</strong> most common interfac<strong>es</strong> are:Parallel portSerial portTraditionally used to connect printers, it can also be used to connect removable disks,scanners, backup units, other computers, and so on. <strong>The</strong> data is transferred 1 byte (8bits) at the time.Like the parallel port, but the data is transferred 1 bit at a time. It includ<strong>es</strong> a UniversalAsynchronous Receiver and Transmitter (UART) chip to string out the byt<strong>es</strong> to be sentinto a sequence of bits and to reassemble the received bits into byt<strong>es</strong>. Since it isintrinsically slower than the parallel port, this interface is mainly used to connectexternal devic<strong>es</strong> that do not operate at a high speed, like modems, mous<strong>es</strong>, andprinters.Universal serial bus (USB)A recent general-purpose I/O interface that is quickly gaining in popularity. It operat<strong>es</strong>at a high speed, and it may be used for the external devic<strong>es</strong> traditionally connected tothe parallel port and the serial port.346


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>PCMCIA interfaceIncluded mostly on portable computers. <strong>The</strong> external device, which has the shape of acredit card, can be inserted into and removed from a slot without rebooting the system.<strong>The</strong> most common PCMCIA devic<strong>es</strong> are hard disks, modems, network cards, andRAM expansions.SCSI (Small Computer System Interface) interfaceA circuit that connects the main PC bus to a secondary bus called the SCSI bus. <strong>The</strong>SCSI-2 bus allows up to eight PCs and external devic<strong>es</strong>—hard disks, scanners, CD-ROM writers, and so on—to be connected together. Wide SCSI-2 and the recentSCSI-3 interfac<strong>es</strong> allow you to connect 16 devic<strong>es</strong> or more if additional interfac<strong>es</strong> arepr<strong>es</strong>ent. <strong>The</strong> SCSI standard is the communication protocol used to connect devic<strong>es</strong> viathe SCSI bus.13.1.3 Device ControllersA complex device may require a device controller to drive it. Essentially, the controller playstwo important rol<strong>es</strong>:• It interprets the high-level commands received from the I/O interface and forc<strong>es</strong> thedevice to execute specific actions by sending proper sequenc<strong>es</strong> of electrical signals toit.• It converts and properly interprets the electrical signals received from the device andmodifi<strong>es</strong> (through the I/O interface) the value of the status register.A typical device controller is the disk controller, which receiv<strong>es</strong> high-level commands such asa "write this block of data" from the microproc<strong>es</strong>sor (through the I/O interface) and convertsthem into low-level disk operations such as "position the disk head on the right track" and"write the data inside the track." Modern disk controllers are very sophisticated, since theycan keep the disk data in fast memory cach<strong>es</strong> and can reorder the CPU high-level requ<strong>es</strong>tsoptimized for the actual disk geometry.Simpler devic<strong>es</strong> do not have a device controller; the Programmable Interrupt Controller (seeSection 4.2 in Chapter 4) and the Programmable Interval Timer (see Section 5.1.3 inChapter 5) are exampl<strong>es</strong> of such devic<strong>es</strong>.13.1.4 Direct Memory Acc<strong>es</strong>s (DMA)All PCs include an auxiliary proc<strong>es</strong>sor called the Direct Memory Acc<strong>es</strong>s Controller, orDMAC, which can be instructed to transfer data between the RAM and an I/O device. Onceactivated by the CPU, the DMAC is able to carry on the data transfer on its own; when thedata transfer has been completed, the DMAC issu<strong>es</strong> an interrupt requ<strong>es</strong>t. <strong>The</strong> conflictsoccurring when both CPU and DMAC need to acc<strong>es</strong>s the same memory location at the sametime are r<strong>es</strong>olved by a hardware circuit called a memory arbiter (see also Section 11.3.1 inChapter 11).347


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> DMAC is mostly used by disk drivers and other slow devic<strong>es</strong> that transfer a large numberof byt<strong>es</strong> at once. Because setup time for the DMAC is relatively high, it is more efficient todirectly use the CPU for the data transfer when the number of byt<strong>es</strong> is small.<strong>The</strong> first DMACs for the old ISA bus<strong>es</strong> were complex and hard to program. More recentDMACs for the PCI and SCSI bus<strong>es</strong> rely on dedicated hardware circuits in the bus<strong>es</strong> andmake life easier for device driver developers.Until now we have distinguished three kinds of memory addr<strong>es</strong>s<strong>es</strong>: logical and linearaddr<strong>es</strong>s<strong>es</strong>, which are used internally by the CPU, and physical addr<strong>es</strong>s<strong>es</strong>, which are thememory addr<strong>es</strong>s<strong>es</strong> used by the CPU to physically drive the data bus. However, there is afourth kind of memory addr<strong>es</strong>s, the so-called bus addr<strong>es</strong>s: it corr<strong>es</strong>ponds to the memoryaddr<strong>es</strong>s<strong>es</strong> used by all hardware devic<strong>es</strong> except the CPU to drive the data bus. In the PCarchitecture, bus addr<strong>es</strong>s<strong>es</strong> coincide with physical addr<strong>es</strong>s<strong>es</strong>; however, in other architectur<strong>es</strong>,like Sun's SPARC and Compaq's Alpha, th<strong>es</strong>e two kinds of addr<strong>es</strong>s<strong>es</strong> differ.Why should the kernel be concerned at all about bus addr<strong>es</strong>s<strong>es</strong>? Well, in a DMA operation thedata transfer tak<strong>es</strong> place without CPU intervention: the data bus is directly driven by the I/Odevice and the DMAC. <strong>The</strong>refore, when the kernel sets up a DMA operation, it must write thebus addr<strong>es</strong>s of the memory buffer involved in the proper I/O ports of the DMAC or I/Odevice.13.2 Associating Fil<strong>es</strong> with I/O Devic<strong>es</strong>As mentioned in Chapter 1, Unix-like operating systems are based on the notion of a file,which is just an information container structured as a sequence of byt<strong>es</strong>. According to thisapproach, I/O devic<strong>es</strong> are treated as fil<strong>es</strong>; thus, the same system calls used to interact withregular fil<strong>es</strong> on disk can be used to directly interact with I/O devic<strong>es</strong>. As an example, the samewrite( ) system call may be used to write data into a regular file, or to send it to a printer bywriting to the /dev/lp0 device file. Let's now examine in more detail how this schema iscarried out.13.2.1 Device Fil<strong>es</strong>Device fil<strong>es</strong> are used to repr<strong>es</strong>ent most of the I/O devic<strong>es</strong> supported by <strong>Linux</strong>. B<strong>es</strong>id<strong>es</strong> itsname, each device file has three main attribut<strong>es</strong>:TypeEither block or character (we'll discuss the difference shortly).Major numberA number ranging from 1 to 255 that identifi<strong>es</strong> the device type. Usually, all devicefil<strong>es</strong> having the same major number and the same type share the same set of fileoperations, since they are handled by the same device driver.348


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Minor numberA number that identifi<strong>es</strong> a specific device among a group of devic<strong>es</strong> that share th<strong>es</strong>ame major number.<strong>The</strong> mknod( ) system call is used to create device fil<strong>es</strong>. It receiv<strong>es</strong> the name of the device file,its type, and the major and minor numbers as parameters. <strong>The</strong> last two parameters are mergedin a 16-bit dev_t number: the eight most significant bits identify the major number, while theremaining on<strong>es</strong> identify the minor number. <strong>The</strong> MAJOR and MINOR macros extract the twovalu<strong>es</strong> from the 16-bit number, while the MKDEV macro merg<strong>es</strong> a major and minor number intoa 16-bit number. Actually, dev_t is the data type specifically used by application programs;the kernel us<strong>es</strong> the kdev_t data type. In <strong>Linux</strong> 2.2 both typ<strong>es</strong> reduce to an unsigned shortinteger, but kdev_t will become a complete device file d<strong>es</strong>criptor in some future <strong>Linux</strong>version.Device fil<strong>es</strong> are usually included in the /dev directory. Table 13-1 illustrat<strong>es</strong> the attribut<strong>es</strong> ofsome device fil<strong>es</strong>. [2] Notice how the same major number may be used to identify both acharacter and a block device.[2] <strong>The</strong> official registry of allocated device numbers and /devdirectory nod<strong>es</strong> is stored in the Documentation/devic<strong>es</strong>.txt file. <strong>The</strong> major numbers of thedevic<strong>es</strong> supported may also be found in the include/linux/major.h file.Table 13-1. Exampl<strong>es</strong> of Device Fil<strong>es</strong>Name Type Major Minor D<strong>es</strong>cription/dev/fd0 block 2 0 Floppy disk/dev/hda block 3 0 First IDE disk/dev/hda2 block 3 2 Second primary partition of first IDE disk/dev/hdb block 3 64 Second IDE disk/dev/hdb3 block 3 67 Third primary partition of second IDE disk/dev/ttyp0 char 3 0 Terminal/dev/console char 5 1 Console/dev/lp1 char 6 1 Parallel printer/dev/ttyS0 char 4 64 First serial port/dev/rtc char 10 135 Real time clock/dev/null char 1 3 Null device (black hole)Usually, a device file is associated with a hardware device, like a hard disk (for instance,/dev/hda), or with some physical or logical portion of a hardware device, like a disk partition(for instance, /dev/hda2). In some cas<strong>es</strong>, however, a device file is not associated to any realhardware device, but repr<strong>es</strong>ents a fictitious logical device. For instance, /dev/null is a devicefile corr<strong>es</strong>ponding to a "black hole": all data written into it are simply discarded, and the fileappears always empty.As far as the kernel is concerned, the name of the device file is irrelevant. If you created adevice file named /tmp/disk of type "block" with major number 3 and minor number 0, itwould be equivalent to the /dev/hda device file shown in the table. On the other hand, devicefilenam<strong>es</strong> may be significant for some application programs. As an example, acommunication program might assume that the first serial port is associated with the/dev/ttyS0 device file. But usually most application programs can be configured to interactwith arbitrarily named device fil<strong>es</strong>.349


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>13.2.1.1 Block versus character devic<strong>es</strong>Block devic<strong>es</strong> have the following characteristics:• <strong>The</strong>y are able to transfer a fixed-size block of data in a single I/O operation.• Blocks stored in the device can be addr<strong>es</strong>sed randomly: the time needed to transfer adata block can be assumed independent of the block addr<strong>es</strong>s inside the device and ofthe current device state.Typical exampl<strong>es</strong> of block devic<strong>es</strong> are hard disks, floppy disks, and CD-ROMs. RAM disks,which are obtained by configuring portions of the RAM as fast hard disks and can maketemporary storage very efficient for application programs, are also treated as block devic<strong>es</strong>.Character devic<strong>es</strong> have the following characteristics:• <strong>The</strong>y are able to transfer arbitrary-sized data in a single I/O operation. Actually, somecharacter devic<strong>es</strong>—such as printers—transfer 1 byte at a time, while others, such astape units, transfer variable-length blocks of data.• <strong>The</strong>y usually addr<strong>es</strong>s characters sequentially.13.2.1.2 Network cardsSome I/O devic<strong>es</strong> have no corr<strong>es</strong>ponding device file. <strong>The</strong> most significant example is networkcards. Essentially, a network card plac<strong>es</strong> outgoing data on a line going to remote computersystems and receiv<strong>es</strong> packets from those systems into kernel memory. Although this bookdo<strong>es</strong> not cover networking, it is worth spending a few moments on the kernel andprogramming interfac<strong>es</strong> to th<strong>es</strong>e cards.Starting with BSD, all Unix systems assign a different symbolic name to each network cardincluded in the computer; for instance, the first Ethernet card gets the eth0 name. However,the name do<strong>es</strong> not corr<strong>es</strong>pond to any device file and has no corr<strong>es</strong>ponding inode.Instead of using the fil<strong>es</strong>ystem, the system administrator has to set up a relationship betweenthe device name and a network addr<strong>es</strong>s. <strong>The</strong>refore, data communication between applicationprograms and the network interface is not based on the standard file-related system calls; it isbased instead on the socket( ), bind( ), listen( ), accept( ), and connect( ) systemcalls, which act on network addr<strong>es</strong>s<strong>es</strong>. This group of system calls, introduced first by UnixBSD, has become the standard programming model for network devic<strong>es</strong>.13.2.2 VFS Handling of Device Fil<strong>es</strong>Device fil<strong>es</strong> live in the system directory tree but are intrinsically different from regular fil<strong>es</strong>and directori<strong>es</strong>. When a proc<strong>es</strong>s acc<strong>es</strong>s<strong>es</strong> a regular file, it is acc<strong>es</strong>sing some data blocks insome disk partition through a fil<strong>es</strong>ystem, but when a proc<strong>es</strong>s acc<strong>es</strong>s<strong>es</strong> a device file, it is justdriving a hardware device. For instance, a proc<strong>es</strong>s might acc<strong>es</strong>s a device file to read the roomtemperature from a digital thermometer connected to the computer. It is the VFS'sr<strong>es</strong>ponsibility to hide the differenc<strong>es</strong> between device fil<strong>es</strong> and regular fil<strong>es</strong> from applicationprograms.350


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In order to do this, the VFS chang<strong>es</strong> the default file operations of an opened device file; asr<strong>es</strong>ult, any system call on the device file will be translated to an invocation of a device-relatedfunction instead of the corr<strong>es</strong>ponding function of the hosting fil<strong>es</strong>ystem. <strong>The</strong> device-relatedfunction acts on the hardware device to perform the operation requ<strong>es</strong>ted by the proc<strong>es</strong>s. [3][3] Notice that, thanks to the name-r<strong>es</strong>olving mechanism explained in Section 12.4 in Chapter 12, symbolic links to device fil<strong>es</strong> work just like devicefil<strong>es</strong>.<strong>The</strong> set of device-related functions that control an I/O device is called a device driver. Sinceeach device has a unique I/O controller, and thus unique commands and unique stateinformation, most I/O device typ<strong>es</strong> have their own drivers.13.2.2.1 Device file class d<strong>es</strong>criptorsEach class of device fil<strong>es</strong> having the same major number and the same type is d<strong>es</strong>cribed by adevice_struct data structure, which includ<strong>es</strong> two fields: the name (name) of the device classand a pointer (fops) to the file operation table. All device_struct d<strong>es</strong>criptors for characterdevice fil<strong>es</strong> are included in the chrdevs table. It includ<strong>es</strong> 255 elements, one for each possiblemajor number. (No device file can have major number 255, since that value is r<strong>es</strong>erved forfuture extensions.) Similarly, all 255 d<strong>es</strong>criptors for block device fil<strong>es</strong> are included in theblkdevs table. <strong>The</strong> first entry of both tabl<strong>es</strong> is always empty, since no device file can havemajor number 0.<strong>The</strong> chrdevs and blkdevs tabl<strong>es</strong> are initially empty. <strong>The</strong> register_chrdev( ) andregister_blkdev( ) functions are used to insert a new entry into one of the tabl<strong>es</strong>, whileunregister_chrdev( ) and unregister_blkdev( ) are used to remove an entry.As an example, the d<strong>es</strong>criptor for the parallel printer driver class is inserted in the chrdevstable as follows:register_chrdev(6, "lp", &lp_fops);<strong>The</strong> first parameter denot<strong>es</strong> the major number, the second denot<strong>es</strong> the device class name, andthe last is a pointer to the table of file operations.If a device driver is statically included in the kernel, the corr<strong>es</strong>ponding device file class isregistered during system initialization. However, if a device driver is dynamically loaded as amodule (see Appendix B), the corr<strong>es</strong>ponding device file class is registered when the module isloaded and unregistered when the module is unloaded.13.2.2.2 Opening a device fileWe discussed in Section 12.5.1 in Chapter 12 how fil<strong>es</strong> are opened. Let us suppose that aproc<strong>es</strong>s opens a device file. <strong>The</strong> VFS initializ<strong>es</strong>, if nec<strong>es</strong>sary, the file object, the dentry object,and the inode object that refer to the device file. In particular, if the inode object do<strong>es</strong> notalready exist, the VFS invok<strong>es</strong> the read_inode method of the proper superblock object toretrieve the file information from disk. In doing so, the method records the device major andminor numbers in the i_rdev field of the inode object and the device file type in the i_modefield (S_IFCHR for character device fil<strong>es</strong> or S_IFBLK for block device fil<strong>es</strong>). Moreover, itinstalls a pointer to the appropriate inode operations as follows:351


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>if ((inode->i_mode & 00170000) == S_IFCHR)inode->i_op = &chrdev_inode_operations;else if ((inode->i_mode & 00170000) == S_IFBLK)inode->i_op = &blkdev_inode_operations;All fields of the chrdev_inode_operations and blkdev_inode_operations tabl<strong>es</strong> are nullexcept for the default_file_ops fields, which point to the def_chr_fops table and to thedef_blk_fops table, r<strong>es</strong>pectively. All the methods of def_chr_fops and def_blk_fops inturn are null except for the open methods, which point to the chrdev_open( ) function andto the blkdev_open( ) function, r<strong>es</strong>pectively.<strong>The</strong> filp_open( ) function fills the new file object and in particular initializ<strong>es</strong> the f_op fieldwith the contents of i_op->default_file_ops field of the inode object. As a consequence,the file operation table will be def_chr_fops or def_blk_fops. <strong>The</strong>n filp_open( ) invok<strong>es</strong>the open method, thus executing either chrdev_open( ) or blkdev_open( ). <strong>The</strong>sefunctions <strong>es</strong>sentially perform three operations:1. Derive the major number of the device driver from the i_rdev field of the inodeobject:major = MAJOR(inode->i_rdev);2. Install the proper file operations for the device file:filp->f_op = chrdevs[major].fops;(<strong>The</strong> example, of course, is for character device fil<strong>es</strong>; blkdev_open( ) us<strong>es</strong> theblkdevs table instead.)3. Invoke, if defined, the open method of the file operations table:if (filp->f_op != NULL && filp->f_op->open != NULL)return filp->f_op->open(inode, filp);Notice that the final invocation of the open( ) method do<strong>es</strong> not cause recursion, since nowthe field contains the addr<strong>es</strong>s of a device-dependent function whose job is to set up the device.Typically, that function performs the following operations:1. If the device driver is included in a kernel module, increments its usage counter, sothat it cannot be unloaded until the device file is closed. (Appendix B d<strong>es</strong>crib<strong>es</strong> howusers can load and unload modul<strong>es</strong>.)2. If the device driver handl<strong>es</strong> several devic<strong>es</strong> of the same kind, selects the proper one bymaking use of the minor number and further specializ<strong>es</strong>, if needed, the table of fileoperations.3. Checks whether the device really exists and is currently working.4. If nec<strong>es</strong>sary, sends an initialization command sequence to the hardware device.5. Initializ<strong>es</strong> the data structur<strong>es</strong> of the device driver.352


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>13.3 Device DriversWe have seen that the VFS us<strong>es</strong> a canonical set of common functions (open, read, lseek, andso on) to control a device. <strong>The</strong> actual implementation of all th<strong>es</strong>e functions is delegated to thedevice driver. Since each device has a unique I/O controller, and thus unique commands andunique state information, most I/O devic<strong>es</strong> have their own drivers.We shall not attempt to d<strong>es</strong>cribe any of the hundreds of existing device drivers butconcentrate rather on how the kernel supports them. In doing so, we shall d<strong>es</strong>cribe several I/Oarchitecture featur<strong>es</strong> that must be taken into consideration by device driver programmers.13.3.1 Level of <strong>Kernel</strong> Support<strong>The</strong> kernel can support acc<strong>es</strong>s to hardware devic<strong>es</strong> in three possible ways:No support at all<strong>The</strong> application program interacts directly with the device's I/O ports by issuingsuitable in and out assembly language instructions.Minimal support<strong>The</strong> kernel do<strong>es</strong> not recognize the hardware device but only its I/O interface. Userprograms are able to treat the interface as a sequential device capable of readingand/or writing sequenc<strong>es</strong> of characters.Extended support<strong>The</strong> kernel recogniz<strong>es</strong> the hardware device and handl<strong>es</strong> the I/O interface itself. In fact,there might not even be a device file for the device.<strong>The</strong> most common example of the first approach, which do<strong>es</strong> not rely on any kernel devicedriver, is how the X Window System handl<strong>es</strong> the graphic display. <strong>The</strong> approach is quiteefficient, although it r<strong>es</strong>trains the X server from making use of the hardware interrupts issuedby the I/O device. This approach also requir<strong>es</strong> some additional effort in order to allow the Xserver to acc<strong>es</strong>s the required I/O ports. As mentioned in Section 3.2.2 in Chapter 3, the iopl() and ioperm( ) system calls grant a proc<strong>es</strong>s the privilege to acc<strong>es</strong>s I/O ports. <strong>The</strong>y can beinvoked only by programs having root privileg<strong>es</strong>. But such programs can be made available tousers by setting the fsuid field of the executable file to 0, the UID of the superuser (seeSection 19.1.1 in Chapter 19).<strong>The</strong> minimal support approach is used to handle external hardware devic<strong>es</strong> connected to ageneral-purpose I/O interface. <strong>The</strong> kernel tak<strong>es</strong> care of the I/O interface by offering a devicefile (and thus a device driver); the application program handl<strong>es</strong> the external hardware deviceby reading and writing the device file.Minimal support is preferable to extended support because it keeps the kernel size small.However, among the general-purpose I/O interfac<strong>es</strong> commonly found on a PC, only the serialport is handled with this approach. Thus, a serial mouse is directly controlled by an353


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>application program, like the X server, and a serial modem always requir<strong>es</strong> a communicationprogram like Minicom, Seyon, or a PPP (Point-to-Point Protocol) daemon.Minimal support has a limited range of applications because it cannot be used when theexternal device must interact heavily with internal kernel data structur<strong>es</strong>. As an example,consider a removable hard disk that is connected to a general-purpose I/O interface. Anapplication program cannot interact with all kernel data structur<strong>es</strong> and functions needed torecognize the disk and to mount its fil<strong>es</strong>ystem, so extended support is mandatory in this case.In general, any hardware device directly connected to the I/O bus, such as the internal harddisk, is handled according to the extended support approach: the kernel must provide a devicedriver for each such device. External devic<strong>es</strong> attached to the parallel port, the Universal SerialBus (USB), the PCMCIA port found in many laptops, or the SCSI interface—in short, anygeneral-purpose I/O interface except the serial port—also require extended support.It is worth noting that the standard file-related system calls like open( ), read( ), andwrite( ) do not always give the application full control of the underlying hardware device.In fact, the low<strong>es</strong>t-common-denominator approach of the VFS do<strong>es</strong> not include room forspecial commands that some devic<strong>es</strong> need or let an application check whether the device is insome specific internal state.<strong>The</strong> POSIX ioctl( ) system call has been introduced to satisfy such needs. B<strong>es</strong>id<strong>es</strong> the filed<strong>es</strong>criptor of the device file and a second 32-bit parameter specifying the requ<strong>es</strong>t, the systemcall can accept an arbitrary number of additional parameters. For example, specific ioctl( )requ<strong>es</strong>ts exist to get the CD-ROM sound volume or to eject the CD-ROM media. Applicationprograms may simulate the user interface of a CD player using th<strong>es</strong>e kinds of ioctl( )requ<strong>es</strong>ts.13.3.2 Monitoring I/O Operations<strong>The</strong> duration of an I/O operation is often unpredictable. It can depend on mechanicalconsiderations (the current position of a disk head with r<strong>es</strong>pect to the block to be transferred),on truly random events (when a data packet will arrive on the network card), or on humanfactors (when a user will pr<strong>es</strong>s a key on the keyboard or when she will notice that a paper jamoccurred in the printer). In any case, the device driver that started an I/O operation must relyon a monitoring technique that signals either the termination of the I/O operation or a timeout.In the case of a terminated operation, the device driver reads the status register of the I/Ointerface to determine if the I/O operation was carried out succ<strong>es</strong>sfully. In the case of a timeout,the driver knows that something went wrong, since the maximum time interval allowed tocomplete the operation elapsed and nothing happened.<strong>The</strong> two techniqu<strong>es</strong> available to monitor the end of an I/O operation are called the pollingmode and the interrupt mode.13.3.2.1 Polling modeAccording to this technique, the CPU checks (polls) the device's status register repeatedlyuntil its value signals that the I/O operation has been completed. We have already encountered354


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>a technique based on polling in Section 11.4.2 in Chapter 11: when a proc<strong>es</strong>sor tri<strong>es</strong> toacquire a busy spin lock, it repeatedly polls the variable until its value becom<strong>es</strong> 0. However,polling applied to I/O operations is usually more elaborate, since the delays involved may behuge and the driver must remember to check for possible time-outs, too. In order to avoidwasting precious machine cycl<strong>es</strong>, device drivers voluntarily relinquish the CPU after eachpolling operation so that other runnable proc<strong>es</strong>s<strong>es</strong> can continue their execution:for (;;) {if (read_status(device) & DEVICE_END_OPERATION)break;schedule( );if (--count == 0)break;}<strong>The</strong> count variable, which was initialized before entering the loop, is decremented at eachiteration, and thus can be used to implement a rough time-out mechanism. Alternatively, amore precise time-out mechanism could be implemented by reading the value of the tickcounter jiffi<strong>es</strong> at each iteration (see Section 5.3 in Chapter 5) and comparing it with the oldvalue read before starting the wait loop.13.3.2.2 Interrupt modeInterrupt mode can be used only if the I/O controller is capable of signaling, via an IRQ line,the end of an I/O operation. <strong>The</strong> device driver starts the I/O operation and invok<strong>es</strong>interruptible_sleep_on( ) or sleep_on( ), passing as the parameter a pointer to the I/Odevice wait queue.When the interrupt occurs, the interrupt handler invok<strong>es</strong> wake_up( ) to wake up all proc<strong>es</strong>s<strong>es</strong>sleeping in the device wait queue. <strong>The</strong> awakened device driver can thus check the r<strong>es</strong>ult of theI/O operation.Time-out control is implemented through static or dynamic timers (see Chapter 5); the timermust be set to the right time before starting the I/O operation and removed when the operationterminat<strong>es</strong>.13.3.3 Acc<strong>es</strong>sing I/O Ports<strong>The</strong> in, out, ins, and outs assembly language instructions acc<strong>es</strong>s I/O ports. <strong>The</strong> followingauxiliary functions are included in the kernel to simplify such acc<strong>es</strong>s<strong>es</strong>:inb( ) , inw( ) , inl( )Read 1, 2, or 4 consecutive byt<strong>es</strong>, r<strong>es</strong>pectively, from an I/O port. <strong>The</strong> suffix "b," "w,"or "l" refers, r<strong>es</strong>pectively, to a byte (8 bits), a word (16 bits), and a long (32 bits).inb_p( ) , inw_p( ) , inl_p( )Read 1, 2, or 4 consecutive byt<strong>es</strong>, r<strong>es</strong>pectively, from an I/O port and then execute a"dummy" instruction to introduce a pause.355


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>outb( ) , outw( ) , outl( )Write 1, 2, or 4 consecutive byt<strong>es</strong> r<strong>es</strong>pectively to an I/O port.outb_p( ) , outw_p( ) , outl_p( )Write 1, 2, and 4 consecutive byt<strong>es</strong>, r<strong>es</strong>pectively, to an I/O port and then execute a"dummy" instruction to introduce a pause.insb( ) , insw( ) , insl( )Read sequenc<strong>es</strong> of consecutive byt<strong>es</strong>, in groups of 1, 2, or 4, r<strong>es</strong>pectively, from an I/Oport. <strong>The</strong> length of the sequence is specified as a parameter of the functions.outsb( ) , outsw( ) , outsl( )Write sequenc<strong>es</strong> of consecutive byt<strong>es</strong>, in groups of 1, 2, or 4, r<strong>es</strong>pectively, to an I/Oport.While acc<strong>es</strong>sing I/O ports is simple, detecting which I/O ports have been assigned to I/Odevic<strong>es</strong> may not be, in particular for systems based on an ISA bus. Often a device driver mustblindly write into some I/O port in order to probe the hardware device; if, however, this I/Oport is already used by some other hardware device, a system crash could occur. In order toprevent such situations, the kernel keeps track of I/O ports assigned to each hardware deviceby means of the iotable table. Any device driver may thus use the following three functions:requ<strong>es</strong>t_region( )Assigns a given interval of I/O ports to an I/O devicecheck_region( )Checks whether a given interval of I/O ports is free or whether some of them havealready been assigned to some I/O devicerelease_region( )Releas<strong>es</strong> a given interval of I/O ports previously assigned to an I/O device<strong>The</strong> I/O addr<strong>es</strong>s<strong>es</strong> currently assigned to I/O devic<strong>es</strong> can be obtained from the /proc/ioportsfile.13.3.4 Requ<strong>es</strong>ting an IRQWe have seen in Section 4.6.7 in Chapter 4 that the assignment of IRQs to devic<strong>es</strong> is usuallymade dynamically, right before using them, since several devic<strong>es</strong> may share the same IRQline. To make sure the IRQ is obtained when needed but not requ<strong>es</strong>ted in a redundant mannerwhen it is already in use, device drivers usually adopt the following schema:356


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• A usage counter keeps track of the number of proc<strong>es</strong>s<strong>es</strong> that are currently acc<strong>es</strong>singthe device file. <strong>The</strong> counter is incremented in the open method of the device file anddecremented in the release method. [4][4] More precisely, the usage counter keeps track of the number of file objects referring to the device file, since clone proc<strong>es</strong>s<strong>es</strong> could share the samefile object.• <strong>The</strong> open method checks the value of the usage counter before the increment. If thecounter is null, the device driver must allocate the IRQ and enable interrupts on thehardware device. <strong>The</strong>refore, it invok<strong>es</strong> requ<strong>es</strong>t_irq( ) and configur<strong>es</strong> the I/Ocontroller properly.• <strong>The</strong> release method checks the value of the usage counter after the decrement. If thecounter is null, no more proc<strong>es</strong>s<strong>es</strong> are using the hardware device. If so, the methodinvok<strong>es</strong> free_irq( ), thus releasing the IRQ line, and disabl<strong>es</strong> interrupts on the I/Ocontroller.13.3.5 Putting DMA to WorkAs mentioned in Section 13.1.4, several I/O drivers make use of the Direct Memory Acc<strong>es</strong>sController (DMAC) to speed up operations. <strong>The</strong> DMAC interacts with the device's I/Ocontroller to perform a data transfer; as we shall see later, the kernel includ<strong>es</strong> an easy-to-us<strong>es</strong>et of routin<strong>es</strong> to program the DMAC. <strong>The</strong> I/O controller signals to the CPU, via an IRQ,when the data transfer has finished.When a device driver sets up a DMA operation for some I/O device, it must specify thememory buffer involved by using bus addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong> kernel provid<strong>es</strong> the virt_to_bus andbus_to_virt macros, r<strong>es</strong>pectively, to translate a linear addr<strong>es</strong>s into a bus addr<strong>es</strong>s and viceversa.As with IRQ lin<strong>es</strong>, the DMAC is a r<strong>es</strong>ource that must be assigned dynamically to the driversthat need it. <strong>The</strong> way the driver starts and ends DMA operations depends on the type of bus.13.3.5.1 DMA for ISA busEach ISA DMAC can control a limited number of channels. Each channel includ<strong>es</strong> anindependent set of internal registers, so that the DMAC can control several data transfers atthe same time.Device drivers normally r<strong>es</strong>erve and release the ISA DMAC in the following manner. Asusual, the device driver reli<strong>es</strong> on a usage counter to detect when a device file is no longeracc<strong>es</strong>sed by any proc<strong>es</strong>s. <strong>The</strong> driver performs the following:• In the open( ) method of the device file, increment the device's usage counter. If theprevious value was 0, the driver performs the following operations:1. Invok<strong>es</strong> requ<strong>es</strong>t_irq( ) to allocate the IRQ line used by the ISA DMAC2. Invok<strong>es</strong> requ<strong>es</strong>t_dma( ) to allocate a DMA channel3. Notifi<strong>es</strong> the hardware device that it should use DMA and issue interrupts4. Allocat<strong>es</strong>, if nec<strong>es</strong>sary, a storage area for the DMA buffer• When the DMA operation must be started, performs the following operations in theproper methods of the device file (typically, read and write):1. Invok<strong>es</strong> set_dma_mode( ) to set the channel to read or write mode.357


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Invok<strong>es</strong> set_dma_addr( ) to set the bus addr<strong>es</strong>s of the DMA buffer. (Onlythe 24 least-significant bits of the addr<strong>es</strong>s are sent to the DMAC, so the buffermust be included in the first 16 MB of RAM.)3. Invok<strong>es</strong> set_dma_count( ) to set the number of byt<strong>es</strong> to be transferred.4. Invok<strong>es</strong> enable_dma( ) to enable the DMA channel.5. Puts the current proc<strong>es</strong>s in the device's wait queue and suspends it. When theDMAC terminat<strong>es</strong> the transfer operation, the device's I/O controller issu<strong>es</strong> aninterrupt and the corr<strong>es</strong>ponding interrupt handler wak<strong>es</strong> up the sleepingproc<strong>es</strong>s.6. Once awakened, invok<strong>es</strong> disable_dma( ) to disable the DMA channel.7. Invok<strong>es</strong> get_dma_r<strong>es</strong>idue( ) to check whether all byt<strong>es</strong> have beentransferred.• In the release method of the device file, decrements the device's usage counter. If itbecom<strong>es</strong> 0, execute the following operations:1. Disabl<strong>es</strong> the DMA and the corr<strong>es</strong>ponding interrupt on the hardware device2. Invok<strong>es</strong> free_dma( ) to release the DMA channel3. Invok<strong>es</strong> free_irq( ) to release the IRQ line used for DMA13.3.5.2 DMA for PCI busMaking use of DMA for a PCI bus is much simpler, since the DMAC is somewhat integratedinto the I/O interface. As usual, in the open method, the device driver must allocate the IRQline used for signaling the termination of the DMA operation. However, there is no need toallocate a DMA channel, since each hardware device directly controls the electrical signals ofthe PCI bus. To start a DMA operation, the device driver simply writ<strong>es</strong> in some I/O port ofthe hardware device the bus addr<strong>es</strong>s of the DMA buffer, the transfer direction, and the size ofthe data; the driver then suspends the current proc<strong>es</strong>s. <strong>The</strong> release method releas<strong>es</strong> the IRQline when the file object is closed by the last proc<strong>es</strong>s.13.3.6 Device Controller's Local MemorySeveral hardware devic<strong>es</strong> include their own memory, which is often called I/O sharedmemory. For instance, all recent graphic cards include a few megabyt<strong>es</strong> of RAM called aframe buffer, which is used to store the screen image to be displayed on the monitor.13.3.6.1 Mapping addr<strong>es</strong>s<strong>es</strong>Depending on the device and on the bus type, I/O shared memory in the PC's architecture maybe mapped within three different physical addr<strong>es</strong>s rang<strong>es</strong>:For most devic<strong>es</strong> connected to the ISA bus<strong>The</strong> I/O shared memory is usually mapped into the physical addr<strong>es</strong>s<strong>es</strong> ranging from0xa0000 to 0xfffff; this giv<strong>es</strong> rise to the "hole" between 640 KB and 1 MBmentioned in Section 2.5.3 of Chapter 2.For some old devic<strong>es</strong> using the VESA Local Bus (VLB)This is a specialized bus mainly used by graphic cards: the I/O shared memory ismapped into the physical addr<strong>es</strong>s<strong>es</strong> ranging from 0xe00000 to 0xffffff, that is358


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>between 14 MB and 16 MB. <strong>The</strong>se devic<strong>es</strong>, which further complicate the initializationof the paging tabl<strong>es</strong>, are going out of production.For devic<strong>es</strong> connected to the PCI bus<strong>The</strong> I/O shared memory is mapped into very large physical addr<strong>es</strong>s<strong>es</strong>, well above theend of RAM's physical addr<strong>es</strong>s<strong>es</strong>. This kind of device is much simpler to handle.13.3.6.2 Acc<strong>es</strong>sing the I/O shared memoryHow ` do<strong>es</strong> the kernel acc<strong>es</strong>s an I/O shared memory location? Let's start with the PC'sarchitecture, which is relatively simple to handle and then extend the discussion to otherarchitectur<strong>es</strong>.Remember that kernel programs act on linear addr<strong>es</strong>s<strong>es</strong>, so the I/O shared memory locationsmust be expr<strong>es</strong>sed as addr<strong>es</strong>s<strong>es</strong> greater than PAGE_OFFSET. In the following discussion, weassume that PAGE_OFFSET is equal to 0xc0000000, that is, that the kernel linear addr<strong>es</strong>s<strong>es</strong> arein the fourth gigabyte.<strong>Kernel</strong> drivers must translate I/O physical addr<strong>es</strong>s<strong>es</strong> of I/O shared memory locations intolinear addr<strong>es</strong>s<strong>es</strong> in kernel space. In the PC architecture, this can be achieved simply by ORingthe 32-bit physical addr<strong>es</strong>s with the 0xc0000000 constant. For instance, suppose the kernelneeds to store in t1 the value in the I/O location at physical addr<strong>es</strong>s 0x000b0fe4 and in t2 thevalue in the I/O location at physical addr<strong>es</strong>s 0xfc000000. One might think that the followingstatements could do the job:t1 = *((unsigned char *)(0xc00b0fe4));t2 = *((unsigned char *)(0xfc000000));During the initialization phase, the kernel has mapped the available RAM's physical addr<strong>es</strong>s<strong>es</strong>into the initial portion of the fourth gigabyte of the linear addr<strong>es</strong>s space. <strong>The</strong>refore, the PagingUnit maps the 0xc00b0fe4 linear addr<strong>es</strong>s appearing in the first statement back to the originalI/O physical addr<strong>es</strong>s 0x000b0fe4, which falls inside the "ISA hole" between 640 KB and 1MB (see Section 2.5 in Chapter 2). This works fine.<strong>The</strong>re is a problem, however, for the second statement because the I/O physical addr<strong>es</strong>s isgreater than the last physical addr<strong>es</strong>s of the system RAM. <strong>The</strong>refore, the 0xfc000000 linearaddr<strong>es</strong>s do<strong>es</strong> not nec<strong>es</strong>sarily corr<strong>es</strong>pond to the 0xfc000000 physical addr<strong>es</strong>s. In such cas<strong>es</strong>,the kernel page tabl<strong>es</strong> must be modified in order to include a linear addr<strong>es</strong>s that maps the I/Ophysical addr<strong>es</strong>s: this can be done by invoking the ioremap( ) function. This function, whichis similar to vmalloc( ), invok<strong>es</strong> get_vm_area( ) to create a new vm_struct d<strong>es</strong>criptor(see Section 6.3.2 in Chapter 6) for a linear addr<strong>es</strong>s interval having the size of the required I/Oshared memory area. <strong>The</strong> ioremap( ) function then updat<strong>es</strong> properly the corr<strong>es</strong>ponding pagetable entri<strong>es</strong> of all proc<strong>es</strong>s<strong>es</strong>.<strong>The</strong> correct form for the second statement might therefore look like:io_mem = ioremap(0xfb000000, 0x200000);t2 = *((unsigned char *)(io_mem + 0x100000));359


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> first statement creat<strong>es</strong> a new 2 MB linear addr<strong>es</strong>s interval, starting from 0xfb000000; th<strong>es</strong>econd one reads the memory location having the 0xfc000000 addr<strong>es</strong>s. To remove themapping later, the device driver must use the iounmap( ) function.Now let's consider architectur<strong>es</strong> other than the PC. In this case, adding to an I/O physicaladdr<strong>es</strong>s the 0xc0000000 constant to obtain the corr<strong>es</strong>ponding linear addr<strong>es</strong>s do<strong>es</strong> not alwayswork. In order to improve kernel portability, <strong>Linux</strong> therefore includ<strong>es</strong> the following macros toacc<strong>es</strong>s the I/O shared memory:readb , readw , readlReads 1, 2, or 4 byt<strong>es</strong>, r<strong>es</strong>pectively, from an I/O shared memory locationwriteb , writew , writelWrit<strong>es</strong> 1, 2, or 4 byt<strong>es</strong>, r<strong>es</strong>pectively, into an I/O shared memory locationmemcpy_fromio , memcpy_toiomemset_ioCopi<strong>es</strong> a block of data from an I/O shared memory location to dynamic memory andvice versaFills an I/O shared memory area with a fixed value<strong>The</strong> recommended way to acc<strong>es</strong>s the 0xfc000000 I/O location is thus:io_mem = ioremap(0xfb000000, 0x200000);t2 = readb(io_mem + 0x100000);Thanks to th<strong>es</strong>e macros, all dependenc<strong>es</strong> on platform-specific ways of acc<strong>es</strong>sing the I/Oshared memory can be hidden.13.4 Character Device HandlingHandling a character device is relatively easy, since no data buffering is needed and no diskcach<strong>es</strong> are involved. Of course, character devic<strong>es</strong> differ in their requirements: some of themmust implement a sophisticated communication protocol to drive the hardware device, whileothers just have to read a few valu<strong>es</strong> from a couple of I/O ports of the hardware devic<strong>es</strong>. Forinstance, the device driver of a multiport serial card device (a hardware device offering manyserial ports) is much more complicated than the device driver of a bus mouse.Let's briefly sketch out the functioning of a very simple character device driver, namely thedriver of the Logitech bus mouse. It is associated with the /dev/logibm character device file,which has major number 10 and minor number 0.Suppose that a proc<strong>es</strong>s opens the /dev/logibm file; as explained in Section 13.2.2 earlier in thischapter, the VFS ends up invoking the open method of the device file operations common toall character devic<strong>es</strong> having major number 10. This device class covers a seri<strong>es</strong> of360


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>heterogeneous devic<strong>es</strong>, and hence the method, a function called misc_open( ), installs yet amore specialized set of file operations according to the device's minor number. As the finalr<strong>es</strong>ult, the field f_op of the file object points to the bus_mouse_fops table, and theopen_mouse( ) function is invoked. This function performs the following operations:1. Checks whether the bus mouse is connected.2. Requ<strong>es</strong>ts the IRQ line used by the bus mouse, that is IRQ5, and registers themouse_interrupt( ) Interrupt Service Routine.3. Initializ<strong>es</strong> a small mouse data structure of type mouse_status, which stor<strong>es</strong> theinformation about the status of the bus mouse. This status information includ<strong>es</strong> whichbuttons are pr<strong>es</strong>sed, along with the horizontal and vertical displacements of the mousepointer after the last read of the device file.4. Writ<strong>es</strong> the valuein the 0x23e control register to enable bus mouse interrupts (theLogitech bus mouse us<strong>es</strong> I/O ports from 0x23c to 0x23f).<strong>The</strong> mouse data structure is filled asynchronously: every time the user chang<strong>es</strong> the mouseposition or pr<strong>es</strong>s<strong>es</strong> a mouse button, the mouse controller generat<strong>es</strong> an interrupt, and hence themouse_interrupt( ) function is activated. It performs the following operations:1. Asks the bus mouse device about its state by writing suitable commands in the 0x23econtrol register and reading the corr<strong>es</strong>ponding valu<strong>es</strong> from the 0x23c input register.2. Updat<strong>es</strong> the mouse data structure.3. Writ<strong>es</strong> the value 0 in the 0x23e control register to reenable bus mouse interrupts (theyare automatically disabled by the bus mouse device each time one of them occurs).<strong>The</strong> proc<strong>es</strong>s must read the /dev/logibm file to get the mouse status. Each read( ) system callends up invoking the read_mouse( ) function associated with the read method of the fileoperations. It performs the following operations:1. Checks that the proc<strong>es</strong>s requ<strong>es</strong>ted at least 3 byt<strong>es</strong> and returns -EINVAL otherwise.2. Checks whether the mouse status has changed after the last read operation of/dev/logibm; if not, return -EAGAIN.3. Invok<strong>es</strong> disable_irq( ) to disable interrupt handling of IRQ5, and reads the valu<strong>es</strong>stored in the mouse data structure; then reenabl<strong>es</strong> interrupt handling of IRQ5 byinvoking enable_irq( ).4. Writ<strong>es</strong> into the User Mode buffer 3 byt<strong>es</strong> repr<strong>es</strong>enting the mouse status (buttonsstatus, horizontal and vertical displacements) after the last read operation.5. If the proc<strong>es</strong>s requ<strong>es</strong>ted more than 3 byt<strong>es</strong>, fills the User Mode buffer with zeros.6. Returns the number of written byt<strong>es</strong>.13.5 Block Device HandlingTypical block devic<strong>es</strong> like hard disks have very high average acc<strong>es</strong>s tim<strong>es</strong>. Each operationrequir<strong>es</strong> several milliseconds to complete, mainly because the hard disk controller must movethe heads on the disk surface to reach the exact position where the data is recorded. However,when the heads are correctly placed, data transfer can be sustained at rat<strong>es</strong> of tens ofmegabyt<strong>es</strong> per second.361


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In order to achieve acceptable performance, hard disks and similar devic<strong>es</strong> transfer severaladjacent byt<strong>es</strong> at once. In the following discussion, we'll say that groups of byt<strong>es</strong> are adjacentwhen they are recorded on the disk surface in such a manner that a single seek operation canacc<strong>es</strong>s them.<strong>The</strong> organization of <strong>Linux</strong> block device handlers is quite involved. We won't be able todiscuss in detail all the functions that have been included in the kernel to support the handlers.But we'll outline the general software architecture and introduce the main data structur<strong>es</strong>.<strong>Kernel</strong> support for block device handlers includ<strong>es</strong> the following featur<strong>es</strong>:• Offers a uniform interface through the VFS• Implements efficient read-ahead of disk data• Provid<strong>es</strong> disk caching for the data<strong>The</strong> kernel basically distinguish<strong>es</strong> two kinds of I/O data transfer:Buffer I/O operationsHere the transferred data is kept in buffers, the kernel's generic memory containers fordisk-based data. Each buffer is associated with a specific block identified by a devicenumber and a block number. <strong>Linux</strong> misleadingly calls th<strong>es</strong>e operations "synchronousI/O operations." <strong>The</strong> term "synchronous" is not well-suited in this context because abuffer I/O operation is really asynchronous: in other words, the kernel control paththat starts the operation may continue its execution without waiting for the operationto end. <strong>The</strong> term is probably inherited from very old versions of <strong>Linux</strong>.Page I/O operationsHere the transferred data is kept in page fram<strong>es</strong>; each page frame contains databelonging to a regular file. Since this data is not nec<strong>es</strong>sarily stored in adjacent diskblocks, it is identified by the file's inode and by an offset within the file. Again, <strong>Linux</strong>inappropriately calls th<strong>es</strong>e operations "asynchronous I/O operations."Buffer I/O operations are most often used either when a proc<strong>es</strong>s directly reads a block devicefile or when the kernel reads particular typ<strong>es</strong> of blocks in a fil<strong>es</strong>ystem (for example, a blockcontaining inod<strong>es</strong> or a superblock). In <strong>Linux</strong> 2.2 buffer operations are also used to write diskbasedregular fil<strong>es</strong>. Page I/O operations are used mainly for reading regular fil<strong>es</strong>, file memorymapping, and swapping. Both kinds of I/O data transfer rely on the same driver to acc<strong>es</strong>s ablock device, but the kernel us<strong>es</strong> different algorithms and buffering techniqu<strong>es</strong> with them.13.5.1 Sectors, Blocks, and BuffersEach data transfer operation for a block device acts on a group of adjacent byt<strong>es</strong> called asector. In most disk devic<strong>es</strong>, the size of a sector is 512 byt<strong>es</strong>, although a few devic<strong>es</strong> haverecently appeared that make use of larger sectors (1024 and 2048 byt<strong>es</strong>). Notice that the sectorshould be considered the basic unit of data transfer: it is never possible to transfer l<strong>es</strong>s than asector, although most disk devic<strong>es</strong> are capable of transferring several adjacent sectors at once.<strong>The</strong> kernel stor<strong>es</strong> the sector size of each hardware block device in a table namedhardsect_size. Each element in the table is indexed by the major number and the minor362


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>number of the corr<strong>es</strong>ponding block device file. Thus, hardsect_size[3][2] repr<strong>es</strong>ents th<strong>es</strong>ector size of /dev/hda2, the second primary partition of the first IDE disk (see Table 13-1).If hardsect_size[M] is NULL, all block devic<strong>es</strong> sharing the major number M have a standardsector size of 512 byt<strong>es</strong>.Block device drivers transfer a large number of adjacent byt<strong>es</strong> called a block in a singleoperation. A block should not be confused with a sector: the sector is the basic unit of datatransfer for the hardware device, while the block is simply a group of adjacent byt<strong>es</strong> involvedin an I/O operation requ<strong>es</strong>ted by a device driver.In <strong>Linux</strong>, the block size must be a power of 2 and cannot be larger than a page frame.Moreover, it must be a multiple of the sector size, since each block must include an integralnumber of sectors. <strong>The</strong>refore, on PC architecture, the permitted block siz<strong>es</strong> are 512, 1024,2048, and 4096 byt<strong>es</strong>. <strong>The</strong> same block device driver may operate with several block siz<strong>es</strong>,since it has to handle a set of device fil<strong>es</strong> sharing the same major number, while each blockdevice file has its own predefined block size. For instance, a block device driver could handlea hard disk with two partitions containing an Ext2 fil<strong>es</strong>ystem and a swap area (see Chapter 16,and Chapter 17). In this case, the device driver mak<strong>es</strong> use of two different block siz<strong>es</strong>: 1024byt<strong>es</strong> for the Ext2 partition [5] and 4096 byt<strong>es</strong> for the swap partition.[5] 1024 is the standard Ext2 block size, although other block siz<strong>es</strong> are allowed.<strong>The</strong> kernel stor<strong>es</strong> the block size in a table named blksize_size; each element in the table isindexed by the major number and the minor number of the corr<strong>es</strong>ponding block device file. Ifblksize_size[M] is NULL, all block devic<strong>es</strong> sharing the major number M have a standardblock size of 1024 byt<strong>es</strong>.Each block requir<strong>es</strong> its own buffer, which is a RAM memory area used by the kernel to storethe block's content. When a device driver reads a block from disk, it fills the corr<strong>es</strong>pondingbuffer with the valu<strong>es</strong> obtained from the hardware device; similarly, when a device driverwrit<strong>es</strong> a block on disk, it updat<strong>es</strong> the corr<strong>es</strong>ponding group of adjacent byt<strong>es</strong> on the hardwaredevice with the actual valu<strong>es</strong> of the associated buffer. <strong>The</strong> size of a buffer always match<strong>es</strong> th<strong>es</strong>ize of the corr<strong>es</strong>ponding block.13.5.2 An Overview of Buffer I/O OperationsFigure 13-3 illustrat<strong>es</strong> the architecture of a generic block device driver and the maincomponents that interact with it when servicing a buffer I/O operation.363


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 13-3. Block device handler architecture for buffer I/O operationsA block device driver is usually split in two parts: a high-level driver, which interfac<strong>es</strong> withthe VFS layer, and a low-level driver, which handl<strong>es</strong> the hardware device.Suppose a proc<strong>es</strong>s issu<strong>es</strong> a read( ) or write( ) system call on a device file. <strong>The</strong> VFSexecut<strong>es</strong> the read or write method of the corr<strong>es</strong>ponding file object, and thus invok<strong>es</strong> aprocedure within the high-level block device handler. This procedure performs all actionsrelated to the read or write requ<strong>es</strong>t that are specific to the hardware device. <strong>The</strong> kernel offerstwo general functions called block_read( ) and block_write( ) that take care of almosteverything (see Section 13.5.4 later in this chapter). <strong>The</strong>refore, in most cas<strong>es</strong>, the high-levelhardware device drivers must do nothing, and the read and write methods of the device filepoint, r<strong>es</strong>pectively, to block_read( ) and block_write( ).However, some block device handlers require their own customized high-level device drivers.A significant example is the device driver of the floppy disk: it must check that the disk in thedrive has not been changed by the user since the last disk acc<strong>es</strong>s; if a new disk has beeninserted, the device driver must invalidate all buffers already filled with data of the old diskmedia.Even when a high-level device driver includ<strong>es</strong> its own read and write methods, they usuallyend up invoking block_read( ) and block_write( ). <strong>The</strong>se functions translate the acc<strong>es</strong>srequ<strong>es</strong>t involving an I/O device file into a requ<strong>es</strong>t for some blocks from the corr<strong>es</strong>pondinghardware device. As we'll see in Section 14.1 in Chapter 14, the blocks required may alreadybe in main memory, so both block_read( ) and block_write( ) invoke the getblk( )function to check the cache first in case a block was prefetched or has stayed unchanged sincean earlier acc<strong>es</strong>s. If the block is not in the cache, getblk( ) must proceed to requ<strong>es</strong>t it fromthe disk by invoking ll_rw_block( ) (see Section 13.5.9). This latter function activat<strong>es</strong> alow-level driver that handl<strong>es</strong> the device controller to perform the requ<strong>es</strong>ted operation on theblock device.Buffer I/O operations are also triggered when the VFS acc<strong>es</strong>s<strong>es</strong> some specific block on ablock device directly. For instance, if the kernel must read an inode from a disk fil<strong>es</strong>ystem, itmust transfer the data from blocks of the corr<strong>es</strong>ponding disk partition. Direct acc<strong>es</strong>s to364


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>specific blocks is performed by the bread( ) and breada( ) functions (see Section 13.5.5),which in turn invoke the getblk( ) and ll_rw_block( ) functions previously mentioned.Since block devic<strong>es</strong> are slow, buffer I/O data transfers are always handled asynchronously:the low-level device driver programs the DMAC and the disk controller and then terminat<strong>es</strong>.When the transfer complet<strong>es</strong>, an interrupt is issued, and the low-level device driver isactivated a second time to clean up the data structur<strong>es</strong> involved in the I/O operation. In thisway, no kernel control path must be suspended until a data transfer complet<strong>es</strong> (unl<strong>es</strong>s thekernel control path explicitly has to wait for some block of data).13.5.3 <strong>The</strong> Role of Read-AheadMany disk acc<strong>es</strong>s<strong>es</strong> are sequential. As we shall see in Chapter 17, fil<strong>es</strong> are stored on disk inlarge groups of adjacent sectors, so that they can be retrieved quickly with few mov<strong>es</strong> of thedisk heads. When a program reads or copi<strong>es</strong> a file, it usually acc<strong>es</strong>s<strong>es</strong> it sequentially, from thefirst byte to the last one. <strong>The</strong>refore, many adjacent sectors on disk are likely to be fetched inseveral I/O operations.Read-ahead is a technique that consists of reading several adjacent blocks of a block device inadvance, before they are actually requ<strong>es</strong>ted. In most cas<strong>es</strong>, read-ahead significantly enhanc<strong>es</strong>disk performance, since it lets the disk controller handle fewer commands that refer to largergroups of adjacent sectors. Moreover, system r<strong>es</strong>ponsiven<strong>es</strong>s improv<strong>es</strong>. A proc<strong>es</strong>s that issequentially reading a block device can get the requ<strong>es</strong>ted data faster because the driverperforms fewer disk acc<strong>es</strong>s<strong>es</strong>.However, read-ahead is of no use for random acc<strong>es</strong>s<strong>es</strong> to block devic<strong>es</strong>; in that case, it isactually detrimental since it tends to waste space in the disk cach<strong>es</strong> with usel<strong>es</strong>s information.<strong>The</strong>refore, the kernel stops read-ahead when it determin<strong>es</strong> that the most recently issued I/Oacc<strong>es</strong>s is not sequential to the previous one. <strong>The</strong> f_reada field of the file object is a flag thatis set when read-ahead is enabled for the corr<strong>es</strong>ponding file (or block device file) and clearedotherwise.<strong>The</strong> kernel stor<strong>es</strong> in a table named read_ahead the number of byt<strong>es</strong> (the number of standard512-byte sectors, to be precise) to be read in advance when a device file is being readsequentially. A "zero" value specifi<strong>es</strong> a default number of 8 512-byte sectors, that is, 4 KB.All block device fil<strong>es</strong> having the same major number share the same predefined number of512-byte sectors to be read in advance; therefore, each element in read_ahead is indexed bythe major device number.13.5.4 <strong>The</strong> block_read( ) and block_write( ) Functions<strong>The</strong> block_read( ) and block_write( ) functions are invoked by a high-level devicedriver whenever a proc<strong>es</strong>s issu<strong>es</strong> a read or write operation on a device file. For example, th<strong>es</strong>uperformat program formats a diskette by writing blocks into the /dev/fd0 device file. <strong>The</strong>write method of the corr<strong>es</strong>ponding file object invok<strong>es</strong> the block_write( ) function.<strong>The</strong> block_read( ) and block_write( ) functions receive the following parameters:365


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>filpAddr<strong>es</strong>s of a file object associated with the device file.bufcountpposAddr<strong>es</strong>s of a memory area in User Mode addr<strong>es</strong>s space. block_read( ) writ<strong>es</strong> thedata fetched from the block device into this memory area; conversely, block_write() reads the data to be written on the block device from the memory area.Number of byt<strong>es</strong> to be transferred.Addr<strong>es</strong>s of a variable containing an offset in the device file; usually, this parameterpoints to filp->f_pos, that is, to the file pointer of the device file.<strong>The</strong> block_read( ) function performs the following operations:1. Deriv<strong>es</strong> the major number and the minor number of the block device fromfilp->f_dentry->d_inode->i_rdev.2. Deriv<strong>es</strong> the block size of the device file from blksize_size.3. Comput<strong>es</strong> from *ppos and the block size the sequential number of the first block to beread on the device. Also comput<strong>es</strong> the offset of the first byte to be read inside thatblock.4. Deriv<strong>es</strong> the size of the block hardware device. This value is stored in a table namedblk_size. As with similar data structur<strong>es</strong> introduced earlier in the chapter, eachelement is indexed by the major number and the minor number of the corr<strong>es</strong>pondingdevice file and repr<strong>es</strong>ents the size of the block device in units of 1024 byt<strong>es</strong>. Ifnec<strong>es</strong>sary, modifi<strong>es</strong> count in order to prevent any read operation from going beyondthe end of the device.5. Comput<strong>es</strong> the number of blocks to be read from the devic<strong>es</strong> from a combination ofcount, the block size, and the offset inside the first block. If filp->f_reada is set,also tak<strong>es</strong> into consideration the number of blocks to be read in advance, which isspecified in the read_ahead table.6. For any block to be read, performs the following operations:a. Search<strong>es</strong> for the block in the buffer cache by using the getblk( ) function(see Section 14.1 in Chapter 14). If it is not found, a new buffer is allocatedand inserted into the cache.b. If the buffer do<strong>es</strong> not contain valid data (for instance, because it has beenallocated just now), starts a read operation by using the ll_rw_block( )function (see Section 13.5.9), and suspends the current proc<strong>es</strong>s until the datahas been transferred in the buffer.c. If the block has been requ<strong>es</strong>ted by the proc<strong>es</strong>s, that is, if it is not read inadvance, copi<strong>es</strong> the buffer content into the user memory area pointed to bybuf.366


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Actually, the algorithm is more elaborate than what we've just explained, since it isoptimized to make maximum use of the buffer cache. <strong>The</strong> function operat<strong>es</strong> byrequ<strong>es</strong>ting large groups of blocks from the low-level driver at once; it do<strong>es</strong> not waituntil all of them have been transferred before searching for the next group of blocks inthe buffer cache. However, the final r<strong>es</strong>ult is the same: after this step, all buffers of theblocks involved contain valid data, and the byt<strong>es</strong> requ<strong>es</strong>ted by the user proc<strong>es</strong>s arecopied into the user memory area.7. Adds to *ppos the number of byt<strong>es</strong> copied into the user memory area.8. Sets the filp->f_reada flag, so that the read-ahead mechanism will be used nexttime (unl<strong>es</strong>s the proc<strong>es</strong>s modifi<strong>es</strong> the file pointer, in which case the flag is cleared).9. Returns the number of byt<strong>es</strong> copied in the user memory area.<strong>The</strong> block_write( ) function is similar to block_read( ), so we won't d<strong>es</strong>cribe it in detail.However, some important differenc<strong>es</strong> should be underlined:• Before starting the write operation, the block_write( ) function must check whetherthe block hardware device is read-only and, in this case, returns an error code. Thishappens, for example, when attempting to write on a block device file associated witha CD-ROM disk. <strong>The</strong> ro_bits table includ<strong>es</strong> a bit for each block hardware device: abit is set if the corr<strong>es</strong>ponding device cannot be written and cleared if it can be written.• <strong>The</strong> block_write( ) function must check the offset of the first byte to be writteninside the first block. If the offset is not null and the buffer cache do<strong>es</strong> not alreadycontain valid data for the first block, the function must read the block from disk beforerewriting it. In fact, since the block device driver operat<strong>es</strong> on whole blocks, theportions of the first block that preced<strong>es</strong> the byt<strong>es</strong> being written must be pr<strong>es</strong>erved bythe write operation. Similarly, the function must also read from disk the last block tobe written before rewriting it, unl<strong>es</strong>s the last byte to be written falls in the last positionof the last block.• <strong>The</strong> block_write( ) function do<strong>es</strong> not nec<strong>es</strong>sarily invoke ll_rw_block( ) to forcea write to disk. Usually, it just marks the buffers of the blocks to be written as "dirty,"thus deferring the actual updating of the corr<strong>es</strong>ponding sectors on disk (see Section14.1.5 in Chapter 14). However, the function do<strong>es</strong> invoke ll_rw_block( ) if the callopening the block device file has specified the O_SYNC flag. In this case, the callingproc<strong>es</strong>s wants to wait (sleep) until the data has been physically written in the hardwaredevice, so that the disk always reflects what the proc<strong>es</strong>s thinks it do<strong>es</strong>.13.5.5 <strong>The</strong> bread( ) and breada( ) Functions<strong>The</strong> bread( ) function checks whether a specific block is already included in the buffercache; if not, the function reads the block from a block device. bread( ) is widely used byfil<strong>es</strong>ystems to read from disk bitmaps, inod<strong>es</strong>, and other block-based data structur<strong>es</strong>. (Recallthat block_read( ) is used instead of bread( ) when a proc<strong>es</strong>s wants to read a block devicefile.) <strong>The</strong> function receiv<strong>es</strong> as parameters the device identifier, the block number, and theblock size, and performs the following operations:1. Invok<strong>es</strong> the getblk( ) function to search for the block in the buffer cache; if theblock is not included in the cache, getblk( ) allocat<strong>es</strong> a new buffer for it.2. If the buffer already contains up-to-date data, terminat<strong>es</strong>.3. Invok<strong>es</strong> ll_rw_block( ) to start the read operation.367


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4. Waits until the data transfer complet<strong>es</strong>. This is done by invoking a function namedwait_on_buffer( ), which inserts the current proc<strong>es</strong>s in the b_wait wait queueand suspends the proc<strong>es</strong>s until the buffer is unlocked.breada( ) is very similar to bread( ), but it also reads in advance some extra blocks inaddition to the one required. Notice that there is no function that directly writ<strong>es</strong> some block todisk. Write operations are never critical for system performance, thus are always deferred (seeSection 14.1.5 in Chapter 14).13.5.6 Buffer Heads<strong>The</strong> buffer head is a d<strong>es</strong>criptor of type buffer_head associated with each buffer. It containsall the information needed by the kernel to know how to handle the buffer; thus, beforeoperating on each buffer the kernel checks its buffer head.<strong>The</strong> buffer head fields are listed in Table 13-2. <strong>The</strong> b_data field of each buffer head stor<strong>es</strong>the starting addr<strong>es</strong>s of the corr<strong>es</strong>ponding buffer. Since a page frame may store several buffers,the b_this_page field points to the buffer head of the next buffer in the page. This fieldfacilitat<strong>es</strong> the storage and retrieval of entire page fram<strong>es</strong>. <strong>The</strong> b_blocknr field stor<strong>es</strong> thelogical block number, that is, the index of the block inside the disk partition.Table 13-2. <strong>The</strong> Fields of a Buffer HeadType Field D<strong>es</strong>criptionunsigned long b_blocknr Logical block numberunsigned long b_size Block sizekdev_t b_dev Virtual device identifierkdev_t b_rdev Real device identifierunsigned long b_rsector Number of initial sector in real deviceunsigned long b_state Buffer status flagsunsigned int b_count Block usage counterchar * b_data Pointer to bufferunsigned long b_flushtime Flushing time for bufferstruct wait_queue * b_wait Buffer wait queu<strong>es</strong>truct buffer_head * b_next Next item in collision hash liststruct buffer_head ** b_pprev Previous item in collision hash liststruct buffer_head * b_this_page Per-page buffer liststruct buffer_head * b_next_free Next item in liststruct buffer_head * b_prev_free Previous item in listunsigned int b_list LRU list including the bufferstruct buffer_head * b_reqnext Requ<strong>es</strong>t's buffer listvoid (*)( ) b_end_io I/O completion methodvoid (*) b_dev_id Specialized device driver data<strong>The</strong> b_state field stor<strong>es</strong> the following flags:BH_UptodateSet if the buffer contains valid data. <strong>The</strong> value of this flag is returned by thebuffer_uptodate( ) function.368


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>BH_DirtyBH_LockBH_ReqSet if the buffer is dirty, that is, if it contains data that must be written to the blockdevice. <strong>The</strong> value of this flag is returned by the buffer_dirty( ) function.Set if the buffer is locked, which happens if the buffer is involved in a disk transfer.<strong>The</strong> value of this flag is returned by the buffer_locked( ) function.Set if the corr<strong>es</strong>ponding block has been requ<strong>es</strong>ted (see next section) and has valid (upto-date)data. <strong>The</strong> value of this flag is returned by the buffer_req( ) function.BH_ProtectedSet if the buffer is protected (protected buffers never get freed). <strong>The</strong> value of this flagis returned by the buffer_protected( ) function. This flag is used only toimplement RAM disks on top of the buffer cache.<strong>The</strong> b_dev field identifi<strong>es</strong> the virtual device containing the block stored in the buffer, whilethe b_rdev field identifi<strong>es</strong> the real device. This distinction, which is meaningl<strong>es</strong>s for simplehard disks, has been introduced to model RAID (Redundant Array of Independent Disks)storage units consisting of several disks operating in parallel. For reasons of safety andefficiency, fil<strong>es</strong> stored in a RAID array are scattered across several disks that the applicationsthink of as a single logical disk. B<strong>es</strong>id<strong>es</strong> the b_blocknr field repr<strong>es</strong>enting the logical blocknumber, it is thus nec<strong>es</strong>sary to specify the specific disk unit in the b_rdev field, and thecorr<strong>es</strong>ponding sector number in the b_rsector field.13.5.7 Block Device Requ<strong>es</strong>tsAlthough block device drivers are able to transfer a single block at a time, the kernel do<strong>es</strong> notperform an individual I/O operation for each block to be acc<strong>es</strong>sed on disk: this would lead topoor disk performanc<strong>es</strong>, since locating the physical position of a block on the disk surface isquite time-consuming. Instead, the kernel tri<strong>es</strong>, whenever possible, to cluster several blocksand handle them as a whole, thus reducing the average number of head movements.When a proc<strong>es</strong>s, the VFS layer, or any other kernel component wish<strong>es</strong> to read or write a diskblock, it actually creat<strong>es</strong> a block device requ<strong>es</strong>t. That requ<strong>es</strong>t <strong>es</strong>sentially d<strong>es</strong>crib<strong>es</strong> therequ<strong>es</strong>ted block and the kind of operation to be performed on it (read or write). However, thekernel do<strong>es</strong> not satisfy a requ<strong>es</strong>t as soon as it is created: the I/O operation is just scheduledand will be performed at a later time. This artificial delay is paradoxically the crucialmechanism for boosting the performance of block devic<strong>es</strong>. When a new block data transfer isrequ<strong>es</strong>ted, the kernel checks whether it can be satisfied by slightly enlarging a previousrequ<strong>es</strong>t that is still waiting, that is, whether the new requ<strong>es</strong>t can be satisfied without furtherseek operations. Since disks tend to be acc<strong>es</strong>sed sequentially, this simple mechanism is veryeffective.369


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Deferring requ<strong>es</strong>ts complicat<strong>es</strong> block device handling. For instance, suppose that a proc<strong>es</strong>sopens a regular file and, consequently, a fil<strong>es</strong>ystem driver wants to read the corr<strong>es</strong>pondinginode from disk. <strong>The</strong> high-level block device driver puts the requ<strong>es</strong>t on a queue and theproc<strong>es</strong>s is suspended until the block storing the inode is transferred. However, the high-levelblock device driver cannot be blocked, because any other proc<strong>es</strong>s trying to acc<strong>es</strong>s the samedisk would be blocked as well.In order to keep the block device driver from being suspended, each I/O operation is beingproc<strong>es</strong>sed asynchronously, as already mentioned in the section Section 13.5.2. Thus, no kernelcontrol path is forced to wait until a data transfer complet<strong>es</strong>. In particular, block devicedrivers are interrupt-driven (see Section 13.3.2 earlier in this chapter), so that the high-leveldriver can terminate its execution as soon as it has issued the block requ<strong>es</strong>t. <strong>The</strong> low-leveldriver, which is activated at a later time, invok<strong>es</strong> a so-called strategy routine, which tak<strong>es</strong> therequ<strong>es</strong>t from a queue and satisfi<strong>es</strong> it by issuing suitable commands to the disk controller.When the I/O operation terminat<strong>es</strong>, the disk controller rais<strong>es</strong> an interrupt and thecorr<strong>es</strong>ponding handler invok<strong>es</strong> the strategy routine again, if nec<strong>es</strong>sary, to proc<strong>es</strong>s anotherrequ<strong>es</strong>t in the queue.Each block device driver maintains its own requ<strong>es</strong>t queu<strong>es</strong>; there should be one requ<strong>es</strong>t queuefor each physical block device, so that the requ<strong>es</strong>ts can be ordered in such a way as to increasedisk performance. <strong>The</strong> strategy routine can thus sequentially scan the queue and service allrequ<strong>es</strong>ts with the minimum number of head movements.Each block device requ<strong>es</strong>t is repr<strong>es</strong>ented by a requ<strong>es</strong>t d<strong>es</strong>criptor , which is stored in therequ<strong>es</strong>t data structure illustrated in Table 13-3. <strong>The</strong> direction of the data transfer is stored inthe cmd field: it is either READ (from block device to RAM) or WRITE (from RAM to blockdevice). <strong>The</strong> rq _status field is used to specify the status of the requ<strong>es</strong>t: for most blockdevic<strong>es</strong>, it is simply set either to RQ_INACTIVE (requ<strong>es</strong>t d<strong>es</strong>criptor not in use) or to RQ_ACTIVE (valid requ<strong>es</strong>t, to be serviced or already being serviced by the low-level devicedriver).Table 13-3. <strong>The</strong> Fields of a Requ<strong>es</strong>t D<strong>es</strong>criptorType Field D<strong>es</strong>criptionint rq_status Requ<strong>es</strong>t statuskdev_t rq_dev Device identifierint Cmd Requ<strong>es</strong>ted operationint errors Succ<strong>es</strong>s or failure codeunsigned long sector First sector numberunsigned long nr_sector Number of sectors of requ<strong>es</strong>tunsigned long current_nr_sector Number of sectors of current blockchar * buffer Memory area for I/O transferstruct semaphore * sem Requ<strong>es</strong>t semaphor<strong>es</strong>truct buffer_head * bh First buffer d<strong>es</strong>criptorstruct buffer_head * bhtail Last buffer d<strong>es</strong>criptorstruct requ<strong>es</strong>t * next Requ<strong>es</strong>t queue link<strong>The</strong> requ<strong>es</strong>t may encompass many adjacent blocks on the same device. <strong>The</strong> rq _dev fieldidentifi<strong>es</strong> the block device, while the sector field specifi<strong>es</strong> the number of the first sectorcorr<strong>es</strong>ponding to the first block in the requ<strong>es</strong>t. Both nr_sector and current_nr_sector370


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>specify the number of sectors to be transferred. As we'll later see in Section 13.5.10, th<strong>es</strong>ector, nr_sector, and current_nr_sector fields could be dynamically updated while therequ<strong>es</strong>t is being serviced.All buffer heads of the blocks in the requ<strong>es</strong>t are collected in a simply linked list. <strong>The</strong>b_reqnext field of each buffer head points to the next element in the list, while the bh andbhtail fields of the requ<strong>es</strong>t d<strong>es</strong>criptor point, r<strong>es</strong>pectively, to the first element and the lastelement in the list.<strong>The</strong> buffer field of the requ<strong>es</strong>t d<strong>es</strong>criptor points to the memory area used for the actual datatransfer. If the requ<strong>es</strong>t involv<strong>es</strong> a single block, buffer is just a copy of the b_data field of thebuffer head. However, if the requ<strong>es</strong>t encompass<strong>es</strong> several blocks whose buffers are notconsecutive in memory, the buffers are linked through the b_reqnext fields of their bufferheads as shown in Figure 13-4. On a read, the low-level device driver could choose to allocatea large memory area referred by buffer, read all sectors of the requ<strong>es</strong>t at once, and then copythe data into the various buffers. Similarly, for a write, the low-level device driver could copythe data from many nonconsecutive buffers into a single memory area referred by buffer andthen perform the whole data transfer at once.Figure 13-4. A requ<strong>es</strong>t d<strong>es</strong>criptor and its buffers and sectorsFigure 13-4 illustrat<strong>es</strong> a requ<strong>es</strong>t d<strong>es</strong>criptor encompassing three blocks. <strong>The</strong> buffers of two ofthem are consecutive in RAM, while the third buffer is by itself. <strong>The</strong> corr<strong>es</strong>ponding bufferheads identify the logical blocks on the block device; the blocks must nec<strong>es</strong>sarily be adjacent.Each logical block includ<strong>es</strong> two sectors. <strong>The</strong> sector field of the requ<strong>es</strong>t d<strong>es</strong>criptor points tothe first sector of the first block on disk, and the b_reqnext field of each buffer head points tothe next buffer head.<strong>The</strong> kernel statically allocat<strong>es</strong> a fixed number of requ<strong>es</strong>t d<strong>es</strong>criptors to handle all the requ<strong>es</strong>tsfor block devic<strong>es</strong>: there are NR_REQUEST d<strong>es</strong>criptors (usually 128) stored in the all_requ<strong>es</strong>tsarray. Since the efficiency of read operations have a larger impact on system performancethan do<strong>es</strong> the efficiency of write operations (because the data to be read is probably needed371


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>for some computation to progr<strong>es</strong>s), the last third of requ<strong>es</strong>t d<strong>es</strong>criptors in all_requ<strong>es</strong>ts isr<strong>es</strong>erved for read operations.<strong>The</strong> fixed number of requ<strong>es</strong>t d<strong>es</strong>criptors may become, under very heavy loads and high diskactivity, a bottleneck. A dearth of free d<strong>es</strong>criptors may force proc<strong>es</strong>s<strong>es</strong> to wait until anongoing data transfer terminat<strong>es</strong>. Thus, a wait_for_requ<strong>es</strong>t wait queue is used to queueproc<strong>es</strong>s<strong>es</strong> waiting for a free requ<strong>es</strong>t element. <strong>The</strong> get_requ<strong>es</strong>t_wait( ) tri<strong>es</strong> to get a freerequ<strong>es</strong>t d<strong>es</strong>criptor and puts the current proc<strong>es</strong>s to sleep in the wait queue if none is found; theget_requ<strong>es</strong>t( ) function is similar but simply returns NULL if no free requ<strong>es</strong>t d<strong>es</strong>criptor isavailable.13.5.8 Requ<strong>es</strong>t Queu<strong>es</strong> and Block Device Driver D<strong>es</strong>criptorsA requ<strong>es</strong>t queue is a simply linked list whose elements are requ<strong>es</strong>t d<strong>es</strong>criptors. <strong>The</strong> next fieldin each requ<strong>es</strong>t d<strong>es</strong>criptor points to the next item in the queue and is null for the last element.<strong>The</strong> list is usually ordered first according to the device identifier and next according to thenumber of the initial sector.As mentioned earlier, device drivers usually have one requ<strong>es</strong>t queue for each disk they serve.However, some device drivers have just one requ<strong>es</strong>t queue that includ<strong>es</strong> the requ<strong>es</strong>ts for allphysical devic<strong>es</strong> handled by the driver. This approach simplifi<strong>es</strong> the d<strong>es</strong>ign of the driver butdegrad<strong>es</strong> overall performanc<strong>es</strong>, since no simple ordering strategy can be imposed on thequeue.<strong>The</strong> addr<strong>es</strong>s of the requ<strong>es</strong>t being serviced, together with a few other piec<strong>es</strong> of relevantinformation, are stored in a d<strong>es</strong>criptor associated with each block device driver. <strong>The</strong>d<strong>es</strong>criptor is a data structure of type blk_dev_struct, whose fields are listed in Table 13-4.<strong>The</strong> d<strong>es</strong>criptors for all the block devic<strong>es</strong> are stored in the blk_dev table, which is indexed bythe major number of the block device.Table 13-4. <strong>The</strong> Fields of a Block Device Driver D<strong>es</strong>criptorType Field D<strong>es</strong>criptionvoid *(*)(void) requ<strong>es</strong>t_fn Strategy routinevoid * data Driver's private data common queu<strong>es</strong>truct requ<strong>es</strong>t plug Dummy plug requ<strong>es</strong>tstruct requ<strong>es</strong>t *current_requ<strong>es</strong>t Current requ<strong>es</strong>t in single common queu<strong>es</strong>tructrequ<strong>es</strong>tMethod for getting a requ<strong>es</strong>t from one of thequeue**(*)(kdev_t)queu<strong>es</strong>struct tq_struct plug_tq Plug task queue elementIf the block device driver has a unique requ<strong>es</strong>t queue for all physical block devic<strong>es</strong>, the queuefield is null and the current_requ<strong>es</strong>t field points to the d<strong>es</strong>criptor of the requ<strong>es</strong>t beingserviced in the queue. If the queue is empty, current_requ<strong>es</strong>t is null.Conversely, if the block device driver maintains several queu<strong>es</strong>, the queue field points to acustom driver method that receiv<strong>es</strong> the identifier of the block device file, selects one of thequeu<strong>es</strong> according to the device number, then returns the addr<strong>es</strong>s of the d<strong>es</strong>criptor of therequ<strong>es</strong>t being serviced, if any. In this case, the current_requ<strong>es</strong>t field points to the d<strong>es</strong>criptorof the requ<strong>es</strong>t being serviced, if any. (<strong>The</strong>re can be at most one requ<strong>es</strong>t at a time, since the372


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>same device driver do<strong>es</strong> not allow requ<strong>es</strong>ts to be proc<strong>es</strong>sed concurrently even if they refer todifferent disks.)<strong>The</strong> requ<strong>es</strong>t_fn( ) field contains the addr<strong>es</strong>s of the driver's strategy routine, the crucialfunction in the low-level block device driver that actually interacts with the physical blockdevice (usually the disk controller) in order to start the data transfer specified by a requ<strong>es</strong>t inthe queue.13.5.9 <strong>The</strong> ll_rw_block( ) Function<strong>The</strong> ll_rw_block( ) function creat<strong>es</strong> a block device requ<strong>es</strong>t; as we have seen earlier in thischapter, it is invoked from several plac<strong>es</strong> in the kernel and device drivers. It receiv<strong>es</strong> thefollowing parameters:• <strong>The</strong> type of operation, rw, whose value can be READ , WRITE, READA , or WRITEA . <strong>The</strong>last two operation typ<strong>es</strong> differ from the former in that the function do<strong>es</strong> not blockwhen no requ<strong>es</strong>t d<strong>es</strong>criptor is available.• <strong>The</strong> number, nr, of blocks to be transferred.• A bh array of nr pointers to buffer heads d<strong>es</strong>cribing the blocks (all of them must havethe same block size and must refer to the same block device).<strong>The</strong> buffer heads have been previously initialized, so each specifi<strong>es</strong> the block number, theblock size, and the virtual device identifier (see Section 13.5.6). All blocks must belong to th<strong>es</strong>ame virtual device.<strong>The</strong> function enters a loop considering all non-null elements of the bh array. For each bufferhead, it performs the following actions:1. Checks that the block size b_size match<strong>es</strong> the block size of the virtual device b_dev.2. Sets the real device identifier (usually just sets b_rdev to be b_dev).3. Sets the sector number b_rsector according to the block number and the block size.4. If the operation is WRITE or WRITEA, checks that the block device is not read-only.5. Sets the BH_Req flag of the buffer head to show other kernel control paths that theblock has been requ<strong>es</strong>ted.6. Invok<strong>es</strong> the make_requ<strong>es</strong>t( ) function, passing to it the real device's major number,the type of I/O operation, and the addr<strong>es</strong>s of the buffer head.<strong>The</strong> make_requ<strong>es</strong>t( ) function, in turn, performs the following operations:1. Sets the BH_Lock flag of the buffer head.2. Checks that b_rsector do<strong>es</strong> not exceed the number of sectors of the block device.3. If the block must be read, checks that it is not already valid (that is, the BH_Uptodateflag must be off). If the block must be written, checks that it is actually dirty (that is,the BH_Dirty flag must be on). If either one of th<strong>es</strong>e conditions do<strong>es</strong> not hold, returnswithout requ<strong>es</strong>ting the data transfer, because it is really usel<strong>es</strong>s.4. Disabl<strong>es</strong> local interrupts and gets the io_requ<strong>es</strong>t_lock spin lock (see Section 11.4.2in Chapter 11).5. Invok<strong>es</strong> the queue method, if defined, or reads the current_requ<strong>es</strong>t field in theblock device d<strong>es</strong>criptor to get the addr<strong>es</strong>s of the real device's requ<strong>es</strong>t queue.6. Performs one of the following substeps:373


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>a. If the requ<strong>es</strong>t queue is empty, inserts a new requ<strong>es</strong>t d<strong>es</strong>criptor in it andschedul<strong>es</strong> activation of the strategy routine at a later time.b. If the requ<strong>es</strong>t queue is not empty, inserts a new requ<strong>es</strong>t d<strong>es</strong>criptor in it, tryingto cluster it with other requ<strong>es</strong>ts already queued. As we'll see shortly, there is noneed to schedule the activation of the strategy routine.Let's look closer at the last two substeps.13.5.9.1 Scheduling the activation of the strategy routineAs we saw earlier, it's expedient to delay activation of the strategy routine in order to increasethe chanc<strong>es</strong> of clustering requ<strong>es</strong>ts for adjacent blocks. <strong>The</strong> delay is accomplished through atechnique known as device plugging and unplugging.If the real device's requ<strong>es</strong>t queue is empty and the device is not already plugged,make_requ<strong>es</strong>t( ) do<strong>es</strong> a device plugging: it sets the current_requ<strong>es</strong>t field of the blockdevice driver d<strong>es</strong>criptor to the addr<strong>es</strong>s of a dummy requ<strong>es</strong>t d<strong>es</strong>criptor, namely, the plug fieldof the same block device driver d<strong>es</strong>criptor. <strong>The</strong> function then allocat<strong>es</strong> a new requ<strong>es</strong>td<strong>es</strong>criptor and initializ<strong>es</strong> it with the information read from the buffer head. Next,make_requ<strong>es</strong>t( ) inserts the new requ<strong>es</strong>t d<strong>es</strong>criptor into the proper real device's requ<strong>es</strong>tqueue. If there is just one queue, the requ<strong>es</strong>t is inserted into the queue right after the dummyelement consisting of the plug field in the block device d<strong>es</strong>criptor. Finally, make_requ<strong>es</strong>t( )inserts the plug _tq task queue d<strong>es</strong>criptor (statically included in the block device driverd<strong>es</strong>criptor) in the tq _disk task queue (see Section 4.6.6 in Chapter 4) to cause the device'sstrategy routine to be activated later. Actually, the task queue element refers to the unplug_device( ) function, which execut<strong>es</strong> the device's strategy routine.<strong>The</strong> kernel checks periodically whether the tq _disk task queue contains any plug_tq taskqueue elements. This occurs in a kernel thread such as kswapd and bdflush or when the kernelmust wait for some r<strong>es</strong>ource related to block device drivers, such as buffers or requ<strong>es</strong>td<strong>es</strong>criptors. During the tq _disk check, the kernel remov<strong>es</strong> any element in the queue andexecut<strong>es</strong> the corr<strong>es</strong>ponding unplug_device( ) function. This activity is referred to asunplugging the device.13.5.9.2 Extending the requ<strong>es</strong>t queueIf the requ<strong>es</strong>t queue is not empty, the low-level block device driver keeps handling requ<strong>es</strong>tsuntil the queue has been emptied (see the next section), so make_requ<strong>es</strong>t( ) do<strong>es</strong> not have toschedule the activation of the strategy routine.In this case, make_requ<strong>es</strong>t( ) just modifi<strong>es</strong> the requ<strong>es</strong>t queue by adding a new element or bymerging the new requ<strong>es</strong>t with existing elements; the second case is known as block clustering.Block clustering is implemented only for blocks belonging to certain block devic<strong>es</strong>, namelythe EIDE and SCSI hard disks, the floppy disk, and a few others. Moreover, a block can beincluded in a requ<strong>es</strong>t only if all the following conditions are satisfied:• <strong>The</strong> block to be inserted belongs to the same block device as the other blocks in therequ<strong>es</strong>t and is adjacent to them: it either immediately preced<strong>es</strong> the first block in therequ<strong>es</strong>t or immediately follows the last block in the requ<strong>es</strong>t.374


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> blocks in the requ<strong>es</strong>t have the same I/O operation type (READ or WRITE) as theblock to be inserted.• <strong>The</strong> extended requ<strong>es</strong>t do<strong>es</strong> not exceed the allowed maximum number of sectors. Thisvalue is stored in the max_sectors table, which is indexed by the major number andthe minor number of the block device. <strong>The</strong> default value is 244 sectors.• <strong>The</strong> requ<strong>es</strong>t is not currently being handled by the low-level device driver.<strong>The</strong> make_requ<strong>es</strong>t( ) function scans all the requ<strong>es</strong>ts in the queue. If one of them satisfi<strong>es</strong> allthe conditions just mentioned, the buffer head is inserted in the requ<strong>es</strong>t's list, and the fields ofthe requ<strong>es</strong>t data structure are updated. If the block was appended to the end of a requ<strong>es</strong>t, thefunction also tri<strong>es</strong> to merge this requ<strong>es</strong>t with the next element of the queue. Nothing else hasto be done, and hence make_requ<strong>es</strong>t( ) releas<strong>es</strong> the io_requ<strong>es</strong>t_lock spin lock andterminat<strong>es</strong>.Conversely, if no existing requ<strong>es</strong>t can include the block, make_requ<strong>es</strong>t( ) allocat<strong>es</strong> a newrequ<strong>es</strong>t d<strong>es</strong>criptor [6] and initializ<strong>es</strong> it properly with the information read from the buffer head.[6] If there is no free requ<strong>es</strong>t d<strong>es</strong>criptor, the current proc<strong>es</strong>s is suspended until a requ<strong>es</strong>t d<strong>es</strong>criptor is freed.Finally, make_requ<strong>es</strong>t( ) invok<strong>es</strong> the add_requ<strong>es</strong>t( ) function, which inserts the newrequ<strong>es</strong>t in the proper position in the requ<strong>es</strong>t queue, according to its initial sector number. <strong>The</strong>io_requ<strong>es</strong>t_lock spin lock is then released and the execution terminat<strong>es</strong>.13.5.10 Low-Level Requ<strong>es</strong>t HandlingWe have now reached the low<strong>es</strong>t level of <strong>Linux</strong>'s block device-handling architecture: thislevel is implemented by the strategy routine, which interacts with the physical block device inorder to satisfy the requ<strong>es</strong>ts collected in the queue.As mentioned earlier, the strategy routine is usually started after inserting a new requ<strong>es</strong>t in anempty requ<strong>es</strong>t queue. Once activated, the low-level block device driver should handle allrequ<strong>es</strong>ts in the queue and terminate when the queue is empty.A naive implementation of the strategy routine could be the following: for each element in thequeue, interact with the block device controller to service the requ<strong>es</strong>t and wait until the datatransfer complet<strong>es</strong>, then remove the serviced requ<strong>es</strong>t from the queue and proceed with thenext one.Such an implementation is not very efficient. Even assuming that data can be transferredusing DMA, the strategy routine must suspend itself while waiting for I/O completion, andhence an unrelated user proc<strong>es</strong>s would be heavily penalized. (<strong>The</strong> strategy routine do<strong>es</strong> notnec<strong>es</strong>sarily execute on behalf of the proc<strong>es</strong>s that has requ<strong>es</strong>ted the I/O operation but at somerandom later time, since it is activated by means of the tq _disk task queue.)<strong>The</strong>refore, many low-level block device drivers adopt the following schema:• <strong>The</strong> strategy routine handl<strong>es</strong> the current requ<strong>es</strong>t in the queue and sets up the blockdevice controller so that it rais<strong>es</strong> an interrupt when the data transfer complet<strong>es</strong>. <strong>The</strong>nthe strategy routine terminat<strong>es</strong>.375


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• When the block device controller rais<strong>es</strong> the interrupt, the interrupt handler activat<strong>es</strong> abottom half. <strong>The</strong> bottom half handler remov<strong>es</strong> the requ<strong>es</strong>t from the queue andreexecut<strong>es</strong> the strategy routine to service the next requ<strong>es</strong>t in the queue.Basically, low-level block device drivers can be further classified into the following:• Drivers that service each block in a requ<strong>es</strong>t separately• Drivers that service several blocks in a requ<strong>es</strong>t togetherDrivers of the second type are much more complicated to d<strong>es</strong>ign and implement than driversof the first type. Indeed, although the sectors are adjacent on the physical block devic<strong>es</strong>, thebuffers in RAM are not nec<strong>es</strong>sarily consecutive. <strong>The</strong>refore, any such driver may have toallocate a temporary area for the DMA data transfer, then perform a memory-to-memory copyof the data between the temporary area and each buffer in the requ<strong>es</strong>t's list.Since clustered requ<strong>es</strong>ts refer to adjacent blocks on disk, they improve the performance ofboth typ<strong>es</strong> of drivers because the requ<strong>es</strong>ts may be serviced by issuing fewer seek commands.Transferring several blocks from disk at once is not as effective in boosting disk performance.<strong>The</strong> kernel do<strong>es</strong>n't offer any support for the second type of drivers: they must handle therequ<strong>es</strong>t queu<strong>es</strong> and the buffer head lists on their own. <strong>The</strong> choice to leave the job up to thedriver is not capricious or lazy. Each physical block device is inherently different from allothers (for example, a floppy driver groups blocks in disk tracks and transfers a whole track ina single I/O operation), so making general assumptions on how to service each clusteredrequ<strong>es</strong>t would make very little sense.However, the kernel offers a limited degree of support for the low-level block device driversin the first class. So we'll spend a little more time on such drivers.A typical strategy routine should perform the following actions:1. Get the current requ<strong>es</strong>t from a requ<strong>es</strong>t queue. If all requ<strong>es</strong>t queu<strong>es</strong> are empty,terminate the routine.2. Check that the current requ<strong>es</strong>t has consistent information. In particular, compare themajor number of the block device with the value stored in the rq _rdev field of therequ<strong>es</strong>t d<strong>es</strong>criptor. Moreover, check that the first buffer head in the list is locked (thatis, the BH_Lock flag has been set by make_requ<strong>es</strong>t( )).3. Program the block device controller for the data transfer of the first block. <strong>The</strong> datatransfer direction can be found in the cmd field of the requ<strong>es</strong>t d<strong>es</strong>criptor and theaddr<strong>es</strong>s of the buffer in the buffer field, while the initial sector number and thenumber of sectors to be transferred are stored in the sector andcurrent_nr_sectors fields, r<strong>es</strong>pectively. [7] Also, set up the block device controller sothat an interrupt is raised when the DMA data transfer complet<strong>es</strong>.[7] Recall that current_nr_sectors contains the number of sectors in the first block of the requ<strong>es</strong>t, while nr_sectors containsthe total number of sectors in the requ<strong>es</strong>t.4. If the routine is handling a block device file for which ll_rw_block( ) accomplish<strong>es</strong>block clustering, increment the sector field and decrement the nr_sectors field ofthe requ<strong>es</strong>t d<strong>es</strong>criptor to keep track of the blocks to be transferred.376


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> interrupt handler associated with the termination of the DMA data transfer for the blockdevice should invoke (either directly or via a bottom half) the end_requ<strong>es</strong>t( ) function. Itreceiv<strong>es</strong> as its parameter the value 1 if the data transfer succeeded and the value if an erroroccurred. end_requ<strong>es</strong>t( ) performs the following operations:1. If an error occurred (parameter value is 0), updat<strong>es</strong> the sector and nr_sectors fieldsso as to skip the remaining sectors of the block. In step 3a, the buffer content will alsobe marked as not up-to-date.2. Remov<strong>es</strong> the buffer head of the transferred block from the requ<strong>es</strong>t's list.3. Invok<strong>es</strong> the b_end_io method of the buffer head. When the getblk( ) functionallocat<strong>es</strong> the buffer head, it loads this field with the addr<strong>es</strong>s of theend_buffer_io_sync( ) function, which performs two operations:a. Sets the BH_Uptodate flag of the buffer head to 1 or 0, according to th<strong>es</strong>ucc<strong>es</strong>s or failure of the data transferb. Clears the BH_Lock flag of the buffer head and wak<strong>es</strong> up all proc<strong>es</strong>s<strong>es</strong> sleepingin the wait queue to which the b_wait field of the buffer head points4. If there is another buffer head on the requ<strong>es</strong>t's list, performs the following actions:a. Sets the current_nr_sectors field of the requ<strong>es</strong>t d<strong>es</strong>criptor to the number ofsectors of the new blockb. Sets the buffer field with the addr<strong>es</strong>s of the new buffer (from the b_data fieldof the new buffer head)5. Otherwise, if the requ<strong>es</strong>t's list is empty, all blocks have been proc<strong>es</strong>sed. <strong>The</strong>refore,performs the following operations:a. Sets the current requ<strong>es</strong>t pointer to the next element in the requ<strong>es</strong>t queueb. Sets the rq _status field of the proc<strong>es</strong>sed requ<strong>es</strong>t to RQ _INACTIVEc. Wak<strong>es</strong> up all proc<strong>es</strong>s<strong>es</strong> sleeping in the wait_for_requ<strong>es</strong>t wait queueAfter invoking end_requ<strong>es</strong>t( ), the low-level block device driver checks the value of thecurrent_requ<strong>es</strong>t field in the block device driver d<strong>es</strong>criptor; if it is not NULL, the requ<strong>es</strong>tqueue is not empty, and the strategy routine is executed again. Notice that end_requ<strong>es</strong>t( )actually performs two n<strong>es</strong>ted iterations: the outer one on the elements of the requ<strong>es</strong>t queueand the inner one on the elements in the buffer head list of each requ<strong>es</strong>t. <strong>The</strong> strategy routineis thus invoked once for each block in the requ<strong>es</strong>t queue.13.6 Page I/O OperationsBlock devic<strong>es</strong> transfer information one block at a time, while proc<strong>es</strong>s addr<strong>es</strong>s spac<strong>es</strong> (or to bemore precise, memory regions allocated for the proc<strong>es</strong>s) are defined as sets of pag<strong>es</strong>. Thismismatch can be hidden to some extent by using page I/O operations (see the sectionSection 13.5). <strong>The</strong>y may be activated in the following cas<strong>es</strong>:• A proc<strong>es</strong>s issu<strong>es</strong> a read( ) or write( ) system call on a regular file (seeSection 15.1 in Chapter 15).• A proc<strong>es</strong>s reads a location of a page that maps a file in memory (see Section 15.2 inChapter 15).• <strong>The</strong> kernel flush<strong>es</strong> some dirty pag<strong>es</strong> related to a file memory mapping to disk (seeSection 15.2.6 in Chapter 15).• When swapping in or swapping out, the kernel loads from disk or sav<strong>es</strong> to disk thecontents of whole page fram<strong>es</strong> (see Chapter 16).377


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>We'll use the r<strong>es</strong>t of this chapter to d<strong>es</strong>cribe how th<strong>es</strong>e operations are carried out.13.6.1 Starting Page I/O OperationsA page I/O operation is activated by invoking the brw_page( ) function, which receiv<strong>es</strong> thefollowing parameters:rwpagedevbsizebmapType of I/O operation (READ or WRITE)Addr<strong>es</strong>s of a page d<strong>es</strong>criptorBlock device numberArray of logical block numbersBlock sizeFlag specifying whether the block numbers in b were computed by using the bmapmethod of the inode operations (see Section 12.2.2 in Chapter 12)<strong>The</strong> page d<strong>es</strong>criptor refers to the page involved in the page I/O operation. It must already belocked (PG_locked flag on) before invoking brw_page( ) so that no other kernel control pathcan acc<strong>es</strong>s it. <strong>The</strong> page is considered as split into 4096/size buffers; the i th buffer in the pageis associated with the block b[i] of device dev.<strong>The</strong> function performs the following operations:1. Invok<strong>es</strong> create_buffers( ) to allocate temporary buffer heads for all buffersincluded in the page (such buffer heads are called asynchronous; they will bediscussed in Section 14.1.1 in Chapter 14). <strong>The</strong> function returns the addr<strong>es</strong>s of the firstbuffer head, while the b_this_page field of each buffer head points to the buffer headof the next buffer in the page.2. For each buffer head in the page, performs the following substeps:a. Initializ<strong>es</strong> the buffer head fields; since it is an asynchronous buffer head, setsthe b_end_io method to end_buffer_io_async( ).b. If the bmap parameter flag is not null, checks whether the buffer head refers toa block having number 0. This is because the bmap method of the inodeoperations us<strong>es</strong> block number to repr<strong>es</strong>ent a file hole (see Chapter 17). In this378


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>case, fills the buffer with zeros, sets the BH_Uptodate flag of the buffer head,and continu<strong>es</strong> with the next asynchronous buffer head.c. Invok<strong>es</strong> find_buffer( ) to check whether the block associated with thebuffer head is already pr<strong>es</strong>ent in memory (see Section 14.1 in Chapter 14). Ifso, performs the following substeps:a. Increments the usage counter of the buffer head found in the cache.b. If the I/O operation is READ and if the buffer in the cache is not up-todate,invok<strong>es</strong> ll_rw_block( ) to issue a READ requ<strong>es</strong>t; then invok<strong>es</strong>wait_on_buffer( ) to wait for the I/O to complete. Notice thatll_rw_block( ) acts on the buffer head included in the buffer cache,and thus triggers a buffer I/O operation.c. If the I/O operation is READ, copi<strong>es</strong> the data from the buffer in the cacheinto the page buffer.d. If the I/O operation is WRITE, copi<strong>es</strong> the data from the page buffer intothe buffer in the cache, and invok<strong>es</strong> mark_buffer_dirty( ) to set theBH_Dirty flag of the buffer head in the cache.e. Sets the BH_Uptodate field of the asynchronous buffer head,decrements the usage counter of the buffer head in the cache, andcontinu<strong>es</strong> with the next asynchronous buffer head.d. <strong>The</strong> block required is not in the cache. <strong>The</strong>refore, if the I/O operation is aREAD, clears the BH_Uptodate flag of the asynchronous buffer head; if it is aWRITE, sets the BH_Dirty flag.e. Inserts the pointer to the asynchronous buffer head into a local array, andcontinu<strong>es</strong> with the next asynchronous buffer head.Now all asynchronous buffer heads have been considered.3. If the local array of asynchronous buffer head pointers is empty, all requ<strong>es</strong>ted blockswere included in the buffer cache, thus the page I/O operation is not nec<strong>es</strong>sary. In thiscase, performs the following substeps:a. Clears the PG_locked flag of the page d<strong>es</strong>criptor, thus unlocking the pageframe.b. Sets the PG_uptodate flag of the page d<strong>es</strong>criptor.c. Wak<strong>es</strong> up any proc<strong>es</strong>s sleeping on the wait wait queue of the page d<strong>es</strong>criptor.d. Invok<strong>es</strong> free_async_buffers( ) to release the asynchronous buffer heads.e. Invok<strong>es</strong> the after_unlock_page( ) function (see Chapter 16). This functionreleas<strong>es</strong> the page frame if the PG_free_after flag of the page d<strong>es</strong>criptor is set.f. Returns the value 0.4. If we have reached this point, the local array of asynchronous buffer head pointers isnot empty, thus a page I/O operation is really nec<strong>es</strong>sary. Invok<strong>es</strong> ll_rw_block( ) toissue an rw requ<strong>es</strong>t for all buffer heads included in the local array and immediatelyreturns the value 0.13.6.2 Terminating Page I/O Operations<strong>The</strong> ll_rw_block( ) function activat<strong>es</strong> the device driver of the block device being acc<strong>es</strong>sed(see Section 13.5.9). As d<strong>es</strong>cribed in Section 13.5.10, the device driver performs the actualdata transfer, and then invok<strong>es</strong> the b_end_io method of all asynchronous buffer heads that379


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>have been transferred. <strong>The</strong> b_end_io field points to the end_buffer_io_async( ) function,which performs the following operations:1. Invok<strong>es</strong> the mark_buffer_uptodate( ) function, which in turn performs thefollowing substeps:a. Sets the BH_Uptodate flag of the asynchronous buffer head according to ther<strong>es</strong>ult of the I/O operation.b. If the BH_Uptodate flag is set, checks whether all other asynchronous bufferheads in the page are up-to-date; if so, sets the PG_uptodate flag of the paged<strong>es</strong>criptor.2. Clears the BH_Lock flag of the asynchronous buffer head.3. If the BH_Uptodate flag is off, sets the PG_error flag of the page d<strong>es</strong>criptor becausean error occurred while transferring the block.4. Decrements the usage counter of the asynchronous buffer head (it becom<strong>es</strong> 0).5. Checks whether all asynchronous buffer heads that refer to the page have null usagecounters. If so, all data transfers for the buffers in the page have been completed, thusperforms the following substeps:a. Invok<strong>es</strong> the free_async_buffers( ) function to release all asynchronousbuffer heads.b. Clears the PG_locked bit of the page d<strong>es</strong>criptor, thus unlocking the pageframe.c. Wak<strong>es</strong> up all proc<strong>es</strong>s<strong>es</strong> sleeping in the wait wait queue of the page d<strong>es</strong>criptor.d. Invok<strong>es</strong> the after_unlock_page( ) function (see Chapter 16). This functionreleas<strong>es</strong> the page frame if the PG_free_after flag of the page d<strong>es</strong>criptor is set.13.7 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 heavily chang<strong>es</strong> how I/O device drivers are handled. <strong>The</strong> main improvementconsists of a new R<strong>es</strong>ource Management Subsystem used to allocate IRQ lin<strong>es</strong>, DMAchannels, I/O ports, and so on. Thanks to this new subsystem, <strong>Linux</strong> now fully supports hotpluggablePlug-And-Play hardware devic<strong>es</strong>, USB bus<strong>es</strong>, and PCMCIA cards.<strong>Linux</strong> 2.4 reorganiz<strong>es</strong> the block device driver layer and adds support for the Logical VolumeManager. <strong>The</strong> Logical Volume Manager allows fil<strong>es</strong>ystems to span several disk partitions andto be r<strong>es</strong>ized dynamically. This new feature brings <strong>Linux</strong> closer to enterprise-class operatingsystems.<strong>The</strong> new kernel introduc<strong>es</strong> a class of character devic<strong>es</strong> called raw I/O devic<strong>es</strong>. <strong>The</strong>se devic<strong>es</strong>allow applications like DBMS to directly acc<strong>es</strong>s disks without making use of the kernelcach<strong>es</strong>.Another significant addition is kernel support for Intelligent Input/Output (I2O) hardware.<strong>The</strong> goal of this new standard, derived from the PCI architecture, is to write OS-independentdevice drivers for several kind of devic<strong>es</strong> like disks, SCSI devic<strong>es</strong>, and network cards.Finally, <strong>Linux</strong> 2.4 includ<strong>es</strong> the devfs virtual fil<strong>es</strong>ystem, which replac<strong>es</strong> the old static /devdirectory of device fil<strong>es</strong>. Virtual fil<strong>es</strong> appear only when the corr<strong>es</strong>ponding device driver ispr<strong>es</strong>ent in the kernel. <strong>The</strong> device filenam<strong>es</strong> have also been changed. As an example, all discdevic<strong>es</strong> are placed under the /dev/discs directory: /dev/hda might become /dev/discs/disc0,380


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>/dev/hdb might become /dev/discs/disc1, and so on. Users can still refer to the old nam<strong>es</strong>cheme by properly configuring a device management daemon.381


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 14. Disk Cach<strong>es</strong>This chapter deals with disk cach<strong>es</strong>. It shows how <strong>Linux</strong> mak<strong>es</strong> use of sophisticatedtechniqu<strong>es</strong> to improve system performanc<strong>es</strong> by reducing disk acc<strong>es</strong>s<strong>es</strong> as much as possible.As mentioned in Section 12.1.1 in Chapter 12, a disk cache is a software mechanism thatallows the system to keep in RAM some data normally stored on a disk, so that furtheracc<strong>es</strong>s<strong>es</strong> to that data can be satisfied quickly without acc<strong>es</strong>sing the disk.B<strong>es</strong>id<strong>es</strong> the dentry cache, which is used by the VFS to speed up the translation of a filepathname to the corr<strong>es</strong>ponding inode, two main disk cach<strong>es</strong>—the buffer cache and the pagecache—are used by <strong>Linux</strong>. Most of this chapter d<strong>es</strong>crib<strong>es</strong> the buffer cache, and a short sectionnear the end covers the page cache.We learned in Section 13.5.1 in Chapter 13, that a buffer is a memory area containing the dataof a disk block. Each block refers to physically adjacent byt<strong>es</strong> on the disk surface; the blocksize depends on the type of the fil<strong>es</strong>ystem it com<strong>es</strong> from. As sugg<strong>es</strong>ted by its name, the buffercache is a disk cache that stor<strong>es</strong> buffers.Conversely, the page cache is a disk cache storing page fram<strong>es</strong> that contain data belonging toregular fil<strong>es</strong>. It is inherently different from the buffer cache, since page fram<strong>es</strong> in the pagecache do not nec<strong>es</strong>sarily corr<strong>es</strong>pond to physically adjacent disk blocks.Buffer I/O operations (see Section 13.5 in Chapter 13) make use of the buffer cache only.Page I/O operations use the page cache and optionally the buffer cache as well. As we'll see inthe following sections, both cach<strong>es</strong> are implemented by making use of proper data structur<strong>es</strong>storing pointers to buffer heads and page d<strong>es</strong>criptors.Table 14-1. Use of the Buffer Cache and Page CacheI/O Operation Cache System Call <strong>Kernel</strong> FunctionRead a block device file [1] Buffer read( ) block_read( )Write a block device file [1] Buffer write( ) block_write( )Read an Ext2 directory [2] Buffer getdents( ) ext2_bread( )Read an Ext2 regular file [2] Page read( ) generic_file_read( )Write an Ext2 regular file [2] Page, buffer write( ) ext2_file_write( )Acc<strong>es</strong>s to memory-mapped file [3] Page None file_map_nopage( )Acc<strong>es</strong>s to swapped-out page [4] Page None do_swap_page( )[1] See Section 13.5.4 in Chapter 13.[2] See Chapter 17.[3] See Section 15.2 in Chapter 15.[4] See Chapter 16.For each type of I/O activity, the table also shows the system call required to start it (if any)and the main corr<strong>es</strong>ponding kernel function that handl<strong>es</strong> it.You'll notice in the table that acc<strong>es</strong>s<strong>es</strong> to memory-mapped fil<strong>es</strong> and swapped-out pag<strong>es</strong> do notrequire system calls; they are transparent to the programmer. Once a file memory mapping382


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>has been set up and once swapping has been activated, the application program can acc<strong>es</strong>s themapped file or the swapped-out page as if it were pr<strong>es</strong>ent in memory. It is the kernel'sr<strong>es</strong>ponsibility to delay the proc<strong>es</strong>s until the required page has been located on disk andbrought into RAM.14.1 <strong>The</strong> Buffer Cache<strong>The</strong> whole idea behind the buffer cache is to relieve proc<strong>es</strong>s<strong>es</strong> from having to wait forrelatively slow disks to retrieve or store data. Thus, it would be counterproductive to write alot of data at once; instead, data should be written piecemeal at regular intervals so that I/Ooperations have a minimal impact on the speed of the user proc<strong>es</strong>s<strong>es</strong> and on r<strong>es</strong>ponse timeexperienced by human users.<strong>The</strong> kernel maintains a lot of information about each buffer to help it pace the writ<strong>es</strong>,including a "dirty" bit to indicate the buffer has been changed in memory and needs to bewritten and a tim<strong>es</strong>tamp to indicate how long the buffer should be kept in memory beforebeing flushed to disk. Information on buffers is kept in buffer heads (introduced in theprevious chapter), so th<strong>es</strong>e data structur<strong>es</strong> require maintenance along with the buffers of userdata themselv<strong>es</strong>.<strong>The</strong> size of the buffer cache may vary. Page fram<strong>es</strong> are allocated on demand when a newbuffer is required and one is not available. When free memory becom<strong>es</strong> scarce, as we shallsee in Chapter 16, buffers are released and the corr<strong>es</strong>ponding page fram<strong>es</strong> are recycled.<strong>The</strong> buffer cache consists of two kinds of data structur<strong>es</strong>:• A set of buffer heads d<strong>es</strong>cribing the buffers in the cache (see Section 13.5.6 in Chapter13)• A hash table to help the kernel quickly derive the buffer head that d<strong>es</strong>crib<strong>es</strong> the bufferassociated with a given pair of device and block numbers14.1.1 Buffer Head Data Structur<strong>es</strong>As mentioned in Section 13.5.6 in Chapter 13, each buffer head is stored in a data structure oftype buffer_head. <strong>The</strong>se data structur<strong>es</strong> have their own slab allocator cache calledbh_cachep, which should not be confused with the buffer cache itself. <strong>The</strong> slab allocatorcache is a memory cache (see Section 3.1.2 in Chapter 3) for the buffer head objects, meaningthat it has no interaction with disks and is simply a way of managing memory efficiently.In contrast, the buffer cache is a disk cache for the data in the buffers. <strong>The</strong> number ofallocated buffer heads, that is, the number of objects obtained from the slab allocator, is storedin the nr_buffer_heads variable.Each buffer used by a block device driver must have a corr<strong>es</strong>ponding buffer head thatd<strong>es</strong>crib<strong>es</strong> the buffer's current status. <strong>The</strong> converse is not true: a buffer head may be unused,which means it is not bound to any buffer. <strong>The</strong> kernel keeps a certain number of unusedbuffer heads to avoid the overhead of constantly allocating and deallocating memory.In general, a buffer head may be in any one of the following stat<strong>es</strong>:383


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Unused buffer head<strong>The</strong> object is available; the valu<strong>es</strong> of its fields are meaningl<strong>es</strong>s.Buffer head for a free bufferIts b_data field points to a free buffer, and its b_dev field has the value B_FREE(0xffff). Notice that the buffer is available, not the buffer head itself.Buffer head for a cached bufferIts b_data field points to a buffer stored in the buffer cache.Asynchronous buffer headIts b_data field points to a temporary buffer used to implement a page I/O operation(see Section 13.6 in Chapter 13).Strictly speaking, the buffer cache data structur<strong>es</strong> include only pointers to buffer heads for acached buffer. For sake of completen<strong>es</strong>s, we shall examine the data structur<strong>es</strong> and themethods used by the kernel to handle all kinds of buffer heads, not just those in the buffercache.14.1.1.1 <strong>The</strong> list of unused buffer headsAll unused buffer heads are collected in a simply linked list, whose first element is addr<strong>es</strong>sedby the unused_list variable. Each buffer head stor<strong>es</strong> the addr<strong>es</strong>s of the next list element inthe b_next_free field. <strong>The</strong> current number of elements in the list is stored in thenr_unused_buffer_heads variable.<strong>The</strong> list of unused buffer heads acts as a primary memory cache for the buffer head objects,while the bh_cachep slab allocator cache is a secondary memory cache. When a buffer headis no longer needed, it is inserted into the list of unused buffer heads. Buffer heads arereleased to the slab allocator (a preliminary step to letting the kernel free the memoryassociated with them altogether) only when the number of list elements exceedsMAX_UNUSED_BUFFERS (usually 36 elements). In other words, a buffer head in this list isconsidered as an allocated object by the slab allocator and as an unused data structure by thebuffer cache.A subset of NR_RESERVED (usually 16) elements in the list is r<strong>es</strong>erved for page I/O operations.This is done to prevent nasty deadlocks caused by the lack of free buffer heads. As we shallsee in Chapter 16, if free memory is scarce, the kernel can try to free a page frame byswapping out some page to disk. In order to do this, it requir<strong>es</strong> at least one additional bufferhead to perform the page I/O file operation. If the swapping algorithm fails to get a bufferhead, it simply keeps waiting and lets writ<strong>es</strong> to fil<strong>es</strong> proceed in order to free up buffers, sinceat least NR_RESERVED buffer heads are going to be released as soon as the ongoing fileoperations terminate.<strong>The</strong> get_unused_buffer_head( ) function is invoked to get a new buffer head. It <strong>es</strong>sentiallyperforms the following operations:384


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1. Invok<strong>es</strong> the recover_reusable_buffer_heads( ) function (more on this later).2. If the list of unused buffer heads has more than NR_RESERVED elements, remov<strong>es</strong> oneof them from the list and returns its addr<strong>es</strong>s.3. Otherwise, invok<strong>es</strong> kmem_cache_alloc( ) to allocate a new buffer head; if theoperation succeeds, returns its addr<strong>es</strong>s.4. No free memory is available. If the buffer head has been requ<strong>es</strong>ted for a buffer I/Ooperation, returns NULL (failure).5. If this point is reached, the buffer head has been requ<strong>es</strong>ted for a page I/O operation. Ifthe list of unused buffer heads is not empty, remov<strong>es</strong> one element and returns itsaddr<strong>es</strong>s.<strong>The</strong> put_unused_buffer_head( ) function performs the reverse operation, releasing abuffer head. It inserts the object in the list of unused buffer heads if that list has fewer thanMAX_UNUSED_BUFFERS elements; otherwise, it releas<strong>es</strong> the object to the slab allocator.14.1.1.2 Lists of buffer heads for free buffersSince <strong>Linux</strong> us<strong>es</strong> several block siz<strong>es</strong> (see Section 13.5.1 in Chapter 13), it us<strong>es</strong> severalcircular lists, one for each buffer size, to collect the buffer heads of free buffers. Such lists actas a memory cache. Thanks to them, a free buffer of a given size can be obtained quicklywhen needed, without relying on the time-consuming Buddy system procedur<strong>es</strong>.Seven lists of buffer heads for free buffers are defined; the corr<strong>es</strong>ponding buffer siz<strong>es</strong> are 512,1024, 2048, 4096, 8192, 16384, and 32768 byt<strong>es</strong>. <strong>The</strong> size of a block, however, cannot exceedthe size of a page frame; only the first four lists are thus actually used on PC architecture.<strong>The</strong> free_list array points to all seven lists; for each list, there is one element in the array tohold the addr<strong>es</strong>s of the list's first element. <strong>The</strong> BUFSIZE_INDEX macro accepts a block size asinput and deriv<strong>es</strong> from it the corr<strong>es</strong>ponding index in the array. For instance, buffer size 512maps to free_list[0], buffer size 1024 to free_list[1], and so on. <strong>The</strong> lists are doublylinked by means of the b_next_free and b_ prev_free fields of each buffer head.14.1.1.3 Lists of buffer heads for cached buffersWhen a buffer belongs to the buffer cache, the flags of the corr<strong>es</strong>ponding buffer head d<strong>es</strong>cribeits current status (see Section 13.5.6 in Chapter 13). For instance, when a block not pr<strong>es</strong>ent inthe cache must be read from disk, a new buffer is allocated and the BH_Uptodate flag of thebuffer head is cleared because the buffer's contents are meaningl<strong>es</strong>s. While filling the bufferby reading from disk, the BH_Lock flag is set to protect the buffer from being reclaimed. If theread operation terminat<strong>es</strong> succ<strong>es</strong>sfully, the BH_Uptodate flag is set and the BH_Lock flag iscleared. If the block must be written to disk, the buffer content is modified and the BH_Dirtyflag is set; the flag will be cleared only after the buffer is succ<strong>es</strong>sfully written to disk.Any buffer head associated with a used buffer is contained in a doubly linked list,implemented by means of the b_next_free and b_prev_free fields. <strong>The</strong>re are three differentlists, identified by an index defined as a macro (BUF_CLEAN, BUF_DIRTY, and BUF_LOCKED).We'll define th<strong>es</strong>e lists in a moment.<strong>The</strong> three lists are introduced to speed up the functions that flush dirty buffers to disk (seeSection 14.1.5 later in this chapter). For reasons of efficiency, a buffer head is not moved385


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>right away from one list to another when it chang<strong>es</strong> status; this mak<strong>es</strong> the followingd<strong>es</strong>cription a bit murky.BUF_CLEANBUF_DIRTYThis list collects buffer heads of nondirty buffers (BH_Dirty flag is off). Notice thatbuffers in this list are not nec<strong>es</strong>sarily up-to-date, that is, they don't nec<strong>es</strong>sarily containvalid data. If the buffer is not up-to-date, it could even be locked (BH_Lock is on) andselected to be read from the physical device while being on this list. <strong>The</strong> buffer headsin this list are guaranteed only to be not dirty—in other words, the corr<strong>es</strong>pondingbuffers are ignored by the functions that flush dirty buffers to disk.This list mainly collects buffer heads of dirty buffers that have not been selected to bewritten into the physical device, that is, dirty buffers that have not yet been included ina block requ<strong>es</strong>t for a block device driver (BH_Dirty is on and BH_Lock is off).However, this list could also include nondirty buffers, since in a few cas<strong>es</strong> theBH_Dirty flag of a dirty buffer is cleared without flushing it to disk and withoutremoving the buffer head from the list (for instance, whenever a floppy disk isremoved from its drive without unmounting—an event that most probably leads todata loss, of course).BUF_LOCKEDThis list mainly collects buffer heads of dirty buffers that have been selected to bewritten to the block device (BH_Lock is on; BH_Dirty is clear because theadd_requ<strong>es</strong>t( ) function r<strong>es</strong>ets it before including the buffer head in a blockrequ<strong>es</strong>t). However, when a write operation for some locked buffer has beencompleted, the low-level block device handler clears the BH_Lock flag withoutremoving the buffer head from the list (see Section 13.5.10 in Chapter 13). <strong>The</strong> bufferheads in this list are guaranteed only to be not dirty, or dirty but selected to be written.For any buffer head associated with a used buffer, the b_list field of the buffer head stor<strong>es</strong>the index of the list containing the buffer. <strong>The</strong> lru_list array [5] stor<strong>es</strong> the addr<strong>es</strong>s of the firstelement in each list, while the nr_buffers_type array stor<strong>es</strong> the number of elements in eachlist.[5] <strong>The</strong> name of the array deriv<strong>es</strong> from the abbreviation for Least Recently Used: in earlier versions of <strong>Linux</strong>, th<strong>es</strong>e lists were ordered according to thetime when each buffer was last acc<strong>es</strong>sed.<strong>The</strong> mark_buffer_dirty( ) and mark_buffer_clean( ) functions set and clear,r<strong>es</strong>pectively, the BH_Dirty flag of a buffer head. <strong>The</strong>y also invoke the refile_buffer( )function, which mov<strong>es</strong> the buffer head into the proper list according to the value of theBH_Dirty and BH_Lock flags.14.1.1.4 <strong>The</strong> hash table of cached buffer heads<strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of the buffer heads belonging to the buffer cache are inserted into a large hashtable. Given a device identifier and a block number, the kernel can use the hash table toquickly derive the addr<strong>es</strong>s of the corr<strong>es</strong>ponding buffer head, if one exists. <strong>The</strong> hash table386


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>noticeably improv<strong>es</strong> kernel performance because checks on buffer heads are frequent. Befor<strong>es</strong>tarting a buffer I/O operation, the kernel must check whether the required block is already inthe buffer cache; in this situation, the hash table lets the kernel avoid a lengthy sequential scanof the lists of cached buffers.<strong>The</strong> hash table is stored in the hash_table array, which is allocated during systeminitialization and whose size depends on the amount of RAM installed on the system. As anexample, for systems having 64 MB of RAM, hash_table is stored in 64 page fram<strong>es</strong> andinclud<strong>es</strong> 65,536 buffer head pointers. As usual, entri<strong>es</strong> causing a collision are chained indoubly linked lists implemented by means of the b_next and b_pprev fields of each bufferhead. <strong>The</strong> total number of buffer heads in the hash table is stored in the nr_hashed_buffervariable.<strong>The</strong> find_buffer( ) function receiv<strong>es</strong> as parameters the device number and the blocknumber of a buffer head to be searched, hash<strong>es</strong> the valu<strong>es</strong> of the parameters and looks into thehash table to find the first element in the collision list, then checks the b_dev and b_blocknrfields of each element in the list and returns the addr<strong>es</strong>s of the requ<strong>es</strong>ted buffer head. If thebuffer head is not in the cache, the function returns NULL.<strong>The</strong> insert_into_queu<strong>es</strong>( ) and remove_from_queu<strong>es</strong>( ) functions insert an element intothe hash table and remove it from the hash table, r<strong>es</strong>pectively. Both functions also take care ofthe buffer head's other data structur<strong>es</strong>. For instance, when insert_into_queu<strong>es</strong>( ) isinvoked on a buffer head that should be cached, the function inserts it into both the properlru_list and the hash table.14.1.1.5 Lists of asynchronous buffer headsAsynchronous buffer heads are used by page I/O file operations (see Section 13.6 in Chapter13). Even if a page I/O operation transfers a whole page, the actual data transfer is done oneblock at a time by the proper block device handler. In other words, the operation views thepage frame containing the page as a group of buffers. <strong>The</strong> number of buffers in the groupdepends on the block size used: a 4 KB page frame may include, for instance, a group of four1 KB buffers if the block size is 1024 or a single 4 KB buffer if the block size is 4096. Duringthe page I/O operation, any buffer in the page must have its corr<strong>es</strong>ponding asynchronousbuffer head. <strong>The</strong>se buffer heads, however, are discarded as soon as the I/O operationcomplet<strong>es</strong>, since from now on the page can be regarded as a whole and referenced by meansof its page d<strong>es</strong>criptor.Since each page can consist of many buffers, the goal at this point is to try to find whether allbuffers used by a page have been transferred.As discussed in Section 13.5.10 in Chapter 13, when a block transfer terminat<strong>es</strong>, the interrupthandler invok<strong>es</strong> end_requ<strong>es</strong>t( ). This function tak<strong>es</strong> care of removing the block requ<strong>es</strong>tfrom the requ<strong>es</strong>t queue and invok<strong>es</strong> the b_end_io method of all buffer heads included in therequ<strong>es</strong>t. When a buffer is involved in a page I/O operation (instead of a buffer I/O operation),the end_requ<strong>es</strong>t( ) field points to the end_buffer_io_async( ) function, whichdecrements the usage counter of the buffer head and checks whether all buffer heads in thepage have a null usage counter. If they turn out to be unused, the function invok<strong>es</strong>free_async_buffers( ) to release the asynchronous buffer heads. Notice that the usage387


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>counter of an asynchronous buffer head is used as a flag specifying whether the buffer datahas been transferred.<strong>The</strong> free_async_buffers( ) function cannot, however, insert the asynchronous bufferheads into the unused list right away, since a customized block device driver's end_requ<strong>es</strong>t() function might need to acc<strong>es</strong>s them later. <strong>The</strong>refore, free_async_buffers( ) inserts th<strong>es</strong>ebuffer heads in a special list denoted as the reuse list, which is implemented by means of theb_next_free field. <strong>The</strong> reuse_list variable points to the first element of the list. Elementsin the reuse list are moved into the unused list by recover_reusable_buffer_heads( ) justbefore getting a buffer head from the unused list. But this never happens beforeend_requ<strong>es</strong>t( ) terminat<strong>es</strong>, so there is no danger of a race condition involving acc<strong>es</strong>s<strong>es</strong> tothe reuse list.14.1.2 <strong>The</strong> getblk( ) Function<strong>The</strong> getblk( ) function is the main service routine for the buffer cache. When the kernelneeds to read or write the contents of some block of a physical device, it must check whetherthe buffer head for the required buffer is already included in the buffer cache. If the buffer isnot there, the kernel must create a new entry in the cache. In order to do this, the kernelinvok<strong>es</strong> getblk( ), specifying as parameters the device identifier, the block number, and theblock size. This function returns the addr<strong>es</strong>s of the buffer head associated with the buffer.Remember that having a buffer head in the cache do<strong>es</strong> not imply that the data in the buffer isvalid. (For instance, the buffer has yet to be read from disk.) Any function that reads blocks,such as block_read( ), must check whether the buffer obtained from getblk( ) is up-todate;if not, it must read the block first from disk before using the buffer.<strong>The</strong> getblk( ) function performs the following operations:1. Invok<strong>es</strong> find_buffer( ), which mak<strong>es</strong> use of the hash table to check whether therequired buffer head is already in the cache.2. If the buffer head has been found, increments its usage counter (b_count field) andreturns its addr<strong>es</strong>s. <strong>The</strong> next section explains the purpose of this field.3. If the buffer head is not in the cache, a new buffer and a new buffer head must beallocated. Deriv<strong>es</strong> from the block size an index in the free_list array and checkswhether the corr<strong>es</strong>ponding free list is empty.4. If the free list is not empty, performs the following operations:a. Remov<strong>es</strong> the first buffer head from the listb. Initializ<strong>es</strong> the buffer head with the device identifier, the block number, and theblock size; stor<strong>es</strong> in the b_end_io field a pointer to the end_buffer_io_sync() function; [6] and sets the b_count usage counter to 1[6]<strong>The</strong> buffer cache is r<strong>es</strong>erved for buffer I/O operations, which require the b_end_io method to point to theend_buffer_io_sync( ) function; asynchronous buffer heads are left out of the buffer cache.c. Invok<strong>es</strong> insert_into_queu<strong>es</strong>( ) to insert the buffer head into the hash tableand the lru_list[BUF_CLEAN] listd. Returns the addr<strong>es</strong>s of the buffer head5. If the free list is empty, invok<strong>es</strong> the refill_freelist( ) function to replenish it (seeSection 14.1.4).388


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>6. Invok<strong>es</strong> find_buffer( ) to check once more whether some other proc<strong>es</strong>s has put thebuffer in the cache while the kernel control path was waiting for the completion of theprevious step. If so, go<strong>es</strong> to step 2; otherwise, go<strong>es</strong> to step 3.14.1.3 Buffer Usage Counter<strong>The</strong> b_count field of the buffer head is a usage counter for the corr<strong>es</strong>ponding buffer. <strong>The</strong>counter is incremented right before any operation on the buffer and decremented right after. Itacts mainly as a safety lock, since the kernel never d<strong>es</strong>troys a buffer (or its contents) as longas it has a non-null usage counter. Instead, the cached buffers are examined either periodicallyor when the free memory becom<strong>es</strong> scarce, and only those buffers having null counters may bed<strong>es</strong>troyed (see Chapter 16). In other words, a buffer with a null usage counter may belong tothe buffer cache, but it cannot be determined for how long the buffer will stay in the cache.When a kernel control path wish<strong>es</strong> to acc<strong>es</strong>s a buffer, it should increment the usage counterfirst. This task is performed by the getblk( ) function, which is usually invoked to locate thebuffer, so that the increment need not be done explicitly by higher-level functions. When akernel control path stops acc<strong>es</strong>sing a buffer, it may invoke either brelse( ) or bforget( )to decrement the corr<strong>es</strong>ponding usage counter.<strong>The</strong> brelse( ) function receiv<strong>es</strong> as its parameter the addr<strong>es</strong>s of a buffer head. It checkswhether the buffer is dirty and, if so, writ<strong>es</strong> the time when the buffer should be flushed in theb_flushtime field of the buffer head (see Section 14.1.5). <strong>The</strong> function also invok<strong>es</strong>refile_buffer( ) to move the buffer head to the proper list, if nec<strong>es</strong>sary. Finally, thePG_referenced flag of the page frame containing the buffer is set (see Chapter 16), and theb_count field is decremented.<strong>The</strong> bforget( ) function is similar to brelse( ), except that if the usage counter becom<strong>es</strong>and the buffer is not locked (BH_Lock flag cleared), the buffer head is removed from thebuffer cache and inserted into the proper list of free buffers. In other words, the data includedin the buffer, as well the association between the buffer and a specific block of a physicaldevice, is lost.14.1.4 Buffer AllocationFor reasons of efficiency, buffers are not allocated as single memory objects. Instead, buffersare stored in dedicated pag<strong>es</strong> called buffer pag<strong>es</strong> . All the buffers within a single buffer pagemust be the same size. Depending on the block size, a buffer page can include eight, four,two, or just one buffer on the PC architecture. <strong>The</strong> buffer head's b_this_page field links allbuffers included in a single buffer page together in a circular list.If the page d<strong>es</strong>criptor refers to a buffer page, its buffers field points to the buffer head of thefirst buffer included in the page; otherwise, this field is set to NULL. Figure 14-1 shows a pagecontaining four buffers and the corr<strong>es</strong>ponding buffer heads.389


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 14-1. A page including four buffers and their buffer heads<strong>The</strong> number of buffer pag<strong>es</strong> must never become too small, or buffer I/O operations would bedelayed for lack of buffers. <strong>The</strong> minimum percentage of buffer pag<strong>es</strong> among all page fram<strong>es</strong>is stored in the min_percent field of the buffer_mem table, [7] which is acc<strong>es</strong>sible either fromthe /proc/sys/vm/buffermem file or by using the sysctl( ) system call.[7] <strong>The</strong> table also includ<strong>es</strong> two other fields, borrow_percent and max_percent, which are not used in <strong>Linux</strong> 2.2.When the getblk( ) function needs a free buffer, it tri<strong>es</strong> to get an element of the list pointingto free buffers of the right size. If that list is empty, the kernel must allocate additional pagefram<strong>es</strong> and then create new buffers of the required block size. This task is performed by therefill_freelist( ) function, which receiv<strong>es</strong> as a parameter the block size of the buffers tobe allocated. Actually, the function just invok<strong>es</strong> grow_buffers( ), which basically tri<strong>es</strong> toallocate new buffers and returns the value 1 if it succeeded, otherwise. If grow_buffers( )failed to obtain new buffers because available memory is scarce, refill_freelist( )wak<strong>es</strong> up the bdflush kernel thread (see the next section). It then relinquish<strong>es</strong> the CPU bysetting the SCHED_YIELD flag of current and by invoking schedule( ), thus allowingbdflush to run. <strong>The</strong> getblk( ) function invok<strong>es</strong> refill_freelist( ) repeatedly until itsucceeds.<strong>The</strong> grow_buffers( ) function receiv<strong>es</strong> as a parameter the size of the buffers to be allocatedand performs the following operations:1. Invok<strong>es</strong> __get_free_page( ) with priority GFP_BUFFER to get a new page framefrom the Buddy system. <strong>The</strong> GFP_BUFFER priority indicat<strong>es</strong> that the current proc<strong>es</strong>scould be suspended while executing this function.2. If no page frame is available, returns 0.3. If a page frame is available, invok<strong>es</strong> the create_buffers( ) function, which in turnperforms the following operations:a. Tri<strong>es</strong> to allocate the buffer heads for all buffers in the page by repeatedlyinvoking get_unused_buffer_head( ).b. If all the buffer heads needed have been obtained, initializ<strong>es</strong> them properly; inparticular, sets the b_dev field to B_FREE, the b_size field to the buffer size,and the b_data field to the starting addr<strong>es</strong>s of the buffer in the page; then linkstogether the buffer heads by means of the b_this_page field. Finally, returnsthe addr<strong>es</strong>s of the buffer head of the first buffer in the page.390


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>c. If not all buffer heads have been obtained, releas<strong>es</strong> all buffer heads alreadyobtained by repeatedly invoking put_unused_buffer_head( ).d. If the buffer has been requ<strong>es</strong>ted for a buffer I/O operation, returns NULL(failure). This will cause grow_buffers( ) to return 0.e. If we reach this point, get_unused_buffer_head( ) failed and the buffer hasbeen requ<strong>es</strong>ted for a page I/O operation. In this case, the unused list is emptyand all NR_RESERVED asynchronous buffer heads in the list are being used forother page I/O operations. Execut<strong>es</strong> the function in the tq _disk task queue(see Section 13.5.9 in Chapter 13) and sleeps on the buffer_wait wait queueuntil some asynchronous buffer head becom<strong>es</strong> free.f. Go<strong>es</strong> to step a and tri<strong>es</strong> again to allocate all buffer heads for a page.4. If create_buffers( ) returned NULL, releas<strong>es</strong> the page frame and returns 0.5. Otherwise, all the buffer heads needed are now available. Inserts the buffer headscorr<strong>es</strong>ponding to the new buffers in the proper free list.6. Adds the number of newly created buffers to nr_buffers, which always stor<strong>es</strong> thetotal number of existing buffers.7. Sets the buffers field of the page d<strong>es</strong>criptor to the addr<strong>es</strong>s of the first buffer head inthe page.8. Updat<strong>es</strong> the buffermem variable, which stor<strong>es</strong> the total number of byt<strong>es</strong> in the bufferpag<strong>es</strong>.9. Returns the value 1 (succ<strong>es</strong>s).14.1.5 Writing Dirty Buffers to DiskUnix systems allow the deferred writing of dirty buffers into block devic<strong>es</strong>, since that strategynoticeably improv<strong>es</strong> system performance. Several write operations on a buffer could b<strong>es</strong>atisfied by just one slow physical update of the corr<strong>es</strong>ponding disk block. Moreover, writeoperations are l<strong>es</strong>s critical than read operations since a proc<strong>es</strong>s is usually not suspendedbecause of delaying writings, while it is most often suspended because of delayed readings.Thanks to deferred writing, any physical block device will service, on the average, many moreread requ<strong>es</strong>ts than write on<strong>es</strong>.A dirty buffer might stay in main memory until the last possible moment, that is, until systemshutdown. However, pushing the delayed-write strategy to its limits has two majordrawbacks:• If a hardware or power supply failure occurs, the contents of RAM can no longer beretrieved, so a lot of file updat<strong>es</strong> made since the time the system was booted are lost.• <strong>The</strong> size of the buffer cache, and hence of the RAM required to contain it, would haveto be huge—at least as big as the size of the acc<strong>es</strong>sed block devic<strong>es</strong>.<strong>The</strong>refore, dirty buffers are flushed (written) to disk under the following conditions:• <strong>The</strong> buffer cache gets too full and more buffers are needed, or the number of dirtybuffers becom<strong>es</strong> too large: when one of th<strong>es</strong>e conditions occurs, the bdflush kernelthread is activated.• Too much time has elapsed since a buffer has stayed dirty: the kupdate kernel threadregularly flush<strong>es</strong> old buffers.• A proc<strong>es</strong>s requ<strong>es</strong>ts all the buffers of block devic<strong>es</strong> or of particular fil<strong>es</strong> to be flushed:it do<strong>es</strong> this by invoking the sync( ), fsync( ), or fdatasync( ) system call.391


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>14.1.5.1 <strong>The</strong> bdflush kernel thread<strong>The</strong> bdflush kernel thread (also called kflushd ) is created during system initialization. Itexecut<strong>es</strong> the bdflush( ) function, which selects some dirty buffers and forc<strong>es</strong> an update ofthe corr<strong>es</strong>ponding blocks on the physical block devic<strong>es</strong>.Some system parameters control the behavior of bdflush; they are stored in the b_un field ofthe bdf_prm table and are acc<strong>es</strong>sible either by means of the /proc/sys/vm/bdflush file or byinvoking the bdflush( ) system call. Each parameter has a default standard value, althoughit may vary within a minimum and a maximum value stored in the bdflush_min andbdflush_max tabl<strong>es</strong>, r<strong>es</strong>pectively. <strong>The</strong> parameters are listed in Table 14-2; remember that 1tick corr<strong>es</strong>ponds to about 10 milliseconds. [8][8] <strong>The</strong> bdf_prm table also includ<strong>es</strong> several other unused fields.Table 14-2. Buffer Cache Tuning ParametersParameter Default Min Max D<strong>es</strong>criptionage_buffer 3000 100 60,000 Time-out in ticks of a normal dirty buffer for being written to diskage_super 500 100 60,000 Time-out in ticks of a superblock dirty buffer for being written to diskinterval 500 0 6000 Delay in ticks between kupdate activationsndirty 500 10 5000Maximum number of dirty buffers written to disk during an activation ofbdflushnfract 40 0 100 Thr<strong>es</strong>hold percentage of dirty buffers for waking up bdflushIn order to implement deferred writing effectively, it would be counterproductive to write alot of data at once; that would degrade system r<strong>es</strong>ponse time more than simply writing eachbuffer as soon as it's dirty. <strong>The</strong>refore, not all dirty buffers are written to disk at each activationof bdflush. <strong>The</strong> maximum number of dirty buffers to be flushed in each activation is stored inthe ndirty parameter of bdf_prm.<strong>The</strong> kernel thread is woken up in a few specific cas<strong>es</strong>:• When a buffer head is inserted into the BUF_DIRTY list and the number of elements inthe list becom<strong>es</strong> larger than:nr_buffers x bdf_prm.b_un.nfract / 100that is, the percentage of dirty buffers exceeds the thr<strong>es</strong>hold repr<strong>es</strong>ented by the nfractsystem parameter.• When the grow_buffers( ) function, invoked by refill_freelist( ), fails toreplenish a list of free buffers as d<strong>es</strong>cribed earlier in Section 14.1.4.• When the kernel tri<strong>es</strong> to get some free pag<strong>es</strong> by releasing some buffers in the buffercache (see Chapter 16).• When a user pr<strong>es</strong>s<strong>es</strong> some specific combinations of keys on the console (usuallyALT+SysRq+U and ALT+SysRq+S). <strong>The</strong>se key combinations, which are enabled only ifthe <strong>Linux</strong> kernel has been compiled with the Magic SysRq Key option, allow <strong>Linux</strong>hackers to have some explicit control over kernel behavior.392


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>In order to wake up bdflush, the kernel invok<strong>es</strong> the wakeup_bdflush( ) function. It receiv<strong>es</strong>as its parameter a wait flag that indicat<strong>es</strong> whether the calling kernel control path wish<strong>es</strong> towait until some buffers have been succ<strong>es</strong>sfully flushed to disk. <strong>The</strong> function performs thefollowing actions:1. Invok<strong>es</strong> wake_up( ) to wake up the proc<strong>es</strong>s suspended in the bdflush_wait taskqueue. <strong>The</strong>re is just one proc<strong>es</strong>s in this wait queue, namely bdflush itself.2. If the wait parameter indicat<strong>es</strong> that the calling proc<strong>es</strong>s wish<strong>es</strong> to wait, invok<strong>es</strong>sleep_on( ) to insert the current proc<strong>es</strong>s in a specific wait queue namedbdflush_done.At each activation of the bdflush kernel thread, the bdflush( ) function performs thefollowing operations:1. Initializ<strong>es</strong> the ndirty local variable to 0; this variable denot<strong>es</strong> the number of dirtybuffers written to disk during a single activation of bdflush( ).2. Scans the BUF_DIRTY and BUF_LOCKED lists of buffer heads. If a dirty, unlocked bufferis found, increments ndirty and invok<strong>es</strong> ll_rw_block( ) to issue a WRITE requ<strong>es</strong>tfor the buffer. Moreover, if a buffer head in the wrong list is found, invok<strong>es</strong>refile_buffer( ) on it (see Section 14.1.1 earlier in this chapter).3. If ndirty is smaller than bdf_prm.b_un.ndirty and there are other buffer heads tobe checked, reruns step 2.4. Invok<strong>es</strong> run_task_queue( ) to execute the functions in the tq _disk task queue,thus starting the effective low-level block device drivers.5. Invok<strong>es</strong> wake_up( ) to wake up all proc<strong>es</strong>s<strong>es</strong> suspended in the bdflush_done waitqueue.6. If some buffers have been flushed in this iteration and the percentage of dirty buffersis greater than bdf_prm.b_un.nfract, go<strong>es</strong> to step 1 and starts a new iteration: thebuffer cache still contains too many dirty buffers.7. Otherwise, suspends the bdflush kernel thread, as follows: invok<strong>es</strong> flush_signals() to flush all pending signals of bdflush and invok<strong>es</strong> interruptible_sleep_on( ) toinsert bdflush in the bdflush_wait wait queue. When the kernel thread is awakened,it will r<strong>es</strong>ume its execution from step 1.14.1.5.2 <strong>The</strong> kupdate kernel threadSince the bdflush kernel thread is usually activated only when there are too many dirty buffersor when more buffers are needed and available memory is scarce, some dirty buffers mightstay in RAM for an arbitrarily long time before being flushed to disk. <strong>The</strong> kupdate kernelthread is thus introduced to flush the older dirty buffers. [9][9] In an earlier version of <strong>Linux</strong> 2.2, the same task was achieved by means of the bdflush( ) system call, which was invoked every fiv<strong>es</strong>econds by a User Mode system proc<strong>es</strong>s launched at system startup and which executed the /sbin/updateprogram. In more recent kernel versions, thebdflush( )system call is used only to allow users to modify the system parameters in the bdf_prm table.<strong>The</strong> kernel distinguish<strong>es</strong> the buffers used by disk superblocks from other buffers. Asuperblock includ<strong>es</strong> very critical information, and its corruption could lead to severeproblems: in fact, the whole partition could become unreadable. As shown in Table 14-2,there are two time-out parameters: age_buffer is the time for normal buffers to age before393


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>kupdate writ<strong>es</strong> them to disk (usually 30 seconds), while age_super is the corr<strong>es</strong>ponding timefor superblocks (usually 5 seconds).<strong>The</strong> interval field of the bdf_prm table stor<strong>es</strong> the delay in ticks between two activations ofthe kupdate kernel thread (usually five seconds). If this field is null, the kernel thread isnormally stopped, and it is activated only when it receiv<strong>es</strong> a SIGCONT signal.When the kernel modifi<strong>es</strong> the contents of some buffer, it sets the b_flushtime field of thecorr<strong>es</strong>ponding buffer head to the time (in jiffi<strong>es</strong>) when it should later be flushed to disk. <strong>The</strong>kupdate kernel thread selects only the dirty buffers whose b_flushtime field is smaller thanthe current value of jiffi<strong>es</strong>.<strong>The</strong> kupdate kernel thread consists of the kupdate( ) function, which execut<strong>es</strong> the followingendl<strong>es</strong>s loop:for (;;) {if (bdf_prm.b_un.interval) {tsk->state = TASK_INTERRUPTIBLE;schedule_timeout(bdf_prm.b_un.interval);} else {tsk->state = TASK_STOPPED;schedule( ); /* wait for SIGCONT */}sync_old_buffers( );}If bdf_prm.b_un.interval is not null, the thread suspends itself for the specified amount ofticks (see Section 5.4.7 in Chapter 5); otherwise, the thread stops itself until a SIGCONT signalis received (see Section 9.1 in Chapter 9).<strong>The</strong> core of the kupdate( ) function consists of the sync_old_buffers( ) function. <strong>The</strong>operations to be performed are very simple for standard fil<strong>es</strong>ystems used with Unix; all thefunction has to do is write dirty buffers to disk. However, some nonnative fil<strong>es</strong>ystemsintroduce complexiti<strong>es</strong> because they store their superblock or inode information incomplicated ways. sync_old_buffers( ) execut<strong>es</strong> the following steps:1. Invok<strong>es</strong> sync_supers( ), which tak<strong>es</strong> care of superblocks used by fil<strong>es</strong>ystems that donot store all the superblock data in a single disk block (an example is AppleMacintosh's HFS). <strong>The</strong> function acc<strong>es</strong>s<strong>es</strong> the super_blocks array to scan th<strong>es</strong>uperblocks of all currently mounted fil<strong>es</strong>ystems (see Section 12.3 in Chapter 12). Itthen invok<strong>es</strong>, for each superblock, the corr<strong>es</strong>ponding write_super superblockoperation, if one is defined (see Section 12.2.1 in Chapter 12). <strong>The</strong> write_supermethod is not defined for any Unix fil<strong>es</strong>ystem.2. Invok<strong>es</strong> sync_inod<strong>es</strong>( ), which tak<strong>es</strong> care of inod<strong>es</strong> used by fil<strong>es</strong>ystems that do notstore all the inode data in a single disk block (an example is the MS-DOS fil<strong>es</strong>ystem).<strong>The</strong> function scans the superblocks of all currently mounted fil<strong>es</strong>ystems and, for eachsuperblock, the list of dirty inod<strong>es</strong> to which the s_dirty field of the superblock objectpoints. <strong>The</strong> function invok<strong>es</strong> the write_inode superblock operation on each elementof the list, if that method is defined. <strong>The</strong> write_inode method is not defined for anyUnix fil<strong>es</strong>ystem.394


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3. Scans the BUF_DIRTY and BUF_LOCKED lists and writ<strong>es</strong> to disk all old dirty buffers, thatis, those whose b_flushtime buffer head fields have a value smaller than or equal tojiffi<strong>es</strong>. <strong>The</strong> code used to perform this step is almost identical to the code used bybdflush( ), but sync_old_buffers( ) do<strong>es</strong> not flush young buffers to disk, and itdo<strong>es</strong>n't limit the number of buffers checked on each activation.4. Execut<strong>es</strong> the functions in the tq _disk task queue, thus starting up (unplugging) anylow-level block device drivers needed to write blocks to disk.14.1.5.3 <strong>The</strong> sync( ), fsync( ), and fdatasync( ) system callsThree different system calls are available to user applications to flush dirty buffers to disk:sync( )fsync( )Usually issued before a shutdown, since it flush<strong>es</strong> all dirty buffers to diskAllows a proc<strong>es</strong>s to flush all blocks belonging to a specific open file to diskfdatasync( )Very similar to fsync( ) but do<strong>es</strong>n't flush the inode block of the file<strong>The</strong> core of the sync( ) system call is the fsync_dev( ) function, which performs thefollowing actions:1. Invok<strong>es</strong> sync_buffers( ), which scans the BUF_DIRTY and BUF_LOCKED lists andissu<strong>es</strong> a WRITE requ<strong>es</strong>t, via ll_rw_block( ), for all dirty, unlocked buffers the listscontain2. Invok<strong>es</strong> sync_supers( ) to write the dirty superblocks to disk, if nec<strong>es</strong>sary, by usingthe write_super methods (see earlier in this section)3. Invok<strong>es</strong> sync_inod<strong>es</strong>( ) to write the dirty inod<strong>es</strong> to disk, if nec<strong>es</strong>sary, by using thewrite_inode methods (see earlier in this section)4. Invok<strong>es</strong> sync_buffers( ) once again, since sync_supers( ) and sync_inod<strong>es</strong>( )might have marked additional buffers as dirty<strong>The</strong> fsync( ) system call forc<strong>es</strong> the kernel to write to disk all dirty buffers belonging to thefile specified by the fd file d<strong>es</strong>criptor parameter (including the buffer containing its inode, ifnec<strong>es</strong>sary). <strong>The</strong> system service routine deriv<strong>es</strong> the addr<strong>es</strong>s of the file object and then invok<strong>es</strong>its fsync method. This method is fil<strong>es</strong>ystem-dependent, since it must know how fil<strong>es</strong> ar<strong>es</strong>tored on disk in order to be able to identify the dirty buffers associated with a given file.Once the corr<strong>es</strong>pondence between file and buffers has been <strong>es</strong>tablished, the r<strong>es</strong>t of the job canbe delegated to ll_rw_block( ). <strong>The</strong> fsync method suspends the calling proc<strong>es</strong>s until alldirty buffers of the file have been written to disk. In order to do this, it scans both theBUF_DIRTY and BUF_LOCKED lists and invok<strong>es</strong> wait_on_buffer( ) for each locked bufferfound.<strong>The</strong> fdatasync( ) system call is very similar to fsync( ), but it is supposed to write to diskonly the buffers that contain the file's data, not those that contain inode information. Since395


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>Linux</strong> 2.2 do<strong>es</strong> not have a specific file method for fdatasync( ), this system call us<strong>es</strong> thefsync method and is thus identical to fsync( ).14.2 <strong>The</strong> Page Cache<strong>The</strong> page cache, which is thankfully much simpler than the buffer cache, is a disk cache forthe data acc<strong>es</strong>sed by page I/O operations. As we shall see in Chapter 15, all acc<strong>es</strong>s to regularfil<strong>es</strong> made by read( ), write( ), and mmap( ) system calls is done through the page cache.Of course, the unit of information kept in the cache is a whole page, since page I/O operationstransfer whole pag<strong>es</strong> of data. A page do<strong>es</strong> not nec<strong>es</strong>sarily contain physically adjacent diskblocks, and it cannot thus be identified by a device number and a block number. Instead, apage in the page cache is identified by a file's inode and by the offset within the file.<strong>The</strong>re are three main activiti<strong>es</strong> related to the page cache: adding a page when acc<strong>es</strong>sing a fileportion not already in the cache, removing a page when the cache gets too big, and finding thepage including a given file offset.14.2.1 Page Cache Data Structur<strong>es</strong><strong>The</strong> page cache mak<strong>es</strong> use of two main data structur<strong>es</strong>:A page hash tableLets the kernel quickly derive the page d<strong>es</strong>criptor addr<strong>es</strong>s for the page associated witha specified inode and file offsetAn inode queueA list of page d<strong>es</strong>criptors corr<strong>es</strong>ponding to pag<strong>es</strong> of data of a particular file(distinguished by a unique inode)Manipulation of the page cache involv<strong>es</strong> adding and removing entri<strong>es</strong> from th<strong>es</strong>e datastructur<strong>es</strong>, as well as updating the fields in all inode objects referencing cached fil<strong>es</strong>.14.2.1.1 <strong>The</strong> page hash tableWhen a proc<strong>es</strong>s reads a large file, the page cache may become filled with pag<strong>es</strong> related to thatfile. In such cas<strong>es</strong>, scanning the proper inode queue to find the page that maps the requiredfile portion could become a time-consuming operation.For that reason, <strong>Linux</strong> mak<strong>es</strong> use of a hash table of page d<strong>es</strong>criptor pointers namedpage_hash_table. Its size depends on the amount of available RAM; as an example, forsystems having 64 MB of RAM, page_hash_table is stored in 16 page fram<strong>es</strong> and includ<strong>es</strong>16,384 page d<strong>es</strong>criptor pointers.<strong>The</strong> page_hash( ) function deriv<strong>es</strong> from the addr<strong>es</strong>s of an inode object and from an offsetvalue the addr<strong>es</strong>s of the corr<strong>es</strong>ponding element in the hash table. As usual, chaining isintroduced to handle entri<strong>es</strong> that cause a collision: the next_hash and pprev_hash fields ofthe page d<strong>es</strong>criptors are used to implement doubly circular lists of entri<strong>es</strong> having the same396


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>hash value. <strong>The</strong> page_cache_size variable specifi<strong>es</strong> the number of page d<strong>es</strong>criptors includedin the collision lists of the page hash table (and therefore in the page cache).<strong>The</strong> add_page_to_hash_queue( ) and remove_page_from_hash_queue( ) functions areused to add an element into the hash table and remove an element from it, r<strong>es</strong>pectively.14.2.1.2 <strong>The</strong> inode queueA queue of pag<strong>es</strong> is associated with each inode object in kernel memory. <strong>The</strong> i_pag<strong>es</strong> fieldof each inode object stor<strong>es</strong> the addr<strong>es</strong>s of the first page d<strong>es</strong>criptor in its inode queue, while thei_nrpag<strong>es</strong> field stor<strong>es</strong> the length of the list.<strong>The</strong> add_page_to_inode_queue( ) and remove_page_from_inode_queue( ) functions areused to insert a page d<strong>es</strong>criptor into an inode queue and to remove it, r<strong>es</strong>pectively.14.2.1.3 Page d<strong>es</strong>criptor fields related to the page cacheWhen a page frame is included in the page cache, some fields of the corr<strong>es</strong>ponding paged<strong>es</strong>criptor have special meanings:inodeContains the addr<strong>es</strong>s of the inode object of the file to which the data included in thepage belongs; if the page do<strong>es</strong> not belong to the page cache, this field is NULL. [10][10] As we shall see in Chapter 16, the inode field points to a fictitious inode object when the page includ<strong>es</strong> data of a swap partition; actually, thepage belongs to a subset of the page cache named "swap cache." In this chapter, we don't care about this special case.offsetnextprevnext_hashpprev_hashSpecifi<strong>es</strong> the relative addr<strong>es</strong>s of the data inside the file.Points to the next element in the inode queue.Points to the previous element in the inode queue.Points to the next colliding page d<strong>es</strong>criptor in the page hash list.Points to the previous colliding page d<strong>es</strong>criptor in the page hash list.In addition, when a page frame is inserted into the page cache, the usage counter (count field)of the corr<strong>es</strong>ponding page d<strong>es</strong>criptor is incremented. If the count field is exactly 1, the page397


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>frame belongs to the cache but is not being acc<strong>es</strong>sed by any proc<strong>es</strong>s: it can thus be removedfrom the page cache whenever free memory becom<strong>es</strong> scarce, as d<strong>es</strong>cribed in Chapter 16.14.2.2 Page Cache Handling Functions<strong>The</strong> high-level functions using the page cache involve finding, adding, and removing a page.<strong>The</strong> find_page( ) function receiv<strong>es</strong> as parameters the addr<strong>es</strong>s of an inode object and anoffset value. It invok<strong>es</strong> page_hash( ) to derive the addr<strong>es</strong>s of the first element in thecollision list, then scans the list until the requ<strong>es</strong>ted page is found. If the page is pr<strong>es</strong>ent, thefunction increments the count field of the page d<strong>es</strong>criptor, sets the PG_referenced flag, andreturns its addr<strong>es</strong>s; otherwise, it returns NULL.<strong>The</strong> add_to_page_cache( ) function inserts a new page d<strong>es</strong>criptor in the page cache. This isachieved by performing the following operations:1. Increments the count field of the page d<strong>es</strong>criptor2. Clears the PG_uptodate, PG_error, and PG_referenced flags of the page frame toindicate that the page is pr<strong>es</strong>ent in the cache but not yet filled with data3. Sets the offset field of the page d<strong>es</strong>criptor with the offset of the data within the file4. Invok<strong>es</strong> add_page_to_hash_queue( ) to insert the page d<strong>es</strong>criptor in the hash table5. Invok<strong>es</strong> add_page_to_inode_queue( ) to insert the page d<strong>es</strong>criptor in the inodequeue and to set the inode field of the page d<strong>es</strong>criptor<strong>The</strong> remove_inode_page( ) function remov<strong>es</strong> a page d<strong>es</strong>criptor from the page cache. This isachieved by invoking remove_page_from_hash_queue( ),remove_page_from_inode_queue( ), and __free_page( ) in succ<strong>es</strong>sion. <strong>The</strong> latterfunction decrements the count field of the page d<strong>es</strong>criptor, and releas<strong>es</strong> the page frame to theBuddy system if the counter becom<strong>es</strong> 0.14.2.3 Tuning the Page Cache<strong>The</strong> page cache tends to quickly grow in size, because any acc<strong>es</strong>s to previously unacc<strong>es</strong>sedportions of fil<strong>es</strong> forc<strong>es</strong> the kernel to allocate a new page frame for the acc<strong>es</strong>sed data and toinsert that page frame in the cache. As we shall see in Chapter 16, when free memorybecom<strong>es</strong> scarce, the kernel prun<strong>es</strong> the page cache by releasing the old<strong>es</strong>t unused pag<strong>es</strong>.However, the page cache size should never fall under some predefined limit, otherwise systemperformance will quickly degrade. <strong>The</strong> lower size limit of the page cache can be tuned bymeans of the min_percent parameter stored in the page_cache table, [11] which specifi<strong>es</strong> theminimum percentage of pag<strong>es</strong> among all page fram<strong>es</strong> that should belong to the page cache.<strong>The</strong> default value is 2%. <strong>The</strong> parameter's value can be read or modified either by invoking th<strong>es</strong>ysctl( ) system call or by acc<strong>es</strong>sing the /proc/sys/vm/pagecache file.[11] <strong>The</strong> page_cache table also includ<strong>es</strong> the borrow_percent and max_percent parameters, which are no longer used.14.3 Anticipating <strong>Linux</strong> 2.4Much work has been done on the page cache. First of all, the page cache mak<strong>es</strong> use preferablyof page fram<strong>es</strong> in high memory. Moreover, <strong>Linux</strong> 2.4 introduc<strong>es</strong> a new kind of object that398


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>repr<strong>es</strong>ents a file addr<strong>es</strong>s space : the object refers to a given block of an inode (or of a blockdevice) and includ<strong>es</strong> pointers to both the memory region d<strong>es</strong>criptors mapping the file and thepag<strong>es</strong> containing the file data. <strong>The</strong> inode object now includ<strong>es</strong> a pointer to the new addr<strong>es</strong>sspace object, and the page cache is indexed by combining the base addr<strong>es</strong>s of the addr<strong>es</strong>sspace object with the offset inside the file.<strong>The</strong> addr<strong>es</strong>s space object includ<strong>es</strong> methods to read and write a full page of data. <strong>The</strong>semethods take care of inode object management (like updating the file acc<strong>es</strong>s time), page cachehandling, and temporary buffer allocation. This approach leads to a better coupling betweenthe page cache and the buffer cache. Most fil<strong>es</strong>ystems can thus use thegeneric_file_write( ) function (as you'll see in Chapter 15, <strong>Linux</strong> 2.2 us<strong>es</strong> this functiononly for networking fil<strong>es</strong>ystems). <strong>The</strong> synchronization problem between the buffer cache andthe page cache has thus been removed: since both read and write operations for regular fil<strong>es</strong>make use of the same page cache, it is no longer nec<strong>es</strong>sary to synchronize data pr<strong>es</strong>ent in thetwo cach<strong>es</strong>.Notice that both the buffer cache and the page cache continue to be used with differentpurpos<strong>es</strong>. <strong>The</strong> first acts on disk blocks, the second on pag<strong>es</strong> that have a file image on disk.399


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 15. Acc<strong>es</strong>sing Regular Fil<strong>es</strong>Acc<strong>es</strong>sing a regular file is a complex activity that involv<strong>es</strong> the VFS abstraction (Chapter 12),the handling of block devic<strong>es</strong> (Chapter 13), and the use of disk cach<strong>es</strong> (Chapter 14). Thischapter shows how the kernel builds on all those faciliti<strong>es</strong> to accomplish file reads and writ<strong>es</strong>.<strong>The</strong> topics covered in this chapter apply to regular fil<strong>es</strong> stored either in disk-based fil<strong>es</strong>ystemsor to network fil<strong>es</strong>ystems such as NFS or Samba.<strong>The</strong> stage we are working at in this chapter starts after the proper read or write method of aparticular file has been called (as d<strong>es</strong>cribed in Chapter 12). We show here how each read endswith the d<strong>es</strong>ired data delivered to a User Mode proc<strong>es</strong>s and how each write ends with datamarked ready for transfer to disk. <strong>The</strong> r<strong>es</strong>t of the transfer is handled by the faciliti<strong>es</strong> inChapter 13 and Chapter 14.In particular, in Section 15.1 we d<strong>es</strong>cribe how regular fil<strong>es</strong> are acc<strong>es</strong>sed by means of theread( ) and write( ) system calls. When a proc<strong>es</strong>s reads from a regular file, data is firstmoved from the disk itself to a set of buffers in the kernel's addr<strong>es</strong>s space. This set of buffersis included in a set of pag<strong>es</strong> in the page cache (see Section 13.6 in Chapter 13). Next, thepag<strong>es</strong> are copied into the proc<strong>es</strong>s's user addr<strong>es</strong>s space. This chapter deals only with the movefrom the kernel to the user addr<strong>es</strong>s space. A write is basically the opposite, although som<strong>es</strong>tag<strong>es</strong> are different from reads in important ways.We also discuss in Section 15.2 how the kernel allows a proc<strong>es</strong>s to directly map a regular fileinto its addr<strong>es</strong>s space, because that activity also has to deal with pag<strong>es</strong> in kernel memory.15.1 Reading and Writing a Regular FileChapter 13's Section 13.5.4 d<strong>es</strong>cribed how the read( ) and write( ) system calls areimplemented. <strong>The</strong> corr<strong>es</strong>ponding service routin<strong>es</strong> end up invoking the file object's read andwrite methods, which may be fil<strong>es</strong>ystem-dependent. For disk-based fil<strong>es</strong>ystems, th<strong>es</strong>emethods locate the physical blocks containing the data being acc<strong>es</strong>sed and activate the blockdevice driver to start the data transfer. However, reading and writing are performed differentlyin <strong>Linux</strong>.Reading a regular file is page-based: the kernel always transfers whole pag<strong>es</strong> of data at once.If a proc<strong>es</strong>s issu<strong>es</strong> a read( ) system call in order to get a few byt<strong>es</strong>, and that data is notalready in RAM, the kernel allocat<strong>es</strong> a new page frame, fills the page with the suitable portionof the regular file, adds the page to the page cache, and finally copi<strong>es</strong> the requ<strong>es</strong>ted byt<strong>es</strong> intothe proc<strong>es</strong>s addr<strong>es</strong>s space. For most fil<strong>es</strong>ystems, reading a page of data from a regular file isjust a matter of finding what blocks on disk contain the requ<strong>es</strong>ted data. Once this is done, thekernel can use one or more page I/O operations to fill the pag<strong>es</strong>.Write operations for disk-based fil<strong>es</strong>ystems are much more complicated to handle, since thefile size could change, and therefore the kernel might allocate or release some physical blockson the disk. Of course, how this is precisely done depends on the fil<strong>es</strong>ystem type.As a matter of fact, the read method of most fil<strong>es</strong>ystems is implemented by a commonfunction named generic_file_read( ). However, all disk-based fil<strong>es</strong>ystems have acustomized write method. Since the Second Extended Fil<strong>es</strong>ystem is the most efficient and400


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>powerful one currently available for <strong>Linux</strong> and since it is thus the standard fil<strong>es</strong>ystem on most<strong>Linux</strong> systems, we discuss how it implements methods like write in Chapter 17. <strong>The</strong>generic_file_write( ) function is used only by NFS and Samba: since they are networkfil<strong>es</strong>ystems, the kernel do<strong>es</strong> not care about how the data is physically recorded on the remotedisks.15.1.1 Reading from a Regular FileLet's discuss the generic_file_read( ) function, which implements the read method forregular fil<strong>es</strong> of most fil<strong>es</strong>ystems. <strong>The</strong> function acts on the following parameters:filebufcountpposAddr<strong>es</strong>s of the file objectLinear addr<strong>es</strong>s of the User Mode memory area where the characters read from the filemust be storedNumber of characters to be readPointer to a variable storing the file offset from which reading must start<strong>The</strong> function verifi<strong>es</strong> that the parameters are correct by invoking acc<strong>es</strong>s_ok( ) (see Section8.2.4 in Chapter 8), then invok<strong>es</strong> do_generic_file_read( ), which performs the followingsteps:1. Determin<strong>es</strong> from *ppos whether the file offset from which reading must start is insideor outside the file's read-ahead window (see the next section).2. Starts a cycle to read all the pag<strong>es</strong> that include the requ<strong>es</strong>ted count characters andinitializ<strong>es</strong> the pos local variable with the value *ppos. During a single iteration, thefunction reads a page of data by performing the following substeps:a. If pos exceeds the file size, exits from the cycle and go<strong>es</strong> to step 3.b. Invok<strong>es</strong> find_ page( ) to check whether the page is already in the pagecache.c. If the page is not in the page cache, allocat<strong>es</strong> a new page frame, adds it to thepage cache, and invok<strong>es</strong> the readpage method of the inode object to fill it.Although its implementation depends on the fil<strong>es</strong>ystem, most disk fil<strong>es</strong>ystemsrely on a common generic_readpage( ) function, which performs thefollowing operations:a. Sets the PG_locked flag of the page so that no other kernel control pathcan acc<strong>es</strong>s the page contents.b. Increments the usage counter of the page d<strong>es</strong>criptor. This is a fail-safemechanism ensuring that, if the proc<strong>es</strong>s that reads the page is killedwhile sleeping, the page frame won't be released to the Buddy system.401


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>c. Sets the PG_free_after flag, thus ensuring that the usage counter ofthe page d<strong>es</strong>criptor is decremented when the page I/O operationterminat<strong>es</strong> (see Section 13.6 in Chapter 13). This flag is nec<strong>es</strong>sarybecause device drivers may invoke the brw_page( ) function withoutfirst incrementing the usage counter. Currently no device driver do<strong>es</strong>this, though.d. Comput<strong>es</strong> the number of blocks needed to fill the page and deriv<strong>es</strong>from the file offset relative to the page the file block number of the firstblock in the page.e. Invok<strong>es</strong> the bmap method of the inode operation table on each file blocknumber to get the corr<strong>es</strong>ponding logical block number.f. Invok<strong>es</strong> brw_page( ) to transfer the blocks in the page (see Section13.6.1 in Chapter 13).d. Invok<strong>es</strong> the generic_file_readahead( ) function (see the next section).e. If the page is locked, invok<strong>es</strong> wait_on_page( ) to wait for I/O to complete.f. Copi<strong>es</strong> the page (or a portion of it) into the proc<strong>es</strong>s addr<strong>es</strong>s space, updat<strong>es</strong> posto point to the next position in the file for a read to take place, and go<strong>es</strong> to step2a to continue with the next requ<strong>es</strong>ted page.3. Assigns to the *ppos the current value of pos, thus storing the next position where aread is to occur for a future invocation of this function.4. Sets the f_reada field of the file d<strong>es</strong>criptor to 1 (see the next section).5. Invoke update_atime( ) to store the current time in the i_atime field of the file'sinode and to mark the inode as dirty.15.1.2 Read-Ahead for Regular Fil<strong>es</strong>In Section 13.5.3 in Chapter 13, we discussed how sequential disk acc<strong>es</strong>s<strong>es</strong> benefit fromreading several adjacent blocks of a block device in advance, before they are actuallyrequ<strong>es</strong>ted. <strong>The</strong> same considerations apply to sequential read operations of regular fil<strong>es</strong>.However, read-ahead of regular fil<strong>es</strong> requir<strong>es</strong> a more sophisticated algorithm than read-aheadof physical blocks for several reasons:• Since data is read page by page, the read-ahead algorithm do<strong>es</strong> not have to considerthe offsets inside the page, but only the positions of the acc<strong>es</strong>sed pag<strong>es</strong> inside the file.A sequence of acc<strong>es</strong>s<strong>es</strong> to pag<strong>es</strong> of the same file is considered sequential if the relatedpag<strong>es</strong> are close to each other. We'll define the word "close" more precisely in amoment.• Read-ahead must be r<strong>es</strong>tarted from scratch when the current acc<strong>es</strong>s is not sequentialwith r<strong>es</strong>pect to the previous one (random acc<strong>es</strong>s).• Read-ahead should be slowed down or even stopped when a proc<strong>es</strong>s keeps acc<strong>es</strong>singthe same pag<strong>es</strong> over and over again (only a small portion of the file is being used).• If nec<strong>es</strong>sary, the read-ahead algorithm must activate the low-level I/O device driver tomake sure that the new pag<strong>es</strong> will be read.We'll try now to sketch out how <strong>Linux</strong> implements read-ahead. However, we won't be able tocover the algorithm in detail because the motivations behind it appear somewhat empirical.<strong>The</strong> read-ahead algorithm identifi<strong>es</strong> a set of pag<strong>es</strong> corr<strong>es</strong>ponding to a contiguous portion ofthe file as the read-ahead window. If the next read operation issued by a proc<strong>es</strong>s falls insidethis set of pag<strong>es</strong>, the kernel considers the file acc<strong>es</strong>s as "sequential" to the previous one. <strong>The</strong>402


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>read-ahead window consists of pag<strong>es</strong> requ<strong>es</strong>ted by the proc<strong>es</strong>s or read in advance by thekernel and included in the page cache. <strong>The</strong> read-ahead window always includ<strong>es</strong> the pag<strong>es</strong>requ<strong>es</strong>ted in the last read-ahead operation; they are called the read-ahead group. Not all thepag<strong>es</strong> in the read-ahead window or group are nec<strong>es</strong>sarily up-to-date. <strong>The</strong>y are invalid (that is,their PG_uptodate flags are cleared) if their transfer from disk has not yet been completed.<strong>The</strong> file object includ<strong>es</strong> the following fields related to read-ahead:f_raendf_rawinf_ralenf_ramaxf_readaPosition of the first byte after the read-ahead group and the read-ahead windowLength in byt<strong>es</strong> of the current read-ahead windowLength in byt<strong>es</strong> of the current read-ahead groupMaximum number of characters for the next read-ahead operationFlag specifying whether the file has been acc<strong>es</strong>sed sequentially (used only whenacc<strong>es</strong>sing a block device file; see Section 13.5.4 in Chapter 13)Figure 15-1 illustrat<strong>es</strong> how some of the fields are used to delimit the read-ahead window andthe read-ahead group. <strong>The</strong> generic_file_readahead( ) function implements the readaheadalgorithm; its overall scheme is shown later in Figure 15-3.Figure 15-1. Read-ahead window and read-ahead group15.1.2.1 Read-ahead operations<strong>The</strong> kernel performs a read-ahead operation by invoking a function namedtry_to_read_ahead( ) several tim<strong>es</strong>, once for each page to be read ahead.403


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> number of byt<strong>es</strong> to be read in advance is stored into the f_ramax field of the file object.This number is initially set to MIN_READAHEAD (usually three pag<strong>es</strong>); it cannot become largerthan MAX_READAHEAD (usually 31 pag<strong>es</strong>).<strong>The</strong> try_to_read_ahead( ) function checks whether the page considered is alreadyincluded in the page cache; if not, the page is transferred by invoking the readpage method ofthe corr<strong>es</strong>ponding inode object. <strong>The</strong> function then doubl<strong>es</strong> the value of the f_ramax field, sothat the next read-ahead operation will be much more aggr<strong>es</strong>sive (acc<strong>es</strong>s<strong>es</strong> appear to be reallysequential). As mentioned, the field is never allowed to exceed MAX_READAHEAD.Next, the generic_file_readahead( ) function updat<strong>es</strong> the file's read-ahead window andthe read-ahead group. As we shall see later, the function may set up either a short read-aheadwindow or a long one: the former consists of only the most recent read-ahead group while thelatter includ<strong>es</strong> the two most recent read-ahead groups (see Figure 15-2).Figure 15-2. Read-ahead group and windowFinally, the generic_file_readahead( ) function may activate the low-level block devicedriver by executing the tq_disk task queue (see Figure 15-3).A special case occurs for the first read-ahead operation, where the previous read-aheadwindow and the previous read-ahead group are null (all relative fields are set to 0). <strong>The</strong>number of byt<strong>es</strong> to be read in advance in the first read-ahead operation is equal toMIN_READAHEAD or the number of byt<strong>es</strong> requ<strong>es</strong>ted by the proc<strong>es</strong>s in its read( ) system call,whichever value is larger.15.1.2.2 Nonsequential acc<strong>es</strong>s (outside the read-ahead window)When a proc<strong>es</strong>s issu<strong>es</strong> a read requ<strong>es</strong>t through the read( ) system call, the kernel checkswhether the first page of the requ<strong>es</strong>ted data is included in the current read-ahead window ofthe corr<strong>es</strong>ponding file object.Suppose the first page is not included in the read-ahead window, perhaps because the file hasnever been acc<strong>es</strong>sed by the proc<strong>es</strong>s or because the proc<strong>es</strong>s issued an lseek( ) system call to404


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>reposition the current file pointer. <strong>The</strong> kernel considers every page requ<strong>es</strong>ted by the proc<strong>es</strong>s inturn; for each such page, it invok<strong>es</strong> the generic_file_readahead( ) function, whichdecid<strong>es</strong> whether a read-ahead operation has to be performed.Basically, two cas<strong>es</strong> may occur (see Figure 15-3). Either the page is locked, meaning that theactual data transfer for the page has not been completed, or the page is unlocked and thereforeup-to-date. (For the sake of simplicity, we will not consider I/O errors occurring whiletransferring the pag<strong>es</strong> in this chapter.)If the page is unlocked, the read-ahead operation do<strong>es</strong> not take place. This rule ensur<strong>es</strong> that noread-ahead operation is performed on a file whose data is entirely contained in the page cache.In such cas<strong>es</strong>, any further read-ahead proc<strong>es</strong>sing would be a waste of CPU time. Conversely,if the page is locked, the kernel starts a read-ahead operation and prepar<strong>es</strong> a short read-aheadwindow for the next read-ahead.Figure 15-3. Read-ahead scheme15.1.2.3 Sequential acc<strong>es</strong>s (inside the read-ahead window)Suppose now that the first page acc<strong>es</strong>sed by the proc<strong>es</strong>s through a read( ) system call fallsinside the read-ahead window of the previous read-ahead operation. <strong>The</strong> kernel considers allpag<strong>es</strong> requ<strong>es</strong>ted by the proc<strong>es</strong>s in turn; for each of them, it invok<strong>es</strong> thegeneric_file_readahead( ) function. Four cas<strong>es</strong> may occur, depending on whether thepage is locked and whether it is included in the read-ahead group (see Figure 15-3):• If the page is not locked but is included in the read-ahead group, the proc<strong>es</strong>s hasprogr<strong>es</strong>sed to the point where it is acc<strong>es</strong>sing a page transferred from disk by the lastread-ahead operation, so the proc<strong>es</strong>s is dangerously close to the end of the pag<strong>es</strong> readin advance. In this case, another read-ahead is performed and a long read-aheadwindow is used; moreover, the functions in the tq _disk task queue are executed tobe sure that the low-level block device driver will be activated.• If the page is neither locked nor included in the read-ahead group, the proc<strong>es</strong>s isacc<strong>es</strong>sing a page transferred from disk in the next-to-last read-ahead operation. In this405


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>case, no reading in advance is done since the proc<strong>es</strong>s is lagging with r<strong>es</strong>pect to readahead.This is the b<strong>es</strong>t scenario and shows that the kernel has an ample number ofread-ahead pag<strong>es</strong> being read sequentially by the proc<strong>es</strong>s, just as we would hope.• If the page is both locked and included in the read-ahead group, the proc<strong>es</strong>s isacc<strong>es</strong>sing a page requ<strong>es</strong>ted in the last read-ahead operation but probably not yettransferred from disk. In this case, it would be pointl<strong>es</strong>s to start an additional readahead.• Finally, if the page is locked but not included in the read-ahead group, the proc<strong>es</strong>s isacc<strong>es</strong>sing a page requ<strong>es</strong>ted in the next-to-last read-ahead operation. In this case, thekernel performs another read-ahead because the page is likely to be written on diskand an additional read-ahead will cause no harm.15.1.3 Writing to a Regular FileRecall that the write( ) system call involv<strong>es</strong> moving data from the User Mode addr<strong>es</strong>s spaceof the calling proc<strong>es</strong>s into the kernel data structur<strong>es</strong>, and then to disk. <strong>The</strong> write method ofthe file object permits each fil<strong>es</strong>ystem type to define a specialized write operation. In <strong>Linux</strong>2.2, the write method of each disk-based fil<strong>es</strong>ystem is a procedure that basically identifi<strong>es</strong>the disk blocks involved in the write operation, copi<strong>es</strong> the data from the User Mode addr<strong>es</strong>sspace into the corr<strong>es</strong>ponding buffers, and then marks those buffers as dirty. <strong>The</strong> proceduredepends on the type of fil<strong>es</strong>ystem; we pr<strong>es</strong>ent one example in Section 17.7 in Chapter 17.Disk-based fil<strong>es</strong>ystems do not directly use the page cache for writing to a regular file. This isa heritage from older versions of <strong>Linux</strong>, in which the only disk cache was the buffer cache.However, network-based fil<strong>es</strong>ystems always use the page cache for writing to a regular file.<strong>The</strong> approach used in <strong>Linux</strong> 2.2, bypassing the page cache, leads to a synchronizationproblem. When writing tak<strong>es</strong> place, the valid data is in the buffer cache but not in the pagecache; more precisely, when the write method chang<strong>es</strong> any portion of a file, any page in thepage cache that corr<strong>es</strong>ponds to that portion no longer includ<strong>es</strong> meaningful data. As anexample of the problem, one proc<strong>es</strong>s might think it's reading correct data but fail to see thechang<strong>es</strong> written by another proc<strong>es</strong>s.In order to solve this problem, all write methods of disk-based fil<strong>es</strong>ystems invoke theupdate_vm_cache( ) function to update the page cache used by reads. This function acts onthe following parameters:inodeposbufPointer to inode object of the file to which the writ<strong>es</strong> took placeOffset within the file where the writ<strong>es</strong> took placeAddr<strong>es</strong>s from where the characters to be written into the file must be fetched406


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>countNumber of characters to be written<strong>The</strong> function updat<strong>es</strong> page by page the portion of the page cache related to the file beingwritten by performing the following operations:1. Comput<strong>es</strong> from pos the offset relative to a page of the first character to be written2. Invok<strong>es</strong> find_page( ) to search in the page cache for the page frame including thecharacter at file offset pos3. If the page is found, performs the following substeps:a. Invok<strong>es</strong> wait_on_page( ) to wait until the page becom<strong>es</strong> unlocked (in case itis involved in an I/O data transfer).b. Fills the page with data from the proc<strong>es</strong>s addr<strong>es</strong>s space, starting from the pageoffset pos previously computed. This is the data that has already been writteninto the buffer cache by the fil<strong>es</strong>ystem's customized write method.c. Invok<strong>es</strong> free_page( ) to decrement the page's usage counter (it wasincremented by the find_page( ) function).4. If some data in the User Mode addr<strong>es</strong>s space remains to be copied, updat<strong>es</strong> pos, setsthe page offset to 0, and go<strong>es</strong> to step 2Now we'll briefly d<strong>es</strong>cribe the write operation for a regular file through the page cache. Recallthat this operation is used only by network fil<strong>es</strong>ystems. In this case, the write method of thefile object is implemented by the generic_file_write( ) function, which acts on thefollowing parameters:filebufcountpposFile object pointerAddr<strong>es</strong>s where the characters to be written into the file must be fetchedNumber of characters to be writtenAddr<strong>es</strong>s of a variable storing the file offset from which writing must start<strong>The</strong> function performs the following operations:1. If the O_APPEND flag of file->flags is on, sets *ppos to the end of the file so that allnew data is appended to it.2. Starts a cycle to update all the pag<strong>es</strong> involved in the write operation. During eachiteration, performs the following substeps:a. Tri<strong>es</strong> to find the page in the page cache. If it isn't there, allocat<strong>es</strong> a free pageand adds it to the page cache.407


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>b. Invok<strong>es</strong> wait_on_page( ) and then sets the PG_locked flag of the paged<strong>es</strong>criptor, thus obtaining exclusive acc<strong>es</strong>s to the page content.c. Invok<strong>es</strong> copy_from_user( ) to fill the page with data coming from theproc<strong>es</strong>s addr<strong>es</strong>s space.d. Invok<strong>es</strong> the inode operation's updatepage method. This method is specific tothe particular network fil<strong>es</strong>ystem being used and is not d<strong>es</strong>cribed in this book.It should ensure that the remote file is properly updated with the newly writtendata.e. Unlocks the page, invok<strong>es</strong> wake_up( ) to wake up the proc<strong>es</strong>s<strong>es</strong> suspended onthe page wait queue, and invok<strong>es</strong> free_page( ) to decrement the page's usagecounter (incremented by find_page( )).3. Updat<strong>es</strong> the value of *ppos to point right after the last character written.15.2 Memory MappingAs already mentioned in Section 7.3 in Chapter 7, a memory region can be associated with afile (or with some portion of it) of a disk-based fil<strong>es</strong>ystem. This means that an acc<strong>es</strong>s to a bytewithin a page of the memory region is translated by the kernel into an operation on thecorr<strong>es</strong>ponding byte of the regular file. This technique is called memory mapping.Two kinds of memory mapping exist:SharedPrivateAny write operation on the pag<strong>es</strong> of the memory region chang<strong>es</strong> the file on disk;moreover, if a proc<strong>es</strong>s writ<strong>es</strong> into a page of a shared memory mapping, the chang<strong>es</strong>are visible to all other proc<strong>es</strong>s<strong>es</strong> that map the same file.Meant to be used when the proc<strong>es</strong>s creat<strong>es</strong> the mapping just to read the file, not towrite it. For this purpose, private mapping is more efficient than shared mapping. Butany write operation on a privately mapped page will cause it not to map the page in thefile any longer. Thus, a write do<strong>es</strong> not change the file on disk, nor is the change visibleto any other proc<strong>es</strong>s<strong>es</strong> that acc<strong>es</strong>s the same file.A proc<strong>es</strong>s can create a new memory mapping by issuing an mmap( ) system call (see Section15.2.3 later in this chapter). Programmers must specify either the MAP_SHARED flag or theMAP_PRIVATE flag as a parameter of the system call; as you can probably gu<strong>es</strong>s, in the formercase the mapping is shared while in the latter it is private. Once the mapping has been created,the proc<strong>es</strong>s can read the data stored in the file by simply reading from the memory locationsof the new memory region. If the memory mapping is shared, the proc<strong>es</strong>s can also modify thecorr<strong>es</strong>ponding file by simply writing into the same memory locations. In order to d<strong>es</strong>troy orshrink a memory mapping, the proc<strong>es</strong>s may use the munmap( ) system call (see Section15.2.4).As a general rule, if a memory mapping is shared, the corr<strong>es</strong>ponding memory region has theVM_SHARED flag set; if it is private, the VM_SHARED flag is cleared. As we'll see later, anexception to this rule exists for read-only shared memory mappings.408


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>15.2.1 Memory Mapping Data Structur<strong>es</strong>A memory mapping is repr<strong>es</strong>ented by a combination of the following data structur<strong>es</strong>:• <strong>The</strong> inode object associated with the mapped file• A file object for each different mapping performed on that file by different proc<strong>es</strong>s<strong>es</strong>• A vm_area_struct d<strong>es</strong>criptor for each different mapping on the file• A page d<strong>es</strong>criptor for each page frame assigned to a memory region that maps the fileFigure 15-4 illustrat<strong>es</strong> how the data structur<strong>es</strong> are linked together. In the upper left corner w<strong>es</strong>how the inode. <strong>The</strong> i_mmap field of each inode object points to the first element of a doublylinked list that includ<strong>es</strong> all memory regions that currently map the file; if i_mmap is NULL, thefile is not mapped by any memory region. <strong>The</strong> list contains vm_area_struct d<strong>es</strong>criptorsrepr<strong>es</strong>enting memory regions and is implemented by means of the vm_next_share andvm_pprev_share fields.Figure 15-4. Data structur<strong>es</strong> related to memory mapping<strong>The</strong> vm_file field of each memory region d<strong>es</strong>criptor contains the addr<strong>es</strong>s of a file object forthe mapped file; if that field is null, the memory region is not used in a memory mapping. <strong>The</strong>file object contains fields that allow the kernel to identify both the proc<strong>es</strong>s that owns thememory mapping and the file being mapped.<strong>The</strong> position of the first mapped location is stored into the vm_offset field of the memoryregion d<strong>es</strong>criptor. <strong>The</strong> length of the mapped file portion is simply the length of the memoryregion, thus can be computed from the vm_start and vm_end fields.409


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Pag<strong>es</strong> of shared memory mappings are always included in the page cache; pag<strong>es</strong> of privatememory mappings are included in the page cache as long as they are unmodified. When aproc<strong>es</strong>s tri<strong>es</strong> to modify a page of a private memory mapping, the kernel duplicat<strong>es</strong> the pageframe and replac<strong>es</strong> the original page frame with the duplicate in the proc<strong>es</strong>s Page Table; thisis one of the applications of the Copy On Write mechanism that we discussed in Chapter 7.<strong>The</strong> original page frame still remains in the page cache, although it no longer belongs to thememory mapping since it is replaced by the duplicate. In turn, the duplicate is not insertedinto the page cache since it no longer contains valid data repr<strong>es</strong>enting the file on disk.Figure 15-4 also shows the page d<strong>es</strong>criptors of the pag<strong>es</strong> included in the page cache that referto the memory-mapped file. As d<strong>es</strong>cribed in Chapter 14 in Chapter 14, th<strong>es</strong>e d<strong>es</strong>criptors areinserted into a doubly linked list implemented through the next and prev fields. <strong>The</strong> addr<strong>es</strong>sof the first list element is in the inode object's i_pag<strong>es</strong> field. Once again, modified pag<strong>es</strong> ofprivate memory mappings do not belong to this list. Notice that the first memory region in thefigure is three pag<strong>es</strong> long, but only two page fram<strong>es</strong> are allocated for it; pr<strong>es</strong>umably, theproc<strong>es</strong>s owning the memory region has never acc<strong>es</strong>sed the third page.15.2.2 Operations Associated with a Memory RegionMemory region d<strong>es</strong>criptors are objects, similar to the superblock, inode, and file objectsd<strong>es</strong>cribed in Chapter 12. Like them, memory regions have their own methods. In fact, eachvm_area_struct d<strong>es</strong>criptor includ<strong>es</strong> a vm_ops field that points to a vm_operations_structstructure. This table, which contains the methods associated with a memory region, includ<strong>es</strong>the fields illustrated in Table 15-1.Methodopencloseunmapprotectsyncadvisenopagewppag<strong>es</strong>wapoutswapinTable 15-1. Memory Region MethodsD<strong>es</strong>criptionOpen the region.Close the region.Unmap a linear addr<strong>es</strong>s interval.Not used.Flush the memory region content.Not used.Demand paging.Not used.Swap out a page belonging to the region.Swap in a page belonging to the region.<strong>The</strong> memory region operations allow different fil<strong>es</strong>ystems to implement their own memorymapping functions. In practice, most fil<strong>es</strong>ystems rely on two standard tabl<strong>es</strong> of memoryregion operations: file_shared_mmap, which is used for shared memory mappings, andfile_private_mmap, which is used for private memory mappings. Table 15-2 shows thenam<strong>es</strong> of the relevant methods.As usual, when a method has a NULL value, the kernel invok<strong>es</strong> either a default function or nofunction at all. If the nopage method has a NULL value, the memory region is anonymous; thatis, it do<strong>es</strong> not map any file on disk. This use is discussed in Section 7.4.3 in Chapter 7.410


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 15-2. Methods Used by file_shared_mmap and by file_private_mmapMethod file_shared_mmap file_private_mmapopen NULL NULLclose NULL NULLunmap filemap_unmap NULLprotect NULL NULLsync filemap_sync NULLadvise NULL NULLnopage filemap_nopage filemap_nopagewppage NULL NULLswapout filemap_swapout NULLswapin NULL NULL15.2.3 Creating a Memory MappingTo create a new memory mapping, a proc<strong>es</strong>s issu<strong>es</strong> an mmap( ) system call, passing thefollowing parameters to it:• A file d<strong>es</strong>criptor identifying the file to be mapped.• An offset inside the file specifying the first character of the file portion to be mapped.• <strong>The</strong> length of the file portion to be mapped.• A set of flags. <strong>The</strong> proc<strong>es</strong>s must explicitly set either the MAP_SHARED flag or theMAP_PRIVATE flag to specify the kind of memory mapping requ<strong>es</strong>ted. [1][1] <strong>The</strong> proc<strong>es</strong>s could also set the MAP_ANONYMOUS flag to specify that the new memory region is anonymous, that is, not associated with anyfile (see Section 7.4.3 in Chapter 7); however, this flag is a <strong>Linux</strong> extension and it is not defined by the POSIX standard.• A set of permissions specifying one or more typ<strong>es</strong> of acc<strong>es</strong>s to the memory region:read acc<strong>es</strong>s (PROT_READ), write acc<strong>es</strong>s (PROT_WRITE ), or execution acc<strong>es</strong>s(PROT_EXEC).• An optional linear addr<strong>es</strong>s, which is taken by the kernel as a hint of where the newmemory region should start. If the MAP_FIXED flag is specified and the kernel cannotallocate the new memory region starting from the specified linear addr<strong>es</strong>s, the systemcall fails.<strong>The</strong> mmap( ) system call returns the linear addr<strong>es</strong>s of the first location in the new memoryregion. <strong>The</strong> service routine is implemented by the old_mmap( ) function, which <strong>es</strong>sentiallyinvok<strong>es</strong> the do_mmap( ) function already d<strong>es</strong>cribed in Section 7.3.4 in Chapter 7. We nowcomplete that d<strong>es</strong>cription by detailing the steps performed only when creating a memoryregion that maps a file.1. Checks whether the mmap file operation for the file to be mapped is defined; if not,returns an error code. A NULL value for mmap in the file operation table indicat<strong>es</strong> thatthe corr<strong>es</strong>ponding file cannot be mapped (for instance, because it is a directory).2. In addition to the usual consistency checks, compar<strong>es</strong> the kind of memory mappingrequ<strong>es</strong>ted and the flags specified when the file was opened. <strong>The</strong> flags passed as aparameter of the system call specify the kind of mapping required, while the value ofthe f_mode field of the file object specifi<strong>es</strong> how the file was opened. Depending onth<strong>es</strong>e two sourc<strong>es</strong> of information, performs the following checks:411


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>a. If a shared writable memory mapping is required, checks that the file wasopened for writing and that it was not opened in append mode (O_APPEND flagof the open( ) system call)b. If a shared memory mapping is required, checks that there is no mandatorylock on the file (see Section 12.6 in Chapter 12)c. For any kind of memory mapping, checks that the file was opened for readingIf any of th<strong>es</strong>e conditions is not fulfilled, returns an error code.3. When initializing the value of the vm_flags field of the new memory regiond<strong>es</strong>criptor, sets the VM_READ, VM_WRITE, VM_EXEC, VM_SHARED, VM_MAYREAD,VM_MAYWRITE, VM_MAYEXEC, and VM_MAYSHARE flags according to the acc<strong>es</strong>s rights ofthe file and the kind of requ<strong>es</strong>ted memory mapping (see Section 7.3.2 in Chapter 7).As an optimization, the VM_SHARED flag is cleared for nonwritable shared memorymapping. This can be done because the proc<strong>es</strong>s is not allowed to write into the pag<strong>es</strong>of the memory region, thus the mapping is treated the same as a private mapping;however, the kernel actually allows other proc<strong>es</strong>s<strong>es</strong> that share the file to acc<strong>es</strong>s thepag<strong>es</strong> in this memory region.4. Invok<strong>es</strong> the mmap method for the file being mapped, passing as parameters the addr<strong>es</strong>sof the file object and the addr<strong>es</strong>s of the memory region d<strong>es</strong>criptor. For mostfil<strong>es</strong>ystems, this method is implemented by the generic_file_mmap( ) function,which performs the following operations:a. Initializ<strong>es</strong> the vm_ops field of the memory region d<strong>es</strong>criptor. If VM_SHARED ison, sets the field to file_shared_mmap, otherwise sets the field tofile_private_mmap (see Table 15-2). In a sense, this step do<strong>es</strong> somethingsimilar to the way opening a file initializ<strong>es</strong> the methods of a file object.b. Checks from the inode's i_mode field whether the file to be mapped is aregular one. For other typ<strong>es</strong> of fil<strong>es</strong>, such as a directory or socket, returns anerror code.c. Checks from the inode's i_op field whether the readpage( ) inode operationis defined. If not, returns an error code.d. Invok<strong>es</strong> update_atime( ) to store the current time in the i_atime field of thefile's inode and to mark the inode as dirty.5. Initializ<strong>es</strong> the vm_file field of the memory region d<strong>es</strong>criptor with the addr<strong>es</strong>s of thefile object and increments the file's usage counter.6. Recall Section 7.3.4 in Chapter 7 that do_mmap( ) invok<strong>es</strong> insert_vm_struct( ).During the execution of this function, inserts the memory region d<strong>es</strong>criptor into the listto which the inode's i_mmap field points.15.2.4 D<strong>es</strong>troying a Memory MappingWhen a proc<strong>es</strong>s is ready to d<strong>es</strong>troy a memory mapping, it invok<strong>es</strong> the munmap( ) system call,passing the following parameters to it:• <strong>The</strong> addr<strong>es</strong>s of the first location in the linear addr<strong>es</strong>s interval to be removed• <strong>The</strong> length of the linear addr<strong>es</strong>s interval to be removedNotice that the munmap( ) system call can be used to either remove or reduce the size of anykind of memory region. Indeed, the sys_munmap( ) service routine of the system call<strong>es</strong>sentially invok<strong>es</strong> the do_munmap( ) function already d<strong>es</strong>cribed in Section 7.3.5 in412


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 7. However, if the memory region maps a file, the following additional steps areperformed for each memory region included in the range of linear addr<strong>es</strong>s<strong>es</strong> to be released:1. Checks whether the memory region has an unmap method; if so, invok<strong>es</strong> it. Usually,private memory mappings do not have such a method since the file must not beupdated. Shared memory mappings make use of the filemap_unmap( ) function,which invok<strong>es</strong>, in turn, filemap_sync( ) (see Section 15.2.6).2. Invok<strong>es</strong> remove_shared_vm_struct( ) to remove the memory region d<strong>es</strong>criptorfrom the inode list pointed to by the i_mmap field.15.2.5 Demand Paging for Memory MappingFor reasons of efficiency, page fram<strong>es</strong> are not assigned to a memory mapping right after it hasbeen created but at the last possible moment—that is, when the proc<strong>es</strong>s attempts to addr<strong>es</strong>sone of its pag<strong>es</strong>, thus causing a "Page fault" exception.We have seen in Section 7.4 in Chapter 7 how the kernel verifi<strong>es</strong> whether the faulty addr<strong>es</strong>s isincluded in some memory region of the proc<strong>es</strong>s; in the affirmative case, the kernel checks thepage table entry corr<strong>es</strong>ponding to the faulty addr<strong>es</strong>s and invok<strong>es</strong> the do_no_page( ) functionif the entry is null (see Section 7.4.3 in Chapter 7).<strong>The</strong> do_no_page( ) function performs all the operations that are common to all typ<strong>es</strong> ofdemand paging, such as allocating a page frame and updating the page tabl<strong>es</strong>. It also checkswhether the nopage method of the memory region involved is defined. Chapter 7'sSection 7.4.3, we d<strong>es</strong>cribed the case in which the method is undefined (anonymous memoryregion); now we complete the d<strong>es</strong>cription by discussing the actions performed by the functionwhen the method is defined:1. Invok<strong>es</strong> the nopage method, which returns the addr<strong>es</strong>s of a page frame that containsthe requ<strong>es</strong>ted page.2. Increments the rss field of the proc<strong>es</strong>s memory d<strong>es</strong>criptor to indicate that a new pageframe has been assigned to the proc<strong>es</strong>s.3. Sets up the Page Table entry corr<strong>es</strong>ponding to the faulty addr<strong>es</strong>s with the addr<strong>es</strong>s ofthe page frame returned by the nopage method and the page acc<strong>es</strong>s rights included inthe memory region vm_page_prot field. Further actions depend on the type of acc<strong>es</strong>s:o If the proc<strong>es</strong>s is trying to write into the page, forc<strong>es</strong> the Read/Write and Dirtybits of the Page Table entry to 1. In this case, either the page frame isexclusively assigned to the proc<strong>es</strong>s, or the page is shared: in both cas<strong>es</strong>, writingto it should be allowed. (This avoids a second usel<strong>es</strong>s "Page fault" exceptioncaused by the Copy On Write mechanism.)o If the proc<strong>es</strong>s is trying to read from the page, the memory region VM_SHAREDflag is not set, and the page's usage counter is greater than 1, forc<strong>es</strong> theRead/Write bit of the Page Table entry to 0. This is because the page's usagecounter indicat<strong>es</strong> that other proc<strong>es</strong>s<strong>es</strong> are sharing the page; since the pagedo<strong>es</strong>n't belong to a shared memory region, it must be handled through theCopy On Write mechanism.<strong>The</strong> core of the demand paging algorithm consists of the memory region's nopage method.Generally speaking, it must return the addr<strong>es</strong>s of a page frame that contains the page acc<strong>es</strong>sed413


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>by the proc<strong>es</strong>s. Its implementation depends on the kind of memory region in which the page isincluded.When handling memory regions that map fil<strong>es</strong> on disk, the nopage method must first searchfor the requ<strong>es</strong>ted page in the page cache. If the page is not found, the method must read itfrom disk. Most fil<strong>es</strong>ystems implement the nopage method by means of thefilemap_nopage( ) function, which acts on three parameters:areaaddr<strong>es</strong>sno_shareD<strong>es</strong>criptor addr<strong>es</strong>s of the memory region including the required page.Linear addr<strong>es</strong>s of the required page.Flag specifying whether the page frame returned by the function must not be sharedamong many proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong> do_no_page( ) function sets this flag only if the proc<strong>es</strong>sis trying to write into the page and the VM_SHARED flag is off.<strong>The</strong> filemap_nopage( ) function execut<strong>es</strong> the following steps:1. Gets the file object addr<strong>es</strong>s from area->vm_file field. Deriv<strong>es</strong> the inode objectaddr<strong>es</strong>s from the d_inode field of the dentry object to which the f_dentry field ofthe file object points.2. Us<strong>es</strong> the vm_start and vm_offset fields of area to determine the offset within thefile of the data corr<strong>es</strong>ponding to the page starting from addr<strong>es</strong>s.3. Checks whether the file offset exceeds the file size. If this occurs and the VM_SHAREDflag is on, returns 0, thus causing a SIGBUS signal to be sent to the proc<strong>es</strong>s. (Privatememory mappings behave differently: a new page frame filled with zeros is assignedto the proc<strong>es</strong>s.)4. Invok<strong>es</strong> find_page( ) to look in the page cache for the page identified by the inodeobject and the file offset. If the page isn't there, invok<strong>es</strong> try_to_read_ahead( ) toallocate a new page frame, to add it to the page cache, and to fill its contents with dataread from disk. Actually, the kernel tri<strong>es</strong> to read ahead the next page_cluster pag<strong>es</strong>as well (see Chapter 16).5. Invok<strong>es</strong> wait_on_page( ) to wait until the required page becom<strong>es</strong> unlocked (that is,until any current I/O data transfer for the page terminat<strong>es</strong>).6. If the no_share flag is 0, the page frame can be shared: returns its addr<strong>es</strong>s.7. <strong>The</strong> no_share flag is 1, so the proc<strong>es</strong>s tried to write into a page of a private memorymapping (or, more precisely, of a memory region whose VM_SHARED flag is off).<strong>The</strong>refore, allocat<strong>es</strong> a new page frame by performing the following operations:a. Invok<strong>es</strong> the __get_free_page( ) functionb. Copi<strong>es</strong> the page included in the page cache in the new page framec. Decrements the usage counter of the page in the page cache in order to undothe increment done by find_page( )d. Returns the addr<strong>es</strong>s of the new page frame414


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>15.2.6 Flushing Dirty Memory Mapping Pag<strong>es</strong> to Disk<strong>The</strong> msync( ) system call can be used by a proc<strong>es</strong>s to flush to disk dirty pag<strong>es</strong> belonging to ashared memory mapping. It receiv<strong>es</strong> as parameters the starting addr<strong>es</strong>s of an interval of linearaddr<strong>es</strong>s<strong>es</strong>, the length of the interval, and a set of flags that have the following meanings.MS_SYNCMS_ASYNCAsks the system call to suspend the proc<strong>es</strong>s until the I/O operation complet<strong>es</strong>. In thisway the calling proc<strong>es</strong>s can assume that when the system call terminat<strong>es</strong>, all pag<strong>es</strong> ofits memory mapping have been flushed to disk.Asks the system call to return immediately without suspending the calling proc<strong>es</strong>s.MS_INVALIDATEAsks the system call to remove all pag<strong>es</strong> included in the memory mapping from theproc<strong>es</strong>s addr<strong>es</strong>s space.<strong>The</strong> sys_msync( ) service routine invok<strong>es</strong> msync_interval( ) on each memory regionincluded in the interval of linear addr<strong>es</strong>s<strong>es</strong> (see Section 7.3.4 in Chapter 7). In turn, the latterfunction performs the following operations:1. If the vm_file field of the memory region d<strong>es</strong>criptor is NULL, returns (the memoryregion do<strong>es</strong>n't map a file).2. Invok<strong>es</strong> the sync method of the memory region operations. In most fil<strong>es</strong>ystems, thismethod is implemented by the filemap_sync( ) function (d<strong>es</strong>cribed shortly).3. If the MS_SYNC flag is on, invok<strong>es</strong> the file_fsync( ) function to flush to disk allrelated file information: the file's inode, the fil<strong>es</strong>ystem's superblock, and (by means ofsync_buffers( )) all the dirty buffers of the file.<strong>The</strong> filemap_sync( ) function copi<strong>es</strong> data included in the memory region to disk. It startsby scanning the Page Table entri<strong>es</strong> corr<strong>es</strong>ponding to the linear addr<strong>es</strong>s intervals included inthe memory region. For each page frame found, it performs the following steps:1. Invok<strong>es</strong> flush_tlb_page( ) to flush the translation lookaside buffers.2. If the MS_INVALIDATE flag is off, increments the usage counter of the page d<strong>es</strong>criptor.3. If the MS_INVALIDATE flag is on, sets the corr<strong>es</strong>ponding Page Table entry to 0, thusspecifying that the page is no longer pr<strong>es</strong>ent.4. Invok<strong>es</strong> the filemap_write_page( ) function, which in turn performs the followingsubsteps:a. Increments the usage counter of the file object associated with the file. This isa fail-safe mechanism, so that the file object is not freed if the proc<strong>es</strong>sterminat<strong>es</strong> while the I/O data transfer is still going on.b. Invok<strong>es</strong> do_write_page( ), which <strong>es</strong>sentially execut<strong>es</strong> the write method ofthe file operations, thus simulating a write( ) system call on the file. In thiscase, of course, the data to be written is not taken from a User Mode buffer butfrom the page being flushed.415


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>c. Invok<strong>es</strong> fput( ) to decrement the usage counter of the file object, thuscompensating for the increment made in step 4a.5. Invok<strong>es</strong> free_page( ) to decrease the usage counter of the page d<strong>es</strong>criptor; thiscompensat<strong>es</strong> for the increment performed at step 2 if the MS_INVALIDATE flag is off.Otherwise, if the flag is on, the global effect of filemap_sync( ) is to release thepage frame (giving it back to the Buddy system if the counter becom<strong>es</strong> 0).15.3 Anticipating <strong>Linux</strong> 2.4<strong>The</strong> approach followed remains basically the same. However, if a memory region isrecognized as "sequential read," read-ahead is performed while reading pag<strong>es</strong> from disk.Moreover, as already mentioned at the end of Chapter 13, writing to a regular file is muchsimpler in <strong>Linux</strong> 2.4 because it can be easily done through page I/O operations.416


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 16. Swapping: Methods for Freeing Memory<strong>The</strong> disk cach<strong>es</strong> examined in previous chapters used RAM as an extension of the disk; thegoal was to improve system r<strong>es</strong>ponse time and the solution was to reduce the number of diskacc<strong>es</strong>s<strong>es</strong>. In this chapter we introduce an opposite approach called swapping: here the kernelus<strong>es</strong> some space on disk as an extension of RAM. Swapping is transparent to the programmer:once the swapping areas have been properly installed and activated, the proc<strong>es</strong>s<strong>es</strong> may rununder the assumption that they have all the physical memory available that they can addr<strong>es</strong>s,never knowing that some of their pag<strong>es</strong> are stored away and retrieved again as needed.Disk cach<strong>es</strong> enhance system performance at the expense of free RAM, while swappingextends the amount of addr<strong>es</strong>sable memory at the expense of acc<strong>es</strong>s speed. Thus, disk cach<strong>es</strong>are "good" and d<strong>es</strong>irable, while swapping should be regarded as some sort of last r<strong>es</strong>ort to beused whenever the amount of free RAM becom<strong>es</strong> too scarce.We'll start in Section 16.1 by defining swapping. <strong>The</strong>n we'll d<strong>es</strong>cribe in Section 16.2 the maindata structur<strong>es</strong> introduced by <strong>Linux</strong> to implement it. We discuss the swap cache andthe low-level functions that transfer pag<strong>es</strong> between RAM and swap areas and vice versa. <strong>The</strong>two crucial sections are: Section 16.5, where we d<strong>es</strong>cribe the procedure used to select a pageto be swapped out to disk, and Section 16.6, where we explain how a page stored in a swaparea is read back into RAM when the need occurs.This chapter effectively conclud<strong>es</strong> our discussion of memory management. Just one topicremains to be covered, namely page frame reclaiming; this is done in the last section, which isrelated only in part to swapping. With so many disk cach<strong>es</strong> around, including the swap cache,all the available RAM could eventually end up in th<strong>es</strong>e cach<strong>es</strong> and no more free RAM wouldbe left. We shall see how the kernel prevents this by monitoring the amount of free RAM andby freeing pag<strong>es</strong> from the cach<strong>es</strong> or from the proc<strong>es</strong>s addr<strong>es</strong>s spac<strong>es</strong>, as the need occurs.16.1 What Is Swapping?Swapping serv<strong>es</strong> two main purpos<strong>es</strong>:• To expand the addr<strong>es</strong>s space that is effectively usable by a proc<strong>es</strong>s• To expand the amount of dynamic RAM (what is left of the RAM once the kernelcode and static data structur<strong>es</strong> have been initialized) to load proc<strong>es</strong>s<strong>es</strong>Let's give a few exampl<strong>es</strong> of how swapping benefits the user. <strong>The</strong> simpl<strong>es</strong>t is when aprogram's data structur<strong>es</strong> take up more space than the size of the available RAM. A swap areawill allow this program to be loaded without any problem, thus to run correctly. A mor<strong>es</strong>ubtle example involv<strong>es</strong> users who issue several commands trying to simultaneously run largeapplications that require a lot of memory. If no swap area is active, the system might rejectrequ<strong>es</strong>ts to launch a new application. In contrast, a swap area allows the kernel to launch it,since some memory can be freed at the expense of some of the already existing proc<strong>es</strong>s<strong>es</strong>without killing them.<strong>The</strong>se two exampl<strong>es</strong> illustrate the benefits, but also the drawbacks, of swapping. Simulationof RAM is not like RAM in terms of performance. Every acc<strong>es</strong>s by a proc<strong>es</strong>s to a page that iscurrently swapped-out increas<strong>es</strong> the proc<strong>es</strong>s execution time by several orders of magnitude. In417


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>short, if performance is of great importance, swapping should be used only as a last r<strong>es</strong>ort;adding RAM chips still remains the b<strong>es</strong>t solution to cope with increasing computing needs. Itis fair to say, however, that, in some cas<strong>es</strong>, swapping may be beneficial to the system as awhole. Long-running proc<strong>es</strong>s<strong>es</strong> typically acc<strong>es</strong>s only half of the page fram<strong>es</strong> obtained. Evenwhen some RAM is available, swapping unused pag<strong>es</strong> out and using the RAM for disk cachecan improve overall system performance.Swapping has been around for many years. <strong>The</strong> first Unix system kernels monitored theamount of free memory constantly. When it became l<strong>es</strong>s than a fixed thr<strong>es</strong>hold, theyperformed some swapping-out. This activity consisted of copying the entire addr<strong>es</strong>s space of aproc<strong>es</strong>s to disk. Conversely, when the scheduling algorithm selected a swapped-out proc<strong>es</strong>s,the whole proc<strong>es</strong>s was swapped in from disk.This approach has been abandoned by modern Unix kernels, including <strong>Linux</strong>, mainly becausecontext switch<strong>es</strong> are quite expensive when they involve swapping in swapped-out proc<strong>es</strong>s<strong>es</strong>.To compensate for the burden of such swapping activity, the scheduling algorithm must bevery sophisticated: it must favor in-RAM proc<strong>es</strong>s<strong>es</strong> without completely shutting out th<strong>es</strong>wapped-out on<strong>es</strong>.In <strong>Linux</strong>, swapping is currently performed at the page level rather than at the proc<strong>es</strong>s addr<strong>es</strong>sspace level. This finer level of granularity has been reached thanks to the inclusion of ahardware paging unit in the CPU. We recall from Section 2.4.1 in Chapter 2, that each PageTable entry includ<strong>es</strong> a Pr<strong>es</strong>ent flag: the kernel can take advantage of this flag to signal to thehardware that a page belonging to a proc<strong>es</strong>s addr<strong>es</strong>s space has been swapped out. B<strong>es</strong>id<strong>es</strong> thatflag, <strong>Linux</strong> also tak<strong>es</strong> advantage of the remaining bits of the Page Table entry to store thelocation of the swapped-out page on disk. When a "Page fault" exception occurs, thecorr<strong>es</strong>ponding exception handler can detect that the page is not pr<strong>es</strong>ent in RAM and invokethe function that swaps the missing page in from the disk.Much of the algorithm's complexity is thus related to swapping-out. In particular, four mainissu<strong>es</strong> must be considered:• Which kind of page to swap out• How to distribute pag<strong>es</strong> in the swap areas• How to select the page to be swapped out• When to perform page swap-outLet us give a short preview of how <strong>Linux</strong> handl<strong>es</strong> th<strong>es</strong>e four issu<strong>es</strong> before d<strong>es</strong>cribing the maindata structur<strong>es</strong> and functions related to swapping.16.1.1 Which Kind of Page to Swap OutSwapping appli<strong>es</strong> only to the following kinds of pag<strong>es</strong>:• Pag<strong>es</strong> belonging to an anonymous memory region (for instance, a User Mode stack) ofa proc<strong>es</strong>s• Modified pag<strong>es</strong> belonging to a private memory mapping of a proc<strong>es</strong>s• Pag<strong>es</strong> belonging to an IPC shared memory region (see Section 18.3.5 in Chapter 18)418


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> remaining kinds of pag<strong>es</strong> are either used by the kernel or used to map fil<strong>es</strong> on disk. In thefirst case, they are ignored by swapping because this simplifi<strong>es</strong> the kernel d<strong>es</strong>ign; in th<strong>es</strong>econd case, the b<strong>es</strong>t swap areas for the pag<strong>es</strong> are the fil<strong>es</strong> themselv<strong>es</strong>.16.1.2 How to Distribute Pag<strong>es</strong> in the Swap AreasEach swap area is organized into slots , where each slot contains exactly one page. Whenswapping out, the kernel tri<strong>es</strong> to store pag<strong>es</strong> in contiguous slots so as to minimize disk seektime when acc<strong>es</strong>sing the swap area; this is an important element of an efficient swappingalgorithm.If more than one swap area is used, things become more complicated. Faster swap areas—thatis, swap areas stored in faster disks—get a higher priority. When looking for a free slot, th<strong>es</strong>earch starts in the swap area having the high<strong>es</strong>t priority. If there are several of them, swapareas of the same priority are cyclically selected in order to avoid overloading one of them. Ifno free slot is found in the swap areas having the high<strong>es</strong>t priority, the search continu<strong>es</strong> in th<strong>es</strong>wap areas having a priority next to the high<strong>es</strong>t one, and so on.16.1.3 How to Select the Page to Be Swapped OutWith the exception of pag<strong>es</strong> belonging to IPC shared memory, which will be discussed inChapter 18, the general rule for swapping out is to steal pag<strong>es</strong> from the proc<strong>es</strong>s having thelarg<strong>es</strong>t number of pag<strong>es</strong> in RAM. However, a choice must be made among the pag<strong>es</strong> of theproc<strong>es</strong>s chosen for swap-out: it would be nice to be able to rank them according to somecriterion. Several Least Recently Used (LRU) replacement algorithms have been proposedand used in some kernels. <strong>The</strong> main idea is to associate with each page in RAM a counterstoring the age of the page, that is, the interval of time elapsed since the last acc<strong>es</strong>s to thepage. <strong>The</strong> old<strong>es</strong>t page of the proc<strong>es</strong>s can then be swapped out.Some computer platforms provide sophisticated support for LRU algorithms; for instance, theCPUs of some mainfram<strong>es</strong> automatically update the value of a counter included in each PageTable entry to specify the age of the corr<strong>es</strong>ponding page. But Intel 80x86 proc<strong>es</strong>sors do notoffer such a hardware feature, so <strong>Linux</strong> cannot use a true LRU algorithm. However, whenselecting a candidate for swap-out, <strong>Linux</strong> tak<strong>es</strong> advantage of the Acc<strong>es</strong>sed flag included ineach Page Table entry, which is automatically set by the hardware when the page is acc<strong>es</strong>sed.As we'll see later, this flag is set and cleared in a rather simplistic way to keep pag<strong>es</strong> frombeing swapped in and out too much.16.1.4 When to Perform Page Swap-outSwapping out is useful when the kernel is dangerously low on memory. In fact, the kernelkeeps a small r<strong>es</strong>erve of free page fram<strong>es</strong> that can be used only by the most critical functions.This turns out to be <strong>es</strong>sential to avoid system crash<strong>es</strong>, which might occur when a kernelroutine invoked to free r<strong>es</strong>ourc<strong>es</strong> is unable to obtain the memory area it needs to complete itstask. In order to protect this r<strong>es</strong>erve of free page fram<strong>es</strong>, <strong>Linux</strong> performs a swap-out on thefollowing occasions:• By a kernel thread denoted as kswapd activated once every second whenever thenumber of free page fram<strong>es</strong> falls below a predefined thr<strong>es</strong>hold419


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• When a memory requ<strong>es</strong>t to the Buddy system (see Section 6.1.2 in Chapter 6) cannotbe satisfied because the number of free page fram<strong>es</strong> would fall below a predefinedthr<strong>es</strong>holdFigure 16-1. Main functions related to swappingSeveral functions are concerned with swapping. Figure 16-1 illustrat<strong>es</strong> the most importanton<strong>es</strong>. <strong>The</strong>y will be discussed in the following sections.16.2 Swap Area<strong>The</strong> pag<strong>es</strong> swapped out from memory are stored in a swap area, which may be implementedeither as a disk partition of its own or as a file included in a larger partition. Several differentswap areas may be defined, up to a maximum number specified by the MAX_SWAPFILES macro(usually set to 8).Having multiple swap areas allows a system administrator to spread a lot of swap spaceamong several disks so that the hardware can act on them concurrently; it also lets swap spacebe increased at runtime without rebooting the system.Each swap area consists of a sequence of page slots , that is, of 4096-byte blocks used tocontain a swapped-out page. <strong>The</strong> first page slot of a swap area is used to persistently stor<strong>es</strong>ome information about the swap area; its format is d<strong>es</strong>cribed by the swap_header unioncomposed of two structur<strong>es</strong>, info and magic. <strong>The</strong> magic structure provid<strong>es</strong> a string thatmarks part of the disk unambiguously as a swap area; it consists of just one field,magic.magic, containing a 10-character "magic" string. <strong>The</strong> magic structure <strong>es</strong>sentiallyallows the kernel to unambiguously identify a file or a partition as a swap area; the text of th<strong>es</strong>tring depends on the swapping algorithm version. <strong>The</strong> field is always located at the end ofthe first page slot.420


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> info structure includ<strong>es</strong> the following fields:info.bootbitsNot used by the swapping algorithm; this field corr<strong>es</strong>ponds to the first 1024 byt<strong>es</strong> ofthe swap area, which may store partition data, disk labels, and so on.info.versionSwapping algorithm version.info.last_pageLast page slot that is effectively usable.info.nr_badpag<strong>es</strong>Number of defective page slots.info.padding[125]Padding byt<strong>es</strong>.info.badpag<strong>es</strong>[1]Up to 640 numbers specifying the location of defective page slots.Usually, the system administrator creat<strong>es</strong> a swap partition when creating the other partitionson the <strong>Linux</strong> system, then us<strong>es</strong> the /sbin/mkswap command to set up the disk area as a newswap area. That command initializ<strong>es</strong> the fields just d<strong>es</strong>cribed within the first page slot. Sincethe disk may include some bad blocks, the program also examin<strong>es</strong> all other page slots in orderto locate the defective on<strong>es</strong>. But executing the /sbin/mkswapcommand leav<strong>es</strong> the swap area inan inactive state. Each swap area can be activated in a script file at system boot ordynamically after the system is running. An initialized swap area is considered active when iteffectively repr<strong>es</strong>ents an extension of the system RAM (see Section 16.2.3 later in thischapter).16.2.1 Swap Area D<strong>es</strong>criptorEach active swap area has its own swap_info_struct d<strong>es</strong>criptor in memory, whose fields areillustrated in Table 16-1.421


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 16-1. Fields of a Swap Area D<strong>es</strong>criptorType Field D<strong>es</strong>criptionunsigned int flags Swap area flagskdev_tswap_device Device number of the swap devic<strong>es</strong>truct dentry * swap_file Dentry of the file or device fileunsigned shortswap_map*Pointer to array of counters, one for each swap area page slotunsigned char * swap_lockmap Pointer to array of bit locks, one for each swap area page slotunsigned int low<strong>es</strong>t_bit First page slot to be scanned when searching for a free oneunsigned int high<strong>es</strong>t_bit Last page slot to be scanned when searching for a free oneunsigned int cluster_next Next page slot to be scanned when searching for a free oneunsigned int cluster_nrNumber of free page slot allocations before r<strong>es</strong>tarting frombeginningint prio Swap area priorityint pag<strong>es</strong> Number of usable page slotsunsigned long max Size of swap area in pag<strong>es</strong>int next Pointer to next swap area d<strong>es</strong>criptor<strong>The</strong> flags field includ<strong>es</strong> two overlapping subfields:SWP_USED1 if the swap area is active; 0 if it is nonactive.SWP_WRITEOKThis 2-bit field is set to 3 if it is possible to write into the swap area and otherwise;since the least-significant bit of this field coincid<strong>es</strong> with the bit used to implementSWP_USED, a swap area can be written only if it is active. <strong>The</strong> kernel is not allowed towrite in a swap area when it is being activated or deactivated.<strong>The</strong> swap_map field points to an array of counters, one for each swap area page slot. If thecounter is equal to 0, the page slot is free; if it is positive, the page slot is filled with aswapped-out page (the exact meaning of positive valu<strong>es</strong> will be discussed in Section 16.3). Ifthe counter has the value SWAP_MAP_MAX (equal to 32,767), the page stored in the page slot is"permanent" and cannot be removed from the corr<strong>es</strong>ponding slot. If the counter has the valueSWAP_MAP_BAD (equal to 32,768), the page slot is considered defective, thus unusable.<strong>The</strong> swap_lockmap field points to an array of bits, one for each swap area page slot. If a bit isset, the page stored in the page slot is currently being swapped in or swapped out. This bit isthus used as a lock to ensure exclusive acc<strong>es</strong>s to the page slot during an I/O data transfer.<strong>The</strong> prio field is a signed integer that denot<strong>es</strong> the "goodn<strong>es</strong>s" of the swap area. Swap areasimplemented on faster disks should have a higher priority, so that they will be used first. Onlywhen they are filled do<strong>es</strong> the swapping algorithm consider lower-priority swap areas. Swapareas having the same priority are cyclically selected in order to distribute swapped-out pag<strong>es</strong>among them. As we shall see in Section 16.2.3, the priority is assigned when the swap area isactivated.422


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> swap_info array includ<strong>es</strong> MAX_SWAPFILES swap area d<strong>es</strong>criptors. Of course, not all ofthem are nec<strong>es</strong>sarily used, only those having the SWP_USED flag set. Figure 16-2 illustrat<strong>es</strong> th<strong>es</strong>wap_info array, one swap area, and the corr<strong>es</strong>ponding array of counters.Figure 16-2. Swap area data structur<strong>es</strong><strong>The</strong> nr_swapfil<strong>es</strong> variable stor<strong>es</strong> the index of the last array element that contains, or that hascontained, an effectively used swap area d<strong>es</strong>criptor. D<strong>es</strong>pite its name, the variable do<strong>es</strong> notcontain the number of active swap areas.D<strong>es</strong>criptors of active swap areas are also inserted into a list sorted by the swap area priority.<strong>The</strong> list is implemented through the next field of the swap area d<strong>es</strong>criptor, which stor<strong>es</strong> theindex of the next d<strong>es</strong>criptor in the swap_info array. This use of the field as an index isdifferent from most fields we've seen with the name next, which are usually pointers.<strong>The</strong> swap_list variable, of type swap_list_t, includ<strong>es</strong> the following fields:headnextIndex in the swap_info array of the first list element.Index in the swap_info array of the d<strong>es</strong>criptor of the next swap area to be selected forswapping out pag<strong>es</strong>. This field is used to implement a round-robin algorithm amongmaximum-priority swap areas with free slots.<strong>The</strong> max field stor<strong>es</strong> the size of the swap area in pag<strong>es</strong>, while the pag<strong>es</strong> field stor<strong>es</strong> thenumber of usable page slots. <strong>The</strong>se numbers differ because pag<strong>es</strong> do<strong>es</strong> not take intoconsideration the first page slot and the defective page slots.Finally, the nr_swap_pag<strong>es</strong> variable contains the total number of free, nondefective pag<strong>es</strong>lots in all active swap areas.423


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>16.2.2 Swapped-out Page IdentifierA swapped-out page is uniquely identified quite simply by specifying the index of the swaparea in the swap_info array and the page slot index inside the swap area. Since the first page(with index 0) of the swap area is r<strong>es</strong>erved for the swap_header union discussed earlier, thefirst useful page slot has index 1. <strong>The</strong> format of a swapped-out page identifier is illustrated inFigure 16-3.Figure 16-3. Swapped-out page identifier<strong>The</strong> SWP_ENTRY(type,offset) macro constructs a swapped-out page identifier from th<strong>es</strong>wap area index type and the page slot index offset. Conversely, the SWP_TYPE andSWP_OFFSET macros extract from a swapped-out page identifier the swap area index and thepage slot index, r<strong>es</strong>pectively.When a page is swapped out, its identifier is inserted as the page's entry into the Page Tabl<strong>es</strong>o the page can be found again when needed. Notice that the least-significant bit of such anidentifier, which corr<strong>es</strong>ponds to the Pr<strong>es</strong>ent flag, is always cleared to denote the fact that thepage is not currently in RAM. However, at least 1 of the 30 most-significant bits has to be setbecause no page is ever stored in slot 0. It is thus possible to identify, from the value of a PageTable entry, three different cas<strong>es</strong>:• Null entry: the page do<strong>es</strong> not belong to the proc<strong>es</strong>s addr<strong>es</strong>s space.• First 30 most-significant bits not all equal to 0, last 2 bits equal to 0: the page iscurrently swapped out.• Least-significant bit equal to 1: the page is contained in RAM.Since a page may belong to the addr<strong>es</strong>s spac<strong>es</strong> of several proc<strong>es</strong>s<strong>es</strong> (see Section 16.3), it maybe swapped out from the addr<strong>es</strong>s space of one proc<strong>es</strong>s and still remain in main memory;therefore, it is possible to swap out the same page several tim<strong>es</strong>. A page is physically swappedout and stored just once, of course, but each subsequent attempt to swap it out increments th<strong>es</strong>wap_map counter.<strong>The</strong> swap_duplicate( ) function is invoked while trying to swap out an already swappedoutpage. It just verifi<strong>es</strong> that the swapped-out page identifier passed as its parameter is validand increments the corr<strong>es</strong>ponding swap_map counter. More precisely, it performs thefollowing actions:1. Us<strong>es</strong> the SWP_TYPE and SWP_OFFSET macros to extract from the parameter the partitionnumber type and the page slot index offset.2. Checks whether one of the following error conditions occurs:a. type is greater than nr_swapfil<strong>es</strong>.b. <strong>The</strong> SWP_USED flag in swap_info[type].flags is cleared, indicating that th<strong>es</strong>wap area is not active.c. offset is greater than swap_info[type].max.424


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>d. swap_info[type].swap_map[offset] is null, indicating that the page slot isfree.If any of th<strong>es</strong>e cas<strong>es</strong> occurs, return 0 (invalid identifier).3. If the previous t<strong>es</strong>ts were passed, the swapped-out page identifier locat<strong>es</strong> a valid page.<strong>The</strong>refore, increments swap_info[type].swap_map[offset]; however, if thecounter is equal to SWAP_MAP_MAX or to SWAP_MAP_BAD, leav<strong>es</strong> it unchanged.4. Returns 1 (valid identifier).16.2.3 Activating and Deactivating a Swap AreaOnce a swap area is initialized, the superuser (or, more precisely, any user having theCAP_SYS_ADMIN capability, as d<strong>es</strong>cribed in Section 19.1.1 in Chapter 19) may use the/bin/swapon and /bin/swapoff programs to, r<strong>es</strong>pectively, activate and deactivate the swap area.<strong>The</strong>se programs make use of the swapon( ) and swapoff( ) system calls; we'll brieflysketch out the corr<strong>es</strong>ponding service routin<strong>es</strong>.<strong>The</strong> sys_swapon( ) service routine receiv<strong>es</strong> as parameters:specialfileThis parameter contains the pathname of the device file (partition) or plain file used toimplement the swap area.swap_flagsIf the SWAP_FLAG_PREFER bit is on, the 15 least-significant bits specify the priority ofthe swap area.<strong>The</strong> function checks the fields of the swap_header union that was put in the first slot whenthe swap area was created. <strong>The</strong> main steps performed by the function are:1. Checks that the current proc<strong>es</strong>s has the CAP_SYS_ADMIN capability.2. Search<strong>es</strong> for the first d<strong>es</strong>criptor in swap_info having the SWP_USED flag cleared,meaning that the corr<strong>es</strong>ponding swap area is inactive. If there is none, returns an errorcode, because there are already MAX_SWAPFILES active swap areas.3. A d<strong>es</strong>criptor for the swap area has been found; sets its SWP_USED flag. Moreover, if thed<strong>es</strong>criptor's index is greater than nr_swapfil<strong>es</strong>, updat<strong>es</strong> that variable.4. Sets the prio field of the d<strong>es</strong>criptor. If the swap_flags parameter do<strong>es</strong> not specify apriority, initializ<strong>es</strong> the field with the low<strong>es</strong>t priority among all active swap areas minus1 (thus assuming that the last activated swap area is on the slow<strong>es</strong>t block device). If noother swap areas are already active, assigns the value -1.5. Initializ<strong>es</strong> the swap_file field of the d<strong>es</strong>criptor to the addr<strong>es</strong>s of the dentry objectassociated with specialfile, as returned by the namei( ) function (see Section 12.4in Chapter 12).6. If the specialfile parameter identifi<strong>es</strong> a block device file, stor<strong>es</strong> the device numberinto the swap_device field of the d<strong>es</strong>criptor, sets the block size in the proper entry ofthe blksize_size array to PAGE_SIZE, and invok<strong>es</strong> blkdev_open( ). <strong>The</strong> latterfunction sets up the f_ops field of a new file object associated with the device file and425


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>initializ<strong>es</strong> the hardware device (see Section 13.2.2 in Chapter 13). Moreover, checksthat the swap area was not already activated by looking at the swap_device field ofthe other d<strong>es</strong>criptors in swap_info. If it has been activated, returns an error code.7. If, on the other hand, the specialfile parameter identifi<strong>es</strong> a regular file, checks thatthe swap area was not already activated by looking at the swap_file->d_inode fieldof the other d<strong>es</strong>criptors in swap_info. If it was activated, returns an error code.8. Allocat<strong>es</strong> a page frame and invok<strong>es</strong> rw_swap_page_nocache( ) (see Section 16.4later in this chapter) to read the first page of the swap area, which stor<strong>es</strong> th<strong>es</strong>wap_header union.9. Checks that the magic string in the last 10 characters of the first page is equal to SWAP-SPACE or to SWAPSPACE2 (there are two slightly different versions of the swappingalgorithm). If not, the specialfile parameter do<strong>es</strong> not specify an already initializedswap area, so returns an error code.10. Initializ<strong>es</strong> the low<strong>es</strong>t_bit, high<strong>es</strong>t_bit, and max fields of the swap area d<strong>es</strong>criptoraccording to the size of the swap area stored in the info.last_page field of th<strong>es</strong>wap_header union.11. Invok<strong>es</strong> vmalloc( ) to create the array of counters associated with the new swap areaand store its addr<strong>es</strong>s in the swap_map field of the swap d<strong>es</strong>criptor. Initializ<strong>es</strong> theelements of the array to or to SWAP_MAP_BAD, according to the list of defective pag<strong>es</strong>lots stored in the info.bad_pag<strong>es</strong> field of the swap_header union.12. Invok<strong>es</strong> vmalloc( ) again to create the array of locks and store its addr<strong>es</strong>s in th<strong>es</strong>wap_lockmap field of the swap d<strong>es</strong>criptor; initializ<strong>es</strong> all bits to 0.13. Comput<strong>es</strong> the number of useful page slots by acc<strong>es</strong>sing the info.last_page andinfo.nr_badpag<strong>es</strong> fields in the first page slot.14. Sets the flags field of the swap d<strong>es</strong>criptor to SWP_WRITEOK, sets the pag<strong>es</strong> field to thenumber of useful page slots, and updat<strong>es</strong> the nr_swap_pag<strong>es</strong> variable.15. Inserts the new swap area d<strong>es</strong>criptor in the list to which the swap_list variablepoints.16. Releas<strong>es</strong> the page frame containing the data of the first page of the swap area andreturns 0 (succ<strong>es</strong>s).<strong>The</strong> sys_swapoff( ) service routine deactivat<strong>es</strong> a swap area identified by the parameterspecialfile. It is much more complex and time-consuming than sys_swapon( ), since thepartition to be deactivated might still contain pag<strong>es</strong> belonging to several proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>function is thus forced to scan the swap area and to swap in all existing pag<strong>es</strong>. Since eachswap-in requir<strong>es</strong> a new page frame, it might fail if there are no free page fram<strong>es</strong> left. In thiscase, the function returns an error code. All this is achieved by performing the followingsteps:1. Invok<strong>es</strong> namei( ) on specialfile to get a pointer to the dentry object associatedwith the swap area.2. Scans the list to which swap_list points and locat<strong>es</strong> the d<strong>es</strong>criptor whose swap_filefield points to the dentry object associated with specialfile. If no such d<strong>es</strong>criptorexists, an invalid parameter was passed to the function; if the d<strong>es</strong>criptor exists but itsSWP_WRITE flag is cleared while its SWP_USED flag is set, the swap area is in the middleof being deactivated. In either case, returns an error code.3. Remov<strong>es</strong> the d<strong>es</strong>criptor from the list and sets its flags field to SWP_USED so the kerneldo<strong>es</strong>n't store more pag<strong>es</strong> in the swap area before this function deactivat<strong>es</strong> it.426


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4. Invok<strong>es</strong> the try_to_unuse( ) function to succ<strong>es</strong>sively force all pag<strong>es</strong> left in the swaparea into RAM and to corr<strong>es</strong>pondingly update the page tabl<strong>es</strong> of the proc<strong>es</strong>s<strong>es</strong> thatmake use of th<strong>es</strong>e pag<strong>es</strong>. For each page slot, the function performs the followingsubsteps:a. If the counter associated with the page slot is equal to (no page is stored there)or to SWAP_MAP_BAD, do<strong>es</strong> nothing (continu<strong>es</strong> with the next page slot).b. Otherwise, invok<strong>es</strong> the read_swap_cache( ) function (see Section 16.4 laterin this chapter) to allocate, if nec<strong>es</strong>sary, a new page frame and fill it with thedata stored in the page slot.c. Invok<strong>es</strong> unuse_proc<strong>es</strong>s( ) on each proc<strong>es</strong>s in the proc<strong>es</strong>s list. This timeconsumingfunction scans all Page Table entri<strong>es</strong> of the proc<strong>es</strong>s and replac<strong>es</strong>any occurrence of the swapped-out page identifier with the physical addr<strong>es</strong>s ofthe page frame. To reflect this move, the function also decrements the page slotcounter in the swap_map array and increments the usage counter of the pageframe.d. Invok<strong>es</strong> shm_unuse( ) to check whether the swapped-out page is used for anIPC shared memory r<strong>es</strong>ource and to properly handle that case (see Section18.3.5 in Chapter 18).e. Remov<strong>es</strong>, if nec<strong>es</strong>sary, the page frame from the swap cache (see xreflinkend="ch16-89946"/> later in this chapter).5. If try_to_unuse( ) fails in allocating all requ<strong>es</strong>ted page fram<strong>es</strong>, the swap areacannot be deactivated. Reinserts the swap area d<strong>es</strong>criptor in the swap_list list, sets itsflags field to SWP_WRITEOK again, and returns an error code.6. Otherwise, all used page slots have been succ<strong>es</strong>sfully transferred to RAM. Finish<strong>es</strong> byreleasing the dentry object and the inode object associated with the swap area,releasing the memory areas used to store the swap_map and swap_lockmap arrays,updating the nr_swap_pag<strong>es</strong> variable, and finally returning (succ<strong>es</strong>s).16.2.4 Finding a Free Page SlotAs we shall see later, when freeing memory, the kernel swaps out many pag<strong>es</strong> in a shortperiod of time. It is thus important to try to store th<strong>es</strong>e pag<strong>es</strong> in contiguous slots so as tominimize disk seek time when acc<strong>es</strong>sing the swap area.A first approach to an algorithm that search<strong>es</strong> for a free slot could choose one of twosimplistic, rather extreme strategi<strong>es</strong>:• Always start from the beginning of the swap area. This approach may increase theaverage seek time during swap-out operations, because free page slots may b<strong>es</strong>cattered far away from one another.• Always start from the last allocated page slot. This approach increas<strong>es</strong> the averag<strong>es</strong>eek time during swap-in operations if the swap area is mostly free (as is usually thecase): the few occupied page slots may be scattered far away from one another.<strong>Linux</strong> adopts a hybrid approach. It always starts from the last allocated page slot unl<strong>es</strong>s one ofth<strong>es</strong>e conditions occurs:• <strong>The</strong> end of the swap area is reached• SWAPFILE_CLUSTER (usually 256) free page slots have been allocated after the lastr<strong>es</strong>tart from the beginning of the swap area427


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> cluster_nr field in the swap_info_struct d<strong>es</strong>criptor stor<strong>es</strong> the number of free pag<strong>es</strong>lots allocated. This field is r<strong>es</strong>et to when the function r<strong>es</strong>tarts allocation from the beginningof the swap area. <strong>The</strong> cluster_next field stor<strong>es</strong> the index of the first page slot to beexamined in the next allocation. [1][1] As you may have noticed, the nam<strong>es</strong> of <strong>Linux</strong> data structur<strong>es</strong> are not always appropriate. In this case, the kernel do<strong>es</strong> not really "cluster" page slotsof a swap area.In order to speed up the search for free page slots, the kernel keeps the low<strong>es</strong>t_bit andhigh<strong>es</strong>t_bit fields of each swap area d<strong>es</strong>criptor up-to-date. <strong>The</strong>se fields specify the first andthe last page slots that could be free; in other words, any page slot below low<strong>es</strong>t_bit andabove high<strong>es</strong>t_bit is known to be occupied.<strong>The</strong> scan_swap_map( ) is used to find a free page slot. It acts on a single parameter, whichpoints to a swap area d<strong>es</strong>criptor and returns the index of a free page slot. It returns if the swaparea do<strong>es</strong> not contain any free slots. <strong>The</strong> function performs the following steps:1. If the cluster_nr field of the swap area d<strong>es</strong>criptor is positive, scans the swap_maparray of counters starting from the element at index cluster_next and looks for anull entry. If a null entry is found, decrements the cluster_nr field and go<strong>es</strong> to step3.2. If this point is reached, either the cluster_nr field is null or the search starting fromcluster_next didn't find a null entry in the swap_map array. It is time to try th<strong>es</strong>econd stage of the hybrid search. Reinitializ<strong>es</strong> cluster_nr to SWAPFILE_CLUSTERand r<strong>es</strong>tarts scanning of the array from the low<strong>es</strong>t_bit index. If no null entry isfound, returns (the swap area is full).3. A null entry has been found. Puts the value 1 in the entry, decrementsnr_swap_pag<strong>es</strong>, updat<strong>es</strong> if nec<strong>es</strong>sary the low<strong>es</strong>t_bit and high<strong>es</strong>t_bit fields, andsets the cluster_next field to the index of the page slot just allocated.4. Returns the index of the allocated page slot.16.2.5 Allocating and Releasing a Page Slot<strong>The</strong> get_swap_page( ) function returns the index of a newly allocated page slot or if allswap areas are filled. <strong>The</strong> function tak<strong>es</strong> into consideration the different prioriti<strong>es</strong> of the activ<strong>es</strong>wap areas.Two pass<strong>es</strong> are nec<strong>es</strong>sary. <strong>The</strong> first pass is partial and appli<strong>es</strong> only to areas having the samepriority; the function search<strong>es</strong> such areas in a round-robin fashion for a free slot. If no freepage slot is found, a second pass is made starting from the beginning of the swap area list; inthis second pass all swap areas are examined. More precisely, the function performs thefollowing steps:1. If nr_swap_pag<strong>es</strong> is null, returns 0.2. Starts by considering the swap area pointed to by swap_list.next (recall that th<strong>es</strong>wap area list is sorted by decreasing prioriti<strong>es</strong>).3. If the swap area is active and not being deactivated, invok<strong>es</strong> scan_swap_map( ) toallocate a free page slot. If scan_swap_map( ) returns a page slot index, the function'sjob is <strong>es</strong>sentially done, but it must prepare for its next invocation. Thus, it updat<strong>es</strong>swap_list.next to point to the next swap area in the swap area list, if the latter has428


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the same priority (thus continuing the round-robin use of th<strong>es</strong>e swap areas). If the nextswap area do<strong>es</strong> not have the same priority as the current one, the function setsswap_list.next to the first swap area in the list (so that the next search will startwith the swap areas having the high<strong>es</strong>t priority). <strong>The</strong> function finish<strong>es</strong> by returning theidentifier corr<strong>es</strong>ponding to the page slot just allocated.4. Either the swap area is not writable, or it do<strong>es</strong> not have free page slots. If the nextswap area in the swap area list has the same priority as the current one, mak<strong>es</strong> it thecurrent one and go<strong>es</strong> to step 3.5. At this point, the next swap area in the swap area list has a lower priority than theprevious one. <strong>The</strong> next step depends on which of the two pass<strong>es</strong> the function isperforming.a. If this is the first (partial) pass, considers the first swap area in the list and go<strong>es</strong>to step 3, thus starting the second pass.b. Otherwise, checks if there is a next element in the list; if so, considers it andgo<strong>es</strong> to step 3.6. At this point the list has been completely scanned by the second pass, and no free pag<strong>es</strong>lot has been found: returns 0.<strong>The</strong> swap_free( ) function is invoked when swapping in a page to decrement thecorr<strong>es</strong>ponding swap_map counter (see Table 16-1). When the counter reach<strong>es</strong> 0, the page slotbecom<strong>es</strong> free since its identifier is no longer included in any Page Table entry. <strong>The</strong> functionacts on a single entry parameter that specifi<strong>es</strong> a swapped-out page identifier and performs thefollowing steps:1. Us<strong>es</strong> the SWP_TYPE and SWP_OFFSET macros to derive from entry the swap area indexand the page slot index.2. Checks whether the swap area is active; returns right away if it is not.3. If the priority of the swap area is greater than that of the swap area to whichswap_list.next points, sets swap_list.next to swap_list.head, so that the nextsearch for a free page slot starts from the high<strong>es</strong>t priority swap area. In this way, thepage slot being released will be reallocated before any other page slot is allocatedfrom lower-priority swap areas.4. If the swap_map counter corr<strong>es</strong>ponding to the page slot being freed is smaller thanSWAP_MAP_MAX, decrements it. Recall that entri<strong>es</strong> having the SWAP_MAP_MAX value areconsidered persistent (undeletable).5. If the swap_map counter becom<strong>es</strong> 0, increments the value of nr_swap_pag<strong>es</strong> andupdat<strong>es</strong>, if nec<strong>es</strong>sary, the low<strong>es</strong>t_bit and high<strong>es</strong>t_bit fields of the swap aread<strong>es</strong>criptor.16.3 <strong>The</strong> Swap CacheIn <strong>Linux</strong>, a page frame may be shared among several proc<strong>es</strong>s<strong>es</strong> in the following cas<strong>es</strong>:• <strong>The</strong> page frame is associated with a shared [2] memory mapping (see the sectionSection 15.2 in Chapter 15).[2] Page fram<strong>es</strong> for private memory mappings are handled through the Copy On Write mechanism, and thus fall under the next case.• <strong>The</strong> page frame is handled by means of Copy On Write, perhaps because a newproc<strong>es</strong>s has been forked (see Section 7.4.4 in Chapter 7).429


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> page frame is allocated to an IPC shared memory r<strong>es</strong>ource (see Section 18.3.5 inChapter 18).As we shall see later in this chapter, page fram<strong>es</strong> used for shared memory mappings are neverswapped out. Instead, they are handled by another kernel function that writ<strong>es</strong> their data to theproper fil<strong>es</strong> and discards them. However, the other two kinds of shared page fram<strong>es</strong> must becarefully handled by the swapping algorithm.Because the kernel handl<strong>es</strong> each proc<strong>es</strong>s separately, a page shared by two proc<strong>es</strong>s<strong>es</strong>, A and B,may have been swapped out from the addr<strong>es</strong>s space of A while it is still in B's addr<strong>es</strong>s space.To handle this peculiar situation, <strong>Linux</strong> mak<strong>es</strong> use of a swap cache, which collects all sharedpage fram<strong>es</strong> that have been copied to swap areas. <strong>The</strong> swap cache do<strong>es</strong> not exist as a datastructure on its own, but the pag<strong>es</strong> in the regular page cache are considered to be in the swapcache if certain fields are set.<strong>The</strong> reader might ask at this point why the algorithm do<strong>es</strong> not just swap a shared page outfrom all the proc<strong>es</strong>s's addr<strong>es</strong>s spac<strong>es</strong> at the same time, thus avoiding the need for a swapcache. <strong>The</strong> answer is that there is no quick way to derive from the page frame the list ofproc<strong>es</strong>s<strong>es</strong> that own it. Scanning all page table entri<strong>es</strong> of all proc<strong>es</strong>s<strong>es</strong> looking for an entry witha given physical addr<strong>es</strong>s would be too costly.So shared page swapping works like this. Consider a page P shared among two proc<strong>es</strong>s<strong>es</strong>, Aand B. Suppose that the swapping algorithm scans the page fram<strong>es</strong> of proc<strong>es</strong>s A and selects Pfor swapping out: it allocat<strong>es</strong> a new page slot and copi<strong>es</strong> the data stored in P into the new pag<strong>es</strong>lot. It then puts the swapped-out page identifier in the corr<strong>es</strong>ponding page table entry ofproc<strong>es</strong>s A. Finally, it invok<strong>es</strong> __free_page( ) to release the page frame. However, thepage's usage counter do<strong>es</strong> not become since P is still owned by B. Thus, the swappingalgorithm succeeds in transferring the page into the swap area, but it fails to reclaim thecorr<strong>es</strong>ponding page frame.Suppose now that the swapping algorithm scans the page fram<strong>es</strong> of proc<strong>es</strong>s B at a later timeand selects P for swapping out. <strong>The</strong> kernel must recognize that P has already been transferredinto a swap area so that the page won't be swapped out a second time. Moreover, it must beable to derive the swapped-out page identifier so that it can increase the page slot usagecounter.Figure 16-4 illustrat<strong>es</strong> schematically the actions performed by the kernel on a shared page thatis swapped out from multiple proc<strong>es</strong>s<strong>es</strong> at different tim<strong>es</strong>. <strong>The</strong> numbers inside the swap areaand inside P repr<strong>es</strong>ent the page slot usage counter and the page usage counter, r<strong>es</strong>pectively.Notice that any usage count includ<strong>es</strong> every proc<strong>es</strong>s that is using the page or page slot, plus th<strong>es</strong>wap cache if the page is included in it. Four stag<strong>es</strong> are shown:1. In (a) P is pr<strong>es</strong>ent in the Page Tabl<strong>es</strong> of both A and B.2. In (b) P has been swapped out from A's addr<strong>es</strong>s space.3. In (c) P has been swapped out from both the addr<strong>es</strong>s spac<strong>es</strong> of A and B but is stillincluded in the swap cache.4. Finally, in (d) P has been released to the Buddy system.430


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 16-4. <strong>The</strong> role of the swap cache<strong>The</strong> swap cache is implemented by the page cache data structur<strong>es</strong> and procedur<strong>es</strong> d<strong>es</strong>cribed inSection 14.2 in Chapter 14. Recall that the page cache includ<strong>es</strong> pag<strong>es</strong> associated with regularfil<strong>es</strong> and that a hash table allows the algorithm to quickly derive the addr<strong>es</strong>s of a paged<strong>es</strong>criptor from the addr<strong>es</strong>s of an inode object and an offset inside the file. Pag<strong>es</strong> in the swapcache are stored as any other page in the page cache, with the following special treatment:• <strong>The</strong> inode field of the page d<strong>es</strong>criptor stor<strong>es</strong> the addr<strong>es</strong>s of a fictitious inode objectcontained in the swapper_inode variable.• <strong>The</strong> offset field stor<strong>es</strong> the swapped-out page identifier associated with the page.• <strong>The</strong> PG_swap_cache flag in the flags field is set.Moreover, when the page is put in the swap cache, both the count field of the page d<strong>es</strong>criptorand the page slot usage counters are incremented, since the swap cache mak<strong>es</strong> use of both thepage frame and the page slot.<strong>The</strong> kernel mak<strong>es</strong> use of several functions to handle the swap cache; they are based mainly onthose discussed in Section 14.2 in Chapter 14. We'll show later how th<strong>es</strong>e relatively low-levelfunctions are invoked by higher-level functions to swap pag<strong>es</strong> in and out as needed.<strong>The</strong> functions that handle the swap cache are:431


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>in_swap_cache( )Checks the PG_swap_cache flag of a page to determine whether it belongs to the swapcache; if so, it returns the swapped-out page identifier stored in the offset field.lookup_swap_cache( )Acts on a swapped-out page identifier passed as its parameter and returns the pageaddr<strong>es</strong>s or if the page is not pr<strong>es</strong>ent in the cache. It invok<strong>es</strong> find_ page( ), passingas parameters the addr<strong>es</strong>s of the fictitious swapper_inode inode object and th<strong>es</strong>wapped-out page identifier to find the required page. If the page is pr<strong>es</strong>ent in th<strong>es</strong>wap cache, lookup_swap_cache( ) checks whether it is locked; if so, the functioninvok<strong>es</strong> wait_on_page( ) to suspend the current proc<strong>es</strong>s until the page becom<strong>es</strong>unlocked.add_to_swap_cache( )Inserts a page into the swap cache. <strong>The</strong> inode and offset fields of the page d<strong>es</strong>criptorare set to the addr<strong>es</strong>s of the fictitiousswapper_inode inode object and to the swappedoutpage identifier, r<strong>es</strong>pectively. <strong>The</strong>n the function invok<strong>es</strong> theadd_page_to_hash_queue( ) and add_page_to_inode_queue( ) functions. <strong>The</strong> pageusage counter is also incremented.is_page_shared( )Returns 1 (true) if a page is shared among several proc<strong>es</strong>s<strong>es</strong>, otherwise. While thisfunction is not properly specific to the swapping algorithm, it must take account of thevalue of the page usage counter, the pr<strong>es</strong>ence or absence of the page in the swapcache, and the usage counter of the corr<strong>es</strong>ponding page slot if any. Moreover, thefunction considers whether the page frame is currently involved in a page I/Ooperation related to swapping, since in this case the page usage counter is incrementedas a fail-safe mechanism (see Section 15.1.1 in Chapter 15).delete_from_swap_cache( )Remov<strong>es</strong> a page from the swap cache by clearing the PG_swap_cache flag and byinvoking remove_inode_page( ); moreover, it invok<strong>es</strong> swap_free( ) to release thecorr<strong>es</strong>ponding page slot.free_page_and_swap_cache( )Releas<strong>es</strong> a page by invoking __free_page( ). It also checks whether the page is in th<strong>es</strong>wap cache (PG_swap_cache flag set) and owned by just one proc<strong>es</strong>s(is_page_shared( ) returns 0); in this case, it invok<strong>es</strong> delete_from_swap_cache( ) toremove the page from the swap cache and to free the corr<strong>es</strong>ponding page slot.432


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>16.4 Transferring Swap Pag<strong>es</strong>Transferring swap pag<strong>es</strong> wouldn't be so complicated if there weren't so many race conditionsand other potential hazards to guard against. Here are some of the things that have to bechecked regularly:• <strong>The</strong>re may be too many asynchronous swap operations going on; when this occurs,only synchronous I/O operations are started (see Section 16.4.2 later in this chapter).• <strong>The</strong> proc<strong>es</strong>s that owns a page may terminate while the page is being swapped in orout.• Another proc<strong>es</strong>s may be in the middle of swapping in a page that the current one istrying to swap out or vice versa.We'll follow a bottom-up approach in the following sections. First we d<strong>es</strong>cribe th<strong>es</strong>ynchronization mechanisms that avoid data corruption caused by simultaneous I/Ooperations on the same page frame or on the same page slot. <strong>The</strong>n we illustrate a fewfunctions used to perform the data transfer of a swapped page.16.4.1 Locking the Page Frame and the Page SlotLike any other disk acc<strong>es</strong>s type, I/O data transfers for swap pag<strong>es</strong> are blocking operations.<strong>The</strong>refore, the kernel must take care to avoid simultaneous transfers involving the same pageframe, the same page slot, or both.Race conditions can be avoided on the page frame through the mechanisms discussed inChapter 13. Specifically, before starting an I/O operation on the page frame, the kernelinvok<strong>es</strong> the wait_on_page( ) function to wait until the PG_locked flag is off. When thefunction returns, the page frame lock has been acquired, and therefore no other kernel controlpath can acc<strong>es</strong>s the page frame's contents during the I/O operation.But the state of the page slot must also be tracked. <strong>The</strong> PG_locked flag of the page d<strong>es</strong>criptoris used once again to ensure exclusive acc<strong>es</strong>s to the page slot involved in the I/O data transfer.Before starting an I/O operation on a swap page, the kernel usually checks that the page frameinvolved is included in the swap cache; if not, it adds the page frame into the swap cache.Let's suppose some proc<strong>es</strong>s tri<strong>es</strong> to swap in some page while the same page is currently beingtransferred. Before doing any work related to the swap-in, the kernel looks in the swap cachefor a page frame associated with the given swapped-out page identifier. Since the page frameis found, the kernel knows that it must not allocate a new page frame, but must simply use thecached page frame. Moreover, since the PG_locked flag is set, the kernel suspends the kernelcontrol path until the bit becom<strong>es</strong> 0, so that both the page frame's contents and the page slot inthe swap area are pr<strong>es</strong>erved until the I/O operation terminat<strong>es</strong>.In a few specific cas<strong>es</strong>, the PG_locked flag and the swap cache are not sufficient to avoid raceconditions. Let us suppose, for instance, that the kernel begins a swap-out operation on somepage; therefore, it increments the page usage counter, it allocat<strong>es</strong> a page slot, and then it startsthe I/O data transfer. Suppose further that during this operation, the proc<strong>es</strong>s that owns thepage di<strong>es</strong>. <strong>The</strong> kernel reclaims the proc<strong>es</strong>s's memory; that is, it releas<strong>es</strong> all page fram<strong>es</strong> and allpage slots used by the proc<strong>es</strong>s. Since the page usage counter was incremented before startingthe I/O operation, the page frame involved in the swap I/O operation is not released to theBuddy system; however, the page slot usage counter in the swap area d<strong>es</strong>criptor could become433


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>0, thus it could be used for swapping out another page before the previous I/O operationcomplet<strong>es</strong>. If this happens, a kernel control path could start a write operation in the page slotwhile another kernel control path is performing another write operation on the same page slot,that is, on the same physical disk portion. This would lead to unpredictable (and unpleasant)r<strong>es</strong>ults.In order to avoid this kind of problem, the kernel us<strong>es</strong> an array of bits, whose addr<strong>es</strong>s is storedinto the swap_lockmap field of the swap area d<strong>es</strong>criptor. Each bit in the array is a lock for apage slot in the swap area. <strong>The</strong> bit is set before activating an I/O data transfer on the page slotand is cleared when the operation terminat<strong>es</strong>. <strong>The</strong> scan_swap_map( ) function d<strong>es</strong>cribed inSection 16.2.4 do<strong>es</strong> not consider a page slot as "free" if its lock bit is on, even if its usagecounter in the swap_map array is null. <strong>The</strong> lock_queue wait queue is used to suspend aproc<strong>es</strong>s until a bit in the swap_lockmap array is cleared.16.4.2 <strong>The</strong> rw_swap_ page( ) FunctionAs illustrated in Figure 16-1, the rw_swap_page( ) function is used to swap in or swap out apage. It receiv<strong>es</strong> the following parameters:bufferentryrwwait<strong>The</strong> initial addr<strong>es</strong>s of the page frame containing the page to be swapped in or swappedout.A swapped-out page identifier. This parameter is somewhat redundant since the sameinformation can be derived from the d<strong>es</strong>criptor of the buffer page frame.A flag specifying the direction of data transfer: READ for swapping in, WRITE forswapping out.A flag specifying whether the kernel control path must block until the I/O operationcomplet<strong>es</strong>.<strong>The</strong> swap-in operation triggered by a page fault is usually synchronous (wait equal to 1), sincethe proc<strong>es</strong>s should be suspended until the requ<strong>es</strong>ted page has been transferred from disk.Conversely, swap-out operations are usually asynchronous (wait equal to 0), since there is noneed to suspend the current proc<strong>es</strong>s until they are completed. However, the kernel defin<strong>es</strong> alimit on the number of asynchronous swap operations concurrently being carried on, in orderto avoid flooding the block device driver's requ<strong>es</strong>t queue. <strong>The</strong> limit is stored in th<strong>es</strong>wap_cluster field of the pager_daemon variable (see Section 16.7.6 later in this chapter). Ifthe limit is reached, rw_swap_page( ) ignor<strong>es</strong> the value of the wait parameter and proceeds asif it were equal to 1.In order to swap the page, the function performs the following operations:434


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1. Comput<strong>es</strong> the addr<strong>es</strong>s of the page d<strong>es</strong>criptor corr<strong>es</strong>ponding to the page frame.2. Gets the swap area index and the page slot index from entry.3. T<strong>es</strong>ts and sets the lock bit in the swap_lockmap array corr<strong>es</strong>ponding to the page slot.If the bit was already set, the page is in the middle of being swapped in or out, so wewant to wait for that operation to complete. <strong>The</strong>refore, execut<strong>es</strong> the functions in the tq_disk task queue, thus unplugging any block device driver that is waiting, and sleepson the lock_queue wait queue until the ongoing swap operation terminat<strong>es</strong>.4. Sets the PG_swap_unlock_after flag of the page in order to ensure that the flag inswap_lockmap will be cleared at the end of the swap operation that the function willstart in step 10. (<strong>The</strong> effect of the flag is discussed later in this section.)5. If the data transfer is for a swap-in operation (rw set to READ), clears thePG_uptodate flag of the page frame. <strong>The</strong> flag will be set again only if the swap-inoperation terminat<strong>es</strong> succ<strong>es</strong>sfully.6. Increments the page usage counter, so that the page frame is not released to the Buddysystem even if the owning proc<strong>es</strong>s di<strong>es</strong> (the fail-safe mechanism discussed in theprevious section). Also sets the PG_free_after flag, thus ensuring that the usagecounter of the page d<strong>es</strong>criptor is decremented when the page I/O operation terminat<strong>es</strong>(see Section 13.6 in Chapter 13). This flag repr<strong>es</strong>ents a belt-and-suspenders kind ofcaution because it might be possible to invoke the brw_page( ) function without firstincrementing the usage counter. Actually, the kernel never do<strong>es</strong> this.7. If the swap area is a disk partition, locat<strong>es</strong> the block associated with the page slot(swap partitions have 4096-byte blocks). Otherwise, if the swap area is a regular file,invok<strong>es</strong> the bmap method of the corr<strong>es</strong>ponding inode object (or equivalently the smapmethod for sector-based fil<strong>es</strong>ystems like MS-DOS) to derive the logical numbers ofthe blocks associated with the page slot.8. If nr_async_pag<strong>es</strong> is greater than pager_daemon.swap_cluster, forc<strong>es</strong> the waitparameter to 1 (too many asynchronous swap operations are being carried on).9. If wait is null, increments nr_async_pag<strong>es</strong>. Sets the PG_decr_after flag of the page inorder to ensure that the variable will be decremented again when the swap operation tobe started on the next step terminat<strong>es</strong>. (Like PG_swap_unlock_after, thePG_decr_after flag will be discussed shortly.)10. Invok<strong>es</strong> the brw_page( ) function to start the actual I/O operation. As d<strong>es</strong>cribed inSection 13.6.1 in Chapter 13, this function returns before the data transfer iscompleted.11. If wait is null, returns without waiting for the completion of the data transfer.12. Otherwise (wait equal to 1), invok<strong>es</strong> wait_on_page( ) to suspend the current proc<strong>es</strong>suntil the page frame becom<strong>es</strong> unlocked, that is, until the I/O data transfer terminat<strong>es</strong>.Notice that the rw_swap_page( ) function reli<strong>es</strong> on the brw_page( ) function to perform thedata transfer. As d<strong>es</strong>cribed in Section 13.6.2 in Chapter 13, whenever the block device driverterminat<strong>es</strong> the data transfer of a block in the page slot, the b_end_io method taken from thecorr<strong>es</strong>ponding asynchronous buffer head is invoked. This method is implemented by theend_buffer_io_async( ) function, which in turn invok<strong>es</strong> after_unlock_page( ) if all blocks inthe page have been transferred. <strong>The</strong> latter function performs the following operations:1. If the PG_decr_after flag of the page is on, clears it and decrements thenr_async_pag<strong>es</strong> variable. As we've seen, this variable helps to put an upper limit onthe number of current asynchronous page swaps.2. If the PG_swap_unlock_after flag of the page is on, clears it and invok<strong>es</strong>swap_after_unlock_page( ). This function clears the lock bit in the swap_lockmap435


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>array associated with the page slot and wak<strong>es</strong> up the proc<strong>es</strong>s<strong>es</strong> sleeping in thelock_queue wait queue. Thus, proc<strong>es</strong>s<strong>es</strong> waiting for the end of the I/O operation on thepage slot can start again.3. If the PG_free_after flag of the page is on, clears it and invok<strong>es</strong> _ _free_page( ) torelease the page frame, thus compensating for the increment of the page usage counterperformed by rw_swap_page( ) as a fail-safe mechanism.16.4.3 <strong>The</strong> read_swap_cache_async( ) FunctionAs shown in Figure 16-1, the read_swap_cache_async( ) function is invoked to swap in apage either from the swap cache or from disk. <strong>The</strong> function receiv<strong>es</strong> the following parameters:entrywaitA swapped-out page identifierA flag specifying whether the kernel is allowed to suspend the current proc<strong>es</strong>s untilthe swap's I/O operation complet<strong>es</strong>D<strong>es</strong>pite the function's name, the wait parameter determin<strong>es</strong> whether the I/O swap operationmust be synchronous or asynchronous. <strong>The</strong> read_swap_cache macro is often used to invokeread_swap_cache_async( ) passing the value 1 to the wait parameter.<strong>The</strong> function performs the following operations:1. Invok<strong>es</strong> swap_duplicate( ) on entry to check whether the page slot is valid and toincrement the page slot usage counter. (<strong>The</strong> increment is a fail-safe mechanismapplied to the page slot during the swap cache lookup, to avoid problems in case theproc<strong>es</strong>s that caused the page fault di<strong>es</strong> before terminating the swap-in.)2. Invok<strong>es</strong> lookup_swap_cache( ) to search for the page in the swap cache. If the page isfound, invok<strong>es</strong> swap_free( ) on entry to decrement the page slot usage counter andends by returning the page's addr<strong>es</strong>s.3. <strong>The</strong> page is not included in the swap cache. Invok<strong>es</strong> __get_free_page( ) to allocate anew page frame. <strong>The</strong>n invok<strong>es</strong> lookup_swap_cache( ) again, because the proc<strong>es</strong>s mayhave been suspended while waiting for the new page frame, and some other kernelcontrol path could have created the page in the interim. As in the previous step, if therequ<strong>es</strong>ted page is found, decreas<strong>es</strong> the page slot usage counter, releas<strong>es</strong> the page framejust allocated, and returns the addr<strong>es</strong>s of the requ<strong>es</strong>ted page.4. Invok<strong>es</strong> add_to_swap_cache( ) to initialize the inode and offset fields of the d<strong>es</strong>criptorof the new page frame and to insert it into the swap cache.5. Sets the PG_locked flag of the page frame. Since the page frame is new, no otherkernel control path could acc<strong>es</strong>s it, therefore no check on the previous flag value isnec<strong>es</strong>sary.6. Invok<strong>es</strong> rw_swap_page( ) to read the page's contents from the swap area, passing tothat function the READ parameter, the swapped-out page identifier, the page frameaddr<strong>es</strong>s, and the wait parameter. As a r<strong>es</strong>ult, the required page is copied in the pageframe.7. Returns the page's addr<strong>es</strong>s.436


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>16.4.4 <strong>The</strong> rw_swap_ page_nocache( ) FunctionIn a few cas<strong>es</strong>, the kernel wants to read a page from a swap area without putting it in the swapcache. This happens, for instance, when servicing the swapon( ) system call: the kernel readsthe first page of a swap area, which contains the swap_header union, then immediatelydiscards the page frame. Swapped-out pag<strong>es</strong> of IPC shared memory regions are also neverincluded in the swap cache (see Section 18.3.5 in Chapter 18).<strong>The</strong> rw_swap_page_nocache( ) function receiv<strong>es</strong> as parameters the type of I/O operation(READ or WRITE), a swapped-out page identifier, and the addr<strong>es</strong>s of a page frame. Itperforms the following operations:1. Invok<strong>es</strong> wait_on_page( ) and then sets the PG_locked flag of the page frame.2. Initializ<strong>es</strong> the inode field of the page d<strong>es</strong>criptor with the addr<strong>es</strong>s of the swapper_inodeinode object, sets the offset field to the swapped-out page identifier, and sets thePG_swap_cache flag. Notice, however, that the function do<strong>es</strong> not insert the pageframe in the swap cache data structur<strong>es</strong>: the PG_swap_cache flag and the inode andoffset fields of the page d<strong>es</strong>criptor are initialized just to satisfy the consistency checksof the rw_swap_page( ) function.3. Increments the page usage counter (fail-safe mechanism).4. Invok<strong>es</strong> rw_swap_page( ) to start the I/O swap operation.5. Decrements the page usage counter, clears the PG_swap_cache flag, and sets the inodefield of the page d<strong>es</strong>criptor to 0.16.5 Page Swap-OutSection 16.7 explains when pag<strong>es</strong> are swapped out. As we indicated at the beginning of thechapter, swapping out pag<strong>es</strong> is a last r<strong>es</strong>ort and appears as part of a general strategy to freememory that us<strong>es</strong> other tactics as well. In this section, we show how the kernel performsswap-out. This is achieved by the swap_out( ) function, which acts on the followingparameters.prioritygfp_maskAn integer value ranging from to 6 that specifi<strong>es</strong> how much time the kernel shouldspend trying to locate a page to be swapped; lower valu<strong>es</strong> corr<strong>es</strong>pond to longer searchtim<strong>es</strong>. We shall d<strong>es</strong>cribe how this parameter is set in Section 16.7.5 later in thischapter.If the function has been invoked as a consequence of a memory allocation requ<strong>es</strong>t, thisparameter is a copy of the gfp_mask parameter passed to the allocator function (seeSection 6.1.1 in Chapter 6. <strong>The</strong> parameter tells the kernel how to treat the page,notably how urgent the requ<strong>es</strong>t is and whether the kernel control path can b<strong>es</strong>uspended.<strong>The</strong> swap_out( ) function scans existing proc<strong>es</strong>s<strong>es</strong> and tri<strong>es</strong> to swap out the pag<strong>es</strong> referencedin each proc<strong>es</strong>s's page tabl<strong>es</strong>. It terminat<strong>es</strong> as soon as one of the following conditions occurs:437


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> function succeeds in swapping out a page.• <strong>The</strong> function performs some blocking operation. It do<strong>es</strong>n't bother r<strong>es</strong>uming activitybecause the proc<strong>es</strong>s being examined could have been d<strong>es</strong>troyed while the currentproc<strong>es</strong>s was sleeping, and thus no further page scanning would be needed.• <strong>The</strong> function failed to swap out a page after scanning a predefined number ofproc<strong>es</strong>s<strong>es</strong>. This is because the kernel do<strong>es</strong> not want to spend too much time in swapoutactiviti<strong>es</strong>. Specifically, each invocation of swap_out( ) considers at mostnr_tasks/(priority+1) proc<strong>es</strong>s<strong>es</strong>.How can the kernel select the proc<strong>es</strong>s<strong>es</strong> to be penalized? In each invocation, the swap_out( )function scans the proc<strong>es</strong>s list and finds the proc<strong>es</strong>s having the larg<strong>es</strong>t value for the swap_cntfield of the proc<strong>es</strong>s d<strong>es</strong>criptor. If all proc<strong>es</strong>s<strong>es</strong> have null swap_cnt fields, the function scansthe proc<strong>es</strong>s list again and sets each swap_cnt field to the number of page fram<strong>es</strong> assigned tothe corr<strong>es</strong>ponding proc<strong>es</strong>s (the number can be found in the mm->rss field of the proc<strong>es</strong>sd<strong>es</strong>criptor). In this way, proc<strong>es</strong>s<strong>es</strong> with many page fram<strong>es</strong> are generally more penalized (inthe long run) than proc<strong>es</strong>s<strong>es</strong> owning fewer page fram<strong>es</strong>.After having selected a proc<strong>es</strong>s, swap_out( ) invok<strong>es</strong> the swap_out_proc<strong>es</strong>s( ) function,passing it the proc<strong>es</strong>s d<strong>es</strong>criptor pointer and the gfp_mask parameter (see Figure 16-1). If thelatter function returns the value 1, swap_out( ) terminat<strong>es</strong> its execution, since either a pageframe has been swapped out or the current proc<strong>es</strong>s has been suspended. Otherwise, swap_out() tri<strong>es</strong> to select another proc<strong>es</strong>s, until it reach<strong>es</strong> the maximum number of proc<strong>es</strong>s<strong>es</strong> to beexamined.<strong>The</strong> swap_out_proc<strong>es</strong>s( ) scans all the memory regions of a proc<strong>es</strong>s and invok<strong>es</strong> th<strong>es</strong>wap_out_vma( ) function on each one. <strong>The</strong> addr<strong>es</strong>s of the first memory region scanned byswap_out_proc<strong>es</strong>s( ) is stored in the swap_addr<strong>es</strong>s field of the proc<strong>es</strong>s d<strong>es</strong>criptor: since thisfield identifi<strong>es</strong> the memory region last scanned in the previous invocation of the function, allmemory regions of the proc<strong>es</strong>s are penalized equally. (This appears to be the b<strong>es</strong>t strategysince the kernel has no information on how often each memory region is acc<strong>es</strong>sed.)swap_out_proc<strong>es</strong>s( ) continu<strong>es</strong> to invoke swap_out_vma( ) until that function returns thevalue 1 or the end of the memory region list is reached. In the latter case, the swap_cnt field inthe proc<strong>es</strong>s d<strong>es</strong>criptor is set to 0, so that the proc<strong>es</strong>s will not be considered again byswap_out( ) before it examin<strong>es</strong> all the other proc<strong>es</strong>s<strong>es</strong> in the system that remain to beconsidered.<strong>The</strong> swap_out_vma( ) function checks that the memory region is not locked, that is, that itsVM_LOCKED flag is equal to 0. It then starts a sequence in which it considers all entri<strong>es</strong> inthe proc<strong>es</strong>s's Page Global Directory that refer to linear addr<strong>es</strong>s<strong>es</strong> in the memory region. Foreach such entry, the function invok<strong>es</strong> the swap_out_pgd( ) function, which in turn considersall entri<strong>es</strong> in a Page Middle Directory corr<strong>es</strong>ponding to addr<strong>es</strong>s intervals in the memoryregion. For each such entry, swap_out_pgd( ) invok<strong>es</strong> the swap_out_pmd( ) function, whichconsiders all entri<strong>es</strong> in a Page Table referencing pag<strong>es</strong> in the memory region. For each suchpage, swap_out_pmd( ) invok<strong>es</strong> the try_to_swap_out( ) function, which finally determin<strong>es</strong>whether the page can be swapped out. If try_to_swap_out( ) returns the value 1 (meaning thepage frame was freed or the current proc<strong>es</strong>s is suspended), the chain of n<strong>es</strong>ted invocations ofthe swap_out_vma( ), swap_out_pgd( ), swap_out_pmd( ), and try_to_swap_out( ) functionsterminat<strong>es</strong>.438


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>16.5.1 <strong>The</strong> try_to_swap_out( ) Function<strong>The</strong> try_to_swap_out( ) function attempts to free a given page frame, either discarding orswapping out its contents. <strong>The</strong> parameters of the function are:tskvmaaddr<strong>es</strong>spage_tablegfp_maskProc<strong>es</strong>s d<strong>es</strong>criptor pointerMemory region d<strong>es</strong>criptor pointerInitial linear addr<strong>es</strong>s of the pageAddr<strong>es</strong>s of the Page Table entry of tsk that maps addr<strong>es</strong>s<strong>The</strong> gfp_mask parameter of the swap_out( ) function, which is passed along the chainof function invocations d<strong>es</strong>cribed at the end of the previous section<strong>The</strong> function returns the value 1 either if it has succeeded in swapping out the page or if it hasexecuted a blocking I/O operation. In this second case, continuing to swap out would be riskysince the function might act on a proc<strong>es</strong>s that no longer actually exists. <strong>The</strong> function returns ifit decided not to swap.try_to_swap_out( ) must recognize many different situations demanding different r<strong>es</strong>pons<strong>es</strong>,but the r<strong>es</strong>pons<strong>es</strong> all share many of the same basic operations. In particular, the functionperforms the following steps:1. Considers the Page Table entry at addr<strong>es</strong>s page_table. If the Pr<strong>es</strong>ent bit is null, no pageframe is allocated, so there is nothing to swap and the function returns 0.2. If the Acc<strong>es</strong>sed flag of the Page Table entry is set, the page frame is young. In thiscase, clears the Acc<strong>es</strong>sed flag, sets the PG_referenced flag of the page d<strong>es</strong>criptor,invok<strong>es</strong> flush_tlb_page( ) to invalidate the TLB entry associated with the page, andreturns 0. Only pag<strong>es</strong> whose Acc<strong>es</strong>sed flag is null can be swapped out. Since this bit isautomatically set by the CPU's paging unit at every page acc<strong>es</strong>s, the page can b<strong>es</strong>wapped out only if it was not acc<strong>es</strong>sed after the previous invocation oftry_to_swap_out( ) on it. As mentioned previously, the Acc<strong>es</strong>sed flag offers a limiteddegree of hardware support that allows <strong>Linux</strong> to make use of a primitive LRUreplacement algorithm.3. If the PG_r<strong>es</strong>erved or PG_locked flag of the page d<strong>es</strong>criptor is set, returns (the pagecannot be swapped out).4. If the PG_DMA flag is equal to and the gfp_mask parameter specifi<strong>es</strong> that the freedpage frame should be used for an ISA DMA buffer, returns 0.439


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>5. If the page belongs to the swap cache, it is shared with some other proc<strong>es</strong>s and it hasalready been stored in a swap area. In this case, the page must be marked as swappedout but no memory transfer is performed. Do<strong>es</strong> the following:a. Gets the swapped-out page's identifier from the offset field of the paged<strong>es</strong>criptorb. Invok<strong>es</strong> swap_duplicate( ) to increment the page slot usage counterc. Writ<strong>es</strong> the swapped-out page identifier into the Page Table entryd. Decrements the mm->rss counter of the proc<strong>es</strong>se. Invok<strong>es</strong> flush_tlb_page( ) to invalidate the TLB entry associated with the pagef. Invok<strong>es</strong> __free_page( ) to decrement the page usage counterg. Returns 0 (no page has been swapped out)6. If the Dirty bit of the Page Table entry is null, the page is "clean"; there is no need towrite it back to disk, since the kernel is always able to r<strong>es</strong>tore its contents with thedemand paging mechanism. <strong>The</strong>refore, performs the following substeps to remove thepage from the proc<strong>es</strong>s's addr<strong>es</strong>s space:a. Sets the Page Table entry to 0b. Decrements the mm->rss counter of the proc<strong>es</strong>sc. Invok<strong>es</strong> flush_tlb_page( ) to invalidate the TLB entry associated with the paged. Invoke _ _free_page( ) to decrement the page usage countere. Returns 0 (no page has been swapped out)7. <strong>The</strong> page is dirty and it can be swapped out; however, checks whether the kernel isallowed to perform I/O operations (that is, if the _ _GFP_IO flag in the gfp_maskparameter is set); if not, returns 0. <strong>The</strong> __GFP_IO flag is cleared when the kernelcontrol path cannot be suspended (for instance, because it is executing an interrupthandler).8. Checks whether the vma memory region that contains the page has its own swapoutmethod. If so, performs the following substeps:a. Sets the Page Table entry to 0b. Decrements the mm->rss counter of the proc<strong>es</strong>sc. Invok<strong>es</strong> flush_tlb_page( ) to invalidate the TLB entry associated with the paged. Invok<strong>es</strong> the swapout method; if this function returns an error code, sends aSIGBUS signal to the proc<strong>es</strong>s tske. Invok<strong>es</strong> _ _free_page( ) to decrement the page usage counterf. Returns 1 (the swapout method invoked in step 8d could block, so th<strong>es</strong>wap_out( ) function must terminate)9. <strong>The</strong> swapout method of the memory region is not defined, thus the page must beexplicitly swapped out. (This is the most frequent case.) Performs the followingsubsteps:a. Invok<strong>es</strong> get_swap_page( ) to allocate a new page slot.b. Decrements the mm->rss field of the proc<strong>es</strong>s and increments its nswap field (acounter of swapped-out pag<strong>es</strong>).c. Writ<strong>es</strong> the swapped-out page identifier into the Page Table entry.d. Invok<strong>es</strong> flush_tlb_page( ) to invalidate the TLB entry associated with the page.e. Invok<strong>es</strong> swap_duplicate( ) to increase the page slot usage counter; it will nowhave the value 2, one increment for the proc<strong>es</strong>s and the other for the swapcache.f. Invok<strong>es</strong> add_to_swap_cache( ) to add the page into the swap cache.g. Preparatory to the swapping operation to be started in the next step, sets thePG_locked flag. (We don't have to t<strong>es</strong>t the flag, because we did so already in440


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>step 3. No other kernel path could have set the flag since then, because thefunction didn't perform any blocking operation.)h. Invok<strong>es</strong> rw_swap_page( ) to start an asynchronous swapping operation to writethe page into the swap area.i. Invok<strong>es</strong> __free_page( ) to decrement the page usage counter.j. Returns 1 (a page was swapped out).16.5.2 Swapping Out Pag<strong>es</strong> from Shared Memory MappingsAs we saw in Section 15.2 in Chapter 15, pag<strong>es</strong> in a shared memory mapping corr<strong>es</strong>pond toportions of regular fil<strong>es</strong> on disk. For that reason, the kernel do<strong>es</strong> not store them in swap areasbut rather updat<strong>es</strong> the corr<strong>es</strong>ponding fil<strong>es</strong>.Shared memory mapping regions define their own swapout method; as shown in Table 15-2in Chapter 15, this method is implemented by the filemap_swapout( ) function, which justinvok<strong>es</strong> the filemap_write_page( ) function to force the page to be written on disk.In this case, however, the filemap_write_page( ) function do<strong>es</strong> not explicitly invoke thedo_write_page( ) function as d<strong>es</strong>cribed in Section 15.2.6 in Chapter 15. <strong>The</strong> reason is thatrunning the function could induce the following nasty race condition: suppose the kernel getsa critical fil<strong>es</strong>ystem lock and then starts swapping out some pag<strong>es</strong> as a consequence of amemory allocation requ<strong>es</strong>t. <strong>The</strong> do_write_page( ) function might try to acquire the same lock,thus inducing a deadlock.In order to avoid this problem, the only part of the kernel allowed to swap out pag<strong>es</strong>belonging to shared memory mappings is a kernel thread named kpiod , which servic<strong>es</strong> all I/Orequ<strong>es</strong>ts in a special input queue. Since kpiod is a separate kernel thread from the proc<strong>es</strong>sexecuting filemap_write_page( ), no deadlock may occur. Even if kpiod is suspended whiletrying to get the fil<strong>es</strong>ystem lock, the proc<strong>es</strong>s executing filemap_write_page( ) can proceed andeventually release that lock.<strong>The</strong> kpiod kernel thread is woken up whenever a new requ<strong>es</strong>t is added to its input queue; eachrequ<strong>es</strong>t refers to a page of a shared memory region to be written to disk. Since the kernel mayattempt to swap out several pag<strong>es</strong> at once (see Section 16.7 later in this chapter), severalrequ<strong>es</strong>ts may accumulate in the kpiod input queue. <strong>The</strong> kernel thread continu<strong>es</strong> to proc<strong>es</strong>srequ<strong>es</strong>ts until the queue becom<strong>es</strong> empty.Each element in the queue is a d<strong>es</strong>criptor of type pio_requ<strong>es</strong>t, which includ<strong>es</strong> the fieldsillustrated in Table 16-2. <strong>The</strong> pio_first and pio_last variabl<strong>es</strong> point to the first and lastelements in the queue, r<strong>es</strong>pectively. <strong>The</strong> pio_requ<strong>es</strong>t d<strong>es</strong>criptors are handled by thepio_requ<strong>es</strong>t_cache slab allocator cache.Table 16-2. Fields of a pio_requ<strong>es</strong>t D<strong>es</strong>criptorType Field D<strong>es</strong>criptionstruct pio_requ<strong>es</strong>t * next Next element in queu<strong>es</strong>truct file * file File object pointerunsigned long offset File offsetunsigned long page Page initial addr<strong>es</strong>s441


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>When invoked by filemap_swapout( ), the filemap_write_page( ) function invok<strong>es</strong>make_pio_requ<strong>es</strong>t( ) to add a requ<strong>es</strong>t to the kpiod input queue, instead of the usualdo_write_page( ) that do<strong>es</strong> its own data transfer. <strong>The</strong> make_pio_requ<strong>es</strong>t( ) function performsthe following operations:1. Increments the usage counter of the page to be written.2. Allocat<strong>es</strong> a new pio_requ<strong>es</strong>t d<strong>es</strong>criptor. If no memory is available, tri<strong>es</strong> to prune somedisk cach<strong>es</strong> without doing any actual I/O operation. <strong>The</strong> function do<strong>es</strong> this byinvoking try_to_free_pag<strong>es</strong>( ) with the __GFP_IO flag cleared in its parameter (seeSection 16.7.4 later in this chapter). make_pio_requ<strong>es</strong>t( ) then tri<strong>es</strong> again to allocatethe requ<strong>es</strong>t pio_requ<strong>es</strong>t d<strong>es</strong>criptor.3. Initializ<strong>es</strong> the fields of the pio_requ<strong>es</strong>t d<strong>es</strong>criptor.4. Inserts the pio_requ<strong>es</strong>t d<strong>es</strong>criptor in the requ<strong>es</strong>t queue.5. Wak<strong>es</strong> up the proc<strong>es</strong>s<strong>es</strong> (actually, the kpiod kernel thread) in the pio_wait wait queue.Thus, the make_pio_requ<strong>es</strong>t( ) function do<strong>es</strong> not trigger any I/O operation; instead, it wak<strong>es</strong>up the kpiod kernel thread. <strong>The</strong> thread execut<strong>es</strong> the kpiod( ) function, which considers allrequ<strong>es</strong>ts in the input queue and invok<strong>es</strong> the do_write_page( ) function on each of them towrite the corr<strong>es</strong>ponding page to disk. <strong>The</strong> page counter is then decremented and thepio_requ<strong>es</strong>t d<strong>es</strong>criptor is released to the slab allocator. When all requ<strong>es</strong>ts in the queue havebeen proc<strong>es</strong>sed, kpiod( ) inserts itself in the pio_wait wait queue and puts itself to sleep.kpiod( ) must guard against another potential error. In general, when a kernel thread requ<strong>es</strong>tssome free page fram<strong>es</strong> and free memory is low, it starts reclaiming pag<strong>es</strong>. In order to do this,it may need to requ<strong>es</strong>t a few additional page fram<strong>es</strong>. However, during this new requ<strong>es</strong>t thethread should never try to reclaim pag<strong>es</strong>, or infinite recursion might occur. For this reason, aPF_MEMALLOC flag is defined in each proc<strong>es</strong>s. It <strong>es</strong>sentially forbids recursive invocationsof try_to_free_pag<strong>es</strong>( ), so the kernel always sets the flag before invoking that function andclears it when the function returns. In particular, the value of this flag is checked by__get_free_pag<strong>es</strong>( ); if it is set, the try_to_free_pag<strong>es</strong>( ) function is never invoked. <strong>The</strong> kpiodkernel thread always runs with PF_MEMALLOC set.16.6 Page Swap-InSwap-in must take place when a proc<strong>es</strong>s attempts to addr<strong>es</strong>s a page within its addr<strong>es</strong>s spacethat has been swapped out to disk. <strong>The</strong> "Page fault" exception handler triggers a swap-inoperation when the following conditions occur (see Section 7.4.2, Chapter 7):• <strong>The</strong> page including the addr<strong>es</strong>s that caused the exception is a valid one, that is, itbelongs to a memory region of the current proc<strong>es</strong>s.• <strong>The</strong> page is not pr<strong>es</strong>ent in memory, that is, the Pr<strong>es</strong>ent flag in the Page Table entry iscleared.• <strong>The</strong> Page Table entry associated with the page is not null, which means it contains aswapped-out page identifier.As d<strong>es</strong>cribed in Section 7.4.3 in Chapter 7, the handle_pte_fault( ) function, invoked by thedo_page_fault( ) exception handler, checks whether the Page Table entry is non-null. If so, itinvok<strong>es</strong> do_swap_page( ), which acts on the following parameters:442


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>tskaddr<strong>es</strong>svmapage_tableentrywrite_acc<strong>es</strong>sProc<strong>es</strong>s d<strong>es</strong>criptor addr<strong>es</strong>s of the proc<strong>es</strong>s that caused the "Page fault" exceptionLinear addr<strong>es</strong>s that caused the exceptionMemory region d<strong>es</strong>criptor addr<strong>es</strong>s of the region that includ<strong>es</strong> addr<strong>es</strong>sAddr<strong>es</strong>s of the Page Table entry that maps addr<strong>es</strong>sIdentifier of the swapped-out pageFlag denoting whether the attempted acc<strong>es</strong>s was a read or a write<strong>Linux</strong> allows each memory region to include a customized function for performing swap-in.A region that needs such a customized function stor<strong>es</strong> a pointer to it in the swapin field of itsd<strong>es</strong>criptor. Until recently, IPC shared memory regions had a special swapin method. But from<strong>Linux</strong> 2.2 on, no memory regions have a customized method. If such a method were provided,do_swap_page( ) would perform the following operations:1. Invoke the swapin method. It returns a Page Table entry value, which contains theaddr<strong>es</strong>s of the page frame to be assigned to the proc<strong>es</strong>s.2. Write the value returned from the swapin method into the Page Table entry thatpage_table points to.3. If the page frame usage counter is greater than 1 and the memory region is shared,clear the Read/Write flag of the Page Table entry.4. Increment the mm->rss and the tsk->maj_flt fields of the proc<strong>es</strong>s.5. Release the kernel_flag global kernel lock, which had been obtained when entering theexception handler.6. Return the value 1.Conversely, when the swapin method is not defined, do_swap_page( ) invok<strong>es</strong> the generalswap_in( ) function. It acts on the same parameters as do_swap_page( ) and performs thefollowing steps:1. Invok<strong>es</strong> lookup_swap_cache( ) to check whether the swap cache already contains thepage specified by entry. If so, go<strong>es</strong> to step 4.2. Invok<strong>es</strong> the swapin_readahead( ) function to read from the swap area a group of 2npag<strong>es</strong>, including the requ<strong>es</strong>ted one. <strong>The</strong> value n is stored into the page_clustervariable, which is usually set to 4, but it could be lower if the system has l<strong>es</strong>s than 32443


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>MB of memory. Each page is read by invoking the read_swap_cache_async( )function, specifying a null wait parameter (asynchronous swap operation).3. Invok<strong>es</strong> read_swap_cache( ) on entry, just in case the swapin_readahead( ) functionfailed to read the requ<strong>es</strong>ted page (for instance, because too many asynchronous swapoperations were already being carried out by the system). Recall thatread_swap_cache( ) activat<strong>es</strong> a synchronous swap operation. As a consequence, thecurrent proc<strong>es</strong>s will be suspended until the page has been read from disk.4. Checks whether the entry to which page_table points differs from entry. If so, anotherkernel control path has already swapped in the requ<strong>es</strong>ted page. <strong>The</strong>refore, invok<strong>es</strong>free_page_and_swap_cache( ) to release the page obtained previously and returns.5. Invok<strong>es</strong> swap_free( ) to decrement the usage counter of the page slot corr<strong>es</strong>ponding toentry.6. Increments the mm->rss and min_flt fields of the proc<strong>es</strong>s.7. If the page is shared by several proc<strong>es</strong>s<strong>es</strong> or the proc<strong>es</strong>s is attempting only a read on it,the page stays in the swap cache. However, the Page Table of the proc<strong>es</strong>s must beupdated so the proc<strong>es</strong>s can find the page. <strong>The</strong>refore, writ<strong>es</strong> the physical addr<strong>es</strong>s of therequ<strong>es</strong>ted page and the protection bits found in the vm_page_prot field of the memoryregion into the Page Table entry to which page_table points.8. Otherwise, if the page is not shared and the proc<strong>es</strong>s attempted to write it, there is noreason to keep it in the swap cache, since it is private to the proc<strong>es</strong>s. <strong>The</strong>refore,invok<strong>es</strong> delete_from_swap_cache( ) and writ<strong>es</strong> the same information d<strong>es</strong>cribed by theprevious step into the Page Table entry. However, sets the Read/Write and Dirty bitsto 1.16.7 Freeing Page Fram<strong>es</strong>Page fram<strong>es</strong> can be freed in several possible ways:• By reclaiming an unused page frame within a cache. Depending on the type of cache,the following functions are used:shrink_mmap( )Used for the page cache, swap cache, and buffer cach<strong>es</strong>hrink_dcache_memory( )Used for the dentry cachekmem_cache_reap( )Used for the slab cache (see Section 6.2.7 in Chapter 6)• By swapping out a page belonging to an anonymous memory region of a proc<strong>es</strong>s or amodified page belonging to a private memory mapping.• By swapping out a page belonging to an IPC shared memory region.As we shall see shortly, the choice among th<strong>es</strong>e possibiliti<strong>es</strong> is done in a rather empirical way,with very little support from theory. <strong>The</strong> situation is somewhat similar to evaluating thefactors that determine the dynamic priority of a proc<strong>es</strong>s. <strong>The</strong> main objective is to get a tuning444


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>of the parameters that achieve good system performance, without asking too many qu<strong>es</strong>tionsabout why it works.16.7.1 Monitoring the Free MemoryB<strong>es</strong>id<strong>es</strong> the nr_free_pag<strong>es</strong> variable, which expr<strong>es</strong>s<strong>es</strong> the current number of free page fram<strong>es</strong>,the kernel reli<strong>es</strong> on two valu<strong>es</strong>, a kind of low and high watermark. <strong>The</strong>se valu<strong>es</strong> are stored ina structure called freepag<strong>es</strong> (it also has a low field that is no longer used in <strong>Linux</strong> 2.2):minhighMinimum number of page fram<strong>es</strong> r<strong>es</strong>erved to the kernel to perform crucial operations(e.g., for swapping pag<strong>es</strong> to disk). (free_area_init( ) initializ<strong>es</strong> this field to 2n, where ndenot<strong>es</strong> the size of primary memory expr<strong>es</strong>sed in megabyt<strong>es</strong>. <strong>The</strong> r<strong>es</strong>ulting value mustlie in the range 10 to 256.<strong>The</strong> thr<strong>es</strong>hold of nr_free_pag<strong>es</strong> that indicat<strong>es</strong> to the kernel that enough free memory isavailable. In this case, no swapping is done; free_area_init( ) initializ<strong>es</strong> this thr<strong>es</strong>holdvalue to 3 x freepag<strong>es</strong>.min.<strong>The</strong> contents of min and high fields can be modified by writing into the file/proc/sys/vm/freepag<strong>es</strong>.16.7.2 Reclaiming Pag<strong>es</strong> from the Page, Swap, and Buffer Cach<strong>es</strong>In order to reclaim page fram<strong>es</strong> from the disk cach<strong>es</strong>, the kernel mak<strong>es</strong> use of th<strong>es</strong>hrink_mmap( ) function. It returns 1 if it succeeds in freeing a page frame belonging to thepage cache, the swap cache, or the buffer cache; otherwise, it returns 0. <strong>The</strong> function acts ontwo parameters:prioritygfp_maskFraction of total number of page fram<strong>es</strong> to be checked before the function giv<strong>es</strong> upand terminat<strong>es</strong> with a return value of 0. <strong>The</strong> parameter's value rang<strong>es</strong> from (veryurgent: shrink everything) to 6 (nonurgent: try to shrink a bit).Flags specifying the kind of page frame to be freed.<strong>The</strong> function scans the mem_map array and looks for a page that can be freed. To fit the bill,the page must belong to one of the above cach<strong>es</strong>, must be unlocked, must not be used by anyproc<strong>es</strong>s, and must have the PG_DMA flag set if the page frame is requ<strong>es</strong>ted for ISA DMA.Moreover, it must have not been recently acc<strong>es</strong>sed.A problem of fairn<strong>es</strong>s exists, similar to the one encountered by swap_out_proc<strong>es</strong>s( ) whenchoosing the first memory region of a proc<strong>es</strong>s to be checked. When shrink_mmap( ) isinvoked, it should not always start scanning the mem_map array from the beginning, or pag<strong>es</strong>445


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>with lower physical addr<strong>es</strong>s<strong>es</strong> would have much l<strong>es</strong>s chance of being in a disk cache thanpag<strong>es</strong> with higher physical addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong> clock [3] static local variable plays the same role asthe swap_addr<strong>es</strong>s field of a proc<strong>es</strong>s d<strong>es</strong>criptor: it points to the next page frame to be checkedin the mem_map array.[3] <strong>The</strong> name of this local variable deriv<strong>es</strong> from the idea of the hand of a clock moving circularly. <strong>The</strong> function has nothing to do with system timers, ofcourse.<strong>The</strong> function scans the page d<strong>es</strong>criptors of mem_map by performing the following steps:1. Initializ<strong>es</strong> the local variable count to the number n/2p of unlocked, nonshared pagefram<strong>es</strong> that should be checked during this activation of the function. Here, n is thenumber of page fram<strong>es</strong> in the system as found in the num_physpag<strong>es</strong> variable, and p isequal to priority.2. If count is greater than 0, increments clock and performs the following substeps on thepage d<strong>es</strong>criptor in mem_map[clock]:a. If the page is locked, if its PG_DMA flag is cleared while the gfp_maskparameter specifi<strong>es</strong> an ISA DMA page, or if its usage counter is not equal to 1,skips the page and r<strong>es</strong>tarts step 2 on the next page.b. <strong>The</strong> page is unlocked and nonshared, so decrements count.c. If the PG_swap_cache flag is set, the page belongs to the swap cache. It can bereclaimed if either of the following conditions holds:• Its PG_referenced flag is off, which means that the page has not beenacc<strong>es</strong>sed since the last invocation of shrink_mmap( ). (This flag actslike the Acc<strong>es</strong>sed flag shown earlier as a simple way to hold backswapping.)• <strong>The</strong> page slot usage counter is 1 (no proc<strong>es</strong>s is referencing it).If the page can be reclaimed, invok<strong>es</strong> delete_from_swap_cache( ), clears thePG_referenced flag, and returns 1 (a page frame has been freed).d. If the PG_referenced flag of the page is set, the page has been recentlyacc<strong>es</strong>sed, thus it cannot be reclaimed: clears the flag and r<strong>es</strong>tarts step 2 on thenext page.e. If the page belongs to the buffer cache (that is, the buffers field of the paged<strong>es</strong>criptor is not null) and the buffer cache size is greater than the thr<strong>es</strong>holdspecified by the buffer_mem.min_percent system parameter, invok<strong>es</strong>try_to_free_buffers( ) to check if all buffers in the page are unused. Inparticular, this function performs the following operations:a. Considers all buffers in the page to determine whether they can bereleased. <strong>The</strong>y must all be free (that is, their usage counters must benull), unlocked, not dirty, and not protected. If one of them fails th<strong>es</strong>et<strong>es</strong>ts, very little can be done. Invok<strong>es</strong> wakeup_bdflush( ) (see Section14.1.5 in Chapter 14) and returns to signal that the page has not beenfreed.b. All buffers are unused. Invok<strong>es</strong> remove_from_queu<strong>es</strong>( ) andput_unused_buffer_head( ) repeatedly to release the corr<strong>es</strong>pondingbuffer heads.c. Decrements nr_buffers by the number of buffers in the page anddecrements buffermem by 4 KB.446


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>d. Wak<strong>es</strong> up the proc<strong>es</strong>s<strong>es</strong> suspended for lack of buffer heads and sleepingin the buffer_wait wait queue (see Section 14.1.4 in Chapter 14).e. Invok<strong>es</strong> __free_page( ) to release the page frame to the Buddy system,and returns 1 to signal that the page has been freed.If try_to_free_buffers( ) returns the value 0, the page cannot be freed: go<strong>es</strong> tostep 2. Otherwise, returns the value 1.f. If the page belongs to the page cache (that is, the inode field of the paged<strong>es</strong>criptor is not null) and the page cache size is greater than the thr<strong>es</strong>holdspecified by the page_cache.min_percent system parameter, invok<strong>es</strong>remove_inode_page( ) (see Section 14.2.2 in Chapter 14) to remove the pagefrom the page cache and releas<strong>es</strong> the page frame to the Buddy system, thenreturns the value 1.3. If this point is reached, no page frame has been freed: returns the value 0.16.7.3 Reclaiming Pag<strong>es</strong> from the Dentry and Inode Cach<strong>es</strong>Dentry objects themselv<strong>es</strong> aren't big, but freeing one of them has a cascading effect that canultimately free a lot of memory by releasing several data structur<strong>es</strong>. <strong>The</strong>shrink_dcache_memory( ) function is invoked to remove dentry objects from the dentrycache. Clearly, only dentry objects not referenced by any proc<strong>es</strong>s (defined as unused dentri<strong>es</strong>in the section Section 12.2.5 in Chapter 12) can be removed.Since the dentry cache objects are allocated through the slab allocator, th<strong>es</strong>hrink_dcache_memory( ) function may force some slabs to become free, thus some pagefram<strong>es</strong> may be consequently reclaimed by kmem_cache_reap( ) (see Section 6.2.7 inChapter 6). Moreover, the dentry cache acts as a controller of the inode cache. <strong>The</strong>refore,when a dentry object is released, the buffer storing the corr<strong>es</strong>ponding inode becom<strong>es</strong> unused,and the shrink_mmap( ) function may release the corr<strong>es</strong>ponding buffer page.<strong>The</strong> shrink_dcache_memory( ) function receiv<strong>es</strong> the same parameters as the shrink_mmap( )function. It checks whether the kernel is allowed to perform I/O operations (if the __GFP_IOflag is set in the gfp_mask parameter) and, if so, invok<strong>es</strong> prune_dcache( ).Two parameters are passed to the latter function: the number of dentry objects d_nr to bereleased and the number of inode objects i_nr to be released (because removing a dentry mayinduce an inode to be released as well). prune_dcache( ) stops shrinking the dentry cache assoon as one of the two targets has been reached. <strong>The</strong> value of the first parameter d_nr dependson priority. If it is 0, shrink_dcache_memory( ) pass<strong>es</strong> the value to prune_dcache( ), whichmeans that all unused dentry objects will be removed. Otherwise, d_nr is computed to ben/priority, where n is the total number of unused dentry objects. <strong>The</strong> shrink_dcache_memory() function pass<strong>es</strong> -1 as a second parameter to prune_dcache( ), which means that no limit isenforced on the number of released inod<strong>es</strong>.<strong>The</strong> prune_dcache( ) function scans the list of unused dentri<strong>es</strong> and invok<strong>es</strong> prune_one_dentry() on each object to be released. <strong>The</strong> latter function, in turn, performs the following operations.1. Remov<strong>es</strong> the dentry object from the dentry hash table and from the list of dentryobjects in its parent's directory.447


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Invok<strong>es</strong> dentry_iput( ), which releas<strong>es</strong> the dentry's inode using the d_iput dentrymethod, if defined, or the iput( ) function.3. Invok<strong>es</strong> dput( ) on the parent dentry of dentry. As a r<strong>es</strong>ult, its usage counter isdecremented.4. Returns the dentry object to the slab allocator (see Section 6.2.12 in Chapter 6).16.7.4 <strong>The</strong> try_to_ free_ pag<strong>es</strong>( ) Function<strong>The</strong> try_to_free_pag<strong>es</strong>( ) function is invoked:• By the __get_free_pag<strong>es</strong>( ) function (see Section 6.1.1 in Chapter 6) when the numberof free page fram<strong>es</strong> falls below the thr<strong>es</strong>hold specified in freepag<strong>es</strong>.min and thePF_MEMALLOC flag of the current proc<strong>es</strong>s is cleared• By the make_pio_requ<strong>es</strong>t( ) function (see Section 16.5.2 earlier in this chapter) whenit fails to allocate a new pio_requ<strong>es</strong>t d<strong>es</strong>criptor<strong>The</strong> function receiv<strong>es</strong> as its parameter a set of flags gfp_mask, whose meaning is exactly th<strong>es</strong>ame as the corr<strong>es</strong>ponding parameter of the __get_free_pag<strong>es</strong>( ) function. In particular, the__GFP_IO flag is set if the kernel is allowed to activate I/O data transfers, while the__GFP_WAIT flag is set if the kernel is allowed to discard the contents of page fram<strong>es</strong> inorder to free memory.<strong>The</strong> function performs only two operations:• Wak<strong>es</strong> up the kswapd kernel thread (see Section 16.7.6 later in this chapter)• If the __GFP_WAIT flag in gfp_mask is set, invok<strong>es</strong> do_try_to_free_pag<strong>es</strong>( ), passingto it the gfp_mask parameter16.7.5 <strong>The</strong> do_try_to_ free_ pag<strong>es</strong>( ) Function<strong>The</strong> do_try_to_free_pag<strong>es</strong>( ) function is invoked by try_to_free_pag<strong>es</strong>( ) and by the kswapdkernel thread. It receiv<strong>es</strong> the usual gfp_mask parameter and tri<strong>es</strong> to free at leastSWAP_CLUSTER_MAX page fram<strong>es</strong> (usually 32). A few auxiliary functions are invoked todo the job. Some of them return after releasing a single page frame, so they must be invokedrepeatedly.<strong>The</strong> algorithm implemented by do_try_to_free_pag<strong>es</strong>( ) is quite reasonable, since the pagefram<strong>es</strong> are released according to their usage. For instance, the algorithm favors thepr<strong>es</strong>ervation of page fram<strong>es</strong> used by the dentry cache over the pr<strong>es</strong>ervation of unused pagefram<strong>es</strong> in the slab allocator cach<strong>es</strong>. Moreover, do_try_to_free_pag<strong>es</strong>( ) tri<strong>es</strong> to free memoryby invoking the functions that do the reclaiming with decreasing priority valu<strong>es</strong>. In general, alower value for priority means that more iterations will be performed by the functions beforequitting. do_try_to_free_pag<strong>es</strong>( ) giv<strong>es</strong> up when all functions have been invoked with apriority.In particular, the function execut<strong>es</strong> the following actions:1. Acquir<strong>es</strong> the global kernel lock by invoking lock_kernel( ).2. Invok<strong>es</strong> kmem_cache_reap(gfp_mask) to reclaim page fram<strong>es</strong> from the slab allocatorcach<strong>es</strong>.448


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3. Sets a priority local variable to 6 (the low<strong>es</strong>t priority).4. Tri<strong>es</strong> to free pag<strong>es</strong> over a seri<strong>es</strong> of more and more thorough search<strong>es</strong>, driven byincreasing the priority on each iteration. To be specific, while priority is greater thanor equal to and the number of released page fram<strong>es</strong> is lower thanSWAP_CLUSTER_MAX, performs the following substeps:a. Invok<strong>es</strong> shrink_mmap(priority, gfp_mask) repeatedly until it fails in releasinga page frame belonging to the page cache, to the swap cache, or to the buffercache or until the number of released page fram<strong>es</strong> reach<strong>es</strong>SWAP_CLUSTER_MAXb. If the kernel is allowed to write pag<strong>es</strong> to disk (if the __GFP_IO flag ingfp_mask is set), invok<strong>es</strong> shm_swap(priority, gfp_mask) repeatedly until itfails in releasing a page frame belonging to an IPC shared memory region oruntil the number of released page fram<strong>es</strong> reach<strong>es</strong> SWAP_CLUSTER_MAXc. Invok<strong>es</strong> swap_out(priority, gfp_mask) repeatedly until it fails in releasing tothe Buddy system a page frame belonging to some proc<strong>es</strong>s or until the numberof released page fram<strong>es</strong> reach<strong>es</strong> SWAP_CLUSTER_MAXd. Invok<strong>es</strong> shrink_dcache_memory(priority, gfp_mask) to release free elementsin the dentry cachee. Decrements priority and go<strong>es</strong> to the start of the loop5. Invok<strong>es</strong> unlock_kernel( ).6. Returns 1 if at least SWAP_CLUSTER_MAX page fram<strong>es</strong> have been released,otherwise.16.7.6 <strong>The</strong> kswapd <strong>Kernel</strong> Thread<strong>The</strong> kswapd kernel thread is another kernel mechanism that activat<strong>es</strong> the reclamation ofmemory. Why is it nec<strong>es</strong>sary? Is it not sufficient to invoke try_to_free_pag<strong>es</strong>( ) when freememory becom<strong>es</strong> scarce and another memory allocation requ<strong>es</strong>t is issued?Unfortunately, this is not the case. Some memory allocation requ<strong>es</strong>ts are performed byinterrupt and exception handlers, which cannot block the current proc<strong>es</strong>s waiting for somepage frame to be freed; moreover, some memory allocation requ<strong>es</strong>ts are done by kernelcontrol paths that have already acquired exclusive acc<strong>es</strong>s to critical r<strong>es</strong>ourc<strong>es</strong> and that,therefore, cannot activate I/O data transfers. In the infrequent case in which all memoryallocation requ<strong>es</strong>ts are done by such sorts of kernel control paths, the kernel would be unableto free memory forever.In order to avoid this situation, the kswapd kernel thread is activated once every 10 seconds.<strong>The</strong> thread execut<strong>es</strong> the kswapd( ) function, which at each activation performs the followingoperations:1. If nr_free_pag<strong>es</strong> is greater than the freepag<strong>es</strong>.high thr<strong>es</strong>hold, no memory reclaiming isnec<strong>es</strong>sary: go<strong>es</strong> to step 5.2. Invok<strong>es</strong> do_try_to_free_pag<strong>es</strong>( ) with gfp_mask set to __GFP_IO. In order to avoidrecursive invocations of the function, the kernel thread execut<strong>es</strong> with thePF_MEMALLOC flag set (see Section 16.5.2 earlier in this chapter). If the functiondo<strong>es</strong> not succeed in freeing SWAP_CLUSTER_MAX page fram<strong>es</strong>, go<strong>es</strong> to step 5.3. If the need_r<strong>es</strong>ched field of current is equal to 0, go<strong>es</strong> to step 1 (no higher priorityproc<strong>es</strong>s is runnable, so continu<strong>es</strong> to reclaim memory).449


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>4. <strong>The</strong> need_r<strong>es</strong>ched field is equal to 1: yields the CPU to some other proc<strong>es</strong>s byinvoking schedule( ). <strong>The</strong> kswapd kernel thread remains runnable. When the threadr<strong>es</strong>um<strong>es</strong> execution, go<strong>es</strong> to step 1.5. Sets the state of current to TASK_INTERRUPTIBLE.6. Invok<strong>es</strong> schedule_timeout( ), passing as its parameter the value 10*HZ, thus forcingthe proc<strong>es</strong>s to suspend itself and r<strong>es</strong>ume execution 10 seconds later. <strong>The</strong>n go<strong>es</strong> to step1.16.8 Anticipating <strong>Linux</strong> 2.4Swapping must now take into consideration the existence of RAM zon<strong>es</strong>; much of th<strong>es</strong>wapping code has thus been rewritten in a simpler and cleaner way, mainly thanks to the newpage cache implementation. <strong>The</strong> swap cache is still implemented on top of the page cache, butthe swapper_inode fictitious inode object has been replaced by a file addr<strong>es</strong>s space object.<strong>The</strong> kpiod kernel thread has been removed, because it is now safe to directly swap out pag<strong>es</strong>of shared memory mappings. Moreover, the arrays of locks associated with each swap areaare no longer used.<strong>The</strong> most inter<strong>es</strong>ting change concerns the policy used to select the proc<strong>es</strong>s from whichstealing pag<strong>es</strong> when reclaiming memory: it is the one that performed fewer page faults (recallthat in <strong>Linux</strong> 2.2 it is the one that owns the larg<strong>es</strong>t number of page fram<strong>es</strong>).450


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 17. <strong>The</strong> Ext2 Fil<strong>es</strong>ystemIn this chapter, we finish our extensive discussion of I/O and fil<strong>es</strong>ystems by taking a look atthe details the kernel has to take care of when interacting with a particular fil<strong>es</strong>ystem. Sincethe Second Extended Fil<strong>es</strong>ystem (Ext2) is native to <strong>Linux</strong> and is used on virtually every <strong>Linux</strong>system, it made a natural choice for this discussion. Furthermore, Ext2 illustrat<strong>es</strong> a lot of goodpractic<strong>es</strong> in its support for modern fil<strong>es</strong>ystem featur<strong>es</strong> with fast performance. To be sure,other fil<strong>es</strong>ystems will embody new and inter<strong>es</strong>ting requirements, because they are d<strong>es</strong>ignedfor other operating systems, but we cannot examine the odditi<strong>es</strong> of various fil<strong>es</strong>ystems andplatforms in this book.After introducing Ext2 in Section 17.1, we d<strong>es</strong>cribe the data structur<strong>es</strong> needed, just as in otherchapters. Since we are looking at a particular way to store data on a disk, we have to considertwo versions of data structur<strong>es</strong>: Section 17.2 shows the data structur<strong>es</strong> stored by Ext2 on thedisk, while Section 17.3 shows how they are duplicated in memory.<strong>The</strong>n we get to the operations performed on the fil<strong>es</strong>ystem. In Section 17.4, we discuss howExt2 is created in a disk partition. <strong>The</strong> next sections d<strong>es</strong>cribe the kernel activiti<strong>es</strong> performedwhenever the disk is used. Most of th<strong>es</strong>e are relatively low-level activiti<strong>es</strong> dealing with theallocation of disk space to inod<strong>es</strong> and data blocks. <strong>The</strong>n we'll discuss how Ext2 regular fil<strong>es</strong>are read and written.17.1 General CharacteristicsEach Unix-like operating system mak<strong>es</strong> use of its own fil<strong>es</strong>ystem. Although all suchfil<strong>es</strong>ystems comply with the POSIX interface, each of them is implemented in a different way.<strong>The</strong> first versions of <strong>Linux</strong> were based on the Minix fil<strong>es</strong>ystem. As <strong>Linux</strong> matured, theExtended Fil<strong>es</strong>ystem (Ext FS) was introduced; it included several significant extensions butoffered unsatisfactory performance. <strong>The</strong> Second Extended Fil<strong>es</strong>ystem (Ext2) was introducedin 1994: b<strong>es</strong>id<strong>es</strong> including several new featur<strong>es</strong>, it is quite efficient and robust and hasbecome the most widely used <strong>Linux</strong> fil<strong>es</strong>ystem.<strong>The</strong> following featur<strong>es</strong> contribute to the efficiency of Ext2:• When creating an Ext2 fil<strong>es</strong>ystem, the system administrator may choose the optimalblock size (from 1024 to 4096 byt<strong>es</strong>), depending on the expected average file length.For instance, a 1024 block size is preferable when the average file length is smallerthan a few thousand byt<strong>es</strong> because this leads to l<strong>es</strong>s internal fragmentation—that is,l<strong>es</strong>s of a mismatch between the file length and the portion of the disk that stor<strong>es</strong> it (seealso Section 6.2 in Chapter 6, where internal fragmentation was discussed for dynamicmemory). On the other hand, larger block siz<strong>es</strong> are usually preferable for fil<strong>es</strong> greaterthan a few thousand byt<strong>es</strong> because this leads to fewer disk transfers, thus reducingsystem overhead.• When creating an Ext2 fil<strong>es</strong>ystem, the system administrator may choose how manyinod<strong>es</strong> to allow for a partition of a given size, depending on the expected number offil<strong>es</strong> to be stored on it. This maximiz<strong>es</strong> the effectively usable disk space.• <strong>The</strong> fil<strong>es</strong>ystem partitions disk blocks into groups. Each group includ<strong>es</strong> data blocks andinod<strong>es</strong> stored in adjacent tracks. Thanks to this structure, fil<strong>es</strong> stored in a single blockgroup can be acc<strong>es</strong>sed with a lower average disk seek time.451


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> fil<strong>es</strong>ystem preallocat<strong>es</strong> disk data blocks to regular fil<strong>es</strong> before they are actuallyused. Thus, when the file increas<strong>es</strong> in size, several blocks are already r<strong>es</strong>erved atphysically adjacent positions, reducing file fragmentation.• Fast symbolic links are supported. If the pathname of the symbolic link (see Section1.5.2 in Chapter 1) has 60 byt<strong>es</strong> or l<strong>es</strong>s, it is stored in the inode and can thus betranslated without reading a data block.Moreover, the Second Extended File System includ<strong>es</strong> other featur<strong>es</strong> that make it both robustand flexible:• A careful implementation of the file-updating strategy that minimiz<strong>es</strong> the impact ofsystem crash<strong>es</strong>. For instance, when creating a new hard link for a file, the counter ofhard links in the disk inode is incremented first, and the new name is added into theproper directory next. In this way, if a hardware failure occurs after the inode updatebut before the directory can be changed, the directory is consistent, even if the inode'shard link counter is wrong. Deleting the file do<strong>es</strong> not lead to catastrophic r<strong>es</strong>ults,although the file's data blocks cannot be automatically reclaimed. If the reverse weredone (changing the directory before updating the inode), the same hardware failurewould produce a dangerous inconsistency: deleting the original hard link wouldremove its data blocks from disk, yet the new directory entry would refer to an inodethat no longer exists. If that inode number is used later for another file, writing into th<strong>es</strong>tale directory entry will corrupt the new file.• Support for automatic consistency checks on the fil<strong>es</strong>ystem status at boot time. <strong>The</strong>checks are performed by the /sbin/e2fsck external program, which may be activatednot only after a system crash, but also after a predefined number of fil<strong>es</strong>ystemmountings (a counter is incremented after each mount operation) or after a predefinedamount of time has elapsed since the most recent check.• Support for immutable fil<strong>es</strong> (they cannot be modified) and for append-only fil<strong>es</strong> (datacan be added only to the end of them). Even the superuser is not allowed to overrideth<strong>es</strong>e kinds of protection.• Compatibility with both the Unix System V Release 4 and the BSD semantics of theGroup ID for a new file. In SVR4 the new file assum<strong>es</strong> the Group ID of the proc<strong>es</strong>sthat creat<strong>es</strong> it; in BSD the new file inherits the Group ID of the directory containing it.Ext2 includ<strong>es</strong> a mount option that specifi<strong>es</strong> which semantic is used.Several additional featur<strong>es</strong> are being considered for the next major version of the Ext2fil<strong>es</strong>ystem. Some of them have already been coded and are available as external patch<strong>es</strong>.Others are just planned, but in some cas<strong>es</strong> fields have already been introduced in the Ext2inode for them. <strong>The</strong> most significant featur<strong>es</strong> are:Block fragmentationSystem administrators usually choose large block siz<strong>es</strong> for acc<strong>es</strong>sing recent disks. As ar<strong>es</strong>ult, small fil<strong>es</strong> stored in large blocks waste a lot of disk space. This problem can b<strong>es</strong>olved by allowing several fil<strong>es</strong> to be stored in different fragments of the same block.452


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Acc<strong>es</strong>s Control ListsInstead of classifying the users of a file under three class<strong>es</strong>—owner, group, andothers—an acc<strong>es</strong>s control list (ACL) is associated with each file to specify the acc<strong>es</strong>srights for any specific users or combinations of users.Handling of compr<strong>es</strong>sed and encrypted fil<strong>es</strong><strong>The</strong>se new options, which must be specified when creating a file, will allow users tostore compr<strong>es</strong>sed and/or encrypted versions of their fil<strong>es</strong> on disk.Logical deletionAn undelete option will allow users to easily recover, if needed, the contents of apreviously removed file.17.2 Disk Data Structur<strong>es</strong>Figure 17-1. Layouts of an Ext2 partition and of an Ext2 block group<strong>The</strong> first block in any Ext2 partition is never managed by the Ext2 fil<strong>es</strong>ystem, since it isr<strong>es</strong>erved for the partition boot sector (see Appendix A). <strong>The</strong> r<strong>es</strong>t of the Ext2 partition is splitinto block groups , each of which has the layout shown in Figure 17-1. As you will noticefrom the figure, some data structur<strong>es</strong> must fit in exactly one block while others may requiremore than one block. All the block groups in the fil<strong>es</strong>ystem have the same size and are storedsequentially, so the kernel can derive the location of a block group in a disk simply from itsinteger index.Block groups reduce file fragmentation, since the kernel tri<strong>es</strong> to keep the data blocksbelonging to a file in the same block group if possible. Each block in a block group containsone of the following piec<strong>es</strong> of information:• A copy of the fil<strong>es</strong>ystem's superblock• A copy of the group of block group d<strong>es</strong>criptors• A data block bitmap• A group of inod<strong>es</strong>• An inode bitmap• A chunk of data belonging to a file; that is, a data blockIf a block do<strong>es</strong> not contain any meaningful information, it is said to be free.453


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As can be seen from Figure 17-1, both the superblock and the group d<strong>es</strong>criptors are duplicatedin each block group. Only the superblock and the group d<strong>es</strong>criptors included in block groupare used by the kernel, while the remaining superblocks and group d<strong>es</strong>criptors are leftunchanged; in fact, the kernel do<strong>es</strong>n't even look at them. When the /sbin/e2fsck programexecut<strong>es</strong> a consistency check on the fil<strong>es</strong>ystem status, it refers to the superblock and the groupd<strong>es</strong>criptors stored in block group 0, then copi<strong>es</strong> them into all other block groups. If datacorruption occurs and the main superblock or the main group d<strong>es</strong>criptors in block groupbecom<strong>es</strong> invalid, the system administrator can instruct /sbin/e2fsck to refer to the old copi<strong>es</strong> ofthe superblock and the group d<strong>es</strong>criptors stored in a block groups other than the first. Usually,the redundant copi<strong>es</strong> store enough information to allow /sbin/e2fsck to bring the Ext2 partitionback to a consistent state.How many block groups are there? Well, that depends both on the partition size and on theblock size. <strong>The</strong> main constraint is that the block bitmap, which is used to identify the blocksthat are used and free inside a group, must be stored in a single block. <strong>The</strong>refore, in eachblock group there can be at most 8xb blocks, where b is the block size in byt<strong>es</strong>. Thus, the totalnumber of block groups is roughly s/(8xb), where s is the partition size in blocks.As an example, let's consider an 8 GB Ext2 partition with a 4 KB block size. In this case, each4 KB block bitmap d<strong>es</strong>crib<strong>es</strong> 32 K data blocks, that is, 128 MB. <strong>The</strong>refore, at most 64 blockgroups are needed. Clearly, the smaller the block size, the larger the number of block groups.17.2.1 SuperblockAn Ext2 disk superblock is stored in an ext2_super_block structure, whose fields are listedin Table 17-1. <strong>The</strong> __u8, __u16, and __u32 data typ<strong>es</strong> denote unsigned numbers of length 8,16, and 32 bits r<strong>es</strong>pectively, while the __s8, __s16, __s32 data typ<strong>es</strong> denote signed numbersof length 8, 16, and 32 bits.454


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 17-1. <strong>The</strong> Fields of the Ext2 SuperblockType Field D<strong>es</strong>cription_ _u32 s_inod<strong>es</strong>_count Total number of inod<strong>es</strong>_ _u32 s_blocks_count Fil<strong>es</strong>ystem size in blocks_ _u32 s_r_blocks_count Number of r<strong>es</strong>erved blocks_ _u32 s_free_blocks_count Free blocks counter_ _u32 s_free_inod<strong>es</strong>_count Free inod<strong>es</strong> counter_ _u32 s_first_data_block Number of first useful block (always 1)_ _u32 s_log_block_size Block size_ _s32 s_log_frag_size Fragment size_ _u32 s_blocks_per_group Number of blocks per group_ _u32 s_frags_per_group Number of fragments per group_ _u32 s_inod<strong>es</strong>_per_group Number of inod<strong>es</strong> per group_ _u32 s_mtime Time of last mount operation_ _u32 s_wtime Time of last write operation_ _u16 s_mnt_count Mount operations counter_ _u16 s_max_mnt_count Number of mount operations before check_ _u16 s_magic Magic signature_ _u16 s_state Status flag_ _u16 s_errors Behavior when detecting errors_ _u16 s_minor_rev_level Minor revision level_ _u32 s_lastcheck Time of last check_ _u32 s_checkinterval Time between checks_ _u32 s_creator_os OS where fil<strong>es</strong>ystem was created_ _u32 s_rev_level Revision level_ _u16 s_def_r<strong>es</strong>uid Default UID for r<strong>es</strong>erved blocks_ _u16 s_def_r<strong>es</strong>gid Default GID for r<strong>es</strong>erved blocks_ _u32 s_first_ino Number of first nonr<strong>es</strong>erved inode_ _u16 s_inode_size Size of on-disk inode structure_ _u16 s_block_group_nr Block group number of this superblock_ _u32 s_feature_compat Compatible featur<strong>es</strong> bitmap_ _u32 s_feature_incompat Incompatible featur<strong>es</strong> bitmap_ _u32 s_feature_ro_compat Read-only-compatible featur<strong>es</strong> bitmap_ _u8 [16] s_uuid 128-bit fil<strong>es</strong>ystem identifierchar [16] s_volume_name Volume namechar [64] s_last_mounted Pathname of last mount point_ _u32 s_algorithm_usage_bitmap Used for compr<strong>es</strong>sion_ _u8 s_prealloc_blocks Number of blocks to preallocate_ _u8 s_prealloc_dir_blocks Number of blocks to preallocate for directori<strong>es</strong>_ _u8 [818] s_padding Nulls to pad out 1024 byt<strong>es</strong><strong>The</strong> s_inod<strong>es</strong>_count field stor<strong>es</strong> the number of inod<strong>es</strong>, while the s_blocks_count fieldstor<strong>es</strong> the number of blocks in the Ext2 fil<strong>es</strong>ystem.<strong>The</strong> s_log_block_size field expr<strong>es</strong>s<strong>es</strong> the block size as a power of 2, using 1024 byt<strong>es</strong> asthe unit. Thus, denot<strong>es</strong> 1024-byte blocks, 1 denot<strong>es</strong> 2048-byte blocks, and so on. <strong>The</strong>s_log_frag_size field is currently equal to s_log_block_size, since block fragmentationis not yet implemented.455


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> s_blocks_per_group, s_frags_per_group, and s_inod<strong>es</strong>_per_group fields store thenumber of blocks, fragments, and inod<strong>es</strong> in each block group, r<strong>es</strong>pectively.Some disk blocks are r<strong>es</strong>erved to the superuser (or to some other user or group of usersselected by the s_def_r<strong>es</strong>uid and s_def_r<strong>es</strong>gid fields). <strong>The</strong>se blocks allow the systemadministrator to continue to use the fil<strong>es</strong>ystem even when no more free blocks are availablefor normal users.<strong>The</strong> s_mnt_count, s_max_mnt_count, s_lastcheck, and s_checkinterval fields set up theExt2 fil<strong>es</strong>ystem to be checked automatically at boot time. <strong>The</strong>se fields cause /sbin/e2fsck torun after a predefined number of mount operations has been performed, or when a predefinedamount of time has elapsed since the last consistency check. (Both kinds of checks can beused together.) <strong>The</strong> consistency check is also enforced at boot time if the fil<strong>es</strong>ystem has notbeen cleanly unmounted (for instance, after a system crash) or when the kernel discoverssome errors in it. <strong>The</strong> s_state field stor<strong>es</strong> the value if the fil<strong>es</strong>ystem is mounted or was notcleanly unmounted, 1 if it was cleanly unmounted, and 2 if it contains errors.17.2.2 Group D<strong>es</strong>criptor and BitmapEach block group has its own group d<strong>es</strong>criptor, an ext2_group_d<strong>es</strong>c structure whose fieldsare illustrated in Table 17-2.Table 17-2. <strong>The</strong> Fields of the Ext2 Group D<strong>es</strong>criptorType Field D<strong>es</strong>cription_ _u32 bg_block_bitmap Block number of block bitmap_ _u32 bg_inode_bitmap Block number of inode bitmap_ _u32 bg_inode_table Block number of first inode table block_ _u16 bg_free_blocks_count Number of free blocks in the group_ _u16 bg_free_inod<strong>es</strong>_count Number of free inod<strong>es</strong> in the group_ _u16 bg_used_dirs_count Number of directori<strong>es</strong> in the group_ _u16 bg_pad Alignment to word_ _u32 [3] bg_r<strong>es</strong>erved Nulls to pad out 24 byt<strong>es</strong><strong>The</strong> bg_free_blocks_count, bg_free_inod<strong>es</strong>_count, and bg_used_dirs_count fields areused when allocating new inod<strong>es</strong> and data blocks. <strong>The</strong>se fields determine the most suitableblock in which to allocate each data structure. <strong>The</strong> bitmaps are sequenc<strong>es</strong> of bits, where thevalue specifi<strong>es</strong> that the corr<strong>es</strong>ponding inode or data block is free and the value 1 specifi<strong>es</strong> thatit is used. Since each bitmap must be stored inside a single block and since the block size canbe 1024, 2048, or 4096 byt<strong>es</strong>, a single bitmap d<strong>es</strong>crib<strong>es</strong> the state of 8192, 16,384, or 32,768blocks.17.2.3 Inode Table<strong>The</strong> inode table consists of a seri<strong>es</strong> of consecutive blocks, each of which contains a predefinednumber of inod<strong>es</strong>. <strong>The</strong> block number of the first block of the inode table is stored in thebg_inode_table field of the group d<strong>es</strong>criptor.All inod<strong>es</strong> have the same size, 128 byt<strong>es</strong>. A 1024-byte block contains 8 inod<strong>es</strong>, while a 4096-byte block contains 32 inod<strong>es</strong>. To figure out how many blocks are occupied by the inode456


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>table, divide the total number of inod<strong>es</strong> in a group (stored in the s_inod<strong>es</strong>_per_group fieldof the superblock) by the number of inod<strong>es</strong> per block.Each Ext2 inode is an ext2_inode structure whose fields are illustrated in Table 17-3.Table 17-3. <strong>The</strong> Fields of an Ext2 Disk InodeType Field D<strong>es</strong>cription_ _u16 i_mode File type and acc<strong>es</strong>s rights_ _u16 i_uid Owner identifier_ _u32 i_size File length in byt<strong>es</strong>_ _u32 i_atime Time of last file acc<strong>es</strong>s_ _u32 i_ctime Time that inode last changed_ _u32 i_mtime Time that file contents last changed_ _u32 i_dtime Time of file deletion_ _u16 i_gid Group identifier_ _u16 i_links_count Hard links counter_ _u32 i_blocks Number of data blocks of the file_ _u32 i_flags File flagsunion osd1 Specific operating system information_ _u32 [EXT2_N_BLOCKS] i_block Pointers to data blocks_ _u32 i_version File version (for NFS)_ _u32 i_file_acl File acc<strong>es</strong>s control list_ _u32 i_dir_acl Directory acc<strong>es</strong>s control list_ _u32 i_faddr Fragment addr<strong>es</strong>sunion osd2 Specific operating system informationMany fields related to POSIX specifications are similar to the corr<strong>es</strong>ponding fields of theVFS's inode object and have already been discussed in Section 12.2.2 in Chapter 12.<strong>The</strong> remaining on<strong>es</strong> refer to the Ext2-specific implementation and deal mostly with blockallocation.In particular, the i_size field stor<strong>es</strong> the effective length of the file in byt<strong>es</strong>, while thei_blocks field stor<strong>es</strong> the number of data blocks (in units of 512 byt<strong>es</strong>) that have beenallocated to the file.<strong>The</strong> valu<strong>es</strong> of i_size and i_blocks are not nec<strong>es</strong>sarily related. Since a file is always storedin an integer number of blocks, a nonempty file receiv<strong>es</strong> at least one data block (sincefragmentation is not yet implemented) and i_size may be smaller than 512xi_blocks. Onthe other hand, as we shall see in Section 17.6.4 later in this chapter, a file may contain hol<strong>es</strong>.In that case, i_size may be greater than 512xi_blocks.<strong>The</strong> i_block field is an array of EXT2_N_BLOCKS (usually 15) pointers to blocks used toidentify the data blocks allocated to the file (see Section 17.6.3 later in this chapter).<strong>The</strong> 32 bits r<strong>es</strong>erved for the i_size field limit the file size to 4 GB. Actually, thehigh<strong>es</strong>t-order bit of the i_size field is not used, thus the maximum file size is limited to 2GB. However, the Ext2 fil<strong>es</strong>ystem includ<strong>es</strong> a "dirty trick" that allows larger fil<strong>es</strong> on 64-bitarchitectur<strong>es</strong> like Compaq's Alpha. Essentially, the i_dir_acl field of the inode, which is notused for regular fil<strong>es</strong>, repr<strong>es</strong>ents a 32-bit extension of the i_size field. <strong>The</strong>refore, the file size457


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>is stored in the inode as a 64-bit integer. <strong>The</strong> 64-bit version of the Ext2 fil<strong>es</strong>ystem issomewhat compatible with the 32-bit version because an Ext2 fil<strong>es</strong>ystem created on a 64-bitarchitecture may be mounted on a 32-bit architecture, and vice versa. However, on a 32-bitarchitecture a large file cannot be acc<strong>es</strong>sed.Recall that the VFS model requir<strong>es</strong> each file to have a different inode number. In Ext2, thereis no need to store the inode number of a file on disk because its value can be derived fromthe block group number and the relative position inside the inode table. As an example,suppose that each block group contains 4096 inod<strong>es</strong> and that we want to know the addr<strong>es</strong>s ondisk of inode 13021. In this case, the inode belongs to the third block group and its diskaddr<strong>es</strong>s is stored in the 733rd entry of the corr<strong>es</strong>ponding inode table. As you can see, theinode number is just a key used by the Ext2 routin<strong>es</strong> to retrieve quickly the proper inoded<strong>es</strong>criptor on disk.17.2.4 How Various File Typ<strong>es</strong> Use Disk Blocks<strong>The</strong> different typ<strong>es</strong> of fil<strong>es</strong> recognized by Ext2 (regular fil<strong>es</strong>, pip<strong>es</strong>, etc.) use data blocks indifferent ways. Some fil<strong>es</strong> store no data and therefore need no data blocks at all. This sectiondiscuss<strong>es</strong> the storage requirements for each type.17.2.4.1 Regular fileRegular fil<strong>es</strong> are the most common case and receive almost all the attention in this chapter.But a regular file needs data blocks only when it starts to have data. When first created, aregular file is empty and needs no data blocks; it can also be emptied by the truncate( )system call. Both situations are common; for instance, when you issue a shell command thatinclud<strong>es</strong> the string >filename, the shell creat<strong>es</strong> an empty file or truncat<strong>es</strong> an existing one.17.2.4.2 DirectoryExt2 implements directori<strong>es</strong> as a special kind of file whose data blocks store filenam<strong>es</strong>together with the corr<strong>es</strong>ponding inode numbers. In particular, such data blocks containstructur<strong>es</strong> of type ext2_dir_entry_2. <strong>The</strong> fields of that structure are shown in Table 17-4.<strong>The</strong> structure has a variable length, since the last name field is a variable length array of up toEXT2_NAME_LEN characters (usually 255). Moreover, for reasons of efficiency, the length of adirectory entry is always a multiple of 4, and therefore null characters (\0) are added forpadding at the end of the filename if nec<strong>es</strong>sary. <strong>The</strong> name_len field stor<strong>es</strong> the actual file namelength (see Figure 17-2).Table 17-4. <strong>The</strong> Fields of an Ext2 Directory EntryType Field D<strong>es</strong>cription_ _u32 inode Inode number_ _u16 rec_len Directory entry length_ _u8 name_len File name length_ _u8 file_type File typechar [EXT2_NAME_LEN] name File name<strong>The</strong> file_type field stor<strong>es</strong> a value that specifi<strong>es</strong> the file type (see Table 17-5). <strong>The</strong> rec_lenfield may be interpreted as a pointer to the next valid directory entry: it is the offset to beadded to the starting addr<strong>es</strong>s of the directory entry to get the starting addr<strong>es</strong>s of the next valid458


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>directory entry. In order to delete a directory entry, it is sufficient to set its inode field to andto suitably increment the value of the rec_len field of the previous valid entry. Read therec_len field of Figure 17-2 carefully; you'll see that the oldfile entry has been deletedbecause the rec_len field of usr is set to 12+16 (the lengths of the usr and oldfile entri<strong>es</strong>).Figure 17-2. An example of EXT2 directory17.2.4.3 Symbolic linkAs stated before, if the pathname of the symbolic link has up to 60 characters, it is stored inthe i_block field of the inode, which consists of an array of 15 4-byte integers; no data blockis thus required. If the pathname is longer than 60 characters, however, a single data block isrequired.17.2.4.4 Device file, pipe, and socketNo data blocks are required for th<strong>es</strong>e kinds of file. All the nec<strong>es</strong>sary information is stored inthe inode.Table 17-5. Ext2 File Typ<strong>es</strong>file_typeD<strong>es</strong>cription0 Unknown1 Regular file2 Directory3 Character device4 Block device5 Named pipe6 Socket7 Symbolic link17.3 Memory Data Structur<strong>es</strong>For the sake of efficiency, most information stored in the disk data structur<strong>es</strong> of an Ext2partition are copied into RAM when the fil<strong>es</strong>ystem is mounted, thus allowing the kernel toavoid many subsequent disk read operations. To get an idea of how often some data structur<strong>es</strong>change, consider some fundamental operations:459


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• When a new file is created, the valu<strong>es</strong> of the s_free_inod<strong>es</strong>_count field in the Ext2superblock and of the bg_free_inod<strong>es</strong>_count field in the proper group d<strong>es</strong>criptormust be decremented.• If the kernel appends some data to an existing file, so that the number of data blocksallocated for it increas<strong>es</strong>, the valu<strong>es</strong> of the s_free_blocks_count field in the Ext2superblock and of the bg_free_blocks_count field in the group d<strong>es</strong>criptor must bemodified.• Even just rewriting a portion of an existing file involv<strong>es</strong> an update of the s_wtimefield of the Ext2 superblock.Since all Ext2 disk data structur<strong>es</strong> are stored in blocks of the Ext2 partition, the kernel us<strong>es</strong>the buffer cache to keep them up-to-date (see Section 14.1.5 in Chapter 14).Table 17-6 specifi<strong>es</strong>, for each type of data related to Ext2 fil<strong>es</strong>ystems and fil<strong>es</strong>, the datastructure used on the disk to repr<strong>es</strong>ent its data, the data structure used by the kernel inmemory, and a rule of thumb used to determine how much caching is used. Data that isupdated very frequently is always cached; that is, the data is permanently stored in memoryand included in the buffer cache until the corr<strong>es</strong>ponding Ext2 partition is unmounted. <strong>The</strong>kernel gets this r<strong>es</strong>ult by keeping the buffer's usage counter greater than at all tim<strong>es</strong>.Table 17-6. VFS Imag<strong>es</strong> of Ext2 Data Structur<strong>es</strong>Type Disk Data Structure Memory Data Structure Caching ModeSuperblock ext2_super_block ext2_sb_info Always cachedGroup d<strong>es</strong>criptor ext2_group_d<strong>es</strong>c ext2_group_d<strong>es</strong>c Always cachedBlock bitmap Bit array in block Bit array in buffer Fixed limitInode bitmap Bit array in block Bit array in buffer Fixed limitInode ext2_inode ext2_inode_info DynamicData block Unspecified Buffer DynamicFree inode ext2_inode None NeverFree block Unspecified None Never<strong>The</strong> never-cached data is not kept in the buffer cache since it do<strong>es</strong> not repr<strong>es</strong>ent meaningfulinformation.In between th<strong>es</strong>e extrem<strong>es</strong> lie two other mod<strong>es</strong>: fixed limit and dynamic. In the fixed limitmode, a specific number of data structur<strong>es</strong> can be kept in the buffer cache; older on<strong>es</strong> areflushed to disk when the number is exceeded. In the dynamic mode, the data is kept in thebuffer cache as long as the associated object (an inode or block) is in use; when the file isclosed or the block is deleted, the shrink_mmap( ) function may remove the associated datafrom the cache and write it back to disk.17.3.1 <strong>The</strong> ext2_sb_info and ext2_inode_info Structur<strong>es</strong>When an Ext2 fil<strong>es</strong>ystem is mounted, the u field of the VFS superblock, which containsfil<strong>es</strong>ystem-specific data, is loaded with a structure of type ext2_sb_info so that the kernelcan find out things related to the fil<strong>es</strong>ystem as a whole. This structure includ<strong>es</strong> the followinginformation:460


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• Most of the disk superblock fields• <strong>The</strong> block bitmap cache, tracked by the s_block_bitmap ands_block_bitmap_number arrays (see next section)• <strong>The</strong> inode bitmap cache, tracked by the s_inode_bitmap ands_inode_bitmap_number arrays (see next section)• An s_sbh pointer to the buffer head of the buffer containing the disk superblock• An s_<strong>es</strong> pointer to the buffer containing the disk superblock• <strong>The</strong> number of group d<strong>es</strong>criptors, s_d<strong>es</strong>c_ per_block, that can be packed in a block• An s_group_d<strong>es</strong>c pointer to an array of buffer heads of buffers containing the groupd<strong>es</strong>criptors (usually, a single entry is sufficient)• Other data related to mount state, mount options, and so onSimilarly, when an inode object pertaining to an Ext2 file is initialized, the u field is loadedwith a structure of type ext2_inode_info, which includ<strong>es</strong> this information:• Most of the fields found in the disk's inode structure that are not kept in the genericVFS inode object (see Table 12-3 in Chapter 12)• <strong>The</strong> fragment size and the fragment number (not yet used)• <strong>The</strong> block_group block group index at which the inode belongs (see Section 17.2earlier in this chapter)• <strong>The</strong> i_alloc_block and i_alloc_count fields, which are used for data blockpreallocation (see the Section 17.6.5 later in this chapter)• <strong>The</strong> i_osync field, which is a flag specifying whether the disk inode should b<strong>es</strong>ynchronously updated (see Section 17.7 later in this chapter)17.3.2 Bitmap Cach<strong>es</strong>When the kernel mounts an Ext2 fil<strong>es</strong>ystem, it allocat<strong>es</strong> a buffer for the Ext2 disk superblockand reads its contents from disk. <strong>The</strong> buffer is released only when the Ext2 fil<strong>es</strong>ystem isunmounted. When the kernel must modify a field in the Ext2 superblock, it simply writ<strong>es</strong> thenew value in the proper position of the corr<strong>es</strong>ponding buffer and then marks the buffer asdirty.Unfortunately, this approach cannot be adopted for all Ext2 disk data structur<strong>es</strong>. <strong>The</strong> tenfoldincrease in disk capacity reached in recent years has induced a tenfold increase in the size ofinode and data block bitmaps, so we have reached the point at which it is no longerconvenient to keep all the bitmaps in RAM at the same time.For instance, consider a 4 GB disk with a 1 KB block size. Since each bitmap fills all the bitsof a single block, each of them d<strong>es</strong>crib<strong>es</strong> the status of 8192 blocks, that is, of 8 MB of diskstorage. <strong>The</strong> number of block groups is 4096 MB/8 MB=512. Since each block group requir<strong>es</strong>both an inode bitmap and a data block bitmap, 1 MB of RAM would be required to store all1024 bitmaps in memory!<strong>The</strong> solution adopted to limit the memory requirements of the Ext2 d<strong>es</strong>criptors is to use, forany mounted Ext2 fil<strong>es</strong>ystem, two cach<strong>es</strong> of size EXT2_MAX_GROUP_LOADED (usually 8). Onecache stor<strong>es</strong> the most recently acc<strong>es</strong>sed inode bitmaps, while the other cache stor<strong>es</strong> the mostrecently acc<strong>es</strong>sed block bitmaps. Buffers containing bitmaps included in a cache have a usagecounter greater than 0, therefore they are never freed by shrink_mmap( ) (see Section 16.7.2461


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>in Chapter 16). Conversely, buffers containing bitmaps not included in a bitmap cache have anull usage counter, and thus they can be freed if free memory becom<strong>es</strong> scarce.Each cache is implemented by means of two arrays of EXT2_MAX_GROUP_LOADED elements.One array contains the index<strong>es</strong> of the block groups whose bitmaps are currently in the cache,while the other array contains pointers to the buffer heads that refer to those bitmaps.<strong>The</strong> ext2_sb_info structure stor<strong>es</strong> the arrays pertaining to the inode bitmap cache: index<strong>es</strong> ofblock groups are found in the s_inode_bitmap field and pointers to buffer heads in th<strong>es</strong>_inode_bitmap_number field. <strong>The</strong> corr<strong>es</strong>ponding arrays for the block bitmap cache ar<strong>es</strong>tored in the s_block_bitmap and s_block_bitmap_number fields.<strong>The</strong> load_inode_bitmap( ) function loads the inode bitmap of a specified block group andreturns the cache position in which the bitmap can be found.If the bitmap is not already in the bitmap cache, load_inode_bitmap( ) invok<strong>es</strong>read_inode_bitmap( ). <strong>The</strong> latter function gets the number of the block containing thebitmap from the bg_inode_bitmap field of the group d<strong>es</strong>criptor, then invok<strong>es</strong> bread( ) toallocate a new buffer and read the block from disk if it is not already included in the buffercache.If the number of block groups in the Ext2 partition is l<strong>es</strong>s than or equal toEXT2_MAX_GROUP_LOADED, the index of the cache array position in which the bitmap isinserted always match<strong>es</strong> the block group index passed as the parameter to theload_inode_bitmap( ) function.Otherwise, if there are more block groups than cache positions, a bitmap is removed fromcache, if nec<strong>es</strong>sary, by using a Least Recently Used (LRU) policy, and the requ<strong>es</strong>ted bitmap isinserted in the first cache position. Figure 17-3 illustrat<strong>es</strong> the three possible cas<strong>es</strong> in which thebitmap in block group 5 is referenced: where the requ<strong>es</strong>ted bitmap is already in cache, wherethe bitmap is not in cache but there is a free position, and where the bitmap is not in cache andthere is no free position.Figure 17-3. Adding a bitmap to the cache<strong>The</strong> load_block_bitmap( ) and read_block_bitmap( ) functions are very similar toload_inode_bitmap( ) and read_inode_bitmap( ), but they refer to the block bitmapcache of an Ext2 partition.462


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 17-4 illustrat<strong>es</strong> the memory data structur<strong>es</strong> of a mounted Ext2 fil<strong>es</strong>ystem. In ourexample, there are three block groups whose d<strong>es</strong>criptors are stored in three blocks on disk;therefore, the s_group_d<strong>es</strong>c field of the ext2_sb_info points to an array of three bufferheads. We have shown just one inode bitmap having index 2 and one block bitmap havingindex 4, although the kernel may keep in the bitmap cach<strong>es</strong> 2 x EXT2_MAX_GROUP_LOADEDbitmaps, and even more may be stored in the buffer cache.Figure 17-4. Ext2 memory data structur<strong>es</strong>17.4 Creating the Fil<strong>es</strong>ystemFormatting a disk partition or a floppy is not the same thing as creating a fil<strong>es</strong>ystem on it.Formatting allows the disk driver to read and write blocks on the disk, while creating afil<strong>es</strong>ystem means setting up the structur<strong>es</strong> d<strong>es</strong>cribed in detail earlier in this chapter.Modern hard disks come preformatted from the factory and need not be reformatted; floppydisks may be formatted by using the /usr/bin/superformat utility program.Ext2 fil<strong>es</strong>ystems are created by the /sbin/mke2fs utility program; it assum<strong>es</strong> the followingdefault options, which may be modified by the user with flags on the command line:• Block size: 1024 byt<strong>es</strong>• Fragment size: block size• Number of allocated inod<strong>es</strong>: one for each group of 4096 byt<strong>es</strong>• Percentage of r<strong>es</strong>erved blocks: 5%<strong>The</strong> program performs the following actions:1. Initializ<strong>es</strong> the superblock and the group d<strong>es</strong>criptors2. Optionally, checks whether the partition contains defective blocks: if so, creat<strong>es</strong> a listof defective blocks3. For each block group, r<strong>es</strong>erv<strong>es</strong> all the disk blocks needed to store the superblock, thegroup d<strong>es</strong>criptors, the inode table, and the two bitmaps4. Initializ<strong>es</strong> the inode bitmap and the data map bitmap of each block group to5. Initializ<strong>es</strong> the inode table of each block group463


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>6. Creat<strong>es</strong> the / root directory7. Creat<strong>es</strong> the lost+found directory, which is used by /sbin/e2fsck to link the lost anddefective blocks found8. Updat<strong>es</strong> the inode bitmap and the data block bitmap of the block group in which thetwo previous directori<strong>es</strong> have been created9. Groups the defective blocks (if any) in the lost+found directoryLet's consider, for the sake of concreten<strong>es</strong>s, how an Ext2 1.4 MB floppy disk is initialized by/sbin/mke2fs with the default options.Once mounted, it will appear to the VFS as a volume consisting of 1390 blocks, each one1024 byt<strong>es</strong> in length. To examine the disk's contents, we can execute the Unix command:$ dd if=/dev/fd0 bs=1k count=1440 | od -tx1 -Ax > /tmp/dump_hexto get in the /tmp directory a file containing the hexadecimal dump of the floppy diskcontents. [1][1] Some information on an Ext2 fil<strong>es</strong>ystem could also be obtained by using the /sbin/dumpe2fs and /sbin/debugfs utility programs.By looking at that file, we can see that, due to the limited capacity of the disk, a single groupd<strong>es</strong>criptor is sufficient. We also notice that the number of r<strong>es</strong>erved blocks is set to 72 (5% of1440) and that, according to the default option, the inode table must include 1 inode for each4096 byt<strong>es</strong>, that is, 360 inod<strong>es</strong> stored in 45 blocks.Table 17-7 summariz<strong>es</strong> how the Ext2 fil<strong>es</strong>ystem is created on a floppy disk when the defaultoptions are selected.Table 17-7. Ext2 Block Allocation for a Floppy DiskBlock Content0 Boot block1 Superblock2 Block containing a single block group d<strong>es</strong>criptor3 Data block bitmap4 Inode bitmap5-49 Inode table: inod<strong>es</strong> up to 10: r<strong>es</strong>erved; inode 11: lost+found; inod<strong>es</strong> 12-360: free50 Root directory (includ<strong>es</strong> ., .., and lost+found)51 lost+found directory (includ<strong>es</strong> . and ..)52-62 R<strong>es</strong>erved blocks preallocated for lost+found directory63-1439 Free blocks17.5 Ext2 MethodsMany of the VFS methods d<strong>es</strong>cribed in Chapter 12 have a corr<strong>es</strong>ponding Ext2implementation. Since it would take a whole book to d<strong>es</strong>cribe all of them, we'll limitourselv<strong>es</strong> to briefly reviewing the methods implemented in Ext2. Once the disk and thememory data structur<strong>es</strong> are clearly understood, the reader should be able to follow the code ofthe Ext2 functions that implement them.464


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>17.5.1 Ext2 Superblock OperationsAll VFS superblock operations have a specific implementation in Ext2, with the exception ofthe clear_inode and umount_begin VFS methods. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of the superblock methodsare stored into the ext2_sops array of pointers.17.5.2 Ext2 Inode OperationsMany of the VFS inode operations have a specific implementation in Ext2, which depends onthe type of the file to which the inode refers. Table 17-8 illustrat<strong>es</strong> the inode operationsimplemented for inod<strong>es</strong> that refer to regular and directory fil<strong>es</strong>; their addr<strong>es</strong>s<strong>es</strong> are stored inthe ext2_file_inode_operations and in the ext2_dir_inode_operations tabl<strong>es</strong>,r<strong>es</strong>pectively. Recall that the VFS us<strong>es</strong> its own generic functions when the corr<strong>es</strong>ponding Ext2method is undefined (NULL pointer).Table 17-8. Ext2 Inode Operations for Regular and Directory Fil<strong>es</strong>VFS Inode Operation Ext2 File Inode Method Ext2 Directory Inode Methodlookup NULL ext2_lookup( )link NULL ext2_link( )unlink NULL ext2_unlink( )symlink NULL ext2_symlink( )mkdir NULL ext2_mkdir( )rmdir NULL ext2_rmdir( )create NULL ext2_create( )mknod NULL ext2_mknod( )rename NULL ext2_rename( )readlink NULL NULLfollow_link NULL NULLreadpage generic_readpage( ) NULLwritepage NULL NULLbmap ext2_bmap( ) NULLtruncate ext2_truncate( ) NULLpermission ext2_permission( ) ext2_permission( )smap NULL NULLupdatepage NULL NULLrevalidate NULL NULLIf the inode refers to a symbolic link, all inode methods are NULL except for readlink andfollow_link, which are implemented by ext2_readlink( ) and ext2_follow_link( ),r<strong>es</strong>pectively. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of those methods are stored in theext2_symlink_inode_operations table.If the inode refers to a character device file, to a block device file, or to a named pipe (seeSection 18.2 in Chapter 18), the inode operations do not depend on the fil<strong>es</strong>ystem. <strong>The</strong>y ar<strong>es</strong>pecified in the chrdev_inode_operations, blkdev_inode_operations, andfifo_inode_operations tabl<strong>es</strong>, r<strong>es</strong>pectively.465


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>17.5.3 Ext2 File Operations<strong>The</strong> file operations specific to the Ext2 fil<strong>es</strong>ystem are listed in Table 17-9. As you can see, theread and mmap VFS methods are implemented by generic functions that are common to manyfil<strong>es</strong>ystems. <strong>The</strong> addr<strong>es</strong>s<strong>es</strong> of th<strong>es</strong>e methods are stored in the ext2_file_operations table.Table 17-9. Ext2 File OperationsVFS File OperationExt2 Methodlseek ext2_file_lseek( )read generic_file_read( )write ext2_file_write( )readdirNULLpollNULLioctl ext2_ioctl( )mmap generic_file_mmap( )open ext2_open_file( )flushNULLrelease ext2_release_file( )fsync ext2_sync_file( )fasyncNULLcheck_media_changeNULLrevalidateNULLlockNULL17.6 Managing Disk Space<strong>The</strong> storage of a file on disk differs from the view the programmer has of the file in two ways:blocks can be scattered around the disk (although the fil<strong>es</strong>ystem tri<strong>es</strong> hard to keep blockssequential to improve acc<strong>es</strong>s time), and fil<strong>es</strong> may appear to a programmer to be bigger thanthey really are because a program can introduce hol<strong>es</strong> into them (through the lseek( )system call).In this section we explain how the Ext2 fil<strong>es</strong>ystem manag<strong>es</strong> the disk space, that is, how itallocat<strong>es</strong> and deallocat<strong>es</strong> inod<strong>es</strong> and data blocks. Two main problems must be addr<strong>es</strong>sed:• Space management must make every effort to avoid file fragmentation, that is, thephysical storage of a file in several, small piec<strong>es</strong> located in nonadjacent disk blocks.File fragmentation increas<strong>es</strong> the average time of sequential read operations on thefil<strong>es</strong>, since the disk heads must be frequently repositioned during the read operation. [2]This problem is similar to the external fragmentation of RAM discussed in Section6.1.2 in Chapter 6.[2] Please note that fragmenting a file across block groups (A Bad Thing) is quite different from the not-yet-implemented fragmentation of blocks inorder to store many fil<strong>es</strong> in one block (A Good Thing).• Space management must be time-efficient; that is, the kernel should be able to quicklyderive from a file offset the corr<strong>es</strong>ponding logical block number in the Ext2 partition.In doing so, the kernel should limit as much as possible the number of acc<strong>es</strong>s<strong>es</strong> toaddr<strong>es</strong>sing tabl<strong>es</strong> stored on disk, since each such intermediate acc<strong>es</strong>s considerablyincreas<strong>es</strong> the average file acc<strong>es</strong>s time.466


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>17.6.1 Creating Inod<strong>es</strong><strong>The</strong> ext2_new_inode( ) function creat<strong>es</strong> an Ext2 disk inode, returning the addr<strong>es</strong>s of thecorr<strong>es</strong>ponding inode object (or NULL in case of failure). It acts on two parameters: the addr<strong>es</strong>sdir of the inode object that refers to the directory into which the new inode must be insertedand a mode that indicat<strong>es</strong> the type of inode being created. <strong>The</strong> latter argument also includ<strong>es</strong> anMS_SYNCHRONOUS flag that requir<strong>es</strong> the current proc<strong>es</strong>s to be suspended until the inode isallocated. <strong>The</strong> function performs the following actions:1. Invok<strong>es</strong> get_empty_inode( ) to allocate a new inode object and initializ<strong>es</strong> its i_sbfield to the superblock addr<strong>es</strong>s stored in dir->i_sb.2. Invok<strong>es</strong> lock_super( ) to get exclusive acc<strong>es</strong>s to the superblock object. <strong>The</strong> functiont<strong>es</strong>ts and sets the value of the s_lock field and, if nec<strong>es</strong>sary, suspends the currentproc<strong>es</strong>s until the flag becom<strong>es</strong> 0.3. If the new inode is a directory, tri<strong>es</strong> to place it so that directori<strong>es</strong> are evenly scatteredthrough partially filled block groups. In particular, allocat<strong>es</strong> the new directory in theblock group that has the maximum number of free blocks among all block groupshaving a number of free inod<strong>es</strong> greater than the average. (<strong>The</strong> average is the totalnumber of free inod<strong>es</strong> divided by the number of block groups).4. If the new inode is not a directory, allocat<strong>es</strong> it in a block group having a free inode.Selects the group by starting from the one containing the parent directory and movingfarther and farther away from it, to be precise:a. Performs a quick logarithmic search starting from the block group that includ<strong>es</strong>the parent directory dir. <strong>The</strong> algorithm search<strong>es</strong> log(n) block groups, where nis the total number of block groups. <strong>The</strong> algorithm jumps further and furtherahead until it finds an available block group, as follows: if we call the numberof the starting block group i, the algorithm considers block groups i mod (n),i+1 mod (n), i+1+2 mod (n), i+1+2+4 mod (n), . . .b. If the logarithmic search failed in finding a block group with a free inode,performs an exhaustive linear search starting from the first block group.5. Invok<strong>es</strong> load_inode_bitmap( ) to get the inode bitmap of the selected block groupand search<strong>es</strong> for the first null bit into it, thus obtaining the number of the first free diskinode.6. Allocat<strong>es</strong> the disk inode: sets the corr<strong>es</strong>ponding bit in the inode bitmap and marks thebuffer containing the bitmap as dirty. Moreover, if the fil<strong>es</strong>ystem has been mountedspecifying the MS_SYNCHRONOUS flag, invok<strong>es</strong> ll_rw_block( ) and waits until thewrite operation terminat<strong>es</strong> (see Section 12.3.3 in Chapter 12).7. Decrements the bg_free_inod<strong>es</strong>_count field of the block group d<strong>es</strong>criptor. If thenew inode is a directory, increments bg_used_dirs_count. Marks the buffercontaining the group d<strong>es</strong>criptor as dirty.8. Decrements the s_free_inod<strong>es</strong>_count field of the disk superblock and marks thebuffer containing it as dirty. Sets the s_dirt field of the VFS's superblock object to 1.9. Initializ<strong>es</strong> the fields of the inode object. In particular, sets the inode number i_no andcopi<strong>es</strong> the value of xtime.tv_sec into i_atime, i_mtime, and i_ctime. Also loadsthe i_block_group field in the ext2_inode_info structure with the block groupindex. Refer to Table 17-3 for the meaning of th<strong>es</strong>e fields.10. Inserts the new inode object into inode_hashtable.11. Invok<strong>es</strong> mark_inode_dirty( ) to move the inode object into the superblock's dirtyinode list (see Section 12.2.2 in Chapter 12).12. Invok<strong>es</strong> unlock_super( ) to release the superblock object.467


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>13. Returns the addr<strong>es</strong>s of the new inode object.17.6.2 Deleting Inod<strong>es</strong><strong>The</strong> ext2_free_inode( ) function delet<strong>es</strong> a disk inode, which is identified by an inodeobject whose addr<strong>es</strong>s is passed as the parameter. <strong>The</strong> kernel should invoke the function after aseri<strong>es</strong> of cleanup operations involving internal data structur<strong>es</strong> and the data in the file itself: itshould come after the inode object has been removed from the inode hash table, after the lasthard link referring to that inode has been deleted from the proper directory, and after the fileis truncated to length in order to reclaim all its data blocks (see Section 17.6.6 later in thischapter). It performs the following actions:1. Invok<strong>es</strong> lock_super( ) to get exclusive acc<strong>es</strong>s to the superblock object.2. Comput<strong>es</strong> from the inode number and the number of inod<strong>es</strong> in each block group theindex of the block group containing the disk inode.3. Invok<strong>es</strong> load_inode_bitmap( ) to get the inode bitmap.4. Invok<strong>es</strong> clear_inode( ) to perform the following operations:a. Release all pag<strong>es</strong> in the page cache associated with the inode, suspending thecurrent proc<strong>es</strong>s if some of them are locked. (<strong>The</strong> pag<strong>es</strong> could be lockedbecause the kernel could be in the proc<strong>es</strong>s of reading or writing them, and thereis no way to stop the block device driver.)b. Invoke the clear_inode method of the superblock object, if defined; but theExt2 fil<strong>es</strong>ystem do<strong>es</strong> not define it.5. Increments the bg_free_inod<strong>es</strong>_count field of the group d<strong>es</strong>criptor. If the deletedinode is a directory, decrements the bg_used_dirs_count field. Marks the buffercontaining the group d<strong>es</strong>criptor as dirty.6. Increments the s_free_inod<strong>es</strong>_count field of the disk superblock and marks thebuffer that contains it as dirty. Also sets the s_dirt field of the superblock object to 1.7. Clears the bit corr<strong>es</strong>ponding to the disk inode in the inode bitmap and marks the buffercontaining the bitmap as dirty. Moreover, if the fil<strong>es</strong>ystem has been mounted with theMS_SYNCHRONIZE flag, invok<strong>es</strong> ll_rw_block( ) and waits until the write operationon the bitmap's buffer terminat<strong>es</strong>.8. Invok<strong>es</strong> unlock_super( ) to unlock the superblock object.17.6.3 Data Blocks Addr<strong>es</strong>singEach nonempty regular file consists of a group of data blocks. Such blocks may be referred toeither by their relative position inside the file (their file block number) or by their positioninside the disk partition (their logical block number, explained in Section 13.5.6 in Chapter13).Deriving the logical block number of the corr<strong>es</strong>ponding data block from an offset f inside afile is a two-step proc<strong>es</strong>s:• Derive from the offset f the file block number, that is, the index of the blockcontaining the character at offset f.• Translate the file block number to the corr<strong>es</strong>ponding logical block number.468


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Since Unix fil<strong>es</strong> do not include any control character, it is quite easy to derive the file blocknumber containing the f th character of a file: simply take the quotient of f and the fil<strong>es</strong>ystem'sblock size and round down to the near<strong>es</strong>t integer.For instance, let's assume a block size of 4 KB. If f is smaller than 4096, the character iscontained in the first data block of the file, which has file block number 0. If f is equal to orgreater than 4096 and l<strong>es</strong>s than 8192, the character is contained in the data block having fileblock number 1 and so on.This is fine as far as file block numbers are concerned. However, translating a file blocknumber into the corr<strong>es</strong>ponding logical block number is not nearly as straightforward, since thedata blocks of an Ext2 file are not nec<strong>es</strong>sarily adjacent on disk.<strong>The</strong> Ext2 fil<strong>es</strong>ystem must thus provide a method to store on disk the connection between eachfile block number and the corr<strong>es</strong>ponding logical block number. This mapping, which go<strong>es</strong>back to early versions of Unix from AT&T, is implemented partly inside the inode. It alsoinvolv<strong>es</strong> some specialized data blocks, which may be considered an inode extension used tohandle large fil<strong>es</strong>.<strong>The</strong> i_block field in the disk inode is an array of EXT2_N_BLOCKS components containinglogical block numbers. In the following discussion, we assume that EXT2_N_BLOCKS has thedefault value, namely 15. <strong>The</strong> array repr<strong>es</strong>ents the initial part of a larger data structure, whichis illustrated in Figure 17-5. As can be noticed from the figure, the 15 components of the arrayare of four different typ<strong>es</strong>:• <strong>The</strong> first 12 components yield the logical block numbers corr<strong>es</strong>ponding to the first 12blocks of the file, that is, to the blocks having file block numbers from to 11.• <strong>The</strong> component at index 12 contains the logical block number of a block thatrepr<strong>es</strong>ents a second-order array of logical block numbers. <strong>The</strong>y corr<strong>es</strong>pond to the fileblock numbers ranging from 12 to b/4+11 where b is the fil<strong>es</strong>ystem's block size (eachlogical block number is stored in 4 byt<strong>es</strong>, so we divide by 4 in the formula). <strong>The</strong>refore,the kernel must look in this component for a pointer to a block, then look in that blockfor another pointer to the ultimate block that contains the file contents.• <strong>The</strong> component at index 13 contains the logical block number of a block containing asecond-order array of logical block numbers; in turn, the entri<strong>es</strong> of this second-orderarray point to third-order arrays, which store the logical block numbers corr<strong>es</strong>pondingto the file block numbers ranging from b/4+12 to (b/4) 2 +(b/4)+11.• Finally, the component at index 14 mak<strong>es</strong> use of triple indirection: the fourth-orderarrays store the logical block numbers corr<strong>es</strong>ponding to the file block numbers rangingfrom (b/4) 2 +(b/4)+12 to (b/4) 3 +(b/4) 2 +(b/4)+11 upward.469


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 17-5. Data structur<strong>es</strong> used to addr<strong>es</strong>s the file's data blocksIn Figure 17-5, the number inside a block repr<strong>es</strong>ents the corr<strong>es</strong>ponding file block number. <strong>The</strong>arrows, which repr<strong>es</strong>ent logical block numbers stored in array components, show how thekernel finds its way to reach the block that contains the actual contents of the file.Notice how this mechanism favors small fil<strong>es</strong>. If the file do<strong>es</strong> not require more than 12 datablocks, any data can be retrieved in two disk acc<strong>es</strong>s<strong>es</strong>: one to read a component in thei_block array of the disk inode and the other one to read the requ<strong>es</strong>ted data block. For largerfil<strong>es</strong>, however, three or even four consecutive disk acc<strong>es</strong>s<strong>es</strong> may be needed in order to acc<strong>es</strong>sthe required block. In practice, this is a worst-case <strong>es</strong>timate, since dentry, buffer, and pagecach<strong>es</strong> contribute significantly to reduce the number of real disk acc<strong>es</strong>s<strong>es</strong>.Notice also how the block size of the fil<strong>es</strong>ystem affects the addr<strong>es</strong>sing mechanism, since alarger block size allows the Ext2 to store more logical block numbers inside a single block.Table 17-10 shows the upper limit placed on a file's size for each block size and eachaddr<strong>es</strong>sing mode. For instance, if the block size is 1024 byt<strong>es</strong> and the file contains up to 268kilobyt<strong>es</strong> of data, the first 12 KB of a file can be acc<strong>es</strong>sed through direct mapping, and theremaining 13 through 268 KB can be addr<strong>es</strong>sed through simple indirection. With 4096-byteblocks, double indirection is sufficient to addr<strong>es</strong>s a file of 2 GB (the maximum allowed by theExt2 fil<strong>es</strong>ystem on 32-bit architecture).Table 17-10. File Size Upper Limits for Data Block Addr<strong>es</strong>singBlock Size Direct 1-Indirect 2-Indirect 3-Indirect1024 12 KB 268 KB 63.55 MB 2 GB2048 24 KB 1.02 MB 513.02 MB 2 GB4096 48 KB 4.04 MB 2 GB —17.6.4 File Hol<strong>es</strong>A file hole is a portion of a regular file that contains null characters and is not stored in anydata block on disk. Hol<strong>es</strong> are a long-standing feature of Unix fil<strong>es</strong>. For instance, the followingUnix command creat<strong>es</strong> a file in which the first byt<strong>es</strong> are a hole:470


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>$ echo -n "X" | dd of=/tmp/hole bs=1024 seek=6Now, /tmp/hole has 6145 characters (6144 null characters plus an X character), yet the fileoccupi<strong>es</strong> just one data block on disk.File hol<strong>es</strong> were introduced to avoid wasting disk space. <strong>The</strong>y are used extensively by databaseapplications and, more generally, by all applications that perform hashing on fil<strong>es</strong>.<strong>The</strong> Ext2 implementation of file hol<strong>es</strong> is based on dynamic data block allocation: a block isactually assigned to a file only when the proc<strong>es</strong>s needs to write data into it. <strong>The</strong> i_size fieldof each inode defin<strong>es</strong> the size of the file as seen by the program, including the hole, while thei_blocks field stor<strong>es</strong> the number of data blocks effectively assigned to the file (in units of512 byt<strong>es</strong>).In our earlier example of the dd command, suppose the /tmp/hole file was created on an Ext2partition having blocks of size 4096. <strong>The</strong> i_size field of the corr<strong>es</strong>ponding disk inode stor<strong>es</strong>the number 6145, while the i_blocks field stor<strong>es</strong> the number 8 (because each 4096-byteblock includ<strong>es</strong> eight 512-byte blocks). <strong>The</strong> second element of the i_block array(corr<strong>es</strong>ponding to the block having file block number 1) stor<strong>es</strong> the logical block number of theallocated block, while all other elements in the array are null (see Figure 17-6).Figure 17-6. A file with an initial hole17.6.5 Allocating a Data BlockWhen the kernel has to allocate a new block to hold data for an Ext2 regular file, it invok<strong>es</strong>the ext2_getblk( ) function. In turn, this function handl<strong>es</strong> the data structur<strong>es</strong> alreadyd<strong>es</strong>cribed in Section 17.6.3 and invok<strong>es</strong> when nec<strong>es</strong>sary the ext2_alloc_block( ) functionto actually search for a free block in the Ext2 partition.In order to reduce file fragmentation, the Ext2 fil<strong>es</strong>ystem tri<strong>es</strong> to get a new block for a filenear the last block already allocated for the file. Failing that, the fil<strong>es</strong>ystem search<strong>es</strong> for a newblock in the block group that includ<strong>es</strong> the file's inode. As a last r<strong>es</strong>ort, the free block is takenfrom one of the other block groups.<strong>The</strong> Ext2 fil<strong>es</strong>ystem us<strong>es</strong> preallocation of data blocks. <strong>The</strong> file do<strong>es</strong> not get just the requ<strong>es</strong>tedblock, but rather a group of up to eight adjacent blocks. <strong>The</strong> i_prealloc_count field in theext2_inode_info structure stor<strong>es</strong> the number of data blocks preallocated to some file that ar<strong>es</strong>till unused, and the i_prealloc_block field stor<strong>es</strong> the logical block number of the next471


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>preallocated block to be used. Any preallocated blocks that remain unused are freed when thefile is closed, when it is truncated, or when a write operation is not sequential with r<strong>es</strong>pect tothe write operation that triggered the block preallocation.<strong>The</strong> ext2_alloc_block( ) function receiv<strong>es</strong> as parameters a pointer to an inode object and agoal. <strong>The</strong> goal is a logical block number that repr<strong>es</strong>ents the preferred position of the newblock. <strong>The</strong> ext2_getblk( ) function sets the goal parameter according to the followingheuristic:1. If the block being allocated and the previously allocated one have consecutive fileblock numbers, the goal is the logical block number of the previous block plus 1; itmak<strong>es</strong> sense that consecutive blocks as seen by a program should be adjacent on disk.2. If the first rule do<strong>es</strong> not apply and at least one block has been previously allocated tothe file, the goal is one of th<strong>es</strong>e block's logical block number. More precisely, it is thelogical block number of the already allocated block that preced<strong>es</strong> in the file the blockto be allocated.3. If the preceding rul<strong>es</strong> do not apply, the goal is the logical block number of the firstblock (not nec<strong>es</strong>sarily free) in the block group that contains the file's inode.<strong>The</strong> ext2_alloc_block( ) function checks whether the goal refers to one of the preallocatedblocks of the file. If so, it allocat<strong>es</strong> the corr<strong>es</strong>ponding block and return its logical blocknumber; otherwise, the function discards all remaining preallocated blocks and invok<strong>es</strong>ext2_new_block( ).This latter function search<strong>es</strong> for a free block inside the Ext2 partition with the followingstrategy:1. If the preferred block passed to ext2_alloc_block( ) (the goal) is free, allocat<strong>es</strong> it.2. If the goal is busy, checks whether one of the next 64 blocks after the preferred blockis free.3. If no free block has been found in the near vicinity of the preferred block, considers allblock groups, starting from the one including the goal. For each block group:a. Looks for a group of at least eight adjacent free blocks.b. If no such group is found, looks for a single free block.<strong>The</strong> search ends as soon as a free block is found. Before terminating, the ext2_new_block( )function also tri<strong>es</strong> to preallocate up to eight free blocks adjacent to the free block found andsets the i_prealloc_block and i_prealloc_count fields of the disk inode to the properblock location and number of blocks.17.6.6 Releasing a Data BlockWhen a proc<strong>es</strong>s delet<strong>es</strong> a file or truncat<strong>es</strong> it to length, all its data blocks must be reclaimed.This is done by ext2_truncate( ), which receiv<strong>es</strong> the addr<strong>es</strong>s of the file's inode object as itsparameter. <strong>The</strong> function <strong>es</strong>sentially scans the disk inode's i_block array to locate all datablocks, partitioning them into physically adjacent groups. Each such group is then released byinvoking ext2_free_blocks( ).472


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> ext2_free_blocks( ) function releas<strong>es</strong> a group of one or more adjacent data blocks.B<strong>es</strong>id<strong>es</strong> its use by ext2_truncate( ), the function is invoked mainly when discarding thepreallocated blocks of a file (see the earlier section Section 17.6.5). Its parameters are:inodeblockcountAddr<strong>es</strong>s of the inode object that d<strong>es</strong>crib<strong>es</strong> the fileLogical block number of the first block to be releasedNumber of adjacent blocks to be released<strong>The</strong> function invok<strong>es</strong> lock_super( ) to get exclusive acc<strong>es</strong>s to the fil<strong>es</strong>ystem's superblock,then performs the following actions for each block to be released:1. Gets the block bitmap of the block group including the block to be released2. Clears the bit in the block bitmap corr<strong>es</strong>ponding to the block to be released and marksthe buffer containing the bitmap as dirty3. Increments the bg_free_blocks_count field in the block group d<strong>es</strong>criptor and marksthe corr<strong>es</strong>ponding buffer as dirty4. Increments the s_free_blocks_count field of the disk superblock, marks thecorr<strong>es</strong>ponding buffer as dirty, and sets the s_dirt flag of the superblock object5. If the fil<strong>es</strong>ystem has been mounted with the MS_SYNCHRONOUS flag set, invok<strong>es</strong>ll_rw_block( ) and waits until the write operation on the bitmap's buffer terminat<strong>es</strong>Finally, the function invok<strong>es</strong> unlock_super( ) to release the superblock.17.7 Reading and Writing an Ext2 Regular FileIn Chapter 12 we d<strong>es</strong>cribed how the Virtual File System recogniz<strong>es</strong> the type of file beingacc<strong>es</strong>sed by a read( ) or write( ) system call and invok<strong>es</strong> the corr<strong>es</strong>ponding method of theproper file operation table. We now have all the needed tools to understand how a regular fileis actually read or written in the Ext2 fil<strong>es</strong>ystem.<strong>The</strong>re's nothing more to say about read operations, however, because they have already beencompletely discussed. As shown in Table 17-9, the Ext2's read method is implemented by thegeneric_file_read( ) function, which is d<strong>es</strong>cribed in the section Section 15.1.1 inChapter 15.Let's concentrate then on Ext2's write method, which is implemented by theext2_file_write( ) function. It acts on four parameters:fdFile d<strong>es</strong>criptor of the file being written473


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>bufAddr<strong>es</strong>s of a memory area containing the data to be writtencountNumber of byt<strong>es</strong> to be writtenpposPointer to a variable storing the file offset where data must be written<strong>The</strong> function performs the following actions:1. Remov<strong>es</strong> any superuser privilege from the file (to guard against tampering with setuidprograms, d<strong>es</strong>cribed in Chapter 19).2. If the file has been opened with the O_APPEND flag set, sets the file offset where datamust be written to the end of the file.3. If the file has been opened in synchronous mode (O_SYNC flag set), sets i_osync fieldin the ext2_inode_info structure of the disk inode to 1. This flag is t<strong>es</strong>ted when adata block is allocated for the file, so that the kernel can synchronously update theinode on disk as soon as it is modified.4. As done before, comput<strong>es</strong> from the file offset and the fil<strong>es</strong>ystem block size the fileblock number of the first byte to be written and the relative offset within the block (seethe earlier section Section 17.6.3).5. For each block to be written, performs the following substeps:a. Invok<strong>es</strong> ext2_getblk( ) to get the data block on the disk, allocating it whennec<strong>es</strong>sary.b. If the block has to be partially rewritten and the buffer is not up-to-date,invok<strong>es</strong> ll_rw_block( ) and waits until the read operation terminat<strong>es</strong>.c. Copi<strong>es</strong> the byt<strong>es</strong> to be written into the block from the proc<strong>es</strong>s addr<strong>es</strong>s space tothe buffer and marks the buffer as dirty.d. Invok<strong>es</strong> update_vm_cache( ) to synchronize the contents of the page cachewith that of the buffer cache.e. If the file has been opened in synchronous mode, inserts the buffer in a localarray. If the array becom<strong>es</strong> filled (it includ<strong>es</strong> 32 elements), invok<strong>es</strong>ll_rw_block( ) to start the write operations and waits until they terminate.6. If the file has been opened in synchronous mode, clears the i_osync flag of the diskinode; also invok<strong>es</strong> ll_rw_block( ) to start the write operation for any buffer stillremaining in the local array and waits until I/O data transfers terminate.7. Updat<strong>es</strong> the i_size field of the inode object.8. Sets the i_ctime and i_mtime fields of the inode object to xtime.tv_sec and marksthe inode as dirty.9. Updat<strong>es</strong> the variable *ppos storing the file offset where the data has been written (it isusually the file pointer).10. Returns the number of byt<strong>es</strong> written into the file.474


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>17.8 Anticipating <strong>Linux</strong> 2.4Using the generic_file_write( ) function to write in a regular Ext2 file has severalbeneficial effects. One of them is that the 2 GB limit on the file size is gone and very largefil<strong>es</strong> can be acc<strong>es</strong>sed even on 32-bit architectur<strong>es</strong>. However, none of the featur<strong>es</strong> mentioned atthe end of Section 17.1 has been included in <strong>Linux</strong> 2.4. <strong>The</strong>y will likely come out with thenew Ext3 fil<strong>es</strong>ystem currently being t<strong>es</strong>ted.475


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 18. Proc<strong>es</strong>s CommunicationThis chapter explains how User Mode proc<strong>es</strong>s<strong>es</strong> can synchronize themselv<strong>es</strong> and exchangedata. We have already covered a lot of synchronization topics in Chapter 11, but the actorsthere were kernel control paths, not User Mode programs. We are now ready, after havingdiscussed I/O management and fil<strong>es</strong>ystems at length, to extend the discussion ofsynchronization to User Mode proc<strong>es</strong>s<strong>es</strong>. <strong>The</strong>se proc<strong>es</strong>s<strong>es</strong> must rely on the kernel tosynchronize themselv<strong>es</strong> and to exchange data.As we saw in Section 12.6.1 in Chapter 12, a crude form of synchronization among UserMode proc<strong>es</strong>s<strong>es</strong> can be achieved by creating a (possibly empty) file and by making use ofsuitable VFS system calls to lock and unlock it. Similarly, data sharing among proc<strong>es</strong>s<strong>es</strong> canbe obtained by storing data in temporary fil<strong>es</strong> protected by locks. This approach is costlysince it requir<strong>es</strong> acc<strong>es</strong>s<strong>es</strong> to the disk fil<strong>es</strong>ystem. For that reason, all Unix kernels include a setof system calls that supports proc<strong>es</strong>s communication without interacting with the fil<strong>es</strong>ystem;furthermore, several wrapper functions have been developed and inserted in suitable librari<strong>es</strong>to expedite how proc<strong>es</strong>s<strong>es</strong> issue their synchronization requ<strong>es</strong>ts to the kernel.As usual, application programmers have a variety of needs that call for differentcommunication mechanisms. Here are the basic mechanisms that Unix systems, and <strong>Linux</strong> inparticular, offer to allow interproc<strong>es</strong>s communication:Pip<strong>es</strong> and FIFOs (named pip<strong>es</strong>)B<strong>es</strong>t suited to implement producer/consumer interactions among proc<strong>es</strong>s<strong>es</strong>. Someproc<strong>es</strong>s<strong>es</strong> fill the pipe with data while others extract data from the pipe.Semaphor<strong>es</strong>M<strong>es</strong>sag<strong>es</strong>Repr<strong>es</strong>ents, as the name impli<strong>es</strong>, the User Mode version of the kernel semaphor<strong>es</strong>discussed in Section 11.2.4 in Chapter 11.Allow proc<strong>es</strong>s<strong>es</strong> to exchange m<strong>es</strong>sag<strong>es</strong> (short blocks of data) in an asynchronous way.<strong>The</strong>y can be thought of as signals carrying additional information.Shared memory regionsB<strong>es</strong>t suited to implement interaction schem<strong>es</strong> in which proc<strong>es</strong>s<strong>es</strong> must share largeamounts of data in an efficient way.This book do<strong>es</strong> not cover another common communication mechanism, sockets. As stated inprevious chapters, sockets were introduced initially to allow data communication betweenapplication programs and the network interface (see Section 13.2.1 in Chapter 13). <strong>The</strong>y canalso be used as a communication tool for proc<strong>es</strong>s<strong>es</strong> located on the same host computer; the XWindow System graphic interface, for instance, us<strong>es</strong> a socket to allow client programs toexchange data with the X server. We don't include them because they would require a longdiscussion of networking, which is beyond the scope of the book.476


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>18.1 Pip<strong>es</strong>Pip<strong>es</strong> are an interproc<strong>es</strong>s communication mechanism that is provided in all flavors of Unix. Apipe is a one-way flow of data between proc<strong>es</strong>s<strong>es</strong>: all data written by a proc<strong>es</strong>s to the pipe isrouted by the kernel to another proc<strong>es</strong>s, which can thus read it.In Unix command shells, pip<strong>es</strong> can be created by means of the | operator. For instance, thefollowing statement instructs the shell to create two proc<strong>es</strong>s<strong>es</strong> connected by a pipe:$ ls | more<strong>The</strong> standard output of the first proc<strong>es</strong>s, which execut<strong>es</strong> the ls program, is redirected to thepipe; the second proc<strong>es</strong>s, which execut<strong>es</strong> the more program, reads its input from the pipe.Note that the same r<strong>es</strong>ults can also be obtained by issuing two commands such as thefollowing:$ ls > temp$ more < temp<strong>The</strong> first command redirects the output of ls into a regular file; then the second commandforc<strong>es</strong> more to read its input from the same file. Of course, using pip<strong>es</strong> instead of temporaryfil<strong>es</strong> is usually more convenient since:• <strong>The</strong> shell statement is much shorter and simpler.• <strong>The</strong>re is no need to create temporary regular fil<strong>es</strong>, which must be deleted later.18.1.1 Using a PipePip<strong>es</strong> may be considered open fil<strong>es</strong> that have no corr<strong>es</strong>ponding image in the mountedfil<strong>es</strong>ystems. A new pipe can be created by means of the pipe( ) system call, which returns apair of file d<strong>es</strong>criptors. <strong>The</strong> proc<strong>es</strong>s can read from the pipe by using the read( ) system callwith the first file d<strong>es</strong>criptor; likewise, it can write into the pipe by using the write( ) systemcall with the second file d<strong>es</strong>criptor.POSIX defin<strong>es</strong> only half-duplex pip<strong>es</strong>, so even though the pipe( ) system call returns twofile d<strong>es</strong>criptors, each proc<strong>es</strong>s must close one before using the other. If a two-way flow of datais required, the proc<strong>es</strong>s<strong>es</strong> must use two different pip<strong>es</strong> by invoking pipe( ) twice.Several Unix systems, such as System V Release 4, implement full-duplex pip<strong>es</strong> and allowboth d<strong>es</strong>criptors to be written into and read from. <strong>Linux</strong> adopts another approach: each pipe'sfile d<strong>es</strong>criptors are still one-way, but it is not nec<strong>es</strong>sary to close one of them before using theother.Let us r<strong>es</strong>ume the previous example: when the command shell interprets the ls|mor<strong>es</strong>tatement, it <strong>es</strong>sentially performs the following actions:1. Invok<strong>es</strong> the pipe( ) system call; let us assume that pipe( ) returns the filed<strong>es</strong>criptors 3 (the pipe's read channel ) and 4 (the write channel ).2. Invok<strong>es</strong> the fork( ) system call twice.477


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3. Invok<strong>es</strong> the close( ) system call twice to release file d<strong>es</strong>criptors 3 and 4.<strong>The</strong> first child proc<strong>es</strong>s, which must execute the ls program, performs the following operations:1. Invok<strong>es</strong> dup2(4,1) to copy file d<strong>es</strong>criptor 4 to file d<strong>es</strong>criptor 1. From now on, filed<strong>es</strong>criptor 1 refers to the pipe's write channel.2. Invok<strong>es</strong> the close( ) system call twice to release file d<strong>es</strong>criptors 3 and 4.3. Invok<strong>es</strong> the execve( ) system call to execute the /bin/ls program (see Section 19.4 inChapter 19). By default, such a program writ<strong>es</strong> its output to the file having filed<strong>es</strong>criptor 1 (the standard output), that is, it writ<strong>es</strong> into the pipe.<strong>The</strong> second child proc<strong>es</strong>s must execute the more program; therefore, it performs the followingoperations:1. Invok<strong>es</strong> dup2(3,0) to copy file d<strong>es</strong>criptor 3 to file d<strong>es</strong>criptor 0. From now on, filed<strong>es</strong>criptor refers to the pipe's read channel.2. Invok<strong>es</strong> the close( ) system call twice to release file d<strong>es</strong>criptors 3 and 4.3. Invok<strong>es</strong> the execve( ) system call to execute /bin/more. By default, that programreads its input from the file having file d<strong>es</strong>criptor (the standard input); that is, it readsfrom the pipe.In this simple example, the pipe is used by exactly two proc<strong>es</strong>s<strong>es</strong>. Because of itsimplementation, though, a pipe can be used by an arbitrary number of proc<strong>es</strong>s<strong>es</strong>. [1] Clearly, iftwo or more proc<strong>es</strong>s<strong>es</strong> read or write the same pipe, they must explicitly synchronize theiracc<strong>es</strong>s<strong>es</strong> by using file locking (see Section 12.6.1 in Chapter 12) or IPC semaphor<strong>es</strong> (seeSection 18.3.3 later in this chapter).[1] Since most shells offer pip<strong>es</strong> that connect only two proc<strong>es</strong>s<strong>es</strong>, applications requiring pip<strong>es</strong> used by more than two proc<strong>es</strong>s<strong>es</strong> must be coded in aprogramming language such as C.Many Unix systems provide, b<strong>es</strong>id<strong>es</strong> the pipe( ) system call, two wrapper functions namedpopen( ) and pclose( ) that handle all the dirty work usually done when using pip<strong>es</strong>. Oncea pipe has been created by means of the popen( ) function, it can be used with the high-levelI/O functions included in the C library (fprintf( ), fscanf( ), and so on).In <strong>Linux</strong>, popen( ) and pclose( ) are included in the C library. <strong>The</strong> popen( ) functionreceiv<strong>es</strong> two parameters: the filename pathname of an executable file and a type stringspecifying the direction of the data transfer. It returns the pointer to a FILE data structure. <strong>The</strong>popen( ) function <strong>es</strong>sentially performs the following operations:1. Creat<strong>es</strong> a new pipe by making use of the pipe( ) system call2. Forks a new proc<strong>es</strong>s, which in turn execut<strong>es</strong> the following operations:a. If type is r, duplicat<strong>es</strong> the file d<strong>es</strong>criptor associated with the pipe's writechannel as file d<strong>es</strong>criptor 1 (standard output); otherwise, if type is w,duplicat<strong>es</strong> the file d<strong>es</strong>criptor associated with the pipe's read channel as filed<strong>es</strong>criptor (standard input)b. Clos<strong>es</strong> the file d<strong>es</strong>criptors returned by pipe( )c. Invok<strong>es</strong> the execve( ) system call to execute the program specified byfilename478


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>3. If type is r, clos<strong>es</strong> the file d<strong>es</strong>criptor associated with the pipe's write channel;otherwise, if type is w, clos<strong>es</strong> the file d<strong>es</strong>criptor associated with the pipe's readchannel4. Returns the addr<strong>es</strong>s of the FILE file pointer that refers to whichever file d<strong>es</strong>criptor forthe pipe is still openAfter the popen( ) invocation, parent and child can exchange information through the pipe:the parent can read (if type is r) or write (if type is w) some data by using the FILE pointerreturned by the function. <strong>The</strong> data is written to the standard output or read from the standardinput, r<strong>es</strong>pectively, by the program executed by the child proc<strong>es</strong>s.<strong>The</strong> pclose( ) function, which receiv<strong>es</strong> the file pointer returned by popen( ) as itsparameter, simply invok<strong>es</strong> the wait4( ) system call and waits for the termination of theproc<strong>es</strong>s created by popen( ).18.1.2 Pipe Data Structur<strong>es</strong>We now have to start thinking again on the system call level. Once a pipe has been created, aproc<strong>es</strong>s us<strong>es</strong> the read( ) and write( ) VFS system calls to acc<strong>es</strong>s it. <strong>The</strong>refore, for eachpipe, the kernel creat<strong>es</strong> an inode object plus two file objects, one for reading and the other forwriting. When a proc<strong>es</strong>s wants to read from or write to the pipe, it must use the proper filed<strong>es</strong>criptor.When the inode object refers to a pipe, its u field consists of a pipe_inode_info structur<strong>es</strong>hown in Table 18-1.Table 18-1. <strong>The</strong> pipe_inode_info StructureType Field D<strong>es</strong>criptionchar * base Addr<strong>es</strong>s of kernel bufferunsigned int start Read position in kernel bufferunsigned int lock Locking flag for exclusive acc<strong>es</strong>sstruct wait_queue * wait Pipe/FIFO wait queueunsigned int readers Flag for (or number of) reading proc<strong>es</strong>s<strong>es</strong>unsigned int writers Flag for (or number of) writing proc<strong>es</strong>s<strong>es</strong>unsigned int rd_openers Used while opening a FIFO for readingunsigned int wr_openers Used while opening a FIFO for writingB<strong>es</strong>id<strong>es</strong> one inode and two file objects, each pipe has its own pipe buffer, that is, a single pageframe containing the data written into the pipe and yet to be read. <strong>The</strong> addr<strong>es</strong>s of this pageframe is stored in the base field of the pipe_inode_info structure. <strong>The</strong> i_size field of theinode object stor<strong>es</strong> the number of byt<strong>es</strong> written into the pipe buffer that are yet to be read; inthe following, we call that number the current pipe size.<strong>The</strong> pipe buffer is acc<strong>es</strong>sed both by reading proc<strong>es</strong>s<strong>es</strong> and by writing on<strong>es</strong>, so the kernel mustkeep track of two current positions in the buffer:• <strong>The</strong> offset of the next byte to be read, which is stored in the start field of thepipe_inode_info structure479


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>• <strong>The</strong> offset of the next byte to be written, which is derived from start and the pip<strong>es</strong>izeTo avoid race conditions on the pipe's data structur<strong>es</strong>, the kernel forbids concurrent acc<strong>es</strong>s<strong>es</strong>to the pipe buffer. In order to achieve this, it mak<strong>es</strong> use of the lock field in thepipe_inode_info data structure. Unfortunately, the lock field is not sufficient. As we shallsee, POSIX dictat<strong>es</strong> that some pipe operations are atomic. Moreover, the POSIX standardallows the writing proc<strong>es</strong>s to be suspended when the pipe is full, so that readers can empty thebuffer (see Section 18.1.5 later in this chapter). <strong>The</strong>se requirements are satisfied by using anadditional i_atomic_write semaphore that can be found in the inode object: this semaphorekeeps a proc<strong>es</strong>s from starting a write operation while another writer has been suspendedbecause the buffer is full.18.1.3 Creating and D<strong>es</strong>troying a PipeA pipe is implemented as a set of VFS objects, which have no corr<strong>es</strong>ponding disk image. Aswe shall see from the following discussion, a pipe remains in the system as long as someproc<strong>es</strong>s owns a file d<strong>es</strong>criptor referring to it.<strong>The</strong> pipe( ) system call is serviced by the sys_pipe( ) function, which in turn invok<strong>es</strong> thedo_pipe( ) function. In order to create a new pipe, do_pipe( ) performs the followingoperations:1. Allocat<strong>es</strong> a file object and a file d<strong>es</strong>criptor for the read channel of the pipe, sets theflag field of the file object to O_RDONLY, and initializ<strong>es</strong> the f_op field with theaddr<strong>es</strong>s of the read_ pipe_fops table.2. Allocat<strong>es</strong> a file object and a file d<strong>es</strong>criptor for the write channel of the pipe, sets theflag field of the file object to O_WRONLY, and initializ<strong>es</strong> the f_op field with theaddr<strong>es</strong>s of the write_ pipe_fops table.3. Invok<strong>es</strong> the get_ pipe_inode( ) function, which allocat<strong>es</strong> and initializ<strong>es</strong> an inodeobject for the pipe. This function also allocat<strong>es</strong> a page frame for the pipe buffer andstor<strong>es</strong> its addr<strong>es</strong>s in the base field of the pipe_inode_info structure.4. Allocat<strong>es</strong> a dentry object and us<strong>es</strong> it to link together the two file objects and the inodeobject (see Section 12.1.1 in Chapter 12).5. Returns the two file d<strong>es</strong>criptors to the User Mode proc<strong>es</strong>s.<strong>The</strong> proc<strong>es</strong>s that issu<strong>es</strong> a pipe( ) system call is initially the only proc<strong>es</strong>s that can acc<strong>es</strong>s thenew pipe, both for reading and for writing. To repr<strong>es</strong>ent that the pipe has actually both areader and a writer, the readers and writers fields of the pipe_inode_info data structureare initialized to 1. In general, each of th<strong>es</strong>e two fields is set to 1 if and only if thecorr<strong>es</strong>ponding pipe's file object is still opened by some proc<strong>es</strong>s; the field is set to if thecorr<strong>es</strong>ponding file object has been released, since it is no longer acc<strong>es</strong>sed by any proc<strong>es</strong>s.Forking a new proc<strong>es</strong>s do<strong>es</strong> not increase the value of the readers and writers fields, so theynever rise above 1; [2] however, it do<strong>es</strong> increase the value of the usage counters of all fileobjects still used by the parent proc<strong>es</strong>s (see Section 3.3.1 in Chapter 3). Thus, the objects willnot be released even when the parent di<strong>es</strong>, and the pipe will stay open for use by the children.[2] As we'll see, the readers and writers fields act as counters instead of flags when associated with FIFOs.480


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Whenever a proc<strong>es</strong>s invok<strong>es</strong> the close( ) system call on a file d<strong>es</strong>criptor associated with apipe, the kernel execut<strong>es</strong> the fput( ) function on the corr<strong>es</strong>ponding file object, whichdecrements the usage counter. If the counter becom<strong>es</strong> 0, the function invok<strong>es</strong> the releasemethod of the file operations (see Section 12.5.3 and Section 12.2.7 in Chapter 12).Both the pipe_read_release( ) and the pipe_write_release( ) functions are used toimplement the release method of the pipe's file objects. <strong>The</strong>y set to the readers and thewriters fields, r<strong>es</strong>pectively, of the pipe_inode_info structure. Each function then invok<strong>es</strong>the pipe_release( ) function. This function wak<strong>es</strong> up any proc<strong>es</strong>s<strong>es</strong> sleeping in the pipe'swait queue so that they can recognize the change in the pipe state. Moreover, the functionchecks whether both the readers and writers fields are equal to 0; in this case, it releas<strong>es</strong>the page frame containing the pipe buffer.18.1.4 Reading from a PipeA proc<strong>es</strong>s wishing to get data from a pipe issu<strong>es</strong> a read( ) system call, specifying as its filed<strong>es</strong>criptor the d<strong>es</strong>criptor associated with the pipe's read channel. As d<strong>es</strong>cribed in Section12.5.2 in Chapter 12, the kernel ends up invoking the read method found in the file operationtable associated with the proper file object. In the case of a pipe, the entry for the read methodin the read_pipe_fops table points to the pipe_read( ) function.<strong>The</strong> pipe_read( ) function is quite involved, since the POSIX standard specifi<strong>es</strong> severalrequirements for the pipe's read operations. Table 18-2 illustrat<strong>es</strong> the expected behavior of aread( ) system call that requ<strong>es</strong>ts n byt<strong>es</strong> from a pipe having a pipe size (number of byt<strong>es</strong> inthe pipe buffer yet to be read) equal to p. Notice that the read operation can be nonblocking:in this case, it complet<strong>es</strong> as soon as all available byt<strong>es</strong> (even none) have been copied into theuser addr<strong>es</strong>s space. [3] Notice also that the value is returned by the read( ) system call only ifthe pipe is empty and no proc<strong>es</strong>s is currently using the file object associated with the pipe'swrite channel.[3] Nonblocking operations are usually requ<strong>es</strong>ted by specifying the O_NONBLOCK flag in the open( ) system call. This method do<strong>es</strong> notwork for pip<strong>es</strong>, since they cannot be opened; a proc<strong>es</strong>s can, however, require a nonblocking operation on a pipe by issuing a fcntl( ) systemcall on the corr<strong>es</strong>ponding file d<strong>es</strong>criptor.Table 18-2. Reading n Byt<strong>es</strong> from a PipePipe Size p At Least One Writing Proc<strong>es</strong>s No Writing Proc<strong>es</strong>sBlocking Read Nonblocking Readp=0 Wait for some data, copy it, and return its size. Return -EAGAIN. Return 0.0


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Checks the lock field of the pipe_inode_info data structure. If it is not null, anotherproc<strong>es</strong>s is currently acc<strong>es</strong>sing the pipe; in this case, either suspends the current proc<strong>es</strong>sor immediately terminat<strong>es</strong> the system call, depending on the type of read operation(blocking or nonblocking).3. Increments the lock field.4. Copi<strong>es</strong> the requ<strong>es</strong>ted number of byt<strong>es</strong> (or the number of available byt<strong>es</strong>, if the buffersize is too small) from the pipe's buffer to the user addr<strong>es</strong>s space.5. Decrements the lock field.6. Invok<strong>es</strong> wake_up_interruptible( ) to wake up all proc<strong>es</strong>s<strong>es</strong> sleeping on the pipe'swait queue.7. Returns the number of byt<strong>es</strong> copied into the user addr<strong>es</strong>s space.18.1.5 Writing into a PipeA proc<strong>es</strong>s wishing to put data into a pipe issu<strong>es</strong> a write( ) system call, specifying as its filed<strong>es</strong>criptor the d<strong>es</strong>criptor associated with the pipe's write channel. <strong>The</strong> kernel satisfi<strong>es</strong> thisrequ<strong>es</strong>t by invoking the write method of the proper file object; the corr<strong>es</strong>ponding entry in thewrite_pipe_fops table points to the pipe_write( ) function.Table 18-3 illustrat<strong>es</strong> the behavior, specified by the POSIX standard, of a write( ) systemcall that requ<strong>es</strong>ted to write n byt<strong>es</strong> into a pipe having u unused byt<strong>es</strong> in its buffer. Inparticular, the standard requir<strong>es</strong> that write operations involving a small number of byt<strong>es</strong> mustbe automatically executed. More precisely, if two or more proc<strong>es</strong>s<strong>es</strong> are concurrently writinginto a pipe, any write operation involving fewer than 4096 byt<strong>es</strong> (the pipe buffer size) mustfinish without being interleaved with write operations of other proc<strong>es</strong>s<strong>es</strong> to the same pipe.However, write operations involving more than 4096 byt<strong>es</strong> may be nonatomic and may alsoforce the calling proc<strong>es</strong>s to sleep.Available BufferSpace uu4096Table 18-3. Writing n Byt<strong>es</strong> to a PipeAt Least One Reading Proc<strong>es</strong>s At Least One Reading Proc<strong>es</strong>sBlocking Write Nonblocking Write No Reading Proc<strong>es</strong>sWait until n-u byt<strong>es</strong> are freed,copy n byt<strong>es</strong>, and return n.Copy n byt<strong>es</strong> (waiting whennec<strong>es</strong>sary) and return n.Return -EAGAIN.If u>0, copy u byt<strong>es</strong> and return u,else return -EAGAIN.u n Copy n byt<strong>es</strong> and return n. Copy n byt<strong>es</strong> and return n.Send SIGPIPE signaland return -EPIPE.Send SIGPIPE signaland return -EPIPE.Send SIGPIPE signaland return -EPIPE.Moreover, any write operation to a pipe must fail if the pipe do<strong>es</strong> not have a reading proc<strong>es</strong>s(that is, if the readers field of the pipe's inode object has the value 0). In that case, the kernelsends a SIGPIPE signal to the writing proc<strong>es</strong>s and terminat<strong>es</strong> the write( ) system call withthe -EPIPE error code, which usually leads to the familiar "Broken pipe" m<strong>es</strong>sage.<strong>The</strong> pipe_write( ) function performs the following operations:1. Checks whether the pipe has at least one reading proc<strong>es</strong>s. If not, sends a SIGPIPEsignal to the current proc<strong>es</strong>s and return an -EPIPE value.482


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>2. Releas<strong>es</strong> the i_sem semaphore of the pipe's inode, which was acquired by th<strong>es</strong>ys_write( ) function (see Section 12.5.2 in Chapter 12), and acquir<strong>es</strong> thei_atomic_write semaphore of the same inode. [4][4] <strong>The</strong> i_sem semaphore prevents multiple proc<strong>es</strong>s<strong>es</strong> from starting write operations on a file, and thus on the pipe. For some reason unknown tothe authors, <strong>Linux</strong> prefers to make use of a specialized pipe semaphore.3. Checks whether the number of byt<strong>es</strong> to be written is within the pipe's buffer size:a. If so, the write operation must be atomic. <strong>The</strong>refore, checks whether the bufferhas enough free space to store all byt<strong>es</strong> to be written.b. If the number of byt<strong>es</strong> is greater than the buffer size, the operation can start aslong as there is any free space at all. <strong>The</strong>refore, checks for at least 1 free byte.4. If the buffer do<strong>es</strong> not have enough free space and the write operation is blocking,inserts the current proc<strong>es</strong>s into the pipe's wait queue and suspends it until some data isread from the pipe. Notice that the i_atomic_write semaphore is not released, so noother proc<strong>es</strong>s can start a write operation on the buffer. If the write operation isnonblocking, returns the -EAGAIN error code.5. Checks the lock field of the pipe_inode_info data structure. If it is not null, anotherproc<strong>es</strong>s is currently reading the pipe, so either suspends the current proc<strong>es</strong>s orimmediately terminat<strong>es</strong> the write depending on whether the write operation is blockingor nonblocking.6. Increments the lock field.7. Copi<strong>es</strong> the requ<strong>es</strong>ted number of byt<strong>es</strong> (or the number of free byt<strong>es</strong> if the pipe size istoo small) from the user addr<strong>es</strong>s space to the pipe's buffer.8. If there are byt<strong>es</strong> yet to be written, go<strong>es</strong> to step 4.9. After all requ<strong>es</strong>ted data is written, decrements the lock field.10. Invok<strong>es</strong> wake_up_interruptible( ) to wake up all proc<strong>es</strong>s<strong>es</strong> sleeping on the pipe'swait queue.11. Releas<strong>es</strong> the i_atomic_write semaphore and acquir<strong>es</strong> the i_sem semaphore (so thatsys_write( ) can safely release the latter).12. Returns the number of byt<strong>es</strong> written into the pipe's buffer.18.2 FIFOsAlthough pip<strong>es</strong> are a simple, flexible, and efficient communication mechanism, they have onemain drawback, namely, that there is no way to open an already existing pipe. This mak<strong>es</strong> itimpossible for two arbitrary proc<strong>es</strong>s<strong>es</strong> to share the same pipe, unl<strong>es</strong>s the pipe was created by acommon anc<strong>es</strong>tor proc<strong>es</strong>s.This drawback is substantial for many application programs. Consider, for instance, adatabase engine server, which continuously polls client proc<strong>es</strong>s<strong>es</strong> wishing to issue somequeri<strong>es</strong> and which sends back to them the r<strong>es</strong>ults of the database lookups. Each interactionbetween the server and a given client might be handled by a pipe. However, client proc<strong>es</strong>s<strong>es</strong>are usually created on demand by a command shell when a user explicitly queri<strong>es</strong> thedatabase; server and client proc<strong>es</strong>s<strong>es</strong> thus cannot easily share a pipe.In order to addr<strong>es</strong>s such limitations, Unix systems introduce a special file type called a namedpipe or FIFO (which stands for "first in, first out": the first byte written into the special file isalso the first byte that will be read). [5][5] Starting with System V Release 3, FIFOs are implemented as full-duplex (bidirectional) objects.483


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>FIFO fil<strong>es</strong> are similar to device fil<strong>es</strong>: they have a disk inode, but they do not make use of datablocks. Thanks to the disk inode, a FIFO can be acc<strong>es</strong>sed by any proc<strong>es</strong>s, since the FIFOfilename is included in the system's directory tree. In addition to having a filename, FIFOs ar<strong>es</strong>imilar to unnamed pip<strong>es</strong> in that they also include a kernel buffer to temporarily store the dataexchanged by two or more proc<strong>es</strong>s<strong>es</strong>. Since they make use of kernel buffers, FIFOs are muchmore efficient than temporary fil<strong>es</strong>.Going back to the database example, the communication between server and clients may beeasily <strong>es</strong>tablished by using FIFOs instead of pip<strong>es</strong>. <strong>The</strong> server creat<strong>es</strong>, at startup, a FIFO usedby client programs to make their requ<strong>es</strong>ts. Each client program creat<strong>es</strong>, before <strong>es</strong>tablishing theconnection, another FIFO to which the server program can write the answer to the query andinclud<strong>es</strong> the FIFO's name in the initial requ<strong>es</strong>t to the server.18.2.1 Creating and Opening a FIFOA proc<strong>es</strong>s creat<strong>es</strong> a FIFO by issuing a mknod( ) [6] system call (see Section 13.2.1 inChapter 13), passing to it as parameters the pathname of the new FIFO and the value S_IFIFO(0x1000) logically ORed with the permission bit mask of the new file. POSIX introduc<strong>es</strong> asystem call named mkfifo( ) specifically to create a FIFO. This call is implemented in<strong>Linux</strong>, as in System V Release 4, as a C library function that invok<strong>es</strong> mknod( ).[6] In fact, mknod( ) can be used to create nearly any kind of file: block and character device fil<strong>es</strong>, FIFOs, and even regular fil<strong>es</strong> (it cannot createdirectori<strong>es</strong> or sockets, though).Once created, a FIFO can be acc<strong>es</strong>sed through the usual open( ), read( ), write( ), andclose( ) system calls, yet the VFS handl<strong>es</strong> it in a special way because the FIFO inode andfile operations are customized and do not depend on the fil<strong>es</strong>ystems in which the FIFO isstored.<strong>The</strong> POSIX standard specifi<strong>es</strong> the behavior of the open( ) system call on named pip<strong>es</strong>; thebehavior depends <strong>es</strong>sentially on the requ<strong>es</strong>ted acc<strong>es</strong>s type, on the kind of I/O operation(blocking or nonblocking), and on the pr<strong>es</strong>ence of other proc<strong>es</strong>s<strong>es</strong> acc<strong>es</strong>sing the FIFO.A proc<strong>es</strong>s may open a FIFO for reading, for writing, or for reading and writing. <strong>The</strong> fileoperations associated with the corr<strong>es</strong>ponding file object are set to special methods for th<strong>es</strong>ethree cas<strong>es</strong>.When a proc<strong>es</strong>s opens a FIFO, the VFS performs the same operations as it do<strong>es</strong> for devicefil<strong>es</strong> (see Section 13.2.2 in Chapter 13). <strong>The</strong> inode object associated with the opened FIFO isinitialized by a fil<strong>es</strong>ystem-dependent read_inode superblock method. This method alwayschecks whether the inode on disk repr<strong>es</strong>ents a FIFO:if ((inode->i_mode & 00170000) == S_IFIFO)init_fifo(inode);<strong>The</strong> init_fifo( ) function sets the i_op field of the inode object to the addr<strong>es</strong>s of thefifo_inode_operations table. <strong>The</strong> function also initializ<strong>es</strong> to all fields of thepipe_inode_info data structure stored inside the inode object (see Table 18-1).<strong>The</strong> filp_open( ) function (invoked by sys_open( ), see Section 12.5.1 in Chapter 12)then fills the remaining fields of the inode object and initializ<strong>es</strong> the f_op field of the new file484


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>object with the contents of i_op->default_file_ops field of the inode object. As aconsequence, the file operation table is set to def_fifo_fops. <strong>The</strong>n filp_open( ) invok<strong>es</strong>the open method from that table of operations, which is implemented in this specific case bythe fifo_open( ) function.<strong>The</strong> fifo_open( ) function examin<strong>es</strong> the valu<strong>es</strong> of the readers and writers fields in thepipe_inode_info data structure. When referring to FIFOs, such fields store the number ofreading and writing proc<strong>es</strong>s<strong>es</strong>, r<strong>es</strong>pectively. If nec<strong>es</strong>sary, the function suspends the currentproc<strong>es</strong>s until a reader or a writer proc<strong>es</strong>s acc<strong>es</strong>s<strong>es</strong> the FIFO: Table 18-4 illustrat<strong>es</strong> the possiblebehaviors of fifo_open( ). Moreover, the function further determin<strong>es</strong> specialized behaviorfor the set of file operations to be used by setting the f_op field of the file object to theaddr<strong>es</strong>s of some predefined tabl<strong>es</strong> shown in Table 18-5. Finally, the function checks whetherthe base field of the pipe_inode_info data structure is NULL; in this case, it gets a free pageframe for the FIFO's kernel buffer and stor<strong>es</strong> its addr<strong>es</strong>s in base.Table 18-4. Behavior of the fifo_open( ) FunctionAcc<strong>es</strong>s Type Blocking NonblockingRead only, with writers Succ<strong>es</strong>sfully return Succ<strong>es</strong>sfully returnRead only, no writer Wait for a writer Succ<strong>es</strong>sfully returnWrite only, with readers Succ<strong>es</strong>sfully return Succ<strong>es</strong>sfully returnWrite only, no reader Wait for a reader Return -ENXIORead/write Succ<strong>es</strong>sfully return Succ<strong>es</strong>sfully return<strong>The</strong> FIFO's four specialized file operation tabl<strong>es</strong> differ mainly in the implementation of theread and write methods. If the acc<strong>es</strong>s type allows read operations, the read method isimplemented by the pipe_read( ) function. Otherwise, it is implemented by bad_pipe_r(), which just returns an error code. Similarly, if the acc<strong>es</strong>s type allows write operations, thewrite method is implemented by the pipe_write( ) function; otherwise, it is implementedby bad_pipe_w( ), which also returns an error code.According to the POSIX standard, a proc<strong>es</strong>s may open a FIFO succ<strong>es</strong>sfully for reading in anonblocking mode, even if the FIFO has no writers: in this case, the pipe_read( ) functioncannot be used right away to implement the read method since it returns an -EAGAIN errorcode when it discovers that the pipe is empty and that there are no writers. <strong>The</strong> solutionadopted consists of implementing the read method with an intermediate connect_read( )function; if there are no writers, this function returns 0; otherwise, it sets the f_op field of thefile object to read_fifo_fops and then invok<strong>es</strong> pipe_read( ).Table 18-5. FIFO's File OperationsAcc<strong>es</strong>s Type File Operations read Method write MethodRead only, with writers read_fifo_fops pipe_read( ) bad_pipe_w( )Read only, no writer connecting_fifo_fops connect_read( ) bad_pipe_w( )Write only write_fifo_fops bad_pipe_r( ) pipe_write( )Read/write rdwr_fifo_fops pipe_read( ) pipe_write( )18.2.2 Reading from and Writing into a FIFO<strong>The</strong> read( ) and write( ) system calls that refer to a FIFO, as well as to any other file type,are handled by the VFS through the read( ) and write( ) file object methods. If the485


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>operation is allowed, the corr<strong>es</strong>ponding entri<strong>es</strong> in the file operation table point to thepipe_read( ) and pipe_write( ) functions (see Section 18.1.4 and Section 18.1.5).<strong>The</strong> VFS thus handl<strong>es</strong> reading and writing for FIFOs the same as for unnamed pip<strong>es</strong>. Incontrast to unnamed pip<strong>es</strong>, however, the same file d<strong>es</strong>criptor may be used both for readingand for writing a FIFO.18.3 System V IPCIPC is an abbreviation that stands for Interproc<strong>es</strong>s Communication. It denot<strong>es</strong> a set of systemcalls that allows a User Mode proc<strong>es</strong>s to:• Synchronize itself with other proc<strong>es</strong>s<strong>es</strong> by means of semaphor<strong>es</strong>• Send m<strong>es</strong>sag<strong>es</strong> to other proc<strong>es</strong>s<strong>es</strong> or receive m<strong>es</strong>sag<strong>es</strong> from them• Share a memory area with other proc<strong>es</strong>s<strong>es</strong>IPC was introduced in a development Unix variant called "Columbus Unix" and later adoptedby AT&T's System III. It is now commonly found in most Unix systems, including <strong>Linux</strong>.IPC data structur<strong>es</strong> are created dynamically when a proc<strong>es</strong>s requ<strong>es</strong>ts an IPC r<strong>es</strong>ource (asemaphore, a m<strong>es</strong>sage queue, or a shared memory segment). Each IPC r<strong>es</strong>ource is persistent:unl<strong>es</strong>s explicitly released by a proc<strong>es</strong>s, it is kept in memory. An IPC r<strong>es</strong>ource may be used byany proc<strong>es</strong>s, including those that do not share the anc<strong>es</strong>tor that created the r<strong>es</strong>ource.Since a proc<strong>es</strong>s may require several IPC r<strong>es</strong>ourc<strong>es</strong> of the same type, each new r<strong>es</strong>ource isidentified by a 32-bit IPC key, which is similar to the file pathname in the system's directorytree. Each IPC r<strong>es</strong>ource also has a 32-bit IPC identifier, which is somewhat similar to the filed<strong>es</strong>criptor associated with an open file. IPC identifiers are assigned to IPC r<strong>es</strong>ourc<strong>es</strong> by thekernel and are unique within the system, while IPC keys can be freely chosen byprogrammers.When two or more proc<strong>es</strong>s<strong>es</strong> wish to communicate through an IPC r<strong>es</strong>ource, they all refer tothe IPC identifier of the r<strong>es</strong>ource.18.3.1 Using an IPC R<strong>es</strong>ourceIPC r<strong>es</strong>ourc<strong>es</strong> are created by invoking the semget( ), msgget( ), or shmget( ) functions,depending on whether the new r<strong>es</strong>ource is a semaphore, a m<strong>es</strong>sage queue, or a shared memorysegment.<strong>The</strong> main objective of each of th<strong>es</strong>e three functions is to derive from the IPC key (passed asthe first parameter) the corr<strong>es</strong>ponding IPC identifier, which will then be used by the proc<strong>es</strong>sfor acc<strong>es</strong>sing the r<strong>es</strong>ource. If there is no IPC r<strong>es</strong>ource already associated with the IPC key, anew r<strong>es</strong>ource is created. If everything go<strong>es</strong> right, the function returns a positive IPC identifier;otherwise, it returns one of the error cod<strong>es</strong> illustrated in Table 18-6.486


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Error CodeEACCESSEEXISTEIDRMENOENTENOMEMENOSPCTable 18-6. Error Cod<strong>es</strong> Returned While Requiring an IPC IdentifierD<strong>es</strong>criptionProc<strong>es</strong>s do<strong>es</strong> not have proper acc<strong>es</strong>s rights.Proc<strong>es</strong>s tried to create an IPC r<strong>es</strong>ource with the same key as one that already exists.<strong>The</strong> r<strong>es</strong>ource is marked so as to be deleted.No IPC r<strong>es</strong>ource with the requ<strong>es</strong>ted key exists and the proc<strong>es</strong>s did not ask to create it.No more storage is left for an additional IPC r<strong>es</strong>ource.Maximum limit on the number of IPC r<strong>es</strong>ourc<strong>es</strong> has been exceeded.Assume that two independent proc<strong>es</strong>s<strong>es</strong> want to share a common IPC r<strong>es</strong>ource. This can beachieved in two possible ways:• <strong>The</strong> proc<strong>es</strong>s<strong>es</strong> agree on some fixed, predefined IPC key. This is the simpl<strong>es</strong>t case, andit works quite well for any complex application implemented by many proc<strong>es</strong>s<strong>es</strong>.However, there's a chance that the same IPC key is adopted by another unrelatedprogram. In this case, the IPC functions might be succ<strong>es</strong>sfully invoked and yet returnthe IPC identifier of the wrong r<strong>es</strong>ource. [7][7] <strong>The</strong> ftok( ) function attempts to create a new key from a file pathname and an 8-bit project identifier passed as parameters. It do<strong>es</strong> notguarantee, however, a unique key number, since there is a small chance that it will return the same IPC key to two different applications usingdifferent pathnam<strong>es</strong> and project identifiers.• One proc<strong>es</strong>s issu<strong>es</strong> a semget( ), msgget( ), or shmget( ) function by specifyingIPC_PRIVATE as its IPC key. A new IPC r<strong>es</strong>ource is thus allocated, and the proc<strong>es</strong>scan either communicate its IPC identifier to the other proc<strong>es</strong>s in the application [8] orfork the other proc<strong>es</strong>s itself. This method ensur<strong>es</strong> that the IPC r<strong>es</strong>ource cannot beaccidentally used by other applications.[8] This impli<strong>es</strong>, of course, the existence of another communication channel between the proc<strong>es</strong>s<strong>es</strong> not based on IPC.<strong>The</strong> last parameter of the semget( ), msgget( ), and shmget( ) functions can include twoflags. IPC_CREAT specifi<strong>es</strong> that the IPC r<strong>es</strong>ource must be created, if it do<strong>es</strong> not already exist;IPC_EXCL specifi<strong>es</strong> that the function must fail if the r<strong>es</strong>ource already exists and theIPC_CREAT flag is set.Even if the proc<strong>es</strong>s us<strong>es</strong> the IPC_CREAT and IPC_EXCL flags, there is no way to ensureexclusive acc<strong>es</strong>s to an IPC r<strong>es</strong>ource, since other proc<strong>es</strong>s<strong>es</strong> may always refer to the r<strong>es</strong>ource byusing its IPC identifier.In order to minimize the risk of incorrectly referencing the wrong r<strong>es</strong>ource, the kernel do<strong>es</strong>not recycle IPC identifiers as soon as they become free. Instead, the IPC identifier assigned toa r<strong>es</strong>ource is almost always larger than the identifier assigned to the previously allocatedr<strong>es</strong>ource of the same type. (<strong>The</strong> only exception occurs when the 32-bit IPC identifieroverflows.) Each IPC identifier is computed by combining a slot usage sequence numberrelative to the r<strong>es</strong>ource type, an arbitrary slot index for the allocated r<strong>es</strong>ource, and the valuechosen in the kernel for the maximum number of allocatable r<strong>es</strong>ourc<strong>es</strong>. If we choose s torepr<strong>es</strong>ent the slot usage sequence number, M to repr<strong>es</strong>ent the maximum number of r<strong>es</strong>ourc<strong>es</strong>,and i to repr<strong>es</strong>ent the slot index, where 0 i


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> slot usage sequence number s is initialized to and is incremented by 1 at every r<strong>es</strong>ourcedeallocation. In two consecutive r<strong>es</strong>ource allocations, the slot index i can only increase; it candecrease only when a r<strong>es</strong>ource has been deallocated, but then the increased slot usag<strong>es</strong>equence number ensur<strong>es</strong> that the new IPC identifier for the next allocated r<strong>es</strong>ource is largerthan the previous one.Each IPC r<strong>es</strong>ource is associated with an ipc_perm data structure, whose fields are shown inTable 18-7. <strong>The</strong> uid, gid, cuid, and cgid fields store the user and group identifiers of ther<strong>es</strong>ource's creator and the user and group identifiers of the current r<strong>es</strong>ource's owner,r<strong>es</strong>pectively. <strong>The</strong> mode bit mask includ<strong>es</strong> six flags, which store the read and write acc<strong>es</strong>spermissions for the r<strong>es</strong>ource's owner, the r<strong>es</strong>ource's group, and all other users. IPC acc<strong>es</strong>spermissions are similar to file acc<strong>es</strong>s permissions d<strong>es</strong>cribed in Section 1.5.5 in Chapter 1,except that there is no Execute permission flag.Table 18-7. <strong>The</strong> Fields in the ipc_ perm StructureType Field D<strong>es</strong>criptionint key IPC keyunsigned short uid Owner user IDunsigned short gid Owner group IDunsigned short cuid Creator user IDunsigned short cgid Creator group IDunsigned short mode Permission bit maskunsigned short seq Slot usage sequence number<strong>The</strong> ipc_perm data structure also includ<strong>es</strong> a key field, which contains the IPC key of thecorr<strong>es</strong>ponding r<strong>es</strong>ource, and a seq field, which stor<strong>es</strong> the slot usage sequence number s usedto compute the IPC identifier of the r<strong>es</strong>ource.<strong>The</strong> semctl( ), msgctl( ), and shmctl( ) functions may be used to handle IPC r<strong>es</strong>ourc<strong>es</strong>.<strong>The</strong> IPC_SET subcommand allows a proc<strong>es</strong>s to change the owner's user and group identifiersand the permission bit mask in the ipc_perm data structure. <strong>The</strong> IPC_STAT and IPC_INFOsubcommands retrieve some information concerning a r<strong>es</strong>ource. Finally, the IPC_RMIDsubcommand releas<strong>es</strong> an IPC r<strong>es</strong>ource. Depending on the type of IPC r<strong>es</strong>ource, otherspecialized subcommands are also available. [9][9] Another IPC d<strong>es</strong>ign flaw is that a User Mode proc<strong>es</strong>s cannot atomically create and initialize an IPC r<strong>es</strong>ource, since th<strong>es</strong>e two operations areperformed by two different IPC functions.Once an IPC r<strong>es</strong>ource has been created, a proc<strong>es</strong>s may act on the r<strong>es</strong>ource by means of a fewspecialized functions. A proc<strong>es</strong>s may acquire or release an IPC semaphore by issuing th<strong>es</strong>emop( ) function. When a proc<strong>es</strong>s wants to send or receive an IPC m<strong>es</strong>sage, it us<strong>es</strong> themsgsnd( ) and msgrcv( ) functions, r<strong>es</strong>pectively. Finally, a proc<strong>es</strong>s attach<strong>es</strong> and detach<strong>es</strong> ashared memory segment in its addr<strong>es</strong>s space by means of the shmat( ) and shmdt( )functions, r<strong>es</strong>pectively.18.3.2 <strong>The</strong> ipc( ) System CallAll IPC functions must be implemented through suitable <strong>Linux</strong> system calls. Actually, in theIntel 80x86 architecture, there is just one IPC system call named ipc( ). When a proc<strong>es</strong>sinvok<strong>es</strong> an IPC function, let's say msgget( ), it really invok<strong>es</strong> a wrapper function in the C488


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>library, which in turn invok<strong>es</strong> the ipc( ) system call by passing to it all the parameters ofmsgget( ) plus a proper subcommand code, in this case MSGGET. <strong>The</strong> sys_ipc( ) serviceroutine examin<strong>es</strong> the subcommand code and invok<strong>es</strong> the kernel function that implements therequ<strong>es</strong>ted service.<strong>The</strong> ipc( ) "multiplexer" system call is a legacy from older <strong>Linux</strong> versions, which includedthe IPC code in a dynamic module (see Appendix B). It did not make much sense to r<strong>es</strong>erv<strong>es</strong>everal system call entri<strong>es</strong> in the system_call table for a kernel component that could bemissing, so the kernel d<strong>es</strong>igners adopted the multiplexer approach.Nowadays, System V IPC can no longer be compiled as a dynamic module, and there is nojustification for using a single IPC system call. As a matter of fact, <strong>Linux</strong> provid<strong>es</strong> one systemcall for each IPC function on Compaq's Alpha architecture.18.3.3 IPC Semaphor<strong>es</strong>IPC semaphor<strong>es</strong> are quite similar to the kernel semaphor<strong>es</strong> introduced in Chapter 11: they arecounters used to provide controlled acc<strong>es</strong>s to shared data structur<strong>es</strong> for multiple proc<strong>es</strong>s<strong>es</strong>.<strong>The</strong> semaphore value is positive if the protected r<strong>es</strong>ource is available, and negative or if theprotected r<strong>es</strong>ource is currently not available. A proc<strong>es</strong>s that wants to acc<strong>es</strong>s the r<strong>es</strong>ourcedecrements by 1 the semaphore value. It is allowed to use the r<strong>es</strong>ource only if the old valuewas positive; otherwise, the proc<strong>es</strong>s waits until the semaphore becom<strong>es</strong> positive. When aproc<strong>es</strong>s relinquish<strong>es</strong> a protected r<strong>es</strong>ource, it increments its semaphore value by 1; in doing so,any other proc<strong>es</strong>s waiting for the semaphore is woken up. Actually, IPC semaphor<strong>es</strong> are morecomplicated to handle than kernel semaphor<strong>es</strong> for two main reasons:• Each IPC semaphore is a set of one or more semaphore valu<strong>es</strong>, not just a single valueas for kernel semaphor<strong>es</strong>. This means that the same IPC r<strong>es</strong>ource can protect severalindependent shared data structur<strong>es</strong>. <strong>The</strong> number of semaphore valu<strong>es</strong> in each IPCsemaphore must be specified as a parameter of the semget( ) function when ther<strong>es</strong>ource is being allocated, but it cannot be greater than SEMMSL (usually 32). Fromnow on, we'll refer to the counters inside an IPC semaphore as primitive semaphor<strong>es</strong>.• <strong>The</strong> IPC specification creat<strong>es</strong> a fail-safe mechanism for situations in which a proc<strong>es</strong>sdi<strong>es</strong> without being able to undo the operations that it previously issued on asemaphore. When a proc<strong>es</strong>s choos<strong>es</strong> to use this mechanism, the r<strong>es</strong>ulting operationsare called undoable semaphore operations. When the proc<strong>es</strong>s di<strong>es</strong>, all of its IPCsemaphor<strong>es</strong> can revert to the valu<strong>es</strong> they would have had if the proc<strong>es</strong>s had neverstarted its operations. This can help prevent deadlocks of other proc<strong>es</strong>s<strong>es</strong> using th<strong>es</strong>ame semaphor<strong>es</strong>.First, we'll briefly sketch the typical steps performed by a proc<strong>es</strong>s wishing to acc<strong>es</strong>s one ormore r<strong>es</strong>ourc<strong>es</strong> protected by an IPC semaphore. <strong>The</strong> proc<strong>es</strong>s:1. Invok<strong>es</strong> the semget( ) wrapper function to get the IPC semaphore identifier,specifying as the parameter the IPC key of the IPC semaphore that protects the sharedr<strong>es</strong>ourc<strong>es</strong>. If the proc<strong>es</strong>s wants to create a new IPC semaphore, it also specifi<strong>es</strong> theIPC_CREATE or IPC_PRIVATE flag and the number of primitive semaphor<strong>es</strong> required(see Section 18.3.1 earlier in this chapter).2. Invok<strong>es</strong> the semop( ) wrapper function to t<strong>es</strong>t and decrement all primitive semaphorevalu<strong>es</strong> involved. If all the t<strong>es</strong>ts succeed, the decrements are performed, the function489


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>terminat<strong>es</strong>, and the proc<strong>es</strong>s is allowed to acc<strong>es</strong>s the protected r<strong>es</strong>ourc<strong>es</strong>. If som<strong>es</strong>emaphor<strong>es</strong> are in use, the proc<strong>es</strong>s is usually suspended until some other proc<strong>es</strong>sreleas<strong>es</strong> the r<strong>es</strong>ourc<strong>es</strong>. <strong>The</strong> function receiv<strong>es</strong> as parameters the IPC semaphoreidentifier, an array of numbers specifying the operations to be atomically performedon the primitive semaphor<strong>es</strong>, and the number of such operations. Optionally, theproc<strong>es</strong>s may specify the SEM_UNDO flag, which instructs the kernel to reverse theoperations should the proc<strong>es</strong>s exit without releasing the primitive semaphor<strong>es</strong>.3. When relinquishing the protected r<strong>es</strong>ourc<strong>es</strong>, invok<strong>es</strong> the semop( ) function again toatomically increment all primitive semaphor<strong>es</strong> involved.4. Optionally, invok<strong>es</strong> the semctl( ) wrapper function, specifying in its parameter theIPC_RMID flag to remove the IPC semaphore from the system.Now we can discuss how the kernel implements IPC semaphor<strong>es</strong>. <strong>The</strong> data structur<strong>es</strong>involved are shown in Figure 18-1. A statically allocated semary array includ<strong>es</strong> SEMMNIvalu<strong>es</strong> (usually 128). Each element in the array can assume one of the following valu<strong>es</strong>:• IPC_UNUSED (-1): no IPC r<strong>es</strong>ource refers to this slot.• IPC_NOID (-2): the IPC r<strong>es</strong>ource is being allocated or d<strong>es</strong>troyed.• <strong>The</strong> addr<strong>es</strong>s of a dynamically allocated memory area containing the IPC semaphorer<strong>es</strong>ource.Figure 18-1. IPC semaphore data structur<strong>es</strong><strong>The</strong> index number of the semary array repr<strong>es</strong>ents the slot index i mentioned earlier. When anew IPC r<strong>es</strong>ource must be allocated, the kernel scans the array and us<strong>es</strong> the first array element(slot) containing the value IPC_UNUSED. <strong>The</strong> slot index can be easily derived from the IPCidentifier by simply masking out its high-order bits (see Section 18.3.1).<strong>The</strong> first locations of the memory area containing the IPC semaphore store a d<strong>es</strong>criptor oftype struct semid_ds, whose fields are shown in Table 18-8. All other locations in thememory area store several sem data structur<strong>es</strong>, one for each primitive semaphore in the IPC490


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>semaphore r<strong>es</strong>ource. <strong>The</strong> sem_base field of the semid_ds structure points to the first semstructure in the memory area. <strong>The</strong> sem data structure includ<strong>es</strong> only two fields:semvalsempidValue of the semaphore's counter.PID of the last proc<strong>es</strong>s that acc<strong>es</strong>sed the semaphore. This value can be queried by aproc<strong>es</strong>s through the semctl( ) wrapper function.Table 18-8. <strong>The</strong> Fields in the semid_ds StructureType Field D<strong>es</strong>criptionstruct ipc_perm sem_perm ipc_perm data structurelong sem_otime Tim<strong>es</strong>tamp of last semop( )long sem_ctime Tim<strong>es</strong>tamp of last chang<strong>es</strong>truct sem * sem_base Pointer to first sem structur<strong>es</strong>truct sem_queue * sem_pending Pending operationsstruct sem_queue ** sem_pending_last Last pending operationstruct sem_undo * undo Undo requ<strong>es</strong>tsunsigned short sem_nsems Number of semaphor<strong>es</strong> in array18.3.3.1 Undoable semaphore operationsIf a proc<strong>es</strong>s aborts suddenly, it cannot undo the operations that it started (for instance, releasethe semaphor<strong>es</strong> it r<strong>es</strong>erved); so by declaring them undoable the proc<strong>es</strong>s lets the kernel returnthe semaphor<strong>es</strong> to a consistent state and allow other proc<strong>es</strong>s<strong>es</strong> to proceed. Proc<strong>es</strong>s<strong>es</strong> mayrequire undoable operations by specifying the SEM_UNDO flag in the semop( ) function.Information to help the kernel reverse the undoable operations performed by a given proc<strong>es</strong>son a given IPC semaphore r<strong>es</strong>ource is stored in a sem_undo data structure. It <strong>es</strong>sentiallycontains the IPC identifier of the semaphore and an array of integers repr<strong>es</strong>enting the chang<strong>es</strong>to the primitive semaphore's valu<strong>es</strong> caused by all undoable operations performed by theproc<strong>es</strong>s.A simple example can illustrate how such sem_undo elements are used. Consider a proc<strong>es</strong>susing an IPC semaphore r<strong>es</strong>ource with four primitive semaphor<strong>es</strong> and suppose that it invok<strong>es</strong>the semop( ) function to increment by 1 the first counter and decrement by 2 the second one.If it specifi<strong>es</strong> the SEM_UNDO flag, the integer in the first array element in the sem_undo datastructure is decremented by 1, the integer in the second element is incremented by 2, and theother two integers are left unchanged. Further undoable operations on the IPC semaphoreperformed by the same proc<strong>es</strong>s change accordingly the integers stored in the sem_undostructure. When the proc<strong>es</strong>s exits, any nonzero value in that array corr<strong>es</strong>ponds to one or moreunbalanced operations on the corr<strong>es</strong>ponding primitive semaphore; the kernel revers<strong>es</strong> th<strong>es</strong>eoperations, simply adding the nonzero value to the corr<strong>es</strong>ponding semaphore's counter. Inother words, the chang<strong>es</strong> made by the aborted proc<strong>es</strong>s are backed out while the chang<strong>es</strong> madeby other proc<strong>es</strong>s<strong>es</strong> are still reflected in the state of the semaphor<strong>es</strong>.491


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>For each proc<strong>es</strong>s, the kernel keeps track of all semaphore r<strong>es</strong>ourc<strong>es</strong> handled with undoableoperations, so that it can roll them back if the proc<strong>es</strong>s unexpectedly exits. Furthermore, thekernel has to keep track, for each semaphore, of all its sem_undo structur<strong>es</strong>, so that it canquickly acc<strong>es</strong>s them whenever a proc<strong>es</strong>s us<strong>es</strong> semctl( ) to force an explicit value into aprimitive semaphore's counter or to d<strong>es</strong>troy an IPC semaphore r<strong>es</strong>ource.<strong>The</strong> kernel is able to handle th<strong>es</strong>e tasks efficiently thanks to two lists, which we denote as theper-proc<strong>es</strong>s and the per-semaphore lists. <strong>The</strong> first one keeps track of all semaphor<strong>es</strong> handledby a given proc<strong>es</strong>s with undoable operations. <strong>The</strong> second one keeps track of all proc<strong>es</strong>s<strong>es</strong> thatare acting on a given semaphore with undoable operations. More precisely:• <strong>The</strong> per-proc<strong>es</strong>s list includ<strong>es</strong> all sem_undo data structur<strong>es</strong> corr<strong>es</strong>ponding to IPCsemaphor<strong>es</strong> on which the proc<strong>es</strong>s has performed undoable operations. <strong>The</strong> semundofield of the proc<strong>es</strong>s d<strong>es</strong>criptor points to the first element of the list, while theproc_next field of each sem_undo data structure points to the next element in the list.• <strong>The</strong> per-semaphore list includ<strong>es</strong> all sem_undo data structur<strong>es</strong> corr<strong>es</strong>ponding to theproc<strong>es</strong>s<strong>es</strong> that performed undoable operations on it. <strong>The</strong> undo field of the semid_dsdata structure points to the first element of the list, while the id_next field of eachsem_undo data structure points to the next element in the list.<strong>The</strong> per-proc<strong>es</strong>s list is used when a proc<strong>es</strong>s terminat<strong>es</strong>. <strong>The</strong> sem_exit( ) function, which isinvoked by do_exit( ), walks through the list and revers<strong>es</strong> the effect of any unbalancedoperation for every IPC semaphore touched by the proc<strong>es</strong>s. By contrast, the per-semaphorelist is mainly used when a proc<strong>es</strong>s invok<strong>es</strong> the semctl( ) function to force an explicit valueinto a primitive semaphore. <strong>The</strong> kernel sets the corr<strong>es</strong>ponding element to in the arrays of allsem_undo data structur<strong>es</strong> referring to that IPC semaphore r<strong>es</strong>ource, since it would no longermake any sense to reverse the effect of previous undoable operations performed on thatprimitive semaphore. Moreover, the per-semaphore list is also used when an IPC semaphoreis d<strong>es</strong>troyed; all related sem_undo data structur<strong>es</strong> are invalidated by setting the semid field to-1. [10][10] Notice that they are just invalidated, and not freed, since it would be too costly to remove the data structur<strong>es</strong> from the per-proc<strong>es</strong>s lists of allproc<strong>es</strong>s<strong>es</strong>.18.3.3.2 <strong>The</strong> queue of pending requ<strong>es</strong>ts<strong>The</strong> kernel associat<strong>es</strong> to each IPC semaphore a queue of pending requ<strong>es</strong>ts to identifyproc<strong>es</strong>s<strong>es</strong> that are waiting on one of the semaphor<strong>es</strong> in the array. <strong>The</strong> queue is a doubly linkedlist of sem_queue data structur<strong>es</strong>, whose fields are shown in Table 18-9. <strong>The</strong> first and lastpending requ<strong>es</strong>ts in the queue are referenced, r<strong>es</strong>pectively, by the sem_pending andsem_pending_last fields of the semid_ds structure. This last field allows the list to behandled easily as a FIFO: new pending requ<strong>es</strong>ts are added to the end of the list so that theywill be serviced later. <strong>The</strong> most important fields of a pending requ<strong>es</strong>t are nsops, which stor<strong>es</strong>the number of primitive semaphor<strong>es</strong> involved in the pending operation, and sops, whichpoints to an array of integer valu<strong>es</strong> d<strong>es</strong>cribing each single semaphore operation. <strong>The</strong> sleeperfield stor<strong>es</strong> the addr<strong>es</strong>s of the wait queue containing the sleeping proc<strong>es</strong>s.492


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 18-9. <strong>The</strong> Fields in the sem_queue StructureType Field D<strong>es</strong>criptionstruct sem_queue * next Pointer to next queue elementstruct sem_queue ** prev Pointer to previous queue elementstruct wait_queue * sleeper Pointer to sleeping proc<strong>es</strong>s wait queu<strong>es</strong>truct sem_undo * undo Pointer to sem_undo structureint pid Proc<strong>es</strong>s identifierint status Completion status of operationstruct semid_ds * sma Pointer to IPC semaphore d<strong>es</strong>criptorstruct sembuf * sops Pointer to array of pending operationsint nsops Number of pending operationsint alter Flag for altering operationsFigure 18-1 illustrat<strong>es</strong> an IPC semaphore that has three pending requ<strong>es</strong>ts. Two of them referto undoable operations, so the undo field of the sem_queue data structure points to thecorr<strong>es</strong>ponding sem_undo structure; the third pending requ<strong>es</strong>t has a NULL undo field since thecorr<strong>es</strong>ponding operation is not undoable.18.3.4 IPC M<strong>es</strong>sag<strong>es</strong>Proc<strong>es</strong>s<strong>es</strong> can communicate with each other by means of IPC m<strong>es</strong>sag<strong>es</strong>. Each m<strong>es</strong>sagegenerated by a proc<strong>es</strong>s is sent to an IPC m<strong>es</strong>sage queue where it stays until another proc<strong>es</strong>sreads it.A m<strong>es</strong>sage is composed of a fixed-size header and a variable-length text; it can be labeledwith an integer value (the m<strong>es</strong>sage type), which allows a proc<strong>es</strong>s to selectively retrievem<strong>es</strong>sag<strong>es</strong> from its m<strong>es</strong>sage queue. [11] Once a proc<strong>es</strong>s has read a m<strong>es</strong>sage from an IPC m<strong>es</strong>sagequeue, the kernel d<strong>es</strong>troys it; therefore, only one proc<strong>es</strong>s can receive a given m<strong>es</strong>sage.[11] As we'll see, the m<strong>es</strong>sage queue is implemented by means of a linked list. Since m<strong>es</strong>sag<strong>es</strong> can be retrieved in an order different from "first in, firstout," the name "m<strong>es</strong>sage queue" is not appropriate. However, new m<strong>es</strong>sag<strong>es</strong> are always put at the end of the linked list.In order to send a m<strong>es</strong>sage, a proc<strong>es</strong>s invok<strong>es</strong> the msgsnd( ) function, passing as parameters:• <strong>The</strong> IPC identifier of the d<strong>es</strong>tination m<strong>es</strong>sage queue• <strong>The</strong> size of the m<strong>es</strong>sage text• <strong>The</strong> addr<strong>es</strong>s of a User Mode buffer that contains the m<strong>es</strong>sage type immediatelyfollowed by the m<strong>es</strong>sage textTo retrieve a m<strong>es</strong>sage, a proc<strong>es</strong>s invok<strong>es</strong> the msgrcv( ) function, passing to it:• <strong>The</strong> IPC identifier of the IPC m<strong>es</strong>sage queue r<strong>es</strong>ource• <strong>The</strong> pointer to a User Mode buffer to which the m<strong>es</strong>sage type and m<strong>es</strong>sage text shouldbe copied• <strong>The</strong> size of this buffer• A value t that specifi<strong>es</strong> what m<strong>es</strong>sage should be retrievedIf the value t is null, the first m<strong>es</strong>sage in the queue is returned. If t is positive, the firstm<strong>es</strong>sage in the queue with its type equal to t is returned. Finally, if t is negative, the function493


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>returns the first m<strong>es</strong>sage whose m<strong>es</strong>sage type is the low<strong>es</strong>t value l<strong>es</strong>s than or equal to theabsolute value of t.<strong>The</strong> data structur<strong>es</strong> associated with IPC m<strong>es</strong>sage queu<strong>es</strong> are shown in Figure 18-2. Astatically allocated array msgque includ<strong>es</strong> MSGMNI valu<strong>es</strong> (usually 128). Like the semaryarray, each element in a msgque array can assume the value IPC_UNUSED, IPC_NOID, or theaddr<strong>es</strong>s of an IPC m<strong>es</strong>sage queue d<strong>es</strong>criptor.Figure 18-2. IPC m<strong>es</strong>sage queue data structur<strong>es</strong><strong>The</strong> m<strong>es</strong>sage queue d<strong>es</strong>criptor is a msqid_ds structure, whose fields are shown in Table 18-10. <strong>The</strong> most important fields are msg_first and msg_last, which point to the first and tothe last m<strong>es</strong>sage in the linked list, r<strong>es</strong>pectively. <strong>The</strong> rwait field points to a wait queue thatinclud<strong>es</strong> all proc<strong>es</strong>s<strong>es</strong> currently waiting for some m<strong>es</strong>sage in the queue. Conversely, thewwait field points to a wait queue that includ<strong>es</strong> all proc<strong>es</strong>s<strong>es</strong> currently waiting for some fre<strong>es</strong>pace in the queue so they can add a new m<strong>es</strong>sage. <strong>The</strong> total size of the header and the text ofall m<strong>es</strong>sag<strong>es</strong> in the queu<strong>es</strong> cannot exceed the value stored in the msg_qbyt<strong>es</strong> field; the defaultmaximum size is MSGMNB, that is, 16,384 byt<strong>es</strong>.Table 18-10. <strong>The</strong> msqid_ds StructureType Field D<strong>es</strong>criptionstruct ipc_perm msg_perm ipc_perm data structur<strong>es</strong>truct msg * msg_first First m<strong>es</strong>sage in queu<strong>es</strong>truct msg * msg_last Last m<strong>es</strong>sage in queueLong msg_stime Time of last msgsnd( )Long msg_rtime Time of last msgrcv( )Long msg_ctime Last change tim<strong>es</strong>truct wait_queue * wwait Proc<strong>es</strong>s<strong>es</strong> waiting for free spac<strong>es</strong>truct wait_queue * rwait Proc<strong>es</strong>s<strong>es</strong> waiting for m<strong>es</strong>sag<strong>es</strong>unsigned short msg_cbyt<strong>es</strong> Current number of byt<strong>es</strong> in queueunsigned short msg_qnum Number of m<strong>es</strong>sag<strong>es</strong> in queueunsigned short msg_qbyt<strong>es</strong> Maximum number of byt<strong>es</strong> in queueunsigned short msg_lspid PID of last msgsnd( )unsigned short msg_lrpid PID of last msgrcv( )494


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Each m<strong>es</strong>sage is placed into a dynamically allocated memory area. <strong>The</strong> beginning of this areastor<strong>es</strong> the m<strong>es</strong>sage header, which is a data structure of type msg; its fields are listed inTable 18-11. <strong>The</strong> m<strong>es</strong>sage text is stored in the r<strong>es</strong>t of the memory area. <strong>The</strong> msg_spot field ofthe m<strong>es</strong>sage header contains the starting addr<strong>es</strong>s of the m<strong>es</strong>sage text, while the msg_ts fieldcontains the length of the m<strong>es</strong>sage text; this length cannot be longer than MSGMAX (usually4056) byt<strong>es</strong>.Table 18-11. <strong>The</strong> msg StructureType Field D<strong>es</strong>criptionstruct msg * msg_next Next m<strong>es</strong>sage in queuelong msg_type M<strong>es</strong>sage typechar * msg_spot M<strong>es</strong>sage text addr<strong>es</strong>stime_t msg_stime Time of msgsnd( )short msg_ts M<strong>es</strong>sage text sizeFinally, each m<strong>es</strong>sage is linked to the next m<strong>es</strong>sage in the queue through the msg_next fieldof its m<strong>es</strong>sage header.18.3.5 IPC Shared Memory<strong>The</strong> most useful IPC mechanism is shared memory, which allows two or more proc<strong>es</strong>s<strong>es</strong> toacc<strong>es</strong>s some common data structur<strong>es</strong> by placing them in a shared memory segment. Eachproc<strong>es</strong>s that wants to acc<strong>es</strong>s the data structur<strong>es</strong> included in a shared memory segment mustadd to its addr<strong>es</strong>s space a new memory region (see the section Section 7.3 in Chapter 7),which maps the page fram<strong>es</strong> associated with the shared memory segment. Such page fram<strong>es</strong>can thus be easily handled by the kernel through demand paging (see Section 7.4.3 inChapter 7).As with semaphor<strong>es</strong> and m<strong>es</strong>sage queu<strong>es</strong>, the shmget( ) function is invoked to get the IPCidentifier of a shared memory segment, optionally creating it if it do<strong>es</strong> not already exist.<strong>The</strong> shmat( ) function is invoked to "attach" a shared memory segment to a proc<strong>es</strong>s. Itreceiv<strong>es</strong> as its parameter the identifier of the IPC shared memory r<strong>es</strong>ource and tri<strong>es</strong> to add ashared memory region to the addr<strong>es</strong>s space of the calling proc<strong>es</strong>s. <strong>The</strong> calling proc<strong>es</strong>s canrequire a specific starting linear addr<strong>es</strong>s for the memory region, but the addr<strong>es</strong>s is usuallyunimportant, and each proc<strong>es</strong>s acc<strong>es</strong>sing the shared memory segment can use a differentaddr<strong>es</strong>s in its own addr<strong>es</strong>s space. <strong>The</strong> proc<strong>es</strong>s's page tabl<strong>es</strong> are left unchanged by shmat( ).We'll d<strong>es</strong>cribe later what the kernel do<strong>es</strong> when the proc<strong>es</strong>s tri<strong>es</strong> to acc<strong>es</strong>s a page belonging tothe new memory region.<strong>The</strong> shmdt( ) function is invoked to "detach" a shared memory segment specified by its IPCidentifier, that is, to remove the corr<strong>es</strong>ponding memory region from the proc<strong>es</strong>s's addr<strong>es</strong>sspace. Recall that an IPC shared memory r<strong>es</strong>ource is persistent: even if no proc<strong>es</strong>s is using it,the corr<strong>es</strong>ponding pag<strong>es</strong> cannot be discarded, although they can be swapped out.Figure 18-3 illustrat<strong>es</strong> the main data structur<strong>es</strong> used for implementing IPC shared memory. Astatically allocated array shm_segs includ<strong>es</strong> SHMMNI valu<strong>es</strong> (usually 128). Like the semaryand msgque arrays, each element in shm_segs can have the value IPC_UNUSED or IPC_NOID orthe addr<strong>es</strong>s of an IPC shared memory segment d<strong>es</strong>criptor.495


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Figure 18-3. IPC shared memory data structur<strong>es</strong>Each IPC shared memory segment d<strong>es</strong>criptor is a shmid_kernel structure, whose fields ar<strong>es</strong>hown in Table 18-12. Some of the fields, which are acc<strong>es</strong>sible to User Mode proc<strong>es</strong>s<strong>es</strong>, areincluded in a shmid_ds data structure named u inside the d<strong>es</strong>criptor. <strong>The</strong>ir contents can beacc<strong>es</strong>sed by means of the shmctl( ) function.<strong>The</strong> u.shm_segsz and shm_npag<strong>es</strong> fields store the size of the shared memory segment inbyt<strong>es</strong> and in pag<strong>es</strong>, r<strong>es</strong>pectively. Although User Mode proc<strong>es</strong>s<strong>es</strong> can require a shared memorysegment of any length, the length of the allocated segment is a multiple of the page size, sincethe kernel must map the segment with a memory region.<strong>The</strong> shm_pag<strong>es</strong> field points to an array that contains one element for each page of th<strong>es</strong>egment. Each element stor<strong>es</strong> a 32-bit value in the format of a Page Table entry (seeSection 2.4.1 in Chapter 2). If a page frame is not currently allocated for the page, the elementis 0. Otherwise, it is a regular Page Table entry containing the physical addr<strong>es</strong>s of a pageframe or a swapped-out page identifier.Table 18-12. <strong>The</strong> Fields in the shmid_kernel StructureType Field D<strong>es</strong>criptionstruct ipc_perm u.shm_perm ipc_perm data structureint u.shm_segsz Size of shared memory region (byt<strong>es</strong>)long u.shm_atime Last attach timelong u.shm_dtime Last detach timelong u.shm_ctime Last change timeunsigned short u.shm_cpid PID of creatorunsigned short u.shm_lpid PID of last acc<strong>es</strong>sing proc<strong>es</strong>sunsigned short u.shm_nattch Number of current attach<strong>es</strong>unsigned long shm_npag<strong>es</strong> Size of shared memory region (pag<strong>es</strong>)unsigned long * shm_pag<strong>es</strong> Pointer to array of page frame PTEsstruct vm_area_struct * attach<strong>es</strong> Pointer to VMA d<strong>es</strong>criptor listIn the example illustrated in Figure 18-3, the segment is contained in five pag<strong>es</strong>. Three ofthem have never been acc<strong>es</strong>sed, while the other two pag<strong>es</strong> are stored in RAM.496


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> attach<strong>es</strong> field points to the first element of a doubly linked list that includ<strong>es</strong> thevm_area_struct d<strong>es</strong>criptors of all memory regions associated with the shared memorysegment. <strong>The</strong> list is implemented by means of the vm_next_share and vm_pprev_sharefields of the d<strong>es</strong>criptors. <strong>The</strong> number of elements in the list is stored in the u.shm_nattchfield. In Figure 18-3, the shared memory segment has been attached to the addr<strong>es</strong>s space oftwo proc<strong>es</strong>s<strong>es</strong>.When mapping IPC shared memory segments, some fields of vm_area_struct d<strong>es</strong>criptorshave a special meaning:vm_start and vm_endvm_ptevm_opsDelimit the linear addr<strong>es</strong>s range of the memory regionStor<strong>es</strong> the index of the shared memory segment in the shm_segs arrayPoints to a table of memory region operations called shm_vm_ops18.3.5.1 Demand paging for IPC shared memory segments<strong>The</strong> pag<strong>es</strong> added to a proc<strong>es</strong>s by shmat( ) are dummy pag<strong>es</strong>; the function adds a newmemory region into a proc<strong>es</strong>s's addr<strong>es</strong>s space, but it do<strong>es</strong>n't modify the proc<strong>es</strong>s's page tabl<strong>es</strong>.We can now explain how th<strong>es</strong>e pag<strong>es</strong> become usable.Because the shmat( ) function didn't modify the page tabl<strong>es</strong>, a "Page fault" occurs when aproc<strong>es</strong>s tri<strong>es</strong> to acc<strong>es</strong>s a location of a shared memory segment. <strong>The</strong> corr<strong>es</strong>ponding exceptionhandler determin<strong>es</strong> that the faulty addr<strong>es</strong>s is inside the proc<strong>es</strong>s addr<strong>es</strong>s space and that thecorr<strong>es</strong>ponding Page Table entry is null; therefore, it invok<strong>es</strong> the do_no_page( ) function (seethe section Section 7.4.3 in Chapter 7). In turn, this function checks whether the nopagemethod for the memory region is defined. <strong>The</strong> method is then invoked, and the Page Tableentry is set to the addr<strong>es</strong>s returned from it (see Section 15.2.5 in Chapter 15).Memory regions used for IPC shared memory always define the nopage method. It isimplemented by the shm_nopage( ) function, which performs the following operations:1. Extracts from the vm_pte field of the memory region d<strong>es</strong>criptor the index of th<strong>es</strong>hm_segs array corr<strong>es</strong>ponding to the shared memory segment.2. Comput<strong>es</strong> the logical page number inside the segment from the vm_start field of thememory region d<strong>es</strong>criptor and the requ<strong>es</strong>ted addr<strong>es</strong>s.3. Acc<strong>es</strong>s<strong>es</strong> the array referenced by the shm_pag<strong>es</strong> field of the shmid_kernel d<strong>es</strong>criptorof the segment and gets the entry corr<strong>es</strong>ponding to the page that includ<strong>es</strong> the faultyaddr<strong>es</strong>s. Three cas<strong>es</strong> are considered, depending on the value of the entry:o Null entry: no page frame was ever allocated to the page. In this case, allocat<strong>es</strong>a new page frame and stor<strong>es</strong> its Page Table entry in the shm_pag<strong>es</strong> array.o Regular entry with Pr<strong>es</strong>ent flag set: the page is already stored in some pageframe. Extracts its physical addr<strong>es</strong>s from the entry in shm_pag<strong>es</strong>.497


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>o Swapped-out page identifier: the page has been swapped out to disk. Allocat<strong>es</strong>a new page frame, reads the page from disk, copi<strong>es</strong> it into the page frame, andstor<strong>es</strong> the new Page Table entry in the shm_pag<strong>es</strong> array. Actually, the actionsperformed in this case corr<strong>es</strong>pond to the swap-in procedure for pag<strong>es</strong> includedin shared memory segments (d<strong>es</strong>cribed later).4. Increments the usage counter of the page frame allocated or identified in the previousstep.5. Returns the physical addr<strong>es</strong>s of the page frame.<strong>The</strong> do_no_page( ) function sets the entry corr<strong>es</strong>ponding to the faulty addr<strong>es</strong>s in theproc<strong>es</strong>s's Page Table so that it points to the page frame returned by the method.18.3.5.2 Swapping out pag<strong>es</strong> of IPC shared memory segments<strong>The</strong> kernel has to be careful when swapping out pag<strong>es</strong> included in shared memory segments.Suppose that two proc<strong>es</strong>s<strong>es</strong> P1 and P2 are acc<strong>es</strong>sing a page of a shared memory segment.Suppose also that the swap_out( ) function tri<strong>es</strong> to free a page frame assigned to proc<strong>es</strong>s P1that is also shared with proc<strong>es</strong>s P2 (see Section 16.5 in Chapter 16). According to the standardswap-out rul<strong>es</strong>, the shared page should be copied to disk and then released, and a swapped-outpage identifier should be written into the corr<strong>es</strong>ponding P1's Page Table entry. However, thisstandard procedure do<strong>es</strong>n't work, because proc<strong>es</strong>s P2 could try to acc<strong>es</strong>s the page through itsown page tabl<strong>es</strong>: since the corr<strong>es</strong>ponding Page Table entry still points to the released pageframe, all sort of data corruption could occur.<strong>The</strong> try_to_swap_out( ) function (see Section 16.5.1 in Chapter 16) recogniz<strong>es</strong> this specialcase by checking whether the memory region includ<strong>es</strong> a swapout method. If it is defined, thepage frame is not released to the Buddy system; its usage counter is simply decremented by 1,and the corr<strong>es</strong>ponding entry in P1's Page Table is cleared. <strong>The</strong> swapout method inshm_vm_ops is an empty function: the method must be non-null to let the kernel know thememory is shared, so it must point to some function, even if that function has nothing to do.P2 can safely acc<strong>es</strong>s the page frame, since it still contains the page of the IPC shared memory.Shared memory segments are persistent r<strong>es</strong>ourc<strong>es</strong>, like any IPC r<strong>es</strong>ource. This means thatpage fram<strong>es</strong> of a shared memory segment no longer used by any proc<strong>es</strong>s are still referencedby the shm_pag<strong>es</strong> array. <strong>The</strong>se page fram<strong>es</strong> may be swapped out to disk by means of th<strong>es</strong>hm_swap( ) function, which is periodically invoked by do_try_to_free_pag<strong>es</strong>( ) (seeSection 16.7.4 in Chapter 16). It iteratively scans all d<strong>es</strong>criptors referenced by the shm_segsarray and, for each d<strong>es</strong>criptor, examin<strong>es</strong> all page fram<strong>es</strong> assigned to each segment. If itdetermin<strong>es</strong> that the usage counter of some page frame is equal to 1, the corr<strong>es</strong>ponding pagecan be safely swapped out to disk. <strong>The</strong> swap-out procedure used is similar to that used fornon-shared pag<strong>es</strong>, except that the swapped-out page identifier is saved in the shm_pag<strong>es</strong>array.In conclusion, though shared memory pag<strong>es</strong> are in proc<strong>es</strong>s page tabl<strong>es</strong>, they are not handledby the swapping facility like other pag<strong>es</strong>. <strong>The</strong> swapped-out page identifiers of such pag<strong>es</strong> donot appear in the page table entri<strong>es</strong> but in the shm_pag<strong>es</strong> array. When a proc<strong>es</strong>s attempts toaddr<strong>es</strong>s a swapped-out page, the null page table entry triggers a "Page fault" exception. <strong>The</strong>kernel retriev<strong>es</strong> the swapped-out page identifier in the shm_pag<strong>es</strong> array and performs th<strong>es</strong>wap-in.498


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>18.4 Anticipating <strong>Linux</strong> 2.4Static arrays used to repr<strong>es</strong>ent semaphor<strong>es</strong> and m<strong>es</strong>sag<strong>es</strong> have been removed and replaced bydynamic data structur<strong>es</strong>. Larger IPC m<strong>es</strong>sag<strong>es</strong> can now be handled.IPC shared memory regions are implemented in a different way: a new /proc fil<strong>es</strong>ystem,denoted as sysvipc, has been introduced. It currently includ<strong>es</strong> only one directory called shmcontaining a virtual file for each IPC shared memory region.499


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Chapter 19. Program Execution<strong>The</strong> concept of a "proc<strong>es</strong>s," d<strong>es</strong>cribed in Chapter 3, was used in Unix from the beginning torepr<strong>es</strong>ent the behavior of groups of running programs that compete for system r<strong>es</strong>ourc<strong>es</strong>. Thisfinal chapter focus<strong>es</strong> on the relationship between program and proc<strong>es</strong>s. We'll specificallyd<strong>es</strong>cribe how the kernel sets up the execution context for a proc<strong>es</strong>s according to the contentsof the program file. While it may not seem like a big problem to load a bunch of instructionsin memory and point the CPU to them, the kernel has to deal with flexibility in several areas:Different executable formats<strong>Linux</strong> is distinguished by its ability to run binari<strong>es</strong> that were compiled for otheroperating systems.Shared librari<strong>es</strong>Many executable fil<strong>es</strong> don't contain all the code required to run the program but expectthe kernel to load in functions from a library at runtime.Other information in the execution contextThis includ<strong>es</strong> the command-line arguments and environment variabl<strong>es</strong> familiar toprogrammers.A program is stored on disk as an executable file , which includ<strong>es</strong> both the object code of thefunctions to be executed and the data on which such functions will act. Many functions of theprogram are service routin<strong>es</strong> available to all programmers; their object code is included inspecial fil<strong>es</strong> called " librari<strong>es</strong>." Actually, the code of a library function may either be staticallycopied in the executable file (static librari<strong>es</strong>), or be linked to the proc<strong>es</strong>s at run time (sharedlibrari<strong>es</strong>, since their code can be shared by several independent proc<strong>es</strong>s<strong>es</strong>).When launching a program, the user may supply two kinds of information that affect the wayit is executed: command-line arguments and environment variabl<strong>es</strong>. Command-line argumentsare typed in by the user following the executable filename at the shell prompt. Environmentvariabl<strong>es</strong>, such as HOME and PATH, are inherited from the shell, but the users may modify thevalu<strong>es</strong> of any such variabl<strong>es</strong> before they launch the program.In Section 19.1 we explain what a program execution context is. In Section 19.2 we mentionsome of the executable formats supported by <strong>Linux</strong> and show how <strong>Linux</strong> can change its"personality" so as to execute programs compiled for other operating systems. Finally, inSection 19.4, we d<strong>es</strong>cribe the system call that allows a proc<strong>es</strong>s to start executing a newprogram.19.1 Executable Fil<strong>es</strong>Chapter 1, defined a proc<strong>es</strong>s as an "execution context." By this we mean the collection ofinformation needed to carry on a specific computation; it includ<strong>es</strong> the pag<strong>es</strong> acc<strong>es</strong>sed, theopen fil<strong>es</strong>, the hardware register contents, and so on. An executable file is a regular file thatd<strong>es</strong>crib<strong>es</strong> how to initialize a new execution context, i.e., how to start a new computation.500


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Suppose a user wants to list the fil<strong>es</strong> in the current directory: he knows that this r<strong>es</strong>ult can b<strong>es</strong>imply achieved by typing the filename of the /bin/ls [1] external command at the shell prompt.<strong>The</strong> command shell forks a new proc<strong>es</strong>s, which in turn invok<strong>es</strong> an execve( ) system call (seeSection 19.4 later in this chapter), passing as one of its parameters a string including the fullpathname for the ls executable file, /bin/ls in this case. <strong>The</strong> sys_execve( ) service routinefinds the corr<strong>es</strong>ponding file, checks the executable format, and modifi<strong>es</strong> the execution contextof the current proc<strong>es</strong>s according to the information stored in it. As a r<strong>es</strong>ult, when the systemcall terminat<strong>es</strong>, the proc<strong>es</strong>s starts executing the code stored in the executable file, whichperforms the directory listing.[1] <strong>The</strong> pathnam<strong>es</strong> of executable fil<strong>es</strong> are not fixed in <strong>Linux</strong>; they depend on the distribution used. Several standard naming schem<strong>es</strong> such as FHS andFSSTND have been proposed for all Unix systems.When a proc<strong>es</strong>s starts running a new program, its execution context chang<strong>es</strong> drastically sincemost of the r<strong>es</strong>ourc<strong>es</strong> obtained during the proc<strong>es</strong>s's previous computations are discarded. Inthe preceding example, when the proc<strong>es</strong>s starts executing /bin/ls, it replac<strong>es</strong> the shell'sarguments with new on<strong>es</strong> passed as parameters in the execve( ) system call and acquir<strong>es</strong> anew shell environment (see Section 19.1.2); all pag<strong>es</strong> inherited from the parent (and sharedwith the Copy On Write mechanism) are released, so that the new computation starts with afr<strong>es</strong>h User Mode addr<strong>es</strong>s space; even the privileg<strong>es</strong> of the proc<strong>es</strong>s could change (seeSection 19.1.1). However, the proc<strong>es</strong>s PID do<strong>es</strong>n't change, and the new computation inheritsfrom the previous one all open file d<strong>es</strong>criptors that have not been closed automatically whileexecuting the execve( ) system call. [2][2] By default, a file already opened by a proc<strong>es</strong>s stays open after issuing an execve( )system call. However, the file will be automaticallyclosed if the proc<strong>es</strong>s has set the corr<strong>es</strong>ponding bit in the close_on_exec field of the fil<strong>es</strong>_struct structure (see Table 12-6 inChapter 12); this is done by means of the fcntl( )system call.19.1.1 Proc<strong>es</strong>s Credentials and Capabiliti<strong>es</strong>Traditionally, Unix systems associate with each proc<strong>es</strong>s some credentials, which bind theproc<strong>es</strong>s to a specific user and a specific user group. Credentials are important on multiusersystems because they determine what each proc<strong>es</strong>s can or cannot do, thus pr<strong>es</strong>erving both theintegrity of each user's personal data and the stability of the system as a whole.<strong>The</strong> use of credentials requir<strong>es</strong> support both in the proc<strong>es</strong>s data structure and in the r<strong>es</strong>ourc<strong>es</strong>being protected. One obvious r<strong>es</strong>ource is a file. Thus, in the Ext2 fil<strong>es</strong>ystem, each file isowned by a specific user and is bound to some group of users. <strong>The</strong> owner of a file may decidewhat kind of operations are allowed on that file, distinguishing among herself, the file's usergroup, and all other users. When some proc<strong>es</strong>s tri<strong>es</strong> to acc<strong>es</strong>s a file, the VFS always checkswhether the acc<strong>es</strong>s is legal, according to the permissions <strong>es</strong>tablished by the file owner and theproc<strong>es</strong>s credentials.<strong>The</strong> proc<strong>es</strong>s's credentials are stored in several fields of the proc<strong>es</strong>s d<strong>es</strong>criptor, listed in Table19-1. <strong>The</strong>se fields contain identifiers of users and user groups in the system, which are usuallycompared with the corr<strong>es</strong>ponding identifiers stored in the inod<strong>es</strong> of the fil<strong>es</strong> being acc<strong>es</strong>sed.501


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Nameuid, gideuid, egidfsuid, fsgidgroupssuid, sgidTable 19-1. Traditional Proc<strong>es</strong>s CredentialsD<strong>es</strong>criptionUser and group real identifiersUser and group effective identifiersUser and group effective identifiers for file acc<strong>es</strong>sSupplementary group identifiersUser and group saved identifiersA null UID specifi<strong>es</strong> the root superuser, while a null GID specifi<strong>es</strong> the root super-group. <strong>The</strong>kernel always allows a proc<strong>es</strong>s to do anything whenever the proc<strong>es</strong>s credential concernedstor<strong>es</strong> a null value. <strong>The</strong>refore, proc<strong>es</strong>s credentials can also be used for checking non-filerelatedoperations, like those referring to system administration or hardware manipulation: ifthe UID stored in some proc<strong>es</strong>s credential is null, the operation is allowed; otherwise, it isdenied.When a proc<strong>es</strong>s is created, it always inherits the credentials of its parent. However, th<strong>es</strong>ecredentials can be modified later, either when the proc<strong>es</strong>s starts executing a new program orwhen it issu<strong>es</strong> suitable system calls. Usually, the uid, euid, fsuid, and suid fields of aproc<strong>es</strong>s contain the same value. However, when the proc<strong>es</strong>s execut<strong>es</strong> a setuid program, thatis, an executable file whose setuid flag is on, the euid and fsuid fields are set to the identifierof the file's owner. Almost all checks involve one of th<strong>es</strong>e two fields: fsuid is used for filerelatedoperations, while euid is used for all other operations. Similar considerations apply tothe gid, egid, fsgid, and sgid fields that refer to group identifiers.As an illustration of how the fsuid field is used, consider the common situation when a userwants to change her password. All passwords are stored in a common file, but she cannotdirectly edit such file because it is protected. <strong>The</strong>refore, she invok<strong>es</strong> a system program named/usr/bin/passwd, which has the setuid flag set and whose owner is the superuser. When theproc<strong>es</strong>s forked by the shell execut<strong>es</strong> such a program, its euid and fsuid fields are set to 0,that is, to the PID of the superuser. Now the proc<strong>es</strong>s can acc<strong>es</strong>s the file, since, when the kernelperform the acc<strong>es</strong>s control, it finds a value in fsuid. Of course, the /usr/bin/passwd programdo<strong>es</strong> not allow the user to do anything but change her own password.Unix's long history teach<strong>es</strong> the l<strong>es</strong>son that setuid programs are quite dangerous: malicioususers could trigger some programming errors (bugs) in the code in such a way to force setuidprograms to perform operations that were never planned by the program's original d<strong>es</strong>igners.Often, the entire system's security can be compromised. In order to minimize such risks,<strong>Linux</strong>, like all modern Unix systems, allows proc<strong>es</strong>s<strong>es</strong> to acquire setuid privileg<strong>es</strong> only whennec<strong>es</strong>sary, and drop them when they are no longer needed. This feature may turn out to beuseful when implementing user applications with several protection levels. <strong>The</strong> proc<strong>es</strong>sd<strong>es</strong>criptor includ<strong>es</strong> an suid field, which stor<strong>es</strong> the valu<strong>es</strong> of the effective identifiers (euid andfsuid) right after the execution of the setuid program. <strong>The</strong> proc<strong>es</strong>s can change the effectiveidentifiers by means of the setuid( ), setr<strong>es</strong>uid( ), setfsuid( ), and setreuid( )system calls. [3][3] GID effective credentials can be changed by issuing the corr<strong>es</strong>ponding setgid( ), setr<strong>es</strong>gid( ), setfsgid( ), andsetregid( ) system calls.Table 19-2 shows how th<strong>es</strong>e system calls affect the proc<strong>es</strong>s's credentials. Be warned that, ifthe calling proc<strong>es</strong>s do<strong>es</strong> not already have superuser privileg<strong>es</strong>, that is, if its euid field is not502


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>null, th<strong>es</strong>e system calls can be used only to set valu<strong>es</strong> already included in the proc<strong>es</strong>s'scredential fields. For instance, an average user proc<strong>es</strong>s can force the value 500 into its fsuidfield by invoking the setfsuid( ) system call, but only if one of the other credential fieldsalready stor<strong>es</strong> the same value of 500.Table 19-2. Semantics of the System Calls that Set Proc<strong>es</strong>s Credentialssetuid(e) setuid(e) setr<strong>es</strong>uid (u,e,s) setreuid (u,e) setfsuid (f)euid=0 euid 0uid Set to e Unchanged Set to u Set to u Unchangedeuid Set to e Set to e Set to e Set to e Unchangedfsuid Set to e Set to e Set to e Set to e Set to fsuid Set to e Unchanged Set to s Set to e UnchangedTo understand the sometim<strong>es</strong> complex relationships among the four user ID fields, considerfor a moment the effects of the setuid( ) system call. <strong>The</strong> actions are different depending onwhether the calling proc<strong>es</strong>s's euid field is set to (that is, the proc<strong>es</strong>s has superuser privileg<strong>es</strong>)or to a normal UID.If the euid field is null, the system call sets all credential fields of the calling proc<strong>es</strong>s (uid,euid, fsuid, and suid) to the value of the parameter e. A superuser proc<strong>es</strong>s can thus drop itsprivileg<strong>es</strong> and become a proc<strong>es</strong>s owned by a normal user. This happens, for instance, when auser logs in: the system forks a new proc<strong>es</strong>s with superuser privileg<strong>es</strong>, but the proc<strong>es</strong>s dropsits privileg<strong>es</strong> by invoking the setuid( ) system call and then starts executing the user's loginshell program.If the euid field is not null, the system call modifi<strong>es</strong> only the value stored in euid and fsuid,leaving the other two fields unchanged. This allows a proc<strong>es</strong>s executing a setuid program tohave its effective privileg<strong>es</strong> stored in euid and fsuid set alternately to uid (the proc<strong>es</strong>s actsas the user who launched the executable file) and to suid (the proc<strong>es</strong>s acts as the user whoowns the executable file).19.1.1.1 Proc<strong>es</strong>s capabiliti<strong>es</strong><strong>Linux</strong> is moving toward another model of proc<strong>es</strong>s credentials based on the notion of"capabiliti<strong>es</strong>." A capability is simply a flag that asserts whether the proc<strong>es</strong>s is allowed toperform a specific operation or a specific class of operations. This model is different from thetraditional "superuser versus normal user" model in which a proc<strong>es</strong>s can either do everythingor do nothing, depending on its effective UID. As illustrated in Table 19-3, severalcapabiliti<strong>es</strong> have already been included in the <strong>Linux</strong> kernel.503


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 19-3. <strong>Linux</strong> Capabiliti<strong>es</strong>NameD<strong>es</strong>criptionCAP_CHOWNIgnore r<strong>es</strong>trictions on file and group ownership chang<strong>es</strong>.CAP_DAC_OVERRIDEIgnore file acc<strong>es</strong>s permissions.CAP_DAC_READ_SEARCH Ignore file/directory read and search permissions.CAP_FOWNERIgnore r<strong>es</strong>trictions on file ownership.CAP_FSETIDIgnore r<strong>es</strong>trictions on setuid and setgid flags.CAP_KILLIgnore r<strong>es</strong>trictions on signal sendings.CAP_SETGIDAllow setgid flag manipulations.CAP_SETUIDAllow setuid flag manipulations.CAP_SETPCAPTransfer/remove permitted capabiliti<strong>es</strong> to other proc<strong>es</strong>s<strong>es</strong>.CAP_LINUX_IMMUTABLE Allow modification of append-only and immutable fil<strong>es</strong>.CAP_NET_BIND_SERVICE Allow binding to TCP/UDP sockets below 1024.CAP_NET_BROADCAST Allow network broadcasting and listen to multicast.CAP_NET_ADMINAllow general networking administration.CAP_NET_RAWAllow use of RAW and PACKET sockets.CAP_IPC_LOCKAllow locking of pag<strong>es</strong> and shared memory segments.CAP_IPC_OWNERSkip IPC ownership checks.CAP_SYS_MODULEAllow inserting and removing of kernel modul<strong>es</strong>.CAP_SYS_RAWIO Allow acc<strong>es</strong>s to I/O ports through ioperm( ) and iopl( ).CAP_SYS_CHROOT Allow use of chroot( ).CAP_SYS_PTRACECAP_SYS_PACCTCAP_SYS_ADMINAllow use of ptrace( ) on any proc<strong>es</strong>s.Allow configuration of proc<strong>es</strong>s accounting.Allow general system administration.CAP_SYS_BOOT Allow use of reboot( ).CAP_SYS_NICE Ignore r<strong>es</strong>triction on nice( ).CAP_SYS_RESOURCECAP_SYS_TIMECAP_SYS_TTY_CONFIGIgnore r<strong>es</strong>trictions on several r<strong>es</strong>ourc<strong>es</strong> usage.Allow manipulation of system clock and real-time clock.Allow configuration of tty devic<strong>es</strong>.<strong>The</strong> main advantage of capabiliti<strong>es</strong> is that, at any time, each program needs a limited numberof them. Consequently, even if a malicious user discovers a way to exploit a buggy program,she can illegally perform a limited number of operation typ<strong>es</strong>.Assume, for instance, that a buggy program has only the CAP_SYS_TIME capability. In thiscase, the malicious user who discovers an exploitation of the bug can succeed only in illegallychanging the real-time and the system clock. She won't be able to perform any other kind ofprivileged operations.A proc<strong>es</strong>s can explicitly get and set its capabiliti<strong>es</strong> by using, r<strong>es</strong>pectively, the capget( ) andcapset( ) system calls. However, neither the VFS nor the Ext2 fil<strong>es</strong>ystem currently supportsthe capability model, so there is no way to associate an executable file with the set ofcapabiliti<strong>es</strong> that should be enforced when a proc<strong>es</strong>s execut<strong>es</strong> that file. <strong>The</strong>refore, capabiliti<strong>es</strong>are usel<strong>es</strong>s for <strong>Linux</strong> 2.2 end users, although we can easily predict that the situation willchange very soon.In fact, the <strong>Linux</strong> kernel already tak<strong>es</strong> capabiliti<strong>es</strong> into account. Let us consider, for instance,the nice( ) system call, which allows users to change the static priority of a proc<strong>es</strong>s. In504


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the traditional model, only the superuser can raise a priority: the kernel should thus checkwhether the euid field in the d<strong>es</strong>criptor of the calling proc<strong>es</strong>s is set to 0. However, the <strong>Linux</strong>kernel defin<strong>es</strong> a capability called CAP_SYS_NICE, which corr<strong>es</strong>ponds exactly to this kind ofoperation. <strong>The</strong> kernel checks the value of this flag by invoking the capable( ) function andby passing the CAP_SYS_NICE value to it.This approach works thanks to some "compatibility hacks" that have been added to the kernelcode: each time a proc<strong>es</strong>s sets the euid and fsuid fields to (either by invoking one of th<strong>es</strong>ystem calls listed in Table 19-2 or by executing a setuid program owned by the superuser),the kernel sets all proc<strong>es</strong>s capabiliti<strong>es</strong>, so that all checks will succeed. Similarly, when theproc<strong>es</strong>s r<strong>es</strong>ets the euid and fsuid fields to the real UID of the proc<strong>es</strong>s owner, the kerneldrops all capabiliti<strong>es</strong>.19.1.2 Command-Line Arguments and Shell EnvironmentWhen a user typ<strong>es</strong> a command, the program loaded to satisfy the requ<strong>es</strong>t may receive somecommand-line arguments from the shell. For example, when a user typ<strong>es</strong> the command:$ ls -l /usr/binin order to get a full listing of the fil<strong>es</strong> in the /usr/bin directory, the shell proc<strong>es</strong>s creat<strong>es</strong> a newproc<strong>es</strong>s to execute the command. This new proc<strong>es</strong>s loads the /bin/ls executable file. In doingso, most of the execution context inherited from the shell is lost, but the three separatearguments ls, -l, and /usr/bin are kept. Generally, the new proc<strong>es</strong>s may receive anynumber of arguments.<strong>The</strong> conventions for passing the command-line arguments depend on the high-level languageused. In the C language, the main( ) function of a program may receive as parameters aninteger specifying how many arguments have been passed to the program and the addr<strong>es</strong>s ofan array of pointers to strings. <strong>The</strong> following prototype formaliz<strong>es</strong> this standard:int main(int argc, char *argv[])Going back to the previous example, when the /bin/ls program is invoked, argc has the value3, argv[0] points to the ls string, argv[1] points to the -l string, and argv[2] points tothe /usr/bin string. <strong>The</strong> end of the argv array is always marked by a null pointer, soargv[3] contains NULL.A third optional parameter that may be passed in the C language to the main( ) function isthe parameter containing environment variabl<strong>es</strong>. When the program us<strong>es</strong> it, main( ) must bedeclared as follows:int main(int argc, char *argv[], char *envp[])<strong>The</strong> envp parameter points to an array of pointers to environment strings of the form:VAR_NAME=somethingwhere VAR_NAME repr<strong>es</strong>ents the name of an environment variable, while the substringfollowing the = delimiter repr<strong>es</strong>ents the actual value assigned to the variable. <strong>The</strong> end of505


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>the envp array is marked by a null pointer, like the argv array. Environment variabl<strong>es</strong> areused to customize the execution context of a proc<strong>es</strong>s, to provide general information to a useror other proc<strong>es</strong>s<strong>es</strong>, or to allow a proc<strong>es</strong>s to keep some information across an execve( )system call.Command-line arguments and environment strings are placed on the User Mode stack, rightbefore the return addr<strong>es</strong>s (see Section 8.2.3 in Chapter 8). <strong>The</strong> bottom locations of the UserMode stack are illustrated in Figure 19-1. Notice that the environment variabl<strong>es</strong> are locatednear the bottom of the stack right after a null long integer.Figure 19-1. <strong>The</strong> bottom locations of the User Mode stack19.1.3 Librari<strong>es</strong>Each high-level source code file is transformed through several steps into an object file, whichcontains the machine code of the assembly language instructions corr<strong>es</strong>ponding to the highlevelinstructions. An object file cannot be executed, since it do<strong>es</strong> not contain the linearaddr<strong>es</strong>s that corr<strong>es</strong>ponds to each reference to a name of a global symbol external to the sourcecode file, such as functions in librari<strong>es</strong> or other source code fil<strong>es</strong> of the same program. <strong>The</strong>assigning, or r<strong>es</strong>olution, of such addr<strong>es</strong>s<strong>es</strong> is performed by the linker, which collects togetherall the object fil<strong>es</strong> of the program and constructs the executable file. <strong>The</strong> linker also analyz<strong>es</strong>the library's functions used by the program and glu<strong>es</strong> them into the executable file in a mannerd<strong>es</strong>cribed later in this chapter.Any program, even the most trivial one, mak<strong>es</strong> use of C librari<strong>es</strong>. Consider, for instance, thefollowing one-line C program:void main(void) { }Although this program do<strong>es</strong> not compute anything, a lot of work is needed to set up theexecution environment (see Section 19.4 later in this chapter) and to kill the proc<strong>es</strong>s when theprogram terminat<strong>es</strong> (see Section 3.4 in Chapter 3). In particular, when the main( ) functionterminat<strong>es</strong>, the C compiler inserts an exit( ) system call in the object code.506


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>We know from Chapter 8 that programs usually invoke system calls through wrapper routin<strong>es</strong>in the C library. This holds for the C compiler too: b<strong>es</strong>id<strong>es</strong> including the code directlygenerated by compiling the program's statements, any executable file also includ<strong>es</strong> some"glue" code to handle the interactions of the User Mode proc<strong>es</strong>s with the kernel. Portions ofsuch glue code are stored in the C library.Many other librari<strong>es</strong> of functions, b<strong>es</strong>id<strong>es</strong> the C library, are included in Unix systems. Ageneric <strong>Linux</strong> system could easily have 50 different librari<strong>es</strong>. Just to mention a couple ofthem: the math library libm includ<strong>es</strong> basic functions for floating point operations, while theX11 library libX11 collects together the basic low-level functions for the X11 WindowSystem graphics interface.All executable fil<strong>es</strong> in traditional Unix systems were based on static librari<strong>es</strong>. This means thatthe executable file produced by the linker includ<strong>es</strong> not only the code of the original programbut also the code of the library functions that the program refers to.Static librari<strong>es</strong> have one big disadvantage: they eat lots of space on disk. Indeed, eachstatically linked executable file duplicat<strong>es</strong> some portion of library code.Modern Unix systems make use of shared librari<strong>es</strong>. <strong>The</strong> executable file do<strong>es</strong> not contain thelibrary object code, but only a reference to the library name. When the program is loaded inmemory for execution, a suitable program called the program interpreter tak<strong>es</strong> care ofanalyzing the library nam<strong>es</strong> in the executable file, locating the library in the system's directorytree and making the requ<strong>es</strong>ted code available to the executing proc<strong>es</strong>s.Shared librari<strong>es</strong> are <strong>es</strong>pecially convenient on systems that provide file memory mapping,since they reduce the amount of main memory requ<strong>es</strong>ted for executing a program. When theprogram interpreter must link some shared library to a proc<strong>es</strong>s, it do<strong>es</strong> not copy the objectcode, but just performs a memory mapping of the relevant portion of the library file into theproc<strong>es</strong>s's addr<strong>es</strong>s space. This allows the page fram<strong>es</strong> containing the machine code of thelibrary to be shared among all proc<strong>es</strong>s<strong>es</strong> that are using the same code.Shared librari<strong>es</strong> also have some disadvantag<strong>es</strong>. <strong>The</strong> startup time of a dynamically linkedprogram is usually much longer than that of a statically linked one. Moreover, dynamicallylinked programs are not as portable as statically linked on<strong>es</strong>, since they may not executeproperly in systems that include a different version of the same library.A user may always require a program to be linked statically. <strong>The</strong> GCC compiler offers, forinstance, the -static option, which tells the linker to use the static librari<strong>es</strong> instead of th<strong>es</strong>hared on<strong>es</strong>.19.1.4 Program Segments and Proc<strong>es</strong>s Memory Regions<strong>The</strong> linear addr<strong>es</strong>s space of a Unix program is traditionally partitioned, from a logical point ofview, in several linear addr<strong>es</strong>s intervals called segments: [4][4] <strong>The</strong> word "segment" has historical roots, since the first Unix systems implemented each linear addr<strong>es</strong>s interval with a different segment register.<strong>Linux</strong>, however, do<strong>es</strong> not rely on the segmentation mechanism of the Intel microproc<strong>es</strong>sors to implement program segments.507


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Text segmentInclud<strong>es</strong> the executable codeData segmentContains the initialized data, that is, the static variabl<strong>es</strong> and the global variabl<strong>es</strong> whoseinitial valu<strong>es</strong> are stored in the executable file (because the program must know theirvalu<strong>es</strong> at startup)bss segmentContains the uninitialized data, that is, all global variabl<strong>es</strong> whose initial valu<strong>es</strong> are notstored in the executable file (because the program sets the valu<strong>es</strong> before referencingthem)Stack segmentContains the program stack, which includ<strong>es</strong> the return addr<strong>es</strong>s<strong>es</strong>, parameters, and localvariabl<strong>es</strong> of the functions being executedEach mm_struct memory d<strong>es</strong>criptor (see the section Section 7.2 in Chapter 7) includ<strong>es</strong> somefields that identify the role of particular memory regions of the corr<strong>es</strong>ponding proc<strong>es</strong>s:start_code , end_codeStore the initial and final linear addr<strong>es</strong>s<strong>es</strong> of the memory region that includ<strong>es</strong> thenative code of the program, that is, the code in the executable file. Since the textsegment includ<strong>es</strong> shared librari<strong>es</strong> but the executable file do<strong>es</strong> not, the memory regiondemarcated by th<strong>es</strong>e fields is a subset of the text segment.start_data , end_dataStore the initial and final linear addr<strong>es</strong>s<strong>es</strong> of the memory region that includ<strong>es</strong> thenative initialized data of the program, as specified in the executable file. <strong>The</strong> fieldsidentify a memory region that roughly corr<strong>es</strong>ponds to the data segment. Actually,start_data should almost always be set to the addr<strong>es</strong>s of the first page right afterend_code, and thus the field is unused. <strong>The</strong> end_data field is used, though.start_brk , brkStore the initial and final linear addr<strong>es</strong>s<strong>es</strong> of the memory region that includ<strong>es</strong> thedynamically allocated memory areas of the proc<strong>es</strong>s (see Section 7.6 in Chapter 7).This memory region is sometim<strong>es</strong> called heap.start_stackStor<strong>es</strong> the addr<strong>es</strong>s right above that of main( )'s return addr<strong>es</strong>s; as illustrated inFigure 19-1, higher addr<strong>es</strong>s<strong>es</strong> are r<strong>es</strong>erved (recall that stacks grow toward loweraddr<strong>es</strong>s<strong>es</strong>).508


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>arg_start , arg_endStore the initial and final addr<strong>es</strong>s<strong>es</strong> of the stack portion containing the command-linearguments.env_start , env_endStore the initial and final addr<strong>es</strong>s<strong>es</strong> of the stack portion containing the environmentstrings.Notice that shared librari<strong>es</strong> and file memory mapping have made the classification of theproc<strong>es</strong>s's addr<strong>es</strong>s space based on program segments a bit obsolete, since each of the sharedlibrari<strong>es</strong> is mapped into a different memory region from the on<strong>es</strong> discussed in the precedinglist.Now we'll d<strong>es</strong>cribe, by means of a simple example, how the <strong>Linux</strong> kernel maps sharedlibrari<strong>es</strong> into the proc<strong>es</strong>s's addr<strong>es</strong>s space. We assume as usual that the User Mode addr<strong>es</strong>sspace rang<strong>es</strong> from 0x00000000 and 0xbfffffff. We consider the /sbin/init program, whichcreat<strong>es</strong> and monitors the activity of all the proc<strong>es</strong>s<strong>es</strong> that implement the outer layers of theoperating system (see Section 3.3.2 in Chapter 3). <strong>The</strong> memory regions of the corr<strong>es</strong>pondinginit proc<strong>es</strong>s are shown in Table 19-4 (such information can be obtained from the /proc/1/mapsfile). Notice that all regions listed are implemented by means of private memory mappings(the letter p in the Permissions column). This is not surprising: th<strong>es</strong>e memory regions existonly to provide data to a proc<strong>es</strong>s; while executing instructions, a proc<strong>es</strong>s may modify thecontents of th<strong>es</strong>e memory regions but the fil<strong>es</strong> on disk associated with them stay unchanged.This is precisely how private memory mappings act.Table 19-4. Memory Regions of the init Proc<strong>es</strong>sAddr<strong>es</strong>s Range Perms Mapped File0x08048000-0x0804cfff r-xp /sbin/init at offset 00x0804d000-0x0804dfff rw-p /sbin/init atoffset 0x40000x0804e000-0x0804efff rwxp Anonymous0x40000000-0x40005fff r-xp /lib/ld-linux.so.1.9.9 at offset 00x40006000-0x40006fff rw-p /lib/ld-linux.so.1.9.9 at offset 0x50000x40007000-0x40007fff rw-p Anonymous0x4000b000-0x40092fff r-xp /lib/libc.so.5.4.46 at offset0x40093000-0x40098fff rw-p /lib/libc.so.5.4.46 at offset 0x870000x40099000-0x400cafff rw-p Anonymous0xbfffd000-0xbfffffff rwxp Anonymous<strong>The</strong> memory region starting from 0x8048000 is a memory mapping associated with theportion of the /sbin/init file ranging from byte to byte 20479 (only the start and end of theregion are shown in the /proc/1/maps file, but the region size can easily be derived fromthem). <strong>The</strong> permissions specify that the region is executable (it contains object code), readonly (it's not writable, because the instructions don't change during a run), and private, so wecan gu<strong>es</strong>s that the region maps the text segment of the program.<strong>The</strong> memory region starting from 0x804d000 is a memory mapping associated with anotherportion of /sbin/init ranging from byte 16384 (corr<strong>es</strong>ponding to offset 0x4000 shown in509


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 19-4) to 20479. Since the permissions specify that the private region may be written, wecan conclude that it maps the data segment of the program.<strong>The</strong> next one-page memory region starting from 0x0804e000 is anonymous, that is, it is notassociated with any file. It is probably associated with the bss segment of init.Similarly, the next three memory regions starting from 0x40000000, 0x40006000, and0x40007000 corr<strong>es</strong>pond to the text segment, the data segment, and the bss segment,r<strong>es</strong>pectively, of the /lib/ld-linux.so.1.9.9 program, which actually is the program interpreterfor the ELF shared librari<strong>es</strong>. <strong>The</strong> program interpreter is never executed alone: it is alwaysmemory-mapped inside the addr<strong>es</strong>s space of a proc<strong>es</strong>s executing another program.On this system, the C library happens to be stored in the /lib/libc.so.5.4.46 file. <strong>The</strong> textsegment, data segment, and bss segment of the C library are mapped into the next threememory regions starting from addr<strong>es</strong>s 0x4000b000. Remember that page fram<strong>es</strong> included inprivate regions can be shared among several proc<strong>es</strong>s<strong>es</strong> with the Copy On Write mechanism,as long as they are not modified. Thus, since the text segment is read only, the page fram<strong>es</strong>containing the executable code of the C library are shared among almost all currentlyexecuting proc<strong>es</strong>s<strong>es</strong> (all except the statically linked on<strong>es</strong>).Finally, the last anonymous memory region from 0xbfffd000 to 0xbfffffff is associatedwith the User Mode stack. We have already explained in Section 7.4 in Chapter 7 how th<strong>es</strong>tack is automatically expanded toward lower addr<strong>es</strong>s<strong>es</strong> whenever nec<strong>es</strong>sary.19.1.5 Execution TracingExecution tracing is a technique that allows a program to monitor the execution of anotherprogram. <strong>The</strong> traced program can be executed step-by-step, until a signal is received, or untila system call is invoked. Execution tracing is widely used by debuggers, together with othertechniqu<strong>es</strong> like the insertion of breakpoints in the debugged program and run-time acc<strong>es</strong>s toits variabl<strong>es</strong>. As usual, we'll focus on how the kernel supports execution tracing rather thandiscussing how debuggers work.In <strong>Linux</strong>, execution tracing is performed through the ptrace( ) system call, which canhandle the commands listed in Table 19-5. Proc<strong>es</strong>s<strong>es</strong> having the CAP_SYS_PTRACE capabilityflag set are allowed to trace any proc<strong>es</strong>s in the system except init. Conversely, a proc<strong>es</strong>s Pwith no CAP_SYS_PTRACE capability is allowed to trace only proc<strong>es</strong>s<strong>es</strong> having the same owneras P. Moreover, a proc<strong>es</strong>s cannot be traced by two proc<strong>es</strong>s<strong>es</strong> at the same time.510


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>CommandPTRACE_TRACEMEPTRACE_ATTACHPTRACE_DETACHPTRACE_KILLPTRACE_CONTPTRACE_SYSCALLPTRACE_SINGLESTEPPTRACE_PEEKTEXTPTRACE_PEEKDATAPTRACE_POKETEXTPTRACE_POKEDATAPTRACE_PEEKUSRPTRACE_POKEUSRPTRACE_GETREGSPTRACE_SETREGSPTRACE_GETFPREGSPTRACE_SETFPREGSTable 19-5. <strong>The</strong> ptrace CommandsD<strong>es</strong>criptionStart execution tracing for the current proc<strong>es</strong>sStart execution tracing for another proc<strong>es</strong>sTerminate execution tracingKill the traced proc<strong>es</strong>sR<strong>es</strong>ume executionR<strong>es</strong>ume execution until the next system call boundaryR<strong>es</strong>ume execution for a single assembly instructionRead a 32-bit value from the text segmentRead a 32-bit value from the data segmentWrite a 32-bit value into the text segmentWrite a 32-bit value into the data segmentRead the CPU's normal and debug registersWrite the CPU's normal and debug registersRead privileged CPU's registersWrite privileged CPU's registersRead floating-point registersWrite floating-point registers<strong>The</strong> ptrace( ) system call modifi<strong>es</strong> the p_pptr field in the d<strong>es</strong>criptor of the traced proc<strong>es</strong>sso that it points to the tracing proc<strong>es</strong>s; therefore, the tracing proc<strong>es</strong>s becom<strong>es</strong> the effectiveparent of the traced one. When execution tracing terminat<strong>es</strong>, that is, when ptrace( ) isinvoked with the PTRACE_DETACH command, the system call sets p_pptr to the value ofp_opptr, thus r<strong>es</strong>toring the original parent of the traced proc<strong>es</strong>s (see Section 3.1.3 inChapter 3).Several monitored events can be associated with a traced program:• End of execution of a single assembly instruction• Entering a system call• Exiting from a system call• Receiving a signalWhen a monitored event occurs, the traced program is stopped and a SIGCHLD signal is sent toits parent. When the parent wish<strong>es</strong> to r<strong>es</strong>ume the child's execution, it can use one of thePTRACE_CONT, PTRACE_SINGLESTEP, and PTRACE_SYSCALL commands, depending on the kindof event it wants to monitor.<strong>The</strong> PTRACE_CONT command just r<strong>es</strong>um<strong>es</strong> execution: the child will execute until it receiv<strong>es</strong>another signal. This kind of tracing is implemented by means of the PF_PTRACED flag in theproc<strong>es</strong>s d<strong>es</strong>criptor, which is checked by the do_signal( ) function (see Section 9.3 inChapter 9).<strong>The</strong> PTRACE_SINGLESTEP command forc<strong>es</strong> the child proc<strong>es</strong>s to execute the next assemblylanguage instruction, then stops it again. This kind of tracing is implemented on Intel-basedmachin<strong>es</strong> by means of the TF trap flag in the eflags register: when it is on, a "Debug"exception is raised right after any assembly language instruction. <strong>The</strong> corr<strong>es</strong>pondingexception handler just clears the flag, forc<strong>es</strong> the current proc<strong>es</strong>s to stop, and sends a SIGCHLD511


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>signal to its parent. Notice that setting the TF flag is not a privileged operation, thus UserMode proc<strong>es</strong>s<strong>es</strong> can force single-step execution even without the ptrace( ) system call. <strong>The</strong>kernel checks the PF_DTRACE flag in the proc<strong>es</strong>s d<strong>es</strong>criptor to keep track of whether the childproc<strong>es</strong>s is being single-stepped through ptrace( ).<strong>The</strong> PTRACE_SYSCALL command caus<strong>es</strong> the traced proc<strong>es</strong>s to r<strong>es</strong>ume execution until a systemcall is invoked. <strong>The</strong> proc<strong>es</strong>s is stopped twice, the first time when the system call starts, andthe second time when the system call terminat<strong>es</strong>. This kind of tracing is implemented bymeans of the PF_TRACESYS flag in the proc<strong>es</strong>sor d<strong>es</strong>criptor, which is checked in th<strong>es</strong>ystem_call( ) assembly language function (see Section 8.2.2 in Chapter 8).A proc<strong>es</strong>s can also be traced using some debugging featur<strong>es</strong> of the Intel Pentium proc<strong>es</strong>sors.For example, the parent could set the valu<strong>es</strong> of the dr0, . . . dr7 debug registers for the childby using the PTRACE_POKEUSR command. When a monitored event occurs, the CPU rais<strong>es</strong> the"Debug" exception; the exception handler can then suspend the traced proc<strong>es</strong>s and send theSIGCHLD signal to the parent.19.2 Executable Formats<strong>The</strong> official <strong>Linux</strong> executable format is named ELF (Executable and Linking Format): it wasdeveloped by Unix System Laboratori<strong>es</strong> and is quite popular in the Unix world. Several wellknownUnix operating systems such as System V Release 4 and Sun's Solaris 2 have adoptedELF as their main executable format.Older <strong>Linux</strong> versions supported another format named a.out (Assembler OUTput Format);actually, there were several versions of that format floating around the Unix world. It is littleused nowadays, since ELF is much more practical.<strong>Linux</strong> supports many other different formats for executable fil<strong>es</strong>; in this way, it can runprograms compiled for other operating systems like MS-DOS EXE programs, or Unix BSD'sCOFF executabl<strong>es</strong>. A few executable formats, like Java or bash scripts, are platformindependent.An executable format is d<strong>es</strong>cribed by an object of type linux_binfmt, which <strong>es</strong>sentiallyprovide three methods:load_binarySets up a new execution environment for the current proc<strong>es</strong>s by reading theinformation stored in an executable file.load_shlibcore_dumpUsed to dynamically bind a shared library to an already running proc<strong>es</strong>s; it is activatedby the uselib( ) system call.Stor<strong>es</strong> the execution context of the current proc<strong>es</strong>s in a file named core. This file,whose format depends on the type of executable of the program being executed, is512


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>usually created when a proc<strong>es</strong>s receiv<strong>es</strong> a signal whose default action is "dump" (seeSection 9.1.1 in Chapter 9).All linux_binfmt objects are included in a simply linked list, and the addr<strong>es</strong>s of the firstelement is stored in the formats variable. Elements can be inserted and removed in the list byinvoking the register_binfmt( ) and unregister_binfmt( ) functions. <strong>The</strong>register_binfmt( ) function is executed during system startup for each executable formatcompiled into the kernel. This function is also executed when a module implementing a newexecutable format is being loaded, while the unregister_binfmt( ) function is invokedwhen the module is unloaded.<strong>The</strong> last element in the formats list is always an object d<strong>es</strong>cribing the executable format forinterpreted scripts. This format defin<strong>es</strong> only the load_binary method. <strong>The</strong> corr<strong>es</strong>pondingdo_load_script( ) function checks whether the executable file starts with the #! pair ofcharacters. If so, it interprets the r<strong>es</strong>t of the first line as the pathname of another executablefile and tri<strong>es</strong> to execute it by passing the name of the script file as a parameter. [5][5] It is possible to execute a script file even if it do<strong>es</strong>n't start with the #! characters, as long as the file is written in the language recognized by theuser's shell. In this case, however, the script is interpreted by the shell on which the user typ<strong>es</strong> the command, and thus the kernel is not directlyinvolved.<strong>Linux</strong> allows users to register their own custom executable formats. Each such format may berecognized either by means of a magic number stored in the first 128 byt<strong>es</strong> of the file, or by afilename extension that identifi<strong>es</strong> the file type. As an example, MS-DOS extensions consist ofthree characters separated from the filename by a dot: the .exe extension identifi<strong>es</strong> executableprograms while the .bat extension identifi<strong>es</strong> shell scripts.Each custom format is associated with an interpreter program, which is automatically invokedby the kernel with the original custom executable filename as a parameter. <strong>The</strong> mechanism issimilar to the script's format, but it's more powerful since it do<strong>es</strong>n't impose any r<strong>es</strong>trictions onthe custom format. To register a new format, the user writ<strong>es</strong> into the/proc/sys/fs/binfmt_misc/register file a string having the following format::name:type:offset:string:mask:interpreter:where each field has the following meaning:nametypeoffsetAn identifier for the new format<strong>The</strong> type of recognition (M for magic number, E for extension)<strong>The</strong> starting offset of the magic number inside the file513


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>stringmask<strong>The</strong> byte sequence to be matched either in the magic number or in the extensionString to mask out some bits in stringinterpreter<strong>The</strong> full pathname of the program interpreterAs an example, the following command performed by the superuser will enable the kernel torecognize the Microsoft Windows executable format:$ echo ':DOSWin:M:0:MZ:0xff:/usr/local/bin/wine:' > \/proc/sys/fs/binfmt_misc/registerA Windows executable file has the MZ magic number in the first two byt<strong>es</strong>, and it will beexecuted by the /usr/local/bin/wine program interpreter.19.3 Execution DomainsAs mentioned in Chapter 1, a neat feature of <strong>Linux</strong> is its ability to execute fil<strong>es</strong> compiled forother operating systems. Of course, this is possible only if the fil<strong>es</strong> include machine code forthe same computer architecture on which the kernel is running. Two kinds of support areoffered for th<strong>es</strong>e "foreign" programs:• Emulated execution: nec<strong>es</strong>sary to execute programs that include system calls that arenot POSIX-compliant• Native execution: valid for programs whose system calls are totally POSIX-compliantMicrosoft MS-DOS and Windows programs are emulated: they cannot be natively executed,since they include APIs that are not recognized by <strong>Linux</strong>. An emulator like DOSemu or Wine(which appeared in the example at the end of the previous section) is invoked to translate eachAPI call into an emulating wrapper function call, which in turn mak<strong>es</strong> use of the existing<strong>Linux</strong> system calls. Since emulators are mostly implemented as User Mode applications, wedon't discuss them further.On the other hand, POSIX-compliant programs compiled on operating systems other than<strong>Linux</strong> can be executed without too much trouble, since POSIX operating systems offer similarAPIs. (Actually, the APIs should be identical, although this is not always the case.) Minordifferenc<strong>es</strong> that the kernel must iron out usually refer to how system calls are invoked or howthe various signals are numbered. This information is stored in execution domain d<strong>es</strong>criptorsof type exec_domain.A proc<strong>es</strong>s specifi<strong>es</strong> its execution domain by setting the personality field of its d<strong>es</strong>criptorand by storing the addr<strong>es</strong>s of the corr<strong>es</strong>ponding exec_domain data structure in theexec_domain field. A proc<strong>es</strong>s can change its personality by issuing a suitable system callnamed personality( ); typical valu<strong>es</strong> assumed by the system call's parameter are listed in514


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Table 19-6. <strong>The</strong> C library do<strong>es</strong> not include a corr<strong>es</strong>ponding wrapper routine, becauseprogrammers are not expected to directly change the personality of their programs. Instead,the personality( ) system call should be issued by the glue code that sets up the executioncontext of the proc<strong>es</strong>s (see Section 19.4).Table 19-6. Main Personaliti<strong>es</strong> Supported by the <strong>Linux</strong> <strong>Kernel</strong>PersonalityOperating SystemPER_LINUXStandard execution domainPER_SVR4 System V Release 4PER_SVR3 System V Release 3PER_SCOSVR3 SCO Unix version 3.2PER_WYSEV386 Unix System V/386 Release 3.2.1PER_ISCR4Interactive UnixPER_BSDBSD UnixPER_XENIXXenixPER_IRIX32SGI Irix-5 32 bitPER_IRIXN32SGI Irix-6 32 bitPER_IRIX64SGI Irix-6 64 bit19.4 <strong>The</strong> exec-like FunctionsUnix systems provide a family of functions that replace the execution context of a proc<strong>es</strong>swith a new context d<strong>es</strong>cribed by an executable file. <strong>The</strong> nam<strong>es</strong> of such functions start with theprefix exec followed by one or two letters; therefore, a generic function in the family isusually referred to as an exec-like function.<strong>The</strong> exec-like functions are listed in Table 19-7; they differ in how the parameters areinterpreted.Table 19-7. <strong>The</strong> exec-like FunctionsFunction Name PATH Search Command-Line Arguments Environment Arrayexecl( ) No List Noexeclp( ) Y<strong>es</strong> List Noexecle( ) No List Y<strong>es</strong>execv( ) No Array Noexecvp( ) Y<strong>es</strong> Array Noexecve( ) No Array Y<strong>es</strong><strong>The</strong> first parameter of each function denot<strong>es</strong> the pathname of the file to be executed.<strong>The</strong> pathname can be absolute or relative to the proc<strong>es</strong>s's current directory. Moreover, if thename do<strong>es</strong> not include any / characters, the execlp( ) and execvp( ) functions search forthe executable file in all directori<strong>es</strong> specified by the PATH environment variable.B<strong>es</strong>id<strong>es</strong> the first parameter, the execl( ), execlp( ), and execle( ) functions includea variable number of additional parameters. Each points to a string d<strong>es</strong>cribing a command-lineargument for the new program; as the l character in the function nam<strong>es</strong> sugg<strong>es</strong>ts,the parameters are organized in a list terminated by a NULL value. Usually, the first commandlineargument duplicat<strong>es</strong> the executable filename. Conversely, the execv( ), execvp( ), andexecve( ) functions specify the command-line arguments with a single parameter: as the v515


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>character in the function nam<strong>es</strong> sugg<strong>es</strong>ts, the parameter is the addr<strong>es</strong>s of a vector of pointersto command-line argument strings. <strong>The</strong> last component of the array must store the NULL value.<strong>The</strong> execle( ) and execve( ) functions receive as their last parameter the addr<strong>es</strong>s of anarray of pointers to environment strings; as usual, the last component of the array must beNULL. <strong>The</strong> other functions may acc<strong>es</strong>s the environment for the new program from the externalenviron global variable, which is defined in the C library.All exec( )-like functions, with the exception of execve( ), are wrapper routin<strong>es</strong> defined inthe C library and make use of execve( ), which is the only system call offered by <strong>Linux</strong> todeal with program execution.<strong>The</strong> sys_execve( ) service routine receiv<strong>es</strong> the following parameters:• <strong>The</strong> addr<strong>es</strong>s of the executable file pathname (in the User Mode addr<strong>es</strong>s space).• <strong>The</strong> addr<strong>es</strong>s of a NULL-terminated array (in the User Mode addr<strong>es</strong>s space) of pointersto strings (again in the User Mode addr<strong>es</strong>s space); each string repr<strong>es</strong>ents a commandlineargument.• <strong>The</strong> addr<strong>es</strong>s of a NULL-terminated array (in the User Mode addr<strong>es</strong>s space) of pointersto strings (again in the User Mode addr<strong>es</strong>s space); each string repr<strong>es</strong>ents anenvironment variable in the NAME=value format.<strong>The</strong> function copi<strong>es</strong> the executable file pathname into a newly allocated page frame. It theninvok<strong>es</strong> the do_execve( ) function, passing to it the pointers to the page frame, to thepointer's arrays, and to the location of the <strong>Kernel</strong> Mode stack where the User Mode registercontents are saved. In turn, do_execve( ) performs the following operations:1. Statically allocat<strong>es</strong> a linux_binprm data structure, which will be filled with dataconcerning the new executable file.2. Invok<strong>es</strong> open_namei( ) to get the dentry object, thus the file object and the inodeobject, associated with the executable file. On failure, returns the proper error code.3. Invok<strong>es</strong> the prepare_binprm( ) function to fill the linux_binprm data structure.This function, in turn, performs the following operations:a. Checks whether the permissions of the file allow its execution; if not, returnsan error code.b. Checks whether the file is being written (that is, whether i_writecountinode's field is not null): if so, returns an error code.c. Initializ<strong>es</strong> the e_uid and e_gid fields of the linux_binprm structure, takinginto account the valu<strong>es</strong> of the setuid and setgid flags of the executable file.<strong>The</strong>se fields repr<strong>es</strong>ent the effective user and group IDs, r<strong>es</strong>pectively. Alsochecks proc<strong>es</strong>s capabiliti<strong>es</strong> (a compatibility hack explained in Section 19.1.1).d. Fills the buf field of the linux_binprm structure with the first 128 byt<strong>es</strong> of theexecutable file. <strong>The</strong>se byt<strong>es</strong> include a magic number and other informationsuitable for recognizing the format of the executable file.4. Copi<strong>es</strong> the file pathname, command-line arguments, and environment strings into oneor more newly allocated page fram<strong>es</strong>. (Eventually, they will be assigned to the UserMode addr<strong>es</strong>s space.)5. Invok<strong>es</strong> the search_binary_handler( ) function, which scans the formats list andtri<strong>es</strong> to apply the load_binary method of each element, passing to it the516


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>linux_binprm data structure. <strong>The</strong> scan of the formats list terminat<strong>es</strong> as soon as aload_binary method succeeds in acknowledging the executable format of the file.6. If the executable file format is not pr<strong>es</strong>ent in the formats list, releas<strong>es</strong> all allocatedpage fram<strong>es</strong> and returns the error code -ENOEXEC: <strong>Linux</strong> cannot recognize theexecutable file format.7. Otherwise, returns the code obtained from the load_binary method associated withthe executable format of the file.<strong>The</strong> load_binary method corr<strong>es</strong>ponding to an executable file format performs the followingoperations (we assume that the executable file is stored on a fil<strong>es</strong>ystem that allows filememory mapping and that it requir<strong>es</strong> one or more shared librari<strong>es</strong>):1. Checks some magic numbers stored in the first 128 byt<strong>es</strong> of the file to identify theexecutable format. If the magic numbers don't match, returns the error code -ENOEXEC.2. Reads the header of the executable file. This header d<strong>es</strong>crib<strong>es</strong> the program's segmentsand the shared librari<strong>es</strong> requ<strong>es</strong>ted.3. Gets from the executable file the pathname of the program interpreter, which will beused to locate the shared librari<strong>es</strong> and map them into memory.4. Gets the dentry object (as well as the inode object and the file object) of the programinterpreter.5. Checks the execution permissions of the program interpreter.6. Copi<strong>es</strong> the first 128 byt<strong>es</strong> of the program interpreter into the buf field of thelinux_binprm structure.7. Performs some consistency checks on the program interpreter type.8. Invok<strong>es</strong> the flush_old_exec( ) function to release almost all r<strong>es</strong>ourc<strong>es</strong> used by theprevious computation; in turn, this function performs the following operations.a. If the table of signal handlers is shared with other proc<strong>es</strong>s<strong>es</strong>, allocat<strong>es</strong> a newtable and decrements the usage counter of the old one; this is done by invokingthe make_private_signals( ) function.b. Updat<strong>es</strong> the table of signal handlers by r<strong>es</strong>etting each signal to its defaultaction: this is done by invoking the release_old_signals( ) andflush_signal_handlers( ) functions.c. Invok<strong>es</strong> the exec_mmap( ) function to release the memory d<strong>es</strong>criptor, allmemory regions, and all page fram<strong>es</strong> assigned to the proc<strong>es</strong>s and to clean upthe proc<strong>es</strong>s's page tabl<strong>es</strong>.d. Sets the comm field of the proc<strong>es</strong>s d<strong>es</strong>criptor with the executable file pathname.e. Invok<strong>es</strong> the flush_thread( ) function to clear the valu<strong>es</strong> of the floating pointregisters and debug registers saved in the TSS segment.f. Invok<strong>es</strong> the flush_old_fil<strong>es</strong>( ) function to close all open fil<strong>es</strong> having thecorr<strong>es</strong>ponding flag in the fil<strong>es</strong>->close_on_exec field of the proc<strong>es</strong>sd<strong>es</strong>criptor set (see Section 12.2.7 in Chapter 12). [6][6] <strong>The</strong>se flags can be read and modified by means of the fcntl( ) system call.9. Now we have reached the point of no return: the function cannot r<strong>es</strong>tore the previouscomputation if something go<strong>es</strong> wrong.10. Sets up the new personality of the proc<strong>es</strong>s, that is, the personality field in theproc<strong>es</strong>s d<strong>es</strong>criptor.11. Invok<strong>es</strong> the setup_arg_pag<strong>es</strong>( ) function to allocate a new memory regiond<strong>es</strong>criptor for the proc<strong>es</strong>s's User Mode stack and to insert that memory region into the517


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>proc<strong>es</strong>s's addr<strong>es</strong>s space. setup_arg_pag<strong>es</strong>( ) also assigns the page fram<strong>es</strong>containing the command-line arguments and the environment variable strings to thenew memory region.12. Invok<strong>es</strong> the do_mmap( ) function to create a new memory region that maps the textsegment (that is, the code) of the executable file. <strong>The</strong> initial linear addr<strong>es</strong>s of thememory region depends on the executable format, since the program's executable codeis usually not relocatable. <strong>The</strong>refore, the function assum<strong>es</strong> that the text segment willbe loaded starting from some specific logical addr<strong>es</strong>s offset (and thus, from som<strong>es</strong>pecified linear addr<strong>es</strong>s). ELF programs are loaded starting from linear addr<strong>es</strong>s0x08048000.13. Invok<strong>es</strong> the do_mmap( ) function to create a new memory region that maps the datasegment of the executable file. Again, the initial linear addr<strong>es</strong>s of the memory regiondepends on the executable format, since the executable code expects to find itsvariabl<strong>es</strong> at specified offsets (that is, at specified linear addr<strong>es</strong>s<strong>es</strong>). In an ELFprogram, the data segment is loaded right after the text segment.14. Allocat<strong>es</strong> additional memory regions for any other specialized segments of theexecutable file. Usually, there are none.15. Invok<strong>es</strong> a function that loads the program interpreter. If the program interpreter is anELF executable, the function is named load_elf_interp( ). In general, the functionperforms the operations in steps 11 through 13, but for the program interpreter insteadof the file to be executed. <strong>The</strong> initial addr<strong>es</strong>s<strong>es</strong> of the memory regions that will includethe text and data of the program interpreter are specified by the program interpreteritself; however, they are very high (usually above 0x40000000) in order to avoidcollisions with the memory regions that map the text and data of the file to beexecuted (see the earlier section Section 19.1.4).16. Sets the exec_domain field in the proc<strong>es</strong>s d<strong>es</strong>criptor according to the personality ofthe new program.17. Determin<strong>es</strong> the new capabiliti<strong>es</strong> of the proc<strong>es</strong>s.18. Clears the PF_FORKNOEXEC flag in the proc<strong>es</strong>s d<strong>es</strong>criptor. This flag, which is set whena proc<strong>es</strong>s is forked and cleared when it execut<strong>es</strong> a new program, is required by thePOSIX standard for proc<strong>es</strong>s accounting.19. Creat<strong>es</strong> specific program interpreter tabl<strong>es</strong> and stor<strong>es</strong> them on the User Mode stack,between the command-line arguments and the array of pointers to environment strings(see Figure 19-1).20. Sets the valu<strong>es</strong> of the start_code, end_code, end_data, start_brk, brk, andstart_stack fields of the proc<strong>es</strong>s's memory d<strong>es</strong>criptor.21. Invok<strong>es</strong> the do_mmap( ) function to create a new anonymous memory region mappingthe bss segment of the program. (When the proc<strong>es</strong>s writ<strong>es</strong> into a variable, it triggersdemand paging, thus the allocation of a page frame.) <strong>The</strong> size of this memory regionwas computed when the executable program was linked. <strong>The</strong> initial linear addr<strong>es</strong>s ofthe memory region must be specified, since the program's executable code is usuallynot relocatable. In an ELF program, the bss segment is loaded right after the datasegment.22. Invok<strong>es</strong> the start_thread( ) macro to modify the valu<strong>es</strong> of the User Mode registerseip and <strong>es</strong>p saved on the <strong>Kernel</strong> Mode stack, so that they point to the entry point ofthe program interpreter and to the top of the new User Mode stack, r<strong>es</strong>pectively.23. If the proc<strong>es</strong>s is being traced, sends the SIGTRAP signal to it.24. Returns the value (succ<strong>es</strong>s).518


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>When the execve( ) system call terminat<strong>es</strong> and the calling proc<strong>es</strong>s r<strong>es</strong>um<strong>es</strong> its execution inUser Mode, the execution context is dramatically changed: the code that invoked the systemcall no longer exists. In this sense, we could say that execve( ) never returns on succ<strong>es</strong>s.Instead, a new program to be executed has been mapped in the addr<strong>es</strong>s space of the proc<strong>es</strong>s.However, the new program cannot yet be executed, since the program interpreter must stilltake care of loading the shared librari<strong>es</strong>. [7][7] Things are much simpler if the executable file is statically linked, that is, if no shared library is requ<strong>es</strong>ted. <strong>The</strong> load_binary method justmaps the text, data, bss, and stack segments of the program into the proc<strong>es</strong>s memory regions, and then sets the User Mode eip register to the entrypoint of the new program.Although the program interpreter runs in User Mode, we'll briefly sketch out here how itoperat<strong>es</strong>. Its first job is to set up a basic execution context for itself, starting from theinformation stored by the kernel in the User Mode stack between the array of pointers toenvironment strings and arg_start. <strong>The</strong>n, the program interpreter must examine the programto be executed, in order to identify which shared librari<strong>es</strong> must be loaded and which functionsin each shared library are effectively requ<strong>es</strong>ted. Next, the interpreter issu<strong>es</strong> several mmap( )system calls to create memory regions mapping the pag<strong>es</strong> that will hold the library functions(text and data) actually used by the program. <strong>The</strong>n, the interpreter updat<strong>es</strong> all referenc<strong>es</strong> to th<strong>es</strong>ymbols of the shared library, according to the linear addr<strong>es</strong>s<strong>es</strong> of the library's memoryregions. Finally, the program interpreter terminat<strong>es</strong> its execution by jumping at the main entrypoint of the program to be executed. From now on, the proc<strong>es</strong>s will execute the code of theexecutable file and of the shared librari<strong>es</strong>.As you may have noticed, executing a program is a complex activity that involv<strong>es</strong> manyfacets of kernel d<strong>es</strong>ign such as proc<strong>es</strong>s abstraction, memory management, system calls, andfil<strong>es</strong>ystems. It is the kind of topic that mak<strong>es</strong> you realize what a marvelous piece of work<strong>Linux</strong> is!19.5 Anticipating <strong>Linux</strong> 2.4<strong>Linux</strong> 2.4 adds a few more personaliti<strong>es</strong>: the new kernel is able to execute programs writtenfor SunOS, Sun Solaris, and RISCOS operating systems. <strong>The</strong> implementation of theexecve( ) system call is pretty much the same as in <strong>Linux</strong> 2.2, though.519


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Appendix A. System StartupThis appendix explains what happens right after users have switched on their computers, thatis, how a <strong>Linux</strong> kernel image is copied into memory and executed. In short, we discuss howthe kernel, and thus the whole system, is "bootstrapped."Traditionally, the term bootstrap refers to a person who tri<strong>es</strong> to stand up by pulling her ownboots. In operating systems, the term denot<strong>es</strong> bringing at least a portion of the operatingsystem into main memory and having the proc<strong>es</strong>sor execute it. It also denot<strong>es</strong> the initializationof kernel data structur<strong>es</strong>, the creation of some user proc<strong>es</strong>s<strong>es</strong>, and the transfer of control toone of them.Computer bootstrapping is a tedious, long task, since initially nearly every hardware deviceincluding the RAM is in a random, unpredictable state. Moreover, the bootstrap proc<strong>es</strong>s ishighly dependent on the computer architecture; as usual, we refer to IBM's PC architecture inthis appendix.A.1 Prehistoric Age: <strong>The</strong> BIOS<strong>The</strong> moment after a computer is powered on, it is practically usel<strong>es</strong>s because the RAM chipscontain random data and no operating system is running. To begin the boot, a specialhardware circuit rais<strong>es</strong> the logical value of the RESET pin of the CPU. After RESET is thusasserted, some registers of the proc<strong>es</strong>sor (including cs and eip) are set to fixed valu<strong>es</strong>, andthe code found at physical addr<strong>es</strong>s 0xfffffff0 is executed. This addr<strong>es</strong>s is mapped by thehardware to some read-only, persistent memory chip, a kind of memory often called ROM(Read-Only Memory). <strong>The</strong> set of programs stored in ROM is traditionally called BIOS (BasicInput/Output System), since it includ<strong>es</strong> several interrupt-driven low-level procedur<strong>es</strong> used bysome operating systems, including Microsoft's MS-DOS, to handle the hardware devic<strong>es</strong> thatmake up the computer.Once initialized, <strong>Linux</strong> do<strong>es</strong> not make any use of BIOS but provid<strong>es</strong> its own device driver forevery hardware device on the computer. In fact, the BIOS procedur<strong>es</strong> must be executed in realmode, while the kernel execut<strong>es</strong> in protected mode (see Section 2.2 in Chapter 2), so theycannot share functions even if that would be beneficial.BIOS us<strong>es</strong> Real Mode addr<strong>es</strong>s<strong>es</strong> because they are the only on<strong>es</strong> available when the computeris turned on. A Real Mode addr<strong>es</strong>s is composed of a seg segment and an off offset; thecorr<strong>es</strong>ponding physical addr<strong>es</strong>s is given by seg *16+off. As a r<strong>es</strong>ult, no Global D<strong>es</strong>criptorTable, Local D<strong>es</strong>criptor Table, or paging table is needed by the CPU addr<strong>es</strong>sing circuit totranslate a logical addr<strong>es</strong>s into a physical one. Clearly, the code that initializ<strong>es</strong> the GDT, LDT,and paging tabl<strong>es</strong> must run in Real Mode.<strong>Linux</strong> is forced to use BIOS in the bootstrapping phase, when it must retrieve the kernelimage from disk or from some other external device. <strong>The</strong> BIOS bootstrap procedure<strong>es</strong>sentially performs the following four operations:1. Execut<strong>es</strong> a seri<strong>es</strong> of t<strong>es</strong>ts on the computer hardware, in order to <strong>es</strong>tablish whichdevic<strong>es</strong> are pr<strong>es</strong>ent and whether they are working properly. This phase is often called520


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>POST (Power-On Self-T<strong>es</strong>t). During this phase, several m<strong>es</strong>sag<strong>es</strong>, such as the BIOSversion banner, are displayed.2. Initializ<strong>es</strong> the hardware devic<strong>es</strong>. This phase is crucial in modern PCI-basedarchitectur<strong>es</strong>, since it guarante<strong>es</strong> that all hardware devic<strong>es</strong> operate without conflicts onthe IRQ lin<strong>es</strong> and I/O ports. At the end of this phase, a table of installed PCI devic<strong>es</strong> isdisplayed.3. Search<strong>es</strong> for an operating system to boot. Actually, depending on the BIOS setting, theprocedure may try to acc<strong>es</strong>s (in a predefined, customizable order) the first sector (bootsector) of any floppy disk, any hard disk, and any CD-ROM in the system.4. As soon as a valid device is found, copi<strong>es</strong> the contents of its first sector into RAM,starting from physical addr<strong>es</strong>s 0x00007c00, then jumps into that addr<strong>es</strong>s and execut<strong>es</strong>the code just loaded.<strong>The</strong> r<strong>es</strong>t of this appendix tak<strong>es</strong> you from the most primitive starting state to the full glory of arunning <strong>Linux</strong> system.A.2 Ancient Age: <strong>The</strong> Boot Loader<strong>The</strong> boot loader is the program invoked by the BIOS to load the image of an operating systemkernel into RAM. Let us briefly sketch how boot loaders work in IBM's PC architecture.In order to boot from a floppy disk, the instructions stored in its first sector are loaded inRAM and executed; th<strong>es</strong>e instructions copy all the remaining sectors containing the kernelimage into RAM.Booting from a hard disk is done differently. <strong>The</strong> first sector of the hard disk, named theMaster Boot Record (MBR), includ<strong>es</strong> the partition table [A] and a small program, which loadsthe first sector of the partition containing the operating system to be started. Some operatingsystems such as Microsoft Windows 98 identify this partition by means of an active flagincluded in the partition table; [B] following this approach, only the operating system whosekernel image is stored in the active partition can be booted. As we shall see later, <strong>Linux</strong> ismore flexible since it replac<strong>es</strong> the rudimentary program included in the MBR with asophisticated program called LILO that allows users to select the operating system to bebooted.[A] Each partition table entry typically includ<strong>es</strong> the starting and ending sectors of a partition and the kind of operating system that handl<strong>es</strong> it.[B] <strong>The</strong> active flag may be set through programs like MS-DOS's FDISK.A.2.1 Booting <strong>Linux</strong> from Floppy Disk<strong>The</strong> only way to store a <strong>Linux</strong> kernel on a single floppy disk is to compr<strong>es</strong>s the kernel image.As we shall see, compr<strong>es</strong>sion is done at compile time and decompr<strong>es</strong>sion by the loader.If the <strong>Linux</strong> kernel is loaded from a floppy disk, the boot loader is quite simple. It is coded inthe arch/i386/boot/bootsect.S assembly language file. When a new kernel image is producedby compiling the kernel source, the executable code yielded by this assembly language file isplaced at the beginning of the kernel image file. Thus, it is very easy to produce a bootablefloppy containing the <strong>Linux</strong> kernel. <strong>The</strong> floppy can be created by copying the kernel imag<strong>es</strong>tarting from the first sector of the disk. When the BIOS loads the first sector of the floppydisk, it actually copi<strong>es</strong> the code of the boot loader.521


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong><strong>The</strong> boot loader, which is invoked by the BIOS by jumping to physical addr<strong>es</strong>s 0x00007c00,performs the following operations:1. Mov<strong>es</strong> itself from addr<strong>es</strong>s 0x00007c00 to addr<strong>es</strong>s 0x00090000.2. Sets up the Real Mode stack, from addr<strong>es</strong>s 0x00003ff4. As usual, the stack will growtoward lower addr<strong>es</strong>s<strong>es</strong>.3. Sets up the disk parameter table, used by the BIOS to handle the floppy device driver.4. Invok<strong>es</strong> a BIOS procedure to display a "Loading" m<strong>es</strong>sage.5. Invok<strong>es</strong> a BIOS procedure to load the setup( ) code of the kernel image from thefloppy disk and puts it in RAM starting from addr<strong>es</strong>s 0x00090200.6. Invok<strong>es</strong> a BIOS procedure to load the r<strong>es</strong>t of the kernel image from the floppy diskand puts the image in RAM starting from either low addr<strong>es</strong>s 0x00010000 (for smallkernel imag<strong>es</strong> compiled with make zImage) or high addr<strong>es</strong>s 0x00100000 (for bigkernel imag<strong>es</strong> compiled with make bzImage). In the following discussion, we will saythat the kernel image is "loaded low" or "loaded high" in RAM, r<strong>es</strong>pectively. Supportfor big kernel imag<strong>es</strong> was introduced quite recently: while it us<strong>es</strong> <strong>es</strong>sentially the samebooting scheme as the older one, it plac<strong>es</strong> data in different physical memory addr<strong>es</strong>s<strong>es</strong>to avoid problems with the ISA hole mentioned in Section 2.5.3 in Chapter 2.7. Jumps to the setup( ) code.A.2.2 Booting <strong>Linux</strong> from Hard DiskIn most cas<strong>es</strong>, the <strong>Linux</strong> kernel is loaded from a hard disk, and a two-stage boot loader isrequired. <strong>The</strong> most commonly used <strong>Linux</strong> boot loader on Intel systems is named LILO (LInuxLOader); corr<strong>es</strong>ponding programs exist for other architectur<strong>es</strong>. LILO may be installed eitheron the MBR, replacing the small program that loads the boot sector of the active partition, orin the boot sector of a (usually active) disk partition. In both cas<strong>es</strong>, the final r<strong>es</strong>ult is the same:when the loader is executed at boot time, the user may choose which operating system to load.<strong>The</strong> LILO boot loader is broken into two parts, since otherwise it would be too large to fit intothe MBR. <strong>The</strong> MBR or the partition boot sector includ<strong>es</strong> a small boot loader, which is loadedinto RAM starting from addr<strong>es</strong>s 0x00007c00 by the BIOS. This small program mov<strong>es</strong> itself tothe addr<strong>es</strong>s 0x0009a000, sets up the Real Mode stack (ranging from 0x0009b000 to0x0009a200), and loads the second part of the LILO boot loader into RAM starting fromaddr<strong>es</strong>s 0x0009b000. In turn, this latter program reads a map of available operating systemsfrom disk and offers the user a prompt so she can choose one of them. Finally, after the userhas chosen the kernel to be loaded (or let a time-out elapse so that LILO choos<strong>es</strong> a default),the boot loader may either copy the boot sector of the corr<strong>es</strong>ponding partition into RAM andexecute it or directly copy the kernel image into RAM.Assuming that a <strong>Linux</strong> kernel image must be booted, the LILO boot loader, which reli<strong>es</strong> onBIOS routin<strong>es</strong>, performs <strong>es</strong>sentially the same operations as the boot loader integrated into thekernel image d<strong>es</strong>cribed in the previous section about floppy disks. <strong>The</strong> loader displays the"Loading <strong>Linux</strong>" m<strong>es</strong>sage; then it copi<strong>es</strong> the integrated boot loader of the kernel image toaddr<strong>es</strong>s 0x00090000, the setup( ) code to addr<strong>es</strong>s 0x00090200, and the r<strong>es</strong>t of the kernelimage to addr<strong>es</strong>s 0x00010000 or 0x00100000. <strong>The</strong>n it jumps to the setup( ) code.522


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>A.3 Middle Ag<strong>es</strong>: <strong>The</strong> setup( ) Function<strong>The</strong> code of the setup( ) assembly language function is placed by the linker immediatelyafter the integrated boot loader of the kernel, that is, at offset 0x200 of the kernel image file.<strong>The</strong> boot loader can thus easily locate the code and copy it into RAM starting from physicaladdr<strong>es</strong>s 0x00090200.<strong>The</strong> setup( ) function must initialize the hardware devic<strong>es</strong> in the computer and set up theenvironment for the execution of the kernel program. Although the BIOS already initializedmost hardware devic<strong>es</strong>, <strong>Linux</strong> do<strong>es</strong> not rely on it but reinitializ<strong>es</strong> the devic<strong>es</strong> in its ownmanner to enhance portability and robustn<strong>es</strong>s. setup( ) <strong>es</strong>sentially performs the followingoperations:1. Invok<strong>es</strong> a BIOS procedure to find out the amount of RAM available in the system.2. Sets the keyboard repeat delay and rate. (When the user keeps a key pr<strong>es</strong>sed past acertain amount of time, the keyboard device sends the corr<strong>es</strong>ponding keycode overand over to the CPU.)3. Initializ<strong>es</strong> the video adapter card.4. Reinitializ<strong>es</strong> the disk controller and determin<strong>es</strong> the hard disk parameters.5. Checks for an IBM Micro Channel bus (MCA).6. Checks for a PS/2 pointing device (bus mouse).7. Checks for Advanced Power Management (APM) BIOS support.8. If the kernel image was loaded low in RAM (at physical addr<strong>es</strong>s 0x00010000), mov<strong>es</strong>it to physical addr<strong>es</strong>s 0x00001000. Conversely, if the kernel image was loaded high inRAM, do<strong>es</strong> not move it. This step is nec<strong>es</strong>sary because, in order to be able to store thekernel image on a floppy disk and to save time while booting, the kernel image storedon disk is compr<strong>es</strong>sed, and the decompr<strong>es</strong>sion routine needs some free space to use asa temporary buffer following the kernel image in RAM.9. Sets up a provisional Interrupt D<strong>es</strong>criptor Table (IDT) and a provisional GlobalD<strong>es</strong>criptor Table (GDT).10. R<strong>es</strong>ets the floating point unit (FPU), if any.11. Reprograms the Programmable Interrupt Controller (PIC) and maps the 16 hardwareinterrupts (IRQ lin<strong>es</strong>) to the range of vectors from 32 to 47. <strong>The</strong> kernel must performthis step because the BIOS erroneously maps the hardware interrupts in the range fromto 15, which is already used for CPU exceptions (see Section 4.2.3 in Chapter 4).12. Switch<strong>es</strong> the CPU from Real Mode to Protected Mode by setting the PE bit in the cr0status register. As explained in Section 2.5.5 in Chapter 2, the provisional kernel pagetabl<strong>es</strong> contained in swapper_pg_dir and pg0 identically map the linear addr<strong>es</strong>s<strong>es</strong> tothe same physical addr<strong>es</strong>s<strong>es</strong>. <strong>The</strong>refore, the transition from Real Mode to ProtectedMode go<strong>es</strong> smoothly.13. Jumps to the startup_32( ) assembly language function.A.4 Renaissance: <strong>The</strong> startup_32( ) Functions<strong>The</strong>re are two different startup_32( ) functions; the one we refer to here is coded in thearch/i386/boot/compr<strong>es</strong>sed/head.S file. After setup( ) terminat<strong>es</strong>, the function has beenmoved either to physical addr<strong>es</strong>s 0x00100000 or to physical addr<strong>es</strong>s 0x00001000, dependingon whether the kernel image was loaded high or low in RAM.This function performs the following operations:523


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>1. Initializ<strong>es</strong> the segmentation registers and a provisional stack.2. Fills the area of uninitialized data of the kernel identified by the _edata and _endsymbols with zeros (see Section 2.5.3 in Chapter 2).3. Invok<strong>es</strong> the decompr<strong>es</strong>s_kernel( ) function to decompr<strong>es</strong>s the kernel image. <strong>The</strong>"Uncompr<strong>es</strong>sing <strong>Linux</strong> . . . " m<strong>es</strong>sage is displayed first. After the kernel image hasbeen decompr<strong>es</strong>sed, the "O K, booting the kernel." m<strong>es</strong>sage is shown. If the kernelimage was loaded low, the decompr<strong>es</strong>sed kernel is placed at physical addr<strong>es</strong>s0x00100000. Otherwise, if the kernel image was loaded high, the decompr<strong>es</strong>sed kernelis placed in a temporary buffer located after the compr<strong>es</strong>sed image. <strong>The</strong> decompr<strong>es</strong>sedimage is then moved into its final position, which starts at physical addr<strong>es</strong>s0x00100000.4. Jumps to physical addr<strong>es</strong>s 0x00100000.<strong>The</strong> decompr<strong>es</strong>sed kernel image begins with another startup_32( ) function included in thearch/i386/kernel/head.S file. Using the same name for both the functions do<strong>es</strong> not create anyproblems (b<strong>es</strong>id<strong>es</strong> confusing our readers), since both functions are executed by jumping totheir initial physical addr<strong>es</strong>s<strong>es</strong>.<strong>The</strong> second startup_32( ) function <strong>es</strong>sentially sets up the execution environment for thefirst <strong>Linux</strong> proc<strong>es</strong>s (proc<strong>es</strong>s 0). <strong>The</strong> function performs the following operations:1. Initializ<strong>es</strong> the segmentation registers with their final valu<strong>es</strong>.2. Sets up the <strong>Kernel</strong> Mode stack for proc<strong>es</strong>s (see Section 3.3.2 in Chapter 3).3. Invok<strong>es</strong> setup_idt( ) to fill the IDT with null interrupt handlers (see Section 4.4.2in Chapter 4).4. Puts the system parameters obtained from the BIOS and the parameters passed to theoperating system into the first page frame (see Section 2.5.3 in Chapter 2).5. Identifi<strong>es</strong> the model of the proc<strong>es</strong>sor.6. Loads the gdtr and idtr registers with the addr<strong>es</strong>s<strong>es</strong> of the GDT and IDT tabl<strong>es</strong>.7. Jumps to the start_kernel( ) function.A.5 Modern Age: <strong>The</strong> start_kernel( ) Function<strong>The</strong> start_kernel( ) function complet<strong>es</strong> the initialization of the <strong>Linux</strong> kernel. Nearly everykernel component is initialized by this function; we mention just a few of them:• <strong>The</strong> page tabl<strong>es</strong> are initialized by invoking the paging_init( ) function (seeSection 2.5.5 in Chapter 2).• <strong>The</strong> page d<strong>es</strong>criptors are initialized by the mem_init( ) function (see Section 6.1 inChapter 6).• <strong>The</strong> final initialization of the IDT is performed by invoking trap_init( ) (see th<strong>es</strong>ection Section 4.5 in Chapter 4) and init_IRQ( ) (see Section 4.6.2 in Chapter 4).• <strong>The</strong> slab allocator is initialized by the kmem_cache_init( ) andkmem_cache_siz<strong>es</strong>_init( ) functions (see Section 6.2.4 in Chapter 6).• <strong>The</strong> system date and time are initialized by the time_init( ) function (seeSection 5.1.1 in Chapter 5).• <strong>The</strong> kernel thread for proc<strong>es</strong>s 1 is created by invoking the kernel_thread( )function. In turn, this kernel thread creat<strong>es</strong> the other kernel threads and execut<strong>es</strong> the/sbin/init program, as d<strong>es</strong>cribed in Section 3.3.2 in Chapter 3.524


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>B<strong>es</strong>id<strong>es</strong> the "<strong>Linux</strong> version 2.2.14 . . . " m<strong>es</strong>sage, which is displayed right after the beginningof start_kernel( ), many other m<strong>es</strong>sag<strong>es</strong> are displayed in this last phase both by the initfunctions and by the kernel threads. At the end, the familiar login prompt appears on theconsole (or in the graphical screen if the X Window System is launched at startup), telling theuser that the <strong>Linux</strong> kernel is up and running.525


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Appendix B. Modul<strong>es</strong>As stated in Chapter 1, modul<strong>es</strong> are <strong>Linux</strong>'s recipe for effectively achieving many ofthe theoretical advantag<strong>es</strong> of microkernels without introducing performance penalti<strong>es</strong>.B.1 To Be (a Module) or Not to Be?When system programmers want to add a new functionality to the <strong>Linux</strong> kernel, they arefaced with an inter<strong>es</strong>ting dilemma: should they write the new code so that it will be compiledas a module, or should they statically link the new code to the kernel?As a general rule, system programmers tend to implement new code as a module. Becausemodul<strong>es</strong> can be linked on demand, as we see later, the kernel do<strong>es</strong> not have to be bloated withhundreds of seldom-used programs. Nearly every higher-level component of the <strong>Linux</strong>kernel—fil<strong>es</strong>ystems, device drivers, executable formats, network layers, and so on—can becompiled as a module.However, some <strong>Linux</strong> code must nec<strong>es</strong>sarily be linked statically, which means that either thecorr<strong>es</strong>ponding component is included in the kernel, or it is not compiled at all. This happenstypically when the component requir<strong>es</strong> a modification to some data structure or functionstatically linked in the kernel.As an example, suppose that the component has to introduce new fields into the proc<strong>es</strong>sd<strong>es</strong>criptor. Linking a module cannot change an already defined data structure liketask_struct since, even if the module us<strong>es</strong> its modified version of the data structure, allstatically linked code continu<strong>es</strong> to see the old version: data corruption will easily occur. Apartial solution to the problem consists of "statically" adding the new fields to the proc<strong>es</strong>sd<strong>es</strong>criptor, thus making them available to the kernel component, no matter how it has beenlinked. However, if the kernel component is never used, such extra fields replicated in everyproc<strong>es</strong>s d<strong>es</strong>criptor are a waste of memory. If the new kernel component increas<strong>es</strong> the size ofthe proc<strong>es</strong>s d<strong>es</strong>criptor a lot, one would get better system performance by adding the requiredfields in the data structure only if the component is statically linked to the kernel.As a second example, consider a kernel component that has to replace statically linked code.It's pretty clear that no such component can be compiled as a module because the kernelcannot change the machine code already in RAM when linking the module. For instance, it isnot possible to link a module that chang<strong>es</strong> the way page fram<strong>es</strong> are allocated, since the Buddysystem functions are always statically linked to the kernel.<strong>The</strong> kernel has two key tasks to perform in managing modul<strong>es</strong>. <strong>The</strong> first task is making surethe r<strong>es</strong>t of the kernel can reach the module's global symbols, such as the entry point to itsmain function. A module must also know the addr<strong>es</strong>s<strong>es</strong> of symbols in the kernel and in othermodul<strong>es</strong>. So referenc<strong>es</strong> are r<strong>es</strong>olved once and for all when a module is linked. <strong>The</strong> secondtask consists of keeping track of the use of modul<strong>es</strong>, so that no module is unloaded whileanother module or another part of the kernel is using it. A simple reference count keeps trackof each module's usage.526


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>B.2 Module ImplementationModul<strong>es</strong> are stored in the fil<strong>es</strong>ystem as ELF object fil<strong>es</strong>. <strong>The</strong> kernel considers only modul<strong>es</strong>that have been loaded into RAM by the /sbin/insmod program (see Section B.3) and for eachof them it allocat<strong>es</strong> a memory area containing the following data:• A module object• A null-terminated string that repr<strong>es</strong>ents the name of the module (all modul<strong>es</strong> shouldhave unique nam<strong>es</strong>)• <strong>The</strong> code that implements the functions of the module<strong>The</strong> module object d<strong>es</strong>crib<strong>es</strong> a module; its fields are shown in Table B-1. A simply linked listcollects all module objects, where the next field of each object points to the next element inthe list. <strong>The</strong> first element of the list is addr<strong>es</strong>sed by the module_list variable. But actually,the first element of the list is always the same: it is named kernel_module and refers to afictitious module repr<strong>es</strong>enting the statically linked kernel code.Table B-1. <strong>The</strong> module ObjectType Name D<strong>es</strong>criptionunsigned long size_of_struct Size of module objectstruct module * next Next list elementconst char * name Pointer to module nameunsigned long size Module sizeatomic_t uc.usecount Module usage counterunsigned long flags Module flagsunsigned int nsyms Number of exported symbolsunsigned int ndeps Number of referenced modul<strong>es</strong>struct module_symbol * syms Table of exported symbolsstruct module_ref * deps List of referenced modul<strong>es</strong>struct module_ref * refs List of referencing modul<strong>es</strong>int (*)(void) init Initialization methodvoid (*)(void) cleanup Cleanup methodstruct exception_table_entry * ex_table_start Start of exception tabl<strong>es</strong>truct exception_table_entry * ex_table_end End of exception table<strong>The</strong> total size of the memory area allocated for the module (including the module object andthe module name) is contained in the size field.As already mentioned in Section 8.2.6 in Chapter 8, each module has its own exception table.<strong>The</strong> table includ<strong>es</strong> the addr<strong>es</strong>s<strong>es</strong> of the fixup code of the module, if any. <strong>The</strong> table is copied inRAM when the module is linked, and its starting and ending addr<strong>es</strong>s<strong>es</strong> are stored inthe ex_table_start and ex_table_end fields of the module object.B.2.1 Module Usage CounterEach module has a usage counter, stored in the uc.usecount field of the corr<strong>es</strong>pondingmodule object. <strong>The</strong> counter is incremented when an operation involving the module'sfunctions is started and decremented when the operation terminat<strong>es</strong>. A module can beunlinked only if its usage counter is null.527


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>As an example, suppose that the MS-DOS fil<strong>es</strong>ystem layer has been compiled as a moduleand that the module has been linked at runtime. Initially, the module usage counter is null.If the user mounts an MS-DOS floppy disk, the module usage counter is incremented by 1.Conversely, when the user unmounts the floppy disk, the counter is decremented by 1.B.2.2 Exporting SymbolsWhen linking a module, all referenc<strong>es</strong> to global kernel symbols (variabl<strong>es</strong> and functions) inthe module's object code must be replaced with suitable addr<strong>es</strong>s<strong>es</strong>. This operation, which isvery similar to that performed by the linker while compiling a User Mode program (seeSection 19.1.3 in Chapter 19), is delegated to the /sbin/insmod external program (d<strong>es</strong>cribedlater in Section B.3).A special table is used by the kernel to store the symbols that can be acc<strong>es</strong>sed by modul<strong>es</strong>together with their corr<strong>es</strong>ponding addr<strong>es</strong>s<strong>es</strong>. This kernel symbol table is contained inthe _ _ksymtab section of the kernel code segment, and its starting and ending addr<strong>es</strong>s<strong>es</strong> areidentified by two symbols produced by the C compiler: __start__ _ksymtab and__stop__ _ksymtab. <strong>The</strong> EXPORT_SYMBOL macro, when used inside the statically linkedkernel code, forc<strong>es</strong> the C compiler to add a specified symbol to the table.Only the kernel symbols actually used by some existing module are included in the table.Should a system programmer need, within some module, to acc<strong>es</strong>s a kernel symbol that is notalready exported, he can simply add the corr<strong>es</strong>ponding EXPORT_SYMBOL macro into thekernel/ksyms.c file of the <strong>Linux</strong> source code.Linked modul<strong>es</strong> can also export their own symbols, so that other modul<strong>es</strong> can acc<strong>es</strong>s them.<strong>The</strong> module symbol table is contained in the _ _ksymtab section of the module code segment.If the module source code includ<strong>es</strong> the EXPORT_NO_SYMBOLS macro, no symbols from thatmodule are added to the table. To export a subset of symbols from the module, theprogrammer must define the EXPORT_SYMTAB macro before including theinclude/linux/module.h header file. <strong>The</strong>n he may use the EXPORT_SYMBOL macro to export aspecific symbol. If neither EXPORT_NO_SYMBOLS nor EXPORT_SYMTAB appears in the modul<strong>es</strong>ource code, all global symbols of the modul<strong>es</strong> are exported.<strong>The</strong> symbol table in the __ksymtab section is copied into a memory area when the module islinked, and the addr<strong>es</strong>s of the area is stored in the syms field of the module object. <strong>The</strong>symbols exported by the statically linked kernel and all linked-in modul<strong>es</strong> can be retrieved byreading the /proc/ksyms file or using the query_module( ) system call (d<strong>es</strong>cribed inSection B.3).B.2.3 Module DependencyA module (B) can refer to the symbols exported by another module (A); in this case, we saythat B is loaded on top of A, or equivalently that A is used by B. In order to link module B,module A must have already been linked; otherwise, the referenc<strong>es</strong> to the symbols exportedby A cannot be properly linked in B. In short, there is a dependency between modul<strong>es</strong>.<strong>The</strong> deps field of the module object relative to B points to a list d<strong>es</strong>cribing all modul<strong>es</strong> thatare used by B; in our example, A's module object would appear in that list. <strong>The</strong> ndeps fieldstor<strong>es</strong> the number of modul<strong>es</strong> used by B. Conversely, the refs field of A points to a list528


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>d<strong>es</strong>cribing all modul<strong>es</strong> that are loaded on top of A (thus, B's module object will be includedwhen it is loaded). <strong>The</strong> refs list must be updated dynamically whenever a module is loadedon top of A. In order to ensure that module A is not removed before B, A's usage counter isincremented for each module loaded on top of it.B<strong>es</strong>ide A and B there could be, of course, another module (C) loaded on top of B, and so on.Stacking modul<strong>es</strong> is an effective way to modularize the kernel source code in order to speedup its development and improve its portability.B.3 Linking and Unlinking Modul<strong>es</strong>A user can link a module into the running kernel by executing the /sbin/insmod externalprogram. This program performs the following operations:1. Reads from the command line the name of the module to be linked.2. Locat<strong>es</strong> the file containing the module's object code in the system directory tree. <strong>The</strong>file is usually placed in some subdirectory below /lib/modul<strong>es</strong>.3. Comput<strong>es</strong> the size of the memory area needed to store the module code, its name, andthe module object.4. Invok<strong>es</strong> the create_module( ) system call, passing to it the name and size of thenew module. <strong>The</strong> corr<strong>es</strong>ponding sys_create_module( ) service routine performs thefollowing operations:a. Checks whether the user is allowed to link the module (the current proc<strong>es</strong>smust have the CAP_SYS_MODULE capability). In any situation where one isadding functionality to a kernel, which has acc<strong>es</strong>s to all data and proc<strong>es</strong>s<strong>es</strong> onthe system, security is a paramount concern.b. Invok<strong>es</strong> the find_module( ) function to scan the module_list list of moduleobjects looking for a module with the specified name. If it is found, the modulehas already been linked, so the system call terminat<strong>es</strong>.c. Invok<strong>es</strong> vmalloc( ) to allocate a memory area for the new module.d. Initializ<strong>es</strong> the fields of the module object at the beginning of the memory areaand copi<strong>es</strong> the name of the module right below the object.e. Inserts the module object into the list pointed to by module_list.f. Returns the starting addr<strong>es</strong>s of the memory area allocated to the module.5. Invok<strong>es</strong> the query_module( ) system call with the QM_MODULES subcommand to getthe name of all already linked modul<strong>es</strong>.6. Invok<strong>es</strong> the query_module( ) system call with the QM_SYMBOL subcommandrepeatedly, to get the kernel symbol table and the symbol tabl<strong>es</strong> of all modul<strong>es</strong> that arealready linked in.7. Using the kernel symbol table, the module symbol tabl<strong>es</strong>, and the addr<strong>es</strong>s returned bythe create_module( ) system call, relocat<strong>es</strong> the object code included in the module'sfile. This means replacing all occurrenc<strong>es</strong> of external and global symbols with thecorr<strong>es</strong>ponding logical addr<strong>es</strong>s offsets.8. Allocat<strong>es</strong> a memory area in the User Mode addr<strong>es</strong>s space and loads it with a copy ofthe module object, the module's name, and the module's code relocated for the runningkernel. <strong>The</strong> addr<strong>es</strong>s fields of the object point to the relocated code. <strong>The</strong> init field isset to the relocated addr<strong>es</strong>s of the module's init_module( ) function, if the moduledefin<strong>es</strong> one. (Virtually all modul<strong>es</strong> define a function of that name, which is invoked inthe next step to perform any initialization required by the module.) Similarly, the529


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>cleanup field is set to the relocated addr<strong>es</strong>s of the module's cleanup_module( )function, if one is pr<strong>es</strong>ent.9. Invok<strong>es</strong> the init_module( ) system call, passing to it the addr<strong>es</strong>s of the User Modememory area set up in the previous step. <strong>The</strong> sys_init_module( ) service routineperforms the following operations:a. Checks whether the user is allowed to link the module (the current proc<strong>es</strong>smust have the CAP_SYS_MODULE capability).b. Invok<strong>es</strong> find_module( ) to find the proper module object in the list to whichmodule_list points.c. Overwrit<strong>es</strong> the module object with the contents of the corr<strong>es</strong>ponding object inthe User Mode memory area.d. Performs a seri<strong>es</strong> of sanity checks on the addr<strong>es</strong>s<strong>es</strong> in the module object.e. Copi<strong>es</strong> the remaining part of the User Mode memory area into the memoryarea allocated to the module.f. Scans the module list and initializ<strong>es</strong> the ndeps and deps fields of the moduleobject.g. Sets the module usage counter to 1.h. If defined, execut<strong>es</strong> the init method of the module to initialize the module'sdata structur<strong>es</strong> properly. <strong>The</strong> method is usually implemented by theinit_module( ) function defined inside the module.i. Sets the module usage counter to 0 and returns.10. Releas<strong>es</strong> the User Mode memory area and terminat<strong>es</strong>.In order to unlink a module, a user invok<strong>es</strong> the /sbin/rmmod external program, whichperforms the following operations:1. From the command line, reads the name of the module to be unlinked.2. Invok<strong>es</strong> the query_module( ) system call with the QM_MODULES subcommand to getthe list of linked modul<strong>es</strong>.3. Invok<strong>es</strong> the query_module( ) system call with the QM_REFS subcommand severaltim<strong>es</strong>, to retrieve dependency information on the linked modul<strong>es</strong>. If some module islinked on top of the one to be removed, terminat<strong>es</strong>.4. Invok<strong>es</strong> the delete_module( ) system call, passing the module's name to it. <strong>The</strong>corr<strong>es</strong>ponding sys_delete_module( ) service routine performs th<strong>es</strong>e operations:a. Checks whether the user is allowed to remove the module (the current proc<strong>es</strong>smust have the CAP_SYS_MODULE capability).b. Invok<strong>es</strong> find_module( ) to find the corr<strong>es</strong>ponding module object in the list towhich module_list points.c. Checks whether both the refs field and the uc.usecount fields of the moduleobject are null; otherwise, returns an error code.d. If defined, invok<strong>es</strong> the cleanup method to perform the operations needed tocleanly shut down the module. <strong>The</strong> method is usually implemented by thecleanup_module( ) function defined inside the module.e. Scans the deps list of the module and remov<strong>es</strong> the module from the refs listof any element found.f. Remov<strong>es</strong> the module from the list to which module_list points.g. Invok<strong>es</strong> vfree( ) to release the memory area used by the module and returns(succ<strong>es</strong>s).530


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>B.4 Linking Modul<strong>es</strong> on DemandA module can be automatically linked when the functionality it provid<strong>es</strong> is requ<strong>es</strong>ted andautomatically removed afterward.For instance, suppose that the MS-DOS fil<strong>es</strong>ystem has not been linked, either statically ordynamically. If a user tri<strong>es</strong> to mount an MS-DOS fil<strong>es</strong>ystem, the mount( ) system callnormally fails by returning an error code, since MS-DOS is not included in the file_systemslist of registered fil<strong>es</strong>ystems. However, if support for automatic linking of modul<strong>es</strong> has beenspecified when configuring the kernel, <strong>Linux</strong> mak<strong>es</strong> an attempt to link the MS-DOS module,then scans the list of registered fil<strong>es</strong>ystems again. If the module was succ<strong>es</strong>sfully linked, themount( ) system call can continue its execution as if the MS-DOS fil<strong>es</strong>ystem were pr<strong>es</strong>entfrom the beginning.B.4.1 <strong>The</strong> modprobe ProgramIn order to automatically link a module, the kernel creat<strong>es</strong> a kernel thread to execute the/sbin/modprobe external program, [A] which tak<strong>es</strong> care of possible complications due to moduledependenci<strong>es</strong>. <strong>The</strong> dependenci<strong>es</strong> were already discussed earlier: a module may require one ormore other modul<strong>es</strong>, and th<strong>es</strong>e in turn may require still other modul<strong>es</strong>. For instance, the MS-DOS module requir<strong>es</strong> another module named fat containing some code common to allfil<strong>es</strong>ystems based on a File Allocation Table (FAT). Thus, if it is not already pr<strong>es</strong>ent, the fatmodule must also be automatically linked into the running kernel when the MS-DOS moduleis requ<strong>es</strong>ted. R<strong>es</strong>olving dependenci<strong>es</strong> and finding modul<strong>es</strong> is a type of activity that's b<strong>es</strong>t donein User Mode, because it requir<strong>es</strong> locating and acc<strong>es</strong>sing module object fil<strong>es</strong> in the fil<strong>es</strong>ystem.[A] This is one of the few exampl<strong>es</strong> in which the kernel reli<strong>es</strong> on an external program.<strong>The</strong> /sbin/modprobe external program is similar to insmod, since it links in a module specifiedon the command line. However, modprobe also recursively links in all modul<strong>es</strong> used by themodule specified on the command line. For instance, if a user invok<strong>es</strong> modprobe to link theMS-DOS module, the program links the fat module, if nec<strong>es</strong>sary, followed by the MS-DOSmodule. Actually, modprobe just checks for module dependenci<strong>es</strong>; the actual linking of eachmodule is done by forking a new proc<strong>es</strong>s and executing insmod.How do<strong>es</strong> modprobe know about module dependenci<strong>es</strong>? Another external program named/sbin/depmod is executed at system startup. It looks at all the modul<strong>es</strong> compiled for therunning kernel, which are usually stored inside the /lib/modul<strong>es</strong> directory. <strong>The</strong>n it writ<strong>es</strong> allmodule dependenci<strong>es</strong> to a file named modul<strong>es</strong>.dep. <strong>The</strong> modprobe program can thus simplycompare the information stored in the file with the list of linked modul<strong>es</strong> produced by thequery_module( ) system call.B.4.2 <strong>The</strong> requ<strong>es</strong>t_module( ) FunctionIn some cas<strong>es</strong>, the kernel may invoke the requ<strong>es</strong>t_module( ) function to attempt automaticlinking for a module.Consider again the case of a user trying to mount an MS-DOS fil<strong>es</strong>ystem: if theget_fs_type( ) function discovers that the fil<strong>es</strong>ystem is not registered, it invok<strong>es</strong> therequ<strong>es</strong>t_module( ) function in the hope that MS-DOS has been compiled as a module.531


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>If the requ<strong>es</strong>t_module( ) function succeeds in linking the requ<strong>es</strong>ted module, get_fs_type() can continue as if the module were always pr<strong>es</strong>ent. Of course, this do<strong>es</strong> not always happen;in our example, the MS-DOS module might not have been compiled at all. In this case,get_fs_type( ) returns an error code.<strong>The</strong> requ<strong>es</strong>t_module( ) function receiv<strong>es</strong> the name of the module to be linked as itsparameter. It invok<strong>es</strong> kernel_thread( ) to create a new kernel thread that execut<strong>es</strong> theexec_modprobe( ) function, then it simply waits until that kernel thread terminat<strong>es</strong>.<strong>The</strong> exec_modprobe( ) function, in turn, also receiv<strong>es</strong> the name of the module to be linkedas its parameter. It invok<strong>es</strong> the execve( ) system call and execut<strong>es</strong> the /sbin/modprobeexternal program, [B] passing the module name to it. In turn, the modprobe program actuallylinks the requ<strong>es</strong>ted module, along with any that it depends on.[B] <strong>The</strong> name and path of the program executed by exec_modprobe( ) can be customized by writing into the /proc/sys/kernel/modprobefile.Each module automatically linked into the kernel has the MOD_AUTOCLEAN flag in the flagsfield of the module object set. This flag allows automatic unlinking of the module when it isno longer used.In order to automatically unlink the module, a system proc<strong>es</strong>s (like crond ) periodicallyexecut<strong>es</strong> the rmmod external program, passing the -a option to it. <strong>The</strong> latter program execut<strong>es</strong>the delete_module( ) system call with a NULL parameter. <strong>The</strong> corr<strong>es</strong>ponding service routin<strong>es</strong>cans the list of module objects and remov<strong>es</strong> all unused modul<strong>es</strong> having the MOD_AUTOCLEANflag set.532


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>Appendix C. Source Code StructureIn order to help you to find your way through the fil<strong>es</strong> of the source code, we briefly d<strong>es</strong>cribethe organization of the kernel directory tree. As usual, all pathnam<strong>es</strong> refer to the maindirectory of the <strong>Linux</strong> kernel, which is, in most <strong>Linux</strong> distributions, /usr/src/linux.<strong>Linux</strong> source code for all supported architectur<strong>es</strong> is contained in about 4500 C and Assemblyfil<strong>es</strong> stored in about 270 subdirectori<strong>es</strong>; it consists of about 2 million lin<strong>es</strong> of code, whichoccupy more than 58 megabyt<strong>es</strong> of disk space.<strong>The</strong> following list illustrat<strong>es</strong> the directory tree containing the <strong>Linux</strong> source code. Please noticethat only the subdirectori<strong>es</strong> somehow related to the target of this book have been expanded.init<strong>Kernel</strong> initialization codekernel <strong>Kernel</strong> core: proc<strong>es</strong>s<strong>es</strong>, timing, program execution, signals, modul<strong>es</strong>, . . .mmMemory handlingarchPlatform-dependent code—i386 IBM's PC architecture——kernel <strong>Kernel</strong> core——mmMemory management——math-emu Software emulator for floating point unit——libHardware-dependent utility functions——bootBootstrapping———compr<strong>es</strong>sed Compr<strong>es</strong>sed kernel handling———tools Programs to build compr<strong>es</strong>sed kernel image—alphaCompaq's Alpha architecture—s390 IBM's System/390 architecture—sparcSun's SPARC architecture—sparc64 Sun's Ultra-SPARC architecture—mipsSilicon Graphics' MIPS architecture—ppcMotorola-IBM's PowerPC-based architectur<strong>es</strong>—m68kMotorola's MC680x0-based architecture—armArchitectur<strong>es</strong> based on ARM proc<strong>es</strong>sorfsFil<strong>es</strong>ystems—proc/proc virtual fil<strong>es</strong>ystem—devpts/dev/pts virtual fil<strong>es</strong>ystem—ext2<strong>Linux</strong> native Ext2 fil<strong>es</strong>ystem—isofsISO9660 fil<strong>es</strong>ystem (CD-ROM)—nfsNetwork File System (NFS)—nfsdIntegrated Network fil<strong>es</strong>ystem server—fatCommon code for FAT-based fil<strong>es</strong>ystems—msdosMicrosoft's MS-DOS fil<strong>es</strong>ystem—vfatMicrosoft's Windows fil<strong>es</strong>ystem (VFAT)—nlsNative Language Support—ntfsMicrosoft's Windows NT fil<strong>es</strong>ystem—smbfsMicrosoft's Windows Server M<strong>es</strong>sage Block (SMB) fil<strong>es</strong>ystem—umsdosUMSDOS fil<strong>es</strong>ystem—minixMINIX fil<strong>es</strong>ystem533


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>—hpfs—sysv—ncpfs—ufs—affs—coda—hfs—adfs—efs—qnx4—romfs—autofs—lockdnetipcdrivers—block——paride—scsi—char——joystick——ftape——hfmodem—ip2—net—sound—video—cdrom—isdn—ap1000—macintosh—sgi—fc4—acorn—misc—pnp—usb—pci—sbus—nubus—zorro—dio—tclibinclude—linux——lockd——nfsdIBM's OS/2 fil<strong>es</strong>ystemSystem V, SCO, Xenix, Coherent, and Version 7 fil<strong>es</strong>ystemNovell's Netware Core Protocol (NCP)Unix BSD, SunOs, FreeBSD, NetBSD, OpenBSD, and NeXTStep fil<strong>es</strong>ystemAmiga's Fast File System (FFS)Coda network fil<strong>es</strong>ystemApple's Macintosh fil<strong>es</strong>ystemAcorn Disc Filing SystemSGI IRIX's EFS fil<strong>es</strong>ystemFil<strong>es</strong>ystem for QNX 4 OSSmall read-only fil<strong>es</strong>ystemDirectory automounter supportRemote file locking supportNetworking codeSystem V's Interproc<strong>es</strong>s CommunicationDevice driversBlock device driversSupport for acc<strong>es</strong>sing IDE devic<strong>es</strong> from parallel portSCSI device driversCharacter device driversJoysticksTape-streaming devic<strong>es</strong>Ham radio devic<strong>es</strong>IntelliPort's multiport serial controllersNetwork card devic<strong>es</strong>Audio card devic<strong>es</strong>Video card devic<strong>es</strong>Proprietary CD-ROM devic<strong>es</strong> (neither ATAPI nor SCSI)ISDN devic<strong>es</strong>Fujitsu's AP1000 devic<strong>es</strong>Apple's Macintosh devic<strong>es</strong>Silicon Graphics' devic<strong>es</strong>Fibre Channel devic<strong>es</strong>Acorn's devic<strong>es</strong>Miscellaneous devic<strong>es</strong>Plug-and-play supportUniversal Serial Bus (USB) supportPCI bus supportSun's SPARC SBus supportApple's Macintosh Nubus supportAmiga's Zorro bus supportHewlett-Packard's HP300 DIO bus supportSun's TurboChannel support (not yet finished)General-purpose kernel functionsHeader fil<strong>es</strong> (.h)<strong>Kernel</strong> coreRemote file lockingIntegrated Network File Server534


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>——sunrpc——byteorder——modul<strong>es</strong>—asm-generic—asm-i386—asm-alpha—asm-mips—asm-m68k—asm-ppc—asm-s390—asm-sparc—asm-sparc64—asm-arm—net—scsi—video—configscriptsDocumentationSun's Remote Procedure CallByte-swapping functionsModule supportPlatform-independent low-level header fil<strong>es</strong>IBM's PC architectureCompaq's Alpha architectureSilicon Graphics' MIPS architectureMotorola-IBM's PowerPC-based architectur<strong>es</strong>Motorola-IBM's PowerPC architectureIBM's System/390 architectureSun's SPARC architectureSun's Ultra-SPARC architectureArchitectur<strong>es</strong> based on ARM proc<strong>es</strong>sorNetworkingSCSI supportVideo card supportHeader fil<strong>es</strong> containing the macros that define the kernel configurationExternal programs for building the kernel imageText fil<strong>es</strong> with general explanations and hints about kernel components535


<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>ColophonOur look is the r<strong>es</strong>ult of reader comments, our own experimentation, and feedback fromdistribution channels. Distinctive covers complement our distinctive approach to technicaltopics, breathing personality and life into potentially dry subjects.<strong>The</strong> cover image of a man with a bubble is adapted from a 19th-century engraving from theDover Pictorial Archive. Edie Freeman d<strong>es</strong>igned the cover. Emma Colby produced the coverwith QuarkXPr<strong>es</strong>s 4.1, using the ITC Garamond Condensed font. David Futato d<strong>es</strong>ignedthe interior layout based on a seri<strong>es</strong> d<strong>es</strong>ign by Alicia Cech.Catherine Morris was the production editor, and Norma Emory was the copyeditor for<strong>Understanding</strong> the <strong>Linux</strong> <strong>Kernel</strong>. Clairemarie Fisher O'Leary was the proofreader. JeffHolcomb, Claire Cloutier, and Catherine Morris provided quality control. Judy Hoer and JoeWizda wrote the index. Linley Dolby, Rachel Wheeler, and Deborah Smith providedproduction support. <strong>The</strong> illustrations that appear in the book were produced by RobertRomano using Macromedia FreeHand 8 and Adobe Photoshop 5.536

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!