The holy bible of SWEB - Institute of Applied Information Processing ...

The holy bible of SWEB 

SWEB-Crew 

September 30, 2008

2 of 151

Contents 

1 Install 13 

1.1 some notes for installing sweb . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

1.1.1 Preparing sweb for building . . . . . . . . . . . . . . . . . . . . . . . . 13 

1.1.2 building sweb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 

1.1.3 Important notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2 running SWEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.1 bochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.2 bochsgdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.3 qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1.2.4 VMWare, Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.2.5 VMWare, Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.2.6 XEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.2.7 ’real’ hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

1.3 Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 

2 GDB 19 

2.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.2 Bochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 

2.2.1 Building Bochs with “page fault” patch . . . . . . . . . . . . . . . . . 19 

2.3 Qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.4 Using GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

2.4.1 Custom GDB macros . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

2.5 Using DDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

2.5.1 DDD example session . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 

3 VM, Protection and Paging 25 

3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.1.1 Memory Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.1.2 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.2 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 

3.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

3.3.1 Protected Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

3.3.2 Protected Mode Segmentation . . . . . . . . . . . . . . . . . . . . . . 26 

3.3.3 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

3.4 Interface & Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

3.4.1 ArchMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 

3.4.2 PageManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 

3

CONTENTS 

CONTENTS 

3.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

3.5 Sweb Boot Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.5.1 Boot Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 

3.5.2 Sweb Global Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 44 

3.5.3 Sweb Boot-Time Paging . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

3.5.4 Sweb Post-Boot Paging Setup . . . . . . . . . . . . . . . . . . . . . . 45 

3.6 Sweb Task Separation and Protection . . . . . . . . . . . . . . . . . . . . . . 46 

3.6.1 Segment Descriptor Privilege Level / Task Privilege Level . . . . . . 46 

3.6.2 User- / Supervisor- Page . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.6.3 LoadPageOnDemand . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.6.4 UserThread Termination . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.7 Extending Sweb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.7.1 Using 4MiB Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 

3.7.2 Additional pages for the kernel . . . . . . . . . . . . . . . . . . . . . . 47 

3.7.3 sharing Memory between linear address spaces . . . . . . . . . . . . 47 

3.7.4 Cloning a Tasks linear address space . . . . . . . . . . . . . . . . . . 47 

3.7.5 CopyOnWrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 

3.7.6 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 

4 Kernelspace Memory Management 55 

4.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

4.2 Memory Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

4.3 Important Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

4.3.1 KernelMemoryManager . . . . . . . . . . . . . . . . . . . . . . . . . . 57 

4.3.2 MallocSegment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.4 Used Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.4.1 allocating memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.4.2 freeing memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.4.3 reallocating memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.5 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

4.5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

5 Threads and Processes 61 

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

5.2 Details for x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 

5.2.1 Example for saving the CPU registers . . . . . . . . . . . . . . . . . . 62 

5.2.2 Example for reading back the CPU Registers . . . . . . . . . . . . . . 62 

5.3 The FPU and other problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

5.4 Threads and Processes in SWEB . . . . . . . . . . . . . . . . . . . . . . . . 64 

5.4.1 Kernel Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 

5.4.2 Userspace Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 

5.4.3 Switching to Userspace . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

5.5 Multiprocessor Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

5.5.1 The FPU Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 

5.5.2 CPU Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

5.5.3 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

5.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

5.6.1 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 

5.6.2 The Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 


CONTENTS 

CONTENTS 

6 System Calls 71 

6.1 Base Information about System Calls . . . . . . . . . . . . . . . . . . . . . . 71 

6.1.1 Concept of System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 72 

6.2 System Calls in the different Operating Systems . . . . . . . . . . . . . . . 73 

6.3 System calls in SWEB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 

6.3.1 Concept of System Calls in SWEB . . . . . . . . . . . . . . . . . . . . 74 

6.3.2 How to implement System Calls in SWEB . . . . . . . . . . . . . . . 77 

6.3.3 A convenient interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 

7 Character Devices 81 

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

7.2 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 

7.3 Character Device Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 

7.4 Keyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 

7.5 Serial Port (or how to write a character device driver in SWEB . . . . . . . 87 

7.6 SerialManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 

7.7 Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

7.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

7.7.2 Escape codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 

7.7.3 Implementation in SWEB . . . . . . . . . . . . . . . . . . . . . . . . . 93 

7.8 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 

7.8.1 Text Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

7.8.2 Frame Buffer Console . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

8 Block Devices 99 

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

8.1.1 what are block devices . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

8.1.2 hard drive layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 

8.1.3 how can blocks be accessed (chs, lba etc.) . . . . . . . . . . . . . . . 101 

8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 

8.2.1 How to implement new low-level block device drivers . . . . . . . . . 102 

8.2.2 How to use block devices . . . . . . . . . . . . . . . . . . . . . . . . . 102 

8.2.3 VirtualDevices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

8.2.4 BDRequest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 

9 VFS, FS 107 

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

9.2 File system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

9.2.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

9.2.2 Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 

9.2.3 Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

9.2.4 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

9.3 VFS data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

9.3.1 Superblock Ojbects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 

9.3.2 Inode objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 

9.3.3 Dentry objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

9.3.4 File objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 

9.3.5 derivation-hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 

9.3.6 interaction between processed and vfs objects . . . . . . . . . . . . . 113 


CONTENTS 

CONTENTS 

9.4 RamFS - memory filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

9.4.1 simplification from the RamFS . . . . . . . . . . . . . . . . . . . . . . 114 

9.4.2 file vs. directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 

9.5 extention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 

10 Locking 117 

10.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 

10.2Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

10.2.1SpinLock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 

10.2.2Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 

10.3Using Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 

10.4Critical Sections in the Sweb-Kernel . . . . . . . . . . . . . . . . . . . . . . 121 

10.4.1KMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

10.4.2PM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

10.4.3Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

10.4.4CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 

10.4.5Console and Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 

10.4.6Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 

11 Xen 127 

11.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

11.2Architecture separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 

11.2.1Building environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

11.3Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

11.3.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 

11.3.2Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

11.3.3Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

11.4SWEB- Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

11.4.1Guest Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 

11.4.2Start Info and Shared Info Page . . . . . . . . . . . . . . . . . . . . . 132 

11.4.3PageManager and KernelMemoryManager . . . . . . . . . . . . . . . 133 

11.4.4Memory regions and modules . . . . . . . . . . . . . . . . . . . . . . 134 

11.4.5Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 

11.4.6PSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

11.4.7Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 

11.4.8Callbacks, Events and Exceptions . . . . . . . . . . . . . . . . . . . . 137 

11.4.9Virtual CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 

11.4.10Block Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 

11.4.11Boot Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 

11.5Work to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 

11.6Appendix: Install guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.6.1General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.6.2Getting Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.6.3Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 

11.6.4Prepare installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

11.6.5Building Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 

11.6.6Installing Xen & Xen tools . . . . . . . . . . . . . . . . . . . . . . . . 143 

11.6.7Configuring GRUB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 

11.6.8SWEB under Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 


CONTENTS 

CONTENTS 

11.6.9Running Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 

11.6.10Alternative to real installation . . . . . . . . . . . . . . . . . . . . . . 146 

12 Hacking Sweb 149 

12.1C++ specific issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 

12.2Dos and Donts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 


CONTENTS 

CONTENTS 


List of Figures 

3.1 Protected Mode Segment Descriptor ((c) Dr. Jim Plusquellic [2]) . . . . . . 26 

3.2 Example Mapping between linear and physical Memory . . . . . . . . . . . 29 

3.3 4k page address translation ((c) by Intel) . . . . . . . . . . . . . . . . . . . . 33 

3.4 Page-Fault-Handling with possible CopyOnWrite . . . . . . . . . . . . . . . 48 

3.5 Example mapping using virtual memory. . . . . . . . . . . . . . . . . . . . . 50 

3.6 Sweb Page-Fault-Handling and possible Virtual Memory Extension . . . . 51 

4.1 linear address space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 

4.2 memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 

4.3 MallocSegment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 

5.1 Saving the cpu registers in an interupt handler . . . . . . . . . . . . . . . . 63 

5.2 Reading back the cpu registers of a Thread . . . . . . . . . . . . . . . . . . 64 

5.3 Initialisation of the Registers for a Kernel Thread . . . . . . . . . . . . . . . 65 

5.4 Initialisation of the Registers for a Userspace Thread . . . . . . . . . . . . . 67 

5.5 Reading back the cpu registers of a Thread and switch to Userspace . . . 68 

6.1 The implementation of the system calls . . . . . . . . . . . . . . . . . . . . . 75 

6.2 The definition list of the system calls . . . . . . . . . . . . . . . . . . . . . . 76 



6.5 The implementation of the system call write . . . . . . . . . . . . . . . . . . 79 

11.1Xen memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 

11.2Page frame numbers: PFN and MFN . . . . . . . . . . . . . . . . . . . . . . 131 

11.3SWEB Xen memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 

11.4Event sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 

11.5Overview over the exception and event implementation . . . . . . . . . . . 139 

9

LIST OF FIGURES 

LIST OF FIGURES 


List of Tables 

3.1 Segment Descriptor Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

3.3 Explanation of Fields in Page-Directory- and Page-Table-Entries . . . . . . 32 

3.4 Control Registers and Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 

3.5 ArchMemory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 

3.6 PageManager Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 

3.7 Sweb Global segment Descriptor Table . . . . . . . . . . . . . . . . . . . . . 44 

3.8 Sweb GDT, more readable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

3.9 Sweb Linear Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

4.1 KernelMemoryManager Interface . . . . . . . . . . . . . . . . . . . . . . . . 58 

4.2 MallocSegment Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 

4.3 provided Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 

7.1 Character Device base class methods . . . . . . . . . . . . . . . . . . . . . . 82 

7.2 Character Device input and output buffers . . . . . . . . . . . . . . . . . . 83 

7.3 The Inode functions implemented by CharacterDevice . . . . . . . . . . . . 84 

7.4 I/O ports for communicating with keyboard . . . . . . . . . . . . . . . . . 85 

7.9 Terminal class base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

7.10Terminal input class methods . . . . . . . . . . . . . . . . . . . . . . . . . . 94 

7.11Terminal output class methods . . . . . . . . . . . . . . . . . . . . . . . . . 96 

7.12Console output functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 

11

LIST OF TABLES 

LIST OF TABLES 


Chapter 1 

Install 

Basic guidelines for building and installing SWEB 

1.1 some notes for installing sweb 

First of all we have to say, that sweb runs on any Intel Pentium (R) or compatible 

platform with at least 16 megs of ram, VGA display and floppy or IDE hdd for booting. 

It is possible that SWEB runs on other platforms too but has not been tested yet. 

Sure, you maybe don’t want to run this OS on ’real’ hardware for testing reasons. 

But please keep in mind, that this os should not only run in one special emulater but 

on real hardware. You might test it on real hardware, in this case you have to install 

sweb on a hard drive. 

1.1.1 Preparing sweb for building 

Sweb uses cmake to generate the makefiles which can be used with the standard make 

command. Before running cmake you can change some options which influence the 

makefile generation. For more information on cmake run ’man cmake’ or look at 

www.cmake.org. There is a file called MakeOptions in the sweb (source) dir. Open it 

in an editor of your choice and set the options to your needings. 

The default MakeOptions file looks similar to the following: 

# Architecture to use: x86, xen 

SET (ARCH x86) 

# Added before the BIN_DIR_NAME (e.g. "../") 

SET (BIN_DIR_PREFIX "../") 

# Name of the Binary dir (e.g. sweb-bin) 

SET (BIN_DIR_NAME "sweb-bin") 

# Doxygen output directory 

# Result must include escaping characters e.g. "..\\/sweb-docs" 

SET (DOC_DIR "..\\/sweb-docs") 

13

1.1. SOME NOTES FOR INSTALLING SWEB CHAPTER 1. INSTALL 

# Path to the bochs executable to use (patched bochs at different location) 

SET (BOCHS_PATH "bochs") 

#SET (BOCHS_PATH "/opt/sweb-bochs/bin/bochs") 

#’true’ to enable verbose compile output 

set(CMAKE_VERBOSE_MAKEFILE false) 

The resulting binary directory is determined as follows: 

// 

If your sweb sources are located at 

/home/user/sweb 

the resulting binary directory (using the above MakeOptions file) would be: 

/home/user/sweb-bin 

Be sure, that the direcory immediately below your desired sweb binary direcory is 

writeable. 

Now it’s time to let cmake do its job. Navigate to your sweb (source) directory, type 

$ cmake . 

(don’t forget the dot) and check for errors in the command output. If no error occured 

the makefiles have been successfully generated. 

1.1.2 building sweb 

Now you can type 

$ make 

to build sweb. 

You might have to install following libs (debian / gentoo package names in braces): 

• cmake >=2.4 (cmake / cmake) 

• libcom_err.so (comerr-dev / ??) 

• nasm (nasm / nasm) 

• g++ (g++ / ) 

• gdb (gdb / gdb) 

• emulator of your choice: 

bochs (bochs-wx, bochs-x, bochsbios, bochs, vgabios / bochs) 

qemu (qemu / qemu) 

VMWare (http://www.vmware.com) 

XEN 


CHAPTER 1. INSTALL 

1.2. RUNNING SWEB 

1.1.3 Important notes 

Adding/removing source files 

When you are adding (or removing) source files to your sweb sources don’t forget to 

run cmake from your sweb (source) directory again 

$ cmake . 

Moving/renaming source directory 

When you are moving your sweb source directory (e.g. /home/user/sweb -> /home/user/foo/bar/sweb) 

run the relocation script from the sweb directory, then run cmake again. 

$ ./ relocate_project . sh 

$ cmake . 

1.2 running SWEB 

1.2.1 bochs 

This might be the best emulator for debugging although it might have some problems. 

We suggest using bochs version >= 2.2.1. Older versions seem to have no problems 

loading grub but executing the kernel image. We supply a redy to use config file 

(bochsrc) and a patched grub to start it. 

start with ”make bochs” 

This internally runs ”make bochsconfig” the first time you run ”make bochs”. 

”make bochsconfig” generates a ready to use bochsrc file. 

Configuring bochs 

The bochs configuration file is located at ./utils/bochs. Threr are two interesting 

file bochsrc (which is generated by make config or make bochsconfig) and 

bochsrc.pattern which is the template for bochsrc. So when you need to edit the 

bochs configuration always edit the bochsrc.pattern and run make config from the 

sweb directory, and you can be sure that your changes don’t get lost. 

1.2.2 bochsgdb 

Almost the same as bochs but with gdb support. 

start with ”make bochsgdb” 

1.2.3 qemu 

Qemu 0.6.2 has been tested too. 

start with ”make qemu” 


1.3. DEVELOPERS CHAPTER 1. INSTALL 

1.2.4 VMWare, Linux 

After building there is a ready config file (sweb.vmx) for booting SWEB in VMWare 

Workstation Linux (>=4.0.5) in the sweb-testing-bin directory. Older versions might 

work but have not been tested. 

start with ”make vmware” 

1.2.5 VMWare, Windows 

Actually we do not have compile support in Windows but you can compile everything 

on Linux and use it on Windows. You can mount the sweb-testing-bin directory on 

your windows host (SMB, NFS, ...) and simply open the sweb.vmx file in VMWare 

or you can copy all necessary files (SWEB.vmdk, SWEB-flat.vmdk, nvram, sweb.vmx) 

from the sweb-tesing-bin directory to your Windows host and start it. Due to the hard 

disk image file format you have to use VMWare Workstation for Windows Version 3.2 

or above. 

1.2.6 XEN 

experimental see instuctions in chapter XEN 

1.2.7 ’real’ hardware 

For installing sweb to a physical hard drive you first have to create at least 2 partitions 

on it (3 if you need the third one for swapping). Then you must install grub 

onto the first partition of the harddrive (which has to be the active one), copy swebbin/kernel.x 

and sweb/images/menu.lst.hda to the first partition (to destinations 

/boot/kernel.x and /boot/grub/menu.lst). Then copy the userprograms 

(which can be found in sweb-bin/userspace/) you want to run to the second 

partition which has to be minix-formatted. 

1.3 Developers 

For basic information you should have a look at our homepage (use link http://sweb.sourceforge.net). 

known bugs: 

• make doesn’t like arguments except 

– ”clean” 

– ”config” 

– ”mrproper” 

– ”bochs” 

– ”bochsgdb” 

– ”install” 

– ”qemu” 

– ”all” 


CHAPTER 1. INSTALL 

1.3. DEVELOPERS 

– ”info” 

– ”submit” 

• sometimes bochs requires ”/usr/share/bochs/VGABIOS-lgpl-latest”. If it is not 

already there, copy it using ”cp /usr/share/vgabios/vgabios.bin /usr/share/bochs/VGABIOSlgpl-latest” 

or something similar. (bochs doesn’t seem to like symlinks) 

• use ”make -i” with care - it may help ignoring compile errors, but you should 

really know about the effects. 


1.3. DEVELOPERS CHAPTER 1. INSTALL 


Chapter 2 

GDB 

2.1 General 

The GNU Project Debugger provides a way to view internal information about programs 

during their execution, which can be essential for finding unobvious errors in the 

program source code. Fortunately, the bochs emulator provides a remote GDB stub 

which can be used to debug operating systems in a rather comfortable way through 

gdb, provided the source code is available. 

2.2 Bochs 

In order to use the gdb stub, Bochs has to be configured by passing the argument 

–enable-gdb-stub to the configure script provided with the bochs sources. Note that 

this conflicts with bochs’ own debugger, so it cannot be used. 

$ ./ configure −−enable−gdb−stub 

With a Bochs configured, built and installed as this, it can be started with 

$ make bochsgdb 

from the SWEB top directory. Bochs then awaits a gdb connection at localhost:1234. 

2.2.1 Building Bochs with “page fault” patch 

When you start debugging with gdb you will recognize that gdb stops on every page 

fault. This is very annoying as the userspace programs grow, so there is a patch to 

eliminate this behavier. 

• Download bochs-2.3.7.tar.gz from http://bochs.sourceforge.net 

• Extract with 

$ tar xvzf bochs−2.3.7. tar . gz 

• Navigate to the extracted directory and patch the source files 

$ patch −p0 < /bochs_2 .3.7 _page_fault . patch 

19

2.3. QEMU CHAPTER 2. GDB 

The patch file is located at ...sweb/utils/bochs 

• Once the source files are patched we can run the configure script. It is recommended 

to use an alternative prefix (installation location) to be able to switch 

between different bochs versions (e.g. patched not patched version) 

$ ./ configure −−enable−gdb−stub −−prefix=/opt/sweb−bochs 

• Once the configuration is done, start compiling and installing 

$ make 

$ sudo make i n s t a l l 

Needless to say that you need some default dev packages like gcc to successfully 

complete the compile process, in addition you need a package called xorg-dev or 

x-window-system-dev on some other systems 

• Self - compiling bochs does not replace installing the packages vgabios and 

bochsbios 

2.3 Qemu 

For using the gdb stub from qemu, start it with 

$ make qemugdb 

2.4 Using GDB 

With bochs or qemu started as described before, gdb can connect to it. There is a 

make target rungdb which preconfigures GDB and connects to localhost:1234. 

$ make rungdb 

. 

. 

. 

0x0000fff0 in ?? ( ) 

You can now set trace and breakpoints . 

Enter ’ continue ’ ( or ’ c ’ ) when you ’ re done . 

( gdb ) 

Once bochs or qemu exits you could end GDB and run make rungdb again, but then 

all you temporary settings like breakpoints are gone and you have to set them up 

again. So there is another way to reconnect GDB to bochs or qemu. Just run the 

following from inside GDB and all your breakpoints and other settings remain active. 

target remote localhost :1234 

From here, gdb can be used as usual, except that it is not necessary to call run, as 

the simulation is already started but paused for command input. It can be resumed 

using the continue command. 


CHAPTER 2. GDB 

2.5. USING DDD 

The documentation for GDB can be found at http://www.gnu.org/software/ 

gdb/documentation/. A very useful Tutorial which covers the major features of GDB 

can be found at http://dirac.org/linux/gdb/, more tutorials can be found on the 

world wide web by using a search engine. 

2.4.1 Custom GDB macros 

When running GDB as described above, some custom macros are installed automatically. 

GDB macros allow to simplify the debugging process by providing new commands. 

The following is a list of the available custom commands. 

• continue_on_pagefault: Tells GDB to ignore all page faults (from userspace 

and kernelspace). It only works with the patched bochs version. 

• stop_on_pagefault: Tells GDB to stop on every page fault. If the pagefault 

comes from kernelspace, the backtrace (to determine the origin of the page fault) 

is shown. 

• btpf_enable_full: Also shows local variables with their values with the page 

fault backtrace 

• btpf_disable_full: Shows the short version of the page fault backtrace 

• backtrace_pagefault: Shows the backtrace of the pagefault again. Only call 

this when debugger has stopped because of a page fault 

To call one of the above macros just enter the name on the gdb command prompt. To 

show the help of this custom macros enter help user-defined. 

2.5 Using DDD 

The data display debugger (ddd) is a graphical front end to gdb. The gdb section above 

also applies to ddd. To run ddd just enter 

$ make runddd 

and you will get a preconfigured ddd environment. In the lower area you can find the 

gdb command prompt where you can enter gdb commands like described above. 

2.5.1 DDD example session 

This example assumes that you have compiled bochs with the page fault patch, configured 

MakeOptions correctly to use this bochs and that all neccessary tools are 

installed. 

Open up two seperate terminals, navigate to the sweb directory and run the following 

from Term1 

$ make && make bochsgdb 

Now a bochs window should open up and you should see something like 

Waiting for gdb connection on port 1234 


2.5. USING DDD CHAPTER 2. GDB 

on Term1 

Now switch to Term2 and enter 

$ make runddd 

Now ddd should start and display the gdb command prompt in the lower area. Now 

open up main.cpp (File/Open Source...) and scroll down till startup. Now set a 

breakpoint anywhere inside the startup method (right click in front of the line/Set 

Breakpoint). Once the breakpoint is set up your window should look similar to the 

following: 

Now you can enter continue (or c) in the gdb command window and you will notice 

that bochs will continue execution. Once the execution reaches your breakpoint, ddd 

will show a green arrow in front of the current line. Now you have several options to 

continue. You can simple enter c to execute to the next break, step, next, stepi, and 

so on. Take a look at the ddd command window 


CHAPTER 2. GDB 

2.5. USING DDD 

To get more information about the different commands enter help (eg. 

help step) in the gdb command window. 


2.5. USING DDD CHAPTER 2. GDB 


Chapter 3 

VM, Protection and Paging 

3.1 Notation 

3.1.1 Memory Types 

linear M. Intel terminology for linear addressed memory 

virtual M. Memory which’s addresses don’t map directly to a physical memory address or 

to physical memory at all. 

physical M. describes short-access-time memory like Random Access Memory. 

swap M. describes long-access-time memory used to extend physical memory, usually 

hard disk memory. 

Definition taken from Chapter 3.3.2 of Intel Software Developer’s Manual [14]: 

When using the IA-32 architecture’s paging mechanism (paging enabled), 

linear address space is divided into pages which are mapped to virtual memory. 

The pages of virtual memory are then mapped as needed into physical 

memory. When an operating system or executive uses paging, the paging 

mechanism is transparent to an application program. All that the application 

sees is linear address space. 

3.1.2 Units 

the units used in this document conform to the IEC 60027-2 A.2 Standard [1]. 

KiB 2 10 Byte 

MiB 2 20 Byte 

GiB 2 30 Byte 

3.2 Abstract 

The following document aims to give a detailed introduction in x86 segmentation- and 

paging- mechanisms. It further aims to acquaint the reader with the use of these 

concepts in the Sweb operating system. 

25

3.3. BASIC CONCEPTS CHAPTER 3. VM, PROTECTION AND PAGING 

Figure 3.1: Protected Mode Segment Descriptor ((c) Dr. Jim Plusquellic [2]) 

3.3 Basic Concepts 

3.3.1 Protected Mode 

The IA32 Architecture offers a feature-set called "Protected Mode", which makes task 

separation possible. Using the protected mode capabilities, we can place restrictions 

on what a piece of code is allowed to do and in which part of the physical memory 

it may work. The two mechanisms making this possible are protected-mode- 

Segmentation and Paging. They will be explained in this chapter. Note that in order to 

switch to Protected Mode, at the very least the special registers GDTR and IDTR must 

have been loaded. 

3.3.2 Protected Mode Segmentation 

Protected Mode Segmentation works differently than Real Mode Segmentation. In this 

document we consider only Protected Mode Segmentation. 

Basically it works by defining memory segments using so-called Segment Descriptors. 

They contain a Start Address in linear memory and a Limit to how much memory 

beyond the start address a program is allowed to access. The segment descriptor also 

defines restrictions on the kind of operations a program can execute. Otherwise a Segment 

Violation is triggered. Even before switching to Protected Mode, some Segment 

Descriptors need to be defined in the so-called Global Descriptor Table (short GDT). 

Like any descriptor table, the GDT is an array of Segment Descriptors of 8 byte length. 

Obviously, a restricted program can not be allowed to (re)write the GDT and the protection 

mechanisms available should be employed to prevent this. The GDT should 

contain at least one reasonable Segment Descriptor beyond the first entry. The first 

entry also called the Null Descriptor, has a special purpose and all its values set to 

zero. Once a GDT has been set up, we can point the CPU to its location in memory by 

loading its linear address into the Global Descriptor Table Register (short GDTR) using 

the LGDT instruction. Figure 3.1 shows the fields of a Segment Descriptor. Table 3.1 

explains their use. 

Obviously, defining just one set of Descriptors for all programs is not going to help 

much with Task Separation, so conveniently the Local Descriptor Table (short LDT) 

comes to the rescue. The LDT, just like the GDT, is an array of Segment Descriptors, 


CHAPTER 3. VM, PROTECTION AND PAGING 

3.3. BASIC CONCEPTS 

Field Field Bits Use 

Limit 20 bits defines the amount of bytes, starting from the Base Address, 

which a program is allowed to access 

Base 25 bits the physical starting address of the Segment 

P 1 bit Descriptor Present Flag, if set to 0 entry is ignored 

DPL 2 bits Descriptor Privilege Level 

S 1 bit System Descriptor Flag 

if set, this is a special system descriptor 

Type 3 bits defines the segment type 

(note: interpretation is different in system descriptors) 

Type=000 Data, read-only 

Type=001 Data, read/write 

Type=010 Stack, read-only 

Type=011 Stack, read/write 

Type=100 Code, execute-only 

Type=101 Code, execute/read 

Type=110 Code, execute-only, aligned to 4 bytes 

Type=111 Code, execute/read, aligned to 4 bytes 

A 1 bit set by the CPU, indicates if a Segment has been accessed 

G 1 bit Granularity: defines the size of one Limit-Unit 

G=0 Granularity is 1 Byte, segments are 1 byte to 1 MiB in length 

G=1 Granularity is 4 KiB, segments are 4KiB to 4GiB in length 

D 1bit If 1: all instructions are assumed to be 32bit, 16bit otherwise. 

U 1bit user defined bit 

Table 3.1: Segment Descriptor Fields 



but contains only Descriptors specific to one program. Usually there is one LDT for 

each separate program. The running program can then choose between segments 

by writing the corresponding index into one of the segment registers (CS for Code 

Segment Register ,DS for Data Segment Register ,SS for Stack Segment Register) 

Privilege Level 

Four degrees between 0 and 3 of privileges exist on the IA32 Platform. The privilege 

level determines how large a part of the instruction set a piece of code can use without 

triggering a Protection Fault. The highest privilege is 0 and usually held only by the 

operating system since everything is allowed here. In Sweb, the Kernel has Level 

0 while user programs have Level 3. The Privilege Level is defined by setting the 

DPL field of the Segment Descriptor in which a programs code runs. When changing 

Segments (by loading a different Descriptor Index into a Segment Register) a program 

may only decrease or hold its present privilege level. A program can however, access 

privileged code through a Call Gate or asking the Operation System to increase its 

Privilege Level. 

3.3.3 Paging 

Paging is the other protection mechanism and instead of Segmentation, was newly 

introduced with Protected Mode. (Of course, the principle did exist long before the 

IA32 platform adopted the concept). 

Paging works by separating the physical memory into Pages of variable size and 

mapping Pages in linear memory to Pages in physical memory. A Page can be 4 KiB, 

2 MiB or 4MiB in size. It is also possible to mix 4KiB and 4MiB Pages. In Sweb we 

mainly concern ourselves with 4KiB Pages. Physical Memory is the range of addresses 

which translate directly into a memory register on the hardware. A linear memory 

address however, does not necessarily correspond to an address in physical hardware 

(see Figure 3.2). Rather the CPU translates a linear address into a physical address 

according to the mapping that was set up by the operating system. Separation works 

by assigning each program its own linear address space starting from 0 to 4 GiB. 

In Sweb however, a user thread is not allowed to access an address beyond 2GiB. 

Translation happens transparently and usually only the operating system knows what 

the mapping looks like. Should an unmapped linear address be referenced, the CPU 

causes a Page-Fault (Fault is a type of Exception) and calls Sweb’s Page-Fault-Handler 

(as defined in the IDT) to deal with the problem. 

Page-Directory and Page-Tables 

The Mapping of Linear Memory to Physical Memory is defined in a Page-Directory 

and its Page-Tables. A Page-Directory has 1024 entries. Each Entry has 32 bits. 

A Page-Directory-Entry can either map a 4MiB Page or hold the physical address 

of a Page-Table. (Actually it holds only the most significant 20 bits of the physical 

address, which equal the physical 4KiB page number). A Page-Table again has 1024 

Entries, each 32 bit in size. A Page-Table-Entry maps to a 4KiB Page in physical 

memory. The structure of the Page-Directory-Entries and Page-Table-Entries is shown 

in the following listing of paging-definitions.h. Note that the least significant 20 bits 

in a Page-Directory-Entry can have different interpretations depending on the state 




Figure 3.2: Example Mapping between linear and physical Memory 



of the use_4_m_pages bit. Sweb defines an union between the two structs so both 

interpretations can be used conveniently. The bits are further explained in table 3.3.3. 

1 #include " types .h" 

2 

3 #define PAGE_TABLE_ENTRIES 1024 

4 #define PAGE_SIZE 4096 

5 #define PAGE_INDEX_OFFSET_BITS 12 

6 

7 struct page_directory_entry_4k_struct 

8 { 

9 uint32 present : 1 ; 

10 uint32 writeable : 1 ; 

11 uint32 user_access : 1 ; 

12 uint32 write_through : 1 ; 

13 uint32 cache_disabled : 1 ; 

14 uint32 accessed : 1 ; 

15 uint32 reserved : 1 ; 

16 uint32 use_4_m_pages : 1 ; 

17 uint32 global_page : 1 ; 

18 uint32 avail_1 : 1 ; 

19 uint32 avail_2 : 1 ; 

20 uint32 avail_3 : 1 ; 

21 uint32 page_table_base_address :20; 

22 } __attribute__ ( ( __packed__ ) ) ; 

23 

24 struct page_directory_entry_4m_struct 

25 { 







32 uint32 dirty : 1 ; 

33 uint32 use_4_m_pages : 1 ; 


35 uint32 avail_1 : 1 ; 

36 uint32 avail_2 : 1 ; 

37 uint32 avail_3 : 1 ; 

38 uint32 pat : 1 ; 

39 uint32 reserved : 9 ; 

40 uint32 page_base_address :10; 


42 

43 union page_directory_entry_union 

44 { 

45 struct page_directory_entry_4k_struct pde4k ; 

46 struct page_directory_entry_4m_struct pde4m; 


48 typedef union page_directory_entry_union page_directory_entry ; 

49 

50 struct page_table_entry_struct 

51 { 







58 uint32 dirty : 1 ; 

59 uint32 pat : 1 ; 





61 uint32 avail_1 : 1 ; 

62 uint32 avail_2 : 1 ; 

63 uint32 avail_3 : 1 ; 

64 uint32 page_base_address :20; 


66 typedef struct page_table_entry_struct page_table_entry ; 

Note: A set flag in a Page-Directory-Entry applies to one 4MiB Page or all 1024 

Page-Table-Entries, even if that flag is not set in each Page-Table-Entry. 

Address Translation 

The following is a pseudo code example on how the CPU translates a linear address 

into a physical one. The example assumes 4KiB-Pages. Translation is also illustrated 

in Figure 3.3. 

Index P ageDirectory = div( linearAddr , P ageT ableEntries) 

P ageSize 

Index P ageT able = mod( linearAddr , P ageT ableEntries) 

P ageSize 

32bit ∗ P ageDirectory = readRegister(CR3) 

32bit ∗ P ageT able = P ageDirectory[Index P ageDirectory ].P ageT ableBase · P ageSize 

physicalP age = P ageT able[Index P ageT able ].P ageBase 

physicalAddr = physicalP age · P ageSize + mod(linearAddr, P ageSize) 

Control Register 0, 2, 3 and 4 

The CR registers are 32 bit large and control the behaviour of the CPU. For a complete 

Reference see [16]. In table 3.4 the bits in these registers used to control Paging are 

listed. 

CR2 holds the linear address, which caused the last Page-Fault as it was accessed. 

The Stack of the Page-Fault-Handler holds additional information i.e. if code tried to 

write to a read-only page or read from an unmapped address. 

CR3 holds the physical memory address of the currently active Page-Directory, 

however only the most-significant 20 bits are used by the CPU, the rest is assumed 

to be 0. The address of any Page-Directory must therefore start on a 4KiB Page or 

rather on an address which is a multiple of 4096. The CPU internally caches some 

of the contents of the Page-Directory and Page-Tables. Therefore, if Page-Directory or 

Page-Table are changed, we need to update CR3 to flush the TLB, even if the address 

of the Page-Directory did not change. Alternatively you could disable usage of the TLB, 

by setting bit 4 of CR3. 



Field Bits Description 

present 1 If set, the linear address range associated with 

this Entry is valid. 

writeable 1 If set, writing to the the linear address range 

associated with this entry is possible. Note that 

unless the WP flag in CR0 is set, supervisor 

level code can still write to such pages. 

user_access 1 If set, this page is user accessible, otherwise it 

is only supervisor accessible (which means only 

with privilege level 3) 

write_through 1 Switches between write-through and writeback 

caching on a per page basis. 

cache_disabled 1 Controls caching on an per page basis. 

accessed 1 Set by the CPU if a linear address associated 

with this page has been accessed. Should be 

cleared by the operating system after every read 

of this bit to continuously track page access. 

dirty 1 Set by the CPU if a page has been written to 

since the bit was last cleared by the operating 

system. 

use_4_m_pages 1 This bit only exists in the Page-Directory-Entry. 

If set, the last 10 bits of this Page-Directory- 

Entry are interpreted as the page number of a 

physical 4MiB Page. Unset, the last 20 bits are 

interpreted as the 4KiB physical page number 

of a Page-Table 

global_page 1 If set by the operating system the CPU assumes 

this to be a shared page and handles entries 

in the TLB (Translation Look-aside Buffer) accordingly. 

Note that this bit has effect only in 

a Page-Table-Entry or Page-Directory-Entries 

that point to 4MiB Pages 

page_table_base_address 20 the 20 most significant bits of the physical address 

of a Page-Table 

page_base_address 10 or 20 the 20 most significant bits of the start address 

of a physical 4KiB-Page or the 10 most significant 

bits of the start address of a 4MiB-Page 

Table 3.3: Explanation of Fields in Page-Directory- and Page-Table-Entries 




Figure 3.3: 4k page address translation ((c) by Intel) 

Register Bit Name Description 

CR0 0 PE enables Protected Mode 

CR0 16 WP If set, writing to read-only pages causes a Page-Fault 

even for Ring 0 Code (needed for CopyOnWrite) 

CR0 31 PG enables Paging 

CR4 4 PSE allows mixing of 4KiB and 4MiB pages 

CR4 7 PGE allows pages to be global, meaning they are not 

flushed from cache on task switch. (see shared memory) 

Table 3.4: Control Registers and Paging 


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING 

3.4 Interface & Classes 

3.4.1 ArchMemory 

The class ArchMemory does not get instantiated. Instead it is a collection of architecture 

dependent static methods related to Memory and Pages. Therefore these methods 

will be refereed to as functions. An alternative would have been to use namespaces, 

but this approach was taken. ArchMemory contains architecture dependent code and 

needs to be implemented separately for each platform. This document focuses on the 

x86 architecture. 

Interface 

Function 

static void initNewPageDirectory(uint32 

physical_page_to_use); 

static void mapPage(uint32 physical_page_directory_page, 

uint32 

linear_page, uint32 physical_page, 

uint32 user_access, uint32 

page_size=PAGE_SIZE); 

static void unmapPage(uint32 physical_page_directory_page, 

uint32 

linear_page); 

static void freePageDirectory(uint32 

physical_page_directory_page); 

Description 

creates a new Page-Directory for an user process 

by copying the original kernel-only page 

directory. The parameter physical_page_to_use 

specifies where the new PDE should be. 

maps a linear page to a physical page (pde 

and pte need to be set up first). physical_page_directory_page 

specifies the physical 

page where the Page-Directory-Entry to work 

on can be found. With user_access the User/- 

Supervisor Flag can be set. page_size is optional 

and defaults to 4KiB pages, but you need 

to set it to 1024*4096 if you want to map a 

4MiB page. 

removes the mapping of a linear_page by marking 

its PTE Entry as non valid The parameter 

physical_page_directory_page specifies the real 

memory page where the PDE to work on can be 

found. virtual_page identifies the mapping that 

will be invalidated. 

recursively removes a Page-Directory-Entry, 

the Page-Tables it points to and marks all 

mapped pages as free in the PageManger. physical_page_directory_page 

are the 20 most significant 

bits of the physical address on which 

the Page-Directory can be found. 



3.4. INTERFACE & CLASSES 

static 

pointer 

get3GBAdressOfPPN(uint32 ppn, 

uint32 page_size=PAGE_SIZE); 

static bool checkAddressValid(uint32 

physical_page_directory_page, 

uint32 

laddress_to_check); 

static uint32 getPhysicalPageOfVirtualPageInKernelMapping(uint32 

linear_page, uint32 *physical_page); 

Because mapping is present everywhere, in 

sweb there is the 1:1 mapping which allows us 

to directly access the given page. Takes a physical 

page number and returns a linear address 

than can be used to directly access the given 

page. Basically, an offset of 3GiB is added and 

the page number is multiplied by page_size. 

This is a very important method, it is the 

only way to manipulate page directories or 

page tables. The parameter page_size is optional 

and defaults to 4KiB pages. Needs to 

be set to 1024*4096 when using a 4MiB page 

number. 

Checks if a given linear address is valid 

and therefore mapped to physical memory. 

physical_page_directory_page is the 

physical page where the Page-Directory-Entry 

which’s mappings are to be checked, can 

be found. The function returns true if a 

mapping of laddress_to_check exists in physical_page_directory_page 

and false if the given 

linear address is unmapped and accessing it 

would result in a page fault. 

Takes a linear_page number and searches 

through the Page-Directory for the physical_page 

number it refers to. The result can 

be found in the variable that *physical_page 

pointed to. To access that physical page directly, 

get3GBAddressOfPPN(..) should be 

used. Returns 0 if the linear page does not map 

to any physical page or the page size in bytes on 

success. (4096 for 4KiB pages or 4096*1024 for 

4MiB pages) 

Table 3.5: ArchMemory Interface 

Implementation 

1 #include "ArchMemory.h" 

2 #include " kprintf .h" 

3 #include " assert .h" 

4 

5 extern "C" uint32 kernel_page_directory_start ; 

6 

7 void ArchMemory : : initNewPageDirectory ( uint32 physical_page_to_use ) 

8 { 

9 page_directory_entry ∗new_page_directory = 

10 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_to_use ) ; 

11 ArchCommon: :memcpy( 

12 ( pointer ) new_page_directory , 

13 ( pointer ) &kernel_page_directory_start , 



14 PAGE_SIZE ) ; 

15 for ( uint32 p = 0; p < 512; ++p ) //we ’ re concerned with f i r s t two gig , rest stays as i s 

16 { 

17 new_page_directory [ p ] . pde4k . present =0; 

18 } 

19 } 

20 

21 void ArchMemory : : checkAndRemovePTE ( uint32 physical_page_directory_page , 

22 uint32 pde_lpn ) 

23 { 

24 page_directory_entry ∗page_directory = 

25 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ; 

26 page_table_entry ∗pte_base = 

27 ( page_table_entry ∗) get3GBAdressOfPPN ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ; 

28 assert ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages == 0 ) ; 

29 for ( uint32 pte_lpn =0; pte_lpn < PAGE_TABLE_ENTRIES; ++pte_lpn ) 

30 i f ( pte_base [ pte_lpn ] . present > 0) 

31 return ; //not empty −> do nothing 

32 

33 //else : 

34 page_directory [ pde_lpn ] . pde4k . present = 0; 

35 PageManager : : instance ()−> freePage ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ; 

36 } 

37 

38 void ArchMemory : : unmapPage( uint32 physical_page_directory_page , 

39 uint32 linear_page ) 

40 { 



43 uint32 pde_lpn = linear_page / PAGE_TABLE_ENTRIES; 

44 uint32 pte_lpn = linear_page % PAGE_TABLE_ENTRIES; 

45 

46 i f ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages ) 

47 { 

48 page_directory [ pde_lpn ] . pde4m. present = 0; 

49 // PageManager manages Pages of size PAGE_SIZE only , 

50 // so we have to free this_page_size/PAGE_SIZE Pages 

51 for ( uint32 p=0;p freePage ( page_directory [ pde_lpn ] . pde4m. page_base_address∗1024+p ) ; 

53 } 

54 { 


56 ( page_table_entry ∗) get3GBAdressOfPPN ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ; 

57 i f ( pte_base [ pte_lpn ] . present ) 

58 { 

59 pte_base [ pte_lpn ] . present = 0; 

60 PageManager : : instance ()−> freePage ( pte_base [ pte_lpn ] . page_base_address ) ; 

61 } 

62 checkAndRemovePTE ( physical_page_directory_page , pde_lpn ) ; 

63 } 

64 } 

65 

66 void ArchMemory : : insertPTE ( uint32 physical_page_directory_page , 

67 uint32 pde_lpn , 

68 uint32 physical_page_table_page ) 

69 { 



72 ArchCommon: : bzero ( get3GBAdressOfPPN ( physical_page_table_page ) ,PAGE_SIZE ) ; 

73 page_directory [ pde_lpn ] . pde4k . present = 1; 

74 page_directory [ pde_lpn ] . pde4k . writeable = 1; 

75 page_directory [ pde_lpn ] . pde4k . use_4_m_pages = 0; 




76 page_directory [ pde_lpn ] . pde4k . page_table_base_address = physical_page_table_page ; 

77 page_directory [ pde_lpn ] . pde4k . user_access = 1; 

78 } 

79 

80 void ArchMemory : : mapPage ( uint32 physical_page_directory_page , 

81 uint32 linear_page , 

82 uint32 physical_page , 

83 uint32 user_access , 

84 uint32 page_size ) 

85 { 

86 kprintfd_nosleep ( "ArchMemory : : mapPage: pys1 %x , pyhs2 %x\n" , 

87 physical_page_directory_page , 

88 physical_page ) ; 





93 i f ( page_size==PAGE_SIZE) 

94 { 

95 i f ( page_directory [ pde_lpn ] . pde4k . present == 0) 

96 { 

97 insertPTE ( physical_page_directory_page , pde_lpn , 

98 PageManager : : instance ()−> getFreePhysicalPage ( ) ) ; 

99 } 


101 ( page_table_entry ∗) get3GBAdressOfPPN ( 

102 page_directory [ pde_lpn ] . pde4k . page_table_base_address 

103 ) ; 


105 pte_base [ pte_lpn ] . writeable = 1; 

106 pte_base [ pte_lpn ] . user_access = user_access ; 

107 pte_base [ pte_lpn ] . page_base_address = physical_page ; 

108 } 

109 else i f ( ( page_size==PAGE_SIZE∗1024) && ( page_directory [ pde_lpn ] . pde4m. present == 0 ) ) 

110 { 

111 page_directory [ pde_lpn ] . pde4m. present = 1; 

112 page_directory [ pde_lpn ] . pde4m. writeable = 1; 

113 page_directory [ pde_lpn ] . pde4m. use_4_m_pages = 1; 

114 page_directory [ pde_lpn ] . pde4m. page_base_address = physical_page ; 

115 page_directory [ pde_lpn ] . pde4m. user_access = user_access ; 

116 } 

117 else 

118 assert ( false ) ; 

119 } 

120 

121 // only free pte ’ s < PAGE_TABLE_ENTRIES/2 because we do NOT 

122 // want to free Kernel Pages 

123 void ArchMemory : : freePageDirectory ( uint32 physical_page_directory_page ) 

124 { 



127 for ( uint32 pde_lpn=0; pde_lpn < PAGE_TABLE_ENTRIES/2; ++pde_lpn ) 

128 { 

129 i f ( page_directory [ pde_lpn ] . pde4k . present ) 

130 { 


132 { 

133 page_directory [ pde_lpn ] . pde4m. present =0; 

134 for ( uint32 p=0;p freePage ( 

136 page_directory [ pde_lpn ] . pde4m. page_base_address∗1024+p 

137 ) ; 



138 } 

139 else 

140 { 




144 ) ; 

145 for ( uint32 pte_lpn =0; pte_lpn < PAGE_TABLE_ENTRIES; ++pte_lpn ) 

146 { 


148 { 


150 PageManager : : instance ()−> freePage ( pte_base [ pte_lpn ] . page_base_address ) ; 

151 } 

152 } 

153 page_directory [ pde_lpn ] . pde4k . present =0; 

154 PageManager : : instance ()−> freePage ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ; 

155 } 

156 } 

157 } 

158 PageManager : : instance ()−> freePage ( physical_page_directory_page ) ; 

159 } 

160 

161 bool ArchMemory : : checkAddressValid ( uint32 physical_page_directory_page , uint32 laddress_to_check ) 

162 { 



165 uint32 linear_page = laddress_to_check / PAGE_SIZE; 



168 i f ( page_directory [ pde_lpn ] . pde4k . present ) 

169 { 


171 return true ; 

172 




176 ) ; 


178 return true ; 

179 else 

180 return false ; 

181 } 

182 else 

183 return false ; 

184 } 

185 

186 uint32 ArchMemory : : getPhysicalPageOfVirtualPageInKernelMapping ( uint32 linear_page , 

187 

188 { 


190 ( page_directory_entry ∗) &kernel_page_directory_start ; 



193 i f ( page_directory [ pde_lpn ] . pde4k . present ) //the present b i t i s the same f o r 4k and 4m 

194 { 


196 { 

197 ∗physical_page = page_directory [ pde_lpn ] . pde4m. page_base_address ; 

198 return (PAGE_SIZE∗1024U) ; 

199 } 




200 else 

201 { 




205 ) ; 


207 { 

208 ∗physical_page = pte_base [ pte_lpn ] . page_base_address ; 

209 return PAGE_SIZE; 

210 } 

211 else 

212 return 0; 

213 } 

214 } 

215 else 

216 return 0; 

217 } 

3.4.2 PageManager 

The class PageManager is a singleton class. Sweb uses PageManager to maintain 

what physical pages are free or used. The PageManager assumes all pages to be 

4KiB in size. PageManager historically is created even before the KernelMemory- 

Manager and loaded at the linear address &kernel_end_address. For each physical 

memory page the manager uses 8 bit of kernel memory. With a memory footprint of 

sizeof(P ageManager)+number physicalpages ∗sizeof(uint8) and the Sweb limit of maximum 

1 GiB physical memory, PageManager can consume a maximum of 12 + 256 = 268Byte. 

Currently Sweb knows four page usage types: PAGE_RESERVED, PAGE_KERNEL, 

PAGE_USERSPACE, PAGE_FREE. Pages marked reserved can never be free’d, they 

are memory areas used by the hardware or modules loaded by grub. The 4 remaining 

bits are free for future use. 

Interface 

Method 

static pointer createPageManager(pointer 

next_usable_address); 

static PageManager *instance(); 

uint32 getTotalNumPages() const; 

Description 

creates the singleton instance of PageManger. 

This method in turn calls the PageManager ctor 

using placement new. 

returns a pointer to singleton instance of Page- 

Manager. Example: PageManager::instance()- 

>method(); 

returns the number of 4KiB Pages available to 

Sweb. That is the size of the first usable memory 

region as reported by grub or a maximum 

of 1GiB, divided by the size of one page (4KiB) 



uint32 

returns the next free physical page number 

getFreePhysicalPage(uint32 type); and marks that page as used. It takes 

one parameter "type", which should be either 

PAGE_USERSPACE, that is the default, or 

PAGE_KERNEL. Note that the type set in the 

PageManager does not influence the User/Supervisor 

Flag in the Page-Directory. 

void freePage(uint32 page_number); marks one physical page identified by its 

page_number as free, if it was marked as used 

in user or kernel context. 

void startUsingSyncMechanism(); to be called after initialising the KernelMemoryManager 

and before starting Interrupts to ensure 

Mutual Exclusion 

Table 3.6: PageManager Interface 

3.4.3 Implementation 

1 /∗∗ 

2 ∗ @ f i l e PageManager . cpp 

3 ∗/ 

4 

5 #include "mm/PageManager .h" 

6 #include "mm/new.h" 

7 #include " paging−definitions .h" 

8 #include " arch_panic .h" 

9 #include "ArchCommon.h" 

10 #include "ArchMemory.h" 

11 #include " debug_bochs .h" 

12 #include " console/kprintf .h" 

13 #include " ArchInterrupts .h" 

14 #include " assert .h" 

15 

16 PageManager∗ PageManager : : instance_ =0; 

17 

18 extern void∗ kernel_end_address ; 

19 

20 void PageManager : : createPageManager ( ) 

21 { 

22 i f ( instance_ ) 

23 return ; 

24 

25 instance_ = new PageManager ( ) ; 

26 } 

27 

28 PageManager : : PageManager ( ) 

29 { 

30 number_of_pages_ = 0; 

31 lowest_unreserved_page_ = 256; //physical memory



39 uint32 highest_address_below_1gig=0,type=0,used_pages=0; 

40 

41 //Determine Amount of RAM 

42 for ( i =0; i 

85 number_of_pages_ = Min ( number_of_pages_ ,1024∗256); 

86 

87 //we use KMM now 

88 //note , the max size of our page_useage_table0 i s 1MiB in theory (4GiB RAM) and 

89 //256KiB in practice (1GiB RAM) ( calculated with sizeof ( puttype )=1) 

90 //currently we have 4MiB of kernel memory (1024 mapped pages starting at linear addr . : 2GiB ) 

91 // i f our kernel image becomes too large , the following command might f a i l 

92 page_usage_table_ = new puttype [ number_of_pages_ ] ; 

93 

94 // since we have gaps in the memory maps we can not give out everything 

95 // f i r s t mark everything as reserved , just to be sure 

96 debug (PM, " Ctor : I n i t i a l i z i n g page_usage_table_ with a l l pages reserved\n" ) ; 

97 for ( i =0; i


101 

102 //now mark as free , everything that might be useable 

103 for ( i =0; i 1024∗256) //because max 1 gig of memory, see above 

111 { 

112 continue ; 

113 } 

114 

115 for ( k=Max( start_page , lowest_unreserved_page_ ) ; k2gb und


3.5. SWEB BOOT PROCESS 

163 

164 // w i l l propably be 1024 

165 debug (PM, " Ctor : lowest_unreserved_page_=%d\n" , lowest_unreserved_page_ ) ; 

166 prenew_assert ( lowest_unreserved_page_ >= 1024); 

167 prenew_assert ( lowest_unreserved_page_ < number_of_pages_ ) ; 

168 debug (PM, " Ctor done\n" ) ; 

169 } 

170 

171 uint32 PageManager : : getTotalNumPages ( ) const 

172 { 

173 return number_of_pages_ ; 

174 } 

175 

176 //used by loader . cpp ArchMemory . cpp 

177 uint32 PageManager : : getFreePhysicalPage ( uint32 type ) 

178 { 

179 i f ( type == PAGE_FREE || type == PAGE_RESERVED) //what a stupid thing that would be to do 

180 return 0; 

181 

182 i f ( lock_ ) 

183 lock_−>acquire ( ) ; 

184 

185 // f i r s t 1024 pages are the 4MiB f o r Kernel Space 

186 for ( uint32 p=lowest_unreserved_page_ ; prelease ( ) ; 

193 return p ; 

194 } 

195 } 

196 i f ( lock_ ) 

197 lock_−>release ( ) ; 

198 arch_panic ( ( uint8 ∗) "PageManager : Sorry , no more Pages Free ! ! ! " ) ; 

199 return 0; 

200 } 

201 

202 void PageManager : : freePage ( uint32 page_number ) 

203 { 

204 i f ( lock_ ) 


206 i f ( page_number >= lowest_unreserved_page_ && page_number < number_of_pages_ 

207 && page_usage_table_ [ page_number ] != PAGE_RESERVED) 

208 page_usage_table_ [ page_number ] = PAGE_FREE; 

209 i f ( lock_ ) 

210 lock_−>release ( ) ; 

211 } 

3.5 Sweb Boot Process 

3.5.1 Boot Sequence 

Sweb uses a patched version of Grub to load itself into memory (patched only for 

vesa-mode, textual mode works without patching). Afterwards the following steps are 

executed in sequential order: 

1. load gdt 

2. load LINEAR_DATA_SEL into all segment registers 


3.5. SWEB BOOT PROCESS CHAPTER 3. VM, PROTECTION AND PAGING 

Index Base Limit P DPL S Type A G D X U 

[24] [20] [1] [2] [1] [3] [1] [1] [1] [1] [1] 

0 0 0 0 00 0 000 0 0 0 0 0 

1 0 0 0 00 0 000 0 0 0 0 0 

2 0 FFFFFh 1 00 1 001 0 1 1 0 0 

3 0 FFFFFh 1 00 1 101 0 1 1 0 0 

4 0 FFFFFh 1 11 1 001 0 1 1 0 0 

5 0 FFFFFh 1 11 1 101 0 1 1 0 0 

Table 3.7: Sweb Global segment Descriptor Table 

3. zero out kernel BSS (uninitialized data section) 

4. set up stack (which is in the BSS section) 

5. call parseMultibootHeader() 

6. call initialiseBootTimePaging() 

7. enable PSE (extended PageSize 4k and 4m mixed) 

8. enable Paging (PG bit = 1) 

9. Paging Mode: 

10. reload GDT with new address 

11. load LINEAR_DATA_SEL into all segment registers 

12. call removeBootTimeIdentMapping() 

13. reload CR3 

14. call startup() 

14. 1. init KernelMemoryManager 

14. 2. init PageManager 

14. 3. init Console 

14. 4. init Scheduler 

14. 5. init kprintf_no_sleep 

14. 6. init KernelThreadInfo 

14. 7. init Interrupts 

14. 8. add Threads 

14. 9. enable synchronisation mechanisms in KMM and PM 

14. 10. enable Interrupts: from here on, tasks are executed pseudo-parallel. 

14. 11. yield away 

for details, refer to the files init_boottime_pagetables.cpp, boot.s and main.cpp. 

3.5.2 Sweb Global Descriptor 

Sweb’s boot time GDT can be found in boot.s and Table 3.7 with a per-descriptor 

elaboration in Table 3.8. 



3.5. SWEB BOOT PROCESS 

Index Base Limit Priv.Lvl. Type Symbol 

2 0 4GiB 0 Data, read/write LINEAR_DATA_SEL 

3 0 4GiB 0 Code, execute/read LINEAR_CODE_SEL 

4 0 4GiB 3 Data, read/write LINEAR_USR_DATA_SEL 

5 0 4GiB 3 Code, execute/read LINEAR_USR_CODE_SEL 

Table 3.8: Sweb GDT, more readable 

3.5.3 Sweb Boot-Time Paging 

As long as paging is disabled during the boot-phase, we are accessing physical memory 

directly. This makes it easy to get information about the location of grub loaded 

modules or access memory used by hardware. The problem in this phase stems from 

the fact that all kernel pointers are compiled with a 2GiB offset, since the kernel is 

meant to be accessed at 2GiB in linear memory. Physically though, the kernel starts 

executing at 1MiB. 

In order to run the Kernel at 2GiB, an initial Page-Directory is set up so that kernelcode 

is accessible through both its physical location as well as above 2GiB. After the 

end of the linear kernel location, mappings to free physical pages are added, which 

we can use as kernel memory.. With this mapping in place, paging is enabled and the 

first call to above 2GiB is made. 

Actually, directly calling a kernel function prior to enabled paging is impossible. 

We would jump 2GiB far away from our code. Therefore, during boot-time, an offset 

is substracted from the function-addresses before calling them. The call to void initialiseBootTimePaging() 

which lacks this measure, works because gcc optimises the 

function call by inlining the code. 

3.5.4 Sweb Post-Boot Paging Setup 

After the boot phase has finished, the linear memory of Sweb is partitioned like shown 

in Table 3.9. Because of page protection, the kernel can directly access user space 

code while access to kernel code is denied to the user program even though kernel and 

user program share the same linear address space. More exactly, every user program’s 

Page-Directory also contains the mapping of the kernel pages. The identity mapping 

(with a 3GiB offset) beyond 3GiB is used to communicate with hardware like the frame 

buffer. 

3.6. SWEB TASK SEPARATION AND PROTECTION 


3.6 Sweb Task Separation and Protection 

3.6.1 Segment Descriptor Privilege Level / Task Privilege Level 

As shown in Table 3.8, Sweb uses only privilege level 0 for kernel code and 3 for user 

code. User threads running user code start with CS and DS set to 4 and 5 respectively. 

Refer to low level task switching and syscalls on when privilege levels are changed. 

3.6.2 User- / Supervisor- Page 

Paging offers an additional mechanism concerning code privileges in form of the user/supervisoraccess 

flag in the Page-Directory. Pages where the user_access flag is not set may only 

be accessed by code with privilege level 0 i.e. the kernel. In Sweb the user_access flag 

is set only on mappings below 2GiB. 

3.6.3 LoadPageOnDemand 

Instead of loading a new process complete into memory on process launch, only the 

pages needed can be loaded one a time. In Sweb, on start of an new user process, all 

virtual pages are non-present. The first time each user page is accessed, a page fault 

happens and the Page-Fault-Handler is called. On deciding that the page is in the user 

threads linear address space below 2GiB, the user processes instance of class Loader 

is called. Loader then checks the ELF-Header and loads the code and data from the 

executable, that match the missing page’s linear address range. 

It is further possible to remember if a page has never changed since it was loaded 

on demand. Then, instead of swapping it out, it could simply be free’d because it 

can be reload from the executable. This approach however introduces some issues in 

conjunction with shared pages which are detailed in section 3.7.6. 

3.6.4 UserThread Termination 

On termination of a user process, ArchMemory::freePageDirectory(..) is called. The 

function recursively frees all mapped pages below 2GiB. 

3.7 Extending Sweb 

3.7.1 Using 4MiB Pages 

In various operating systems, pages of 4MiB size are used to map kernel code because 

the IA32 maintains different TLBs for 4KiB and 4MiB pages. The advantage would 

be that the kernel has its own page cache that would remain unchanged on a task 

switch. In the beginning Sweb used one 4MiB kernel page. Later, it was more important 

to prevent and catch errors by marking parts (


3.7. EXTENDING SWEB 

ArchMemory functions always take page numbers which depend on the given page 

size as arguments, i.e. the 2nd 4MiB Page and the 2048th 4KiB Page would start at 

the same linear address. In order to map 4MiB pages in Sweb during runtime, the 

PageManager needs to be modified. Since the PageManager manages 4MiB pages as 

1024 consecutive 4KiB pages, it needs to be able to find 1024 consecutive free pages 

starting from a page number that is a multiple of 1024. When a free 4MiB page has 

been found, ArchMemory::map( , , 

, , PAGE_SIZE*1024); can be used to create 

a mapping. Unmapping is already supported. 

3.7.2 Additional pages for the kernel 

Currently kernel memory is limited to 4MiB. Even though dynamically mapping pages 

beyond the 4MiB area of the kernel is possible, it is easier to make available more 

memory at at boot time by changing the file init_boottime_pagetables.cpp. To do this, 

Page-Directory-Entry 513 (from 2GiB+4MiB to 2GiB+8MiB) needs to be present and 

point to a matching Page-Table which can then map up to 4MiB more memory. Since 

PageManger detects kernel pages auto-magically at initialisation time, all that remains 

is to modify the pointer end_address in main.cpp. 

Note however that 4MiB are often enough. The Kernel code itself uses about 

500KiB. PageManager usually needs less than 10KiB with 32MiB RAM and KMM uses 

about 4KiB per allocation. Additionally FiFos and RingBuffers use about 50KiB buffer 

space. The conclusion is, that running out of kernel memory in Sweb is more likely 

the result of a bug than shortage. 

3.7.3 sharing Memory between linear address spaces 

From what the reader of this document now knows about Paging, it is clear that 

sharing memory between different linear address spaces (and their programs) means 

sharing pages. Sweb does this already, kernel pages are shared shared between all 

user processes because their mapping exits in all page directories. So all that needs 

to be done is to map one linear memory page of different user processes to the same 

physical page. Finally, the global_page flag of the mapping needs to be set, in order to 

distinguish it from non-shared mappings and instruct the CPU to keep the mapping 

cached on a task switch. In order for the CPU to honour the global flag, the PGE flag 

in CR4 needs to be set. (see table 3.4) Furthermore mutual exclusion between user 

programs that share pages must not be forgotten. 

Care must be taken when a user processes with shared pages terminates. The 

PageManager must be changed, so that pages are marked free only if all owning processes 

have free’d it. 

3.7.4 Cloning a Tasks linear address space 

Syscalls like clone() or fork() require the creation of a new user process which’s linear 

address space is identical to an existing user process. Unless we elegantly use Copy- 

OnWrite, this means manually copying every page to a new physical memory location 

and creating a new Page-Directory that maps to the new locations. This could be 

done from the old user threads linear memory context by referencing the new physical 

locations through the identity mapping above 3GiB. 


3.7. EXTENDING SWEB CHAPTER 3. VM, PROTECTION AND PAGING 

Figure 3.4: Page-Fault-Handling with possible CopyOnWrite 

3.7.5 CopyOnWrite 

A more efficient way, would be not to copy all pages, but, on demand, only copy pages 

that get modified. Such a mechanism is called CopyOnWrite. We might implement 

CopyOnWrite in Sweb by first marking all user pages as read-only. When creating a 

new process, instead of copying the kernel page directory, we copy the one from the 

original user process. This way, we can save memory space as well as the time it takes 

to create a new user process. 

Because both user process pages are read-only, writing on them will cause a page 

fault. Flag WP in CR0 should be set, so this happens even on a write performed by 

kernel code. To copy pages on demand the page fault handler needs to be modified to 

recognise a shared page and copy that page to a new location where it can be written 

to. An example of how the page fault handler might work can be found in Figure 3.4. 

Obvious care is necessary in administrating shared physical pages, i.e. a shared 

page can only be freed if all user processes using it have terminated. A possible 

solution might be to manage a reference counter for each physical page, however in 

combination with virtual memory this is not enough. 




3.7.6 Virtual Memory 

Virtual Memory Basics 

Physical Memory is expensive and often the memory available is not enough. Conveniently 

not all programs or even all parts of a program are needed in physical memory 

at the same time. Virtual Memory is the concept of swapping out some less used data 

and code from physical memory to a less expensive media with more capacity, like a 

hard disk device. In essence a part of the hard disk becomes a virtual extension of 

the physical memory, this part is sometimes refereed to as Swap- or Page- file,-space 

or -partition. Paging already partitions the memory in equally sized pages which are 

easy to track and swap around. Figure 3.5 shows an example mapping using virtual 

memory on a hard disk as opposed to Figure 3.2. 

To implement Virtual Memory, the operating system must: 

• know which pages in the swap space are used, 

• keep track of where in physical or swap memory the data on a virtual page really 

is, 

• catch attempts to access virtual pages that are not in physical memory and 

• know which pages are not currently needed in physical memory and which processes 

they belong to. 

Sweb Virtual Memory 

Sweb uses the class PageManager to manage which pages of physical memory are 

in use or free. A class SwapManager might accomplish the same for a static swap 

space on the hard disk. The Page-Directory defines where a virtual page is in physical 

memory, something similar is needed for swap memory. Under certain condition, one 

Page-Directory might hold both physical and swap page numbers, but a way to distinguish 

unloaded pages from swapped out pages is necessary. For example, unloaded 

pages might not only be non-present but have their page number set to a certain value. 

Knowing if a page is preset in physical memory ties in with catching access attempts. 

If a virtual page is accessed and that page-entry’s present bit is not set, the 

CPU will cause a page fault. An extended Page-Fault-Handler (see Figure 3.6) might 

check if a missing page can be found in the swap space before trying to load it from 

the executable or deciding that an error has happened. The missing page can then be 

loaded into physical memory and after updating the Page-Table-Entry, the program 

can continue. 

Page Replacement Algorithm 

When physical memory runs short, at least one page needs to be swapped out for 

another to load. Deciding which page to evict is the function of the Page Replace 

Algorithm and a matter of much consideration and research. For introduction to and 

explanation of several Page Replacement Algorithms see Chapters 4.4 and 4.5 in [17] 

Key to most Page Replacement Algorithm, are the .accessed and .dirty flags of Page- 

Table-Entries. (see table 3.3.3). These flags are set by the CPU when a read or write 

operation is performed. More exactly, on executing a read or write operation on a linear 



Figure 3.5: Example mapping using virtual memory. 




Figure 3.6: Sweb Page-Fault-Handling and possible Virtual Memory Extension 



address, the CPU must translate that address into a physical address by consulting 

the Page-Directory and Page-Tables, and in the process sets the appropriate flags. 

By polling and clearing these flags, the operating system can know which pages are 

frequently used or have been changed. Frequently used pages are bad candidates for 

eviction from physical memory since they will probably be needed sooner than others. 

Non-Dirty pages that have not been modified, might be reloaded from the executable or 

a copy might still exist in the swap space from an earlier page replacement operation. 

Such pages can be freed immediately, saving the time of writing to a swap file. Note 

that Page Replacement Algorithms always work on the whole of physical memory and 

not only on the active Page-Directory. 

Inverted Page Table 

When a virtual page is evicted from physical memory, all Page-Directory and Page- 

Tables pointing to that physical page need their present flag to be unset. The problem 

of finding the virtual page numbers mapping to that physical page can be optimised 

with an Inverted Page Table. The Inverted Page Table is a a relation between physical 

page number and the virtual pages mapping to it. To each physical page number, it 

needs to store at least: 

• the virtual page number(s) mapping to it and 

• the Page-Directory or user program, the virtual page belongs to. 

To simplify matters, the data structure can assume that only one page per virtual address 

space maps to one physical page. The table can also be used to hold information 

about dirty and accessed bits, but care must be taken to keep this information up to 

date. When sharing pages, the Inverted Page Table must be able to remember several 

user programs with each’s virtual page number, per physical page. For further details 

see Chapter 4.3.4 of [17] 

Sharing Pages In Virtual Memory 

When implementing CopyOnWrite in a virtual memory environment, what applied to 

physical memory now applies to swap memory as well. With a good page replacement 

algorithm, a shared page will be accessed more often and therefore be less likely a 

candidate for eviction. Still, when a shared page is swapped out, not one but several 

Page-Directories need to be updated. Furthermore, since it proved very efficient to 

share a page in physical memory, there is no reason to abolish the concept in swap 

memory. A Inverted Swap Page Table is needed to hold information about each swap 

page and its owners. Otherwise, the operating system could not know which Page- 

Directories to update on swap-in. 

Sharing Unloaded Pages 

Handling CopyOnWrite and LoadPageOnDemand smartly with virtual memory is more 

complex. With on demand loading of pages from an executable, a virtual page can 

be in one of three places: Physical Memory, Swap Memory or the Executable. The 

question is if, additionally to sharing pages in physical and swap memory, unloaded 

pages can be shared as well. The following design issues arises: How to remember 

which processes have been cloned and need a still unloaded page ? A table of pages 




that can be loaded from an executable in their initial state is necessary. Not unlike an 

Inverted Page Table, this table might hold the user processes that share an unloaded 

page and whose Page-Directories need to be updated. Update of this information 

would be necessary, every time a process is cloned and an executable’s page stops or 

starts being shared. 

If it is possible to know that a shared page has never been changed in physical 

memory and such a page is evicted from physical memory, swapping might be omitted. 

Instead the virtual page can be shared in the executable again. 

Of course, the issue can be avoided by simply not sharing unloaded pages. 




Chapter 4 

Kernelspace Memory 

Management 

4.1 General 

Figure 4.1: linear address space 

The kernel memory manager initializes right after the initialization of the virtual 

memory management. 

After the VMM initialization finishes some linear address space starting at 2 GB is 

mapped for the kernel. 

At the front of this memory area lies the SWEB kernel image. The kernel memory 

manager (KMM) is set in place behind the kernel image during initialization. 

The kernel-space memory management works in the linear address space provided 

from the VMM. 

55

4.2. MEMORY ORGANISATION CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 

The KMM only works with linear addresses. This is why the implementation of the 

KMM is independent from the hardware architecture SWEB is running on. 

The header and source files containing the KMM and affiliated classes can be found 

in “common/source/mm/” and “common/include/mm”. 

4.2 Memory Organisation 

Figure 4.2: memory segments 

The memory, managed by the KMM is handled as a double linked list of memory 

segments. 

Which means that the memory segments are handled as list nodes in the mentioned 

double linked list. 

To implement this type of memory management we designed a object called “Malloc- 

Segment” which works as header of the memory segments. 

This header object holds a pointer to the next element of the list, one to the last element 

and some state information of the memory segment. 

The “MallocSegment” object also provides some elementary set and get functions 

which are used by the “KernelMemoryManager” for memory segment manipulation. 


CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 4.3. IMPORTANT CLASSES 

4.3 Important Classes 

4.3.1 KernelMemoryManager 

KernelMemoryManager provides public methods for allocation, reallocation and freeing 

of memory. Those three methods are using some elementary memory segment 

manipulating methods. 

There are elementary segment manipulating methods provided for: 

• finding a free segment 

• freeing a segment 

• merging a unused segment with the following unused segment 

• shrink a segment to a smaller size (and add a new segment in the appearing free 

space) 

Interface 

Method 

static uint32 createMemoryManager 

(pointer start_address, pointer 

end_address) 

pointer allocateMemory (size_t requested_size); 

bool freeMemory (pointer virtual_address); 

Description 

creates an instance of KernelMemoryManager. 

This method calls the constructor of KernelMemroyManger 

using placement new. 

start_address is the first free memory address. 

end_address is the last address where memory 

is accessible. (i.e. the end of the mapped address 

space.) 

allocateMemory searches the MallocSegment 

list for a free segment with size >= requested_size 

and returns a pointer to the allocated 

memory or 0 if not enough memory available. 

This method is used by new and kmalloc. 

requested_size is the number of bytes to allocate 

freeMemory checks if the given address is 

pointing to an actual memory segment and 

marks it as unused. If it is possible freeMemory 

merge the segment with free segments around 

it. 

freeMemory returns true if the segment was 

freed or false if the address was wrong. 

virtual_address is the memory address which 

was originally returned by allocateMemory. 


4.3. IMPORTANT CLASSES CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 

pointer reallocateMemory (pointer 

virtual_address, size_t new_size); 

reallocateMemory resizes an already allocated 

memory segment or moves the entire segment 

somewehre else. 

reallocateMemory returns the (possibly altered) 

pointer to the resized memory segment or 0 if 

an error occurs. 

WARNING: therefore, code must assume that 

the memory address will change after using reallocateMemory. 

virtual_address is the address 

of the segment to resize. new_size is the new 

size of the segment. 

Note: The function acts like freeMemory if 

new_size is set to 0. 

void startUsingSyncMechanism (); startUsingSyntMechanism is called from 

startup() after the scheduler has been created 

and just before the Interrupts are turned on. 

Table 4.1: KernelMemoryManager Interface 

4.3.2 MallocSegment 

Figure 4.3: MallocSegment 

The objects of the class MallocSegment are used as list nodes in the double linked 

list where the memory segments are stored in. 

MallocSegment does not only hold the previous- and the next-pointers but also a 

marker and a size_flag_ variable. 

The size_flag_ variable contains the size of the memory segment and a used flag which 

marks the segment as used. 

Every time the KMM searches the MallocSegment list or any operation is performed on 


CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 

4.4. USED ALGORITHMS 

a memory segment, the KMM checks if the marker is where expected. So the integrity 

of the memory segments is checked continuous, which is also a simple form of access 

violation detection. 

The size_flag_ variable is a private variable which is accessed by get- and set-methods. 

Interface 

Method 

MallocSegment (MallocSegment 

*prev, MallocSegment *next, size_t 

size, bool used); 

size_t getSize (); 

void setSize (size_t size); 

bool getUsed (); 

Description 

The Construktor 

*prev is the Pointer to the previous MallocSegment 

in the list. 

*next is the Pointer to the next MallocSegment 

in the list. 

size is the size (in bytes) of memory the 

described memory segment provides. Note: 

“this + sizeof(MallocSegment) + size” is usually the 

start of the next segment. 

used describes if the segment is allocated or 

free. 

Returns the size of the segment in bytes. 

Sets the size of the segment in bytes. 

size is the size to which the segment should be 

set. 

Returns true if the segment is allocated, false if 

it is unused. 

void setUsed (bool used); Sets the state of the segment as used 

(used=true) or free (used=false). 

Table 4.2: MallocSegment Interface 

4.4 Used Algorithms 

4.4.1 allocating memory 

Allocating memory is done by the allocateMemory method in KernelMemoryManager. 

The memory manager searches the memory segment list with a first fit algorithm 

and shrinks the found segment to the requested size. In the memory area after the 

shrinked segment a new segment will be insert into the list. The segment is not 

shrinked if the space after the shrinked segment is to small for a new segment. In 

case the segment is shrunk (and a new segment is inserted behind) the KMM tries to 

merge the new segment with the following segment if possible. 

4.4.2 freeing memory 

Freeing memory is done by the freeMemory method in KernelMemoryManager. If the 

previous and/or next segment is/are marked as unused the unused segments were 

merged to a big segment which is marked as unused. 


4.5. INTERFACE CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 

4.4.3 reallocating memory 

Reallocation of memory is also done in KernelMemoryManager. 

The method is named reallocateMemory and takes an address and the requested size 

as parameters. The kernel memory manager finds the MallocSegment which contains 

the virtual address and check the space after this segment. If there is enough free 

space behind, the kernel memory manager merges the segments to a bigger one an 

returns the gived pointer which now points to bigger segment. If there is not enough 

space the kernel memory manger searches for a segment which is large enough, allocates 

the requested size of memory, moves the data from the old memory location to 

the new, frees the old segment and returns a new pointer. 

4.5 Interface 

4.5.1 Functions 

Three functions for allocating and freeing memory are provided in the files “kmalloc.cpp” 

and “kmalloc.h”. This functions are calling the memory management methods 

in the kernel memory manager. They can be used like the well known functions 

malloc, free and realloc. 

Functions 

Description 

void* kmalloc (size_t size); 

allocates size bytes and returns a pointer to the 

allocated memory. 

void kfree(void * address); 

frees the memory space pointed to by address, 

which must have been returned by a previous 

call to kmalloc() or krealloc(). 

void* krealloc(void * address, size_t changes the size of the memory block pointed 

size); 

to by address to size bytes. 

Table 4.3: provided Functions 

Provided operators are: 

• new 

• delete 

• new[] 

• delete[] 

This operators are also calling the memory management methods in the kernel memory 

manager and can be used as usually. To use them include “new.h”. 


Chapter 5 

Threads and Processes 

5.1 Introduction 

The kernel of sweb is multithreaded, so we’ll now discuss what is necessary for 

thread switching. In order to switch between two threads we first have to interrupt 

the currently running thread. Typically this will happen by either a timer interrupt 

or another interrupt (e.g. caused by yield()). Inside this interrupt handler we will 

have to save the current state of cpu, which is called thread context. Then we call 

Scheduler::schedule() where a runnable thread is selected. Before leaving the interrupt 

handler the new thread context has to be loaded. 

5.2 Details for x86 

In this section we will have a look at how the low level aspects of multithreading work 

on different platforms and different operating systems. We will also explain how SWEB 

implements threads and processes on the x86 platform. 

A word of warning though, all of the details described here are valid for 32 bit protected 

mode only. 

A typical x86 cpu only has a very limited amount of hardware registers: 

• The four general purpose registers EAX, EBX, ECX and EDX. 

• 2 stack registers called ESP and EBP. 

• 2 index registers called EDI and ESI. 

• 4 segment registers called CS, ES, DS and SS. 

• The instruction pointer register called EIP. 

• The flags register called EFLAGS. 

Besides these basic registers there are a number of special registers which control 

things like paging and interrupt handling. Additions and enhancements to the x86 

specifications (like mmx, sse or 3dnow) added additional cpu registers, mostly used 

by the fpu. 

61

5.3. THE FPU AND OTHER PROBLEMS CHAPTER 5. THREADS AND PROCESSES 

But a simple thread which does not deal with special things will only use the registers 

listed above. Some threads (userspace processes) are not even allowed to use 

registers besides the ones listed above. 

5.2.1 Example for saving the CPU registers 

The assembly listing in figure 5.1 is an example for how to save the CPU registers 

in an interrupt handler. The function will first determine if the interrupt happened 

during usermode or kernel mode and then write the cpu registers into a special data 

structure. ”currentThreadInfo” is a pointer to this data structure. 

Right before calling this function all cpu registers have been pushed onto the stack 

by using a ”push all” cpu instruction. 

5.2.2 Example for reading back the CPU Registers 

The assembly listing in figure 5.2 is an example for how to read back the CPU registers. 

In this case we assume that we only want to switch to a different thead in kernelmode. 

Switching to a userspace thread / process is almost the same. 

Loading the cpu registers is mostly straight forward, however there is one special 

thing to take care of. Simply reading out the register’s value and putting it in the register 

will not work for some special registers because we could potentially run into race 

conditions. On x86 the eflags registers contains, among other things, the interrupt 

enable flag. Since we do not want to accidentially enable interrupts during writing 

back the cpu registers we have to make sure that the last register we are going to 

write is the eflags registers. This however is not possible with a simple approach since 

as soon as we write into the eip registers the next instruction the cpu will execute will 

depend on the new value of the eip register. 

Luckily x86 provides a special instruction that will take care of this problem. The 

”iret” instruction. [15, Intel Manual] This special instruction will read back the eip, 

the esp, the cs and the eflags registers from the stack. Since this is only one cpu 

instruction we have no problems. 

5.3 The FPU and other problems 

The FPU registers are a special case on x86. The amount of data that has to be copied 

on every thread switch to update the fpu status is huge compared to the other general 

purpose registers. Luckily the number of programs that actually use the fpu is small 

so almost all of the modern operating systems use a special trick to handle the FPU 

registers. Instead of copying them on every thread switch one remembers which thread 

last used the fpu and upon a thread switch the FPU is switched to a special mode. 

In this mode the FPU will trigger a hardware exception on the next attempt to use it. 

If such an exception is thrown the operating system has to copy the data, but if only 

one thread uses the fpu such an exeception will never need to be thrown and thus the 

FPU registers never have to be copied. One problem / restriction however raises due 

to this: Kernel mode threads must not access the fpu. This is no problem since on all 

sane operating systems this is not allowed anyway. 


CHAPTER 5. THREADS AND PROCESSES 

5.3. THE FPU AND OTHER PROBLEMS 

1 global arch_saveThreadRegisters 

2 arch_saveThreadRegisters : 

3 mov ebx , dword[ currentThreadInfo ] 

4 mov eax , dword[ esp + 48] ; get cs 

5 and eax , 0x03 ; check cpl i s 3 

6 cmp eax , 0x03 

7 je from_user 

8 from_kernel : 

9 mov eax , dword [ esp + 44] ; save eip 

10 mov dword[ ebx ] , eax 

11 mov eax , dword [ esp + 48] ; save cs 

12 mov dword[ ebx + 4] , eax 

13 mov eax , dword [ esp + 52] ; save eflags 


15 mov eax , dword [ esp + 40] ; save eax 


17 mov eax , dword [ esp + 36] ; save ecx 


19 mov eax , dword [ esp + 32] ; save edx 


21 mov eax , dword [ esp + 28] ; save ebx 


23 mov eax , dword [ esp + 24] ; save esp 

24 add eax , 0xc 


26 mov eax , dword [ esp + 20] ; save ebp 


28 mov eax , dword [ esp + 16] ; save esi 


30 mov eax , dword [ esp + 12] ; save edi 


32 mov eax , [ esp + 8] ; save ds 


34 mov eax , [ esp + 4] ; save es 


36 ret 

37 from_user : 

38 mov eax , dword[ esp + 60] ; save ss3 


40 mov eax , dword[ esp + 56] ; save esp3 


42 mov eax , dword [ esp + 44] ; save eip 

43 mov dword[ ebx ] , eax 

44 mov eax , dword [ esp + 48] ; save cs 


46 mov eax , dword [ esp + 52] ; save eflags 


48 mov eax , dword [ esp + 40] ; save eax 


50 mov eax , dword [ esp + 36] ; save ecx 


52 mov eax , dword [ esp + 32] ; save edx 


54 mov eax , dword [ esp + 28] ; save ebx 


56 mov eax , dword [ esp + 20] ; save ebp 


58 mov eax , dword [ esp + 16] ; save esi 


60 mov eax , dword [ esp + 12] ; save edi 


62 mov eax , [ esp + 8] ; save ds 


64 mov eax , [ esp + 4] ; save es 

65 mov dword[ ebx + 48] , eax 63 of 151 

66 

67 

68 

69 

70 ret 

Figure 5.1: Saving the cpu registers in an interupt handler

5.4. THREADS AND PROCESSES IN SWEBCHAPTER 5. THREADS AND PROCESSES 

1 arch_switchThreadKernelToKernel : 


3 mov ecx , dword[ g_tss ] ; tss 

4 mov eax , dword[ ebx + 68] ; get esp0 

5 mov dword[ ecx + 4] , eax ; restore esp0 

6 mov eax , dword[ ebx + 12] ; restore eax 

7 mov ecx , dword[ ebx + 16] ; restore ecx 

8 mov edx , dword[ ebx + 20] ; restore edx 

9 mov esp , dword[ ebx + 28] ; restore esp 

10 mov ebp , dword[ ebx + 32] ; restore ebp 

11 mov esi , dword[ ebx + 36] ; restore esi 

12 mov edi , dword[ ebx + 40] ; restore edi 

13 mov es , word [ ebx + 48] ; restore es 

14 mov ds , word [ ebx + 44] ; restore ds 

15 push dword[ ebx + 8] ; push eflags 

16 push dword[ ebx + 4] ; push cs 

17 push dword[ ebx + 0] ; push eip 

18 push dword[ ebx + 24] 

19 pop ebx ; restore ebx 

20 iretd ; switch to next 

Figure 5.2: Reading back the cpu registers of a Thread 

5.4 Threads and Processes in SWEB 

5.4.1 Kernel Threads 

A Kernel Thread in sweb can be created by deriving from the ”Thread” class and implementing 

the virtual ”Run” Method. The constructor the the Thread class then allocates 

a Stack for the Thread (typically 8 kilobytes) and a structure called ”ArchThreadRegisters” 

where all the registers of the Thread are stored. Then the Thread registers are 

initialised so that on the first time the Thread will run it will execute the Run Method. 

This is done by calling a C function called ”ThreadStartHack” which will then call 

”currentThread->Run()”. As an alternative one could also directly call the Run method 

by push a valid this pointer onto the stack. This however would have the side effect of 

not being able to catch a return from the Run method. A return from the run method 

would then lead to strange, unreproducable results since there is no well defined return 

address to jump to after the return. The Run Method may never return or the 

Kernel will raise a panic to prevent more Damage from happening. 

Register Initialisation 

Figure 5.3 shows how the CPU registers of a kernel thread are initialised. Interesting 

things to see: 

• All segment registers are set to use kernel level segments. 

• The page directory (cr3 register) is set to use the kernel page tables (this way all 

kernel threads share the same virtual memory area). 

• The protection level (dpl) is set to kernel mode which means that this thread can 

virtually do whatever it wants, without limitations. 

• The instruction pointer (eip) is set to the address of the start function. 


CHAPTER 5. THREADS AND PROCESSES5.4. THREADS AND PROCESSES IN SWEB 

1 void ArchThreads : : createThreadInfosKernelThread ( ArchThreadInfo ∗&info , 

2 pointer start_function , pointer stack ) 

3 { 

4 info = ( ArchThreadInfo ∗)new uint8 [ sizeof ( ArchThreadInfo ) ] ; 

5 ArchCommon: : bzero ( ( pointer ) info , sizeof ( ArchThreadInfo ) ) ; 

6 pointer pageDirectory = VIRTUAL_TO_PHYSICAL_BOOT( 

7 ( ( pointer)& kernel_page_directory_start ) ) ; 

8 

9 info −>cs = KERNEL_CS; 

10 info −>ds = KERNEL_DS; 

11 info −>es = KERNEL_DS; 

12 info −>ss = KERNEL_SS; 

13 info −>eflags = 0x200 ; 

14 info −>eax = 0; 

15 info −>ecx = 0; 

16 info −>edx = 0; 

17 info −>ebx = 0; 

18 info −>esi = 0; 

19 info −>edi = 0; 

20 info −>dpl = DPL_KERNEL; 

21 info −>esp = stack ; 

22 info −>ebp = stack ; 

23 info −>eip = start_function ; 

24 info −>cr3 = pageDirectory ; 

25 

26 /∗ fpu (= f n i n i t ) ∗/ 

27 info −>fpu [ 0 ] = 0xFFFF037F; 

28 info −>fpu [ 1 ] = 0xFFFF0000; 

29 info −>fpu [ 2 ] = 0xFFFFFFFF; 

30 info −>fpu [ 3 ] = 0x00000000 ; 

31 info −>fpu [ 4 ] = 0x00000000 ; 

32 info −>fpu [ 5 ] = 0x00000000 ; 

33 info −>fpu [ 6 ] = 0xFFFF0000; 

34 kprintfd_nosleep ( " ArchThreads : : createThreadInfosKernelThread : values done\n" ) ; 

35 } 

Figure 5.3: Initialisation of the Registers for a Kernel Thread 

• The FPU is initialised by setting known good values into the fpu registers. 

5.4.2 Userspace Processes 

A userspace process is a little bit more complicated than a kernel thread. To create 

a userspace process one first has to create an ordinary kernel thread. This thread 

then will have to load an executable by using the Loader. Then the thread has to 

create a second cpu register structure. After all this the thread can set a special flag 

in its thread object to make the scheduler switch to the process on the next time it 

will switch to this thread. So basically a userspace process is a kernel thread with an 

addition set of cpu registers and of course an own address space. This addition set of 

cpu registers is initialiased different than for a normal kernel thread as we will see in 

the next section. 


5.5. MULTIPROCESSOR MACHINES CHAPTER 5. THREADS AND PROCESSES 

Register Initialisation 

Figure 5.4 shows how the CPU registers of a userspace thread are initialised. Interesting 

things to see: 

• All segment registers are set to use userspace segments. Only the ss0 entry is 

set to use a kernel segment. 

• The page directory (cr3 register) is set to use the kernel page tables. This is to 

have a failsafe in case one forgets to create a PageDirectory for the userspace 

process. 

• The protection level (dpl) is set to userspace mode which means that this thread 

can only access what the kernel allows it to access (for instance memory larger 

than 2gig will not be accessible). 

• The instruction pointer (eip) is set to the address of the start function. The value 

is retrieved by the binary loader. 

• The FPU is initialised by setting known good values into the fpu registers. 

• We set the esp0 entry to point to a stack in the kernel area. This stack will be 

used by the cpu during interrupt handling an syscall handling. 

5.4.3 Switching to Userspace 

At the end of some interrupt handlers the scheduler will switch to a different thread. 

The selected thread object is then checked to determine whether its userspace or its 

kernelspace part is to be executed next. In listing 5.2 one can see how a task switch 

to a kernel thread happens. 

Main differences between switching to userspace and switchting to a normal kernel 

thread: 

• esp0 has to be set in the tss. This stack will be used if any exception or interrupt 

happens in user mode because the usermode stack can not be trusted to have a 

good value (imagine a pagefault because of a stack overflow in userspace). 

• iret also pops the esp register and the ss segment register since the userspace 

registers can’t be used for the last steps before the iret in kernel mode. 

5.5 Multiprocessor Machines 

5.5.1 The FPU Trick 

The FPU trick previously described will not work with multiprocessor machines anymore 

since a thread could be resheduled to run on a different cpu the next time it will 

run. So, unless one can gurantee cpu affinity for threads the fpu will have to be saved 

and restored on every thread switch. 

This little trick also lead to a famous bug in the linux kernel: http://lwn.net/Articles/89586/ 



5.5. MULTIPROCESSOR MACHINES 

1 void ArchThreads : : createThreadInfosUserspaceThread ( ArchThreadInfo ∗&info , 

2 pointer start_function , pointer user_stack , pointer kernel_stack ) 

3 { 

4 info = ( ArchThreadInfo ∗)new uint8 [ sizeof ( ArchThreadInfo ) ] ; 

5 ArchCommon: : bzero ( ( pointer ) info , sizeof ( ArchThreadInfo ) ) ; 

6 pointer pageDirectory = VIRTUAL_TO_PHYSICAL_BOOT( 

7 ( ( pointer)& kernel_page_directory_start ) ) ; 

8 

9 info −>cs = USER_CS; 

10 info −>ds = USER_DS; 

11 info −>es = USER_DS; 

12 info −>ss = USER_SS; 

13 info −>ss0 = KERNEL_SS; 

14 info −>eflags = 0x200 ; 

15 info −>eax = 0; 

16 info −>ecx = 0; 

17 info −>edx = 0; 

18 info −>ebx = 0; 

19 info −>esi = 0; 

20 info −>edi = 0; 

21 info −>dpl = DPL_USER; 

22 info −>esp = user_stack ; 

23 info −>ebp = user_stack ; 

24 info −>esp0 = kernel_stack ; 

25 info −>eip = start_function ; 

26 info −>cr3 = pageDirectory ; 

27 

28 /∗ fpu (= f n i n i t ) ∗/ 

29 info −>fpu [ 0 ] = 0xFFFF037F; 

30 info −>fpu [ 1 ] = 0xFFFF0000; 

31 info −>fpu [ 2 ] = 0xFFFFFFFF; 

32 info −>fpu [ 3 ] = 0x00000000 ; 

33 info −>fpu [ 4 ] = 0x00000000 ; 

34 info −>fpu [ 5 ] = 0x00000000 ; 

35 info −>fpu [ 6 ] = 0xFFFF0000; 

36 kprintfd_nosleep ( " ArchThreads : : create : values done\n" ) ; 

37 

38 } 

Figure 5.4: Initialisation of the Registers for a Userspace Thread 


5.6. SCHEDULING CHAPTER 5. THREADS AND PROCESSES 

1 arch_switchThreadToUserPageDirChange : 


3 mov ecx , dword[ g_tss ] ; tss 

4 mov eax , dword[ ebx + 68] ; get esp0 

5 mov dword[ ecx + 4] , eax ; restore esp0 

6 mov eax , dword[ ebx + 76] ; page directory 

7 mov cr3 , eax ; change page directory 

8 mov eax , dword[ ebx + 12] ; restore eax 

9 mov ecx , dword[ ebx + 16] ; restore ecx 

10 mov edx , dword[ ebx + 20] ; restore edx 

11 mov ebp , dword[ ebx + 32] ; restore ebp 

12 mov esi , dword[ ebx + 36] ; restore esi 

13 mov edi , dword[ ebx + 40] ; restore edi 

14 mov es , word [ ebx + 48] ; restore es 

15 mov ds , word [ ebx + 44] ; restore ds 

16 push dword[ ebx + 60] ; push ss here dpl lowwer 

17 push dword[ ebx + 28] ; push esp here dpl lowwer 

18 push dword[ ebx + 8] ; push eflags 

19 push dword[ ebx + 4] ; push cs 

20 push dword[ ebx + 0] ; push eip 

21 push dword[ ebx + 24] 

22 pop ebx ; restore ebp 

23 iretd ; switch to next 

Figure 5.5: Reading back the cpu registers of a Thread and switch to Userspace 

5.5.2 CPU Affinity 

Another interesting problem that only arises on multi processor machines is that when 

one moves a thread from one cpu to another this can have potential ill effects on 

performance. All the data used by the thread which was already cached in the cpu it 

was moved from is not available in the cpu it moves to. This can decrease performance 

a lot. 

5.5.3 NUMA 

NUMA or Non Unified Memory Access simply means that certain areas of memory are 

faster to access for a given cpu than others. This however means that ideally a thread 

should be running on the cpu which has the best access to the memory it’s using. 

5.6 Scheduling 

5.6.1 Threads 

The base unit used in SWEB for scheduling is defined as a class Thread. This class 

includes all the necessary information for task switching like the state of a thread or 

a structure containing register values. 

Every new structure representing a task in the multiprogramming sense such as a 

userspace thread has to inherit from this base class. 



5.6. SCHEDULING 

Thread states 

In SWEB a Thread object can have three possible states which has an implication on 

their scheduling and management. 

The default state is called ’Running’ and marks the thread as runnable. Only 

runnable threads will be considered during scheduling. ’Sleeping’ threads are simply 

ignored by the scheduling mechanism and are not executed. Threads having the 

state ’ToBeDestroyed’ have already completed their execution or have been forcably 

killed. Thread objects with this state are destroyed by a cleanup routine called in the 

IdleThread which is always on the scheduler list. 

5.6.2 The Scheduler 

The central element of the scheduling in SWEB is the class Scheduler, implemented 

using a slightly modified Singleton pattern. This class manages a list containing all 

threads in the system and provides several methods for controlling the scheduling at 

run-time. 

Important Control Methods 

• void Scheduler::addNewThread(Thread *thread) adds a new Thread to the scheduling 

list. 

• void Scheduler::yield() manually calls a special interrupt which does a reschedule. 

• void Scheduler::sleep() sets the stat of the currently executed thread to ’Sleeping’ 

and calls yield(), therefore letting it sleep. 

• void Scheduler::sleepAndRelease(Mutex &lock) releases the given Mutex after 

the thread state is set for avoiding a race condition between state setting and the 

call to yield(). This method is also overloaded for using a SpinLock as argument. 

• void Scheduler::wake(Thread* thread_to_wake) wakes up the given thread by 

setting its state to ’Running’. 

The Scheduling Algorithm 

The scheduling of registered threads is done in the method uint32 Scheduler::schedule() 

in round-robin fashion. It takes a thread from the front of the thread list and rotates 

the list, so that the thread on the front is moved to the back of it. This is done by 

a special method List::rotateBack() for avoiding memory allocation or deletion because 

the latter two should not be used in this method. The reason for this is that 

a thread or process could be holding the kernel memory manager lock which would 

have to be released for using dynamical memory management. But as the scheduler 

itself is waiting for the lock the owning thread cannot be scheduled for execution and 

thus the system would end up dead locked. 

The scheduling method is called from the timer and yield interrupt handlers. No 

accounting is done regarding the execution time of a thread on the cpu. 


5.6. SCHEDULING CHAPTER 5. THREADS AND PROCESSES 

Cleaning up − The IdleThread 

A special thread called IdleThread is put on the scheduling list on creation time. This 

thread executes a cleanup routine which removes dead thread objects recognizable by 

the state ’ToBeDestroyed’ and is never removed from the scheduling list. Therefore it 

has a regular chance for execution and it is ensured that there is always at least one 

thread ready to run even when the system is idle (hence the name of the thread). 


Chapter 6 

System Calls 

6.1 Base Information about System Calls 

System calls are implemented in every operating system. They are normally used for 

the communication between the kernel and the user space applications. 

Every standard single-CPU computer can execute only one instruction at a time. 

Therefore, if a user process is running in user mode and needs some information 

about the system, or needs a system service, like read or write data from a file, it has 

to execute a trap or system call instruction to transfer control to the operating system. 

The operating system reacts to this call by looking at the transfered parameters. Then 

the operating system carries out the system call and returns control to the instruction 

following the system call. In a way, the system call is like a procedure call, but the 

procedure itself is carried out by the kernel in kernel space, and not by the application 

in user space. 

71

6.1. BASE INFORMATION ABOUT SYSTEM CALLS CHAPTER 6. SYSTEM CALLS 

6.1.1 Concept of System Calls 

Learning by example is the simplest way to understand how the system calls work. 

In this section, the concept of system calls is shown. As an example the operating 

system UNIX was used. This figure shows the concept of the system call read in a 

UNIX system. 

The system call read is made of a series of different steps. At first, before the library 

procedure read is called, the calling programm pushes the parameters on the stack. 

This behaviour coresponds with the steps 1-3 on the diagram. The parameters are 

pushed on the stack in the reversed order, because of historical reasons related to the 

old C an C++ compilers. The first and the third parameter are called by value and 

the second parameter is called by reference, meaning that the address of the buffer is 

passed and not the contents of the buffer. 

The fourth step on the diagram is the actual call to the library procedure. These 

libraries are usually written in assembly language and optimised for speed. The next 

step in the diagram shows how the system call number is put into a specific register. 

The number in the register is used by operating system so the appropriate handler 

can be called. In the sixth step a TRAP instruction switchs the CPU from the user 

mode into the kernel mode and starts execution at a fixed address within the kernel. 

The kernel code starts to examine the number from the forementioned register and 

transfers the control to the correct system call handler. The mapping from the system 

call number to the handler is done via a table of pointers to the system call handlers 

indexed on sytemcall number. In the step 8 the system call handler performs the 

tasks depending on the parameters passed by the system call. Next, in the step 9, 


CHAPTER 6. SYSTEM 6.2. CALLS SYSTEM CALLS IN THE DIFFERENT OPERATING SYSTEMS 

the system call handler has completed its work. Control may be returend to the user 

space library procedure at the instruction following the TRAP instruction. In the step 

10 the procedure returns then to the user program in the way like procedure calls 

returns. 

At last in the step 11 the job is finished and the user program has to clean up the 

stack, like it does after any procedure call. 

This was a simple example of how system calls work in UNIX. There are a lot of system 

calls for the different processes. The Process Management, the File Management, the 

Directory Management and the Miscallaneous System calls are different groups of 

system calls for different processes. These groups of system calls are not the focus 

of this document and will not be closely examined. If you want more information 

about them, refer to the book Modern Operating System by Tanenbaum. However, the 

system calls which are currently implemented in SWEB are thorougly described here, 

in the sections to follow. 

6.2 System Calls in the different Operating Systems 

The system calls use different structures in the different operating systems. UNIX, 

MINIX and Linux, as well as the SWEB operating system have the same base concept 

for system calls, only the details of the implementation are different. This type of compatibility 

is defined as POSIX system call standard. On the other hand, the family of 

Windows operating systems uses system calls in a different way. 

In the previous section the functioning of the system calls in the UNIX operating system 

was explained. In this section it will be shown how the operating system Windows 

handles system calls. The details are different between the individual Windows versions. 

A Windows program is normally event driven, what means that the main program is 

waiting for some events to happen and after this event the program calls a procedure 

to handle the event. Events in Windows could mean a mouse moving or a mouse 

button click. When the event occurs, the appropriate handler is called to process the 

event and update the screen as well as the internal program state. This means that 

windows implements message passing programming paradigm as oposed to UNIX system 

call. 

In a way Windows has system library but it is unlike UNIXs. UNIX has a library 

procedure for each system call. On the other hand, Microsoft has defined a set of 

procedures, called the Win32 API (Application Program Interface) which programmers 

use to get services from the operating system. This interface should be supported on 

all versions of Windows since Windows 95. 

Win API has a large number of calls, more than thousand. But opposed to UNIX, it is 

impossible to see what really is a system call and what is a simple user space library 

call. Another big difference between UNIX and Windows is the Graphical User Interface 

(GUI). In UNIX the GUI runs in the user space, meaning that only the system call 

write for writing on the screen and a few other minor ones run in the kernel. This provides 

better kernel stability and less userspace/kernel context switching. Additionaly 

this means that different graphical subsystems can be implemented without major 

changes in the kernel. 

On the other hand, Windows within the Win32 API implements a huge number of calls 

for managing windows, text, fonts, menus and other features of the GUI. The graph- 


6.3. SYSTEM CALLS IN SWEB CHAPTER 6. SYSTEM CALLS 

ics subsystem runs completely in the kernel, but not in all versions of Windows. On 

some Windows these W32 API calls are remapped directly to system calls (KERNEL32), 

and on the other versions they are just library calls (GDI32). In a nutshell, the Windows 

engineers just implemented another layer between the user applications and the 

kernel. This layer gives more flexibility and provides the programmer with the stable 

interface that is independent from the Windows version. On the other hand, the GUI 

management is fixed and unflexible. It is hard to extend the GUI without extending 

the kernel. 

To give a better overview between UNIX and Windows here is a list of the most important 

system calls. 

UNIX Win32 Description 

fork CreateProcess Create a new process 

waitpid WaitForSingleObject Can wait for a process to exit 

execve (none) CreateProcess = fork + execve 

exit ExitProcess Terminate execution 

open CreateFile Create a file or open an existing file 

close CloseHandle Close a file 

read ReadFile Read data from a file 

write WriteFile Write data to a file 

lseek SetFilePointer Move the file pointer 

stat GetFileAttributesEx Get various file attributes 

mkdir CreateDirectory Create a new directory 

rmdir RemoveDirectory Remove an empty directory 

link (none) Win32 does not support links 

unlinke DeleteFile Destroy an existing file 

mount (none) Win32 does not support mount 

umount (none) Win32 does not support mount 

chdir SetCurrentDirectory Change the current working directory 

chmod (none) Win32 does not support security (although NT does) 

kill (none) Win32 does not support signals 

time GetLocalTime Get the current time 

Table 1: The Win32 API calls which are comparable to the UNIX system calls. 

This table is taken from the book ’Modern Operating Systems’ from Tanenbaum. 

For more information please refer to this book. 

6.3 System calls in SWEB 

SWEB is a simple operating system that tries to implement the POSIX system call 

standard. The system calls are implemented in a similar way to LINUX operating 

system. 

6.3.1 Concept of System Calls in SWEB 

SWEB uses interrupt 0x80 for calling the system call handler in the kernel. The type 

of system call is specified by the predefined number which is stored in the EAX register 


CHAPTER 6. SYSTEM CALLS 

6.3. SYSTEM CALLS IN SWEB 

1 

2 BITS 32 

3 

4 section . text 

5 

6 global __syscall 

7 __syscall : 

8 

9 ; ok , f i r s t we have to do some l i t t l e cleanup as every function should do 

10 push ebp ; the base pointer goes onto the stack so we can restore i t a l t e r 

11 mov ebp , esp ; 

12 

13 ; ok , we have to save ALL registers we are now going to overwrite for 

14 ; our syscall 

15 

16 

17 push ebx ; 

18 push esi ; 

19 push edi ; 

20 

21 mov eax , [ ebp + 8] 

22 mov ebx , [ ebp + 12] 

23 mov ecx , [ ebp + 16] 

24 mov edx , [ ebp + 20] 

25 mov esi , [ ebp + 24] 

26 mov edi , [ ebp + 28] 

27 

28 int 0x80 ; 

29 

30 pop edi 

31 pop esi 

32 pop ebx 

33 mov esp , ebp 

34 pop ebp 

35 ret 

Figure 6.1: The implementation of the system calls 

on the x86 architecture. System call arguments are provided through the remaining 

general purpose registers. When the kernel has finished with the execution of the call 

it stores the return value in the EAX register. Here is shown how the system calls are 

implemented in assembler. 

In detail a part of the system calls are implemented in kernel, so the kernel knows 

what are system calls. The definition of system calls are in the kernel. 

This is a list from the system calls which are implemented in the operating system 

SWEB. This list is in ˜swebcommonincludekernel . This are the most important system 

calls in a operating system. Not all system calls are needed yet and not all are 

used. This definition is for the user space and for the kernel space. 

The class Syscall is only a container for the system call handler and there methodes. 

This file is also in the kernel ˜swebcommonincludekernel implemented. This 

class should not be an instance of system calls, because the syscallException is made 

static to call directly. The syscallException has to stay in this way because the Inter- 



1 #define fd_stdout 0 

2 #define fd_stdin 1 

3 #define fd_stderr 2 

4 

5 #define sc_restart 0 

6 #define sc_exit 1 

7 #define sc_fork 2 

8 #define sc_read 3 

9 #define sc_write 4 

10 #define sc_open 5 

11 #define sc_close 6 

12 #define sc_waitpid 7 

13 #define sc_creat 8 

14 #define sc_link 9 

15 #define sc_unlink 10 

16 #define sc_execve 11 

17 

18 #define sc_lseek 19 

19 #define sc_getpid 20 

20 #define sc_mount 21 

21 #define sc_umount 22 

22 

23 #define sc_nice 34 

24 

25 #define s c _ k i l l 37 

26 

27 #define sc_rename 38 

28 #define sc_mkdir 39 

29 #define sc_rmdir 40 

30 #define sc_dup 41 

31 #define sc_pipe 42 

32 

33 #define sc_brk 45 

34 

35 #define sc_signal 48 

36 

37 #define sc_dup2 63 

38 

39 #define sc_reboot 88 

40 

41 #define sc_ipc 117 

42 

43 #define sc_clone 120 

44 

45 #define sc_flock 143 

46 #define sc_msync 144 

47 

48 #define sc_vfork 190 

Figure 6.2: The definition list of the system calls 




1 class Syscall 

2 { 

3 public : 

4 

5 static uint32 syscallException ( uint32 syscall_number , uint32 arg1 , uint32 arg2 , uint32 arg3 , uint32 arg4 , ui 

6 

7 static void e x i t ( uint32 exit_code ) ; 

8 

9 static uint32 write ( uint32 fd , pointer buffer , uint32 size ) ; 

10 

11 static uint32 read ( uint32 fd , pointer buffer , uint32 count ) ; 

12 

13 private : 

14 } ; 

15 

16 #endif 

Figure 6.3: The implementation of the system calls 

ruptUtils are calling the system calls in this way. 

The syscallException can take maximum six arguments from the user space. The 

syscallException handles the system calls and return a single value back to the architecture 

specified interrupt handler. The return value is for the user space. It is 

returned in the register EAX on the x86. The parameter syscall_number is the first 

argument from the userspace and is defined in the syscall-definition.h list. This has 

to be a system call in the syscall-definition.h. 

The method exit is an example for handling system calls. The parameter exit_code is 

the transmitted exit code from the userspace. 

The methods write and read are also examples how the system calls are handled in 

the operating system SWEB. There are descriptions only for the method write, because 

read is handled in the similar way. The parameter fd is a file descriptor as described 

in the syscall-definitions, with fd_stdin, fd_stdout and fd_stderr. 

This is also a part of the code from SWEB. This is the file Syscall.cpp and shows how 

the system calls are implemented. This file you find in ˜/sweb/common/source/kernel. 

6.3.2 How to implement System Calls in SWEB 

In SWEB is it very easy to implement system calls. The most important system calls 

are defined in the file syscall-definitions. So the programmer does not implement the 

system call in this list. It is possible to use the defined system calls on a simple way. 

In this part of code is shown how the system call write is implemented. The other 

system calls should be implemented in this way. 

6.3.3 A convenient interface 

As manual calling for system functions doesn’t exactly match the definition of a convenient 

interface and most often some extra work has to be done on the side of the 

user program, system calls are executed by ordinary library functions available to 

userspace applications. 

In SWEB a slim C library which is automatically linked to executables by the make 

system serves this purpose. Applications written in the C programming language 



1 uint32 Syscall : : syscallException ( uint32 syscall_number , uint32 arg1 , uint32 arg2 , uint32 arg3 , uint32 a 

2 { 

3 uint32 return_value =0; 

4 

5 kprintfd ( " Syscall %d called with arguments %d(=%x ) %d(=%x ) %d(=%x ) %d(=%x ) %d(=%x)\n" , syscall_number 

6 

7 switch ( syscall_number ) 

8 { 

9 case sc_exit : 

10 e x i t ( arg1 ) ; 

11 break ; 

12 case sc_write : 

13 return_value = write ( arg1 , arg2 , arg3 ) ; 

14 break ; 

15 case sc_read : 

16 return_value = read ( arg1 , arg2 , arg3 ) ; 

17 break ; 

18 default : 

19 kprintf ( " Syscall : : syscall_exception : Unimplemented Syscall Number %d\n" , syscall_number ) ; 

20 } 

21 return return_value ; 

22 } 

23 

24 void Syscall : : e x i t ( uint32 exit_code ) 

25 { 

26 kprintfd ( " Syscall : : EXIT : called , exit_code : %d\n" , exit_code ) ; 

27 currentThread−>k i l l ( ) ; 

28 } 

29 

30 uint32 Syscall : : write ( uint32 fd , pointer buffer , uint32 size ) 

31 { 

32 //WARNING: this might f a i l i f Kernel PageFaults are not handled 

33 assert ( buffer < 2U∗1024U∗1024U∗1024U) ; 

34 i f ( fd == fd_stdout ) //stdout 

35 { 

36 kprintfd ( " Syscall : : write : %B\n" , ( char ∗) buffer , size ) ; 

37 kprint_buffer ( ( char ∗) buffer , size ) ; 

38 } 

39 return size ; 

40 } 

41 

42 uint32 Syscall : : read ( uint32 fd , pointer buffer , uint32 count ) 

43 { 


45 uint32 num_read = 0; 

46 i f ( fd == fd_stdin ) 

47 { 

48 //this doesn ’ t ! terminate a string with \0, gotta do that yourself 

49 num_read = currentThread−>getTerminal()−>readLine ( ( char ∗) buffer , count ) ; 

50 kprintfd ( " Syscall : : read : %B\n" , ( char ∗) buffer ,num_read ) ; 

51 for ( uint32 c=0; c



1 

2 uint32 Syscall : : write ( uint32 fd , pointer buffer , uint32 size ) 

3 { 

4 //WARNING: this might f a i l i f Kernel PageFaults are not handled 


6 i f ( fd == fd_stdout ) //stdout 

7 { 

8 kprintfd ( " Syscall : : write : %B\n" , ( char ∗) buffer , size ) ; 

9 kprint_buffer ( ( char ∗) buffer , size ) ; 

10 } 

11 return size ; 

12 } 

Figure 6.5: The implementation of the system call write 

can use the provided functions and therefore don’t have to worry about the basics of 

system calls. This library can be found in the directory userspace/libc from the SWEB 

source tree. 

In order to make use of this enrichment in usability, an appropriate calling function 

should be provided in this library for every added system call. 




Chapter 7 

Character Devices 


Generally, the I/O devices connected to a computer can be divided into two major 

groups. Character devices, through which data is transmitted one character at a time 

and block devices, through which data is transmitted in the form of blocks. The block 

devices often implement the buffered reading. The main difference between the block 

devices and character devices is that from character devices it is possible to read only 

the next available character, whereas the block devices can deliver any block from 

the device at any given time meaning that seek operations are supported as well. It 

should be noted that this kind of division can be found in UNIX operating system 

and its derivatives. This approach is great for treating the devices as files and brings 

the necessary abstraction needed for hiding the hardware from the programmer/user. 

The drawback is that sometimes there are too many abstraction layers that do nothing 

but pass messages around thus effectively slowing the system. On the other hand, 

the Windows operating system implements the devices through API classes meaning 

that for each device type, there exists a defined API which must be implemented by a 

particular driver. This approach gives more flexibility and speed, but here the disadvantage 

lies in the fact that operating system must be extended each time a new device 

class appears. The aim of this chapter is to give a short introduction and explanation 

of the character devices and their implementation in SWEB operating system. 

7.2 Classes 

Some of the most common character devices that can be connected to the computer 

are: 

• Keyboards 

• Monitors 

• Serial ports 

• Mice, tablets, light pens and other pointing devices 

81

7.3. CHARACTER DEVICE BASE CHAPTER 7. CHARACTER DEVICES 

• USB devices 

• Printers 

• Linux even implements SCSI disks as character devices, where a certain structure 

is filled with parameters and written to the device and the result is then read 

in raw mode character by character from the device driver. 

The SWEB operating system has device drivers for keyboard, monitor and serial port. 

7.3 Character Device Base 

Every character device by definition can be viewed as a simple sequence of characters. 

Owing to that, all classes that represent character devices have one buffer for storing 

the data coming from the character device and another buffer for the output data. 

Those buffers, as well as methods to handle the read and write requests, have been 

put in a base class called CharacterDevice. The definition and implementation of this 

class can be found in the common/include/drivers/chardev.h. 

Function 

Description 

CharacterDevice( char* name, Superblock* 

Constructor of the CharacterDevice. 

super_block = 0, uint32 in- 

The parameters super_block and in- 

ode_type = I_CHARDEVICE ); 

ode_type are optional and are just 

passed to the Inode constructor and 

not used by the CharacterDevice class. 

The parameter name is the device 

name under which the character device 

will be added to the Device File 

System. The constructor copies the 

name in an internal member variable 

and allocates the input and output 

buffers. It adds this instance to the 

DeviceFSSuperblock as well. 

CharacterDevice(); Destructor deletes the allocated 

buffers and deregisters the device 

from the Device File System. 

Table 7.1: Character Device base class methods 

The data structure used for the buffers is a FIFO circular buffer where old characters 

get overwritten if nobody has read them and the data is to large to fit into the rest of 

the buffer. The FIFO buffer also implements the synchronization mechanisms between 

reads and writes as well as multiple reads. 

Function 

FiFo< uint32 > *_in_buffer; 

FiFo< uint32 > *_out_buffer; 

Description 

Circular buffer for storing input data 

after it has been read from the device. 

Circular buffer for storing the output 

data before it is written to the device. 


CHAPTER 7. CHARACTER DEVICES 

7.3. CHARACTER DEVICE BASE 

RingBuffer< uint32 > *_irq_buffer; Circular buffer for storing input data 

from an IRQ handler. The data can 

not be stored directly within the handler 

because of the synchronisation 

mechanisms implemented in the FiFo 

buffers. 

static const uint32 CD_BUFFER_SIZE; The size of the buffers is set to 1k as 

default. 

Table 7.2: Character Device input and output buffers 

The class CharacterDevice also implements the Inode interface (see VFS Chapter) allowing 

it to be added to the Device file system. Any class that inherits the CharacterDevice 

class will be automatically added to the Device file system under /dev/- 

name_of_the_device. This means that any character device can be opened as a file 

through the Virtual File System (although no seeking operation is supported). 

Function 

virtual int32 readData (int32 offset, 

int32 size, char *buffer); 

virtual int32 writeData (int32 offset, 

int32 size, const char*buffer); 

Description 

Reads size bytes from the character 

device input buffer. The parameter 

buffer is a pointer to preallocated 

buffer which must have enough space 

to store the data. Offset parameter is 

left for compatibility with Inode interface. 

If offset is not equal zero, an error 

is returned. This is a blocking function, 

meaning that if the input buffer 

is empty, the calling thread will be put 

to sleep until a character is available. 

If the function is successfull, the number 

of bytes actually read will be returned. 

Otherwise a negative integer 

value is returned. 

Writes size bytes to the character device 

output buffer. Offset parameter is 

left for compatibility with Inode interface. 

If offset is not equal zero, an error 

is returned. If the output buffer is 

full, the calling thread is blocked until 

there is enough space in the output 

buffer to store the data.If the function 

is successfull, the number of bytes actually 

written will be returned. Otherwise 

a negative integer value is returned. 


7.4. KEYBOARD CHAPTER 7. CHARACTER DEVICES 

int32 mknod(Dentry *dentry); int32 

mkfile(Dentry *dentry); int32 create(Dentry 

*dentry); 

File* link(uint32 flag); 

These three functions basicaly perform 

the same thing. They link the current 

instance of the character device with 

the dentry directory entry. 

Creates a new file and links the file 

with the current instance of the character 

device. 

int32 unlink(File* file); As the name suggests unlinks the 

file from the current instance of the 

character device and deletes the file 

pointer. 

Table 7.3: The Inode functions implemented by Character- 

Device 

When writing a device driver for character devices, the simplest thing to do would be to 

inherit the CharacterDevice class and override its methods ProcessInBuffer() and ProcessOutBuffer(). 

The ProcessOutBuffer() method should implement writing the characters 

in the output buffer to the character device. The ProcessInBuffer() method should 

be overridden only if you need some monitoring and/or filtering of the input buffer. 

Then the IRQ handler should be written where the characters should be received from 

the device and added to the input buffer. The CharacterDevice class implements the 

Thread interface allowing it to run as a separate thread in the kernel space. During 

the kernel initialization this thread should be added to the Scheduler. See the Serial 

Port section of this chapter for an example on how to write a CharacterDevice driver. 

7.4 Keyboard 

The keyboard can be connected through: 

• legacy AT keyboard connector, 

• PS/2 keyboard connector, 

• USB port 

From the driver programming point of view, the difference between the first two connectors 

does not exist. These types of connected keyboards are supported with SWEB 

keyboard driver. The USB keyboards running in legacy mode are supported as well, 

but there are some known problems, like controlling the keyboard LEDs. Each keyboard 

contains a processor which monitors the state of the keyboard key matrix and 

when a change is detected sends the data to the keyboard controller which is usually 

located on the computers motherboard. The keyboard processor communicates with 

the keyboard controller using an IBM communications protocol developed in the early 

60’s. The controller receives the data from the keyboard processor and stores it in 

his internal buffer. The operating system communicates with the keyboard controller 

through three 8-bit registers. 



7.4. KEYBOARD 

Register Port Read/Write 

Controller Input Buffer 0x60 0x64 Write 

Controller Output 0x60 

Read 

Buffer 

Controller Status Register 

0x64 

Read 

Table 7.4: I/O ports for communicating with keyboard 

When there is an available character in the keyboard controller buffer, the IRQ 1 is 

raised and the interrupt handler routine should read the character from the buffer. If 

the character is not read, the IRQ will not be raised when the next character arrives. 

The keyboard sends scancodes rather than the actual characters and these scancodes 

must be converted. One key can send one, two or four byte scancode. The SWEB 

implementation of the keyboard is a simple character device that requires only the 

input buffer. Its driver is implemented in KeyboardManager class that can be found 

in arch_keyboard_manager.h and arch_keyboard_manager.cpp. The main input part of 

the keyboard driver is the interrupt handler. This handler is called each time there is 

a character in the keyboard controller buffer. In the interrupt routine, the keyboard 

is disabled and the scancodes are read from the keyboard controller buffer. The one 

byte scancodes are directly saved to the keyboard buffer, whereas the two and four 

byte scancodes are converted to one byte in the IRQ handler prior to storing them. 

The one byte scancodes are in the area from 0x00 to 0x60 and there are less than 

32 different two and four byte scancodes. Owing to that, the two byte and four byte 

scancodes can be effectively remapped to the area between 0x60 and 0x80. The one 

bit between 0x80 and 0xFF is intentionally left for extending the driver to handle the 

release key events. Currently these events are completely ignored by SWEB with the 

exception of the SHIFT, CTRL and ALT modifiers. The datastructure used to store the 

scancodes is the RingBuffer. The RingBuffer is a circular buffer structure of a certain 

size. When the buffer is full and no data was read, the oldest entry is overwritten with 

the new data. That means if the Terminal thread is blocked and unable to read from 

the keyboard driver buffer, data loss may occur. The size of the keyboard RingBuffer 

is set to 1024 bytes meaning that 1k of scancodes can be effectively remebered by the 

driver. This should be more than enough, when taking into consideration that this 

section has less than 1024 characters. 

Function 

void serviceIRQ(); 

Description 

This method is the actual IRQ handler. 

It receives the scancode, performs necessary 

conversions, modifies the keyboard 

status and stores the scancode 

in the internal buffer. 


7.4. KEYBOARD CHAPTER 7. CHARACTER DEVICES 

void modifyKeyboardStatus (uint8 

scancode ); 

void setLEDs( void ); 

void kb_wait(); 

void send_cmd( uint8 cmd, 

port = 0x64 ); 

uint8 

If the parameter scancode contains 

one of the modifier key scancodes: 

SHIFT, CTRL, ALT, NUM, CAPS or 

SCROLL, the internal keyboard status 

variable is modified. This function is 

called from the interrupt handler for 

each scancode that is received from 

the keyboard. The method is aware of 

the key release events as well. If the 

keyboard status gets modified, the setLEDs 

method is call. 

The setLEDs method checks if the current 

keyboard NUM, CAPS or SCROLL 

status has changed and in that case 

updates the keyboard LED lights accordingly. 

Most of the modern keyboards 

keep track of the keyboard LED 

lights, but most of older keyboards do 

not. 

This method checks the state of the 

keyboard controller and waits until it 

is ready to accept commands or until a 

timeout occurs. 

The method used to send a command 

to the keyboard controller. If the port 

is not given, the default port 0x64 is 

used. Before sending a command, the 

kb_wait() method is called. 

On the other side, the terminals need the keys that were pressed and have no use 

for scancodes. The method getKeyFromKbd() can be used to read the next character 

from the keyboard. This method is non-blocking, meaning that if no key is available 

that the function will simply return booelan false. Note that this behavior is likely to 

change in the future. 

Function 

bool getKeyFromKbd(uint32 &key); 

Description 

Method for reading the next character 

from the keyboard. This method 

checks the internal keyboard buffer 

and if there is a scancode in the buffer, 

it is converted and stored in the key 

parameter. The method returns true 

if there is any scancode available and 

false otherwise. 


CHAPTER 7.5. SERIAL 7. CHARACTER PORT (OR HOW DEVICES TO WRITE A CHARACTER DEVICE DRIVER IN SWEB 

uint32 convertScancode( uint8 

scancode ); 

bool isCaps(); bool isNum(); bool isScroll(); 

bool isShift(); bool isCtrl(); 

bool isAlt(); bool isAltGr(); 

The scancode is converted according 

to the conversion table and the corresponding 

character is returned to the 

caller. The conversion table is stored 

in the STANDARD_KEYMAP internal 

variable. The conversion method is the 

standard table lookup. 

The set of methods can test the status 

of the keyboard modifiers. The methods 

return true if the modifier is on, 

false otherwise. 

The following changes shall be implemented in the SWEB keyboard driver: - The 

CharacterDevice interface should be implemented, thus enabling the applications to 

get the raw scancodes through VFS read method. - The convertScancode() method 

should be transferred to the Terminal class, thus enabling individual key mappings 

for each Terminal to be implemented. 

7.5 Serial Port (or how to write a character device 

driver in SWEB 

This section will give the reader a simple tutorial on how to write a character device 

driver for SWEB operating system. The tutorial assumes that SWEB source code is 

downloaded and that the system is set up to compile it. Additionally the tutorial is 

written with the “x86” architecture in mind, but the basic outline does not differ much 

for other architectures. To avoid interference from the actual SWEB serial port driver, 

the following lines must be removed (commented out) from the actual source code: 

In file /common/kernel/main.cpp: 

664: Scheduler::instance()->addNewThread( 

665: new SerialThread( "SerialTestThread" ); 

666: ); 

First of all some background information on serial ports is provided in this tutorial. 

The serial port is an I/O device that enables the computer to communicate with other 

devices that support RS 232 standard of communication. As its name suggests the 

data over the serial line is transmitted in a sequence of bits. So every byte of data that 

needs to be transmitted over the serial line must be converted in a series of bits. This 

work is performed by the UART microcontroller. UART stands for Universal Asynchronous 

Receiver / Transmitter. The 8250 series, which includes the 16450, 16550, 

16650, & 16750 UARTs are the most commonly found type in the PC computers. The 

later versions of UART contain FIFO buffers as well. The UART has several registers 

which can be used to set the serial port communication parameters and send and 

receive data. Some of these registers are: 

Base Address Read/Write Abr. Register Name 


7.5. SERIAL PORT (OR HOW TO WRITE A CHARACTER CHAPTERDEVICE 7. CHARACTER DRIVER INDEVICES 

SWEB 

+0 Write THB Transmitter Holding 

Buffer 

Read RBR Receiver Buffer 

+2 Write FCR FIFO Control Register 

+3 Read/Write LCR Line Control Register 

+ 4 Read/Write MCR Modem Control 

Register 

The base address and the IRQ of the UART depend upon the communication port. In 

most of the PCs there are 4 communication ports with following addresses and IRQ 

numbers. 

Name Address IRQ 

COM 1 3F8 4 

COM 2 2F8 3 

COM 3 3E8 4 

COM 4 2E8 3 

To implement the device driver class SerialPort is used. The definition file should 

be placed in arch/arch/include/arch_serial_tutorial.h and the declaration file should 

be placed in arch/arch/source/arch_serial_tutorial.cpp. The class inherits the CharacterDevice 

class thus inheriting the input and output buffers as well. The class 

SerialPort requires the following member variables: 

• base port 

• IRQ 

Constructor of this class accepts the following parameters: 

• name of the device to be passed to device file system ( char * ), 

• base port of the system ( uint16 ) 

• IRQ of the serial port ( uint8 ) 

The name parameter is passed to CharacterDevice constructor and automatically registered 

with the device file system. The constructor also calls the enableIRQ() function 

(defined in the 8259.h) with the IRQ parameter. This will tell the PIC to pass through 

the interrupt to the processor. In the end the parameters base port and IRQ are stored 

in a member variable for future use. The reading and writing of UART registers is 

done via write_UART() and read_UART() methods respectively. Some additional enumerations 

for the communication parameters and constants for UART registers are 

defined in this class as well. At the end of the h file, a global variable for the COM1 

serial port is created for tutorial purposes. For the device detection see the SerialManager 

section of this chapter. 



1 #include " types .h" 

2 #include " ports .h" 

3 #include " 8259.h" // required f o r enableIRQ ( ) 

4 

5 class SerialPort : public CharacterDevice { 

6 

7 public : 

8 uint16 base_port ; ///< The base IO port 

9 uint8 uart_type ; ///< Type of the connected UART 

10 uint8 irq_num ; ///< IRQ number 

11 

12 /// \ b r i e f The baudrate f o r s e r i a l communication 

13 typedef enum _br { 

14 BR_9600, 

15 BR_14400, 

16 BR_19200, 

17 BR_38400, 

18 BR_55600, 

19 BR_115200 

20 } BAUD_RATE_E; 

21 /// \ b r i e f The parity f o r s e r i a l communication 

22 typedef enum _par { 

23 ODD_PARITY, 

24 EVEN_PARITY, 

25 NO_PARITY 

26 } PARITY_E ; 

27 /// \ b r i e f Number of data b i t s used in s e r i a l communication 

28 typedef enum _db { 

29 DATA_7, 

30 DATA_8 

31 } DATA_BITS_E; 

32 /// \ b r i e f Number of stop b i t s used in s e r i a l communication 

33 typedef enum _sb { 

34 STOP_ONE, 

35 STOP_TWO, 

36 STOP_ONEANDHALF 

37 } STOP_BITS_E ; 

38 

39 static uint32 THB; ///< UART Transmiter holding buffer } } 

40 static uint32 RBR; ///< UART Receiver buffer r e g i s t e r } } 

41 static uint32 IER ; ///< UART Interrupt Enable Register } } 

42 static uint32 IIR ; ///< UART Interrupt I n d e t i f i c a t i o n Register } } 

43 static uint32 FCR; ///< UART FIFO Control Register } } 

44 static uint32 LCR; ///< UART Line Control Register } } 

45 static uint32 MCR; ///< UART Modem Control Register } } 

46 static uint32 LSR; ///< UART Line Status Register } } 

47 static uint32 MSR; ///< UART Modem Status Register } } 

48 

49 SerialPort ( char ∗name, uint16 baseport , uint8 irqnum ) : CharacterDevice ( name ) { 

50 base_port = baseport ; 

51 irq_num = irqnum ; 

52 enableIRQ ( irq_num ) ; 

53 } ; 

54 ~SerialPort ( ) { } ; 

55 

56 void SerialPort : : writeUART ( uint32 reg , uint8 what ) { 

57 outportb ( this−>base_port + reg , what ) ; 

58 } ; 

59 uint8 SerialPort : : readUART ( uint32 reg ) { 

60 return inportb ( this−>base_port + reg ) ; 

61 } ; 

62 


7.5. SERIAL PORT (OR HOW TO WRITE A CHARACTER CHAPTERDEVICE 7. CHARACTER DRIVER INDEVICES 

SWEB 

63 setupPort ( BAUD_RATE_E, DATA_BITS_E, STOP_BITS_E, PARITY_E ) ; 

64 } ; 

65 

66 // Global variable 

67 SerialPort ∗ sp_tutorial = new SerialPort ( "com1" , 0x3F8 , 4 ) ; 

The configuration of the UART is placed in the setupPort() method. The parameters of 

this method are: 

• port speed 

• data bits 

• parity bits 

• stop bits 

1 uint32 SerialPort : : setup_port ( BAUD_RATE_E baud_rate , DATA_BITS_E data_bits , STOP_BITS_E stop_bits , PA 

2 { 

3 writeUART ( IER , 0x00 ) ; // turn o f f interupts 

4 

5 uint8 divisor = 0x0C; 

6 switch ( baud_rate ) 

7 { 

8 case BR_14400: 

9 case BR_19200: 

10 divisor = 0x06 ; 

11 break ; 

12 case BR_38400: 


14 break ; 

15 case BR_55600: 


17 break ; 

18 case BR_115200: 


20 break ; 

21 default : 

22 case BR_9600: 

23 divisor = 0x0C; 

24 break ; 

25 } 

26 

27 writeUART ( LCR , 0x80 ) ; // activate DL 

28 writeUART ( 0 , divisor ) ; // DL low byte 

29 writeUART ( IER , 0x00 ) ; // DL high byte 

30 

31 uint8 data_bit_reg = 0x03 ; 

32 switch ( data_bits ) 

33 { 

34 case DATA_8: 

35 data_bit_reg = 0x03 ; 

36 break ; 

37 case DATA_7: 

38 data_bit_reg = 0x02 ; 

39 break ; 

40 } 

41 

42 uint8 par = 0x00 ; 

43 switch ( parity ) 



44 { 

45 case NO_PARITY: 

46 par = 0x00 ; 

47 break ; 

48 case EVEN_PARITY: 

49 par = 0x18 ; 

50 break ; 

51 case ODD_PARITY: 

52 par = 0x08 ; 

53 break ; 

54 } 

55 

56 uint8 stopb = 0x00 ; 

57 switch ( stop_bits ) 

58 { 

59 case STOP_ONE: 

60 stopb = 0x00 ; 

61 break ; 

62 case STOP_TWO: 

63 case STOP_ONEANDHALF: 

64 stopb = 0x04 ; 

65 break ; 

66 } 

67 

68 writeUART ( LCR , data_bit_reg | par | stopb ) ; // deact DL 

69 writeUART ( FCR , 0xC7 ) ; 

70 writeUART ( MCR , 0x0B ) ; 

71 writeUART ( IER , 0x0F ) ; // turn on interrupts 

72 

73 return 0; 

74 } ; 

When data arrives over the serial port an IRQ will be raised. The class SerialPort 

implements serviceIRQ() method that simply receives the data and stores it into the 

buffer. 

1 void SerialPort : : serviceIRQ ( ) 

2 { 

3 uint8 int_id_reg = readUART ( IIR ) ; 

4 

5 i f ( int_id_reg & 0x01 ) 

6 return ; // i t i s not my IRQ or IRQ i s handled 

7 

8 uint8 int_id = ( int_id_reg & 0x06 ) >> 1; 

9 

10 switch ( int_id ) 

11 { 

12 case 0: // Modem status changed 

13 break ; 

14 case 1: // Output buffer i s empty 

15 break ; 

16 case 2: // Data i s available 

17 int_id = readUART ( RBR ) ; 

18 _in_buffer −>put ( int_id ) ; 

19 break ; 

20 case 3: // Line status changed 

21 break ; 

22 default : // This w i l l never be executed 

23 break ; 

24 } 

25 

26 return ; 


7.6. SERIALMANAGER CHAPTER 7. CHARACTER DEVICES 

27 } 

After writing the interrupt handler the next step is to register the interrupt handler 

with the SWEB interrupt utilities. The file arch/x86/source/InterruptUtils.cpp, contains 

the class of the same name that manages all interrupt functions in SWEB. The 

IRQ 3 and IRQ 4 handler is already present and only the lines for calling the tutorial 

driver must be added. First of all the #include "arch_serial_tutorial.h" must be added 

for access to the global variable, and after that, in function extern "C" void irqHandler_4(), 

the interrupt handler must be called with sp_tutorial->serviceIRQ();. At this 

point the character device driver can be tested. In the testing thread add an variable 

char ch; and an endless loop that calls sp_tutorial->readData(0, 1, &ch); followed 

by the kprintfd( "%c", ch ); The driver is now capable of receiving data from the serial 

port. The writing still ends up in the buffer and is not actually written to the 

UART. The method processOutBuffer() implements this by getting the characters from 

the output buffer and writing them to the serial port. The method processInBuffer() 

must be overridden as well, since the default behavior is to delete the characters from 

the buffer. 

1 void SerialPort : : processOutBuffer ( void ) 

2 { 

3 i f ( ! _out_buffer ) 

4 return ; 

5 uint32 j i f f i e s = 0; 

6 

7 while ( ! ( readUART ( LSR ) & 0x40 ) && j i f f i e s ++ < 50000 ) ; 

8 i f ( j i f f i e s == 50000 ) // TIMEOUT 

9 return ; 

10 writeUART ( THB, _out_buffer−>get ( ) ) ; 

11 } 

12 

13 void SerialPort : : processInBuffer ( void ) 

14 { 

15 return ; 

16 } ; 

Additionally the serial port thread must be registered with the scheduler. This will ensure 

that the method gets actually called. The thread should be registered somewhere 

in the beginning of the testing thread with: Scheduler::instance()->addNewThread( 

sp_tutorial );. The basic serial port support is now complete. The only thing left to do 

is to test this driver by extending the test routine to do a little bit of writing. The test 

routine could also be extended to include the writing and reading through the VFS. 

The actual implementation of the serial port driver in SWEB slightly differs from the 

one in this tutorial, but the difference is only structural, the functionality remains the 

same. 

7.6 SerialManager 

The SerialManager class implements the collection of SerialPort instances as well as 

the doDetection method used to populate this collection. The driver instances are 

stored in a simple array. This type of storage does waste the resources, but enables 

fast and simple access. Taking in the consideration that the search through the collection 

happens in the interrupt handler, the speed is much more important factor. 



7.7. TERMINAL 

7.7 Terminal 

7.7.1 Introduction 

Terminals are the basic input output units of any computer. In the early days, the 

computer terminals consisted of printers as the output devices coupled with the card 

readers as the input devices. These early terminals were slow, but sufficient for the 

slow computers of that time. With the advances in computer technology, the computers 

were getting faster and more interactivity from the terminal was required. So 

the printers were replaced with video screens and card readers with keyboards. This 

combination is today widely known as the computer terminal. Today UNIX and its 

derivatives like Linux have introduced a concept of virtual terminals. These are device 

independent input-output primitives that implement common terminal functions such 

as clearing screen, scrolling, writing to a specific position etc. In this way, there is an 

additional level of abstraction before the information reaches the actual I/O device. 

This device driver is in SWEB implemented in the Console class. 

7.7.2 Escape codes 

Many computer terminals and terminal emulators support color and cursor control 

through a system of escape sequences. One such standard is commonly referred to as 

ANSI Color. Several terminal specifications are based on the ANSI color standard, including 

VT100. The VT100 terminals use "escape[" characters as the control sequence 

inducer. For the entire list of all possible ANSI terminal escape codes you can consult 

the Internet. In SWEB there is currently no escape sequence interpreter (no terminal 

emulation). 

7.7.3 Implementation in SWEB 

The virtual terminal concept is implemented in SWEB as well. The SWEB terminal is a 

virtual device that has input-output capabilities, a character buffer and forementioned 

common functions. Every terminal can be attached to a console driver that does the 

actual input-output. This console can be a serial port, a printer and a card reader or 

the classic keyboard and monitor combination. For the complete list of implemented 

consoles, see the next section of this chapter. The definition of the terminal class 

can be found in the common/include/console/Terminal.h and its implementation in 

common/source/console/Terminal.cpp. 

Function 

Terminal(char name, Console × 

console, uint32 num_columns, 

uint32 num_rows); 

Description 

Constructor; creates the terminal with 

the given parameters. Name is the 

name under which the terminal will be 

registered with the Virtual File System; 

Console is the parent console which 

provides the output capabilities to this 

terminal; rows and columns give the 

maximum coordinates of the output 

terminal buffer. 


7.7. TERMINAL CHAPTER 7. CHARACTER DEVICES 

Terminal(); Standard destructor. Simply deallocates 

the buffers. 

Table 7.9: Terminal class base 

When a terminal is created, SWEB automaticaly associates two streams with it. First 

is the input stream that reads characters from the keyboard and stores them to the 

internal buffer. And the second is the output stream, which writes the characters 

to the parent console. The functionality of the input stream is implemented by the 

following functions : 

Function 

uint32 read(); 

uint32 readLine(char *buffer, 

uint32 size_of_buffer); 

uint32 readLineNoBlock(char 

*buffer, uint32 size_of_buffer); 

void clearBuffer(); 

void putInBuffer( uint32 key ); 

void backspace( void ); 

Description 

Returns a single character from the input 

buffer 

Reads a stream of characters in the 

preallocated buffer of given size. The 

charaters are read until a carriage return 

or line feed comes or until the 

preallocated buffer is full. The number 

of characters actualy read is returned. 

This function will block the 

calling thread by sending it effectively 

to sleep until the character arrives 

from the keyboard. 

The non blocking version of the above 

function that only reads the characters 

from the internal buffer and returns 

when the buffer is empty without waiting 

for keys from the keyboard. 

Clears the input buffer. This function 

is automatically called when the terminal 

is opened by the VFS. 

Places the character given by the parameter 

in the input buffer. Usable for 

implementing pipes. This function is 

called by the keyboard driver (precisely 

the parent console) to place the incoming 

keys in the active terminal buffer. 

This function deletes the last key from 

the keyboard buffer and also deletes 

the last displayed character from the 

screen. It should only be called by the 

keyboard driver as a response to the 

backspace key event. 

Table 7.10: Terminal input class methods 

The terminal output stream is a rectangular grid arranged of rows and columns where 



7.7. TERMINAL 

each grid intersection, or character cell, can contain a character. Each character has 

its own foreground color and each character cell has its own background color. The 

functions manipulating the terminal output buffer are synchronised; meaning that 

multiple threads can not modify the output buffer at the same time. The terminal output 

buffer must not be modified directly, but through the exposed output functions. 


7.8. CONSOLE CHAPTER 7. CHARACTER DEVICES 

Function 

void write(char character); 

Description 

Writes a single character on the current 

cursor position with the current 

colour and paints the cell with the current 

background colour. The cursor is 

moved to the next position and terminal 

is scrolled if row has been filled. 

Writes a null terminated string to the 

terminal. 

Like above, only the buffer does not 

have to be null terminated, but additional 

parameter giving the size of the 

buffer is passed to the function 

The Inode interface wrapper function. 

The offset parameter must be zero or 

nothing will be written. 

The functions for setting the parameters 

which will be used as the next 

character colour and the cell background 

colour. 

Table 7.11: Terminal output class methods 

void writeString(char const *string); 

void writeBuffer(char const *buffer, 

size_t len); 

virtual int32 writeData (int32 offset, 

int32 size, const char*buffer); 

void setForegroundColor( color ); 

void setBackgroundColor( color ); 

The terminal class should be extended in a way to enable different keyboard mappings. 

The keyboard driver is currently in charge of converting raw scancodes, but that is 

really a job for the terminals. 

7.8 Console 

The console is a class that provides the output interface between the actual hardware 

devices and the terminals. Every console has a common set of features that must be 

implemented in order to satisfy the requirements of the terminal interface. Every text 

written to the console is written to an area owned by the console called the console 

window. The console window is an attribute of the console, and is organized in a 

simmilar maner as the terminals output buffer. The console window is always less 

than or equal to the size of the terminal buffer, and can be moved to view different 

areas of the underlying terminal buffer. If the terminal buffer is larger than the console 

window, the console should provide some way to scroll though the rest of the text. A 

cursor indicates the screen buffer position where text is currently read or written. The 

origin for character cell coordinates in the screen buffer is the upper left corner, and 

the position of the cursor and the console window are measured relative to that origin. 

Use zero-based indexes to specify positions; that is, specify the topmost row as row 0, 

and the leftmost column as column 0. The maximum value for the row and column 

indexes depends on the hardware in question. The Console in SWEB also serves as 

the extension to the keyboard driver. Every console represents a kernel Thread object. 

In its Run method the console polls the keyboard driver. The special function keys 



7.8. CONSOLE 

(like terminal switch or scrolling) are handled directly by the console and the normal 

letter (or number) keys are dispatched to the active terminal. 

The following output functions are defined in the Console base class : 

Function 

Description 

virtual void consoleClearScreen(); Clears the screen. All characters are 

deleted and the background color of all 

cells is set to black. 

virtual uint32 consoleSetCharacter(uint32 

Writes the character to the speci- 

const &row, uint32 fied coordinates with the specified at- 

const&column, uint8 const &character, 

tribute. 

uint8 const &state); 

virtual uint32 consoleGetNum- Returns the dimensions of the console 

Rows() const; virtual uint32 consoleGetNumColumns() 

const; 

virtual void consoleScrollUp(); 

virtual void consoleSetForegroundColor(FOREGROUNDCOLORS 

const &color); virtual 

void consoleSetBackground- 

Color(BACKGROUNDCOLORS const 

&color); 

Table 7.12: Console output functions 

When writing to the console, the cursor 

moves to the beginning of the next 

row when it reaches the end of the current 

row. This function is called when 

the cursor advances beyond the last 

row. It scrolls up the rows so that the 

last row is free for writing. 

Sets the text attributes to be used for 

the next write to the console. 

7.8.1 Text Console 

The text console represents the simple text mode, which can display 80 columns by 

25 lines of high resolution text characters with 16 colors. The display data is held 

in 4 kilobytes of video memory buffer that can be found at the 0xB8000 (or the 

0xC00B8000 with the paging set up). For each character two bytes of the buffer 

are used. The first byte represents the character self and the second byte represents 

the attributes of the character (color, blinking etc.). While creating a text console the 

parameter can be passed to the constructor in order to set the number of terminals 

created for the console. The text console implements all of the output functions in the 

table above. 

7.8.2 Frame Buffer Console 

The Frame Buffer Console (FBC) is a console that uses graphical modes to emulate 

text-mode consoles. The normal VGA text consoles are limited to 512 characters, so 

the Framebuffer console is the only way to support Unicode characters. For each pixel 


7.8. CONSOLE CHAPTER 7. CHARACTER DEVICES 

16 bits are reserved in the frame buffer. The size of the buffer as well as the number 

of rows and columns available depends on the resolution used. SWEB currently has 

8x16 and sun8x16 font implemented for Frame Buffer Console. The resolution of the 

console can be chosen by seting the vbematch parameter in GRUB boot loader. 


Chapter 8 

Block Devices 


bla bla bla 

8.1.1 what are block devices 

A block device stores data in blocks with a fixed size. Commonly these block sizes 

are multiples of 512 Byte up to 32768 Byte. Each of those blocks can be accessed 

independently by using any form of unique id. Actually seek operations are possible 

on block devices. 

The commonly known opposite of block devices are character devices. They can not 

be accessed directly or independently and seek operations are not possible. Keep in 

mind, that not all devices do fit in this classification scheme. Take the system clock for 

example. Blocks can not be addressed separately and it does not generate or accept 

any character streams. 

Most common known block devices are disks, floppies, cdroms, tapes and many others. 

Disks for instance usually have 512 Byte Blocks. Sure, for your convenience it 

would be a lot easier to access each and every byte on the whole hard drive individually 

but there are some reasons why this is definitely not the best way to do it. Let us 

assume that we want to access one byte on the hard drive. One reason would be that 

the index of the single byte would have to be transfered to the hard drive / controller. 

This would lead to severe disk operations in case you want to read a lager amount of 

data. Another reason would be that you usually read larger amounts of data of disks. 

8.1.2 hard drive layout 

Hard disks still consist of several platters mounted on a spindle moving very fast. Disk 

read/write heads float at each surface of each platter. 

99

8.1. INTRODUCTION CHAPTER 8. BLOCK DEVICES 

These platters are devided into tracks and the tracks are devided into sectors. Due 

to the fact that sectors on a track in the middle of the platter would be much smaller 

than on the edge on the patter, you would be wasting space on the outside of the 

platter. 

Therefore most of the hard drives devide the tracks on the ’longer’ tracks into more 

sectors than on the ’shorter’ or inner tracks. But how would you know of how much 

sectors a track consist for addressing a sector? Actually you don’t know. The hard 

drive itself simulates a constant number of sectors per track. 


CHAPTER 8. BLOCK DEVICES 

8.2. IMPLEMENTATION 

8.1.3 how can blocks be accessed (chs, lba etc.) 

http://www.toolsthatwork.com/bbdocs/diskstructures.htm 

Long time ago disks were accessed using cylinder-head-sector notation based on the 

physical layout of the hard drive. Here you have the basics for the notation. Each 

track on the platter staple is described as cylinder, each head is one physical head on 

one side of a platter and a sector is one part of the track. The CHS number range is 

really limited since there are 10/8/6 bit available for addressing cylinders heads and 

sectors. Therefore the hard drive limit will be around 8 GB (1023*255*63*512Byte). 

Note that the cylinder and head value ranges from 0 to the maximum and sectors range 

from 1 to the maximum. Later on in history these values became virtual emulated by 

the hard drive itself trying to use more space. For example simulate 16 heads with 

just 4 physical installed. CHS is still found in some parts of the operating system and 

in the partition table on the platter. Therefore you still have to use CHS. 

//LBA: linear base address of the block 

//CYL: value of the cylinder CHS coordinate 

//HPC: number of heads per cylinder for the disk 

//HEAD: value of the head CHS coordinate 

//SPT: number of sectors per track for the disk 

//SECT: value of the sector CHS coordinate 

//TEMP: buffer to hold a temporary value 

// This equation is used very often by operating systems such as DOS 

// (or SWEB) to calculate the CHS values it needs to send to the disk 

// controller or INT13h in order to read or write data. 

uint32 LBA = start_sector; 

uint32 cyls = LBA / (HPC * SPT); 

uint32 TEMP = LBA % (HPC * SPT); 

uint32 head = TEMP / SPT; 

uint32 sect = TEMP % SPT + 1; 

Nowadays nearly everything is done by LBA (Logical block addressing). This is a very 

simple addressing scheme counting blocks from 0 to the maximum number of blocks. 

Translation where necessary (since old hard drives do not support LBA) is done by the 

OS. LBA addressing is limited to 28 or 48 bit numbering resulting in addressable disk 

sizes of 128 GiB and 128 PiB assuming the common block size of 512 Byte. This seems 

to be the only usable way to handle large hard drives. (If you have 3 minutes left visit 

http://www.hitachigst.com/hdd/research/recording_head/pr/PerpendicularAnimation.html 

for having a nice animation on the oncoming hard drive storage management) 

http://en.wikipedia.org/wiki/Hard_disk 

8.2 Implementation 

The basic concepts of block devices are implemented in sweb and a sample ATA/IDE 

driver is already registered and usable. The main ”mastermind” class of all the block 

device stuff is BDManager found in arch_bd_manager.h .cpp. This class is implemented 

as singleton to be able to manage all those requests on block devices from 


8.2. IMPLEMENTATION CHAPTER 8. BLOCK DEVICES 

kernel space. This global object holds a list of virtual devices (BDVirtualDevice). Therefore 

the BDManager can be seen as global device list handling tool for initialisation and 

handling virtual devices. Virtual devices are objects each representing a block device 

or special sub-devices (partitions etc.) holding the main information on block devices 

and handling requests on this block device. The device driver itself has to be registered 

in the BDManager and is creating virtual devices. 

8.2.1 How to implement new low-level block device drivers 

• Derive from BDDriver and implement all functions. You may want to derive from 

bdio too to have basic functions for port I/O available. 

• add your Driver to list found in BDManager::doDeviceDetection and do your device 

detection in blocking mode in this section. 

• build and initialize vitual devices and register them at BDManager::addVirtualDevice( 

BDVirtualDevice* dev ) 

• wait for requests... 

8.2.2 How to use block devices 

in short words: 

• include arch_bd_manager.h 

• get the instance of BDManager 

• perhaps it’s a good idea to check the device numbering. 

>getNumberOfDevices()] 

[uint32 BDManager- 

• get ”your” device [my_device BDManager->getDeviceByNumber(uint32 dev_num)] 

• do checks! (is my buffer large enough?, correct id?, is the start_block+num_block 

in bounds?, ...) 

• build a new instance of BDRequest (just to be on the safe side) 

• hand over the request to ”your” device [my_device->addRequest(BDRequest * comand)] 

• check if error occoured or still in progress... BDRequst->getResult() 

• don’t forget do delete BDRequest and your buffer after using it. 

Some hints on using Block devices: 

• NEVER swap any part of the block device architecture. 

your disk diver. How can you get it back to ram? 

Just image you swap 

• Think of allocating RAM while swaping. 

• Sometimes the disk driver needs some RAM to... 




• Keep in mind that a disk driver might need IRQ’s for communication with the 

device. 

• Be shure that initialisation of (slow) block devices is finished before accessing 

any. 

Example 

//Actually we know the virtual device number we want to use: "3" (swap) 

//We figured this out earlier. 

uint32 dev_nr = 3; 

//What is the block size on this device? 

//Well, let’s have a look. 

BDRequest * bd_bs = new BDRequest(dev_nr, BDRequest::BD_GET_BLK_SIZE); 

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd_bs ); 

//this is a blocking request. It just references data already in memory. 

if (bd_bs->getStatus() != BDRequest::BD_DONE) 

{ 

// something went wrong... EXIT! 

} 

uint32 block_size = bd_bs->getResult(); 

//We have to check how many blocks are available on this device: 

BDRequest * bd_bc = new BDRequest(dev_nr, BDRequest::BD_GET_NUM_BLOCKS); 

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd_bc ); 

//this is a blocking request. It just references data already in memory. 

if (bd_bs->getStatus() != BDRequest::BD_DONE) 

{ 

// something went wrong... EXIT! 

} 

uint32 block_count = bd_bc->getResult(); 

//We assume that we want to read only one block (Number 234) 

uint32 blocks2read = 1; 

uint32 offset = 233; // since numbering stats with 0 

if (offset + blocks2read > block_count) 

{ 

//do this little check! 

} 

//allocate some buffer or point somewhere in memory. 

char *my_buffer = (char *) kmalloc( blocks2read*block_size*sizeof(uint8) ); 

//build the command 

BDRequest * bd = new BDRequest(dev_nr, BDRequest::BD_READ, offset, blocks2read, my_buffe 

//and send it. 

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd ); 

uint32 jiffies = 0; 

//actually we don’t know if this request is blocking or not. Just to be 

//on the safe side, check if the output is valid by now. 

while( bd->getStatus() == BDRequest::BD_QUEUED && jiffies++ < 50000 ); 



if( bd->getStatus() != BDRequest::BD_DONE ) 

{ 

//We should definitely should have a closer look at the status by now. 

//It may happen, that the request is still in queue or had an error. 

} 

//By now the reqested data should be in my buffer. 

//Print the buffer... 

for(uint32 i = 0; i < blocks2read*block_size; i++ ) 

kprintfd( "%2X%c", *(my_buffer+i), (i+1)%8 ? ’ ’ : ’\n’ ); 

//don’t forget to clean up! 

kfree( my_buffer ); delete bd; delete bd_bs; delete bd_bc; 

8.2.3 VirtualDevices 

These Objects represent physical devices or patitions on physical devices as soon as a 

valid partition table is found. You can use the device numbers or the device names to 

find the device you are looking for. These numbers start with 0 and are numbered in 

sequence of device detection/registration. In our example: 

• The whole first ATA drive (0, ”idea”) 

• The first partition on first ATA drive (1 = minix, kernel files, ”idea0”) 

• The second partition on first ATA drive (2 = minix, user programs, ”idea1”) 

• The third partition on first ATA drice (3 = swap, not formatted, ”idea2”) 

For example the first IDE drive in your system will get the first number, the first 

partion on the first drive the second, the third partion on your first drive the third, the 

second drive the fourth, the first partion on the second drive the fifth, ... 

8.2.4 BDRequest 

Actually this is the way how you can send a command to any block device. Create an 

instance with: 

BDRequest(dev_id, cmd, start_block = 0, num_block = 0, buffer = 0) 

dev_id: (uint32) the device ID you want to talk to 

cmd: (BD_CMD) what should the device do? 

basic commands: 

BD_READ 

BD_WRITE 

BD_GET_BLK_SIZE 

BD_GET_NUM_BLOCKS 

some extentions: 




BD_REINIT 

BD_DEINIT 

BD_SEND_RAW_CMD 

start_block: (uint32) where should the operation start? 

num_block: (uint32) how many blocks? 

buffer: (void *) some buffer to read / write data. This buffer is neither allocated nor 

checked anywhere in the block device framework. It is simply written or read without 

taing care of anything. 

getStatus() returns: 

BD_QUEUED 

BD_DONE 

BD_ERROR 

Please keep in mind, that other commands or status codes might exist and will be 

used by dwevice drivers not implemented yet. 

getResult() returns requesting metadata (eg. number of blocks on this device) 

Actually some other functions do exist but you should really know what you are 

doing. Really bad sync problems can arise! For demonstration, test and initialisation 

purposes (reding partition table etc.) 2 functions have been implemented: 

BDVirtualDevice::readData and BDVirtualDevice::writeData. You may want to 

have a look at them as reference. 




Chapter 9 

VFS, FS 


This document describes how the VFS (virtual-file-system) maintains the files in the 

file systems that it supports. 

The files in a file system are collections of data; A file sytem not only holds the 

data that is contained within the files of the file system but also the structure of 

the file system. It holds all of the information that users and processes see as files, 

directories soft links, file protection information and so on. 

9.2 File system 

Each VFS object is stored by a suitable data structure, which includes both the object 

attributes and the object methods. The VFS may dynamically modify the methods of 

the object and, hence, it may install specialized behavior for the object. The following 

sections explain the VFS objects and their interrelationships in detail. 

The basic objects known to the FS layer are files, file systems, inodes, and names for 

inodes. 

9.2.1 Files 

Files are things that can be read from or writtn to. They can also be mapped into memory 

and sometimes a list of file names can be read from them. [3] They map very closely to 

the file descriptor. Files are represented by the class File. 

9.2.2 Inodes 

An inode represents a basic object within a filesystem. It can be a regular file, a 

directory, a symblic link, or a few other things. [3] The VFS does not make a strong 

107

9.3. VFS DATA STRUCTURES CHAPTER 9. VFS, FS 

distinction between different sorts of objects, but leaves it to the actual file-system 

implementation to provide appropriate behaviurs. Each inode is represented by the 

class Inode. 

It may seem that Files and Inodes are very similar. They are but there are 

some important differences. One thing to note is that there are some things that have 

inodes but never have files. A good example of this is a directory. Also, a file has 

state imformation that an inode does not have, particulary a position, which indicates 

where in the file the next read or write will be performed. 

9.2.3 Filesystems 

A filesystem is a collection of inodes with one distinguished inode known as the ROOT, 

other inodes are accessed by starting at the root and looking up a file name to get to 

another inode. A filesystem has a number of characteristics which apply uniformly 

to all inodes within the filesystem. Some of there are flags such as the READ-ONLY 

flag. Each filesystem is represented by the class Superblock. Each file points to the 

superblock that is has open, also the corresponding inode. 

There is a strong correlation between superblock (and hence filesystem) and device 

numbers. Each filesystem must (appear to) have a unique device on which the 

filesystem resides. Some filesystems (such as ramfs) are marked as no needing a 

real device. As well as knowing about filesystems, the VFS konws about different 

filesystem types. Each type of filesystem is represented by the class FileSystemType. 

9.2.4 Names 

All inodes within a filesystem are accessed by name. The VFS used the name-to-inode 

lookup process to found the certain inode, it may be expensive for some filesystem, 

but it is easy to implement. The structure of filesystem is as a tree. Each node in the 

tree corresponds to an inode in a given directory with a given name. Each node in the 

tree is represented by the class Dentry. The dentries act as an intermediary between 

Files and Inodes. Each file points to the dentry that if has open. Each dentry points 

to the inode that it references. 

9.3 VFS data structures 

9.3.1 Superblock Ojbects 

The superblock object stores information concerning a mounted filesystem. For diskbased 

filesystems, this object usually corresponds to a filesystem control block stored 

on disk. A superblock object consists of a Superblock class whose fields are described 

in table. 


CHAPTER 9. VFS, FS 

9.3. VFS DATA STRUCTURES 

Type Field Description 

class FilesystemType s_type_ Pointer to the filesystem-type 

uint32 s_dev_ Device identifier 

uint64 s_magic_ Filesystem magic number 

uint32 s_flags_ Mount flags 

class Dentry s_root_ Dentry object of the ROOT-directory 

from the filesystem 

class Dentry mounted_over_ Dentry object of the mounted point 

from the common filesystem 

class PointList dirty_inodes_ the modified (dirty) inode-list 

class PointList used_inodes_ the used inode-list 

class PointList all_inode_ the all inode-list 

class PointList s_files_ the file-descriptor-list 

Table 1: The fields of the superblock object 

The type of the filesystem is represented by the s_type_. the s_dev_ and s_magic_ 

is used identifier to block device (e.g. the ext2 use a block device to store the data 

information). The s_flags_ show the permission (writable/readable) to the filesystem 

(e.g. MS_RDONLY: be permitted, and no indirect modification). s_root_ is the 

ROOT-directory of the filesystem, every filesystem has a ROOT-directory of the own 

filesystem, the mount-command mounted the ROOT-pointer to the mounted-pointer 

(mounted_over_) from the common filesystem (the tree of the VFS). The superblock 

also has the list for dirty_inodes_ (the modified inodes), used_inodes_ (only used for 

corresponding inodes from the opened files), all_inodes_ and file descriptor (s_files_) 

which stores the pointer of the opend file. 

The operations is defined in the class. Let’s briefly describe the important superblock 

operations: 

createInode(dentry, type): Create a new inode-object of the superblock with the 

inode-type and connect to the parameter dentry. 

createFd(inode, flag): Create a file with the given flag from the given inode. It return 

a file-descriptor num if succusses. 

removeFd(inode, fd): Remove the corresponding file descriptor from the given inode. 

readInode(inode): Fills the fields of the inode object whose address is passed as the 

parameter from the data on disk. 

write_inode(inode): Updates a filesystem inode with the contents of the inode object; 

the inode object (parameter) identifies the filesystem inode on the disk. 

delete_inode(inode): Remove the data block containing the file and the VFS inode. 

put_inode(inode): Releases the inode object whose address is passed as the parameter. 

Releasing an object does not necessarily mean freeing memory, since other 

processes may still use that object. 

The fields corresponding to unimplemented methods are set to NULL. 

9.3.2 Inode objects 

All information needed by the filesystem to handle a file is included in a data structure 

called on inode. [18] A filename is a casually assigned label that can be changed, but 

the inode is unique to the file and remainds the same as lang as the file exists. For 



disk-based filesystems, this object usually corresponds to a file control block stored 

on disk. An inode object consists of an Inode-class whose fields are described in table. 


class Dentry i_dentry_ Pointer to the dentry class 

class PointList i_dentry_link_ the dentry-list from the inode 

(only used for symbolic link) 

class PointList i_files_ the opened file objects from the inode 

uint32 i_nlink_ the number of the linked file of the inode 

class Superblock i_superblock_ pointer to the superblock object 

uint32 i_size_ the size of the file 

uint32 i_type_ the type of the inode 

uint32 i_state_ inode state flags 

Table 2: The fields of the inode object 

Each inode object always appears in one of the following circular doubly linked list 

(PointList): 

• The list of valid unused inodes, typically those mirroring valid disk inodes and not 

currently used by any process. The element is referenced by the unused_inodes_ 

variable of the superblock class. 

• The list of in-use inodes, typically those mirroring valid disk inodes and used 

by some process. These inodes are not dirty. The element is referenced by the 

used_inodes_ variable of the superblock class. 

• The list of dirty inodes. The element are referenced by the dirty_inodes_ field of 

the corresponding superblock object. 

The inode object connect to a directory (i_dentry_), and has a lot of the files maybe (if 

the inode ist a file-inode), all opened file of this inode store by the i_files_, thus it can 

open the files repeated. the member variable i_type_ determine the type of the inode, 

a file-inode or a directory-inode. Here are the inode operation in the order they appear 

in the table: 

Inode(superblock, type): Construct a new inode, the inode muss to kown the 

superblock and the mode (which is desided a file or a directory). 

[D]create(dentry): Create a new disk inode for a regular file associated with a dentry 

object in same directory. 

lookup(name): Searches a directory for an inode corresponding to the filename 

included in a dentry object. 

[F]link(flag): Create a file object from the given flag and linked to the inode. 

[F]unlink(file): Unlinked the file Oject to the inode and remove the file object. 

[D]mkdir(dentry): Create a new inode for a directory associated with a dentry object 

in some directory. 

rm(): Removes the dentry. 

[D]rmdir(): Removes the directory whose name is included in the corresponding 

dentry. 

[D]symlink(dentry): Create a new inode for a symbolic link associated with a dentry 

object in some direcotry. 

followLink(inode,dir): Translates a symbolic link specified by an inode object; if the 




symbolic link is a relative pathname, the lookup operation starts from the specified 

directory. 

[F]readData(offset, size, buffer): Read the size byte from a data starting at position 

(data_ + offset) into the buffer. 

[F]WriteData(offset, size, buffer): Write the size byte from the buffer starting at 

position into the data starting at position (data_ + offset). 

[D]: it only used for directory; [F]: it only used for file 

9.3.3 Dentry objects 

The VFS considers each directory a file that contains a list of files and other directories. 

The directory entry is transformed by the VFS into a dentry object based on 

the Dentry class, whose fields are described in table. The system creates a dentry 

object for every component of a pathname that a precess looks up; the dentry object 

associates the component to its corresponding inode. For example, when looking up 

the /tmp/test pathname, the system creates a dentry object for the / root directory, a 

second dentry object for the tmp entry of the root directory, and a third dentry object 

for the test entry of the /tmp directory. 

The dentry object stores information about the linking of a directory entry with 

the corresponding file. Each disk-based filesystem stores this information in its own 

particular way on disk. 


class Inode d_inode_ Pointer to the inode class 

class Dentry d_parent_ Pointer to the parent dentry 

class PointList d_child_ the child dentry-list from the dentry 

class PointList d_subdirs_ For directories, list of dentry objects of subdirctories 

char d_name_ the name of dentry 

Table 3: The fields of the dentry object 

the dentry only identifys the inode reference and knows where are his father and his 

childs located, the dentry stores the pathname in the class. The relevant operations 

appear in the follow: 

Dentry(name): Construct a new dentry from given name. 

Dentry(parent): Construct a new dentry from the parent dentry. 

childRemove(dentry): Removes the given dentry from the child-dentry-list. 

childInsert(dentry): Inserts the given dentry to the child-dentry-list. 

checkName(name): Compare the name with the name of all dentry from the childdentry-list. 

9.3.4 File objects 

A file object describe how a process interacts with a file it has opened. The object is 

created when the file is opened and consists of the file class, whose fields are described 

in table. Notice that file object have no corresponding image on disk, and hence on 



dirty field is included in the file class to specify that the file object has been modified. 

i.e. the information only exists in memory during the period when each process 

accesses a file. 


class own 

the own information of the file 

uint32 uid the use-identifier of the file 

uint32 gid the group-identifier of the file 

uint32 version the version number 

class Superblock f_superblock_ pointer to the superblock object 

class Inode f_inode_ the inode associated to the file 

class Dentry f_dentry_ pointer to the dentry object 

uint32 flag_ flags specified when opening the file 

The fields of the file object 

Table 4: 

The main information stored by a file object is the file pointer - the current position in 

the file from which the next operation will take place. [18] Since several processed may 

access the same file concurrently, the file pointer cannot be kept in the inode object. 

Serveral list of "in use" file objects already assigned to superblocks. Each superblock 

object stores by the s_files_ field the dummy first element of a list of file descriptor 

which comprisd file objects; thus, file objects of files belonging to different filesystems 

are included in different list. 

Each filesystem includes its own set of file oprations (is defined in the file class) 

the perform such activities as reading ans writing a file. When a process opens the 

file, the VFS intializes the new file object with the address stored by the inode so that 

further calls to file operations can use these functions. Let’s briefly describe the file 

operations: 

File(inode, dentry): Construct a new file from given inode and dentry. 

read(buffer, size, offset): Read the size byte from a file starting at position + offset in 

the buffer. 

write(buffer, size, dentry): Write the size byte from the buffer into a file starting at 

position + offset. 

File Descriptor 

The file descriptor object contain a unsigned integer number and a file object. The 

PointList object store the object point, the index alter after handling of the operators 

(e.g. remove). That is the reason for existance of the File Descriptor object. 

9.3.5 derivation-hierarchy 

Figure illustrates the derivation-diagram, the concrete filesystem is derivated from 

the four basic class, the VirtualFileType determines the type of filesystem (i.e. ramfs, 

ext2, ...). The derivative filesystem overwrites the four baic class and create the own 

class for the concrete type. 




9.3.6 interaction between processed and vfs objects 

Figure illustrates with a simple example how processes interact with files. Three 

different processes have opened the same file. Each of the three processes uses its 

own file object (the file descriptor), the system (path_walker) found correct dentry in 

the ROOT-tree. The three dentry point at the same inode object, which identifies the 

superblock object and, together with the latter, the common disk file. 


9.4. RAMFS - MEMORY FILESYSTEM CHAPTER 9. VFS, FS 

9.4 RamFS - memory filesystem 

The RamFS is dynamically allocated (size varies dynamically as needed from the memory) 

and is a real filesystem, not a fake device on which to put a filesystem. The filesystem 

uses the space needed to store the files and directory on it. The data is located by 

the memory, it removed all data after exit the program. 

9.4.1 simplification from the RamFS 

• The class RamFsSuperblock used the member variable all_inodes_ list to store 

the used and modify inode of the filesystem. it is based on memory, thus there 

are not different between the used inode and modify inode, and the s_dev_ and 

s_magic_ are unused. The createInode method create a RamFsInode and connect 

to the dentry. the s_files_ stores all file descriptor and the corresponding inode 

from the opened file of the superblock object is located in used_inodes_. 

• The class RamFsInode has got the reference to Dentry (i_dentry_, the name and 

the position the root-tree) and RamFsSuperblock (i_superblock_, controler for the 

inode). It can open the file repeated with the READONLY-flag, the inode store 

the open file in the PointList i_files_. The ramfs-inode object used the member 

variable data_ to store the content from the file, the operations writeData and 

readData handle with it. The create, mknod and mkdir operators are equal to 

each other. The symlink, readlink and followLink are not implemented. 

• The Dentry only know the position of your inode and store the name (pathname) 

information by the tree-structure. 

• The class RamFsFile is accessed the inode from the outside layer. the own 

information (uid, gid, version) does not implements in this version. the vfs_syscall 

open and close setup the file descriptor (file object) in the s_file_ list from the 

RamFsSuperblock. write and read handle the buffer with the data_ from the 

RamFsInode. 

9.4.2 file vs. directory 

The difference between file and directory is the member variable data_ from the 

inode object, the file can be read from or written to the information, and the process 

manipulate with a file descriptor (the integer number announce the position from the 

file which locate in opened-file-list from the superblock object). The directory does 

not used the data_ (data_ = 0) and it is only a inode object. All operations from the 

directory point in the inode object. 

The file can associte with a lot of file object (link/unlink-operator). Many file 

store in the PointList from the inode, if they connect with a inode (type: I_FILE), the 

directory object has not get the file object. 



9.4. RAMFS - MEMORY FILESYSTEM 

directory - inode & dentry 

The operator mkdir create a new directory. THe workflow from VFS describe as: 

1. The vfs_syscall used the path_walker to find the corresponding inode. 

2. Create a new dentry in the correct direcotry. 

3. The vfs_syscall turn the control over to (RamFs-)Seperblock, calls the method 

createInode from the superblock object. 

4. The RamFsSuperblock create a new inode with type I_DIR 

5. Connect the inode to dentry (mkdir) 

6. CreateInode return the created inode object. 

file - inode & dentry 

The operator open check the existence of the inode object first, if the inode does not 

exist, create ones. Then create a new file, connect to the inode object and return a 

file-descriptor num. The workflow describe as: 

1. The vfs_syscall used the path_walker to find the corresponding inode. 

2. Create a new dentry in the correct direcotry. 

3. The vfs_syscall turn the control over to (RamFs-)Seperblock, call the method 

createInode from the superblock object. 

4. The RamFsSuperblock create a new inode with type I_File 

5. Connect the inode to dentry (mkfile) 

6. createInode return the created inode object. 

7. The vfs_syscall calls the method createFd from the superblock object. 


9.5. EXTENTION CHAPTER 9. VFS, FS 

8. Create a file descriptor (file object) and insert to the s_file_ list. 

9. Connect the inode to file (link) 

10. CreateFd return the file descritor. 

The step 2-6 occurs if the file does not exist before vfs_syscall open. 

9.5 extention 

• symbolic link: allow the directory-to-directory link. 

• static Ramdisk: used the Ramdisk to store the data with fix size. 

• dynamic Ramdisk: used the Ramdisk to store the data with increasing size. 

• ext2 FS: used the block device to store the data 

• pipfs, BFS, ...: [4] 


Chapter 10 

Locking 


Locking is used in the sweb-kernel for handling critical sections. A critical section is 

a piece of code which operates with resources (data, files, databases, printers, etc.) 

shared between two or more threads or processes which must not be concurrently 

accessed by more than one thread. The mechanism of assuring that only one thread 

executes a particular critical section at a particular time is called mutual exclusion. 

For example, look at the following code: 

1 i = i + 1; 

Imagine that i is a global variable (or shared in another way) and there are two threads 

which are executing this instruction. Each thread can be interrupted at any time. The 

following steps each thread may execute: 

1 reg = i ; 

2 reg ++; 

3 i = reg ; 

Each of the threads increments the value, so the value should have been incremented 

twice after execution. Now imagine that Thread A is interrupted between line 1 and 

2 of execution, and Thread B is scheduled next. After Thread B has done its work 

Thread A is rescheduled. As you see, the value of i increases just by 1, because each 

thread ”sees” the same initial value of i. Such a event is called race-condition. 

For more information and examples see chapter 2.3 in [17] and [5]. You really 

should be aware of this issue (not only for the operating-systems-course), because 

race-conditions often are the reason for strange program behavior and may not be 

discovered for a long time. 

You should be aware of the possibility of disabling interrupts to achieve mutual 

exclusion, but the better way is using locks. The scheduler itself disables scheduling 

during working with the thread-list, which is a similar procedure, but this should be 

a exception since it can’t protect the list otherwise. 

117

10.2. LOCKS CHAPTER 10. LOCKING 

10.2 Locks 

Locks in Sweb are objects which provide the methods acquire() and release(). 

These methods should be called before and after a critical section and assure mutual 

exclusion. 

10.2.1 SpinLock 

The SpinLock provides a kind of ”busy-waiting”. A thread don’t getting instant 

access to the critical section yields away in a loop until it does get access. This is 

much better than wasting cpu-time by only polling the lock-variable. So, how it works? 

The class SpinLock has a member variable named nosleep_mutex_ which has value 

0 if the lock is free and 1 if the lock is acquired. Acquiring the lock is a bit tricky, 

because it leads to a race condition if you read the lock-variable and set it to 1 if it is 

0. If you are interrupted between observing the value and setting it, another thread 

may read the same value (e.g. =0, unlocked) and then both threads enter the critical 

region simultaneously, which is not the right behavior. See the implementation: 

1 SpinLock : : SpinLock ( ) 

2 { 

3 nosleep_mutex_ = 0; 

4 } 

5 

6 void SpinLock : : acquire ( ) 

7 { 

8 while ( ArchThreads : : testSetLock ( nosleep_mutex_ ,1 ) ) 

9 { 

10 i f ( unlikely ( ArchInterrupts : : testIFSet ( ) ==false ) ) 

11 kpanict ( ( uint8∗ ) " SpinLock : : acquire : with IF=0 ! Now we ’ re dead ! ! ! \ n" 

12 "Maybe you used new/delete in irq/int−Handler context ?\n" ) ; 

13 //SpinLock : Simplest of Locks , do the next best thing to busy wating 

14 Scheduler : : instance ()−> yield ( ) ; 

15 } 

16 } 

17 

18 void SpinLock : : release ( ) 

19 { 

20 nosleep_mutex_ = 0; 

21 } 

Avoiding this race-condition is done by a special instruction named TSL (TestSetLock, 

the assembly instruction used in sweb is XCHG for intel x86 architectures). It is a 

single, atomar cpu-instruction that sets the lock-variable and stores the old value of 

it in a register. The method ArchThreads::testSetLock returns the old value. If it 

is 1, we haven’t modified the lock-variable, and we know that the lock is being held by 

another thread. If it is 0 we got the lock and need not to yield away any more. After a 

return from acquire() a thread can execute the critical region related to the lock. 

The release()-method simply sets nosleep_mutex_ to 0. 

The mutex-implementation in chapter 2.3.6 of [17] is very similar to the SpinLock in 

sweb. 


CHAPTER 10. LOCKING 

10.2. LOCKS 

10.2.2 Mutex 

The Mutex-class improves the busy-waiting-method by putting threads to sleep during 

they are waiting for the lock. If a thread doesn’t get access to the critical section 

it is put to sleep. It is also put on a sleepers-list to remember all threads currently 

sleeping on the Mutex-object. If another thread leaves the critical section by 

calling release(), a thread on the sleepers-list will be selected and set to runningstate. 

Then the waked thread tries to enter the critical section again. For the correct 

implementation there is used a SpinLock for protecting the sleepers-list and 

the Scheduler::sleepAndRelease()-method for atomar going to sleep and releasing 

the spinlock. A detailed explanation of possible race-conditions is given in the file 

Mutex.h. 


1 Mutex : : Mutex ( ) 

2 { 

3 mutex_ = 0; 

4 } 

5 

6 void Mutex : : acquire ( ) 

7 { 

8 spinlock_ . acquire ( ) ; 

9 while ( ArchThreads : : testSetLock ( mutex_,1 ) ) 

10 { 

11 sleepers_ . pushBack ( currentThread ) ; 

12 Scheduler : : instance ()−>sleepAndRelease ( spinlock_ ) ; 


14 } 

15 spinlock_ . release ( ) ; 

16 held_by_=currentThread ; 

17 } 

18 

19 void Mutex : : release ( ) 

20 { 


22 mutex_ = 0; 

23 held_by_ =0; 

24 i f ( ! sleepers_ . empty ( ) ) 

25 { 

26 Thread ∗thread = sleepers_ . front ( ) ; 

27 sleepers_ . popFront ( ) ; 

28 Scheduler : : instance ()−>wake ( thread ) ; 

29 } 

30 spinlock_ . release ( ) ; 

31 } 

Because this mechanism uses a list, it involves the KernelMemoryManager. Therefore 

the KMM itself cannot use a Mutex as locking mechanism (it uses a SpinLock instead). 

The Mutex is the standard locking mechanism in sweb, so whenever you need a lock 

use it. 


10.3. USING LOCKS CHAPTER 10. LOCKING 

10.3 Using Locks 

Using Locks is easy, before the critical section you have to call acquire() and after 

it release(). In sweb you can also use the class MutexLock. Its constructor takes 

a Mutex-object and calls acquire(). The destructor calls release(). This can be 

useful if you have to leave a function a few times (maybe after catching errors), and 

would have to call release() before each return-instruction. You could create a 

stack-allocated MutexLock-object instead which will automatically release the mutex 

at method-return. See example in Loader::readHeaders(). 

Using locks can easily lead to deadlocks if you make a mistake. It’s also not that easy 

to recognize the code snippets which are race-condition-prone. We will discuss critical 

sections in the sweb-kernel later. 

A few hints: 

• don’t acquire a lock twice, this will lead to a deadlock. Remember this when you 

are calling methods inside a critical region. 

• don’t use Scheduler::wake() and ::sleep() on a thread which uses mutexes 

(if you wake a thread sleeping on a mutex, the thread will be put on the sleeperslist 

again). Use condition-variables instead if you want to block a running thread. 

• all synchronization primitives in sweb (SpinLock, Mutex, CV) shouldn’t be used 

if interrupts are disabled. The reason is that no task switch is possible (also no 

yielding!) and nobody can release the lock if it is already acquired. If someone 

doesn’t get the lock, i.e. doesn’t get access to the critical section, he will be 

deadlocked. 

• You should avoid nesting locks, but if you do it, you have to follow a simple rule: 

don’t acquire locks in different orders, because otherwise your code will produce 

deadlocks. Following example demonstrates why: 

Thread A executes: 

1 lock_a . acquire ( ) ; 

2 lock_b . acquire ( ) ; 

3 . . . 

4 lock_b . release ( ) ; 

5 lock_a . release ( ) ; 

Thread B executes: 

1 lock_b . acquire ( ) ; 

2 lock_a . acquire ( ) ; 

3 . . . 

4 lock_a . release ( ) ; 

5 lock_b . release ( ) ; 

Now imagine thread A has already acquired lock_a, but not lock_b (interrupted 

between line 1 and 2). Thread B acquires lock_b and then tries to acquire lock_a. 

Obviously thread B will wait forever for lock_a and thread A will also wait forever 

for lock_b. We have created a deadlock, similar to situation in following picture: 



10.4. CRITICAL SECTIONS IN THE SWEB-KERNEL 

10.4 Critical Sections in the Sweb-Kernel 

10.4.1 KMM 

The allocateMemory(), reallocateMemory() and freeMemory() methods each use 

the spinlock-member lock_ to protect the list of memory-segments. The KMM not 

always uses the lock because it is involved in the startup-routine before enabling 

interrupts, where using the lock is not necessary. 

10.4.2 PM 

The PageManager uses a Mutex-object in the methods getFreePhysicalPage() and 

freePage() to protect the map of physical pages. 

10.4.3 Loader 

The Loader uses the mutex file_lock_ to avoid race-conditions in setting file-offset 

and reading the file in the methods readHeaders() and loadOnePageSafeButSlow(). 

This is necessary only if you implement multithreaded userprocesses, because then 

the threads share the same Loader and therefore the same file-descriptor. 

10.4.4 CV 

The condition-variable is another synchronization primitive provided in sweb. It adds 

the ability to block threads (i.e. put to sleep) if they cannot continue execution and 

unblock them if a condition is met. To block it the method wait() is provided. There 

the thread calling it is put to sleep and stored on a sleepers-list. If some condition 

is met, someone calls signal() where a thread is waked, or broadcast() where 

all threads are waked. Condition-variables can only be used in a critical section 


10.4. CRITICAL SECTIONS IN THE SWEB-KERNEL CHAPTER 10. LOCKING 

protected with a Mutex-object. For examples of use see chapter 2.3.7 of [17]. Just 

imagine that the monitor-procedures are surrounded by a mutex. See also uses in 

sweb: common/kernel/MountMinix.h;.cpp and common/ipc/FiFo.h;.cpp. 

Note that CVs introduced in Tanenbaum chapter 2.3.7 [17] are used in a monitor 

and only one thread or process can be active in the monitor at a particular time. A 

monitor can be simulated with a mutex-locked code section, as it is done in sweb 

and for example the posix-multithreading-library. You should also note that in 

sweb we don’t let the signaled thread immediately run (Mesa-Semantics). So the 

correct use of CV-wait() is generally while(foo) condition.wait(); instead of 

if(foo) condition.wait();. 


1 Condition : : Condition ( Mutex ∗lock ) 

2 { 

3 sleepers_=new List ( ) ; 

4 lock_=lock ; 

5 } 

6 

7 Condition : : ~ Condition ( ) 

8 { 

9 delete sleepers_ ; 

10 } 

11 

12 void Condition : : wait ( ) 

13 { 

14 // l i s t i s protected , because we assume, the lock i s being held 

15 assert ( lock_−>isHeldBy ( currentThread ) ) ; 

16 assert ( ArchInterrupts : : testIFSet ( ) ) ; 

17 sleepers_−>pushBack ( currentThread ) ; 

18 Scheduler : : instance ()−>sleepAndRelease (∗ lock_ ) ; 


20 } 

21 

22 void Condition : : signal ( ) 

23 { 

24 i f ( ! lock_−>isHeldBy ( currentThread ) ) 

25 return ; 


27 Thread ∗thread=0; 

28 i f ( ! sleepers_−>empty ( ) ) 

29 { 

30 thread = sleepers_−>front ( ) ; 

31 i f ( thread−>state_ == Sleeping ) 

32 { 


34 sleepers_−>popFront ( ) ; 

35 } 

36 } 

37 } 

38 

39 void Condition : : broadcast ( ) 

40 { 

41 i f ( ! lock_−>isHeldBy ( currentThread ) ) 

42 return ; 





44 Thread ∗thread ; 

45 List tmp_threads ; 

46 while ( ! sleepers_−>empty ( ) ) 

47 { 

48 thread = sleepers_−>front ( ) ; 

49 sleepers_−>popFront ( ) ; 

50 i f ( thread−>state_ == Sleeping ) 


52 else 

53 tmp_threads . pushBack ( thread ) ; 

54 } 

55 while ( ! tmp_threads . empty ( ) ) 

56 { 

57 sleepers_−>pushBack ( tmp_threads . front ( ) ) ; 

58 tmp_threads . popFront ( ) ; 

59 } 

60 } 

The sleepers-list need not to be protected separately because when calling wait() 

or signal() the thread already has to be in a critical section protected with the 

same lock as provided in the Condition-constructor. In wait() we also have to 

release the lock (atomar with going to sleep). This obviously is done for the reason 

that some other thread can enter the critical region. This is the only possibility even 

something can happen that satisfies the condition and signal() being called. The 

last instruction in wait() is of course acquiring the lock again because the thread 

was in the critical region before calling wait() and should also be in there at leaving 

wait(). The signal()-method simply wakes the next thread on list, broadcast() 

wakes all threads on list. 

10.4.5 Console and Terminal 

The Terminal uses a lock to protect the characters and states on the terminal and 

some other members. A Console uses two locks, one for drawing (protecting writes 

to the display, named console_lock_) and one for protecting the active terminal 

(set_active_lock_). The locks are nested in various ways to assure mutual exclusion. 

In detail: 

• Terminal-methods generally lock either the terminal itself only (if setting some 

members) or the terminal and the console (when writing characters to terminal 

and console). 

• The console itself uses the set_active_lock_ to protect access to the active terminal 

(Console::getActiveTerminal(), Console::setActiveTerminal()). 

Because the terminal itself also needs to know if it is the active one 

or not, it provides the methods Terminal::setAsActiveTerminal() and 

Terminal::unSetAsActiveTerminal(). Those are protected by the terminallock 

of course. In case of setting the active terminal the console-lock is also used 

in order to update the console-contents with the contents of the terminal. 

Therefore nested locks are acquired in following orders: 

• Several terminal-methods: 



1 terminal_lock . acquire ( ) ; 

2 console_lock . acquire ( ) ; 

3 console_lock . release ( ) ; 

4 terminal_lock . release ( ) ; 

• Console::setActiveTerminal() 

1 set_active_lock . acquire ( ) ; 

2 

3 terminal_a_lock . acquire ( ) ; 

4 terminal_a_lock . release ( ) ; 

5 

6 terminal_b_lock . acquire ( ) ; 

7 console_lock . acquire ( ) ; 

8 console_lock . release ( ) ; 

9 terminal_b_lock . release ( ) ; 

10 

11 set_active_lock . release ( ) ; 

Deadlocks can occur if threads are already holding locks, and requesting others which 

they possibly don’t get. So, let’s look at possible events: 

• Thread holding terminal_x_lock, waiting for console_lock. The console_lock is 

always the last lock to be acquired, so it will be released sooner or later. 

• Thread holding set_active_lock, requesting terminal_x_lock. This is also no problem 

because a terminal_lock will be released at the latest if it gets the console_lock 

and can do its work, which is independent of set_active_lock. 

10.4.6 Scheduler 

The Scheduler disables scheduling in order to protect the thread-list (then it obviously 

cannot get interrupted by other threads which assures mutual exclusion). The 

reason for this is that protecting it with a mutex or spinlock is not possible because 

it would also have to be used in the Scheduler::schedule() method. But this 

method is called from interrupt context with interrupts disabled, when yielding away 

is not possible (apart from the fact that after a hypothetic yield we will end up in 

Scheduler::schedule() again ;-)). So the method simply searches for a new thread 

to run only if scheduling is enabled, otherwise probably the list is being modified and 

currentThread has to continue with execution (which hopefully enables scheduling 

again). 

Before modifying the thread-list, the private Scheduler-method lockScheduling() 

should be called which actually blocks the Scheduler. When adding or removing 

threads (those list-methods use new/delete) you also have to be sure 

that the KernelMemoryManager-lock is free. Otherwise you will be deadlocked 

if you use new/delete caused of disabled scheduling (yielding is not possible 

in this case). Of course, this condition can’t be checked before you call 

lockScheduling(). Imagine you’re interrupted between SpinLock::isFree() and 

Scheduler::lockScheduling() - you cannot be sure that the SpinLock is free 

any more. SpinLock::isFree() (and Mutex::isFree()) should only be called if 




scheduling or interrupts are disabled because only in this cases the result can be 

trustworthy and it even makes sense to call. 

So, we have to check the KMM-lock after locking scheduling. And if it’s not free 

we have to unlock scheduling, yield away, lock scheduling, check KMM-lock, etc. 

until the KMM-lock is free and scheduling is disabled. This is done by the private 

Scheduler-method waitForFreeKMMLock(). A few code snippets from the Scheduler: 

1 void Scheduler : : addNewThread ( Thread ∗thread ) 

2 { 

3 lockScheduling ( ) ; 

4 debug ( SCHEDULER, "addNewThread : %x %d:%s\n" , thread , thread−>getPID ( ) , thread−>getName ( ) ) ; 

5 waitForFreeKMMLock ( ) ; 

6 threads_ . pushFront ( thread ) ; 

7 unlockScheduling ( ) ; 

8 } 

9 

10 uint32 Scheduler : : schedule ( ) 

11 { 

12 i f ( testLock ( ) || block_scheduling_extern_ >0 ) 

13 { 

14 debug ( SCHEDULER, " schedule : currently blocked\n" ) ; 

15 return 0; 

16 } 

17 

18 [ select a thread ready to run , update currentThread and currentThreadInfo ] 

19 } 

20 

21 void Scheduler : : lockScheduling ( ) //not as severe as stopping Interrupts 

22 { 

23 i f ( unlikely ( ArchThreads : : testSetLock ( block_scheduling_ ,1 ) ) ) 

24 arch_panic ( ( uint8∗ ) "FATAL ERROR: Scheduler : : ∗ : block_scheduling_ was set ! ! " 

25 "How the Hell did the program flow get here then ?\n" ) ; 

26 } 

27 

28 void Scheduler : : unlockScheduling ( ) 

29 { 

30 block_scheduling_ = 0; 

31 } 

32 

33 bool Scheduler : : testLock ( ) 

34 { 

35 return ( block_scheduling_ > 0 ) ; 

36 } 

37 

38 void Scheduler : : waitForFreeKMMLock ( ) 

39 { 

40 i f ( block_scheduling_==0 ) 

41 arch_panic ( ( uint8∗ ) "FATAL ERROR: Scheduler : : waitForFreeKMMLock : " 

42 " This i s meant to be used while Scheduler i s locked\n" ) ; 

43 

44 while ( ! KernelMemoryManager : : instance ()−>isKMMLockFree ( ) ) 

45 { 

46 unlockScheduling ( ) ; 

47 yield ( ) ; 

48 lockScheduling ( ) ; 

49 } 

50 } 




Chapter 11 

Xen 


The following chapter describes the experimental port of SWEB to Xen version 3.x. 

The intention was to create a fully functional SWEB version as a Xen guest domain. 

See the last section for what is missing. This is partly based on the fact that the 

Xen implementation started later when a great portion of SWEB was already done. 

Major changes to the structure were not possible anymore due to time constraints 

as the code should soon be used for exercises during an operating systems lecture 

course. Therefore integration was harder to do and the architecture separation is a 

bit weakened. The first attempts where done under Xen 2.0.6. Xen 3.0 was released 

in December 2005. As it incorporated many bug fixes the 2.0.x code was largely 

abolished and the implementation was redone for Xen 3.0. 

11.2 Architecture separation 

SWEB was started as a student project and it was intended to be used for exercises 

accomplishing an operating system lecture where NACHOS [6] was used. One of the 

problems of NACHOS are that it contains a lot of ifdef macros spread all over the code. 

This makes is obfuscate for students so the intention in SWEB was to have as few 

ifdefs as possible. The main decision to reach this goal was to create a common interface 

for the architecture specific code. This can be found in arch/common/include. 

These classes only contain static methods and do not rely on any further includes. 

Therefore it is possibly to make an architecture specific implementation without any 

modifications in the common code. This is achieved by the symlink arch/arch/ which 

points to arch/Xen or arch/x86 depending on which build is done. 

After a short review of the Xen documentation is was clear that some areas will have 

to implemented differently. The following architecture specific headers existed from 

the beginning: 

• ArchCommon.h 

• ArchInterrupts.h 

127

11.3. XEN CHAPTER 11. XEN 

• ArchMemory.h 

• ArchThreads.h 

The following architecture specific headers were introduced quite late in the development. 

Unfortunately these require additional includes and therefore do not follow the 

idea of a clean architecture separation strictly. 

• arch_keyboard_manager.h 

• arch_bd_manager.h 

Note that one also finds XenConsole.h there. This file is specific to SWEB 

Xen but was introduced later. The clean way would have been to remove 

common/include/console.h and introduce arch/common/include/ArchConsole.h 

to replace it. The reason why this happened is that the source was already used for 

an exercise. No major modifications should be done anymore to be able to test and 

create eventually necessary patches in the testing tree. When refactoring the code this 

should be changed to follow the above explained architecture separation. 

11.2.1 Building environment 

To compile SWEB Xen one has to call make ARCH=Xen. Before this sweb-testing-bin 

should be removed completely to ensure all dependencies are updated and everything 

is recompiled. The results are sweb_Xen.elf, sweb_Xen.gz, sweb_Xen.x. The image 

which is used to start a Xen guest domain is sweb_Xen.gz. See the installation 

guide ??for instructions on how to boot this image. Also note that under arch there is 

a link automatically created to point to the Xen source tree. This way the Xen specific 

implementation of the common architecture interface headers is compiled. 

There is one pit fall in the Makefile when one intends to change it. When calling 

the linker head.o must be the first object file in the argument list. To achieve this it 

is deleted from $(OBJECTDIR)/sauhaufen and given explicitly as an argument to the 

linker. If you change this the link process will fail. Therefore if one renames head.S 

the Makefile must be adapted too. 

11.3 Xen 

11.3.1 Overview 

Xen is a solution to run multiple operating systems on one host. Xen itself is called a 

hypervisor: 

A hypervisor in computing is a scheme which allows multiple operating systems 

to run, unmodified, on a host computer at the same time. The term is 

an extension of the earlier term supervisor, which was commonly applied to 

operating system kernels in that era. [7] 


CHAPTER 11. XEN 

11.3. XEN 

Xen differs from this definition in one aspect. The operating systems must be modified 

to run under Xen. With additional processor support is possible to run unmodified 

guest. Currently code for the new Vanderpool[8] architecture by Intel exists. AMD will 

provide a similar extension called Pacifica[9]. 

To archive the goal of multiple concurrently running operating systems the para virtualisation 

approach is used. This means unlike a full PC simulator it only provides 

virtualisation of the core parts. These are basically CPU and memory. Additionally 

network interface cards, console and block devices are supported. Xen therefore is 

suitable especially for network server virtualisation. 

The Xen hypervisor deals only with CPU and memory. The rest is provided by a privileged 

operating system called domain 0 (short dom0). Xen is not usable without a 

dom0. Dom0 contains all the drivers for hardware access and provides the services 

like virtual network interfaces to the other OS instances. These are called guest domains. 

11.3.2 Architecture 

The Xen hypervisor is mainly responsible to ensure strict partitioning between the 

guest domains. It also carries out the privileged work on behalf of the domains. All 

hardware handling is done by driver domains. The most important is dom0. Drivers 

domains provide an interface for other domains to share a singular resources like 

the network card. This is also called back end/front end architecture where dom0 

is running the back end with access to the real hardware and the guest domain only 

implements the front end which is a driver to use the exposed interface. 

Xen also allows to transfer running guest domains to other machines over the network 

or suspend them to disk. This is almost transparent to the guest domain as long as 

the new machine provides the same services. 

11.3.3 Memory Layout 

Xen reserves part of the virtual address space for itself. These are currently the top 

64Mb on the x86 32-bit architecture. The start is defined in the Xen public headers 

(arch-x86_32.h) as HYPERVISOR_VIRT_START. As explained above Xen is running on 

ring 0 and guest domains run in ring 1. As paging alone would not protect the Xen 

memory from the guest OS segmentation is used. When the guest is running the top 

memory used by Xen is outside the set up segment limit. This can be considered like 

a memory hole. Software not aware of this fact may uses the grow-down feature of 

segments. This can lead to a performance decrease as Xen has to catch accesses. 

Especially the /lib/tls libraries under Linux implement the worst case behavior. 

Therefore Xen Linux gives a warning during the boot process if these libraries are 

not disabled. Read http://wiki.xensource.com/xenwiki/XenSegments for more 

details. It is also possible to install adapted versions of the libraries http://wiki. 

xensource.com/xenwiki/XenSpecificGlibc. 

Figure 11.1 shows the general memory layout under Xen constructed by the Linux 

domain builder. The builder originally developed for Linux as it was the first operation 

supported but is suitable for other systems too. Other domain builders can create 

slightly different layouts. The Xen team intends to keep the number of domain builders 


11.3. XEN CHAPTER 11. XEN 

Figure 11.1: Xen memory layout 

small [10]. It is preferred to adapt the guest domains as necessary. Care must be taken 

as the layout might change with major updates. 

Pseudo Physical Memory vs. Machine Memory 

Xen provides an abstraction of the machine memory. 

• machine memory: The real physical memory build into the machine. The frames 

are called machine frame numbers (MFN). 

• pseudo physical memory: The abstraction layer for the guest operating system. 

Called page frame numbers (PFN). 

The PFNs form a continuous memory region in size of the specified memory amount 

for the guest domain. Numbering starts at frame number zero. The PFNs are specific 

to one domain. Each domain has their own PFN numbering, each starting at frame 

number 0. 

The MFNs cover the whole real machine memory consecutively starting from address 

0x00000000. A MFN corresponds to the physical address of the machine page frame. 

The frame and page size is 4KiB. A MFN refers to the same memory location in every 

domain. 

The above figure gives an illustration of PFNs and MFNs. One clearly sees that the 

guest OS can maintain the illusion of a continuous memory whereas in reality it might 

be scattered all over the real machine memory. This is no problem as virtual memory 

management already provides such an abstraction and the OS is aware of it. The problem 

is that most OS kernels expect the continuous memory internally when managing 

the pages. Internally the OS can use the original code which manages continuous 

memory when using PFN. When manipulating page tables a guest OS must use the 

MFN as only this corresponds to a real physical memory address. Therefore before 

updating or manipulating a page table entry the OS must translate the PFN to the 

corresponding MFN. 



11.4. SWEB- XEN 

11.4 SWEB- Xen 

Figure 11.2: Page frame numbers: PFN and MFN 

The following sections cover different aspects of the implementation of SWEB for Xen. 

The port is called SWEB Xen to differentiate it from the native SWEB implementation. 

11.4.1 Guest Section 

A guest domain must have a section __xen_guest which consists of an ASCII string. 

In SWEB Xen the section looks like this: 

1 . section __xen_guest 

2 . a s c i i "GUEST_OS=linux ,GUEST_VER=2.6 " 

3 . a s c i i " ,XEN_VER=xen−3.0" 

4 . a s c i i " ,VIRT_BASE=0x80000000" 

5 . a s c i i " ,PAE=no" 

6 . a s c i i " ,LOADER=generic " 

7 . byte 0 

Important is XEN_VER which must be the version string for the used Xen version. 

If this is differs from the running Xen version the domain builder will not load the 

guest domain. The second important value is VIRT_BASE which must be the virtual 

address to which the kernel image should be loaded. The same value must 


11.4. SWEB- XEN CHAPTER 11. XEN 

be used in the SWEB source files for address calculations and in the linker script 

(arch/xen/utils/kernel-ld-script.ld). 

11.4.2 Start Info and Shared Info Page 

As a boot parameter a pointer to the start info page is passed on to the guest domain 

in the register esi. This pointer is already a valid virtual address in the guest address 

space. The used structure start_info is defined in the public Xen header xen.h 

1 typedef struct start_info { 

2 /∗ THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. ∗/ 

3 char magic [ 3 2 ] ; /∗ " xen−− " . ∗/ 

4 unsigned long nr_pages ; /∗ Total pages allocated to this domain . ∗/ 

5 unsigned long shared_info ; /∗ MACHINE address of shared info s t r u c t . ∗/ 

6 uint32_t flags ; /∗ SIF_xxx flags . ∗/ 

7 unsigned long store_mfn ; /∗ MACHINE page number of shared page . ∗/ 

8 uint32_t store_evtchn ; /∗ Event channel f o r store communication . ∗/ 

9 unsigned long console_mfn ; /∗ MACHINE address of console page . ∗/ 

10 uint32_t console_evtchn ; /∗ Event channel f o r console messages . ∗/ 

11 /∗ THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME) . ∗/ 

12 unsigned long pt_base ; /∗ VIRTUAL address of page directory . ∗/ 

13 unsigned long nr_pt_frames ; /∗ Number of bootstrap p . t . frames . ∗/ 

14 unsigned long mfn_list ; /∗ VIRTUAL address of page−frame l i s t . ∗/ 

15 unsigned long mod_start ; /∗ VIRTUAL address of pre−loaded module . ∗/ 

16 unsigned long mod_len ; /∗ Size ( bytes ) of pre−loaded module . ∗/ 

17 int8_t cmd_line [MAX_GUEST_CMDLINE] ; 

18 } s t a r t _ i n f o _ t ; 

Without this information a further setup of the guest domain would be impossible. A 

detailed explanation of the structure fields is skipped here as they will be explained 

when used. 

There is also an additional shared info page. In the start_info struct the MFN of the 

page can be found. As it is not sure if or where this page is mapped by the domain 

builder a proper mapping is done. For this in head.S a dummy page is created. This 

results in a valid virtual address at the start as the domain builder sets up mappings 

for the whole kernel image. So we can remap the virtual address to the shared_info 

page an can access it before the virtual memory setup is complete. 

1 typedef struct shared_info { 

2 vcpu_info_t vcpu_info [MAX_VIRT_CPUS ] ; 

3 

4 /∗ 

5 ∗ A domain can create " event channels " on which i t can send and receive 

6 ∗ asynchronous event n o t i f i c a t i o n s . There are three classes of event that 

7 ∗ are delivered by this mechanism : 

8 ∗ 1. Bi−d i r e c t i o n a l inter− and intra−domain connections . Domains must 

9 ∗ arrange out−of−band to set up a connection ( usually by a l locating 

10 ∗ an unbound ’ l i s t e n e r ’ port and avertising that via a storage service 

11 ∗ such as xenstore ) . 

12 ∗ 2. Physical interrupts . A domain with suitable hardware−access 

13 ∗ p r i v i l e g e s can bind an event−channel port to a physical i n t e r r u p t 

14 ∗ source . 

15 ∗ 3. Virtual interrupts ( ’ events ’ ) . A domain can bind an event−channel 

16 ∗ port to a v i r t u a l i n t e r r u p t source , such as the virtual−timer 

17 ∗ device or the emergency console . 

18 ∗ 

19 ∗ Event channels are addressed by a " port index " . Each channel i s 

20 ∗ associated with two b i t s of information : 




21 ∗ 1. PENDING −− n o t i f i e s the domain that there i s a pending n o t i f i c a t i o n 

22 ∗ to be processed . This b i t i s cleared by the guest . 

23 ∗ 2. MASK −− i f this b i t i s clear then a 0−>1 t r a n s i t i o n of PENDING 

24 ∗ w i l l cause an asynchronous upcall to be scheduled . This b i t i s only 

25 ∗ updated by the guest . I t i s read−only within Xen . I f a channel 

26 ∗ becomes pending while the channel i s masked then the ’ edge ’ i s l o s t 

27 ∗ ( i . e . , when the channel i s unmasked, the guest must manually handle 

28 ∗ pending n o t i f i c a t i o n s as no upcall w i l l be scheduled by Xen ) . 

29 ∗ 

30 ∗ To expedite scanning of pending n o t i f i c a t i o n s , any 0−>1 pending 

31 ∗ t r a n s i t i o n on an unmasked channel causes a corresponding b i t in a 

32 ∗ per−vcpu s e l e c t o r word to be set . Each b i t in the s e l e c t o r covers a 

33 ∗ ’C long ’ in the PENDING b i t f i e l d array . 

34 ∗/ 

35 unsigned long evtchn_pending [ sizeof ( unsigned long ) ∗ 8 ] ; 

36 unsigned long evtchn_mask [ sizeof ( unsigned long ) ∗ 8 ] ; 

37 

38 /∗ 

39 ∗ Wallclock time : updated only by c o n t r o l software . Guests should base 

40 ∗ t h e i r gettimeofday ( ) syscall on this wallclock−base value . 

41 ∗/ 

42 uint32_t wc_version ; /∗ Version counter : see vcpu_time_info_t . ∗/ 

43 uint32_t wc_sec ; /∗ Secs 00:00:00 UTC, Jan 1 , 1970. ∗/ 

44 uint32_t wc_nsec ; /∗ Nsecs 00:00:00 UTC, Jan 1 , 1970. ∗/ 

45 

46 arch_shared_info_t arch ; 

47 

48 } shared_info_t ; 

This page is especially used for the event channel notifications. 

events 11.4.8 for more details on event handling. 

See callbacks and 

11.4.3 PageManager and KernelMemoryManager 

The PageManager and the KernelMemoryManager classes expect to be placed after the 

kernel end. Therefore the following conditions must be meet: 

• the start of free memory after the kernel must be known 

• there must be enough free memory after the kernel 

• the free memory must be mapped in the page table (accessible by virtual addresses) 

SWEB expects that for the kernel 4Mb of memory are available and it is mapped 

correctly. These 4MB are considered kernel memory and will only be used by kernel 

code. So there are two things which have to be done in SWEB Xen: 

• check and correct the 4MB mapping if necessary 

• calculate the start of free memory 

Under SWEB the symbol kernel_end_address (see 

arch/arch/utils/kernel-ld-script.ld for the definition) is identical to the 

start of the free memory after the kernel. Under Xen this is not true. See 11.1 for 

the general memory layout. Therefore a pair of static methods in ArchCommon was 



introduced to get the free kernel memory area. 

To get the start of free memory one must call 

ArchCommon::getFreeKernelMemoryStart. When booting native the returned 

pointer will be the same as kernel_end_address while under Xen it will return a 

higher address. 

The second call is ArchCommon::getFreeKernelMemoryEnd. During the boot process 

in start_xen_init() the correct addresses are calculated. 

11.4.4 Memory regions and modules 

On a native boot SWEB uses the multiboot definition [11] filled by the Grub bootloader. 

In this structure the usable memory regions and module addresses are passed 

on. These are parsed during the boot and the later needed information is stored in 

the structure multiboot_remainder in x86/source/ArchCommon.cpp. To get the 

information the following static methods exist: 

• ArchCommon::getNumUsableMemoryRegions 

• ArchCommon::getUsableMemoryRegion 

• ArchCommon::getNumModules 

• ArchCommon::getModuleStartAddress 

• ArchCommon::getModuleEndAddress 

Under Xen the needed information for the above methods is provided partly by the 

start info page (see 11.4.2). The remaining parts are calculated during the boot process 

in xen_start_init. 

11.4.5 Memory Layout 

Compare this to the SWEB linear memory layout in chapter VM 3.9. 

Note that the memory layout described in the Xen public header xen.h differs from the 

actually build layout by the Linux domain builder xc_linux_build.c Figure SWEB 

Xen memory layout 11.3 shows the virtual memory layout of the kernel which contains 

the following items: 

• kernel image 

The following parts are all set up by the domain builder: 

• MFN list: Contains the numbers of all machine frames belonging to the guest 

domain. 

• start info: page containing the start info structure 

• store: page for the Xenstore 

• console: page for the Xen console 

• PT_base: boot page directory 




Figure 11.3: SWEB Xen memory layout 

• PT_frames: boot page tables 

The following part is constructed during the SWEB Xen boot process 

• additional page tables: needed to finally set up the virtual memory for the kernel 

The last parts are the same as under native SWEB. 

11.4.6 PSE 

SWEB uses PSE (page size extension). This enables a more efficient identy mapping 

starting at 3GB (See 3.9). It is also used to map the frame buffer of the graphic 

card. Also see section “Using 4MiB Pages” in chapter VM for more information on the 

support of 4MB pages in SWEB. Under Xen this mode of operation is not possible. 

The interface provided by Xen for guest domains is only aware of a single fixed page 

size. One problem is the pseudo physical memory abstraction 11.3.3. To support 4MB 

pages it would be necessary to get a continuous 4MB block of real machine memory 

starting at a 4MB boundary. As the memory is managed in 4k pages this is not 

easily possible. This could change in future versions of Xen as currently a redesign 

of the memory handling is discussed to be able to support architectures which are 

significantly different from the x86 32bit architecture. 



11.4.7 Console 

Under native SWEB there are two possible consoles: TextConsole and 

FrameBufferConsole. Direct frame buffer access is not allowed under Xen 

at the moment. Therefore FrameBufferConsole must not be created under 

Xen. This is easily achieved as its creation depends on the return value of 

ArchCommon::haveVESAConsole. Unfortunately the implementation of the console 

part was not too clear from the beginning. So at the moment if there is no VESAConsole 

a TextConsole is created. The problem is that the TextConsole also writes directly 

to the graphic card memory by the pointer returned from ArchCommon::getFBPtr. 

Xen provides a virtual console to every domain. This is accomplished by a shared 

memory page to transfer the data to and from the console. Data transfer is asynchronous. 

Several batches of data (up to the maximum buffer size) can be placed in 

the page. They will not be read until an event is send over the console event channel. 

It is important to understand that not the Xen hypervisor handles the console itself. It 

only passes on the sent events. Instead dom0 must have a service which provides the 

console handling. This is currently a serial console. To access the console in dom0 

one has to enter the command xm console domid in a terminal. This connects the 

virtual Xen console to the current tty. To close the connection Ctrl+] must be entered. 


For Xen 3.0 an experimental patch exists which aims to provide a frame buffer emulation 

[12]. This patch was not available at the beginning but during the development 

a similar idea was considered. It would be possible to set up part of the memory as a 

frame buffer and return the pointer to it. TextConsole could then write directly to the 

buffer. To actually write the data to the console provided by Xen some kind of driver 

would have to be noticed of changes in this buffer and then update the Xen console 

the same way. This approach has two drawbacks: It would slow down console output 

further. But the major problem is to detect the buffer updates reliably. So this idea 

was discarded. 

The decision was made to implement a XenConsole which is derived from Console. 

The clean way would be to make Console an arch common header (e.g. ArchConsole) 

and derive a TextConsole in each arch tree. Then in main.cpp simply a TextConsole 

would be created which fits to the chosen architecture automatically. Unfortunately 

for this it was already too late. So at the moment an ifdef in main.cpp decides which 

console is build. At the next major update to the code base this should be fixed. 

Each console handles several terminals. (common/console/Terminal.h) These provide 

a buffer for the screen output and one for input. Instead of directly accessing the 

hardware they use the methods provided by the Console. So no changes were needed 

in the terminals. 

The console also handles the input it gets form the keyboard manager by calling 

getKeyFromKbd and passes it on to the active terminal. Under Xen the console is 

fed with input data by the console event handler. 




11.4.8 Callbacks, Events and Exceptions 

Under Xen interrupts are not directly available to guest domains. They are covered 

with events. Exceptions are available but evaluated by the hypervisor first. 

For a software exception two parts are important: the caller and the calle. The caller 

uses hypercalls which are similar to system calls. For a list of the available hypercalls 

see the Xen 3.0 Interface documentation [13]. Basically a hypercall is a trap 

(software exception). To select which hypercall should be called the corresponding 

number is written into the EAX register. Arguments are passed in EBX, ECX, 

EDX, ESI, EDI. After the call finished EAX contains the return value if there is any. 

See arch/xen/include/hypervisor.h for the assembler stubs and the Xen public 

header xen.h for the list of hypercalls and numbers. 

When a hypercall is done the hypervisor is invoked. Depending on the hypercall it 

does the necessary operations or delegates the handling of the hypercall to dom0 or 

another domain. 

Figure 11.4: Event sources 

For more detailed information about traps, faults and exceptions on which the Xen 

hypercalls, events and exceptions are based read the Intel developer manual Volume 

3 Architecture and Programing Manual [16]. Especially the chapters 14 and 13. 

Events There are three possible sources for events (see figure 11.5): 

• Virtual interrupts (like the timer Xen provides) 

• Physical interrupts (if the domain has the necessary rights) 

• Events form inter domain communication 



An example for an inter domain event is the write to the console. To notify the console 

back end that there is new data available for output an event is send with the hypercall 

HYPERVISOR_event_channel_op from SWEB Xen. The hypervisor then makes a call 

to the console handling domain. This is at the moment always dom0. 

Exceptions Exceptions are not the result of a hypercall. But they may arise during 

the execution of a hypercall. A good example is the page fault exceptions raised by 

the processor when an address translation fails. Xen first receives all exceptions. 

Depending on the cause it handles them or passes it on the currently running domain. 


Two callbacks have to be registered at the hypervisor with 

HYPERVISOR_set_callbacks. These are failsafe_callback and 

hypervisor_callback. The failsafe callback will only be called when a fault 

occurs during the execution of the hypervisor_callback. The segment selectors provided 

when registering the callbacks must have ring privilege level 1. FLAT_KERNEL_CS 

form the Xen public headers is correct and can be used. arch/xen/source/entry.S 

contains the needed assembler stubs for both callbacks. It also contains the stubs 

used for the trap table which is contained in xen_traps.c and registered during 

initialisation. There the virtual exceptions are implemented. 

When Xen wants to notify us of an event the hypervisor callback is used. Care is 

taken that only one event is processed at a time (see the source code comments for 

details). In the shared info page there are two bit masks per virtual CPU. With these 

the guest finds out which events are pending. Like with interrupts the other one is 

used to mask out the event so Xen will not do another callback while processing the 

current one. 

For virtual exceptions the callback is not used. The assemblers stubs in entry.S are 

called directly. Each of them calls an appropriate function in xen_traps.c Most of 

them are just dummy debug implementations at the moment. 

A full implemented example event handling is the console input. For compatibility 

the to the rest of SWEB the class is called KeyboardManager in 

arch_keyboard_manager.cpp. It contains a buffer to which the input from the 

console page is copied to. If the buffer is full but an event is received the data in the 

console page will be discarded. XenConsole gets the data by calling getKeyFromKbd 

11.4.9 Virtual CPUs 

Xen can provide several virtual CPUs (short VCPU) to one domain. The number is not 

necessarily the same as the number of physical CPUS. Currently SWEB is not aware of 

multiple CPUs at all. This is also true for SWEB Xen which only handles VCPU 0. For 

example this is relevant when handling events because they are bound to one VCPU. 

Also the configuration file used for to start the SWEB Xen domain should specify only 

one VCPU. 




Figure 11.5: Overview over the exception and event implementation 

11.4.10 Block Devices 

Xen includes the emulation of block devices. This is archived by the back-end, frontend 

driver mechanism. Dom0 provides the back-end driver which access the real block 

device. This can be a physical disk, a physical partition, a network block device or a 

single file. Further sources which can be used as a block device are possible when 

integrated into the Dom0 code. A guest domain which wants to use a block device 

must implement a front end driver which communicates with the back-end driver over 

event channels and shared memory. 

Currently there is no implementation of a block device front end driver in SWEB Xen. 

It should be possible to implement one using the given SWEB block device infrastructure. 

Maybe some modifications are needed. 

11.4.11 Boot Process 

The following steps are done during the boot process. 


11.5. WORK TO DO CHAPTER 11. XEN 

1. execution start at _start: (arch/xen/source/head.S) 

(a) call start_xen_init (arch/xen/source/xen_startup.c) 

2. execution of start_xen_init: 

(a) copy start_info to global accessible area 

(b) map shared_info 

(c) setup hypervisor_callback, failsafe_callback 

(d) initialize pfn to mfn mapping 

(e) check and/or create 4MiB kernel mapping 

(f) create ident mapping at 3GB 

(g) calculate kernel end 

(h) calculate free memory 

(i) initialise traps 

(j) initialise events 

(k) activate event delivery 

(l) call startup(), this hands over to the normal boot process of SWEB (common/source/main.cpp) 

3. execution of startup (Continuing normal SWEB boot process) 

11.5 Work to do 

As stated in the introduction the port of SWEB to Xen is not completely finished. 

Especially the Task switching (threading) is missing. The following short list also 

includes optional tasks not really needed to run SWEB under Xen. 

• Task switching (threading). Xen supports this with some hypercalls. 

• Correct the partly weekend architecture separation 

• Find a real debugging solution. Note that since Xen 3.0 xen/tools/debugger 

exists which should be a good starting point. 

• Xenstore: At the moment there is no support in SWEB Xen for the Xenstore at 

all. As the current development of Xen seems to make more use of the Xenstore 

support for it should be added. 

• Virtual block devices: As SWEB already has a block device support it would be a 

nice task to add support for the virtual block devices by Xen. Note that for this a 

redesign of the arch_bd_manager.h might be necessary. 

• Virtually network card: Xen also offers virtual network cards. SWEB does not 

know anything about network cards at the moment. Therefore an architecture 

independent interface is needed and implementations for real hardware and the 

virtual cards by Xen can be done. 



11.6. APPENDIX: INSTALL GUIDE 

11.6 Appendix: Install guide 

11.6.1 General 

The SWEB port was done on a checkout of the xen-3.0-testing repository. 

This checkout is available for download. 

11.6.2 Getting Xen 

Informations how to get a fresh checkout of the source repository can be found at 

http://www.xensource.com/ 

To get the Xen-3.0 testing snapshot we used for the SWEB port, download 

the tarball from http://www.sbox.tugraz.at/home/r/rotho/sweb/xen-3. 

0-testing-snapshot.tar.gz. 

The Linux command “wget http://www.sbox.tugraz.at/home/r/rotho/sweb/xen−3.0−testing−snapshot.tar.gz“ 

will do that. 

11.6.3 Dependencies 

To build Xen, XenLinux and the documentation coming with the source, you will need 

following libraries/packages: 

• make 

• libncurses5-dev 

• libncurses5 

• gcc 

• libc6-dev 

• zlib1g-dev 

• python 

• python-dev 

• python-twisted 

• bridge-utils 

• iproute 

• libcurl3 

• libcurl3-dev 

• bzip2 


11.6. APPENDIX: INSTALL GUIDE CHAPTER 11. XEN 

• module-init-tools 

• latex 

• latex2html 

• transfig 

• tgif 

11.6.4 Prepare installation 

First the downloaded tarball must be extracted. 

Usually this is done with: 

“tar −xzf xen−3.0−testing−snapshot.tar.gz“. 

Change into the extracted xen source directory (usually “xen-3.0-testing“). 

Although debuging is not used at the moment you should enable it right away. At 

least this leads to more and better debug messages. 

Open the file “xen/Rules.mk“ with your favourite editor and the find following lines: 

1 verbose ?= n 

2 debug ?= n 

3 perfc ?= n 

4 perfc_arrays?= n 

5 domu_debug ?= n 

6 crash_debug ?= n 

Change this lines to: 

1 verbose ?= y 

2 debug ?= y 

3 perfc ?= n 

4 perfc_arrays?= n 

5 domu_debug ?= y 

6 crash_debug ?= y 

Enter the xen source directory and make clean 

11.6.5 Building Xen 

The next step is to build the Xen hypervisor. Enter the xen source directory and type 

“make world“. This will: 

• build xen 

• build the xen tools 

• download 2.6 linux source from kernel.org 

• patch the mentioned linux source 

• build a privileged linux kernel 

• build a guest linux kernel 




Building customized domains 

In some cases it will be necessary to build a customised XenLinux kernel. 

To do so change into the subdirectory of the patched Linux kernel with 

“cd linux−2.6.12−xen0”. 

Enter “make ARCH=xen menuconfig” and alter the configuration to your needs. 

Then change into the xen source directory and enter “make kernels”. 

11.6.6 Installing Xen & Xen tools 

Note: To install the Xen kernel, Xen tools and the guest kernels root privileges are 

required. 

It is necessary to become root (e.g. with “su”) or by using sudo. 

To install the xen build and the guest domains type “make install” in the xen directory. 

This will: 

• build and install the Xen hypervisor 

• build and install the control tools 

• build and install guest kernels 

• build and install user documentation 

11.6.7 Configuring GRUB 

After building and installing xen, the boot loader has to be configured. 

To do so, modify the GRUB configuration file (usually “menu.lst” of “grub.conf”) which 

can be found in “/boot” or “/boot/grub”. 

You have to add an entry like this one: 

1 t i t l e Xen 3.0−testing / XenLinux 2.6.12.6−xen0 

2 root ( hd0, 0 ) 

3 kernel /boot/xen−3.0.0. gz dom0_mem=270000 

4 module /boot/vmlinuz−2.6−xen0 root=/dev/hda1 ro console=tty0 

“title” is a GRUB command. 

It starts a new boot entry. The name of the entry is set to the contents of the rest of 

the line, starting with the first non-space character. 

The kernel line tells GRUB where the Xen kernel can be found and which boot 

parameters should be passed to the Xen kernel. 

The dom0_mem=270000 options tells Xen that 270000KB Memory should be allocated 

for domain 0. 

“root” is a GRUB command which sets the current root device. 

This line will not be necessary in most cases. Read the GRUB documentation to learn 



more. 

The module line contains the location of the XenLinux kernel which Xen starts as 

domain 0 and the kernel parameters which should be passed to the XenLinux kernel. 

Those kernel parameters are standard Linux boot parameters. 

11.6.8 SWEB under Xen 

We assume in this section, that: 

• there is already a SWEB source tree including the “arch/xen” subdirectory 

• SWEB has been compiled for x86 architecture successfully before 

Building SWEB for Xen 

Remove the x86 build of SWEB manually. 

If you don’t know in which directory your binary build is enter the SWEB source 

directory (in most cases it’s “sweb-stable”) and enter “rm ‘pwd‘−bin/∗−rf”. 

Change to the SWEB source directory and enter “make ARCH=xen”. 

Note: 

If you use your own Xen checkout you have to copy the Xen 

public headers from /xen/include/public/* to 

/arch/xen/include/xen3-public/. 

Don’t forget to include the subdirectories. 

The domain config file 

Before you can run SWEB as a guest domain, you must create a configuration file. 

At the moment a very simple configuration file will be sufficient. If the Xen port is 

developed further (e.g. network device support added) more options may have to be 

configured 

A minimal configuration file which fits our needs would look similar to this: 

1 name = "SWEB" 

2 kernel = "/home/user/sweb−testing−bin/sweb_xen . gz " 

3 memory = 64 

The “name” entry sets the name of the domain. This domain name has to be a unique 

identifier of the domain. 

The “kernel” entry contains the location of the kernel image which should be loaded. 




The “memory” entry sets the domain memory size given in MiB. 

64 MiB of memory should be enough. If SWEB shows strange behaver or crashes 

without reason try a higher value. 

Save the configuration file as “/etc/xen/sweb.cfg”. 

11.6.9 Running Xen 

Running dom0 

To run the installed Xen version reboot the system an choose the option “Xen 

3.0-testing / XenLinux 2.6.12.6-xen0” from the GRUB boot menu. 

Then you can see XenLinux booting, which looks very similar to a conventional Linux 

boot. 

After the system hast booted you should be a able to login and work on your system 

as usual. 

If the system hang-up or you can not log in by another reason, reboot the system 

and choose your normal system from the GRUB boot menu and read the Xen 

documentation. 

After login start the xend control daemon by typing “xend start”. You have to be root to 

do this. 

Running a guest domain 

Before you try to run SWEB as a guest you first should verify that your Xen setup is 

complete and working. Therefore it is best to create a guest domain with the previously 

during the install build domU kernel. For Debian the most convenient way is to use 

the package“xen−tools”. These include some scripts for building and managing guest 

domain images. See the documentation of the package for details. The most important 

script is “xen−create−image”. After you successfully created a guest domain image and 

the configuration file for it use the command “xm create conffile −c” to start it and open a 

console to the newly started domain. If everything works then you can go on to SWEB. 

If the creation of the guest domain fails you first have to fix this. Under “/var/log/” you 

find the following log files from Xen: 

• xend.log 

• xend-debug.log 

• xen-hotplug.log 

Check these for any errors. Also verify your network setup is correct. 



Running SWEB as guest 

To work with the Xen management user interface (xm) you need root privileges. 

Now you can boot the SWEB guest domain with “xm create sweb.cfg”. 

“xm list” gives now a output similar to this: 

1 # xm l i s t 

2 Name ID Mem(MiB) VCPUs State Time ( s ) 

3 Domain−0 0 264 1 r−−−−− 107.9 

4 SWEB 1 64 1 −−−−−− 11.5 

“xm console SWEB” opens a console to SWEB. 

The shortcut CTRL + ] escapes the Xen console. 

“xm destroy SWEB” destroys the SWEB guest domain. 

“xm create sweb.cfg −c” creates a SWEB guest domain and opens a console to SWEB 

without delay. 

Type “xm help” or “man xm” for more information. 

11.6.10 Alternative to real installation 

If for some reason Xen is not running reliably on your machine there is an alternative. 

You need quite some free disk space and a powerful machine (especially processor and 

memory). It is possible to run Xen completely under the qemu emulator. Here only 

only a short outline is given. You will need to read the qemu documentation carefully. 

• Install and configure qemu 

• Create a big disk image 

• Install Linux (e.g. Debian with the net install CD) into this image 

• Setup a complete build environment within the image (start the image with qemu 

and then work within the simulation) 

• Install Xen within the image 

• Test the Xen install with a guest image 

• Get SWEB into your image, build and start it there 

Note that you work most of the time running a complete disk image under qemu. 

So you have a complete Linux install with a complete Xen install which is run by 

qemu. This is quite slow (qemu runs under your regular system as a user process. 

Within this user process xen is running, the dom0 is running where you work and 

start additional guest domains). But is has advantages. As you installed everything 

within a disk image it can be easily deleted. Also you install Xen in a freshly installed 

Linux environment which saves you from modifying your real setup. It also might be 

easier to get the network setup done correctly. The main problem is to find a way to 




synchronise the SWEB code in the image with the real one outside (e.g. share it over 

CVS or use a shared disk with qemu). Read the qemu documentation and choose the 

way which is best for you. 




Chapter 12 

Hacking Sweb 

This chapter is intended to list some dos and donts when hacking around in sweb. 

12.1 C++ specific issues 

Because we are working in a freestanding environment there are serveral c++ features 

which can not be used in sweb. 

Exceptions Exceptions can not be used in sweb because the required functionality 

for stack unrolling is not present in sweb (at least not sufficient to satisfy the requirements 

in libstdc++). 

RTTI Depending on your compiler rtti might work, at least partially. Since no exceptions 

are possible its highly unlikely that your compiler will ever be able to implement 

them in a sane way. Thus, dont use it. 

Static Objects Static objects can be used but one has to note that constructors for 

static objects will never be called. The memory these objects use will be initialised 

to 0 tough. So it’s much better to wrap static objects in a singleton (but only if your 

singleton allocates using new!). 

Inheritance and Pure Virtual Methods Inheritance works without any limitations. 

Multiple inheritance should also work althou it was never tested in sweb. One special 

thing to take care of are Methods that are Pure Virtual. If for some reason a pure 

virtual is called this will result in a kernel panic and not just in the offending thread 

beeing killed. 

Floating Point Numbers Using floating point math is a really big “no no“ in the 

kernel. While the fpu is initialised and handeled correctly during thread switches we 

do not handly any fpu exceptions currently. Also floating point math is very slow 

compared to integer math and there is no good reason to use it in the kernel anyway. 

149

12.2. DOS AND DONTS CHAPTER 12. HACKING SWEB 

12.2 Dos and Donts 

Interrupts Never ever turn the interrupts off unless you really know what you are 

doing. Whilst it’s common practice to turn off the interrupts a lot of times in kernel 

mode when working with nachos, there is really no good reason for doing it in sweb. If 

you want to make sure that only your thread has access to a resource at a given point 

in time then use a lock. 

Allocating Memory Memory should only ever be allocated if it is ok for your thread 

to be suspended. Since memory allocation can block (i.e. yield the thread) one must 

never allocate memory inside an interrupt handler or when interrupts are turned off. 

There are excemptions to this rule however: The pagefault and the syscall handlers 

both run with interrupts turned on, so it is ok the allocate memory there. 

Another important thing to note is that a new or malloc may fail so always test if your 

allocation succeded or not. 

Locks Taking a lock is only allowed outside of interrupt context. Do never try to get 

a lock with interrupts turned off. 

kprintf Both kprintf and kprintfd may take locks and allocate memory. Thus it is 

not allowd to use it when interrupts are turned off. For printing out information in 

this cases there are the functions kprintfd_nosleep and kprintf_nosleep. This can be 

safely used but on the other hand can’t guarantee that your messages get printed at 

all! 

Threads To implement a thread one has to override the virtual Run Method. One 

however must never ever return from this Run Method. 

Stack usage On a typical Linux machine it is ok to use several megabytes of stack 

memory. In sweb however one only has a total of 8 kilobytes of stack. It thus is a wise 

idea to limit the depth of recursive functions and to not allocate anything on the stack 

besides some simply autovariables. It is certainly a very bad idea to allocate buffers 

and the like on the stack. Currently there is no mechanism to kill a thread once a 

stack overflow happens so one should pay special attention to stack usage. 


Bibliography 

[1] Wikipedia:Binary_prefix (http://en.wikipedia.org/wiki/Binary_prefix). 

[2] UMBC EECS Dpartment Course in Systems Design and Programming by Dr. Jim 

Plusquellic. 

[3] The Linux Virtual File-System Layer. 

[4] Virtual Filesystem. 

[5] Wikipedia:Race_condition (http://en.wikipedia.org/wiki/Race_condition). 

[6] NACHOS (http://www.cs.washington.edu/homes/tom/nachos). 

[7] Wikipedia: Hypervisor (http://en.wikipedia.org/wiki/Hypervisor). 

[8] INTEL Vanderpool (http://cache-www.intel.com/cd/00/00/19/76/197666_197666.pdf). 

[9] AMD Pacifica (http://developer.amd.com/assets/WinHEC2005_Pacifica_Virtualization.pdf). 

[10] Statement to the ReactOS domain builder (http://lists.xensource.com/archives/html/xendevel/2005-06/msg00074.html). 

[11] Grub Multiboot Definition (http://www.gnu.org/software/grub/manual/multiboot). 

[12] Experimental Xen frame buffer (http://wiki.xensource.com/xenwiki/VirtualFramebuffer). 

[13] Xen 3.0 for x86 Interface Manual (http://www.cl.cam.ac.uk/Research/SRG/netos/xen/readmes/interf 

[14] IA-32 Intel R○ Architecture Software Developer’s Manual, Volume 1: Basic Architecture. 

Intel Corporation. 

[15] IA-32 Intel R○ Architecture Software Developer’s Manual, Volume 2A: Instruction Set 

Reference, A-M. 

[16] IA-32 Intel R○ Architecture Software Developer’s Manual, Volume 3: System Programming 

Guide. Intel Corporation. 

[17] Modern Operating Systems. Prentice-Hall, 2001. 

[18] Understanding the Linux Kernel, 2nd Edition, chapter 12. 2002. 

151

The holy bible of SWEB - Institute of Applied Information Processing ...

Create successful ePaper yourself

Delete template?

Save as template?