17.01.2014 Views

The holy bible of SWEB - Institute of Applied Information Processing ...

The holy bible of SWEB - Institute of Applied Information Processing ...

The holy bible of SWEB - Institute of Applied Information Processing ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>The</strong> <strong>holy</strong> <strong>bible</strong> <strong>of</strong> <strong>SWEB</strong><br />

<strong>SWEB</strong>-Crew<br />

September 30, 2008


2 <strong>of</strong> 151


Contents<br />

1 Install 13<br />

1.1 some notes for installing sweb . . . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

1.1.1 Preparing sweb for building . . . . . . . . . . . . . . . . . . . . . . . . 13<br />

1.1.2 building sweb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14<br />

1.1.3 Important notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2 running <strong>SWEB</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.1 bochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.2 bochsgdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.3 qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15<br />

1.2.4 VMWare, Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.2.5 VMWare, Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.2.6 XEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.2.7 ’real’ hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

1.3 Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

2 GDB 19<br />

2.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.2 Bochs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

2.2.1 Building Bochs with “page fault” patch . . . . . . . . . . . . . . . . . 19<br />

2.3 Qemu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.4 Using GDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20<br />

2.4.1 Custom GDB macros . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.5 Using DDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

2.5.1 DDD example session . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3 VM, Protection and Paging 25<br />

3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.1.1 Memory Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.1.2 Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.2 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25<br />

3.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.3.1 Protected Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.3.2 Protected Mode Segmentation . . . . . . . . . . . . . . . . . . . . . . 26<br />

3.3.3 Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28<br />

3.4 Interface & Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

3.4.1 ArchMemory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34<br />

3.4.2 PageManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

3


CONTENTS<br />

CONTENTS<br />

3.4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

3.5 Sweb Boot Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.5.1 Boot Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43<br />

3.5.2 Sweb Global Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . 44<br />

3.5.3 Sweb Boot-Time Paging . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

3.5.4 Sweb Post-Boot Paging Setup . . . . . . . . . . . . . . . . . . . . . . 45<br />

3.6 Sweb Task Separation and Protection . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.6.1 Segment Descriptor Privilege Level / Task Privilege Level . . . . . . 46<br />

3.6.2 User- / Supervisor- Page . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.6.3 LoadPageOnDemand . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.6.4 UserThread Termination . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.7 Extending Sweb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.7.1 Using 4MiB Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46<br />

3.7.2 Additional pages for the kernel . . . . . . . . . . . . . . . . . . . . . . 47<br />

3.7.3 sharing Memory between linear address spaces . . . . . . . . . . . . 47<br />

3.7.4 Cloning a Tasks linear address space . . . . . . . . . . . . . . . . . . 47<br />

3.7.5 CopyOnWrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

3.7.6 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49<br />

4 Kernelspace Memory Management 55<br />

4.1 General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.2 Memory Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

4.3 Important Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.3.1 KernelMemoryManager . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

4.3.2 MallocSegment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.4 Used Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.4.1 allocating memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.4.2 freeing memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.4.3 reallocating memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.5 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

4.5.1 Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

5 Threads and Processes 61<br />

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

5.2 Details for x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

5.2.1 Example for saving the CPU registers . . . . . . . . . . . . . . . . . . 62<br />

5.2.2 Example for reading back the CPU Registers . . . . . . . . . . . . . . 62<br />

5.3 <strong>The</strong> FPU and other problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 62<br />

5.4 Threads and Processes in <strong>SWEB</strong> . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

5.4.1 Kernel Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64<br />

5.4.2 Userspace Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

5.4.3 Switching to Userspace . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

5.5 Multiprocessor Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

5.5.1 <strong>The</strong> FPU Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

5.5.2 CPU Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

5.5.3 NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

5.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

5.6.1 Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68<br />

5.6.2 <strong>The</strong> Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69<br />

4 <strong>of</strong> 151


CONTENTS<br />

CONTENTS<br />

6 System Calls 71<br />

6.1 Base <strong>Information</strong> about System Calls . . . . . . . . . . . . . . . . . . . . . . 71<br />

6.1.1 Concept <strong>of</strong> System Calls . . . . . . . . . . . . . . . . . . . . . . . . . . 72<br />

6.2 System Calls in the different Operating Systems . . . . . . . . . . . . . . . 73<br />

6.3 System calls in <strong>SWEB</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74<br />

6.3.1 Concept <strong>of</strong> System Calls in <strong>SWEB</strong> . . . . . . . . . . . . . . . . . . . . 74<br />

6.3.2 How to implement System Calls in <strong>SWEB</strong> . . . . . . . . . . . . . . . 77<br />

6.3.3 A convenient interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 77<br />

7 Character Devices 81<br />

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

7.2 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81<br />

7.3 Character Device Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82<br />

7.4 Keyboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84<br />

7.5 Serial Port (or how to write a character device driver in <strong>SWEB</strong> . . . . . . . 87<br />

7.6 SerialManager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92<br />

7.7 Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

7.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

7.7.2 Escape codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

7.7.3 Implementation in <strong>SWEB</strong> . . . . . . . . . . . . . . . . . . . . . . . . . 93<br />

7.8 Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

7.8.1 Text Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

7.8.2 Frame Buffer Console . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

8 Block Devices 99<br />

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

8.1.1 what are block devices . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

8.1.2 hard drive layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99<br />

8.1.3 how can blocks be accessed (chs, lba etc.) . . . . . . . . . . . . . . . 101<br />

8.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101<br />

8.2.1 How to implement new low-level block device drivers . . . . . . . . . 102<br />

8.2.2 How to use block devices . . . . . . . . . . . . . . . . . . . . . . . . . 102<br />

8.2.3 VirtualDevices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

8.2.4 BDRequest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104<br />

9 VFS, FS 107<br />

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

9.2 File system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

9.2.1 Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

9.2.2 Inodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107<br />

9.2.3 Filesystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

9.2.4 Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

9.3 VFS data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

9.3.1 Superblock Ojbects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108<br />

9.3.2 Inode objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109<br />

9.3.3 Dentry objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

9.3.4 File objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111<br />

9.3.5 derivation-hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112<br />

9.3.6 interaction between processed and vfs objects . . . . . . . . . . . . . 113<br />

5 <strong>of</strong> 151


CONTENTS<br />

CONTENTS<br />

9.4 RamFS - memory filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.4.1 simplification from the RamFS . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.4.2 file vs. directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114<br />

9.5 extention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116<br />

10 Locking 117<br />

10.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117<br />

10.2Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

10.2.1SpinLock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118<br />

10.2.2Mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119<br />

10.3Using Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120<br />

10.4Critical Sections in the Sweb-Kernel . . . . . . . . . . . . . . . . . . . . . . 121<br />

10.4.1KMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

10.4.2PM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

10.4.3Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

10.4.4CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121<br />

10.4.5Console and Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . 123<br />

10.4.6Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124<br />

11 Xen 127<br />

11.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

11.2Architecture separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127<br />

11.2.1Building environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

11.3Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

11.3.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128<br />

11.3.2Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

11.3.3Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129<br />

11.4<strong>SWEB</strong>- Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

11.4.1Guest Section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131<br />

11.4.2Start Info and Shared Info Page . . . . . . . . . . . . . . . . . . . . . 132<br />

11.4.3PageManager and KernelMemoryManager . . . . . . . . . . . . . . . 133<br />

11.4.4Memory regions and modules . . . . . . . . . . . . . . . . . . . . . . 134<br />

11.4.5Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134<br />

11.4.6PSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

11.4.7Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136<br />

11.4.8Callbacks, Events and Exceptions . . . . . . . . . . . . . . . . . . . . 137<br />

11.4.9Virtual CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138<br />

11.4.10Block Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

11.4.11Boot Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139<br />

11.5Work to do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140<br />

11.6Appendix: Install guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.6.1General . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.6.2Getting Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.6.3Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141<br />

11.6.4Prepare installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

11.6.5Building Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142<br />

11.6.6Installing Xen & Xen tools . . . . . . . . . . . . . . . . . . . . . . . . 143<br />

11.6.7Configuring GRUB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143<br />

11.6.8<strong>SWEB</strong> under Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144<br />

6 <strong>of</strong> 151


CONTENTS<br />

CONTENTS<br />

11.6.9Running Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145<br />

11.6.10Alternative to real installation . . . . . . . . . . . . . . . . . . . . . . 146<br />

12 Hacking Sweb 149<br />

12.1C++ specific issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149<br />

12.2Dos and Donts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150<br />

7 <strong>of</strong> 151


CONTENTS<br />

CONTENTS<br />

8 <strong>of</strong> 151


List <strong>of</strong> Figures<br />

3.1 Protected Mode Segment Descriptor ((c) Dr. Jim Plusquellic [2]) . . . . . . 26<br />

3.2 Example Mapping between linear and physical Memory . . . . . . . . . . . 29<br />

3.3 4k page address translation ((c) by Intel) . . . . . . . . . . . . . . . . . . . . 33<br />

3.4 Page-Fault-Handling with possible CopyOnWrite . . . . . . . . . . . . . . . 48<br />

3.5 Example mapping using virtual memory. . . . . . . . . . . . . . . . . . . . . 50<br />

3.6 Sweb Page-Fault-Handling and possible Virtual Memory Extension . . . . 51<br />

4.1 linear address space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

4.2 memory segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

4.3 MallocSegment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

5.1 Saving the cpu registers in an interupt handler . . . . . . . . . . . . . . . . 63<br />

5.2 Reading back the cpu registers <strong>of</strong> a Thread . . . . . . . . . . . . . . . . . . 64<br />

5.3 Initialisation <strong>of</strong> the Registers for a Kernel Thread . . . . . . . . . . . . . . . 65<br />

5.4 Initialisation <strong>of</strong> the Registers for a Userspace Thread . . . . . . . . . . . . . 67<br />

5.5 Reading back the cpu registers <strong>of</strong> a Thread and switch to Userspace . . . 68<br />

6.1 <strong>The</strong> implementation <strong>of</strong> the system calls . . . . . . . . . . . . . . . . . . . . . 75<br />

6.2 <strong>The</strong> definition list <strong>of</strong> the system calls . . . . . . . . . . . . . . . . . . . . . . 76<br />

6.3 <strong>The</strong> implementation <strong>of</strong> the system calls . . . . . . . . . . . . . . . . . . . . . 77<br />

6.4 <strong>The</strong> implementation <strong>of</strong> the system calls . . . . . . . . . . . . . . . . . . . . . 78<br />

6.5 <strong>The</strong> implementation <strong>of</strong> the system call write . . . . . . . . . . . . . . . . . . 79<br />

11.1Xen memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130<br />

11.2Page frame numbers: PFN and MFN . . . . . . . . . . . . . . . . . . . . . . 131<br />

11.3<strong>SWEB</strong> Xen memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135<br />

11.4Event sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137<br />

11.5Overview over the exception and event implementation . . . . . . . . . . . 139<br />

9


LIST OF FIGURES<br />

LIST OF FIGURES<br />

10 <strong>of</strong> 151


List <strong>of</strong> Tables<br />

3.1 Segment Descriptor Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27<br />

3.3 Explanation <strong>of</strong> Fields in Page-Directory- and Page-Table-Entries . . . . . . 32<br />

3.4 Control Registers and Paging . . . . . . . . . . . . . . . . . . . . . . . . . . . 33<br />

3.5 ArchMemory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35<br />

3.6 PageManager Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40<br />

3.7 Sweb Global segment Descriptor Table . . . . . . . . . . . . . . . . . . . . . 44<br />

3.8 Sweb GDT, more readable . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

3.9 Sweb Linear Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45<br />

4.1 KernelMemoryManager Interface . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

4.2 MallocSegment Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59<br />

4.3 provided Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60<br />

7.1 Character Device base class methods . . . . . . . . . . . . . . . . . . . . . . 82<br />

7.2 Character Device input and output buffers . . . . . . . . . . . . . . . . . . 83<br />

7.3 <strong>The</strong> Inode functions implemented by CharacterDevice . . . . . . . . . . . . 84<br />

7.4 I/O ports for communicating with keyboard . . . . . . . . . . . . . . . . . 85<br />

7.9 Terminal class base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

7.10Terminal input class methods . . . . . . . . . . . . . . . . . . . . . . . . . . 94<br />

7.11Terminal output class methods . . . . . . . . . . . . . . . . . . . . . . . . . 96<br />

7.12Console output functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97<br />

11


LIST OF TABLES<br />

LIST OF TABLES<br />

12 <strong>of</strong> 151


Chapter 1<br />

Install<br />

Basic guidelines for building and installing <strong>SWEB</strong><br />

1.1 some notes for installing sweb<br />

First <strong>of</strong> all we have to say, that sweb runs on any Intel Pentium (R) or compatible<br />

platform with at least 16 megs <strong>of</strong> ram, VGA display and floppy or IDE hdd for booting.<br />

It is possible that <strong>SWEB</strong> runs on other platforms too but has not been tested yet.<br />

Sure, you maybe don’t want to run this OS on ’real’ hardware for testing reasons.<br />

But please keep in mind, that this os should not only run in one special emulater but<br />

on real hardware. You might test it on real hardware, in this case you have to install<br />

sweb on a hard drive.<br />

1.1.1 Preparing sweb for building<br />

Sweb uses cmake to generate the makefiles which can be used with the standard make<br />

command. Before running cmake you can change some options which influence the<br />

makefile generation. For more information on cmake run ’man cmake’ or look at<br />

www.cmake.org. <strong>The</strong>re is a file called MakeOptions in the sweb (source) dir. Open it<br />

in an editor <strong>of</strong> your choice and set the options to your needings.<br />

<strong>The</strong> default MakeOptions file looks similar to the following:<br />

# Architecture to use: x86, xen<br />

SET (ARCH x86)<br />

# Added before the BIN_DIR_NAME (e.g. "../")<br />

SET (BIN_DIR_PREFIX "../")<br />

# Name <strong>of</strong> the Binary dir (e.g. sweb-bin)<br />

SET (BIN_DIR_NAME "sweb-bin")<br />

# Doxygen output directory<br />

# Result must include escaping characters e.g. "..\\/sweb-docs"<br />

SET (DOC_DIR "..\\/sweb-docs")<br />

13


1.1. SOME NOTES FOR INSTALLING <strong>SWEB</strong> CHAPTER 1. INSTALL<br />

# Path to the bochs executable to use (patched bochs at different location)<br />

SET (BOCHS_PATH "bochs")<br />

#SET (BOCHS_PATH "/opt/sweb-bochs/bin/bochs")<br />

#’true’ to enable verbose compile output<br />

set(CMAKE_VERBOSE_MAKEFILE false)<br />

<strong>The</strong> resulting binary directory is determined as follows:<br />

//<br />

If your sweb sources are located at<br />

/home/user/sweb<br />

the resulting binary directory (using the above MakeOptions file) would be:<br />

/home/user/sweb-bin<br />

Be sure, that the direcory immediately below your desired sweb binary direcory is<br />

writeable.<br />

Now it’s time to let cmake do its job. Navigate to your sweb (source) directory, type<br />

$ cmake .<br />

(don’t forget the dot) and check for errors in the command output. If no error occured<br />

the makefiles have been successfully generated.<br />

1.1.2 building sweb<br />

Now you can type<br />

$ make<br />

to build sweb.<br />

You might have to install following libs (debian / gentoo package names in braces):<br />

• cmake >=2.4 (cmake / cmake)<br />

• libcom_err.so (comerr-dev / ??)<br />

• nasm (nasm / nasm)<br />

• g++ (g++ / )<br />

• gdb (gdb / gdb)<br />

• emulator <strong>of</strong> your choice:<br />

bochs (bochs-wx, bochs-x, bochsbios, bochs, vgabios / bochs)<br />

qemu (qemu / qemu)<br />

VMWare (http://www.vmware.com)<br />

XEN<br />

14 <strong>of</strong> 151


CHAPTER 1. INSTALL<br />

1.2. RUNNING <strong>SWEB</strong><br />

1.1.3 Important notes<br />

Adding/removing source files<br />

When you are adding (or removing) source files to your sweb sources don’t forget to<br />

run cmake from your sweb (source) directory again<br />

$ cmake .<br />

Moving/renaming source directory<br />

When you are moving your sweb source directory (e.g. /home/user/sweb -> /home/user/foo/bar/sweb)<br />

run the relocation script from the sweb directory, then run cmake again.<br />

$ ./ relocate_project . sh<br />

$ cmake .<br />

1.2 running <strong>SWEB</strong><br />

1.2.1 bochs<br />

This might be the best emulator for debugging although it might have some problems.<br />

We suggest using bochs version >= 2.2.1. Older versions seem to have no problems<br />

loading grub but executing the kernel image. We supply a redy to use config file<br />

(bochsrc) and a patched grub to start it.<br />

start with ”make bochs”<br />

This internally runs ”make bochsconfig” the first time you run ”make bochs”.<br />

”make bochsconfig” generates a ready to use bochsrc file.<br />

Configuring bochs<br />

<strong>The</strong> bochs configuration file is located at ./utils/bochs. Threr are two interesting<br />

file bochsrc (which is generated by make config or make bochsconfig) and<br />

bochsrc.pattern which is the template for bochsrc. So when you need to edit the<br />

bochs configuration always edit the bochsrc.pattern and run make config from the<br />

sweb directory, and you can be sure that your changes don’t get lost.<br />

1.2.2 bochsgdb<br />

Almost the same as bochs but with gdb support.<br />

start with ”make bochsgdb”<br />

1.2.3 qemu<br />

Qemu 0.6.2 has been tested too.<br />

start with ”make qemu”<br />

15 <strong>of</strong> 151


1.3. DEVELOPERS CHAPTER 1. INSTALL<br />

1.2.4 VMWare, Linux<br />

After building there is a ready config file (sweb.vmx) for booting <strong>SWEB</strong> in VMWare<br />

Workstation Linux (>=4.0.5) in the sweb-testing-bin directory. Older versions might<br />

work but have not been tested.<br />

start with ”make vmware”<br />

1.2.5 VMWare, Windows<br />

Actually we do not have compile support in Windows but you can compile everything<br />

on Linux and use it on Windows. You can mount the sweb-testing-bin directory on<br />

your windows host (SMB, NFS, ...) and simply open the sweb.vmx file in VMWare<br />

or you can copy all necessary files (<strong>SWEB</strong>.vmdk, <strong>SWEB</strong>-flat.vmdk, nvram, sweb.vmx)<br />

from the sweb-tesing-bin directory to your Windows host and start it. Due to the hard<br />

disk image file format you have to use VMWare Workstation for Windows Version 3.2<br />

or above.<br />

1.2.6 XEN<br />

experimental see instuctions in chapter XEN<br />

1.2.7 ’real’ hardware<br />

For installing sweb to a physical hard drive you first have to create at least 2 partitions<br />

on it (3 if you need the third one for swapping). <strong>The</strong>n you must install grub<br />

onto the first partition <strong>of</strong> the harddrive (which has to be the active one), copy swebbin/kernel.x<br />

and sweb/images/menu.lst.hda to the first partition (to destinations<br />

/boot/kernel.x and /boot/grub/menu.lst). <strong>The</strong>n copy the userprograms<br />

(which can be found in sweb-bin/userspace/) you want to run to the second<br />

partition which has to be minix-formatted.<br />

1.3 Developers<br />

For basic information you should have a look at our homepage (use link http://sweb.sourceforge.net).<br />

known bugs:<br />

• make doesn’t like arguments except<br />

– ”clean”<br />

– ”config”<br />

– ”mrproper”<br />

– ”bochs”<br />

– ”bochsgdb”<br />

– ”install”<br />

– ”qemu”<br />

– ”all”<br />

16 <strong>of</strong> 151


CHAPTER 1. INSTALL<br />

1.3. DEVELOPERS<br />

– ”info”<br />

– ”submit”<br />

• sometimes bochs requires ”/usr/share/bochs/VGABIOS-lgpl-latest”. If it is not<br />

already there, copy it using ”cp /usr/share/vgabios/vgabios.bin /usr/share/bochs/VGABIOSlgpl-latest”<br />

or something similar. (bochs doesn’t seem to like symlinks)<br />

• use ”make -i” with care - it may help ignoring compile errors, but you should<br />

really know about the effects.<br />

17 <strong>of</strong> 151


1.3. DEVELOPERS CHAPTER 1. INSTALL<br />

18 <strong>of</strong> 151


Chapter 2<br />

GDB<br />

2.1 General<br />

<strong>The</strong> GNU Project Debugger provides a way to view internal information about programs<br />

during their execution, which can be essential for finding unobvious errors in the<br />

program source code. Fortunately, the bochs emulator provides a remote GDB stub<br />

which can be used to debug operating systems in a rather comfortable way through<br />

gdb, provided the source code is available.<br />

2.2 Bochs<br />

In order to use the gdb stub, Bochs has to be configured by passing the argument<br />

–enable-gdb-stub to the configure script provided with the bochs sources. Note that<br />

this conflicts with bochs’ own debugger, so it cannot be used.<br />

$ ./ configure −−enable−gdb−stub<br />

With a Bochs configured, built and installed as this, it can be started with<br />

$ make bochsgdb<br />

from the <strong>SWEB</strong> top directory. Bochs then awaits a gdb connection at localhost:1234.<br />

2.2.1 Building Bochs with “page fault” patch<br />

When you start debugging with gdb you will recognize that gdb stops on every page<br />

fault. This is very annoying as the userspace programs grow, so there is a patch to<br />

eliminate this behavier.<br />

• Download bochs-2.3.7.tar.gz from http://bochs.sourceforge.net<br />

• Extract with<br />

$ tar xvzf bochs−2.3.7. tar . gz<br />

• Navigate to the extracted directory and patch the source files<br />

$ patch −p0 < /bochs_2 .3.7 _page_fault . patch<br />

19


2.3. QEMU CHAPTER 2. GDB<br />

<strong>The</strong> patch file is located at ...sweb/utils/bochs<br />

• Once the source files are patched we can run the configure script. It is recommended<br />

to use an alternative prefix (installation location) to be able to switch<br />

between different bochs versions (e.g. patched not patched version)<br />

$ ./ configure −−enable−gdb−stub −−prefix=/opt/sweb−bochs<br />

• Once the configuration is done, start compiling and installing<br />

$ make<br />

$ sudo make i n s t a l l<br />

Needless to say that you need some default dev packages like gcc to successfully<br />

complete the compile process, in addition you need a package called xorg-dev or<br />

x-window-system-dev on some other systems<br />

• Self - compiling bochs does not replace installing the packages vgabios and<br />

bochsbios<br />

2.3 Qemu<br />

For using the gdb stub from qemu, start it with<br />

$ make qemugdb<br />

2.4 Using GDB<br />

With bochs or qemu started as described before, gdb can connect to it. <strong>The</strong>re is a<br />

make target rungdb which preconfigures GDB and connects to localhost:1234.<br />

$ make rungdb<br />

.<br />

.<br />

.<br />

0x0000fff0 in ?? ( )<br />

You can now set trace and breakpoints .<br />

Enter ’ continue ’ ( or ’ c ’ ) when you ’ re done .<br />

( gdb )<br />

Once bochs or qemu exits you could end GDB and run make rungdb again, but then<br />

all you temporary settings like breakpoints are gone and you have to set them up<br />

again. So there is another way to reconnect GDB to bochs or qemu. Just run the<br />

following from inside GDB and all your breakpoints and other settings remain active.<br />

target remote localhost :1234<br />

From here, gdb can be used as usual, except that it is not necessary to call run, as<br />

the simulation is already started but paused for command input. It can be resumed<br />

using the continue command.<br />

20 <strong>of</strong> 151


CHAPTER 2. GDB<br />

2.5. USING DDD<br />

<strong>The</strong> documentation for GDB can be found at http://www.gnu.org/s<strong>of</strong>tware/<br />

gdb/documentation/. A very useful Tutorial which covers the major features <strong>of</strong> GDB<br />

can be found at http://dirac.org/linux/gdb/, more tutorials can be found on the<br />

world wide web by using a search engine.<br />

2.4.1 Custom GDB macros<br />

When running GDB as described above, some custom macros are installed automatically.<br />

GDB macros allow to simplify the debugging process by providing new commands.<br />

<strong>The</strong> following is a list <strong>of</strong> the available custom commands.<br />

• continue_on_pagefault: Tells GDB to ignore all page faults (from userspace<br />

and kernelspace). It only works with the patched bochs version.<br />

• stop_on_pagefault: Tells GDB to stop on every page fault. If the pagefault<br />

comes from kernelspace, the backtrace (to determine the origin <strong>of</strong> the page fault)<br />

is shown.<br />

• btpf_enable_full: Also shows local variables with their values with the page<br />

fault backtrace<br />

• btpf_disable_full: Shows the short version <strong>of</strong> the page fault backtrace<br />

• backtrace_pagefault: Shows the backtrace <strong>of</strong> the pagefault again. Only call<br />

this when debugger has stopped because <strong>of</strong> a page fault<br />

To call one <strong>of</strong> the above macros just enter the name on the gdb command prompt. To<br />

show the help <strong>of</strong> this custom macros enter help user-defined.<br />

2.5 Using DDD<br />

<strong>The</strong> data display debugger (ddd) is a graphical front end to gdb. <strong>The</strong> gdb section above<br />

also applies to ddd. To run ddd just enter<br />

$ make runddd<br />

and you will get a preconfigured ddd environment. In the lower area you can find the<br />

gdb command prompt where you can enter gdb commands like described above.<br />

2.5.1 DDD example session<br />

This example assumes that you have compiled bochs with the page fault patch, configured<br />

MakeOptions correctly to use this bochs and that all neccessary tools are<br />

installed.<br />

Open up two seperate terminals, navigate to the sweb directory and run the following<br />

from Term1<br />

$ make && make bochsgdb<br />

Now a bochs window should open up and you should see something like<br />

Waiting for gdb connection on port 1234<br />

21 <strong>of</strong> 151


2.5. USING DDD CHAPTER 2. GDB<br />

on Term1<br />

Now switch to Term2 and enter<br />

$ make runddd<br />

Now ddd should start and display the gdb command prompt in the lower area. Now<br />

open up main.cpp (File/Open Source...) and scroll down till startup. Now set a<br />

breakpoint anywhere inside the startup method (right click in front <strong>of</strong> the line/Set<br />

Breakpoint). Once the breakpoint is set up your window should look similar to the<br />

following:<br />

Now you can enter continue (or c) in the gdb command window and you will notice<br />

that bochs will continue execution. Once the execution reaches your breakpoint, ddd<br />

will show a green arrow in front <strong>of</strong> the current line. Now you have several options to<br />

continue. You can simple enter c to execute to the next break, step, next, stepi, and<br />

so on. Take a look at the ddd command window<br />

22 <strong>of</strong> 151


CHAPTER 2. GDB<br />

2.5. USING DDD<br />

To get more information about the different commands enter help (eg.<br />

help step) in the gdb command window.<br />

23 <strong>of</strong> 151


2.5. USING DDD CHAPTER 2. GDB<br />

24 <strong>of</strong> 151


Chapter 3<br />

VM, Protection and Paging<br />

3.1 Notation<br />

3.1.1 Memory Types<br />

linear M. Intel terminology for linear addressed memory<br />

virtual M. Memory which’s addresses don’t map directly to a physical memory address or<br />

to physical memory at all.<br />

physical M. describes short-access-time memory like Random Access Memory.<br />

swap M. describes long-access-time memory used to extend physical memory, usually<br />

hard disk memory.<br />

Definition taken from Chapter 3.3.2 <strong>of</strong> Intel S<strong>of</strong>tware Developer’s Manual [14]:<br />

When using the IA-32 architecture’s paging mechanism (paging enabled),<br />

linear address space is divided into pages which are mapped to virtual memory.<br />

<strong>The</strong> pages <strong>of</strong> virtual memory are then mapped as needed into physical<br />

memory. When an operating system or executive uses paging, the paging<br />

mechanism is transparent to an application program. All that the application<br />

sees is linear address space.<br />

3.1.2 Units<br />

the units used in this document conform to the IEC 60027-2 A.2 Standard [1].<br />

KiB 2 10 Byte<br />

MiB 2 20 Byte<br />

GiB 2 30 Byte<br />

3.2 Abstract<br />

<strong>The</strong> following document aims to give a detailed introduction in x86 segmentation- and<br />

paging- mechanisms. It further aims to acquaint the reader with the use <strong>of</strong> these<br />

concepts in the Sweb operating system.<br />

25


3.3. BASIC CONCEPTS CHAPTER 3. VM, PROTECTION AND PAGING<br />

Figure 3.1: Protected Mode Segment Descriptor ((c) Dr. Jim Plusquellic [2])<br />

3.3 Basic Concepts<br />

3.3.1 Protected Mode<br />

<strong>The</strong> IA32 Architecture <strong>of</strong>fers a feature-set called "Protected Mode", which makes task<br />

separation possible. Using the protected mode capabilities, we can place restrictions<br />

on what a piece <strong>of</strong> code is allowed to do and in which part <strong>of</strong> the physical memory<br />

it may work. <strong>The</strong> two mechanisms making this possible are protected-mode-<br />

Segmentation and Paging. <strong>The</strong>y will be explained in this chapter. Note that in order to<br />

switch to Protected Mode, at the very least the special registers GDTR and IDTR must<br />

have been loaded.<br />

3.3.2 Protected Mode Segmentation<br />

Protected Mode Segmentation works differently than Real Mode Segmentation. In this<br />

document we consider only Protected Mode Segmentation.<br />

Basically it works by defining memory segments using so-called Segment Descriptors.<br />

<strong>The</strong>y contain a Start Address in linear memory and a Limit to how much memory<br />

beyond the start address a program is allowed to access. <strong>The</strong> segment descriptor also<br />

defines restrictions on the kind <strong>of</strong> operations a program can execute. Otherwise a Segment<br />

Violation is triggered. Even before switching to Protected Mode, some Segment<br />

Descriptors need to be defined in the so-called Global Descriptor Table (short GDT).<br />

Like any descriptor table, the GDT is an array <strong>of</strong> Segment Descriptors <strong>of</strong> 8 byte length.<br />

Obviously, a restricted program can not be allowed to (re)write the GDT and the protection<br />

mechanisms available should be employed to prevent this. <strong>The</strong> GDT should<br />

contain at least one reasonable Segment Descriptor beyond the first entry. <strong>The</strong> first<br />

entry also called the Null Descriptor, has a special purpose and all its values set to<br />

zero. Once a GDT has been set up, we can point the CPU to its location in memory by<br />

loading its linear address into the Global Descriptor Table Register (short GDTR) using<br />

the LGDT instruction. Figure 3.1 shows the fields <strong>of</strong> a Segment Descriptor. Table 3.1<br />

explains their use.<br />

Obviously, defining just one set <strong>of</strong> Descriptors for all programs is not going to help<br />

much with Task Separation, so conveniently the Local Descriptor Table (short LDT)<br />

comes to the rescue. <strong>The</strong> LDT, just like the GDT, is an array <strong>of</strong> Segment Descriptors,<br />

26 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.3. BASIC CONCEPTS<br />

Field Field Bits Use<br />

Limit 20 bits defines the amount <strong>of</strong> bytes, starting from the Base Address,<br />

which a program is allowed to access<br />

Base 25 bits the physical starting address <strong>of</strong> the Segment<br />

P 1 bit Descriptor Present Flag, if set to 0 entry is ignored<br />

DPL 2 bits Descriptor Privilege Level<br />

S 1 bit System Descriptor Flag<br />

if set, this is a special system descriptor<br />

Type 3 bits defines the segment type<br />

(note: interpretation is different in system descriptors)<br />

Type=000 Data, read-only<br />

Type=001 Data, read/write<br />

Type=010 Stack, read-only<br />

Type=011 Stack, read/write<br />

Type=100 Code, execute-only<br />

Type=101 Code, execute/read<br />

Type=110 Code, execute-only, aligned to 4 bytes<br />

Type=111 Code, execute/read, aligned to 4 bytes<br />

A 1 bit set by the CPU, indicates if a Segment has been accessed<br />

G 1 bit Granularity: defines the size <strong>of</strong> one Limit-Unit<br />

G=0 Granularity is 1 Byte, segments are 1 byte to 1 MiB in length<br />

G=1 Granularity is 4 KiB, segments are 4KiB to 4GiB in length<br />

D 1bit If 1: all instructions are assumed to be 32bit, 16bit otherwise.<br />

U 1bit user defined bit<br />

Table 3.1: Segment Descriptor Fields<br />

27 <strong>of</strong> 151


3.3. BASIC CONCEPTS CHAPTER 3. VM, PROTECTION AND PAGING<br />

but contains only Descriptors specific to one program. Usually there is one LDT for<br />

each separate program. <strong>The</strong> running program can then choose between segments<br />

by writing the corresponding index into one <strong>of</strong> the segment registers (CS for Code<br />

Segment Register ,DS for Data Segment Register ,SS for Stack Segment Register)<br />

Privilege Level<br />

Four degrees between 0 and 3 <strong>of</strong> privileges exist on the IA32 Platform. <strong>The</strong> privilege<br />

level determines how large a part <strong>of</strong> the instruction set a piece <strong>of</strong> code can use without<br />

triggering a Protection Fault. <strong>The</strong> highest privilege is 0 and usually held only by the<br />

operating system since everything is allowed here. In Sweb, the Kernel has Level<br />

0 while user programs have Level 3. <strong>The</strong> Privilege Level is defined by setting the<br />

DPL field <strong>of</strong> the Segment Descriptor in which a programs code runs. When changing<br />

Segments (by loading a different Descriptor Index into a Segment Register) a program<br />

may only decrease or hold its present privilege level. A program can however, access<br />

privileged code through a Call Gate or asking the Operation System to increase its<br />

Privilege Level.<br />

3.3.3 Paging<br />

Paging is the other protection mechanism and instead <strong>of</strong> Segmentation, was newly<br />

introduced with Protected Mode. (Of course, the principle did exist long before the<br />

IA32 platform adopted the concept).<br />

Paging works by separating the physical memory into Pages <strong>of</strong> variable size and<br />

mapping Pages in linear memory to Pages in physical memory. A Page can be 4 KiB,<br />

2 MiB or 4MiB in size. It is also possible to mix 4KiB and 4MiB Pages. In Sweb we<br />

mainly concern ourselves with 4KiB Pages. Physical Memory is the range <strong>of</strong> addresses<br />

which translate directly into a memory register on the hardware. A linear memory<br />

address however, does not necessarily correspond to an address in physical hardware<br />

(see Figure 3.2). Rather the CPU translates a linear address into a physical address<br />

according to the mapping that was set up by the operating system. Separation works<br />

by assigning each program its own linear address space starting from 0 to 4 GiB.<br />

In Sweb however, a user thread is not allowed to access an address beyond 2GiB.<br />

Translation happens transparently and usually only the operating system knows what<br />

the mapping looks like. Should an unmapped linear address be referenced, the CPU<br />

causes a Page-Fault (Fault is a type <strong>of</strong> Exception) and calls Sweb’s Page-Fault-Handler<br />

(as defined in the IDT) to deal with the problem.<br />

Page-Directory and Page-Tables<br />

<strong>The</strong> Mapping <strong>of</strong> Linear Memory to Physical Memory is defined in a Page-Directory<br />

and its Page-Tables. A Page-Directory has 1024 entries. Each Entry has 32 bits.<br />

A Page-Directory-Entry can either map a 4MiB Page or hold the physical address<br />

<strong>of</strong> a Page-Table. (Actually it holds only the most significant 20 bits <strong>of</strong> the physical<br />

address, which equal the physical 4KiB page number). A Page-Table again has 1024<br />

Entries, each 32 bit in size. A Page-Table-Entry maps to a 4KiB Page in physical<br />

memory. <strong>The</strong> structure <strong>of</strong> the Page-Directory-Entries and Page-Table-Entries is shown<br />

in the following listing <strong>of</strong> paging-definitions.h. Note that the least significant 20 bits<br />

in a Page-Directory-Entry can have different interpretations depending on the state<br />

28 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.3. BASIC CONCEPTS<br />

Figure 3.2: Example Mapping between linear and physical Memory<br />

29 <strong>of</strong> 151


3.3. BASIC CONCEPTS CHAPTER 3. VM, PROTECTION AND PAGING<br />

<strong>of</strong> the use_4_m_pages bit. Sweb defines an union between the two structs so both<br />

interpretations can be used conveniently. <strong>The</strong> bits are further explained in table 3.3.3.<br />

1 #include " types .h"<br />

2<br />

3 #define PAGE_TABLE_ENTRIES 1024<br />

4 #define PAGE_SIZE 4096<br />

5 #define PAGE_INDEX_OFFSET_BITS 12<br />

6<br />

7 struct page_directory_entry_4k_struct<br />

8 {<br />

9 uint32 present : 1 ;<br />

10 uint32 writeable : 1 ;<br />

11 uint32 user_access : 1 ;<br />

12 uint32 write_through : 1 ;<br />

13 uint32 cache_disabled : 1 ;<br />

14 uint32 accessed : 1 ;<br />

15 uint32 reserved : 1 ;<br />

16 uint32 use_4_m_pages : 1 ;<br />

17 uint32 global_page : 1 ;<br />

18 uint32 avail_1 : 1 ;<br />

19 uint32 avail_2 : 1 ;<br />

20 uint32 avail_3 : 1 ;<br />

21 uint32 page_table_base_address :20;<br />

22 } __attribute__ ( ( __packed__ ) ) ;<br />

23<br />

24 struct page_directory_entry_4m_struct<br />

25 {<br />

26 uint32 present : 1 ;<br />

27 uint32 writeable : 1 ;<br />

28 uint32 user_access : 1 ;<br />

29 uint32 write_through : 1 ;<br />

30 uint32 cache_disabled : 1 ;<br />

31 uint32 accessed : 1 ;<br />

32 uint32 dirty : 1 ;<br />

33 uint32 use_4_m_pages : 1 ;<br />

34 uint32 global_page : 1 ;<br />

35 uint32 avail_1 : 1 ;<br />

36 uint32 avail_2 : 1 ;<br />

37 uint32 avail_3 : 1 ;<br />

38 uint32 pat : 1 ;<br />

39 uint32 reserved : 9 ;<br />

40 uint32 page_base_address :10;<br />

41 } __attribute__ ( ( __packed__ ) ) ;<br />

42<br />

43 union page_directory_entry_union<br />

44 {<br />

45 struct page_directory_entry_4k_struct pde4k ;<br />

46 struct page_directory_entry_4m_struct pde4m;<br />

47 } __attribute__ ( ( __packed__ ) ) ;<br />

48 typedef union page_directory_entry_union page_directory_entry ;<br />

49<br />

50 struct page_table_entry_struct<br />

51 {<br />

52 uint32 present : 1 ;<br />

53 uint32 writeable : 1 ;<br />

54 uint32 user_access : 1 ;<br />

55 uint32 write_through : 1 ;<br />

56 uint32 cache_disabled : 1 ;<br />

57 uint32 accessed : 1 ;<br />

58 uint32 dirty : 1 ;<br />

59 uint32 pat : 1 ;<br />

30 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.3. BASIC CONCEPTS<br />

60 uint32 global_page : 1 ;<br />

61 uint32 avail_1 : 1 ;<br />

62 uint32 avail_2 : 1 ;<br />

63 uint32 avail_3 : 1 ;<br />

64 uint32 page_base_address :20;<br />

65 } __attribute__ ( ( __packed__ ) ) ;<br />

66 typedef struct page_table_entry_struct page_table_entry ;<br />

Note: A set flag in a Page-Directory-Entry applies to one 4MiB Page or all 1024<br />

Page-Table-Entries, even if that flag is not set in each Page-Table-Entry.<br />

Address Translation<br />

<strong>The</strong> following is a pseudo code example on how the CPU translates a linear address<br />

into a physical one. <strong>The</strong> example assumes 4KiB-Pages. Translation is also illustrated<br />

in Figure 3.3.<br />

Index P ageDirectory = div( linearAddr , P ageT ableEntries)<br />

P ageSize<br />

Index P ageT able = mod( linearAddr , P ageT ableEntries)<br />

P ageSize<br />

32bit ∗ P ageDirectory = readRegister(CR3)<br />

32bit ∗ P ageT able = P ageDirectory[Index P ageDirectory ].P ageT ableBase · P ageSize<br />

physicalP age = P ageT able[Index P ageT able ].P ageBase<br />

physicalAddr = physicalP age · P ageSize + mod(linearAddr, P ageSize)<br />

Control Register 0, 2, 3 and 4<br />

<strong>The</strong> CR registers are 32 bit large and control the behaviour <strong>of</strong> the CPU. For a complete<br />

Reference see [16]. In table 3.4 the bits in these registers used to control Paging are<br />

listed.<br />

CR2 holds the linear address, which caused the last Page-Fault as it was accessed.<br />

<strong>The</strong> Stack <strong>of</strong> the Page-Fault-Handler holds additional information i.e. if code tried to<br />

write to a read-only page or read from an unmapped address.<br />

CR3 holds the physical memory address <strong>of</strong> the currently active Page-Directory,<br />

however only the most-significant 20 bits are used by the CPU, the rest is assumed<br />

to be 0. <strong>The</strong> address <strong>of</strong> any Page-Directory must therefore start on a 4KiB Page or<br />

rather on an address which is a multiple <strong>of</strong> 4096. <strong>The</strong> CPU internally caches some<br />

<strong>of</strong> the contents <strong>of</strong> the Page-Directory and Page-Tables. <strong>The</strong>refore, if Page-Directory or<br />

Page-Table are changed, we need to update CR3 to flush the TLB, even if the address<br />

<strong>of</strong> the Page-Directory did not change. Alternatively you could disable usage <strong>of</strong> the TLB,<br />

by setting bit 4 <strong>of</strong> CR3.<br />

31 <strong>of</strong> 151


3.3. BASIC CONCEPTS CHAPTER 3. VM, PROTECTION AND PAGING<br />

Field Bits Description<br />

present 1 If set, the linear address range associated with<br />

this Entry is valid.<br />

writeable 1 If set, writing to the the linear address range<br />

associated with this entry is possible. Note that<br />

unless the WP flag in CR0 is set, supervisor<br />

level code can still write to such pages.<br />

user_access 1 If set, this page is user accessible, otherwise it<br />

is only supervisor accessible (which means only<br />

with privilege level 3)<br />

write_through 1 Switches between write-through and writeback<br />

caching on a per page basis.<br />

cache_disabled 1 Controls caching on an per page basis.<br />

accessed 1 Set by the CPU if a linear address associated<br />

with this page has been accessed. Should be<br />

cleared by the operating system after every read<br />

<strong>of</strong> this bit to continuously track page access.<br />

dirty 1 Set by the CPU if a page has been written to<br />

since the bit was last cleared by the operating<br />

system.<br />

use_4_m_pages 1 This bit only exists in the Page-Directory-Entry.<br />

If set, the last 10 bits <strong>of</strong> this Page-Directory-<br />

Entry are interpreted as the page number <strong>of</strong> a<br />

physical 4MiB Page. Unset, the last 20 bits are<br />

interpreted as the 4KiB physical page number<br />

<strong>of</strong> a Page-Table<br />

global_page 1 If set by the operating system the CPU assumes<br />

this to be a shared page and handles entries<br />

in the TLB (Translation Look-aside Buffer) accordingly.<br />

Note that this bit has effect only in<br />

a Page-Table-Entry or Page-Directory-Entries<br />

that point to 4MiB Pages<br />

page_table_base_address 20 the 20 most significant bits <strong>of</strong> the physical address<br />

<strong>of</strong> a Page-Table<br />

page_base_address 10 or 20 the 20 most significant bits <strong>of</strong> the start address<br />

<strong>of</strong> a physical 4KiB-Page or the 10 most significant<br />

bits <strong>of</strong> the start address <strong>of</strong> a 4MiB-Page<br />

Table 3.3: Explanation <strong>of</strong> Fields in Page-Directory- and Page-Table-Entries<br />

32 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.3. BASIC CONCEPTS<br />

Figure 3.3: 4k page address translation ((c) by Intel)<br />

Register Bit Name Description<br />

CR0 0 PE enables Protected Mode<br />

CR0 16 WP If set, writing to read-only pages causes a Page-Fault<br />

even for Ring 0 Code (needed for CopyOnWrite)<br />

CR0 31 PG enables Paging<br />

CR4 4 PSE allows mixing <strong>of</strong> 4KiB and 4MiB pages<br />

CR4 7 PGE allows pages to be global, meaning they are not<br />

flushed from cache on task switch. (see shared memory)<br />

Table 3.4: Control Registers and Paging<br />

33 <strong>of</strong> 151


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.4 Interface & Classes<br />

3.4.1 ArchMemory<br />

<strong>The</strong> class ArchMemory does not get instantiated. Instead it is a collection <strong>of</strong> architecture<br />

dependent static methods related to Memory and Pages. <strong>The</strong>refore these methods<br />

will be refereed to as functions. An alternative would have been to use namespaces,<br />

but this approach was taken. ArchMemory contains architecture dependent code and<br />

needs to be implemented separately for each platform. This document focuses on the<br />

x86 architecture.<br />

Interface<br />

Function<br />

static void initNewPageDirectory(uint32<br />

physical_page_to_use);<br />

static void mapPage(uint32 physical_page_directory_page,<br />

uint32<br />

linear_page, uint32 physical_page,<br />

uint32 user_access, uint32<br />

page_size=PAGE_SIZE);<br />

static void unmapPage(uint32 physical_page_directory_page,<br />

uint32<br />

linear_page);<br />

static void freePageDirectory(uint32<br />

physical_page_directory_page);<br />

Description<br />

creates a new Page-Directory for an user process<br />

by copying the original kernel-only page<br />

directory. <strong>The</strong> parameter physical_page_to_use<br />

specifies where the new PDE should be.<br />

maps a linear page to a physical page (pde<br />

and pte need to be set up first). physical_page_directory_page<br />

specifies the physical<br />

page where the Page-Directory-Entry to work<br />

on can be found. With user_access the User/-<br />

Supervisor Flag can be set. page_size is optional<br />

and defaults to 4KiB pages, but you need<br />

to set it to 1024*4096 if you want to map a<br />

4MiB page.<br />

removes the mapping <strong>of</strong> a linear_page by marking<br />

its PTE Entry as non valid <strong>The</strong> parameter<br />

physical_page_directory_page specifies the real<br />

memory page where the PDE to work on can be<br />

found. virtual_page identifies the mapping that<br />

will be invalidated.<br />

recursively removes a Page-Directory-Entry,<br />

the Page-Tables it points to and marks all<br />

mapped pages as free in the PageManger. physical_page_directory_page<br />

are the 20 most significant<br />

bits <strong>of</strong> the physical address on which<br />

the Page-Directory can be found.<br />

34 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.4. INTERFACE & CLASSES<br />

static<br />

pointer<br />

get3GBAdressOfPPN(uint32 ppn,<br />

uint32 page_size=PAGE_SIZE);<br />

static bool checkAddressValid(uint32<br />

physical_page_directory_page,<br />

uint32<br />

laddress_to_check);<br />

static uint32 getPhysicalPageOfVirtualPageInKernelMapping(uint32<br />

linear_page, uint32 *physical_page);<br />

Because mapping is present everywhere, in<br />

sweb there is the 1:1 mapping which allows us<br />

to directly access the given page. Takes a physical<br />

page number and returns a linear address<br />

than can be used to directly access the given<br />

page. Basically, an <strong>of</strong>fset <strong>of</strong> 3GiB is added and<br />

the page number is multiplied by page_size.<br />

This is a very important method, it is the<br />

only way to manipulate page directories or<br />

page tables. <strong>The</strong> parameter page_size is optional<br />

and defaults to 4KiB pages. Needs to<br />

be set to 1024*4096 when using a 4MiB page<br />

number.<br />

Checks if a given linear address is valid<br />

and therefore mapped to physical memory.<br />

physical_page_directory_page is the<br />

physical page where the Page-Directory-Entry<br />

which’s mappings are to be checked, can<br />

be found. <strong>The</strong> function returns true if a<br />

mapping <strong>of</strong> laddress_to_check exists in physical_page_directory_page<br />

and false if the given<br />

linear address is unmapped and accessing it<br />

would result in a page fault.<br />

Takes a linear_page number and searches<br />

through the Page-Directory for the physical_page<br />

number it refers to. <strong>The</strong> result can<br />

be found in the variable that *physical_page<br />

pointed to. To access that physical page directly,<br />

get3GBAddressOfPPN(..) should be<br />

used. Returns 0 if the linear page does not map<br />

to any physical page or the page size in bytes on<br />

success. (4096 for 4KiB pages or 4096*1024 for<br />

4MiB pages)<br />

Table 3.5: ArchMemory Interface<br />

Implementation<br />

1 #include "ArchMemory.h"<br />

2 #include " kprintf .h"<br />

3 #include " assert .h"<br />

4<br />

5 extern "C" uint32 kernel_page_directory_start ;<br />

6<br />

7 void ArchMemory : : initNewPageDirectory ( uint32 physical_page_to_use )<br />

8 {<br />

9 page_directory_entry ∗new_page_directory =<br />

10 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_to_use ) ;<br />

11 ArchCommon: :memcpy(<br />

12 ( pointer ) new_page_directory ,<br />

13 ( pointer ) &kernel_page_directory_start ,<br />

35 <strong>of</strong> 151


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING<br />

14 PAGE_SIZE ) ;<br />

15 for ( uint32 p = 0; p < 512; ++p ) //we ’ re concerned with f i r s t two gig , rest stays as i s<br />

16 {<br />

17 new_page_directory [ p ] . pde4k . present =0;<br />

18 }<br />

19 }<br />

20<br />

21 void ArchMemory : : checkAndRemovePTE ( uint32 physical_page_directory_page ,<br />

22 uint32 pde_lpn )<br />

23 {<br />

24 page_directory_entry ∗page_directory =<br />

25 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

26 page_table_entry ∗pte_base =<br />

27 ( page_table_entry ∗) get3GBAdressOfPPN ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ;<br />

28 assert ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages == 0 ) ;<br />

29 for ( uint32 pte_lpn =0; pte_lpn < PAGE_TABLE_ENTRIES; ++pte_lpn )<br />

30 i f ( pte_base [ pte_lpn ] . present > 0)<br />

31 return ; //not empty −> do nothing<br />

32<br />

33 //else :<br />

34 page_directory [ pde_lpn ] . pde4k . present = 0;<br />

35 PageManager : : instance ()−> freePage ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ;<br />

36 }<br />

37<br />

38 void ArchMemory : : unmapPage( uint32 physical_page_directory_page ,<br />

39 uint32 linear_page )<br />

40 {<br />

41 page_directory_entry ∗page_directory =<br />

42 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

43 uint32 pde_lpn = linear_page / PAGE_TABLE_ENTRIES;<br />

44 uint32 pte_lpn = linear_page % PAGE_TABLE_ENTRIES;<br />

45<br />

46 i f ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages )<br />

47 {<br />

48 page_directory [ pde_lpn ] . pde4m. present = 0;<br />

49 // PageManager manages Pages <strong>of</strong> size PAGE_SIZE only ,<br />

50 // so we have to free this_page_size/PAGE_SIZE Pages<br />

51 for ( uint32 p=0;p freePage ( page_directory [ pde_lpn ] . pde4m. page_base_address∗1024+p ) ;<br />

53 }<br />

54 {<br />

55 page_table_entry ∗pte_base =<br />

56 ( page_table_entry ∗) get3GBAdressOfPPN ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ;<br />

57 i f ( pte_base [ pte_lpn ] . present )<br />

58 {<br />

59 pte_base [ pte_lpn ] . present = 0;<br />

60 PageManager : : instance ()−> freePage ( pte_base [ pte_lpn ] . page_base_address ) ;<br />

61 }<br />

62 checkAndRemovePTE ( physical_page_directory_page , pde_lpn ) ;<br />

63 }<br />

64 }<br />

65<br />

66 void ArchMemory : : insertPTE ( uint32 physical_page_directory_page ,<br />

67 uint32 pde_lpn ,<br />

68 uint32 physical_page_table_page )<br />

69 {<br />

70 page_directory_entry ∗page_directory =<br />

71 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

72 ArchCommon: : bzero ( get3GBAdressOfPPN ( physical_page_table_page ) ,PAGE_SIZE ) ;<br />

73 page_directory [ pde_lpn ] . pde4k . present = 1;<br />

74 page_directory [ pde_lpn ] . pde4k . writeable = 1;<br />

75 page_directory [ pde_lpn ] . pde4k . use_4_m_pages = 0;<br />

36 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.4. INTERFACE & CLASSES<br />

76 page_directory [ pde_lpn ] . pde4k . page_table_base_address = physical_page_table_page ;<br />

77 page_directory [ pde_lpn ] . pde4k . user_access = 1;<br />

78 }<br />

79<br />

80 void ArchMemory : : mapPage ( uint32 physical_page_directory_page ,<br />

81 uint32 linear_page ,<br />

82 uint32 physical_page ,<br />

83 uint32 user_access ,<br />

84 uint32 page_size )<br />

85 {<br />

86 kprintfd_nosleep ( "ArchMemory : : mapPage: pys1 %x , pyhs2 %x\n" ,<br />

87 physical_page_directory_page ,<br />

88 physical_page ) ;<br />

89 page_directory_entry ∗page_directory =<br />

90 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

91 uint32 pde_lpn = linear_page / PAGE_TABLE_ENTRIES;<br />

92 uint32 pte_lpn = linear_page % PAGE_TABLE_ENTRIES;<br />

93 i f ( page_size==PAGE_SIZE)<br />

94 {<br />

95 i f ( page_directory [ pde_lpn ] . pde4k . present == 0)<br />

96 {<br />

97 insertPTE ( physical_page_directory_page , pde_lpn ,<br />

98 PageManager : : instance ()−> getFreePhysicalPage ( ) ) ;<br />

99 }<br />

100 page_table_entry ∗pte_base =<br />

101 ( page_table_entry ∗) get3GBAdressOfPPN (<br />

102 page_directory [ pde_lpn ] . pde4k . page_table_base_address<br />

103 ) ;<br />

104 pte_base [ pte_lpn ] . present = 1;<br />

105 pte_base [ pte_lpn ] . writeable = 1;<br />

106 pte_base [ pte_lpn ] . user_access = user_access ;<br />

107 pte_base [ pte_lpn ] . page_base_address = physical_page ;<br />

108 }<br />

109 else i f ( ( page_size==PAGE_SIZE∗1024) && ( page_directory [ pde_lpn ] . pde4m. present == 0 ) )<br />

110 {<br />

111 page_directory [ pde_lpn ] . pde4m. present = 1;<br />

112 page_directory [ pde_lpn ] . pde4m. writeable = 1;<br />

113 page_directory [ pde_lpn ] . pde4m. use_4_m_pages = 1;<br />

114 page_directory [ pde_lpn ] . pde4m. page_base_address = physical_page ;<br />

115 page_directory [ pde_lpn ] . pde4m. user_access = user_access ;<br />

116 }<br />

117 else<br />

118 assert ( false ) ;<br />

119 }<br />

120<br />

121 // only free pte ’ s < PAGE_TABLE_ENTRIES/2 because we do NOT<br />

122 // want to free Kernel Pages<br />

123 void ArchMemory : : freePageDirectory ( uint32 physical_page_directory_page )<br />

124 {<br />

125 page_directory_entry ∗page_directory =<br />

126 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

127 for ( uint32 pde_lpn=0; pde_lpn < PAGE_TABLE_ENTRIES/2; ++pde_lpn )<br />

128 {<br />

129 i f ( page_directory [ pde_lpn ] . pde4k . present )<br />

130 {<br />

131 i f ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages )<br />

132 {<br />

133 page_directory [ pde_lpn ] . pde4m. present =0;<br />

134 for ( uint32 p=0;p freePage (<br />

136 page_directory [ pde_lpn ] . pde4m. page_base_address∗1024+p<br />

137 ) ;<br />

37 <strong>of</strong> 151


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING<br />

138 }<br />

139 else<br />

140 {<br />

141 page_table_entry ∗pte_base =<br />

142 ( page_table_entry ∗) get3GBAdressOfPPN (<br />

143 page_directory [ pde_lpn ] . pde4k . page_table_base_address<br />

144 ) ;<br />

145 for ( uint32 pte_lpn =0; pte_lpn < PAGE_TABLE_ENTRIES; ++pte_lpn )<br />

146 {<br />

147 i f ( pte_base [ pte_lpn ] . present )<br />

148 {<br />

149 pte_base [ pte_lpn ] . present = 0;<br />

150 PageManager : : instance ()−> freePage ( pte_base [ pte_lpn ] . page_base_address ) ;<br />

151 }<br />

152 }<br />

153 page_directory [ pde_lpn ] . pde4k . present =0;<br />

154 PageManager : : instance ()−> freePage ( page_directory [ pde_lpn ] . pde4k . page_table_base_address ) ;<br />

155 }<br />

156 }<br />

157 }<br />

158 PageManager : : instance ()−> freePage ( physical_page_directory_page ) ;<br />

159 }<br />

160<br />

161 bool ArchMemory : : checkAddressValid ( uint32 physical_page_directory_page , uint32 laddress_to_check )<br />

162 {<br />

163 page_directory_entry ∗page_directory =<br />

164 ( page_directory_entry ∗) get3GBAdressOfPPN ( physical_page_directory_page ) ;<br />

165 uint32 linear_page = laddress_to_check / PAGE_SIZE;<br />

166 uint32 pde_lpn = linear_page / PAGE_TABLE_ENTRIES;<br />

167 uint32 pte_lpn = linear_page % PAGE_TABLE_ENTRIES;<br />

168 i f ( page_directory [ pde_lpn ] . pde4k . present )<br />

169 {<br />

170 i f ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages )<br />

171 return true ;<br />

172<br />

173 page_table_entry ∗pte_base =<br />

174 ( page_table_entry ∗) get3GBAdressOfPPN (<br />

175 page_directory [ pde_lpn ] . pde4k . page_table_base_address<br />

176 ) ;<br />

177 i f ( pte_base [ pte_lpn ] . present )<br />

178 return true ;<br />

179 else<br />

180 return false ;<br />

181 }<br />

182 else<br />

183 return false ;<br />

184 }<br />

185<br />

186 uint32 ArchMemory : : getPhysicalPageOfVirtualPageInKernelMapping ( uint32 linear_page ,<br />

187<br />

188 {<br />

189 page_directory_entry ∗page_directory =<br />

190 ( page_directory_entry ∗) &kernel_page_directory_start ;<br />

191 uint32 pde_lpn = linear_page / PAGE_TABLE_ENTRIES;<br />

192 uint32 pte_lpn = linear_page % PAGE_TABLE_ENTRIES;<br />

193 i f ( page_directory [ pde_lpn ] . pde4k . present ) //the present b i t i s the same f o r 4k and 4m<br />

194 {<br />

195 i f ( page_directory [ pde_lpn ] . pde4m. use_4_m_pages )<br />

196 {<br />

197 ∗physical_page = page_directory [ pde_lpn ] . pde4m. page_base_address ;<br />

198 return (PAGE_SIZE∗1024U) ;<br />

199 }<br />

38 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.4. INTERFACE & CLASSES<br />

200 else<br />

201 {<br />

202 page_table_entry ∗pte_base =<br />

203 ( page_table_entry ∗) get3GBAdressOfPPN (<br />

204 page_directory [ pde_lpn ] . pde4k . page_table_base_address<br />

205 ) ;<br />

206 i f ( pte_base [ pte_lpn ] . present )<br />

207 {<br />

208 ∗physical_page = pte_base [ pte_lpn ] . page_base_address ;<br />

209 return PAGE_SIZE;<br />

210 }<br />

211 else<br />

212 return 0;<br />

213 }<br />

214 }<br />

215 else<br />

216 return 0;<br />

217 }<br />

3.4.2 PageManager<br />

<strong>The</strong> class PageManager is a singleton class. Sweb uses PageManager to maintain<br />

what physical pages are free or used. <strong>The</strong> PageManager assumes all pages to be<br />

4KiB in size. PageManager historically is created even before the KernelMemory-<br />

Manager and loaded at the linear address &kernel_end_address. For each physical<br />

memory page the manager uses 8 bit <strong>of</strong> kernel memory. With a memory footprint <strong>of</strong><br />

size<strong>of</strong>(P ageManager)+number physicalpages ∗size<strong>of</strong>(uint8) and the Sweb limit <strong>of</strong> maximum<br />

1 GiB physical memory, PageManager can consume a maximum <strong>of</strong> 12 + 256 = 268Byte.<br />

Currently Sweb knows four page usage types: PAGE_RESERVED, PAGE_KERNEL,<br />

PAGE_USERSPACE, PAGE_FREE. Pages marked reserved can never be free’d, they<br />

are memory areas used by the hardware or modules loaded by grub. <strong>The</strong> 4 remaining<br />

bits are free for future use.<br />

Interface<br />

Method<br />

static pointer createPageManager(pointer<br />

next_usable_address);<br />

static PageManager *instance();<br />

uint32 getTotalNumPages() const;<br />

Description<br />

creates the singleton instance <strong>of</strong> PageManger.<br />

This method in turn calls the PageManager ctor<br />

using placement new.<br />

returns a pointer to singleton instance <strong>of</strong> Page-<br />

Manager. Example: PageManager::instance()-<br />

>method();<br />

returns the number <strong>of</strong> 4KiB Pages available to<br />

Sweb. That is the size <strong>of</strong> the first usable memory<br />

region as reported by grub or a maximum<br />

<strong>of</strong> 1GiB, divided by the size <strong>of</strong> one page (4KiB)<br />

39 <strong>of</strong> 151


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING<br />

uint32<br />

returns the next free physical page number<br />

getFreePhysicalPage(uint32 type); and marks that page as used. It takes<br />

one parameter "type", which should be either<br />

PAGE_USERSPACE, that is the default, or<br />

PAGE_KERNEL. Note that the type set in the<br />

PageManager does not influence the User/Supervisor<br />

Flag in the Page-Directory.<br />

void freePage(uint32 page_number); marks one physical page identified by its<br />

page_number as free, if it was marked as used<br />

in user or kernel context.<br />

void startUsingSyncMechanism(); to be called after initialising the KernelMemoryManager<br />

and before starting Interrupts to ensure<br />

Mutual Exclusion<br />

Table 3.6: PageManager Interface<br />

3.4.3 Implementation<br />

1 /∗∗<br />

2 ∗ @ f i l e PageManager . cpp<br />

3 ∗/<br />

4<br />

5 #include "mm/PageManager .h"<br />

6 #include "mm/new.h"<br />

7 #include " paging−definitions .h"<br />

8 #include " arch_panic .h"<br />

9 #include "ArchCommon.h"<br />

10 #include "ArchMemory.h"<br />

11 #include " debug_bochs .h"<br />

12 #include " console/kprintf .h"<br />

13 #include " ArchInterrupts .h"<br />

14 #include " assert .h"<br />

15<br />

16 PageManager∗ PageManager : : instance_ =0;<br />

17<br />

18 extern void∗ kernel_end_address ;<br />

19<br />

20 void PageManager : : createPageManager ( )<br />

21 {<br />

22 i f ( instance_ )<br />

23 return ;<br />

24<br />

25 instance_ = new PageManager ( ) ;<br />

26 }<br />

27<br />

28 PageManager : : PageManager ( )<br />

29 {<br />

30 number_<strong>of</strong>_pages_ = 0;<br />

31 lowest_unreserved_page_ = 256; //physical memory


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.4. INTERFACE & CLASSES<br />

39 uint32 highest_address_below_1gig=0,type=0,used_pages=0;<br />

40<br />

41 //Determine Amount <strong>of</strong> RAM<br />

42 for ( i =0; i <br />

85 number_<strong>of</strong>_pages_ = Min ( number_<strong>of</strong>_pages_ ,1024∗256);<br />

86<br />

87 //we use KMM now<br />

88 //note , the max size <strong>of</strong> our page_useage_table0 i s 1MiB in theory (4GiB RAM) and<br />

89 //256KiB in practice (1GiB RAM) ( calculated with size<strong>of</strong> ( puttype )=1)<br />

90 //currently we have 4MiB <strong>of</strong> kernel memory (1024 mapped pages starting at linear addr . : 2GiB )<br />

91 // i f our kernel image becomes too large , the following command might f a i l<br />

92 page_usage_table_ = new puttype [ number_<strong>of</strong>_pages_ ] ;<br />

93<br />

94 // since we have gaps in the memory maps we can not give out everything<br />

95 // f i r s t mark everything as reserved , just to be sure<br />

96 debug (PM, " Ctor : I n i t i a l i z i n g page_usage_table_ with a l l pages reserved\n" ) ;<br />

97 for ( i =0; i


3.4. INTERFACE & CLASSES CHAPTER 3. VM, PROTECTION AND PAGING<br />

101<br />

102 //now mark as free , everything that might be useable<br />

103 for ( i =0; i 1024∗256) //because max 1 gig <strong>of</strong> memory, see above<br />

111 {<br />

112 continue ;<br />

113 }<br />

114<br />

115 for ( k=Max( start_page , lowest_unreserved_page_ ) ; k2gb und


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.5. <strong>SWEB</strong> BOOT PROCESS<br />

163<br />

164 // w i l l propably be 1024<br />

165 debug (PM, " Ctor : lowest_unreserved_page_=%d\n" , lowest_unreserved_page_ ) ;<br />

166 prenew_assert ( lowest_unreserved_page_ >= 1024);<br />

167 prenew_assert ( lowest_unreserved_page_ < number_<strong>of</strong>_pages_ ) ;<br />

168 debug (PM, " Ctor done\n" ) ;<br />

169 }<br />

170<br />

171 uint32 PageManager : : getTotalNumPages ( ) const<br />

172 {<br />

173 return number_<strong>of</strong>_pages_ ;<br />

174 }<br />

175<br />

176 //used by loader . cpp ArchMemory . cpp<br />

177 uint32 PageManager : : getFreePhysicalPage ( uint32 type )<br />

178 {<br />

179 i f ( type == PAGE_FREE || type == PAGE_RESERVED) //what a stupid thing that would be to do<br />

180 return 0;<br />

181<br />

182 i f ( lock_ )<br />

183 lock_−>acquire ( ) ;<br />

184<br />

185 // f i r s t 1024 pages are the 4MiB f o r Kernel Space<br />

186 for ( uint32 p=lowest_unreserved_page_ ; prelease ( ) ;<br />

193 return p ;<br />

194 }<br />

195 }<br />

196 i f ( lock_ )<br />

197 lock_−>release ( ) ;<br />

198 arch_panic ( ( uint8 ∗) "PageManager : Sorry , no more Pages Free ! ! ! " ) ;<br />

199 return 0;<br />

200 }<br />

201<br />

202 void PageManager : : freePage ( uint32 page_number )<br />

203 {<br />

204 i f ( lock_ )<br />

205 lock_−>acquire ( ) ;<br />

206 i f ( page_number >= lowest_unreserved_page_ && page_number < number_<strong>of</strong>_pages_<br />

207 && page_usage_table_ [ page_number ] != PAGE_RESERVED)<br />

208 page_usage_table_ [ page_number ] = PAGE_FREE;<br />

209 i f ( lock_ )<br />

210 lock_−>release ( ) ;<br />

211 }<br />

3.5 Sweb Boot Process<br />

3.5.1 Boot Sequence<br />

Sweb uses a patched version <strong>of</strong> Grub to load itself into memory (patched only for<br />

vesa-mode, textual mode works without patching). Afterwards the following steps are<br />

executed in sequential order:<br />

1. load gdt<br />

2. load LINEAR_DATA_SEL into all segment registers<br />

43 <strong>of</strong> 151


3.5. <strong>SWEB</strong> BOOT PROCESS CHAPTER 3. VM, PROTECTION AND PAGING<br />

Index Base Limit P DPL S Type A G D X U<br />

[24] [20] [1] [2] [1] [3] [1] [1] [1] [1] [1]<br />

0 0 0 0 00 0 000 0 0 0 0 0<br />

1 0 0 0 00 0 000 0 0 0 0 0<br />

2 0 FFFFFh 1 00 1 001 0 1 1 0 0<br />

3 0 FFFFFh 1 00 1 101 0 1 1 0 0<br />

4 0 FFFFFh 1 11 1 001 0 1 1 0 0<br />

5 0 FFFFFh 1 11 1 101 0 1 1 0 0<br />

Table 3.7: Sweb Global segment Descriptor Table<br />

3. zero out kernel BSS (uninitialized data section)<br />

4. set up stack (which is in the BSS section)<br />

5. call parseMultibootHeader()<br />

6. call initialiseBootTimePaging()<br />

7. enable PSE (extended PageSize 4k and 4m mixed)<br />

8. enable Paging (PG bit = 1)<br />

9. Paging Mode:<br />

10. reload GDT with new address<br />

11. load LINEAR_DATA_SEL into all segment registers<br />

12. call removeBootTimeIdentMapping()<br />

13. reload CR3<br />

14. call startup()<br />

14. 1. init KernelMemoryManager<br />

14. 2. init PageManager<br />

14. 3. init Console<br />

14. 4. init Scheduler<br />

14. 5. init kprintf_no_sleep<br />

14. 6. init KernelThreadInfo<br />

14. 7. init Interrupts<br />

14. 8. add Threads<br />

14. 9. enable synchronisation mechanisms in KMM and PM<br />

14. 10. enable Interrupts: from here on, tasks are executed pseudo-parallel.<br />

14. 11. yield away<br />

for details, refer to the files init_boottime_pagetables.cpp, boot.s and main.cpp.<br />

3.5.2 Sweb Global Descriptor<br />

Sweb’s boot time GDT can be found in boot.s and Table 3.7 with a per-descriptor<br />

elaboration in Table 3.8.<br />

44 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.5. <strong>SWEB</strong> BOOT PROCESS<br />

Index Base Limit Priv.Lvl. Type Symbol<br />

2 0 4GiB 0 Data, read/write LINEAR_DATA_SEL<br />

3 0 4GiB 0 Code, execute/read LINEAR_CODE_SEL<br />

4 0 4GiB 3 Data, read/write LINEAR_USR_DATA_SEL<br />

5 0 4GiB 3 Code, execute/read LINEAR_USR_CODE_SEL<br />

Table 3.8: Sweb GDT, more readable<br />

3.5.3 Sweb Boot-Time Paging<br />

As long as paging is disabled during the boot-phase, we are accessing physical memory<br />

directly. This makes it easy to get information about the location <strong>of</strong> grub loaded<br />

modules or access memory used by hardware. <strong>The</strong> problem in this phase stems from<br />

the fact that all kernel pointers are compiled with a 2GiB <strong>of</strong>fset, since the kernel is<br />

meant to be accessed at 2GiB in linear memory. Physically though, the kernel starts<br />

executing at 1MiB.<br />

In order to run the Kernel at 2GiB, an initial Page-Directory is set up so that kernelcode<br />

is accessible through both its physical location as well as above 2GiB. After the<br />

end <strong>of</strong> the linear kernel location, mappings to free physical pages are added, which<br />

we can use as kernel memory.. With this mapping in place, paging is enabled and the<br />

first call to above 2GiB is made.<br />

Actually, directly calling a kernel function prior to enabled paging is impossible.<br />

We would jump 2GiB far away from our code. <strong>The</strong>refore, during boot-time, an <strong>of</strong>fset<br />

is substracted from the function-addresses before calling them. <strong>The</strong> call to void initialiseBootTimePaging()<br />

which lacks this measure, works because gcc optimises the<br />

function call by inlining the code.<br />

3.5.4 Sweb Post-Boot Paging Setup<br />

After the boot phase has finished, the linear memory <strong>of</strong> Sweb is partitioned like shown<br />

in Table 3.9. Because <strong>of</strong> page protection, the kernel can directly access user space<br />

code while access to kernel code is denied to the user program even though kernel and<br />

user program share the same linear address space. More exactly, every user program’s<br />

Page-Directory also contains the mapping <strong>of</strong> the kernel pages. <strong>The</strong> identity mapping<br />

(with a 3GiB <strong>of</strong>fset) beyond 3GiB is used to communicate with hardware like the frame<br />

buffer.<br />


3.6. <strong>SWEB</strong> TASK SEPARATION AND PROTECTION<br />

CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.6 Sweb Task Separation and Protection<br />

3.6.1 Segment Descriptor Privilege Level / Task Privilege Level<br />

As shown in Table 3.8, Sweb uses only privilege level 0 for kernel code and 3 for user<br />

code. User threads running user code start with CS and DS set to 4 and 5 respectively.<br />

Refer to low level task switching and syscalls on when privilege levels are changed.<br />

3.6.2 User- / Supervisor- Page<br />

Paging <strong>of</strong>fers an additional mechanism concerning code privileges in form <strong>of</strong> the user/supervisoraccess<br />

flag in the Page-Directory. Pages where the user_access flag is not set may only<br />

be accessed by code with privilege level 0 i.e. the kernel. In Sweb the user_access flag<br />

is set only on mappings below 2GiB.<br />

3.6.3 LoadPageOnDemand<br />

Instead <strong>of</strong> loading a new process complete into memory on process launch, only the<br />

pages needed can be loaded one a time. In Sweb, on start <strong>of</strong> an new user process, all<br />

virtual pages are non-present. <strong>The</strong> first time each user page is accessed, a page fault<br />

happens and the Page-Fault-Handler is called. On deciding that the page is in the user<br />

threads linear address space below 2GiB, the user processes instance <strong>of</strong> class Loader<br />

is called. Loader then checks the ELF-Header and loads the code and data from the<br />

executable, that match the missing page’s linear address range.<br />

It is further possible to remember if a page has never changed since it was loaded<br />

on demand. <strong>The</strong>n, instead <strong>of</strong> swapping it out, it could simply be free’d because it<br />

can be reload from the executable. This approach however introduces some issues in<br />

conjunction with shared pages which are detailed in section 3.7.6.<br />

3.6.4 UserThread Termination<br />

On termination <strong>of</strong> a user process, ArchMemory::freePageDirectory(..) is called. <strong>The</strong><br />

function recursively frees all mapped pages below 2GiB.<br />

3.7 Extending Sweb<br />

3.7.1 Using 4MiB Pages<br />

In various operating systems, pages <strong>of</strong> 4MiB size are used to map kernel code because<br />

the IA32 maintains different TLBs for 4KiB and 4MiB pages. <strong>The</strong> advantage would<br />

be that the kernel has its own page cache that would remain unchanged on a task<br />

switch. In the beginning Sweb used one 4MiB kernel page. Later, it was more important<br />

to prevent and catch errors by marking parts (


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.7. EXTENDING <strong>SWEB</strong><br />

ArchMemory functions always take page numbers which depend on the given page<br />

size as arguments, i.e. the 2nd 4MiB Page and the 2048th 4KiB Page would start at<br />

the same linear address. In order to map 4MiB pages in Sweb during runtime, the<br />

PageManager needs to be modified. Since the PageManager manages 4MiB pages as<br />

1024 consecutive 4KiB pages, it needs to be able to find 1024 consecutive free pages<br />

starting from a page number that is a multiple <strong>of</strong> 1024. When a free 4MiB page has<br />

been found, ArchMemory::map( , ,<br />

, , PAGE_SIZE*1024); can be used to create<br />

a mapping. Unmapping is already supported.<br />

3.7.2 Additional pages for the kernel<br />

Currently kernel memory is limited to 4MiB. Even though dynamically mapping pages<br />

beyond the 4MiB area <strong>of</strong> the kernel is possible, it is easier to make available more<br />

memory at at boot time by changing the file init_boottime_pagetables.cpp. To do this,<br />

Page-Directory-Entry 513 (from 2GiB+4MiB to 2GiB+8MiB) needs to be present and<br />

point to a matching Page-Table which can then map up to 4MiB more memory. Since<br />

PageManger detects kernel pages auto-magically at initialisation time, all that remains<br />

is to modify the pointer end_address in main.cpp.<br />

Note however that 4MiB are <strong>of</strong>ten enough. <strong>The</strong> Kernel code itself uses about<br />

500KiB. PageManager usually needs less than 10KiB with 32MiB RAM and KMM uses<br />

about 4KiB per allocation. Additionally FiFos and RingBuffers use about 50KiB buffer<br />

space. <strong>The</strong> conclusion is, that running out <strong>of</strong> kernel memory in Sweb is more likely<br />

the result <strong>of</strong> a bug than shortage.<br />

3.7.3 sharing Memory between linear address spaces<br />

From what the reader <strong>of</strong> this document now knows about Paging, it is clear that<br />

sharing memory between different linear address spaces (and their programs) means<br />

sharing pages. Sweb does this already, kernel pages are shared shared between all<br />

user processes because their mapping exits in all page directories. So all that needs<br />

to be done is to map one linear memory page <strong>of</strong> different user processes to the same<br />

physical page. Finally, the global_page flag <strong>of</strong> the mapping needs to be set, in order to<br />

distinguish it from non-shared mappings and instruct the CPU to keep the mapping<br />

cached on a task switch. In order for the CPU to honour the global flag, the PGE flag<br />

in CR4 needs to be set. (see table 3.4) Furthermore mutual exclusion between user<br />

programs that share pages must not be forgotten.<br />

Care must be taken when a user processes with shared pages terminates. <strong>The</strong><br />

PageManager must be changed, so that pages are marked free only if all owning processes<br />

have free’d it.<br />

3.7.4 Cloning a Tasks linear address space<br />

Syscalls like clone() or fork() require the creation <strong>of</strong> a new user process which’s linear<br />

address space is identical to an existing user process. Unless we elegantly use Copy-<br />

OnWrite, this means manually copying every page to a new physical memory location<br />

and creating a new Page-Directory that maps to the new locations. This could be<br />

done from the old user threads linear memory context by referencing the new physical<br />

locations through the identity mapping above 3GiB.<br />

47 <strong>of</strong> 151


3.7. EXTENDING <strong>SWEB</strong> CHAPTER 3. VM, PROTECTION AND PAGING<br />

Figure 3.4: Page-Fault-Handling with possible CopyOnWrite<br />

3.7.5 CopyOnWrite<br />

A more efficient way, would be not to copy all pages, but, on demand, only copy pages<br />

that get modified. Such a mechanism is called CopyOnWrite. We might implement<br />

CopyOnWrite in Sweb by first marking all user pages as read-only. When creating a<br />

new process, instead <strong>of</strong> copying the kernel page directory, we copy the one from the<br />

original user process. This way, we can save memory space as well as the time it takes<br />

to create a new user process.<br />

Because both user process pages are read-only, writing on them will cause a page<br />

fault. Flag WP in CR0 should be set, so this happens even on a write performed by<br />

kernel code. To copy pages on demand the page fault handler needs to be modified to<br />

recognise a shared page and copy that page to a new location where it can be written<br />

to. An example <strong>of</strong> how the page fault handler might work can be found in Figure 3.4.<br />

Obvious care is necessary in administrating shared physical pages, i.e. a shared<br />

page can only be freed if all user processes using it have terminated. A possible<br />

solution might be to manage a reference counter for each physical page, however in<br />

combination with virtual memory this is not enough.<br />

48 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.7. EXTENDING <strong>SWEB</strong><br />

3.7.6 Virtual Memory<br />

Virtual Memory Basics<br />

Physical Memory is expensive and <strong>of</strong>ten the memory available is not enough. Conveniently<br />

not all programs or even all parts <strong>of</strong> a program are needed in physical memory<br />

at the same time. Virtual Memory is the concept <strong>of</strong> swapping out some less used data<br />

and code from physical memory to a less expensive media with more capacity, like a<br />

hard disk device. In essence a part <strong>of</strong> the hard disk becomes a virtual extension <strong>of</strong><br />

the physical memory, this part is sometimes refereed to as Swap- or Page- file,-space<br />

or -partition. Paging already partitions the memory in equally sized pages which are<br />

easy to track and swap around. Figure 3.5 shows an example mapping using virtual<br />

memory on a hard disk as opposed to Figure 3.2.<br />

To implement Virtual Memory, the operating system must:<br />

• know which pages in the swap space are used,<br />

• keep track <strong>of</strong> where in physical or swap memory the data on a virtual page really<br />

is,<br />

• catch attempts to access virtual pages that are not in physical memory and<br />

• know which pages are not currently needed in physical memory and which processes<br />

they belong to.<br />

Sweb Virtual Memory<br />

Sweb uses the class PageManager to manage which pages <strong>of</strong> physical memory are<br />

in use or free. A class SwapManager might accomplish the same for a static swap<br />

space on the hard disk. <strong>The</strong> Page-Directory defines where a virtual page is in physical<br />

memory, something similar is needed for swap memory. Under certain condition, one<br />

Page-Directory might hold both physical and swap page numbers, but a way to distinguish<br />

unloaded pages from swapped out pages is necessary. For example, unloaded<br />

pages might not only be non-present but have their page number set to a certain value.<br />

Knowing if a page is preset in physical memory ties in with catching access attempts.<br />

If a virtual page is accessed and that page-entry’s present bit is not set, the<br />

CPU will cause a page fault. An extended Page-Fault-Handler (see Figure 3.6) might<br />

check if a missing page can be found in the swap space before trying to load it from<br />

the executable or deciding that an error has happened. <strong>The</strong> missing page can then be<br />

loaded into physical memory and after updating the Page-Table-Entry, the program<br />

can continue.<br />

Page Replacement Algorithm<br />

When physical memory runs short, at least one page needs to be swapped out for<br />

another to load. Deciding which page to evict is the function <strong>of</strong> the Page Replace<br />

Algorithm and a matter <strong>of</strong> much consideration and research. For introduction to and<br />

explanation <strong>of</strong> several Page Replacement Algorithms see Chapters 4.4 and 4.5 in [17]<br />

Key to most Page Replacement Algorithm, are the .accessed and .dirty flags <strong>of</strong> Page-<br />

Table-Entries. (see table 3.3.3). <strong>The</strong>se flags are set by the CPU when a read or write<br />

operation is performed. More exactly, on executing a read or write operation on a linear<br />

49 <strong>of</strong> 151


3.7. EXTENDING <strong>SWEB</strong> CHAPTER 3. VM, PROTECTION AND PAGING<br />

Figure 3.5: Example mapping using virtual memory.<br />

50 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.7. EXTENDING <strong>SWEB</strong><br />

Figure 3.6: Sweb Page-Fault-Handling and possible Virtual Memory Extension<br />

51 <strong>of</strong> 151


3.7. EXTENDING <strong>SWEB</strong> CHAPTER 3. VM, PROTECTION AND PAGING<br />

address, the CPU must translate that address into a physical address by consulting<br />

the Page-Directory and Page-Tables, and in the process sets the appropriate flags.<br />

By polling and clearing these flags, the operating system can know which pages are<br />

frequently used or have been changed. Frequently used pages are bad candidates for<br />

eviction from physical memory since they will probably be needed sooner than others.<br />

Non-Dirty pages that have not been modified, might be reloaded from the executable or<br />

a copy might still exist in the swap space from an earlier page replacement operation.<br />

Such pages can be freed immediately, saving the time <strong>of</strong> writing to a swap file. Note<br />

that Page Replacement Algorithms always work on the whole <strong>of</strong> physical memory and<br />

not only on the active Page-Directory.<br />

Inverted Page Table<br />

When a virtual page is evicted from physical memory, all Page-Directory and Page-<br />

Tables pointing to that physical page need their present flag to be unset. <strong>The</strong> problem<br />

<strong>of</strong> finding the virtual page numbers mapping to that physical page can be optimised<br />

with an Inverted Page Table. <strong>The</strong> Inverted Page Table is a a relation between physical<br />

page number and the virtual pages mapping to it. To each physical page number, it<br />

needs to store at least:<br />

• the virtual page number(s) mapping to it and<br />

• the Page-Directory or user program, the virtual page belongs to.<br />

To simplify matters, the data structure can assume that only one page per virtual address<br />

space maps to one physical page. <strong>The</strong> table can also be used to hold information<br />

about dirty and accessed bits, but care must be taken to keep this information up to<br />

date. When sharing pages, the Inverted Page Table must be able to remember several<br />

user programs with each’s virtual page number, per physical page. For further details<br />

see Chapter 4.3.4 <strong>of</strong> [17]<br />

Sharing Pages In Virtual Memory<br />

When implementing CopyOnWrite in a virtual memory environment, what applied to<br />

physical memory now applies to swap memory as well. With a good page replacement<br />

algorithm, a shared page will be accessed more <strong>of</strong>ten and therefore be less likely a<br />

candidate for eviction. Still, when a shared page is swapped out, not one but several<br />

Page-Directories need to be updated. Furthermore, since it proved very efficient to<br />

share a page in physical memory, there is no reason to abolish the concept in swap<br />

memory. A Inverted Swap Page Table is needed to hold information about each swap<br />

page and its owners. Otherwise, the operating system could not know which Page-<br />

Directories to update on swap-in.<br />

Sharing Unloaded Pages<br />

Handling CopyOnWrite and LoadPageOnDemand smartly with virtual memory is more<br />

complex. With on demand loading <strong>of</strong> pages from an executable, a virtual page can<br />

be in one <strong>of</strong> three places: Physical Memory, Swap Memory or the Executable. <strong>The</strong><br />

question is if, additionally to sharing pages in physical and swap memory, unloaded<br />

pages can be shared as well. <strong>The</strong> following design issues arises: How to remember<br />

which processes have been cloned and need a still unloaded page ? A table <strong>of</strong> pages<br />

52 <strong>of</strong> 151


CHAPTER 3. VM, PROTECTION AND PAGING<br />

3.7. EXTENDING <strong>SWEB</strong><br />

that can be loaded from an executable in their initial state is necessary. Not unlike an<br />

Inverted Page Table, this table might hold the user processes that share an unloaded<br />

page and whose Page-Directories need to be updated. Update <strong>of</strong> this information<br />

would be necessary, every time a process is cloned and an executable’s page stops or<br />

starts being shared.<br />

If it is possible to know that a shared page has never been changed in physical<br />

memory and such a page is evicted from physical memory, swapping might be omitted.<br />

Instead the virtual page can be shared in the executable again.<br />

Of course, the issue can be avoided by simply not sharing unloaded pages.<br />

53 <strong>of</strong> 151


3.7. EXTENDING <strong>SWEB</strong> CHAPTER 3. VM, PROTECTION AND PAGING<br />

54 <strong>of</strong> 151


Chapter 4<br />

Kernelspace Memory<br />

Management<br />

4.1 General<br />

Figure 4.1: linear address space<br />

<strong>The</strong> kernel memory manager initializes right after the initialization <strong>of</strong> the virtual<br />

memory management.<br />

After the VMM initialization finishes some linear address space starting at 2 GB is<br />

mapped for the kernel.<br />

At the front <strong>of</strong> this memory area lies the <strong>SWEB</strong> kernel image. <strong>The</strong> kernel memory<br />

manager (KMM) is set in place behind the kernel image during initialization.<br />

<strong>The</strong> kernel-space memory management works in the linear address space provided<br />

from the VMM.<br />

55


4.2. MEMORY ORGANISATION CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT<br />

<strong>The</strong> KMM only works with linear addresses. This is why the implementation <strong>of</strong> the<br />

KMM is independent from the hardware architecture <strong>SWEB</strong> is running on.<br />

<strong>The</strong> header and source files containing the KMM and affiliated classes can be found<br />

in “common/source/mm/” and “common/include/mm”.<br />

4.2 Memory Organisation<br />

Figure 4.2: memory segments<br />

<strong>The</strong> memory, managed by the KMM is handled as a double linked list <strong>of</strong> memory<br />

segments.<br />

Which means that the memory segments are handled as list nodes in the mentioned<br />

double linked list.<br />

To implement this type <strong>of</strong> memory management we designed a object called “Malloc-<br />

Segment” which works as header <strong>of</strong> the memory segments.<br />

This header object holds a pointer to the next element <strong>of</strong> the list, one to the last element<br />

and some state information <strong>of</strong> the memory segment.<br />

<strong>The</strong> “MallocSegment” object also provides some elementary set and get functions<br />

which are used by the “KernelMemoryManager” for memory segment manipulation.<br />

56 <strong>of</strong> 151


CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT 4.3. IMPORTANT CLASSES<br />

4.3 Important Classes<br />

4.3.1 KernelMemoryManager<br />

KernelMemoryManager provides public methods for allocation, reallocation and freeing<br />

<strong>of</strong> memory. Those three methods are using some elementary memory segment<br />

manipulating methods.<br />

<strong>The</strong>re are elementary segment manipulating methods provided for:<br />

• finding a free segment<br />

• freeing a segment<br />

• merging a unused segment with the following unused segment<br />

• shrink a segment to a smaller size (and add a new segment in the appearing free<br />

space)<br />

Interface<br />

Method<br />

static uint32 createMemoryManager<br />

(pointer start_address, pointer<br />

end_address)<br />

pointer allocateMemory (size_t requested_size);<br />

bool freeMemory (pointer virtual_address);<br />

Description<br />

creates an instance <strong>of</strong> KernelMemoryManager.<br />

This method calls the constructor <strong>of</strong> KernelMemroyManger<br />

using placement new.<br />

start_address is the first free memory address.<br />

end_address is the last address where memory<br />

is accessible. (i.e. the end <strong>of</strong> the mapped address<br />

space.)<br />

allocateMemory searches the MallocSegment<br />

list for a free segment with size >= requested_size<br />

and returns a pointer to the allocated<br />

memory or 0 if not enough memory available.<br />

This method is used by new and kmalloc.<br />

requested_size is the number <strong>of</strong> bytes to allocate<br />

freeMemory checks if the given address is<br />

pointing to an actual memory segment and<br />

marks it as unused. If it is possible freeMemory<br />

merge the segment with free segments around<br />

it.<br />

freeMemory returns true if the segment was<br />

freed or false if the address was wrong.<br />

virtual_address is the memory address which<br />

was originally returned by allocateMemory.<br />

57 <strong>of</strong> 151


4.3. IMPORTANT CLASSES CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT<br />

pointer reallocateMemory (pointer<br />

virtual_address, size_t new_size);<br />

reallocateMemory resizes an already allocated<br />

memory segment or moves the entire segment<br />

somewehre else.<br />

reallocateMemory returns the (possibly altered)<br />

pointer to the resized memory segment or 0 if<br />

an error occurs.<br />

WARNING: therefore, code must assume that<br />

the memory address will change after using reallocateMemory.<br />

virtual_address is the address<br />

<strong>of</strong> the segment to resize. new_size is the new<br />

size <strong>of</strong> the segment.<br />

Note: <strong>The</strong> function acts like freeMemory if<br />

new_size is set to 0.<br />

void startUsingSyncMechanism (); startUsingSyntMechanism is called from<br />

startup() after the scheduler has been created<br />

and just before the Interrupts are turned on.<br />

Table 4.1: KernelMemoryManager Interface<br />

4.3.2 MallocSegment<br />

Figure 4.3: MallocSegment<br />

<strong>The</strong> objects <strong>of</strong> the class MallocSegment are used as list nodes in the double linked<br />

list where the memory segments are stored in.<br />

MallocSegment does not only hold the previous- and the next-pointers but also a<br />

marker and a size_flag_ variable.<br />

<strong>The</strong> size_flag_ variable contains the size <strong>of</strong> the memory segment and a used flag which<br />

marks the segment as used.<br />

Every time the KMM searches the MallocSegment list or any operation is performed on<br />

58 <strong>of</strong> 151


CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT<br />

4.4. USED ALGORITHMS<br />

a memory segment, the KMM checks if the marker is where expected. So the integrity<br />

<strong>of</strong> the memory segments is checked continuous, which is also a simple form <strong>of</strong> access<br />

violation detection.<br />

<strong>The</strong> size_flag_ variable is a private variable which is accessed by get- and set-methods.<br />

Interface<br />

Method<br />

MallocSegment (MallocSegment<br />

*prev, MallocSegment *next, size_t<br />

size, bool used);<br />

size_t getSize ();<br />

void setSize (size_t size);<br />

bool getUsed ();<br />

Description<br />

<strong>The</strong> Construktor<br />

*prev is the Pointer to the previous MallocSegment<br />

in the list.<br />

*next is the Pointer to the next MallocSegment<br />

in the list.<br />

size is the size (in bytes) <strong>of</strong> memory the<br />

described memory segment provides. Note:<br />

“this + size<strong>of</strong>(MallocSegment) + size” is usually the<br />

start <strong>of</strong> the next segment.<br />

used describes if the segment is allocated or<br />

free.<br />

Returns the size <strong>of</strong> the segment in bytes.<br />

Sets the size <strong>of</strong> the segment in bytes.<br />

size is the size to which the segment should be<br />

set.<br />

Returns true if the segment is allocated, false if<br />

it is unused.<br />

void setUsed (bool used); Sets the state <strong>of</strong> the segment as used<br />

(used=true) or free (used=false).<br />

Table 4.2: MallocSegment Interface<br />

4.4 Used Algorithms<br />

4.4.1 allocating memory<br />

Allocating memory is done by the allocateMemory method in KernelMemoryManager.<br />

<strong>The</strong> memory manager searches the memory segment list with a first fit algorithm<br />

and shrinks the found segment to the requested size. In the memory area after the<br />

shrinked segment a new segment will be insert into the list. <strong>The</strong> segment is not<br />

shrinked if the space after the shrinked segment is to small for a new segment. In<br />

case the segment is shrunk (and a new segment is inserted behind) the KMM tries to<br />

merge the new segment with the following segment if possible.<br />

4.4.2 freeing memory<br />

Freeing memory is done by the freeMemory method in KernelMemoryManager. If the<br />

previous and/or next segment is/are marked as unused the unused segments were<br />

merged to a big segment which is marked as unused.<br />

59 <strong>of</strong> 151


4.5. INTERFACE CHAPTER 4. KERNELSPACE MEMORY MANAGEMENT<br />

4.4.3 reallocating memory<br />

Reallocation <strong>of</strong> memory is also done in KernelMemoryManager.<br />

<strong>The</strong> method is named reallocateMemory and takes an address and the requested size<br />

as parameters. <strong>The</strong> kernel memory manager finds the MallocSegment which contains<br />

the virtual address and check the space after this segment. If there is enough free<br />

space behind, the kernel memory manager merges the segments to a bigger one an<br />

returns the gived pointer which now points to bigger segment. If there is not enough<br />

space the kernel memory manger searches for a segment which is large enough, allocates<br />

the requested size <strong>of</strong> memory, moves the data from the old memory location to<br />

the new, frees the old segment and returns a new pointer.<br />

4.5 Interface<br />

4.5.1 Functions<br />

Three functions for allocating and freeing memory are provided in the files “kmalloc.cpp”<br />

and “kmalloc.h”. This functions are calling the memory management methods<br />

in the kernel memory manager. <strong>The</strong>y can be used like the well known functions<br />

malloc, free and realloc.<br />

Functions<br />

Description<br />

void* kmalloc (size_t size);<br />

allocates size bytes and returns a pointer to the<br />

allocated memory.<br />

void kfree(void * address);<br />

frees the memory space pointed to by address,<br />

which must have been returned by a previous<br />

call to kmalloc() or krealloc().<br />

void* krealloc(void * address, size_t changes the size <strong>of</strong> the memory block pointed<br />

size);<br />

to by address to size bytes.<br />

Table 4.3: provided Functions<br />

Provided operators are:<br />

• new<br />

• delete<br />

• new[]<br />

• delete[]<br />

This operators are also calling the memory management methods in the kernel memory<br />

manager and can be used as usually. To use them include “new.h”.<br />

60 <strong>of</strong> 151


Chapter 5<br />

Threads and Processes<br />

5.1 Introduction<br />

<strong>The</strong> kernel <strong>of</strong> sweb is multithreaded, so we’ll now discuss what is necessary for<br />

thread switching. In order to switch between two threads we first have to interrupt<br />

the currently running thread. Typically this will happen by either a timer interrupt<br />

or another interrupt (e.g. caused by yield()). Inside this interrupt handler we will<br />

have to save the current state <strong>of</strong> cpu, which is called thread context. <strong>The</strong>n we call<br />

Scheduler::schedule() where a runnable thread is selected. Before leaving the interrupt<br />

handler the new thread context has to be loaded.<br />

5.2 Details for x86<br />

In this section we will have a look at how the low level aspects <strong>of</strong> multithreading work<br />

on different platforms and different operating systems. We will also explain how <strong>SWEB</strong><br />

implements threads and processes on the x86 platform.<br />

A word <strong>of</strong> warning though, all <strong>of</strong> the details described here are valid for 32 bit protected<br />

mode only.<br />

A typical x86 cpu only has a very limited amount <strong>of</strong> hardware registers:<br />

• <strong>The</strong> four general purpose registers EAX, EBX, ECX and EDX.<br />

• 2 stack registers called ESP and EBP.<br />

• 2 index registers called EDI and ESI.<br />

• 4 segment registers called CS, ES, DS and SS.<br />

• <strong>The</strong> instruction pointer register called EIP.<br />

• <strong>The</strong> flags register called EFLAGS.<br />

Besides these basic registers there are a number <strong>of</strong> special registers which control<br />

things like paging and interrupt handling. Additions and enhancements to the x86<br />

specifications (like mmx, sse or 3dnow) added additional cpu registers, mostly used<br />

by the fpu.<br />

61


5.3. THE FPU AND OTHER PROBLEMS CHAPTER 5. THREADS AND PROCESSES<br />

But a simple thread which does not deal with special things will only use the registers<br />

listed above. Some threads (userspace processes) are not even allowed to use<br />

registers besides the ones listed above.<br />

5.2.1 Example for saving the CPU registers<br />

<strong>The</strong> assembly listing in figure 5.1 is an example for how to save the CPU registers<br />

in an interrupt handler. <strong>The</strong> function will first determine if the interrupt happened<br />

during usermode or kernel mode and then write the cpu registers into a special data<br />

structure. ”currentThreadInfo” is a pointer to this data structure.<br />

Right before calling this function all cpu registers have been pushed onto the stack<br />

by using a ”push all” cpu instruction.<br />

5.2.2 Example for reading back the CPU Registers<br />

<strong>The</strong> assembly listing in figure 5.2 is an example for how to read back the CPU registers.<br />

In this case we assume that we only want to switch to a different thead in kernelmode.<br />

Switching to a userspace thread / process is almost the same.<br />

Loading the cpu registers is mostly straight forward, however there is one special<br />

thing to take care <strong>of</strong>. Simply reading out the register’s value and putting it in the register<br />

will not work for some special registers because we could potentially run into race<br />

conditions. On x86 the eflags registers contains, among other things, the interrupt<br />

enable flag. Since we do not want to accidentially enable interrupts during writing<br />

back the cpu registers we have to make sure that the last register we are going to<br />

write is the eflags registers. This however is not possible with a simple approach since<br />

as soon as we write into the eip registers the next instruction the cpu will execute will<br />

depend on the new value <strong>of</strong> the eip register.<br />

Luckily x86 provides a special instruction that will take care <strong>of</strong> this problem. <strong>The</strong><br />

”iret” instruction. [15, Intel Manual] This special instruction will read back the eip,<br />

the esp, the cs and the eflags registers from the stack. Since this is only one cpu<br />

instruction we have no problems.<br />

5.3 <strong>The</strong> FPU and other problems<br />

<strong>The</strong> FPU registers are a special case on x86. <strong>The</strong> amount <strong>of</strong> data that has to be copied<br />

on every thread switch to update the fpu status is huge compared to the other general<br />

purpose registers. Luckily the number <strong>of</strong> programs that actually use the fpu is small<br />

so almost all <strong>of</strong> the modern operating systems use a special trick to handle the FPU<br />

registers. Instead <strong>of</strong> copying them on every thread switch one remembers which thread<br />

last used the fpu and upon a thread switch the FPU is switched to a special mode.<br />

In this mode the FPU will trigger a hardware exception on the next attempt to use it.<br />

If such an exception is thrown the operating system has to copy the data, but if only<br />

one thread uses the fpu such an exeception will never need to be thrown and thus the<br />

FPU registers never have to be copied. One problem / restriction however raises due<br />

to this: Kernel mode threads must not access the fpu. This is no problem since on all<br />

sane operating systems this is not allowed anyway.<br />

62 <strong>of</strong> 151


CHAPTER 5. THREADS AND PROCESSES<br />

5.3. THE FPU AND OTHER PROBLEMS<br />

1 global arch_saveThreadRegisters<br />

2 arch_saveThreadRegisters :<br />

3 mov ebx , dword[ currentThreadInfo ]<br />

4 mov eax , dword[ esp + 48] ; get cs<br />

5 and eax , 0x03 ; check cpl i s 3<br />

6 cmp eax , 0x03<br />

7 je from_user<br />

8 from_kernel :<br />

9 mov eax , dword [ esp + 44] ; save eip<br />

10 mov dword[ ebx ] , eax<br />

11 mov eax , dword [ esp + 48] ; save cs<br />

12 mov dword[ ebx + 4] , eax<br />

13 mov eax , dword [ esp + 52] ; save eflags<br />

14 mov dword[ ebx + 8] , eax<br />

15 mov eax , dword [ esp + 40] ; save eax<br />

16 mov dword[ ebx + 12] , eax<br />

17 mov eax , dword [ esp + 36] ; save ecx<br />

18 mov dword[ ebx + 16] , eax<br />

19 mov eax , dword [ esp + 32] ; save edx<br />

20 mov dword[ ebx + 20] , eax<br />

21 mov eax , dword [ esp + 28] ; save ebx<br />

22 mov dword[ ebx + 24] , eax<br />

23 mov eax , dword [ esp + 24] ; save esp<br />

24 add eax , 0xc<br />

25 mov dword[ ebx + 28] , eax<br />

26 mov eax , dword [ esp + 20] ; save ebp<br />

27 mov dword[ ebx + 32] , eax<br />

28 mov eax , dword [ esp + 16] ; save esi<br />

29 mov dword[ ebx + 36] , eax<br />

30 mov eax , dword [ esp + 12] ; save edi<br />

31 mov dword[ ebx + 40] , eax<br />

32 mov eax , [ esp + 8] ; save ds<br />

33 mov dword[ ebx + 44] , eax<br />

34 mov eax , [ esp + 4] ; save es<br />

35 mov dword[ ebx + 48] , eax<br />

36 ret<br />

37 from_user :<br />

38 mov eax , dword[ esp + 60] ; save ss3<br />

39 mov dword[ ebx + 60] , eax<br />

40 mov eax , dword[ esp + 56] ; save esp3<br />

41 mov dword[ ebx + 28] , eax<br />

42 mov eax , dword [ esp + 44] ; save eip<br />

43 mov dword[ ebx ] , eax<br />

44 mov eax , dword [ esp + 48] ; save cs<br />

45 mov dword[ ebx + 4] , eax<br />

46 mov eax , dword [ esp + 52] ; save eflags<br />

47 mov dword[ ebx + 8] , eax<br />

48 mov eax , dword [ esp + 40] ; save eax<br />

49 mov dword[ ebx + 12] , eax<br />

50 mov eax , dword [ esp + 36] ; save ecx<br />

51 mov dword[ ebx + 16] , eax<br />

52 mov eax , dword [ esp + 32] ; save edx<br />

53 mov dword[ ebx + 20] , eax<br />

54 mov eax , dword [ esp + 28] ; save ebx<br />

55 mov dword[ ebx + 24] , eax<br />

56 mov eax , dword [ esp + 20] ; save ebp<br />

57 mov dword[ ebx + 32] , eax<br />

58 mov eax , dword [ esp + 16] ; save esi<br />

59 mov dword[ ebx + 36] , eax<br />

60 mov eax , dword [ esp + 12] ; save edi<br />

61 mov dword[ ebx + 40] , eax<br />

62 mov eax , [ esp + 8] ; save ds<br />

63 mov dword[ ebx + 44] , eax<br />

64 mov eax , [ esp + 4] ; save es<br />

65 mov dword[ ebx + 48] , eax 63 <strong>of</strong> 151<br />

66<br />

67<br />

68<br />

69<br />

70 ret<br />

Figure 5.1: Saving the cpu registers in an interupt handler


5.4. THREADS AND PROCESSES IN <strong>SWEB</strong>CHAPTER 5. THREADS AND PROCESSES<br />

1 arch_switchThreadKernelToKernel :<br />

2 mov ebx , dword[ currentThreadInfo ]<br />

3 mov ecx , dword[ g_tss ] ; tss<br />

4 mov eax , dword[ ebx + 68] ; get esp0<br />

5 mov dword[ ecx + 4] , eax ; restore esp0<br />

6 mov eax , dword[ ebx + 12] ; restore eax<br />

7 mov ecx , dword[ ebx + 16] ; restore ecx<br />

8 mov edx , dword[ ebx + 20] ; restore edx<br />

9 mov esp , dword[ ebx + 28] ; restore esp<br />

10 mov ebp , dword[ ebx + 32] ; restore ebp<br />

11 mov esi , dword[ ebx + 36] ; restore esi<br />

12 mov edi , dword[ ebx + 40] ; restore edi<br />

13 mov es , word [ ebx + 48] ; restore es<br />

14 mov ds , word [ ebx + 44] ; restore ds<br />

15 push dword[ ebx + 8] ; push eflags<br />

16 push dword[ ebx + 4] ; push cs<br />

17 push dword[ ebx + 0] ; push eip<br />

18 push dword[ ebx + 24]<br />

19 pop ebx ; restore ebx<br />

20 iretd ; switch to next<br />

Figure 5.2: Reading back the cpu registers <strong>of</strong> a Thread<br />

5.4 Threads and Processes in <strong>SWEB</strong><br />

5.4.1 Kernel Threads<br />

A Kernel Thread in sweb can be created by deriving from the ”Thread” class and implementing<br />

the virtual ”Run” Method. <strong>The</strong> constructor the the Thread class then allocates<br />

a Stack for the Thread (typically 8 kilobytes) and a structure called ”ArchThreadRegisters”<br />

where all the registers <strong>of</strong> the Thread are stored. <strong>The</strong>n the Thread registers are<br />

initialised so that on the first time the Thread will run it will execute the Run Method.<br />

This is done by calling a C function called ”ThreadStartHack” which will then call<br />

”currentThread->Run()”. As an alternative one could also directly call the Run method<br />

by push a valid this pointer onto the stack. This however would have the side effect <strong>of</strong><br />

not being able to catch a return from the Run method. A return from the run method<br />

would then lead to strange, unreproducable results since there is no well defined return<br />

address to jump to after the return. <strong>The</strong> Run Method may never return or the<br />

Kernel will raise a panic to prevent more Damage from happening.<br />

Register Initialisation<br />

Figure 5.3 shows how the CPU registers <strong>of</strong> a kernel thread are initialised. Interesting<br />

things to see:<br />

• All segment registers are set to use kernel level segments.<br />

• <strong>The</strong> page directory (cr3 register) is set to use the kernel page tables (this way all<br />

kernel threads share the same virtual memory area).<br />

• <strong>The</strong> protection level (dpl) is set to kernel mode which means that this thread can<br />

virtually do whatever it wants, without limitations.<br />

• <strong>The</strong> instruction pointer (eip) is set to the address <strong>of</strong> the start function.<br />

64 <strong>of</strong> 151


CHAPTER 5. THREADS AND PROCESSES5.4. THREADS AND PROCESSES IN <strong>SWEB</strong><br />

1 void ArchThreads : : createThreadInfosKernelThread ( ArchThreadInfo ∗&info ,<br />

2 pointer start_function , pointer stack )<br />

3 {<br />

4 info = ( ArchThreadInfo ∗)new uint8 [ size<strong>of</strong> ( ArchThreadInfo ) ] ;<br />

5 ArchCommon: : bzero ( ( pointer ) info , size<strong>of</strong> ( ArchThreadInfo ) ) ;<br />

6 pointer pageDirectory = VIRTUAL_TO_PHYSICAL_BOOT(<br />

7 ( ( pointer)& kernel_page_directory_start ) ) ;<br />

8<br />

9 info −>cs = KERNEL_CS;<br />

10 info −>ds = KERNEL_DS;<br />

11 info −>es = KERNEL_DS;<br />

12 info −>ss = KERNEL_SS;<br />

13 info −>eflags = 0x200 ;<br />

14 info −>eax = 0;<br />

15 info −>ecx = 0;<br />

16 info −>edx = 0;<br />

17 info −>ebx = 0;<br />

18 info −>esi = 0;<br />

19 info −>edi = 0;<br />

20 info −>dpl = DPL_KERNEL;<br />

21 info −>esp = stack ;<br />

22 info −>ebp = stack ;<br />

23 info −>eip = start_function ;<br />

24 info −>cr3 = pageDirectory ;<br />

25<br />

26 /∗ fpu (= f n i n i t ) ∗/<br />

27 info −>fpu [ 0 ] = 0xFFFF037F;<br />

28 info −>fpu [ 1 ] = 0xFFFF0000;<br />

29 info −>fpu [ 2 ] = 0xFFFFFFFF;<br />

30 info −>fpu [ 3 ] = 0x00000000 ;<br />

31 info −>fpu [ 4 ] = 0x00000000 ;<br />

32 info −>fpu [ 5 ] = 0x00000000 ;<br />

33 info −>fpu [ 6 ] = 0xFFFF0000;<br />

34 kprintfd_nosleep ( " ArchThreads : : createThreadInfosKernelThread : values done\n" ) ;<br />

35 }<br />

Figure 5.3: Initialisation <strong>of</strong> the Registers for a Kernel Thread<br />

• <strong>The</strong> FPU is initialised by setting known good values into the fpu registers.<br />

5.4.2 Userspace Processes<br />

A userspace process is a little bit more complicated than a kernel thread. To create<br />

a userspace process one first has to create an ordinary kernel thread. This thread<br />

then will have to load an executable by using the Loader. <strong>The</strong>n the thread has to<br />

create a second cpu register structure. After all this the thread can set a special flag<br />

in its thread object to make the scheduler switch to the process on the next time it<br />

will switch to this thread. So basically a userspace process is a kernel thread with an<br />

addition set <strong>of</strong> cpu registers and <strong>of</strong> course an own address space. This addition set <strong>of</strong><br />

cpu registers is initialiased different than for a normal kernel thread as we will see in<br />

the next section.<br />

65 <strong>of</strong> 151


5.5. MULTIPROCESSOR MACHINES CHAPTER 5. THREADS AND PROCESSES<br />

Register Initialisation<br />

Figure 5.4 shows how the CPU registers <strong>of</strong> a userspace thread are initialised. Interesting<br />

things to see:<br />

• All segment registers are set to use userspace segments. Only the ss0 entry is<br />

set to use a kernel segment.<br />

• <strong>The</strong> page directory (cr3 register) is set to use the kernel page tables. This is to<br />

have a failsafe in case one forgets to create a PageDirectory for the userspace<br />

process.<br />

• <strong>The</strong> protection level (dpl) is set to userspace mode which means that this thread<br />

can only access what the kernel allows it to access (for instance memory larger<br />

than 2gig will not be accessible).<br />

• <strong>The</strong> instruction pointer (eip) is set to the address <strong>of</strong> the start function. <strong>The</strong> value<br />

is retrieved by the binary loader.<br />

• <strong>The</strong> FPU is initialised by setting known good values into the fpu registers.<br />

• We set the esp0 entry to point to a stack in the kernel area. This stack will be<br />

used by the cpu during interrupt handling an syscall handling.<br />

5.4.3 Switching to Userspace<br />

At the end <strong>of</strong> some interrupt handlers the scheduler will switch to a different thread.<br />

<strong>The</strong> selected thread object is then checked to determine whether its userspace or its<br />

kernelspace part is to be executed next. In listing 5.2 one can see how a task switch<br />

to a kernel thread happens.<br />

Main differences between switching to userspace and switchting to a normal kernel<br />

thread:<br />

• esp0 has to be set in the tss. This stack will be used if any exception or interrupt<br />

happens in user mode because the usermode stack can not be trusted to have a<br />

good value (imagine a pagefault because <strong>of</strong> a stack overflow in userspace).<br />

• iret also pops the esp register and the ss segment register since the userspace<br />

registers can’t be used for the last steps before the iret in kernel mode.<br />

5.5 Multiprocessor Machines<br />

5.5.1 <strong>The</strong> FPU Trick<br />

<strong>The</strong> FPU trick previously described will not work with multiprocessor machines anymore<br />

since a thread could be resheduled to run on a different cpu the next time it will<br />

run. So, unless one can gurantee cpu affinity for threads the fpu will have to be saved<br />

and restored on every thread switch.<br />

This little trick also lead to a famous bug in the linux kernel: http://lwn.net/Articles/89586/<br />

66 <strong>of</strong> 151


CHAPTER 5. THREADS AND PROCESSES<br />

5.5. MULTIPROCESSOR MACHINES<br />

1 void ArchThreads : : createThreadInfosUserspaceThread ( ArchThreadInfo ∗&info ,<br />

2 pointer start_function , pointer user_stack , pointer kernel_stack )<br />

3 {<br />

4 info = ( ArchThreadInfo ∗)new uint8 [ size<strong>of</strong> ( ArchThreadInfo ) ] ;<br />

5 ArchCommon: : bzero ( ( pointer ) info , size<strong>of</strong> ( ArchThreadInfo ) ) ;<br />

6 pointer pageDirectory = VIRTUAL_TO_PHYSICAL_BOOT(<br />

7 ( ( pointer)& kernel_page_directory_start ) ) ;<br />

8<br />

9 info −>cs = USER_CS;<br />

10 info −>ds = USER_DS;<br />

11 info −>es = USER_DS;<br />

12 info −>ss = USER_SS;<br />

13 info −>ss0 = KERNEL_SS;<br />

14 info −>eflags = 0x200 ;<br />

15 info −>eax = 0;<br />

16 info −>ecx = 0;<br />

17 info −>edx = 0;<br />

18 info −>ebx = 0;<br />

19 info −>esi = 0;<br />

20 info −>edi = 0;<br />

21 info −>dpl = DPL_USER;<br />

22 info −>esp = user_stack ;<br />

23 info −>ebp = user_stack ;<br />

24 info −>esp0 = kernel_stack ;<br />

25 info −>eip = start_function ;<br />

26 info −>cr3 = pageDirectory ;<br />

27<br />

28 /∗ fpu (= f n i n i t ) ∗/<br />

29 info −>fpu [ 0 ] = 0xFFFF037F;<br />

30 info −>fpu [ 1 ] = 0xFFFF0000;<br />

31 info −>fpu [ 2 ] = 0xFFFFFFFF;<br />

32 info −>fpu [ 3 ] = 0x00000000 ;<br />

33 info −>fpu [ 4 ] = 0x00000000 ;<br />

34 info −>fpu [ 5 ] = 0x00000000 ;<br />

35 info −>fpu [ 6 ] = 0xFFFF0000;<br />

36 kprintfd_nosleep ( " ArchThreads : : create : values done\n" ) ;<br />

37<br />

38 }<br />

Figure 5.4: Initialisation <strong>of</strong> the Registers for a Userspace Thread<br />

67 <strong>of</strong> 151


5.6. SCHEDULING CHAPTER 5. THREADS AND PROCESSES<br />

1 arch_switchThreadToUserPageDirChange :<br />

2 mov ebx , dword[ currentThreadInfo ]<br />

3 mov ecx , dword[ g_tss ] ; tss<br />

4 mov eax , dword[ ebx + 68] ; get esp0<br />

5 mov dword[ ecx + 4] , eax ; restore esp0<br />

6 mov eax , dword[ ebx + 76] ; page directory<br />

7 mov cr3 , eax ; change page directory<br />

8 mov eax , dword[ ebx + 12] ; restore eax<br />

9 mov ecx , dword[ ebx + 16] ; restore ecx<br />

10 mov edx , dword[ ebx + 20] ; restore edx<br />

11 mov ebp , dword[ ebx + 32] ; restore ebp<br />

12 mov esi , dword[ ebx + 36] ; restore esi<br />

13 mov edi , dword[ ebx + 40] ; restore edi<br />

14 mov es , word [ ebx + 48] ; restore es<br />

15 mov ds , word [ ebx + 44] ; restore ds<br />

16 push dword[ ebx + 60] ; push ss here dpl lowwer<br />

17 push dword[ ebx + 28] ; push esp here dpl lowwer<br />

18 push dword[ ebx + 8] ; push eflags<br />

19 push dword[ ebx + 4] ; push cs<br />

20 push dword[ ebx + 0] ; push eip<br />

21 push dword[ ebx + 24]<br />

22 pop ebx ; restore ebp<br />

23 iretd ; switch to next<br />

Figure 5.5: Reading back the cpu registers <strong>of</strong> a Thread and switch to Userspace<br />

5.5.2 CPU Affinity<br />

Another interesting problem that only arises on multi processor machines is that when<br />

one moves a thread from one cpu to another this can have potential ill effects on<br />

performance. All the data used by the thread which was already cached in the cpu it<br />

was moved from is not available in the cpu it moves to. This can decrease performance<br />

a lot.<br />

5.5.3 NUMA<br />

NUMA or Non Unified Memory Access simply means that certain areas <strong>of</strong> memory are<br />

faster to access for a given cpu than others. This however means that ideally a thread<br />

should be running on the cpu which has the best access to the memory it’s using.<br />

5.6 Scheduling<br />

5.6.1 Threads<br />

<strong>The</strong> base unit used in <strong>SWEB</strong> for scheduling is defined as a class Thread. This class<br />

includes all the necessary information for task switching like the state <strong>of</strong> a thread or<br />

a structure containing register values.<br />

Every new structure representing a task in the multiprogramming sense such as a<br />

userspace thread has to inherit from this base class.<br />

68 <strong>of</strong> 151


CHAPTER 5. THREADS AND PROCESSES<br />

5.6. SCHEDULING<br />

Thread states<br />

In <strong>SWEB</strong> a Thread object can have three possible states which has an implication on<br />

their scheduling and management.<br />

<strong>The</strong> default state is called ’Running’ and marks the thread as runnable. Only<br />

runnable threads will be considered during scheduling. ’Sleeping’ threads are simply<br />

ignored by the scheduling mechanism and are not executed. Threads having the<br />

state ’ToBeDestroyed’ have already completed their execution or have been forcably<br />

killed. Thread objects with this state are destroyed by a cleanup routine called in the<br />

IdleThread which is always on the scheduler list.<br />

5.6.2 <strong>The</strong> Scheduler<br />

<strong>The</strong> central element <strong>of</strong> the scheduling in <strong>SWEB</strong> is the class Scheduler, implemented<br />

using a slightly modified Singleton pattern. This class manages a list containing all<br />

threads in the system and provides several methods for controlling the scheduling at<br />

run-time.<br />

Important Control Methods<br />

• void Scheduler::addNewThread(Thread *thread) adds a new Thread to the scheduling<br />

list.<br />

• void Scheduler::yield() manually calls a special interrupt which does a reschedule.<br />

• void Scheduler::sleep() sets the stat <strong>of</strong> the currently executed thread to ’Sleeping’<br />

and calls yield(), therefore letting it sleep.<br />

• void Scheduler::sleepAndRelease(Mutex &lock) releases the given Mutex after<br />

the thread state is set for avoiding a race condition between state setting and the<br />

call to yield(). This method is also overloaded for using a SpinLock as argument.<br />

• void Scheduler::wake(Thread* thread_to_wake) wakes up the given thread by<br />

setting its state to ’Running’.<br />

<strong>The</strong> Scheduling Algorithm<br />

<strong>The</strong> scheduling <strong>of</strong> registered threads is done in the method uint32 Scheduler::schedule()<br />

in round-robin fashion. It takes a thread from the front <strong>of</strong> the thread list and rotates<br />

the list, so that the thread on the front is moved to the back <strong>of</strong> it. This is done by<br />

a special method List::rotateBack() for avoiding memory allocation or deletion because<br />

the latter two should not be used in this method. <strong>The</strong> reason for this is that<br />

a thread or process could be holding the kernel memory manager lock which would<br />

have to be released for using dynamical memory management. But as the scheduler<br />

itself is waiting for the lock the owning thread cannot be scheduled for execution and<br />

thus the system would end up dead locked.<br />

<strong>The</strong> scheduling method is called from the timer and yield interrupt handlers. No<br />

accounting is done regarding the execution time <strong>of</strong> a thread on the cpu.<br />

69 <strong>of</strong> 151


5.6. SCHEDULING CHAPTER 5. THREADS AND PROCESSES<br />

Cleaning up − <strong>The</strong> IdleThread<br />

A special thread called IdleThread is put on the scheduling list on creation time. This<br />

thread executes a cleanup routine which removes dead thread objects recognizable by<br />

the state ’ToBeDestroyed’ and is never removed from the scheduling list. <strong>The</strong>refore it<br />

has a regular chance for execution and it is ensured that there is always at least one<br />

thread ready to run even when the system is idle (hence the name <strong>of</strong> the thread).<br />

70 <strong>of</strong> 151


Chapter 6<br />

System Calls<br />

6.1 Base <strong>Information</strong> about System Calls<br />

System calls are implemented in every operating system. <strong>The</strong>y are normally used for<br />

the communication between the kernel and the user space applications.<br />

Every standard single-CPU computer can execute only one instruction at a time.<br />

<strong>The</strong>refore, if a user process is running in user mode and needs some information<br />

about the system, or needs a system service, like read or write data from a file, it has<br />

to execute a trap or system call instruction to transfer control to the operating system.<br />

<strong>The</strong> operating system reacts to this call by looking at the transfered parameters. <strong>The</strong>n<br />

the operating system carries out the system call and returns control to the instruction<br />

following the system call. In a way, the system call is like a procedure call, but the<br />

procedure itself is carried out by the kernel in kernel space, and not by the application<br />

in user space.<br />

71


6.1. BASE INFORMATION ABOUT SYSTEM CALLS CHAPTER 6. SYSTEM CALLS<br />

6.1.1 Concept <strong>of</strong> System Calls<br />

Learning by example is the simplest way to understand how the system calls work.<br />

In this section, the concept <strong>of</strong> system calls is shown. As an example the operating<br />

system UNIX was used. This figure shows the concept <strong>of</strong> the system call read in a<br />

UNIX system.<br />

<strong>The</strong> system call read is made <strong>of</strong> a series <strong>of</strong> different steps. At first, before the library<br />

procedure read is called, the calling programm pushes the parameters on the stack.<br />

This behaviour coresponds with the steps 1-3 on the diagram. <strong>The</strong> parameters are<br />

pushed on the stack in the reversed order, because <strong>of</strong> historical reasons related to the<br />

old C an C++ compilers. <strong>The</strong> first and the third parameter are called by value and<br />

the second parameter is called by reference, meaning that the address <strong>of</strong> the buffer is<br />

passed and not the contents <strong>of</strong> the buffer.<br />

<strong>The</strong> fourth step on the diagram is the actual call to the library procedure. <strong>The</strong>se<br />

libraries are usually written in assembly language and optimised for speed. <strong>The</strong> next<br />

step in the diagram shows how the system call number is put into a specific register.<br />

<strong>The</strong> number in the register is used by operating system so the appropriate handler<br />

can be called. In the sixth step a TRAP instruction switchs the CPU from the user<br />

mode into the kernel mode and starts execution at a fixed address within the kernel.<br />

<strong>The</strong> kernel code starts to examine the number from the forementioned register and<br />

transfers the control to the correct system call handler. <strong>The</strong> mapping from the system<br />

call number to the handler is done via a table <strong>of</strong> pointers to the system call handlers<br />

indexed on sytemcall number. In the step 8 the system call handler performs the<br />

tasks depending on the parameters passed by the system call. Next, in the step 9,<br />

72 <strong>of</strong> 151


CHAPTER 6. SYSTEM 6.2. CALLS SYSTEM CALLS IN THE DIFFERENT OPERATING SYSTEMS<br />

the system call handler has completed its work. Control may be returend to the user<br />

space library procedure at the instruction following the TRAP instruction. In the step<br />

10 the procedure returns then to the user program in the way like procedure calls<br />

returns.<br />

At last in the step 11 the job is finished and the user program has to clean up the<br />

stack, like it does after any procedure call.<br />

This was a simple example <strong>of</strong> how system calls work in UNIX. <strong>The</strong>re are a lot <strong>of</strong> system<br />

calls for the different processes. <strong>The</strong> Process Management, the File Management, the<br />

Directory Management and the Miscallaneous System calls are different groups <strong>of</strong><br />

system calls for different processes. <strong>The</strong>se groups <strong>of</strong> system calls are not the focus<br />

<strong>of</strong> this document and will not be closely examined. If you want more information<br />

about them, refer to the book Modern Operating System by Tanenbaum. However, the<br />

system calls which are currently implemented in <strong>SWEB</strong> are thorougly described here,<br />

in the sections to follow.<br />

6.2 System Calls in the different Operating Systems<br />

<strong>The</strong> system calls use different structures in the different operating systems. UNIX,<br />

MINIX and Linux, as well as the <strong>SWEB</strong> operating system have the same base concept<br />

for system calls, only the details <strong>of</strong> the implementation are different. This type <strong>of</strong> compatibility<br />

is defined as POSIX system call standard. On the other hand, the family <strong>of</strong><br />

Windows operating systems uses system calls in a different way.<br />

In the previous section the functioning <strong>of</strong> the system calls in the UNIX operating system<br />

was explained. In this section it will be shown how the operating system Windows<br />

handles system calls. <strong>The</strong> details are different between the individual Windows versions.<br />

A Windows program is normally event driven, what means that the main program is<br />

waiting for some events to happen and after this event the program calls a procedure<br />

to handle the event. Events in Windows could mean a mouse moving or a mouse<br />

button click. When the event occurs, the appropriate handler is called to process the<br />

event and update the screen as well as the internal program state. This means that<br />

windows implements message passing programming paradigm as oposed to UNIX system<br />

call.<br />

In a way Windows has system library but it is unlike UNIXs. UNIX has a library<br />

procedure for each system call. On the other hand, Micros<strong>of</strong>t has defined a set <strong>of</strong><br />

procedures, called the Win32 API (Application Program Interface) which programmers<br />

use to get services from the operating system. This interface should be supported on<br />

all versions <strong>of</strong> Windows since Windows 95.<br />

Win API has a large number <strong>of</strong> calls, more than thousand. But opposed to UNIX, it is<br />

impossible to see what really is a system call and what is a simple user space library<br />

call. Another big difference between UNIX and Windows is the Graphical User Interface<br />

(GUI). In UNIX the GUI runs in the user space, meaning that only the system call<br />

write for writing on the screen and a few other minor ones run in the kernel. This provides<br />

better kernel stability and less userspace/kernel context switching. Additionaly<br />

this means that different graphical subsystems can be implemented without major<br />

changes in the kernel.<br />

On the other hand, Windows within the Win32 API implements a huge number <strong>of</strong> calls<br />

for managing windows, text, fonts, menus and other features <strong>of</strong> the GUI. <strong>The</strong> graph-<br />

73 <strong>of</strong> 151


6.3. SYSTEM CALLS IN <strong>SWEB</strong> CHAPTER 6. SYSTEM CALLS<br />

ics subsystem runs completely in the kernel, but not in all versions <strong>of</strong> Windows. On<br />

some Windows these W32 API calls are remapped directly to system calls (KERNEL32),<br />

and on the other versions they are just library calls (GDI32). In a nutshell, the Windows<br />

engineers just implemented another layer between the user applications and the<br />

kernel. This layer gives more flexibility and provides the programmer with the stable<br />

interface that is independent from the Windows version. On the other hand, the GUI<br />

management is fixed and unflexible. It is hard to extend the GUI without extending<br />

the kernel.<br />

To give a better overview between UNIX and Windows here is a list <strong>of</strong> the most important<br />

system calls.<br />

UNIX Win32 Description<br />

fork CreateProcess Create a new process<br />

waitpid WaitForSingleObject Can wait for a process to exit<br />

execve (none) CreateProcess = fork + execve<br />

exit ExitProcess Terminate execution<br />

open CreateFile Create a file or open an existing file<br />

close CloseHandle Close a file<br />

read ReadFile Read data from a file<br />

write WriteFile Write data to a file<br />

lseek SetFilePointer Move the file pointer<br />

stat GetFileAttributesEx Get various file attributes<br />

mkdir CreateDirectory Create a new directory<br />

rmdir RemoveDirectory Remove an empty directory<br />

link (none) Win32 does not support links<br />

unlinke DeleteFile Destroy an existing file<br />

mount (none) Win32 does not support mount<br />

umount (none) Win32 does not support mount<br />

chdir SetCurrentDirectory Change the current working directory<br />

chmod (none) Win32 does not support security (although NT does)<br />

kill (none) Win32 does not support signals<br />

time GetLocalTime Get the current time<br />

Table 1: <strong>The</strong> Win32 API calls which are comparable to the UNIX system calls.<br />

This table is taken from the book ’Modern Operating Systems’ from Tanenbaum.<br />

For more information please refer to this book.<br />

6.3 System calls in <strong>SWEB</strong><br />

<strong>SWEB</strong> is a simple operating system that tries to implement the POSIX system call<br />

standard. <strong>The</strong> system calls are implemented in a similar way to LINUX operating<br />

system.<br />

6.3.1 Concept <strong>of</strong> System Calls in <strong>SWEB</strong><br />

<strong>SWEB</strong> uses interrupt 0x80 for calling the system call handler in the kernel. <strong>The</strong> type<br />

<strong>of</strong> system call is specified by the predefined number which is stored in the EAX register<br />

74 <strong>of</strong> 151


CHAPTER 6. SYSTEM CALLS<br />

6.3. SYSTEM CALLS IN <strong>SWEB</strong><br />

1<br />

2 BITS 32<br />

3<br />

4 section . text<br />

5<br />

6 global __syscall<br />

7 __syscall :<br />

8<br />

9 ; ok , f i r s t we have to do some l i t t l e cleanup as every function should do<br />

10 push ebp ; the base pointer goes onto the stack so we can restore i t a l t e r<br />

11 mov ebp , esp ;<br />

12<br />

13 ; ok , we have to save ALL registers we are now going to overwrite for<br />

14 ; our syscall<br />

15<br />

16<br />

17 push ebx ;<br />

18 push esi ;<br />

19 push edi ;<br />

20<br />

21 mov eax , [ ebp + 8]<br />

22 mov ebx , [ ebp + 12]<br />

23 mov ecx , [ ebp + 16]<br />

24 mov edx , [ ebp + 20]<br />

25 mov esi , [ ebp + 24]<br />

26 mov edi , [ ebp + 28]<br />

27<br />

28 int 0x80 ;<br />

29<br />

30 pop edi<br />

31 pop esi<br />

32 pop ebx<br />

33 mov esp , ebp<br />

34 pop ebp<br />

35 ret<br />

Figure 6.1: <strong>The</strong> implementation <strong>of</strong> the system calls<br />

on the x86 architecture. System call arguments are provided through the remaining<br />

general purpose registers. When the kernel has finished with the execution <strong>of</strong> the call<br />

it stores the return value in the EAX register. Here is shown how the system calls are<br />

implemented in assembler.<br />

In detail a part <strong>of</strong> the system calls are implemented in kernel, so the kernel knows<br />

what are system calls. <strong>The</strong> definition <strong>of</strong> system calls are in the kernel.<br />

This is a list from the system calls which are implemented in the operating system<br />

<strong>SWEB</strong>. This list is in ˜swebcommonincludekernel . This are the most important system<br />

calls in a operating system. Not all system calls are needed yet and not all are<br />

used. This definition is for the user space and for the kernel space.<br />

<strong>The</strong> class Syscall is only a container for the system call handler and there methodes.<br />

This file is also in the kernel ˜swebcommonincludekernel implemented. This<br />

class should not be an instance <strong>of</strong> system calls, because the syscallException is made<br />

static to call directly. <strong>The</strong> syscallException has to stay in this way because the Inter-<br />

75 <strong>of</strong> 151


6.3. SYSTEM CALLS IN <strong>SWEB</strong> CHAPTER 6. SYSTEM CALLS<br />

1 #define fd_stdout 0<br />

2 #define fd_stdin 1<br />

3 #define fd_stderr 2<br />

4<br />

5 #define sc_restart 0<br />

6 #define sc_exit 1<br />

7 #define sc_fork 2<br />

8 #define sc_read 3<br />

9 #define sc_write 4<br />

10 #define sc_open 5<br />

11 #define sc_close 6<br />

12 #define sc_waitpid 7<br />

13 #define sc_creat 8<br />

14 #define sc_link 9<br />

15 #define sc_unlink 10<br />

16 #define sc_execve 11<br />

17<br />

18 #define sc_lseek 19<br />

19 #define sc_getpid 20<br />

20 #define sc_mount 21<br />

21 #define sc_umount 22<br />

22<br />

23 #define sc_nice 34<br />

24<br />

25 #define s c _ k i l l 37<br />

26<br />

27 #define sc_rename 38<br />

28 #define sc_mkdir 39<br />

29 #define sc_rmdir 40<br />

30 #define sc_dup 41<br />

31 #define sc_pipe 42<br />

32<br />

33 #define sc_brk 45<br />

34<br />

35 #define sc_signal 48<br />

36<br />

37 #define sc_dup2 63<br />

38<br />

39 #define sc_reboot 88<br />

40<br />

41 #define sc_ipc 117<br />

42<br />

43 #define sc_clone 120<br />

44<br />

45 #define sc_flock 143<br />

46 #define sc_msync 144<br />

47<br />

48 #define sc_vfork 190<br />

Figure 6.2: <strong>The</strong> definition list <strong>of</strong> the system calls<br />

76 <strong>of</strong> 151


CHAPTER 6. SYSTEM CALLS<br />

6.3. SYSTEM CALLS IN <strong>SWEB</strong><br />

1 class Syscall<br />

2 {<br />

3 public :<br />

4<br />

5 static uint32 syscallException ( uint32 syscall_number , uint32 arg1 , uint32 arg2 , uint32 arg3 , uint32 arg4 , ui<br />

6<br />

7 static void e x i t ( uint32 exit_code ) ;<br />

8<br />

9 static uint32 write ( uint32 fd , pointer buffer , uint32 size ) ;<br />

10<br />

11 static uint32 read ( uint32 fd , pointer buffer , uint32 count ) ;<br />

12<br />

13 private :<br />

14 } ;<br />

15<br />

16 #endif<br />

Figure 6.3: <strong>The</strong> implementation <strong>of</strong> the system calls<br />

ruptUtils are calling the system calls in this way.<br />

<strong>The</strong> syscallException can take maximum six arguments from the user space. <strong>The</strong><br />

syscallException handles the system calls and return a single value back to the architecture<br />

specified interrupt handler. <strong>The</strong> return value is for the user space. It is<br />

returned in the register EAX on the x86. <strong>The</strong> parameter syscall_number is the first<br />

argument from the userspace and is defined in the syscall-definition.h list. This has<br />

to be a system call in the syscall-definition.h.<br />

<strong>The</strong> method exit is an example for handling system calls. <strong>The</strong> parameter exit_code is<br />

the transmitted exit code from the userspace.<br />

<strong>The</strong> methods write and read are also examples how the system calls are handled in<br />

the operating system <strong>SWEB</strong>. <strong>The</strong>re are descriptions only for the method write, because<br />

read is handled in the similar way. <strong>The</strong> parameter fd is a file descriptor as described<br />

in the syscall-definitions, with fd_stdin, fd_stdout and fd_stderr.<br />

This is also a part <strong>of</strong> the code from <strong>SWEB</strong>. This is the file Syscall.cpp and shows how<br />

the system calls are implemented. This file you find in ˜/sweb/common/source/kernel.<br />

6.3.2 How to implement System Calls in <strong>SWEB</strong><br />

In <strong>SWEB</strong> is it very easy to implement system calls. <strong>The</strong> most important system calls<br />

are defined in the file syscall-definitions. So the programmer does not implement the<br />

system call in this list. It is possible to use the defined system calls on a simple way.<br />

In this part <strong>of</strong> code is shown how the system call write is implemented. <strong>The</strong> other<br />

system calls should be implemented in this way.<br />

6.3.3 A convenient interface<br />

As manual calling for system functions doesn’t exactly match the definition <strong>of</strong> a convenient<br />

interface and most <strong>of</strong>ten some extra work has to be done on the side <strong>of</strong> the<br />

user program, system calls are executed by ordinary library functions available to<br />

userspace applications.<br />

In <strong>SWEB</strong> a slim C library which is automatically linked to executables by the make<br />

system serves this purpose. Applications written in the C programming language<br />

77 <strong>of</strong> 151


6.3. SYSTEM CALLS IN <strong>SWEB</strong> CHAPTER 6. SYSTEM CALLS<br />

1 uint32 Syscall : : syscallException ( uint32 syscall_number , uint32 arg1 , uint32 arg2 , uint32 arg3 , uint32 a<br />

2 {<br />

3 uint32 return_value =0;<br />

4<br />

5 kprintfd ( " Syscall %d called with arguments %d(=%x ) %d(=%x ) %d(=%x ) %d(=%x ) %d(=%x)\n" , syscall_number<br />

6<br />

7 switch ( syscall_number )<br />

8 {<br />

9 case sc_exit :<br />

10 e x i t ( arg1 ) ;<br />

11 break ;<br />

12 case sc_write :<br />

13 return_value = write ( arg1 , arg2 , arg3 ) ;<br />

14 break ;<br />

15 case sc_read :<br />

16 return_value = read ( arg1 , arg2 , arg3 ) ;<br />

17 break ;<br />

18 default :<br />

19 kprintf ( " Syscall : : syscall_exception : Unimplemented Syscall Number %d\n" , syscall_number ) ;<br />

20 }<br />

21 return return_value ;<br />

22 }<br />

23<br />

24 void Syscall : : e x i t ( uint32 exit_code )<br />

25 {<br />

26 kprintfd ( " Syscall : : EXIT : called , exit_code : %d\n" , exit_code ) ;<br />

27 currentThread−>k i l l ( ) ;<br />

28 }<br />

29<br />

30 uint32 Syscall : : write ( uint32 fd , pointer buffer , uint32 size )<br />

31 {<br />

32 //WARNING: this might f a i l i f Kernel PageFaults are not handled<br />

33 assert ( buffer < 2U∗1024U∗1024U∗1024U) ;<br />

34 i f ( fd == fd_stdout ) //stdout<br />

35 {<br />

36 kprintfd ( " Syscall : : write : %B\n" , ( char ∗) buffer , size ) ;<br />

37 kprint_buffer ( ( char ∗) buffer , size ) ;<br />

38 }<br />

39 return size ;<br />

40 }<br />

41<br />

42 uint32 Syscall : : read ( uint32 fd , pointer buffer , uint32 count )<br />

43 {<br />

44 assert ( buffer < 2U∗1024U∗1024U∗1024U) ;<br />

45 uint32 num_read = 0;<br />

46 i f ( fd == fd_stdin )<br />

47 {<br />

48 //this doesn ’ t ! terminate a string with \0, gotta do that yourself<br />

49 num_read = currentThread−>getTerminal()−>readLine ( ( char ∗) buffer , count ) ;<br />

50 kprintfd ( " Syscall : : read : %B\n" , ( char ∗) buffer ,num_read ) ;<br />

51 for ( uint32 c=0; c


CHAPTER 6. SYSTEM CALLS<br />

6.3. SYSTEM CALLS IN <strong>SWEB</strong><br />

1<br />

2 uint32 Syscall : : write ( uint32 fd , pointer buffer , uint32 size )<br />

3 {<br />

4 //WARNING: this might f a i l i f Kernel PageFaults are not handled<br />

5 assert ( buffer < 2U∗1024U∗1024U∗1024U) ;<br />

6 i f ( fd == fd_stdout ) //stdout<br />

7 {<br />

8 kprintfd ( " Syscall : : write : %B\n" , ( char ∗) buffer , size ) ;<br />

9 kprint_buffer ( ( char ∗) buffer , size ) ;<br />

10 }<br />

11 return size ;<br />

12 }<br />

Figure 6.5: <strong>The</strong> implementation <strong>of</strong> the system call write<br />

can use the provided functions and therefore don’t have to worry about the basics <strong>of</strong><br />

system calls. This library can be found in the directory userspace/libc from the <strong>SWEB</strong><br />

source tree.<br />

In order to make use <strong>of</strong> this enrichment in usability, an appropriate calling function<br />

should be provided in this library for every added system call.<br />

79 <strong>of</strong> 151


6.3. SYSTEM CALLS IN <strong>SWEB</strong> CHAPTER 6. SYSTEM CALLS<br />

80 <strong>of</strong> 151


Chapter 7<br />

Character Devices<br />

7.1 Introduction<br />

Generally, the I/O devices connected to a computer can be divided into two major<br />

groups. Character devices, through which data is transmitted one character at a time<br />

and block devices, through which data is transmitted in the form <strong>of</strong> blocks. <strong>The</strong> block<br />

devices <strong>of</strong>ten implement the buffered reading. <strong>The</strong> main difference between the block<br />

devices and character devices is that from character devices it is possible to read only<br />

the next available character, whereas the block devices can deliver any block from<br />

the device at any given time meaning that seek operations are supported as well. It<br />

should be noted that this kind <strong>of</strong> division can be found in UNIX operating system<br />

and its derivatives. This approach is great for treating the devices as files and brings<br />

the necessary abstraction needed for hiding the hardware from the programmer/user.<br />

<strong>The</strong> drawback is that sometimes there are too many abstraction layers that do nothing<br />

but pass messages around thus effectively slowing the system. On the other hand,<br />

the Windows operating system implements the devices through API classes meaning<br />

that for each device type, there exists a defined API which must be implemented by a<br />

particular driver. This approach gives more flexibility and speed, but here the disadvantage<br />

lies in the fact that operating system must be extended each time a new device<br />

class appears. <strong>The</strong> aim <strong>of</strong> this chapter is to give a short introduction and explanation<br />

<strong>of</strong> the character devices and their implementation in <strong>SWEB</strong> operating system.<br />

7.2 Classes<br />

Some <strong>of</strong> the most common character devices that can be connected to the computer<br />

are:<br />

• Keyboards<br />

• Monitors<br />

• Serial ports<br />

• Mice, tablets, light pens and other pointing devices<br />

81


7.3. CHARACTER DEVICE BASE CHAPTER 7. CHARACTER DEVICES<br />

• USB devices<br />

• Printers<br />

• Linux even implements SCSI disks as character devices, where a certain structure<br />

is filled with parameters and written to the device and the result is then read<br />

in raw mode character by character from the device driver.<br />

<strong>The</strong> <strong>SWEB</strong> operating system has device drivers for keyboard, monitor and serial port.<br />

7.3 Character Device Base<br />

Every character device by definition can be viewed as a simple sequence <strong>of</strong> characters.<br />

Owing to that, all classes that represent character devices have one buffer for storing<br />

the data coming from the character device and another buffer for the output data.<br />

Those buffers, as well as methods to handle the read and write requests, have been<br />

put in a base class called CharacterDevice. <strong>The</strong> definition and implementation <strong>of</strong> this<br />

class can be found in the common/include/drivers/chardev.h.<br />

Function<br />

Description<br />

CharacterDevice( char* name, Superblock*<br />

Constructor <strong>of</strong> the CharacterDevice.<br />

super_block = 0, uint32 in-<br />

<strong>The</strong> parameters super_block and in-<br />

ode_type = I_CHARDEVICE );<br />

ode_type are optional and are just<br />

passed to the Inode constructor and<br />

not used by the CharacterDevice class.<br />

<strong>The</strong> parameter name is the device<br />

name under which the character device<br />

will be added to the Device File<br />

System. <strong>The</strong> constructor copies the<br />

name in an internal member variable<br />

and allocates the input and output<br />

buffers. It adds this instance to the<br />

DeviceFSSuperblock as well.<br />

CharacterDevice(); Destructor deletes the allocated<br />

buffers and deregisters the device<br />

from the Device File System.<br />

Table 7.1: Character Device base class methods<br />

<strong>The</strong> data structure used for the buffers is a FIFO circular buffer where old characters<br />

get overwritten if nobody has read them and the data is to large to fit into the rest <strong>of</strong><br />

the buffer. <strong>The</strong> FIFO buffer also implements the synchronization mechanisms between<br />

reads and writes as well as multiple reads.<br />

Function<br />

FiFo< uint32 > *_in_buffer;<br />

FiFo< uint32 > *_out_buffer;<br />

Description<br />

Circular buffer for storing input data<br />

after it has been read from the device.<br />

Circular buffer for storing the output<br />

data before it is written to the device.<br />

82 <strong>of</strong> 151


CHAPTER 7. CHARACTER DEVICES<br />

7.3. CHARACTER DEVICE BASE<br />

RingBuffer< uint32 > *_irq_buffer; Circular buffer for storing input data<br />

from an IRQ handler. <strong>The</strong> data can<br />

not be stored directly within the handler<br />

because <strong>of</strong> the synchronisation<br />

mechanisms implemented in the FiFo<br />

buffers.<br />

static const uint32 CD_BUFFER_SIZE; <strong>The</strong> size <strong>of</strong> the buffers is set to 1k as<br />

default.<br />

Table 7.2: Character Device input and output buffers<br />

<strong>The</strong> class CharacterDevice also implements the Inode interface (see VFS Chapter) allowing<br />

it to be added to the Device file system. Any class that inherits the CharacterDevice<br />

class will be automatically added to the Device file system under /dev/-<br />

name_<strong>of</strong>_the_device. This means that any character device can be opened as a file<br />

through the Virtual File System (although no seeking operation is supported).<br />

Function<br />

virtual int32 readData (int32 <strong>of</strong>fset,<br />

int32 size, char *buffer);<br />

virtual int32 writeData (int32 <strong>of</strong>fset,<br />

int32 size, const char*buffer);<br />

Description<br />

Reads size bytes from the character<br />

device input buffer. <strong>The</strong> parameter<br />

buffer is a pointer to preallocated<br />

buffer which must have enough space<br />

to store the data. Offset parameter is<br />

left for compatibility with Inode interface.<br />

If <strong>of</strong>fset is not equal zero, an error<br />

is returned. This is a blocking function,<br />

meaning that if the input buffer<br />

is empty, the calling thread will be put<br />

to sleep until a character is available.<br />

If the function is successfull, the number<br />

<strong>of</strong> bytes actually read will be returned.<br />

Otherwise a negative integer<br />

value is returned.<br />

Writes size bytes to the character device<br />

output buffer. Offset parameter is<br />

left for compatibility with Inode interface.<br />

If <strong>of</strong>fset is not equal zero, an error<br />

is returned. If the output buffer is<br />

full, the calling thread is blocked until<br />

there is enough space in the output<br />

buffer to store the data.If the function<br />

is successfull, the number <strong>of</strong> bytes actually<br />

written will be returned. Otherwise<br />

a negative integer value is returned.<br />

83 <strong>of</strong> 151


7.4. KEYBOARD CHAPTER 7. CHARACTER DEVICES<br />

int32 mknod(Dentry *dentry); int32<br />

mkfile(Dentry *dentry); int32 create(Dentry<br />

*dentry);<br />

File* link(uint32 flag);<br />

<strong>The</strong>se three functions basicaly perform<br />

the same thing. <strong>The</strong>y link the current<br />

instance <strong>of</strong> the character device with<br />

the dentry directory entry.<br />

Creates a new file and links the file<br />

with the current instance <strong>of</strong> the character<br />

device.<br />

int32 unlink(File* file); As the name suggests unlinks the<br />

file from the current instance <strong>of</strong> the<br />

character device and deletes the file<br />

pointer.<br />

Table 7.3: <strong>The</strong> Inode functions implemented by Character-<br />

Device<br />

When writing a device driver for character devices, the simplest thing to do would be to<br />

inherit the CharacterDevice class and override its methods ProcessInBuffer() and ProcessOutBuffer().<br />

<strong>The</strong> ProcessOutBuffer() method should implement writing the characters<br />

in the output buffer to the character device. <strong>The</strong> ProcessInBuffer() method should<br />

be overridden only if you need some monitoring and/or filtering <strong>of</strong> the input buffer.<br />

<strong>The</strong>n the IRQ handler should be written where the characters should be received from<br />

the device and added to the input buffer. <strong>The</strong> CharacterDevice class implements the<br />

Thread interface allowing it to run as a separate thread in the kernel space. During<br />

the kernel initialization this thread should be added to the Scheduler. See the Serial<br />

Port section <strong>of</strong> this chapter for an example on how to write a CharacterDevice driver.<br />

7.4 Keyboard<br />

<strong>The</strong> keyboard can be connected through:<br />

• legacy AT keyboard connector,<br />

• PS/2 keyboard connector,<br />

• USB port<br />

From the driver programming point <strong>of</strong> view, the difference between the first two connectors<br />

does not exist. <strong>The</strong>se types <strong>of</strong> connected keyboards are supported with <strong>SWEB</strong><br />

keyboard driver. <strong>The</strong> USB keyboards running in legacy mode are supported as well,<br />

but there are some known problems, like controlling the keyboard LEDs. Each keyboard<br />

contains a processor which monitors the state <strong>of</strong> the keyboard key matrix and<br />

when a change is detected sends the data to the keyboard controller which is usually<br />

located on the computers motherboard. <strong>The</strong> keyboard processor communicates with<br />

the keyboard controller using an IBM communications protocol developed in the early<br />

60’s. <strong>The</strong> controller receives the data from the keyboard processor and stores it in<br />

his internal buffer. <strong>The</strong> operating system communicates with the keyboard controller<br />

through three 8-bit registers.<br />

84 <strong>of</strong> 151


CHAPTER 7. CHARACTER DEVICES<br />

7.4. KEYBOARD<br />

Register Port Read/Write<br />

Controller Input Buffer 0x60 0x64 Write<br />

Controller Output 0x60<br />

Read<br />

Buffer<br />

Controller Status Register<br />

0x64<br />

Read<br />

Table 7.4: I/O ports for communicating with keyboard<br />

When there is an available character in the keyboard controller buffer, the IRQ 1 is<br />

raised and the interrupt handler routine should read the character from the buffer. If<br />

the character is not read, the IRQ will not be raised when the next character arrives.<br />

<strong>The</strong> keyboard sends scancodes rather than the actual characters and these scancodes<br />

must be converted. One key can send one, two or four byte scancode. <strong>The</strong> <strong>SWEB</strong><br />

implementation <strong>of</strong> the keyboard is a simple character device that requires only the<br />

input buffer. Its driver is implemented in KeyboardManager class that can be found<br />

in arch_keyboard_manager.h and arch_keyboard_manager.cpp. <strong>The</strong> main input part <strong>of</strong><br />

the keyboard driver is the interrupt handler. This handler is called each time there is<br />

a character in the keyboard controller buffer. In the interrupt routine, the keyboard<br />

is disabled and the scancodes are read from the keyboard controller buffer. <strong>The</strong> one<br />

byte scancodes are directly saved to the keyboard buffer, whereas the two and four<br />

byte scancodes are converted to one byte in the IRQ handler prior to storing them.<br />

<strong>The</strong> one byte scancodes are in the area from 0x00 to 0x60 and there are less than<br />

32 different two and four byte scancodes. Owing to that, the two byte and four byte<br />

scancodes can be effectively remapped to the area between 0x60 and 0x80. <strong>The</strong> one<br />

bit between 0x80 and 0xFF is intentionally left for extending the driver to handle the<br />

release key events. Currently these events are completely ignored by <strong>SWEB</strong> with the<br />

exception <strong>of</strong> the SHIFT, CTRL and ALT modifiers. <strong>The</strong> datastructure used to store the<br />

scancodes is the RingBuffer. <strong>The</strong> RingBuffer is a circular buffer structure <strong>of</strong> a certain<br />

size. When the buffer is full and no data was read, the oldest entry is overwritten with<br />

the new data. That means if the Terminal thread is blocked and unable to read from<br />

the keyboard driver buffer, data loss may occur. <strong>The</strong> size <strong>of</strong> the keyboard RingBuffer<br />

is set to 1024 bytes meaning that 1k <strong>of</strong> scancodes can be effectively remebered by the<br />

driver. This should be more than enough, when taking into consideration that this<br />

section has less than 1024 characters.<br />

Function<br />

void serviceIRQ();<br />

Description<br />

This method is the actual IRQ handler.<br />

It receives the scancode, performs necessary<br />

conversions, modifies the keyboard<br />

status and stores the scancode<br />

in the internal buffer.<br />

85 <strong>of</strong> 151


7.4. KEYBOARD CHAPTER 7. CHARACTER DEVICES<br />

void modifyKeyboardStatus (uint8<br />

scancode );<br />

void setLEDs( void );<br />

void kb_wait();<br />

void send_cmd( uint8 cmd,<br />

port = 0x64 );<br />

uint8<br />

If the parameter scancode contains<br />

one <strong>of</strong> the modifier key scancodes:<br />

SHIFT, CTRL, ALT, NUM, CAPS or<br />

SCROLL, the internal keyboard status<br />

variable is modified. This function is<br />

called from the interrupt handler for<br />

each scancode that is received from<br />

the keyboard. <strong>The</strong> method is aware <strong>of</strong><br />

the key release events as well. If the<br />

keyboard status gets modified, the setLEDs<br />

method is call.<br />

<strong>The</strong> setLEDs method checks if the current<br />

keyboard NUM, CAPS or SCROLL<br />

status has changed and in that case<br />

updates the keyboard LED lights accordingly.<br />

Most <strong>of</strong> the modern keyboards<br />

keep track <strong>of</strong> the keyboard LED<br />

lights, but most <strong>of</strong> older keyboards do<br />

not.<br />

This method checks the state <strong>of</strong> the<br />

keyboard controller and waits until it<br />

is ready to accept commands or until a<br />

timeout occurs.<br />

<strong>The</strong> method used to send a command<br />

to the keyboard controller. If the port<br />

is not given, the default port 0x64 is<br />

used. Before sending a command, the<br />

kb_wait() method is called.<br />

On the other side, the terminals need the keys that were pressed and have no use<br />

for scancodes. <strong>The</strong> method getKeyFromKbd() can be used to read the next character<br />

from the keyboard. This method is non-blocking, meaning that if no key is available<br />

that the function will simply return booelan false. Note that this behavior is likely to<br />

change in the future.<br />

Function<br />

bool getKeyFromKbd(uint32 &key);<br />

Description<br />

Method for reading the next character<br />

from the keyboard. This method<br />

checks the internal keyboard buffer<br />

and if there is a scancode in the buffer,<br />

it is converted and stored in the key<br />

parameter. <strong>The</strong> method returns true<br />

if there is any scancode available and<br />

false otherwise.<br />

86 <strong>of</strong> 151


CHAPTER 7.5. SERIAL 7. CHARACTER PORT (OR HOW DEVICES TO WRITE A CHARACTER DEVICE DRIVER IN <strong>SWEB</strong><br />

uint32 convertScancode( uint8<br />

scancode );<br />

bool isCaps(); bool isNum(); bool isScroll();<br />

bool isShift(); bool isCtrl();<br />

bool isAlt(); bool isAltGr();<br />

<strong>The</strong> scancode is converted according<br />

to the conversion table and the corresponding<br />

character is returned to the<br />

caller. <strong>The</strong> conversion table is stored<br />

in the STANDARD_KEYMAP internal<br />

variable. <strong>The</strong> conversion method is the<br />

standard table lookup.<br />

<strong>The</strong> set <strong>of</strong> methods can test the status<br />

<strong>of</strong> the keyboard modifiers. <strong>The</strong> methods<br />

return true if the modifier is on,<br />

false otherwise.<br />

<strong>The</strong> following changes shall be implemented in the <strong>SWEB</strong> keyboard driver: - <strong>The</strong><br />

CharacterDevice interface should be implemented, thus enabling the applications to<br />

get the raw scancodes through VFS read method. - <strong>The</strong> convertScancode() method<br />

should be transferred to the Terminal class, thus enabling individual key mappings<br />

for each Terminal to be implemented.<br />

7.5 Serial Port (or how to write a character device<br />

driver in <strong>SWEB</strong><br />

This section will give the reader a simple tutorial on how to write a character device<br />

driver for <strong>SWEB</strong> operating system. <strong>The</strong> tutorial assumes that <strong>SWEB</strong> source code is<br />

downloaded and that the system is set up to compile it. Additionally the tutorial is<br />

written with the “x86” architecture in mind, but the basic outline does not differ much<br />

for other architectures. To avoid interference from the actual <strong>SWEB</strong> serial port driver,<br />

the following lines must be removed (commented out) from the actual source code:<br />

In file /common/kernel/main.cpp:<br />

664: Scheduler::instance()->addNewThread(<br />

665: new SerialThread( "SerialTestThread" );<br />

666: );<br />

First <strong>of</strong> all some background information on serial ports is provided in this tutorial.<br />

<strong>The</strong> serial port is an I/O device that enables the computer to communicate with other<br />

devices that support RS 232 standard <strong>of</strong> communication. As its name suggests the<br />

data over the serial line is transmitted in a sequence <strong>of</strong> bits. So every byte <strong>of</strong> data that<br />

needs to be transmitted over the serial line must be converted in a series <strong>of</strong> bits. This<br />

work is performed by the UART microcontroller. UART stands for Universal Asynchronous<br />

Receiver / Transmitter. <strong>The</strong> 8250 series, which includes the 16450, 16550,<br />

16650, & 16750 UARTs are the most commonly found type in the PC computers. <strong>The</strong><br />

later versions <strong>of</strong> UART contain FIFO buffers as well. <strong>The</strong> UART has several registers<br />

which can be used to set the serial port communication parameters and send and<br />

receive data. Some <strong>of</strong> these registers are:<br />

Base Address Read/Write Abr. Register Name<br />

87 <strong>of</strong> 151


7.5. SERIAL PORT (OR HOW TO WRITE A CHARACTER CHAPTERDEVICE 7. CHARACTER DRIVER INDEVICES<br />

<strong>SWEB</strong><br />

+0 Write THB Transmitter Holding<br />

Buffer<br />

Read RBR Receiver Buffer<br />

+2 Write FCR FIFO Control Register<br />

+3 Read/Write LCR Line Control Register<br />

+ 4 Read/Write MCR Modem Control<br />

Register<br />

<strong>The</strong> base address and the IRQ <strong>of</strong> the UART depend upon the communication port. In<br />

most <strong>of</strong> the PCs there are 4 communication ports with following addresses and IRQ<br />

numbers.<br />

Name Address IRQ<br />

COM 1 3F8 4<br />

COM 2 2F8 3<br />

COM 3 3E8 4<br />

COM 4 2E8 3<br />

To implement the device driver class SerialPort is used. <strong>The</strong> definition file should<br />

be placed in arch/arch/include/arch_serial_tutorial.h and the declaration file should<br />

be placed in arch/arch/source/arch_serial_tutorial.cpp. <strong>The</strong> class inherits the CharacterDevice<br />

class thus inheriting the input and output buffers as well. <strong>The</strong> class<br />

SerialPort requires the following member variables:<br />

• base port<br />

• IRQ<br />

Constructor <strong>of</strong> this class accepts the following parameters:<br />

• name <strong>of</strong> the device to be passed to device file system ( char * ),<br />

• base port <strong>of</strong> the system ( uint16 )<br />

• IRQ <strong>of</strong> the serial port ( uint8 )<br />

<strong>The</strong> name parameter is passed to CharacterDevice constructor and automatically registered<br />

with the device file system. <strong>The</strong> constructor also calls the enableIRQ() function<br />

(defined in the 8259.h) with the IRQ parameter. This will tell the PIC to pass through<br />

the interrupt to the processor. In the end the parameters base port and IRQ are stored<br />

in a member variable for future use. <strong>The</strong> reading and writing <strong>of</strong> UART registers is<br />

done via write_UART() and read_UART() methods respectively. Some additional enumerations<br />

for the communication parameters and constants for UART registers are<br />

defined in this class as well. At the end <strong>of</strong> the h file, a global variable for the COM1<br />

serial port is created for tutorial purposes. For the device detection see the SerialManager<br />

section <strong>of</strong> this chapter.<br />

88 <strong>of</strong> 151


CHAPTER 7.5. SERIAL 7. CHARACTER PORT (OR HOW DEVICES TO WRITE A CHARACTER DEVICE DRIVER IN <strong>SWEB</strong><br />

1 #include " types .h"<br />

2 #include " ports .h"<br />

3 #include " 8259.h" // required f o r enableIRQ ( )<br />

4<br />

5 class SerialPort : public CharacterDevice {<br />

6<br />

7 public :<br />

8 uint16 base_port ; ///< <strong>The</strong> base IO port<br />

9 uint8 uart_type ; ///< Type <strong>of</strong> the connected UART<br />

10 uint8 irq_num ; ///< IRQ number<br />

11<br />

12 /// \ b r i e f <strong>The</strong> baudrate f o r s e r i a l communication<br />

13 typedef enum _br {<br />

14 BR_9600,<br />

15 BR_14400,<br />

16 BR_19200,<br />

17 BR_38400,<br />

18 BR_55600,<br />

19 BR_115200<br />

20 } BAUD_RATE_E;<br />

21 /// \ b r i e f <strong>The</strong> parity f o r s e r i a l communication<br />

22 typedef enum _par {<br />

23 ODD_PARITY,<br />

24 EVEN_PARITY,<br />

25 NO_PARITY<br />

26 } PARITY_E ;<br />

27 /// \ b r i e f Number <strong>of</strong> data b i t s used in s e r i a l communication<br />

28 typedef enum _db {<br />

29 DATA_7,<br />

30 DATA_8<br />

31 } DATA_BITS_E;<br />

32 /// \ b r i e f Number <strong>of</strong> stop b i t s used in s e r i a l communication<br />

33 typedef enum _sb {<br />

34 STOP_ONE,<br />

35 STOP_TWO,<br />

36 STOP_ONEANDHALF<br />

37 } STOP_BITS_E ;<br />

38<br />

39 static uint32 THB; ///< UART Transmiter holding buffer } }<br />

40 static uint32 RBR; ///< UART Receiver buffer r e g i s t e r } }<br />

41 static uint32 IER ; ///< UART Interrupt Enable Register } }<br />

42 static uint32 IIR ; ///< UART Interrupt I n d e t i f i c a t i o n Register } }<br />

43 static uint32 FCR; ///< UART FIFO Control Register } }<br />

44 static uint32 LCR; ///< UART Line Control Register } }<br />

45 static uint32 MCR; ///< UART Modem Control Register } }<br />

46 static uint32 LSR; ///< UART Line Status Register } }<br />

47 static uint32 MSR; ///< UART Modem Status Register } }<br />

48<br />

49 SerialPort ( char ∗name, uint16 baseport , uint8 irqnum ) : CharacterDevice ( name ) {<br />

50 base_port = baseport ;<br />

51 irq_num = irqnum ;<br />

52 enableIRQ ( irq_num ) ;<br />

53 } ;<br />

54 ~SerialPort ( ) { } ;<br />

55<br />

56 void SerialPort : : writeUART ( uint32 reg , uint8 what ) {<br />

57 outportb ( this−>base_port + reg , what ) ;<br />

58 } ;<br />

59 uint8 SerialPort : : readUART ( uint32 reg ) {<br />

60 return inportb ( this−>base_port + reg ) ;<br />

61 } ;<br />

62<br />

89 <strong>of</strong> 151


7.5. SERIAL PORT (OR HOW TO WRITE A CHARACTER CHAPTERDEVICE 7. CHARACTER DRIVER INDEVICES<br />

<strong>SWEB</strong><br />

63 setupPort ( BAUD_RATE_E, DATA_BITS_E, STOP_BITS_E, PARITY_E ) ;<br />

64 } ;<br />

65<br />

66 // Global variable<br />

67 SerialPort ∗ sp_tutorial = new SerialPort ( "com1" , 0x3F8 , 4 ) ;<br />

<strong>The</strong> configuration <strong>of</strong> the UART is placed in the setupPort() method. <strong>The</strong> parameters <strong>of</strong><br />

this method are:<br />

• port speed<br />

• data bits<br />

• parity bits<br />

• stop bits<br />

1 uint32 SerialPort : : setup_port ( BAUD_RATE_E baud_rate , DATA_BITS_E data_bits , STOP_BITS_E stop_bits , PA<br />

2 {<br />

3 writeUART ( IER , 0x00 ) ; // turn o f f interupts<br />

4<br />

5 uint8 divisor = 0x0C;<br />

6 switch ( baud_rate )<br />

7 {<br />

8 case BR_14400:<br />

9 case BR_19200:<br />

10 divisor = 0x06 ;<br />

11 break ;<br />

12 case BR_38400:<br />

13 divisor = 0x03 ;<br />

14 break ;<br />

15 case BR_55600:<br />

16 divisor = 0x02 ;<br />

17 break ;<br />

18 case BR_115200:<br />

19 divisor = 0x01 ;<br />

20 break ;<br />

21 default :<br />

22 case BR_9600:<br />

23 divisor = 0x0C;<br />

24 break ;<br />

25 }<br />

26<br />

27 writeUART ( LCR , 0x80 ) ; // activate DL<br />

28 writeUART ( 0 , divisor ) ; // DL low byte<br />

29 writeUART ( IER , 0x00 ) ; // DL high byte<br />

30<br />

31 uint8 data_bit_reg = 0x03 ;<br />

32 switch ( data_bits )<br />

33 {<br />

34 case DATA_8:<br />

35 data_bit_reg = 0x03 ;<br />

36 break ;<br />

37 case DATA_7:<br />

38 data_bit_reg = 0x02 ;<br />

39 break ;<br />

40 }<br />

41<br />

42 uint8 par = 0x00 ;<br />

43 switch ( parity )<br />

90 <strong>of</strong> 151


CHAPTER 7.5. SERIAL 7. CHARACTER PORT (OR HOW DEVICES TO WRITE A CHARACTER DEVICE DRIVER IN <strong>SWEB</strong><br />

44 {<br />

45 case NO_PARITY:<br />

46 par = 0x00 ;<br />

47 break ;<br />

48 case EVEN_PARITY:<br />

49 par = 0x18 ;<br />

50 break ;<br />

51 case ODD_PARITY:<br />

52 par = 0x08 ;<br />

53 break ;<br />

54 }<br />

55<br />

56 uint8 stopb = 0x00 ;<br />

57 switch ( stop_bits )<br />

58 {<br />

59 case STOP_ONE:<br />

60 stopb = 0x00 ;<br />

61 break ;<br />

62 case STOP_TWO:<br />

63 case STOP_ONEANDHALF:<br />

64 stopb = 0x04 ;<br />

65 break ;<br />

66 }<br />

67<br />

68 writeUART ( LCR , data_bit_reg | par | stopb ) ; // deact DL<br />

69 writeUART ( FCR , 0xC7 ) ;<br />

70 writeUART ( MCR , 0x0B ) ;<br />

71 writeUART ( IER , 0x0F ) ; // turn on interrupts<br />

72<br />

73 return 0;<br />

74 } ;<br />

When data arrives over the serial port an IRQ will be raised. <strong>The</strong> class SerialPort<br />

implements serviceIRQ() method that simply receives the data and stores it into the<br />

buffer.<br />

1 void SerialPort : : serviceIRQ ( )<br />

2 {<br />

3 uint8 int_id_reg = readUART ( IIR ) ;<br />

4<br />

5 i f ( int_id_reg & 0x01 )<br />

6 return ; // i t i s not my IRQ or IRQ i s handled<br />

7<br />

8 uint8 int_id = ( int_id_reg & 0x06 ) >> 1;<br />

9<br />

10 switch ( int_id )<br />

11 {<br />

12 case 0: // Modem status changed<br />

13 break ;<br />

14 case 1: // Output buffer i s empty<br />

15 break ;<br />

16 case 2: // Data i s available<br />

17 int_id = readUART ( RBR ) ;<br />

18 _in_buffer −>put ( int_id ) ;<br />

19 break ;<br />

20 case 3: // Line status changed<br />

21 break ;<br />

22 default : // This w i l l never be executed<br />

23 break ;<br />

24 }<br />

25<br />

26 return ;<br />

91 <strong>of</strong> 151


7.6. SERIALMANAGER CHAPTER 7. CHARACTER DEVICES<br />

27 }<br />

After writing the interrupt handler the next step is to register the interrupt handler<br />

with the <strong>SWEB</strong> interrupt utilities. <strong>The</strong> file arch/x86/source/InterruptUtils.cpp, contains<br />

the class <strong>of</strong> the same name that manages all interrupt functions in <strong>SWEB</strong>. <strong>The</strong><br />

IRQ 3 and IRQ 4 handler is already present and only the lines for calling the tutorial<br />

driver must be added. First <strong>of</strong> all the #include "arch_serial_tutorial.h" must be added<br />

for access to the global variable, and after that, in function extern "C" void irqHandler_4(),<br />

the interrupt handler must be called with sp_tutorial->serviceIRQ();. At this<br />

point the character device driver can be tested. In the testing thread add an variable<br />

char ch; and an endless loop that calls sp_tutorial->readData(0, 1, &ch); followed<br />

by the kprintfd( "%c", ch ); <strong>The</strong> driver is now capable <strong>of</strong> receiving data from the serial<br />

port. <strong>The</strong> writing still ends up in the buffer and is not actually written to the<br />

UART. <strong>The</strong> method processOutBuffer() implements this by getting the characters from<br />

the output buffer and writing them to the serial port. <strong>The</strong> method processInBuffer()<br />

must be overridden as well, since the default behavior is to delete the characters from<br />

the buffer.<br />

1 void SerialPort : : processOutBuffer ( void )<br />

2 {<br />

3 i f ( ! _out_buffer )<br />

4 return ;<br />

5 uint32 j i f f i e s = 0;<br />

6<br />

7 while ( ! ( readUART ( LSR ) & 0x40 ) && j i f f i e s ++ < 50000 ) ;<br />

8 i f ( j i f f i e s == 50000 ) // TIMEOUT<br />

9 return ;<br />

10 writeUART ( THB, _out_buffer−>get ( ) ) ;<br />

11 }<br />

12<br />

13 void SerialPort : : processInBuffer ( void )<br />

14 {<br />

15 return ;<br />

16 } ;<br />

Additionally the serial port thread must be registered with the scheduler. This will ensure<br />

that the method gets actually called. <strong>The</strong> thread should be registered somewhere<br />

in the beginning <strong>of</strong> the testing thread with: Scheduler::instance()->addNewThread(<br />

sp_tutorial );. <strong>The</strong> basic serial port support is now complete. <strong>The</strong> only thing left to do<br />

is to test this driver by extending the test routine to do a little bit <strong>of</strong> writing. <strong>The</strong> test<br />

routine could also be extended to include the writing and reading through the VFS.<br />

<strong>The</strong> actual implementation <strong>of</strong> the serial port driver in <strong>SWEB</strong> slightly differs from the<br />

one in this tutorial, but the difference is only structural, the functionality remains the<br />

same.<br />

7.6 SerialManager<br />

<strong>The</strong> SerialManager class implements the collection <strong>of</strong> SerialPort instances as well as<br />

the doDetection method used to populate this collection. <strong>The</strong> driver instances are<br />

stored in a simple array. This type <strong>of</strong> storage does waste the resources, but enables<br />

fast and simple access. Taking in the consideration that the search through the collection<br />

happens in the interrupt handler, the speed is much more important factor.<br />

92 <strong>of</strong> 151


CHAPTER 7. CHARACTER DEVICES<br />

7.7. TERMINAL<br />

7.7 Terminal<br />

7.7.1 Introduction<br />

Terminals are the basic input output units <strong>of</strong> any computer. In the early days, the<br />

computer terminals consisted <strong>of</strong> printers as the output devices coupled with the card<br />

readers as the input devices. <strong>The</strong>se early terminals were slow, but sufficient for the<br />

slow computers <strong>of</strong> that time. With the advances in computer technology, the computers<br />

were getting faster and more interactivity from the terminal was required. So<br />

the printers were replaced with video screens and card readers with keyboards. This<br />

combination is today widely known as the computer terminal. Today UNIX and its<br />

derivatives like Linux have introduced a concept <strong>of</strong> virtual terminals. <strong>The</strong>se are device<br />

independent input-output primitives that implement common terminal functions such<br />

as clearing screen, scrolling, writing to a specific position etc. In this way, there is an<br />

additional level <strong>of</strong> abstraction before the information reaches the actual I/O device.<br />

This device driver is in <strong>SWEB</strong> implemented in the Console class.<br />

7.7.2 Escape codes<br />

Many computer terminals and terminal emulators support color and cursor control<br />

through a system <strong>of</strong> escape sequences. One such standard is commonly referred to as<br />

ANSI Color. Several terminal specifications are based on the ANSI color standard, including<br />

VT100. <strong>The</strong> VT100 terminals use "escape[" characters as the control sequence<br />

inducer. For the entire list <strong>of</strong> all possible ANSI terminal escape codes you can consult<br />

the Internet. In <strong>SWEB</strong> there is currently no escape sequence interpreter (no terminal<br />

emulation).<br />

7.7.3 Implementation in <strong>SWEB</strong><br />

<strong>The</strong> virtual terminal concept is implemented in <strong>SWEB</strong> as well. <strong>The</strong> <strong>SWEB</strong> terminal is a<br />

virtual device that has input-output capabilities, a character buffer and forementioned<br />

common functions. Every terminal can be attached to a console driver that does the<br />

actual input-output. This console can be a serial port, a printer and a card reader or<br />

the classic keyboard and monitor combination. For the complete list <strong>of</strong> implemented<br />

consoles, see the next section <strong>of</strong> this chapter. <strong>The</strong> definition <strong>of</strong> the terminal class<br />

can be found in the common/include/console/Terminal.h and its implementation in<br />

common/source/console/Terminal.cpp.<br />

Function<br />

Terminal(char name, Console ×<br />

console, uint32 num_columns,<br />

uint32 num_rows);<br />

Description<br />

Constructor; creates the terminal with<br />

the given parameters. Name is the<br />

name under which the terminal will be<br />

registered with the Virtual File System;<br />

Console is the parent console which<br />

provides the output capabilities to this<br />

terminal; rows and columns give the<br />

maximum coordinates <strong>of</strong> the output<br />

terminal buffer.<br />

93 <strong>of</strong> 151


7.7. TERMINAL CHAPTER 7. CHARACTER DEVICES<br />

Terminal(); Standard destructor. Simply deallocates<br />

the buffers.<br />

Table 7.9: Terminal class base<br />

When a terminal is created, <strong>SWEB</strong> automaticaly associates two streams with it. First<br />

is the input stream that reads characters from the keyboard and stores them to the<br />

internal buffer. And the second is the output stream, which writes the characters<br />

to the parent console. <strong>The</strong> functionality <strong>of</strong> the input stream is implemented by the<br />

following functions :<br />

Function<br />

uint32 read();<br />

uint32 readLine(char *buffer,<br />

uint32 size_<strong>of</strong>_buffer);<br />

uint32 readLineNoBlock(char<br />

*buffer, uint32 size_<strong>of</strong>_buffer);<br />

void clearBuffer();<br />

void putInBuffer( uint32 key );<br />

void backspace( void );<br />

Description<br />

Returns a single character from the input<br />

buffer<br />

Reads a stream <strong>of</strong> characters in the<br />

preallocated buffer <strong>of</strong> given size. <strong>The</strong><br />

charaters are read until a carriage return<br />

or line feed comes or until the<br />

preallocated buffer is full. <strong>The</strong> number<br />

<strong>of</strong> characters actualy read is returned.<br />

This function will block the<br />

calling thread by sending it effectively<br />

to sleep until the character arrives<br />

from the keyboard.<br />

<strong>The</strong> non blocking version <strong>of</strong> the above<br />

function that only reads the characters<br />

from the internal buffer and returns<br />

when the buffer is empty without waiting<br />

for keys from the keyboard.<br />

Clears the input buffer. This function<br />

is automatically called when the terminal<br />

is opened by the VFS.<br />

Places the character given by the parameter<br />

in the input buffer. Usable for<br />

implementing pipes. This function is<br />

called by the keyboard driver (precisely<br />

the parent console) to place the incoming<br />

keys in the active terminal buffer.<br />

This function deletes the last key from<br />

the keyboard buffer and also deletes<br />

the last displayed character from the<br />

screen. It should only be called by the<br />

keyboard driver as a response to the<br />

backspace key event.<br />

Table 7.10: Terminal input class methods<br />

<strong>The</strong> terminal output stream is a rectangular grid arranged <strong>of</strong> rows and columns where<br />

94 <strong>of</strong> 151


CHAPTER 7. CHARACTER DEVICES<br />

7.7. TERMINAL<br />

each grid intersection, or character cell, can contain a character. Each character has<br />

its own foreground color and each character cell has its own background color. <strong>The</strong><br />

functions manipulating the terminal output buffer are synchronised; meaning that<br />

multiple threads can not modify the output buffer at the same time. <strong>The</strong> terminal output<br />

buffer must not be modified directly, but through the exposed output functions.<br />

95 <strong>of</strong> 151


7.8. CONSOLE CHAPTER 7. CHARACTER DEVICES<br />

Function<br />

void write(char character);<br />

Description<br />

Writes a single character on the current<br />

cursor position with the current<br />

colour and paints the cell with the current<br />

background colour. <strong>The</strong> cursor is<br />

moved to the next position and terminal<br />

is scrolled if row has been filled.<br />

Writes a null terminated string to the<br />

terminal.<br />

Like above, only the buffer does not<br />

have to be null terminated, but additional<br />

parameter giving the size <strong>of</strong> the<br />

buffer is passed to the function<br />

<strong>The</strong> Inode interface wrapper function.<br />

<strong>The</strong> <strong>of</strong>fset parameter must be zero or<br />

nothing will be written.<br />

<strong>The</strong> functions for setting the parameters<br />

which will be used as the next<br />

character colour and the cell background<br />

colour.<br />

Table 7.11: Terminal output class methods<br />

void writeString(char const *string);<br />

void writeBuffer(char const *buffer,<br />

size_t len);<br />

virtual int32 writeData (int32 <strong>of</strong>fset,<br />

int32 size, const char*buffer);<br />

void setForegroundColor( color );<br />

void setBackgroundColor( color );<br />

<strong>The</strong> terminal class should be extended in a way to enable different keyboard mappings.<br />

<strong>The</strong> keyboard driver is currently in charge <strong>of</strong> converting raw scancodes, but that is<br />

really a job for the terminals.<br />

7.8 Console<br />

<strong>The</strong> console is a class that provides the output interface between the actual hardware<br />

devices and the terminals. Every console has a common set <strong>of</strong> features that must be<br />

implemented in order to satisfy the requirements <strong>of</strong> the terminal interface. Every text<br />

written to the console is written to an area owned by the console called the console<br />

window. <strong>The</strong> console window is an attribute <strong>of</strong> the console, and is organized in a<br />

simmilar maner as the terminals output buffer. <strong>The</strong> console window is always less<br />

than or equal to the size <strong>of</strong> the terminal buffer, and can be moved to view different<br />

areas <strong>of</strong> the underlying terminal buffer. If the terminal buffer is larger than the console<br />

window, the console should provide some way to scroll though the rest <strong>of</strong> the text. A<br />

cursor indicates the screen buffer position where text is currently read or written. <strong>The</strong><br />

origin for character cell coordinates in the screen buffer is the upper left corner, and<br />

the position <strong>of</strong> the cursor and the console window are measured relative to that origin.<br />

Use zero-based indexes to specify positions; that is, specify the topmost row as row 0,<br />

and the leftmost column as column 0. <strong>The</strong> maximum value for the row and column<br />

indexes depends on the hardware in question. <strong>The</strong> Console in <strong>SWEB</strong> also serves as<br />

the extension to the keyboard driver. Every console represents a kernel Thread object.<br />

In its Run method the console polls the keyboard driver. <strong>The</strong> special function keys<br />

96 <strong>of</strong> 151


CHAPTER 7. CHARACTER DEVICES<br />

7.8. CONSOLE<br />

(like terminal switch or scrolling) are handled directly by the console and the normal<br />

letter (or number) keys are dispatched to the active terminal.<br />

<strong>The</strong> following output functions are defined in the Console base class :<br />

Function<br />

Description<br />

virtual void consoleClearScreen(); Clears the screen. All characters are<br />

deleted and the background color <strong>of</strong> all<br />

cells is set to black.<br />

virtual uint32 consoleSetCharacter(uint32<br />

Writes the character to the speci-<br />

const &row, uint32 fied coordinates with the specified at-<br />

const&column, uint8 const &character,<br />

tribute.<br />

uint8 const &state);<br />

virtual uint32 consoleGetNum- Returns the dimensions <strong>of</strong> the console<br />

Rows() const; virtual uint32 consoleGetNumColumns()<br />

const;<br />

virtual void consoleScrollUp();<br />

virtual void consoleSetForegroundColor(FOREGROUNDCOLORS<br />

const &color); virtual<br />

void consoleSetBackground-<br />

Color(BACKGROUNDCOLORS const<br />

&color);<br />

Table 7.12: Console output functions<br />

When writing to the console, the cursor<br />

moves to the beginning <strong>of</strong> the next<br />

row when it reaches the end <strong>of</strong> the current<br />

row. This function is called when<br />

the cursor advances beyond the last<br />

row. It scrolls up the rows so that the<br />

last row is free for writing.<br />

Sets the text attributes to be used for<br />

the next write to the console.<br />

7.8.1 Text Console<br />

<strong>The</strong> text console represents the simple text mode, which can display 80 columns by<br />

25 lines <strong>of</strong> high resolution text characters with 16 colors. <strong>The</strong> display data is held<br />

in 4 kilobytes <strong>of</strong> video memory buffer that can be found at the 0xB8000 (or the<br />

0xC00B8000 with the paging set up). For each character two bytes <strong>of</strong> the buffer<br />

are used. <strong>The</strong> first byte represents the character self and the second byte represents<br />

the attributes <strong>of</strong> the character (color, blinking etc.). While creating a text console the<br />

parameter can be passed to the constructor in order to set the number <strong>of</strong> terminals<br />

created for the console. <strong>The</strong> text console implements all <strong>of</strong> the output functions in the<br />

table above.<br />

7.8.2 Frame Buffer Console<br />

<strong>The</strong> Frame Buffer Console (FBC) is a console that uses graphical modes to emulate<br />

text-mode consoles. <strong>The</strong> normal VGA text consoles are limited to 512 characters, so<br />

the Framebuffer console is the only way to support Unicode characters. For each pixel<br />

97 <strong>of</strong> 151


7.8. CONSOLE CHAPTER 7. CHARACTER DEVICES<br />

16 bits are reserved in the frame buffer. <strong>The</strong> size <strong>of</strong> the buffer as well as the number<br />

<strong>of</strong> rows and columns available depends on the resolution used. <strong>SWEB</strong> currently has<br />

8x16 and sun8x16 font implemented for Frame Buffer Console. <strong>The</strong> resolution <strong>of</strong> the<br />

console can be chosen by seting the vbematch parameter in GRUB boot loader.<br />

98 <strong>of</strong> 151


Chapter 8<br />

Block Devices<br />

8.1 Introduction<br />

bla bla bla<br />

8.1.1 what are block devices<br />

A block device stores data in blocks with a fixed size. Commonly these block sizes<br />

are multiples <strong>of</strong> 512 Byte up to 32768 Byte. Each <strong>of</strong> those blocks can be accessed<br />

independently by using any form <strong>of</strong> unique id. Actually seek operations are possible<br />

on block devices.<br />

<strong>The</strong> commonly known opposite <strong>of</strong> block devices are character devices. <strong>The</strong>y can not<br />

be accessed directly or independently and seek operations are not possible. Keep in<br />

mind, that not all devices do fit in this classification scheme. Take the system clock for<br />

example. Blocks can not be addressed separately and it does not generate or accept<br />

any character streams.<br />

Most common known block devices are disks, floppies, cdroms, tapes and many others.<br />

Disks for instance usually have 512 Byte Blocks. Sure, for your convenience it<br />

would be a lot easier to access each and every byte on the whole hard drive individually<br />

but there are some reasons why this is definitely not the best way to do it. Let us<br />

assume that we want to access one byte on the hard drive. One reason would be that<br />

the index <strong>of</strong> the single byte would have to be transfered to the hard drive / controller.<br />

This would lead to severe disk operations in case you want to read a lager amount <strong>of</strong><br />

data. Another reason would be that you usually read larger amounts <strong>of</strong> data <strong>of</strong> disks.<br />

8.1.2 hard drive layout<br />

Hard disks still consist <strong>of</strong> several platters mounted on a spindle moving very fast. Disk<br />

read/write heads float at each surface <strong>of</strong> each platter.<br />

99


8.1. INTRODUCTION CHAPTER 8. BLOCK DEVICES<br />

<strong>The</strong>se platters are devided into tracks and the tracks are devided into sectors. Due<br />

to the fact that sectors on a track in the middle <strong>of</strong> the platter would be much smaller<br />

than on the edge on the patter, you would be wasting space on the outside <strong>of</strong> the<br />

platter.<br />

<strong>The</strong>refore most <strong>of</strong> the hard drives devide the tracks on the ’longer’ tracks into more<br />

sectors than on the ’shorter’ or inner tracks. But how would you know <strong>of</strong> how much<br />

sectors a track consist for addressing a sector? Actually you don’t know. <strong>The</strong> hard<br />

drive itself simulates a constant number <strong>of</strong> sectors per track.<br />

100 <strong>of</strong> 151


CHAPTER 8. BLOCK DEVICES<br />

8.2. IMPLEMENTATION<br />

8.1.3 how can blocks be accessed (chs, lba etc.)<br />

http://www.toolsthatwork.com/bbdocs/diskstructures.htm<br />

Long time ago disks were accessed using cylinder-head-sector notation based on the<br />

physical layout <strong>of</strong> the hard drive. Here you have the basics for the notation. Each<br />

track on the platter staple is described as cylinder, each head is one physical head on<br />

one side <strong>of</strong> a platter and a sector is one part <strong>of</strong> the track. <strong>The</strong> CHS number range is<br />

really limited since there are 10/8/6 bit available for addressing cylinders heads and<br />

sectors. <strong>The</strong>refore the hard drive limit will be around 8 GB (1023*255*63*512Byte).<br />

Note that the cylinder and head value ranges from 0 to the maximum and sectors range<br />

from 1 to the maximum. Later on in history these values became virtual emulated by<br />

the hard drive itself trying to use more space. For example simulate 16 heads with<br />

just 4 physical installed. CHS is still found in some parts <strong>of</strong> the operating system and<br />

in the partition table on the platter. <strong>The</strong>refore you still have to use CHS.<br />

//LBA: linear base address <strong>of</strong> the block<br />

//CYL: value <strong>of</strong> the cylinder CHS coordinate<br />

//HPC: number <strong>of</strong> heads per cylinder for the disk<br />

//HEAD: value <strong>of</strong> the head CHS coordinate<br />

//SPT: number <strong>of</strong> sectors per track for the disk<br />

//SECT: value <strong>of</strong> the sector CHS coordinate<br />

//TEMP: buffer to hold a temporary value<br />

// This equation is used very <strong>of</strong>ten by operating systems such as DOS<br />

// (or <strong>SWEB</strong>) to calculate the CHS values it needs to send to the disk<br />

// controller or INT13h in order to read or write data.<br />

uint32 LBA = start_sector;<br />

uint32 cyls = LBA / (HPC * SPT);<br />

uint32 TEMP = LBA % (HPC * SPT);<br />

uint32 head = TEMP / SPT;<br />

uint32 sect = TEMP % SPT + 1;<br />

Nowadays nearly everything is done by LBA (Logical block addressing). This is a very<br />

simple addressing scheme counting blocks from 0 to the maximum number <strong>of</strong> blocks.<br />

Translation where necessary (since old hard drives do not support LBA) is done by the<br />

OS. LBA addressing is limited to 28 or 48 bit numbering resulting in addressable disk<br />

sizes <strong>of</strong> 128 GiB and 128 PiB assuming the common block size <strong>of</strong> 512 Byte. This seems<br />

to be the only usable way to handle large hard drives. (If you have 3 minutes left visit<br />

http://www.hitachigst.com/hdd/research/recording_head/pr/PerpendicularAnimation.html<br />

for having a nice animation on the oncoming hard drive storage management)<br />

http://en.wikipedia.org/wiki/Hard_disk<br />

8.2 Implementation<br />

<strong>The</strong> basic concepts <strong>of</strong> block devices are implemented in sweb and a sample ATA/IDE<br />

driver is already registered and usable. <strong>The</strong> main ”mastermind” class <strong>of</strong> all the block<br />

device stuff is BDManager found in arch_bd_manager.h .cpp. This class is implemented<br />

as singleton to be able to manage all those requests on block devices from<br />

101 <strong>of</strong> 151


8.2. IMPLEMENTATION CHAPTER 8. BLOCK DEVICES<br />

kernel space. This global object holds a list <strong>of</strong> virtual devices (BDVirtualDevice). <strong>The</strong>refore<br />

the BDManager can be seen as global device list handling tool for initialisation and<br />

handling virtual devices. Virtual devices are objects each representing a block device<br />

or special sub-devices (partitions etc.) holding the main information on block devices<br />

and handling requests on this block device. <strong>The</strong> device driver itself has to be registered<br />

in the BDManager and is creating virtual devices.<br />

8.2.1 How to implement new low-level block device drivers<br />

• Derive from BDDriver and implement all functions. You may want to derive from<br />

bdio too to have basic functions for port I/O available.<br />

• add your Driver to list found in BDManager::doDeviceDetection and do your device<br />

detection in blocking mode in this section.<br />

• build and initialize vitual devices and register them at BDManager::addVirtualDevice(<br />

BDVirtualDevice* dev )<br />

• wait for requests...<br />

8.2.2 How to use block devices<br />

in short words:<br />

• include arch_bd_manager.h<br />

• get the instance <strong>of</strong> BDManager<br />

• perhaps it’s a good idea to check the device numbering.<br />

>getNumberOfDevices()]<br />

[uint32 BDManager-<br />

• get ”your” device [my_device BDManager->getDeviceByNumber(uint32 dev_num)]<br />

• do checks! (is my buffer large enough?, correct id?, is the start_block+num_block<br />

in bounds?, ...)<br />

• build a new instance <strong>of</strong> BDRequest (just to be on the safe side)<br />

• hand over the request to ”your” device [my_device->addRequest(BDRequest * comand)]<br />

• check if error occoured or still in progress... BDRequst->getResult()<br />

• don’t forget do delete BDRequest and your buffer after using it.<br />

Some hints on using Block devices:<br />

• NEVER swap any part <strong>of</strong> the block device architecture.<br />

your disk diver. How can you get it back to ram?<br />

Just image you swap<br />

• Think <strong>of</strong> allocating RAM while swaping.<br />

• Sometimes the disk driver needs some RAM to...<br />

102 <strong>of</strong> 151


CHAPTER 8. BLOCK DEVICES<br />

8.2. IMPLEMENTATION<br />

• Keep in mind that a disk driver might need IRQ’s for communication with the<br />

device.<br />

• Be shure that initialisation <strong>of</strong> (slow) block devices is finished before accessing<br />

any.<br />

Example<br />

//Actually we know the virtual device number we want to use: "3" (swap)<br />

//We figured this out earlier.<br />

uint32 dev_nr = 3;<br />

//What is the block size on this device?<br />

//Well, let’s have a look.<br />

BDRequest * bd_bs = new BDRequest(dev_nr, BDRequest::BD_GET_BLK_SIZE);<br />

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd_bs );<br />

//this is a blocking request. It just references data already in memory.<br />

if (bd_bs->getStatus() != BDRequest::BD_DONE)<br />

{<br />

// something went wrong... EXIT!<br />

}<br />

uint32 block_size = bd_bs->getResult();<br />

//We have to check how many blocks are available on this device:<br />

BDRequest * bd_bc = new BDRequest(dev_nr, BDRequest::BD_GET_NUM_BLOCKS);<br />

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd_bc );<br />

//this is a blocking request. It just references data already in memory.<br />

if (bd_bs->getStatus() != BDRequest::BD_DONE)<br />

{<br />

// something went wrong... EXIT!<br />

}<br />

uint32 block_count = bd_bc->getResult();<br />

//We assume that we want to read only one block (Number 234)<br />

uint32 blocks2read = 1;<br />

uint32 <strong>of</strong>fset = 233; // since numbering stats with 0<br />

if (<strong>of</strong>fset + blocks2read > block_count)<br />

{<br />

//do this little check!<br />

}<br />

//allocate some buffer or point somewhere in memory.<br />

char *my_buffer = (char *) kmalloc( blocks2read*block_size*size<strong>of</strong>(uint8) );<br />

//build the command<br />

BDRequest * bd = new BDRequest(dev_nr, BDRequest::BD_READ, <strong>of</strong>fset, blocks2read, my_buffe<br />

//and send it.<br />

BDManager::getInstance()->getDeviceByNumber(dev_nr)->addRequest ( bd );<br />

uint32 jiffies = 0;<br />

//actually we don’t know if this request is blocking or not. Just to be<br />

//on the safe side, check if the output is valid by now.<br />

while( bd->getStatus() == BDRequest::BD_QUEUED && jiffies++ < 50000 );<br />

103 <strong>of</strong> 151


8.2. IMPLEMENTATION CHAPTER 8. BLOCK DEVICES<br />

if( bd->getStatus() != BDRequest::BD_DONE )<br />

{<br />

//We should definitely should have a closer look at the status by now.<br />

//It may happen, that the request is still in queue or had an error.<br />

}<br />

//By now the reqested data should be in my buffer.<br />

//Print the buffer...<br />

for(uint32 i = 0; i < blocks2read*block_size; i++ )<br />

kprintfd( "%2X%c", *(my_buffer+i), (i+1)%8 ? ’ ’ : ’\n’ );<br />

//don’t forget to clean up!<br />

kfree( my_buffer ); delete bd; delete bd_bs; delete bd_bc;<br />

8.2.3 VirtualDevices<br />

<strong>The</strong>se Objects represent physical devices or patitions on physical devices as soon as a<br />

valid partition table is found. You can use the device numbers or the device names to<br />

find the device you are looking for. <strong>The</strong>se numbers start with 0 and are numbered in<br />

sequence <strong>of</strong> device detection/registration. In our example:<br />

• <strong>The</strong> whole first ATA drive (0, ”idea”)<br />

• <strong>The</strong> first partition on first ATA drive (1 = minix, kernel files, ”idea0”)<br />

• <strong>The</strong> second partition on first ATA drive (2 = minix, user programs, ”idea1”)<br />

• <strong>The</strong> third partition on first ATA drice (3 = swap, not formatted, ”idea2”)<br />

For example the first IDE drive in your system will get the first number, the first<br />

partion on the first drive the second, the third partion on your first drive the third, the<br />

second drive the fourth, the first partion on the second drive the fifth, ...<br />

8.2.4 BDRequest<br />

Actually this is the way how you can send a command to any block device. Create an<br />

instance with:<br />

BDRequest(dev_id, cmd, start_block = 0, num_block = 0, buffer = 0)<br />

dev_id: (uint32) the device ID you want to talk to<br />

cmd: (BD_CMD) what should the device do?<br />

basic commands:<br />

BD_READ<br />

BD_WRITE<br />

BD_GET_BLK_SIZE<br />

BD_GET_NUM_BLOCKS<br />

some extentions:<br />

104 <strong>of</strong> 151


CHAPTER 8. BLOCK DEVICES<br />

8.2. IMPLEMENTATION<br />

BD_REINIT<br />

BD_DEINIT<br />

BD_SEND_RAW_CMD<br />

start_block: (uint32) where should the operation start?<br />

num_block: (uint32) how many blocks?<br />

buffer: (void *) some buffer to read / write data. This buffer is neither allocated nor<br />

checked anywhere in the block device framework. It is simply written or read without<br />

taing care <strong>of</strong> anything.<br />

getStatus() returns:<br />

BD_QUEUED<br />

BD_DONE<br />

BD_ERROR<br />

Please keep in mind, that other commands or status codes might exist and will be<br />

used by dwevice drivers not implemented yet.<br />

getResult() returns requesting metadata (eg. number <strong>of</strong> blocks on this device)<br />

Actually some other functions do exist but you should really know what you are<br />

doing. Really bad sync problems can arise! For demonstration, test and initialisation<br />

purposes (reding partition table etc.) 2 functions have been implemented:<br />

BDVirtualDevice::readData and BDVirtualDevice::writeData. You may want to<br />

have a look at them as reference.<br />

105 <strong>of</strong> 151


8.2. IMPLEMENTATION CHAPTER 8. BLOCK DEVICES<br />

106 <strong>of</strong> 151


Chapter 9<br />

VFS, FS<br />

9.1 Introduction<br />

This document describes how the VFS (virtual-file-system) maintains the files in the<br />

file systems that it supports.<br />

<strong>The</strong> files in a file system are collections <strong>of</strong> data; A file sytem not only holds the<br />

data that is contained within the files <strong>of</strong> the file system but also the structure <strong>of</strong><br />

the file system. It holds all <strong>of</strong> the information that users and processes see as files,<br />

directories s<strong>of</strong>t links, file protection information and so on.<br />

9.2 File system<br />

Each VFS object is stored by a suitable data structure, which includes both the object<br />

attributes and the object methods. <strong>The</strong> VFS may dynamically modify the methods <strong>of</strong><br />

the object and, hence, it may install specialized behavior for the object. <strong>The</strong> following<br />

sections explain the VFS objects and their interrelationships in detail.<br />

<strong>The</strong> basic objects known to the FS layer are files, file systems, inodes, and names for<br />

inodes.<br />

9.2.1 Files<br />

Files are things that can be read from or writtn to. <strong>The</strong>y can also be mapped into memory<br />

and sometimes a list <strong>of</strong> file names can be read from them. [3] <strong>The</strong>y map very closely to<br />

the file descriptor. Files are represented by the class File.<br />

9.2.2 Inodes<br />

An inode represents a basic object within a filesystem. It can be a regular file, a<br />

directory, a symblic link, or a few other things. [3] <strong>The</strong> VFS does not make a strong<br />

107


9.3. VFS DATA STRUCTURES CHAPTER 9. VFS, FS<br />

distinction between different sorts <strong>of</strong> objects, but leaves it to the actual file-system<br />

implementation to provide appropriate behaviurs. Each inode is represented by the<br />

class Inode.<br />

It may seem that Files and Inodes are very similar. <strong>The</strong>y are but there are<br />

some important differences. One thing to note is that there are some things that have<br />

inodes but never have files. A good example <strong>of</strong> this is a directory. Also, a file has<br />

state imformation that an inode does not have, particulary a position, which indicates<br />

where in the file the next read or write will be performed.<br />

9.2.3 Filesystems<br />

A filesystem is a collection <strong>of</strong> inodes with one distinguished inode known as the ROOT,<br />

other inodes are accessed by starting at the root and looking up a file name to get to<br />

another inode. A filesystem has a number <strong>of</strong> characteristics which apply uniformly<br />

to all inodes within the filesystem. Some <strong>of</strong> there are flags such as the READ-ONLY<br />

flag. Each filesystem is represented by the class Superblock. Each file points to the<br />

superblock that is has open, also the corresponding inode.<br />

<strong>The</strong>re is a strong correlation between superblock (and hence filesystem) and device<br />

numbers. Each filesystem must (appear to) have a unique device on which the<br />

filesystem resides. Some filesystems (such as ramfs) are marked as no needing a<br />

real device. As well as knowing about filesystems, the VFS konws about different<br />

filesystem types. Each type <strong>of</strong> filesystem is represented by the class FileSystemType.<br />

9.2.4 Names<br />

All inodes within a filesystem are accessed by name. <strong>The</strong> VFS used the name-to-inode<br />

lookup process to found the certain inode, it may be expensive for some filesystem,<br />

but it is easy to implement. <strong>The</strong> structure <strong>of</strong> filesystem is as a tree. Each node in the<br />

tree corresponds to an inode in a given directory with a given name. Each node in the<br />

tree is represented by the class Dentry. <strong>The</strong> dentries act as an intermediary between<br />

Files and Inodes. Each file points to the dentry that if has open. Each dentry points<br />

to the inode that it references.<br />

9.3 VFS data structures<br />

9.3.1 Superblock Ojbects<br />

<strong>The</strong> superblock object stores information concerning a mounted filesystem. For diskbased<br />

filesystems, this object usually corresponds to a filesystem control block stored<br />

on disk. A superblock object consists <strong>of</strong> a Superblock class whose fields are described<br />

in table.<br />

108 <strong>of</strong> 151


CHAPTER 9. VFS, FS<br />

9.3. VFS DATA STRUCTURES<br />

Type Field Description<br />

class FilesystemType s_type_ Pointer to the filesystem-type<br />

uint32 s_dev_ Device identifier<br />

uint64 s_magic_ Filesystem magic number<br />

uint32 s_flags_ Mount flags<br />

class Dentry s_root_ Dentry object <strong>of</strong> the ROOT-directory<br />

from the filesystem<br />

class Dentry mounted_over_ Dentry object <strong>of</strong> the mounted point<br />

from the common filesystem<br />

class PointList dirty_inodes_ the modified (dirty) inode-list<br />

class PointList used_inodes_ the used inode-list<br />

class PointList all_inode_ the all inode-list<br />

class PointList s_files_ the file-descriptor-list<br />

Table 1: <strong>The</strong> fields <strong>of</strong> the superblock object<br />

<strong>The</strong> type <strong>of</strong> the filesystem is represented by the s_type_. the s_dev_ and s_magic_<br />

is used identifier to block device (e.g. the ext2 use a block device to store the data<br />

information). <strong>The</strong> s_flags_ show the permission (writable/readable) to the filesystem<br />

(e.g. MS_RDONLY: be permitted, and no indirect modification). s_root_ is the<br />

ROOT-directory <strong>of</strong> the filesystem, every filesystem has a ROOT-directory <strong>of</strong> the own<br />

filesystem, the mount-command mounted the ROOT-pointer to the mounted-pointer<br />

(mounted_over_) from the common filesystem (the tree <strong>of</strong> the VFS). <strong>The</strong> superblock<br />

also has the list for dirty_inodes_ (the modified inodes), used_inodes_ (only used for<br />

corresponding inodes from the opened files), all_inodes_ and file descriptor (s_files_)<br />

which stores the pointer <strong>of</strong> the opend file.<br />

<strong>The</strong> operations is defined in the class. Let’s briefly describe the important superblock<br />

operations:<br />

createInode(dentry, type): Create a new inode-object <strong>of</strong> the superblock with the<br />

inode-type and connect to the parameter dentry.<br />

createFd(inode, flag): Create a file with the given flag from the given inode. It return<br />

a file-descriptor num if succusses.<br />

removeFd(inode, fd): Remove the corresponding file descriptor from the given inode.<br />

readInode(inode): Fills the fields <strong>of</strong> the inode object whose address is passed as the<br />

parameter from the data on disk.<br />

write_inode(inode): Updates a filesystem inode with the contents <strong>of</strong> the inode object;<br />

the inode object (parameter) identifies the filesystem inode on the disk.<br />

delete_inode(inode): Remove the data block containing the file and the VFS inode.<br />

put_inode(inode): Releases the inode object whose address is passed as the parameter.<br />

Releasing an object does not necessarily mean freeing memory, since other<br />

processes may still use that object.<br />

<strong>The</strong> fields corresponding to unimplemented methods are set to NULL.<br />

9.3.2 Inode objects<br />

All information needed by the filesystem to handle a file is included in a data structure<br />

called on inode. [18] A filename is a casually assigned label that can be changed, but<br />

the inode is unique to the file and remainds the same as lang as the file exists. For<br />

109 <strong>of</strong> 151


9.3. VFS DATA STRUCTURES CHAPTER 9. VFS, FS<br />

disk-based filesystems, this object usually corresponds to a file control block stored<br />

on disk. An inode object consists <strong>of</strong> an Inode-class whose fields are described in table.<br />

Type Field Description<br />

class Dentry i_dentry_ Pointer to the dentry class<br />

class PointList i_dentry_link_ the dentry-list from the inode<br />

(only used for symbolic link)<br />

class PointList i_files_ the opened file objects from the inode<br />

uint32 i_nlink_ the number <strong>of</strong> the linked file <strong>of</strong> the inode<br />

class Superblock i_superblock_ pointer to the superblock object<br />

uint32 i_size_ the size <strong>of</strong> the file<br />

uint32 i_type_ the type <strong>of</strong> the inode<br />

uint32 i_state_ inode state flags<br />

Table 2: <strong>The</strong> fields <strong>of</strong> the inode object<br />

Each inode object always appears in one <strong>of</strong> the following circular doubly linked list<br />

(PointList):<br />

• <strong>The</strong> list <strong>of</strong> valid unused inodes, typically those mirroring valid disk inodes and not<br />

currently used by any process. <strong>The</strong> element is referenced by the unused_inodes_<br />

variable <strong>of</strong> the superblock class.<br />

• <strong>The</strong> list <strong>of</strong> in-use inodes, typically those mirroring valid disk inodes and used<br />

by some process. <strong>The</strong>se inodes are not dirty. <strong>The</strong> element is referenced by the<br />

used_inodes_ variable <strong>of</strong> the superblock class.<br />

• <strong>The</strong> list <strong>of</strong> dirty inodes. <strong>The</strong> element are referenced by the dirty_inodes_ field <strong>of</strong><br />

the corresponding superblock object.<br />

<strong>The</strong> inode object connect to a directory (i_dentry_), and has a lot <strong>of</strong> the files maybe (if<br />

the inode ist a file-inode), all opened file <strong>of</strong> this inode store by the i_files_, thus it can<br />

open the files repeated. the member variable i_type_ determine the type <strong>of</strong> the inode,<br />

a file-inode or a directory-inode. Here are the inode operation in the order they appear<br />

in the table:<br />

Inode(superblock, type): Construct a new inode, the inode muss to kown the<br />

superblock and the mode (which is desided a file or a directory).<br />

[D]create(dentry): Create a new disk inode for a regular file associated with a dentry<br />

object in same directory.<br />

lookup(name): Searches a directory for an inode corresponding to the filename<br />

included in a dentry object.<br />

[F]link(flag): Create a file object from the given flag and linked to the inode.<br />

[F]unlink(file): Unlinked the file Oject to the inode and remove the file object.<br />

[D]mkdir(dentry): Create a new inode for a directory associated with a dentry object<br />

in some directory.<br />

rm(): Removes the dentry.<br />

[D]rmdir(): Removes the directory whose name is included in the corresponding<br />

dentry.<br />

[D]symlink(dentry): Create a new inode for a symbolic link associated with a dentry<br />

object in some direcotry.<br />

followLink(inode,dir): Translates a symbolic link specified by an inode object; if the<br />

110 <strong>of</strong> 151


CHAPTER 9. VFS, FS<br />

9.3. VFS DATA STRUCTURES<br />

symbolic link is a relative pathname, the lookup operation starts from the specified<br />

directory.<br />

[F]readData(<strong>of</strong>fset, size, buffer): Read the size byte from a data starting at position<br />

(data_ + <strong>of</strong>fset) into the buffer.<br />

[F]WriteData(<strong>of</strong>fset, size, buffer): Write the size byte from the buffer starting at<br />

position into the data starting at position (data_ + <strong>of</strong>fset).<br />

[D]: it only used for directory; [F]: it only used for file<br />

9.3.3 Dentry objects<br />

<strong>The</strong> VFS considers each directory a file that contains a list <strong>of</strong> files and other directories.<br />

<strong>The</strong> directory entry is transformed by the VFS into a dentry object based on<br />

the Dentry class, whose fields are described in table. <strong>The</strong> system creates a dentry<br />

object for every component <strong>of</strong> a pathname that a precess looks up; the dentry object<br />

associates the component to its corresponding inode. For example, when looking up<br />

the /tmp/test pathname, the system creates a dentry object for the / root directory, a<br />

second dentry object for the tmp entry <strong>of</strong> the root directory, and a third dentry object<br />

for the test entry <strong>of</strong> the /tmp directory.<br />

<strong>The</strong> dentry object stores information about the linking <strong>of</strong> a directory entry with<br />

the corresponding file. Each disk-based filesystem stores this information in its own<br />

particular way on disk.<br />

Type Field Description<br />

class Inode d_inode_ Pointer to the inode class<br />

class Dentry d_parent_ Pointer to the parent dentry<br />

class PointList d_child_ the child dentry-list from the dentry<br />

class PointList d_subdirs_ For directories, list <strong>of</strong> dentry objects <strong>of</strong> subdirctories<br />

char d_name_ the name <strong>of</strong> dentry<br />

Table 3: <strong>The</strong> fields <strong>of</strong> the dentry object<br />

the dentry only identifys the inode reference and knows where are his father and his<br />

childs located, the dentry stores the pathname in the class. <strong>The</strong> relevant operations<br />

appear in the follow:<br />

Dentry(name): Construct a new dentry from given name.<br />

Dentry(parent): Construct a new dentry from the parent dentry.<br />

childRemove(dentry): Removes the given dentry from the child-dentry-list.<br />

childInsert(dentry): Inserts the given dentry to the child-dentry-list.<br />

checkName(name): Compare the name with the name <strong>of</strong> all dentry from the childdentry-list.<br />

9.3.4 File objects<br />

A file object describe how a process interacts with a file it has opened. <strong>The</strong> object is<br />

created when the file is opened and consists <strong>of</strong> the file class, whose fields are described<br />

in table. Notice that file object have no corresponding image on disk, and hence on<br />

111 <strong>of</strong> 151


9.3. VFS DATA STRUCTURES CHAPTER 9. VFS, FS<br />

dirty field is included in the file class to specify that the file object has been modified.<br />

i.e. the information only exists in memory during the period when each process<br />

accesses a file.<br />

Type Field Description<br />

class own<br />

the own information <strong>of</strong> the file<br />

uint32 uid the use-identifier <strong>of</strong> the file<br />

uint32 gid the group-identifier <strong>of</strong> the file<br />

uint32 version the version number<br />

class Superblock f_superblock_ pointer to the superblock object<br />

class Inode f_inode_ the inode associated to the file<br />

class Dentry f_dentry_ pointer to the dentry object<br />

uint32 flag_ flags specified when opening the file<br />

<strong>The</strong> fields <strong>of</strong> the file object<br />

Table 4:<br />

<strong>The</strong> main information stored by a file object is the file pointer - the current position in<br />

the file from which the next operation will take place. [18] Since several processed may<br />

access the same file concurrently, the file pointer cannot be kept in the inode object.<br />

Serveral list <strong>of</strong> "in use" file objects already assigned to superblocks. Each superblock<br />

object stores by the s_files_ field the dummy first element <strong>of</strong> a list <strong>of</strong> file descriptor<br />

which comprisd file objects; thus, file objects <strong>of</strong> files belonging to different filesystems<br />

are included in different list.<br />

Each filesystem includes its own set <strong>of</strong> file oprations (is defined in the file class)<br />

the perform such activities as reading ans writing a file. When a process opens the<br />

file, the VFS intializes the new file object with the address stored by the inode so that<br />

further calls to file operations can use these functions. Let’s briefly describe the file<br />

operations:<br />

File(inode, dentry): Construct a new file from given inode and dentry.<br />

read(buffer, size, <strong>of</strong>fset): Read the size byte from a file starting at position + <strong>of</strong>fset in<br />

the buffer.<br />

write(buffer, size, dentry): Write the size byte from the buffer into a file starting at<br />

position + <strong>of</strong>fset.<br />

File Descriptor<br />

<strong>The</strong> file descriptor object contain a unsigned integer number and a file object. <strong>The</strong><br />

PointList object store the object point, the index alter after handling <strong>of</strong> the operators<br />

(e.g. remove). That is the reason for existance <strong>of</strong> the File Descriptor object.<br />

9.3.5 derivation-hierarchy<br />

Figure illustrates the derivation-diagram, the concrete filesystem is derivated from<br />

the four basic class, the VirtualFileType determines the type <strong>of</strong> filesystem (i.e. ramfs,<br />

ext2, ...). <strong>The</strong> derivative filesystem overwrites the four baic class and create the own<br />

class for the concrete type.<br />

112 <strong>of</strong> 151


CHAPTER 9. VFS, FS<br />

9.3. VFS DATA STRUCTURES<br />

9.3.6 interaction between processed and vfs objects<br />

Figure illustrates with a simple example how processes interact with files. Three<br />

different processes have opened the same file. Each <strong>of</strong> the three processes uses its<br />

own file object (the file descriptor), the system (path_walker) found correct dentry in<br />

the ROOT-tree. <strong>The</strong> three dentry point at the same inode object, which identifies the<br />

superblock object and, together with the latter, the common disk file.<br />

113 <strong>of</strong> 151


9.4. RAMFS - MEMORY FILESYSTEM CHAPTER 9. VFS, FS<br />

9.4 RamFS - memory filesystem<br />

<strong>The</strong> RamFS is dynamically allocated (size varies dynamically as needed from the memory)<br />

and is a real filesystem, not a fake device on which to put a filesystem. <strong>The</strong> filesystem<br />

uses the space needed to store the files and directory on it. <strong>The</strong> data is located by<br />

the memory, it removed all data after exit the program.<br />

9.4.1 simplification from the RamFS<br />

• <strong>The</strong> class RamFsSuperblock used the member variable all_inodes_ list to store<br />

the used and modify inode <strong>of</strong> the filesystem. it is based on memory, thus there<br />

are not different between the used inode and modify inode, and the s_dev_ and<br />

s_magic_ are unused. <strong>The</strong> createInode method create a RamFsInode and connect<br />

to the dentry. the s_files_ stores all file descriptor and the corresponding inode<br />

from the opened file <strong>of</strong> the superblock object is located in used_inodes_.<br />

• <strong>The</strong> class RamFsInode has got the reference to Dentry (i_dentry_, the name and<br />

the position the root-tree) and RamFsSuperblock (i_superblock_, controler for the<br />

inode). It can open the file repeated with the READONLY-flag, the inode store<br />

the open file in the PointList i_files_. <strong>The</strong> ramfs-inode object used the member<br />

variable data_ to store the content from the file, the operations writeData and<br />

readData handle with it. <strong>The</strong> create, mknod and mkdir operators are equal to<br />

each other. <strong>The</strong> symlink, readlink and followLink are not implemented.<br />

• <strong>The</strong> Dentry only know the position <strong>of</strong> your inode and store the name (pathname)<br />

information by the tree-structure.<br />

• <strong>The</strong> class RamFsFile is accessed the inode from the outside layer. the own<br />

information (uid, gid, version) does not implements in this version. the vfs_syscall<br />

open and close setup the file descriptor (file object) in the s_file_ list from the<br />

RamFsSuperblock. write and read handle the buffer with the data_ from the<br />

RamFsInode.<br />

9.4.2 file vs. directory<br />

<strong>The</strong> difference between file and directory is the member variable data_ from the<br />

inode object, the file can be read from or written to the information, and the process<br />

manipulate with a file descriptor (the integer number announce the position from the<br />

file which locate in opened-file-list from the superblock object). <strong>The</strong> directory does<br />

not used the data_ (data_ = 0) and it is only a inode object. All operations from the<br />

directory point in the inode object.<br />

<strong>The</strong> file can associte with a lot <strong>of</strong> file object (link/unlink-operator). Many file<br />

store in the PointList from the inode, if they connect with a inode (type: I_FILE), the<br />

directory object has not get the file object.<br />

114 <strong>of</strong> 151


CHAPTER 9. VFS, FS<br />

9.4. RAMFS - MEMORY FILESYSTEM<br />

directory - inode & dentry<br />

<strong>The</strong> operator mkdir create a new directory. THe workflow from VFS describe as:<br />

1. <strong>The</strong> vfs_syscall used the path_walker to find the corresponding inode.<br />

2. Create a new dentry in the correct direcotry.<br />

3. <strong>The</strong> vfs_syscall turn the control over to (RamFs-)Seperblock, calls the method<br />

createInode from the superblock object.<br />

4. <strong>The</strong> RamFsSuperblock create a new inode with type I_DIR<br />

5. Connect the inode to dentry (mkdir)<br />

6. CreateInode return the created inode object.<br />

file - inode & dentry<br />

<strong>The</strong> operator open check the existence <strong>of</strong> the inode object first, if the inode does not<br />

exist, create ones. <strong>The</strong>n create a new file, connect to the inode object and return a<br />

file-descriptor num. <strong>The</strong> workflow describe as:<br />

1. <strong>The</strong> vfs_syscall used the path_walker to find the corresponding inode.<br />

2. Create a new dentry in the correct direcotry.<br />

3. <strong>The</strong> vfs_syscall turn the control over to (RamFs-)Seperblock, call the method<br />

createInode from the superblock object.<br />

4. <strong>The</strong> RamFsSuperblock create a new inode with type I_File<br />

5. Connect the inode to dentry (mkfile)<br />

6. createInode return the created inode object.<br />

7. <strong>The</strong> vfs_syscall calls the method createFd from the superblock object.<br />

115 <strong>of</strong> 151


9.5. EXTENTION CHAPTER 9. VFS, FS<br />

8. Create a file descriptor (file object) and insert to the s_file_ list.<br />

9. Connect the inode to file (link)<br />

10. CreateFd return the file descritor.<br />

<strong>The</strong> step 2-6 occurs if the file does not exist before vfs_syscall open.<br />

9.5 extention<br />

• symbolic link: allow the directory-to-directory link.<br />

• static Ramdisk: used the Ramdisk to store the data with fix size.<br />

• dynamic Ramdisk: used the Ramdisk to store the data with increasing size.<br />

• ext2 FS: used the block device to store the data<br />

• pipfs, BFS, ...: [4]<br />

116 <strong>of</strong> 151


Chapter 10<br />

Locking<br />

10.1 Introduction<br />

Locking is used in the sweb-kernel for handling critical sections. A critical section is<br />

a piece <strong>of</strong> code which operates with resources (data, files, databases, printers, etc.)<br />

shared between two or more threads or processes which must not be concurrently<br />

accessed by more than one thread. <strong>The</strong> mechanism <strong>of</strong> assuring that only one thread<br />

executes a particular critical section at a particular time is called mutual exclusion.<br />

For example, look at the following code:<br />

1 i = i + 1;<br />

Imagine that i is a global variable (or shared in another way) and there are two threads<br />

which are executing this instruction. Each thread can be interrupted at any time. <strong>The</strong><br />

following steps each thread may execute:<br />

1 reg = i ;<br />

2 reg ++;<br />

3 i = reg ;<br />

Each <strong>of</strong> the threads increments the value, so the value should have been incremented<br />

twice after execution. Now imagine that Thread A is interrupted between line 1 and<br />

2 <strong>of</strong> execution, and Thread B is scheduled next. After Thread B has done its work<br />

Thread A is rescheduled. As you see, the value <strong>of</strong> i increases just by 1, because each<br />

thread ”sees” the same initial value <strong>of</strong> i. Such a event is called race-condition.<br />

For more information and examples see chapter 2.3 in [17] and [5]. You really<br />

should be aware <strong>of</strong> this issue (not only for the operating-systems-course), because<br />

race-conditions <strong>of</strong>ten are the reason for strange program behavior and may not be<br />

discovered for a long time.<br />

You should be aware <strong>of</strong> the possibility <strong>of</strong> disabling interrupts to achieve mutual<br />

exclusion, but the better way is using locks. <strong>The</strong> scheduler itself disables scheduling<br />

during working with the thread-list, which is a similar procedure, but this should be<br />

a exception since it can’t protect the list otherwise.<br />

117


10.2. LOCKS CHAPTER 10. LOCKING<br />

10.2 Locks<br />

Locks in Sweb are objects which provide the methods acquire() and release().<br />

<strong>The</strong>se methods should be called before and after a critical section and assure mutual<br />

exclusion.<br />

10.2.1 SpinLock<br />

<strong>The</strong> SpinLock provides a kind <strong>of</strong> ”busy-waiting”. A thread don’t getting instant<br />

access to the critical section yields away in a loop until it does get access. This is<br />

much better than wasting cpu-time by only polling the lock-variable. So, how it works?<br />

<strong>The</strong> class SpinLock has a member variable named nosleep_mutex_ which has value<br />

0 if the lock is free and 1 if the lock is acquired. Acquiring the lock is a bit tricky,<br />

because it leads to a race condition if you read the lock-variable and set it to 1 if it is<br />

0. If you are interrupted between observing the value and setting it, another thread<br />

may read the same value (e.g. =0, unlocked) and then both threads enter the critical<br />

region simultaneously, which is not the right behavior. See the implementation:<br />

1 SpinLock : : SpinLock ( )<br />

2 {<br />

3 nosleep_mutex_ = 0;<br />

4 }<br />

5<br />

6 void SpinLock : : acquire ( )<br />

7 {<br />

8 while ( ArchThreads : : testSetLock ( nosleep_mutex_ ,1 ) )<br />

9 {<br />

10 i f ( unlikely ( ArchInterrupts : : testIFSet ( ) ==false ) )<br />

11 kpanict ( ( uint8∗ ) " SpinLock : : acquire : with IF=0 ! Now we ’ re dead ! ! ! \ n"<br />

12 "Maybe you used new/delete in irq/int−Handler context ?\n" ) ;<br />

13 //SpinLock : Simplest <strong>of</strong> Locks , do the next best thing to busy wating<br />

14 Scheduler : : instance ()−> yield ( ) ;<br />

15 }<br />

16 }<br />

17<br />

18 void SpinLock : : release ( )<br />

19 {<br />

20 nosleep_mutex_ = 0;<br />

21 }<br />

Avoiding this race-condition is done by a special instruction named TSL (TestSetLock,<br />

the assembly instruction used in sweb is XCHG for intel x86 architectures). It is a<br />

single, atomar cpu-instruction that sets the lock-variable and stores the old value <strong>of</strong><br />

it in a register. <strong>The</strong> method ArchThreads::testSetLock returns the old value. If it<br />

is 1, we haven’t modified the lock-variable, and we know that the lock is being held by<br />

another thread. If it is 0 we got the lock and need not to yield away any more. After a<br />

return from acquire() a thread can execute the critical region related to the lock.<br />

<strong>The</strong> release()-method simply sets nosleep_mutex_ to 0.<br />

<strong>The</strong> mutex-implementation in chapter 2.3.6 <strong>of</strong> [17] is very similar to the SpinLock in<br />

sweb.<br />

118 <strong>of</strong> 151


CHAPTER 10. LOCKING<br />

10.2. LOCKS<br />

10.2.2 Mutex<br />

<strong>The</strong> Mutex-class improves the busy-waiting-method by putting threads to sleep during<br />

they are waiting for the lock. If a thread doesn’t get access to the critical section<br />

it is put to sleep. It is also put on a sleepers-list to remember all threads currently<br />

sleeping on the Mutex-object. If another thread leaves the critical section by<br />

calling release(), a thread on the sleepers-list will be selected and set to runningstate.<br />

<strong>The</strong>n the waked thread tries to enter the critical section again. For the correct<br />

implementation there is used a SpinLock for protecting the sleepers-list and<br />

the Scheduler::sleepAndRelease()-method for atomar going to sleep and releasing<br />

the spinlock. A detailed explanation <strong>of</strong> possible race-conditions is given in the file<br />

Mutex.h.<br />

Implementation<br />

1 Mutex : : Mutex ( )<br />

2 {<br />

3 mutex_ = 0;<br />

4 }<br />

5<br />

6 void Mutex : : acquire ( )<br />

7 {<br />

8 spinlock_ . acquire ( ) ;<br />

9 while ( ArchThreads : : testSetLock ( mutex_,1 ) )<br />

10 {<br />

11 sleepers_ . pushBack ( currentThread ) ;<br />

12 Scheduler : : instance ()−>sleepAndRelease ( spinlock_ ) ;<br />

13 spinlock_ . acquire ( ) ;<br />

14 }<br />

15 spinlock_ . release ( ) ;<br />

16 held_by_=currentThread ;<br />

17 }<br />

18<br />

19 void Mutex : : release ( )<br />

20 {<br />

21 spinlock_ . acquire ( ) ;<br />

22 mutex_ = 0;<br />

23 held_by_ =0;<br />

24 i f ( ! sleepers_ . empty ( ) )<br />

25 {<br />

26 Thread ∗thread = sleepers_ . front ( ) ;<br />

27 sleepers_ . popFront ( ) ;<br />

28 Scheduler : : instance ()−>wake ( thread ) ;<br />

29 }<br />

30 spinlock_ . release ( ) ;<br />

31 }<br />

Because this mechanism uses a list, it involves the KernelMemoryManager. <strong>The</strong>refore<br />

the KMM itself cannot use a Mutex as locking mechanism (it uses a SpinLock instead).<br />

<strong>The</strong> Mutex is the standard locking mechanism in sweb, so whenever you need a lock<br />

use it.<br />

119 <strong>of</strong> 151


10.3. USING LOCKS CHAPTER 10. LOCKING<br />

10.3 Using Locks<br />

Using Locks is easy, before the critical section you have to call acquire() and after<br />

it release(). In sweb you can also use the class MutexLock. Its constructor takes<br />

a Mutex-object and calls acquire(). <strong>The</strong> destructor calls release(). This can be<br />

useful if you have to leave a function a few times (maybe after catching errors), and<br />

would have to call release() before each return-instruction. You could create a<br />

stack-allocated MutexLock-object instead which will automatically release the mutex<br />

at method-return. See example in Loader::readHeaders().<br />

Using locks can easily lead to deadlocks if you make a mistake. It’s also not that easy<br />

to recognize the code snippets which are race-condition-prone. We will discuss critical<br />

sections in the sweb-kernel later.<br />

A few hints:<br />

• don’t acquire a lock twice, this will lead to a deadlock. Remember this when you<br />

are calling methods inside a critical region.<br />

• don’t use Scheduler::wake() and ::sleep() on a thread which uses mutexes<br />

(if you wake a thread sleeping on a mutex, the thread will be put on the sleeperslist<br />

again). Use condition-variables instead if you want to block a running thread.<br />

• all synchronization primitives in sweb (SpinLock, Mutex, CV) shouldn’t be used<br />

if interrupts are disabled. <strong>The</strong> reason is that no task switch is possible (also no<br />

yielding!) and nobody can release the lock if it is already acquired. If someone<br />

doesn’t get the lock, i.e. doesn’t get access to the critical section, he will be<br />

deadlocked.<br />

• You should avoid nesting locks, but if you do it, you have to follow a simple rule:<br />

don’t acquire locks in different orders, because otherwise your code will produce<br />

deadlocks. Following example demonstrates why:<br />

Thread A executes:<br />

1 lock_a . acquire ( ) ;<br />

2 lock_b . acquire ( ) ;<br />

3 . . .<br />

4 lock_b . release ( ) ;<br />

5 lock_a . release ( ) ;<br />

Thread B executes:<br />

1 lock_b . acquire ( ) ;<br />

2 lock_a . acquire ( ) ;<br />

3 . . .<br />

4 lock_a . release ( ) ;<br />

5 lock_b . release ( ) ;<br />

Now imagine thread A has already acquired lock_a, but not lock_b (interrupted<br />

between line 1 and 2). Thread B acquires lock_b and then tries to acquire lock_a.<br />

Obviously thread B will wait forever for lock_a and thread A will also wait forever<br />

for lock_b. We have created a deadlock, similar to situation in following picture:<br />

120 <strong>of</strong> 151


CHAPTER 10. LOCKING<br />

10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL<br />

10.4 Critical Sections in the Sweb-Kernel<br />

10.4.1 KMM<br />

<strong>The</strong> allocateMemory(), reallocateMemory() and freeMemory() methods each use<br />

the spinlock-member lock_ to protect the list <strong>of</strong> memory-segments. <strong>The</strong> KMM not<br />

always uses the lock because it is involved in the startup-routine before enabling<br />

interrupts, where using the lock is not necessary.<br />

10.4.2 PM<br />

<strong>The</strong> PageManager uses a Mutex-object in the methods getFreePhysicalPage() and<br />

freePage() to protect the map <strong>of</strong> physical pages.<br />

10.4.3 Loader<br />

<strong>The</strong> Loader uses the mutex file_lock_ to avoid race-conditions in setting file-<strong>of</strong>fset<br />

and reading the file in the methods readHeaders() and loadOnePageSafeButSlow().<br />

This is necessary only if you implement multithreaded userprocesses, because then<br />

the threads share the same Loader and therefore the same file-descriptor.<br />

10.4.4 CV<br />

<strong>The</strong> condition-variable is another synchronization primitive provided in sweb. It adds<br />

the ability to block threads (i.e. put to sleep) if they cannot continue execution and<br />

unblock them if a condition is met. To block it the method wait() is provided. <strong>The</strong>re<br />

the thread calling it is put to sleep and stored on a sleepers-list. If some condition<br />

is met, someone calls signal() where a thread is waked, or broadcast() where<br />

all threads are waked. Condition-variables can only be used in a critical section<br />

121 <strong>of</strong> 151


10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL CHAPTER 10. LOCKING<br />

protected with a Mutex-object. For examples <strong>of</strong> use see chapter 2.3.7 <strong>of</strong> [17]. Just<br />

imagine that the monitor-procedures are surrounded by a mutex. See also uses in<br />

sweb: common/kernel/MountMinix.h;.cpp and common/ipc/FiFo.h;.cpp.<br />

Note that CVs introduced in Tanenbaum chapter 2.3.7 [17] are used in a monitor<br />

and only one thread or process can be active in the monitor at a particular time. A<br />

monitor can be simulated with a mutex-locked code section, as it is done in sweb<br />

and for example the posix-multithreading-library. You should also note that in<br />

sweb we don’t let the signaled thread immediately run (Mesa-Semantics). So the<br />

correct use <strong>of</strong> CV-wait() is generally while(foo) condition.wait(); instead <strong>of</strong><br />

if(foo) condition.wait();.<br />

Implementation<br />

1 Condition : : Condition ( Mutex ∗lock )<br />

2 {<br />

3 sleepers_=new List ( ) ;<br />

4 lock_=lock ;<br />

5 }<br />

6<br />

7 Condition : : ~ Condition ( )<br />

8 {<br />

9 delete sleepers_ ;<br />

10 }<br />

11<br />

12 void Condition : : wait ( )<br />

13 {<br />

14 // l i s t i s protected , because we assume, the lock i s being held<br />

15 assert ( lock_−>isHeldBy ( currentThread ) ) ;<br />

16 assert ( ArchInterrupts : : testIFSet ( ) ) ;<br />

17 sleepers_−>pushBack ( currentThread ) ;<br />

18 Scheduler : : instance ()−>sleepAndRelease (∗ lock_ ) ;<br />

19 lock_−>acquire ( ) ;<br />

20 }<br />

21<br />

22 void Condition : : signal ( )<br />

23 {<br />

24 i f ( ! lock_−>isHeldBy ( currentThread ) )<br />

25 return ;<br />

26 assert ( ArchInterrupts : : testIFSet ( ) ) ;<br />

27 Thread ∗thread=0;<br />

28 i f ( ! sleepers_−>empty ( ) )<br />

29 {<br />

30 thread = sleepers_−>front ( ) ;<br />

31 i f ( thread−>state_ == Sleeping )<br />

32 {<br />

33 Scheduler : : instance ()−>wake ( thread ) ;<br />

34 sleepers_−>popFront ( ) ;<br />

35 }<br />

36 }<br />

37 }<br />

38<br />

39 void Condition : : broadcast ( )<br />

40 {<br />

41 i f ( ! lock_−>isHeldBy ( currentThread ) )<br />

42 return ;<br />

122 <strong>of</strong> 151


CHAPTER 10. LOCKING<br />

10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL<br />

43 assert ( ArchInterrupts : : testIFSet ( ) ) ;<br />

44 Thread ∗thread ;<br />

45 List tmp_threads ;<br />

46 while ( ! sleepers_−>empty ( ) )<br />

47 {<br />

48 thread = sleepers_−>front ( ) ;<br />

49 sleepers_−>popFront ( ) ;<br />

50 i f ( thread−>state_ == Sleeping )<br />

51 Scheduler : : instance ()−>wake ( thread ) ;<br />

52 else<br />

53 tmp_threads . pushBack ( thread ) ;<br />

54 }<br />

55 while ( ! tmp_threads . empty ( ) )<br />

56 {<br />

57 sleepers_−>pushBack ( tmp_threads . front ( ) ) ;<br />

58 tmp_threads . popFront ( ) ;<br />

59 }<br />

60 }<br />

<strong>The</strong> sleepers-list need not to be protected separately because when calling wait()<br />

or signal() the thread already has to be in a critical section protected with the<br />

same lock as provided in the Condition-constructor. In wait() we also have to<br />

release the lock (atomar with going to sleep). This obviously is done for the reason<br />

that some other thread can enter the critical region. This is the only possibility even<br />

something can happen that satisfies the condition and signal() being called. <strong>The</strong><br />

last instruction in wait() is <strong>of</strong> course acquiring the lock again because the thread<br />

was in the critical region before calling wait() and should also be in there at leaving<br />

wait(). <strong>The</strong> signal()-method simply wakes the next thread on list, broadcast()<br />

wakes all threads on list.<br />

10.4.5 Console and Terminal<br />

<strong>The</strong> Terminal uses a lock to protect the characters and states on the terminal and<br />

some other members. A Console uses two locks, one for drawing (protecting writes<br />

to the display, named console_lock_) and one for protecting the active terminal<br />

(set_active_lock_). <strong>The</strong> locks are nested in various ways to assure mutual exclusion.<br />

In detail:<br />

• Terminal-methods generally lock either the terminal itself only (if setting some<br />

members) or the terminal and the console (when writing characters to terminal<br />

and console).<br />

• <strong>The</strong> console itself uses the set_active_lock_ to protect access to the active terminal<br />

(Console::getActiveTerminal(), Console::setActiveTerminal()).<br />

Because the terminal itself also needs to know if it is the active one<br />

or not, it provides the methods Terminal::setAsActiveTerminal() and<br />

Terminal::unSetAsActiveTerminal(). Those are protected by the terminallock<br />

<strong>of</strong> course. In case <strong>of</strong> setting the active terminal the console-lock is also used<br />

in order to update the console-contents with the contents <strong>of</strong> the terminal.<br />

<strong>The</strong>refore nested locks are acquired in following orders:<br />

• Several terminal-methods:<br />

123 <strong>of</strong> 151


10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL CHAPTER 10. LOCKING<br />

1 terminal_lock . acquire ( ) ;<br />

2 console_lock . acquire ( ) ;<br />

3 console_lock . release ( ) ;<br />

4 terminal_lock . release ( ) ;<br />

• Console::setActiveTerminal()<br />

1 set_active_lock . acquire ( ) ;<br />

2<br />

3 terminal_a_lock . acquire ( ) ;<br />

4 terminal_a_lock . release ( ) ;<br />

5<br />

6 terminal_b_lock . acquire ( ) ;<br />

7 console_lock . acquire ( ) ;<br />

8 console_lock . release ( ) ;<br />

9 terminal_b_lock . release ( ) ;<br />

10<br />

11 set_active_lock . release ( ) ;<br />

Deadlocks can occur if threads are already holding locks, and requesting others which<br />

they possibly don’t get. So, let’s look at possible events:<br />

• Thread holding terminal_x_lock, waiting for console_lock. <strong>The</strong> console_lock is<br />

always the last lock to be acquired, so it will be released sooner or later.<br />

• Thread holding set_active_lock, requesting terminal_x_lock. This is also no problem<br />

because a terminal_lock will be released at the latest if it gets the console_lock<br />

and can do its work, which is independent <strong>of</strong> set_active_lock.<br />

10.4.6 Scheduler<br />

<strong>The</strong> Scheduler disables scheduling in order to protect the thread-list (then it obviously<br />

cannot get interrupted by other threads which assures mutual exclusion). <strong>The</strong><br />

reason for this is that protecting it with a mutex or spinlock is not possible because<br />

it would also have to be used in the Scheduler::schedule() method. But this<br />

method is called from interrupt context with interrupts disabled, when yielding away<br />

is not possible (apart from the fact that after a hypothetic yield we will end up in<br />

Scheduler::schedule() again ;-)). So the method simply searches for a new thread<br />

to run only if scheduling is enabled, otherwise probably the list is being modified and<br />

currentThread has to continue with execution (which hopefully enables scheduling<br />

again).<br />

Before modifying the thread-list, the private Scheduler-method lockScheduling()<br />

should be called which actually blocks the Scheduler. When adding or removing<br />

threads (those list-methods use new/delete) you also have to be sure<br />

that the KernelMemoryManager-lock is free. Otherwise you will be deadlocked<br />

if you use new/delete caused <strong>of</strong> disabled scheduling (yielding is not possible<br />

in this case). Of course, this condition can’t be checked before you call<br />

lockScheduling(). Imagine you’re interrupted between SpinLock::isFree() and<br />

Scheduler::lockScheduling() - you cannot be sure that the SpinLock is free<br />

any more. SpinLock::isFree() (and Mutex::isFree()) should only be called if<br />

124 <strong>of</strong> 151


CHAPTER 10. LOCKING<br />

10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL<br />

scheduling or interrupts are disabled because only in this cases the result can be<br />

trustworthy and it even makes sense to call.<br />

So, we have to check the KMM-lock after locking scheduling. And if it’s not free<br />

we have to unlock scheduling, yield away, lock scheduling, check KMM-lock, etc.<br />

until the KMM-lock is free and scheduling is disabled. This is done by the private<br />

Scheduler-method waitForFreeKMMLock(). A few code snippets from the Scheduler:<br />

1 void Scheduler : : addNewThread ( Thread ∗thread )<br />

2 {<br />

3 lockScheduling ( ) ;<br />

4 debug ( SCHEDULER, "addNewThread : %x %d:%s\n" , thread , thread−>getPID ( ) , thread−>getName ( ) ) ;<br />

5 waitForFreeKMMLock ( ) ;<br />

6 threads_ . pushFront ( thread ) ;<br />

7 unlockScheduling ( ) ;<br />

8 }<br />

9<br />

10 uint32 Scheduler : : schedule ( )<br />

11 {<br />

12 i f ( testLock ( ) || block_scheduling_extern_ >0 )<br />

13 {<br />

14 debug ( SCHEDULER, " schedule : currently blocked\n" ) ;<br />

15 return 0;<br />

16 }<br />

17<br />

18 [ select a thread ready to run , update currentThread and currentThreadInfo ]<br />

19 }<br />

20<br />

21 void Scheduler : : lockScheduling ( ) //not as severe as stopping Interrupts<br />

22 {<br />

23 i f ( unlikely ( ArchThreads : : testSetLock ( block_scheduling_ ,1 ) ) )<br />

24 arch_panic ( ( uint8∗ ) "FATAL ERROR: Scheduler : : ∗ : block_scheduling_ was set ! ! "<br />

25 "How the Hell did the program flow get here then ?\n" ) ;<br />

26 }<br />

27<br />

28 void Scheduler : : unlockScheduling ( )<br />

29 {<br />

30 block_scheduling_ = 0;<br />

31 }<br />

32<br />

33 bool Scheduler : : testLock ( )<br />

34 {<br />

35 return ( block_scheduling_ > 0 ) ;<br />

36 }<br />

37<br />

38 void Scheduler : : waitForFreeKMMLock ( )<br />

39 {<br />

40 i f ( block_scheduling_==0 )<br />

41 arch_panic ( ( uint8∗ ) "FATAL ERROR: Scheduler : : waitForFreeKMMLock : "<br />

42 " This i s meant to be used while Scheduler i s locked\n" ) ;<br />

43<br />

44 while ( ! KernelMemoryManager : : instance ()−>isKMMLockFree ( ) )<br />

45 {<br />

46 unlockScheduling ( ) ;<br />

47 yield ( ) ;<br />

48 lockScheduling ( ) ;<br />

49 }<br />

50 }<br />

125 <strong>of</strong> 151


10.4. CRITICAL SECTIONS IN THE <strong>SWEB</strong>-KERNEL CHAPTER 10. LOCKING<br />

126 <strong>of</strong> 151


Chapter 11<br />

Xen<br />

11.1 Introduction<br />

<strong>The</strong> following chapter describes the experimental port <strong>of</strong> <strong>SWEB</strong> to Xen version 3.x.<br />

<strong>The</strong> intention was to create a fully functional <strong>SWEB</strong> version as a Xen guest domain.<br />

See the last section for what is missing. This is partly based on the fact that the<br />

Xen implementation started later when a great portion <strong>of</strong> <strong>SWEB</strong> was already done.<br />

Major changes to the structure were not possible anymore due to time constraints<br />

as the code should soon be used for exercises during an operating systems lecture<br />

course. <strong>The</strong>refore integration was harder to do and the architecture separation is a<br />

bit weakened. <strong>The</strong> first attempts where done under Xen 2.0.6. Xen 3.0 was released<br />

in December 2005. As it incorporated many bug fixes the 2.0.x code was largely<br />

abolished and the implementation was redone for Xen 3.0.<br />

11.2 Architecture separation<br />

<strong>SWEB</strong> was started as a student project and it was intended to be used for exercises<br />

accomplishing an operating system lecture where NACHOS [6] was used. One <strong>of</strong> the<br />

problems <strong>of</strong> NACHOS are that it contains a lot <strong>of</strong> ifdef macros spread all over the code.<br />

This makes is obfuscate for students so the intention in <strong>SWEB</strong> was to have as few<br />

ifdefs as possible. <strong>The</strong> main decision to reach this goal was to create a common interface<br />

for the architecture specific code. This can be found in arch/common/include.<br />

<strong>The</strong>se classes only contain static methods and do not rely on any further includes.<br />

<strong>The</strong>refore it is possibly to make an architecture specific implementation without any<br />

modifications in the common code. This is achieved by the symlink arch/arch/ which<br />

points to arch/Xen or arch/x86 depending on which build is done.<br />

After a short review <strong>of</strong> the Xen documentation is was clear that some areas will have<br />

to implemented differently. <strong>The</strong> following architecture specific headers existed from<br />

the beginning:<br />

• ArchCommon.h<br />

• ArchInterrupts.h<br />

127


11.3. XEN CHAPTER 11. XEN<br />

• ArchMemory.h<br />

• ArchThreads.h<br />

<strong>The</strong> following architecture specific headers were introduced quite late in the development.<br />

Unfortunately these require additional includes and therefore do not follow the<br />

idea <strong>of</strong> a clean architecture separation strictly.<br />

• arch_keyboard_manager.h<br />

• arch_bd_manager.h<br />

Note that one also finds XenConsole.h there. This file is specific to <strong>SWEB</strong><br />

Xen but was introduced later. <strong>The</strong> clean way would have been to remove<br />

common/include/console.h and introduce arch/common/include/ArchConsole.h<br />

to replace it. <strong>The</strong> reason why this happened is that the source was already used for<br />

an exercise. No major modifications should be done anymore to be able to test and<br />

create eventually necessary patches in the testing tree. When refactoring the code this<br />

should be changed to follow the above explained architecture separation.<br />

11.2.1 Building environment<br />

To compile <strong>SWEB</strong> Xen one has to call make ARCH=Xen. Before this sweb-testing-bin<br />

should be removed completely to ensure all dependencies are updated and everything<br />

is recompiled. <strong>The</strong> results are sweb_Xen.elf, sweb_Xen.gz, sweb_Xen.x. <strong>The</strong> image<br />

which is used to start a Xen guest domain is sweb_Xen.gz. See the installation<br />

guide ??for instructions on how to boot this image. Also note that under arch there is<br />

a link automatically created to point to the Xen source tree. This way the Xen specific<br />

implementation <strong>of</strong> the common architecture interface headers is compiled.<br />

<strong>The</strong>re is one pit fall in the Makefile when one intends to change it. When calling<br />

the linker head.o must be the first object file in the argument list. To achieve this it<br />

is deleted from $(OBJECTDIR)/sauhaufen and given explicitly as an argument to the<br />

linker. If you change this the link process will fail. <strong>The</strong>refore if one renames head.S<br />

the Makefile must be adapted too.<br />

11.3 Xen<br />

11.3.1 Overview<br />

Xen is a solution to run multiple operating systems on one host. Xen itself is called a<br />

hypervisor:<br />

A hypervisor in computing is a scheme which allows multiple operating systems<br />

to run, unmodified, on a host computer at the same time. <strong>The</strong> term is<br />

an extension <strong>of</strong> the earlier term supervisor, which was commonly applied to<br />

operating system kernels in that era. [7]<br />

128 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.3. XEN<br />

Xen differs from this definition in one aspect. <strong>The</strong> operating systems must be modified<br />

to run under Xen. With additional processor support is possible to run unmodified<br />

guest. Currently code for the new Vanderpool[8] architecture by Intel exists. AMD will<br />

provide a similar extension called Pacifica[9].<br />

To archive the goal <strong>of</strong> multiple concurrently running operating systems the para virtualisation<br />

approach is used. This means unlike a full PC simulator it only provides<br />

virtualisation <strong>of</strong> the core parts. <strong>The</strong>se are basically CPU and memory. Additionally<br />

network interface cards, console and block devices are supported. Xen therefore is<br />

suitable especially for network server virtualisation.<br />

<strong>The</strong> Xen hypervisor deals only with CPU and memory. <strong>The</strong> rest is provided by a privileged<br />

operating system called domain 0 (short dom0). Xen is not usable without a<br />

dom0. Dom0 contains all the drivers for hardware access and provides the services<br />

like virtual network interfaces to the other OS instances. <strong>The</strong>se are called guest domains.<br />

11.3.2 Architecture<br />

<strong>The</strong> Xen hypervisor is mainly responsible to ensure strict partitioning between the<br />

guest domains. It also carries out the privileged work on behalf <strong>of</strong> the domains. All<br />

hardware handling is done by driver domains. <strong>The</strong> most important is dom0. Drivers<br />

domains provide an interface for other domains to share a singular resources like<br />

the network card. This is also called back end/front end architecture where dom0<br />

is running the back end with access to the real hardware and the guest domain only<br />

implements the front end which is a driver to use the exposed interface.<br />

Xen also allows to transfer running guest domains to other machines over the network<br />

or suspend them to disk. This is almost transparent to the guest domain as long as<br />

the new machine provides the same services.<br />

11.3.3 Memory Layout<br />

Xen reserves part <strong>of</strong> the virtual address space for itself. <strong>The</strong>se are currently the top<br />

64Mb on the x86 32-bit architecture. <strong>The</strong> start is defined in the Xen public headers<br />

(arch-x86_32.h) as HYPERVISOR_VIRT_START. As explained above Xen is running on<br />

ring 0 and guest domains run in ring 1. As paging alone would not protect the Xen<br />

memory from the guest OS segmentation is used. When the guest is running the top<br />

memory used by Xen is outside the set up segment limit. This can be considered like<br />

a memory hole. S<strong>of</strong>tware not aware <strong>of</strong> this fact may uses the grow-down feature <strong>of</strong><br />

segments. This can lead to a performance decrease as Xen has to catch accesses.<br />

Especially the /lib/tls libraries under Linux implement the worst case behavior.<br />

<strong>The</strong>refore Xen Linux gives a warning during the boot process if these libraries are<br />

not disabled. Read http://wiki.xensource.com/xenwiki/XenSegments for more<br />

details. It is also possible to install adapted versions <strong>of</strong> the libraries http://wiki.<br />

xensource.com/xenwiki/XenSpecificGlibc.<br />

Figure 11.1 shows the general memory layout under Xen constructed by the Linux<br />

domain builder. <strong>The</strong> builder originally developed for Linux as it was the first operation<br />

supported but is suitable for other systems too. Other domain builders can create<br />

slightly different layouts. <strong>The</strong> Xen team intends to keep the number <strong>of</strong> domain builders<br />

129 <strong>of</strong> 151


11.3. XEN CHAPTER 11. XEN<br />

Figure 11.1: Xen memory layout<br />

small [10]. It is preferred to adapt the guest domains as necessary. Care must be taken<br />

as the layout might change with major updates.<br />

Pseudo Physical Memory vs. Machine Memory<br />

Xen provides an abstraction <strong>of</strong> the machine memory.<br />

• machine memory: <strong>The</strong> real physical memory build into the machine. <strong>The</strong> frames<br />

are called machine frame numbers (MFN).<br />

• pseudo physical memory: <strong>The</strong> abstraction layer for the guest operating system.<br />

Called page frame numbers (PFN).<br />

<strong>The</strong> PFNs form a continuous memory region in size <strong>of</strong> the specified memory amount<br />

for the guest domain. Numbering starts at frame number zero. <strong>The</strong> PFNs are specific<br />

to one domain. Each domain has their own PFN numbering, each starting at frame<br />

number 0.<br />

<strong>The</strong> MFNs cover the whole real machine memory consecutively starting from address<br />

0x00000000. A MFN corresponds to the physical address <strong>of</strong> the machine page frame.<br />

<strong>The</strong> frame and page size is 4KiB. A MFN refers to the same memory location in every<br />

domain.<br />

<strong>The</strong> above figure gives an illustration <strong>of</strong> PFNs and MFNs. One clearly sees that the<br />

guest OS can maintain the illusion <strong>of</strong> a continuous memory whereas in reality it might<br />

be scattered all over the real machine memory. This is no problem as virtual memory<br />

management already provides such an abstraction and the OS is aware <strong>of</strong> it. <strong>The</strong> problem<br />

is that most OS kernels expect the continuous memory internally when managing<br />

the pages. Internally the OS can use the original code which manages continuous<br />

memory when using PFN. When manipulating page tables a guest OS must use the<br />

MFN as only this corresponds to a real physical memory address. <strong>The</strong>refore before<br />

updating or manipulating a page table entry the OS must translate the PFN to the<br />

corresponding MFN.<br />

130 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.4. <strong>SWEB</strong>- XEN<br />

11.4 <strong>SWEB</strong>- Xen<br />

Figure 11.2: Page frame numbers: PFN and MFN<br />

<strong>The</strong> following sections cover different aspects <strong>of</strong> the implementation <strong>of</strong> <strong>SWEB</strong> for Xen.<br />

<strong>The</strong> port is called <strong>SWEB</strong> Xen to differentiate it from the native <strong>SWEB</strong> implementation.<br />

11.4.1 Guest Section<br />

A guest domain must have a section __xen_guest which consists <strong>of</strong> an ASCII string.<br />

In <strong>SWEB</strong> Xen the section looks like this:<br />

1 . section __xen_guest<br />

2 . a s c i i "GUEST_OS=linux ,GUEST_VER=2.6 "<br />

3 . a s c i i " ,XEN_VER=xen−3.0"<br />

4 . a s c i i " ,VIRT_BASE=0x80000000"<br />

5 . a s c i i " ,PAE=no"<br />

6 . a s c i i " ,LOADER=generic "<br />

7 . byte 0<br />

Important is XEN_VER which must be the version string for the used Xen version.<br />

If this is differs from the running Xen version the domain builder will not load the<br />

guest domain. <strong>The</strong> second important value is VIRT_BASE which must be the virtual<br />

address to which the kernel image should be loaded. <strong>The</strong> same value must<br />

131 <strong>of</strong> 151


11.4. <strong>SWEB</strong>- XEN CHAPTER 11. XEN<br />

be used in the <strong>SWEB</strong> source files for address calculations and in the linker script<br />

(arch/xen/utils/kernel-ld-script.ld).<br />

11.4.2 Start Info and Shared Info Page<br />

As a boot parameter a pointer to the start info page is passed on to the guest domain<br />

in the register esi. This pointer is already a valid virtual address in the guest address<br />

space. <strong>The</strong> used structure start_info is defined in the public Xen header xen.h<br />

1 typedef struct start_info {<br />

2 /∗ THE FOLLOWING ARE FILLED IN BOTH ON INITIAL BOOT AND ON RESUME. ∗/<br />

3 char magic [ 3 2 ] ; /∗ " xen−− " . ∗/<br />

4 unsigned long nr_pages ; /∗ Total pages allocated to this domain . ∗/<br />

5 unsigned long shared_info ; /∗ MACHINE address <strong>of</strong> shared info s t r u c t . ∗/<br />

6 uint32_t flags ; /∗ SIF_xxx flags . ∗/<br />

7 unsigned long store_mfn ; /∗ MACHINE page number <strong>of</strong> shared page . ∗/<br />

8 uint32_t store_evtchn ; /∗ Event channel f o r store communication . ∗/<br />

9 unsigned long console_mfn ; /∗ MACHINE address <strong>of</strong> console page . ∗/<br />

10 uint32_t console_evtchn ; /∗ Event channel f o r console messages . ∗/<br />

11 /∗ THE FOLLOWING ARE ONLY FILLED IN ON INITIAL BOOT (NOT RESUME) . ∗/<br />

12 unsigned long pt_base ; /∗ VIRTUAL address <strong>of</strong> page directory . ∗/<br />

13 unsigned long nr_pt_frames ; /∗ Number <strong>of</strong> bootstrap p . t . frames . ∗/<br />

14 unsigned long mfn_list ; /∗ VIRTUAL address <strong>of</strong> page−frame l i s t . ∗/<br />

15 unsigned long mod_start ; /∗ VIRTUAL address <strong>of</strong> pre−loaded module . ∗/<br />

16 unsigned long mod_len ; /∗ Size ( bytes ) <strong>of</strong> pre−loaded module . ∗/<br />

17 int8_t cmd_line [MAX_GUEST_CMDLINE] ;<br />

18 } s t a r t _ i n f o _ t ;<br />

Without this information a further setup <strong>of</strong> the guest domain would be impossible. A<br />

detailed explanation <strong>of</strong> the structure fields is skipped here as they will be explained<br />

when used.<br />

<strong>The</strong>re is also an additional shared info page. In the start_info struct the MFN <strong>of</strong> the<br />

page can be found. As it is not sure if or where this page is mapped by the domain<br />

builder a proper mapping is done. For this in head.S a dummy page is created. This<br />

results in a valid virtual address at the start as the domain builder sets up mappings<br />

for the whole kernel image. So we can remap the virtual address to the shared_info<br />

page an can access it before the virtual memory setup is complete.<br />

1 typedef struct shared_info {<br />

2 vcpu_info_t vcpu_info [MAX_VIRT_CPUS ] ;<br />

3<br />

4 /∗<br />

5 ∗ A domain can create " event channels " on which i t can send and receive<br />

6 ∗ asynchronous event n o t i f i c a t i o n s . <strong>The</strong>re are three classes <strong>of</strong> event that<br />

7 ∗ are delivered by this mechanism :<br />

8 ∗ 1. Bi−d i r e c t i o n a l inter− and intra−domain connections . Domains must<br />

9 ∗ arrange out−<strong>of</strong>−band to set up a connection ( usually by a l locating<br />

10 ∗ an unbound ’ l i s t e n e r ’ port and avertising that via a storage service<br />

11 ∗ such as xenstore ) .<br />

12 ∗ 2. Physical interrupts . A domain with suitable hardware−access<br />

13 ∗ p r i v i l e g e s can bind an event−channel port to a physical i n t e r r u p t<br />

14 ∗ source .<br />

15 ∗ 3. Virtual interrupts ( ’ events ’ ) . A domain can bind an event−channel<br />

16 ∗ port to a v i r t u a l i n t e r r u p t source , such as the virtual−timer<br />

17 ∗ device or the emergency console .<br />

18 ∗<br />

19 ∗ Event channels are addressed by a " port index " . Each channel i s<br />

20 ∗ associated with two b i t s <strong>of</strong> information :<br />

132 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.4. <strong>SWEB</strong>- XEN<br />

21 ∗ 1. PENDING −− n o t i f i e s the domain that there i s a pending n o t i f i c a t i o n<br />

22 ∗ to be processed . This b i t i s cleared by the guest .<br />

23 ∗ 2. MASK −− i f this b i t i s clear then a 0−>1 t r a n s i t i o n <strong>of</strong> PENDING<br />

24 ∗ w i l l cause an asynchronous upcall to be scheduled . This b i t i s only<br />

25 ∗ updated by the guest . I t i s read−only within Xen . I f a channel<br />

26 ∗ becomes pending while the channel i s masked then the ’ edge ’ i s l o s t<br />

27 ∗ ( i . e . , when the channel i s unmasked, the guest must manually handle<br />

28 ∗ pending n o t i f i c a t i o n s as no upcall w i l l be scheduled by Xen ) .<br />

29 ∗<br />

30 ∗ To expedite scanning <strong>of</strong> pending n o t i f i c a t i o n s , any 0−>1 pending<br />

31 ∗ t r a n s i t i o n on an unmasked channel causes a corresponding b i t in a<br />

32 ∗ per−vcpu s e l e c t o r word to be set . Each b i t in the s e l e c t o r covers a<br />

33 ∗ ’C long ’ in the PENDING b i t f i e l d array .<br />

34 ∗/<br />

35 unsigned long evtchn_pending [ size<strong>of</strong> ( unsigned long ) ∗ 8 ] ;<br />

36 unsigned long evtchn_mask [ size<strong>of</strong> ( unsigned long ) ∗ 8 ] ;<br />

37<br />

38 /∗<br />

39 ∗ Wallclock time : updated only by c o n t r o l s<strong>of</strong>tware . Guests should base<br />

40 ∗ t h e i r gettime<strong>of</strong>day ( ) syscall on this wallclock−base value .<br />

41 ∗/<br />

42 uint32_t wc_version ; /∗ Version counter : see vcpu_time_info_t . ∗/<br />

43 uint32_t wc_sec ; /∗ Secs 00:00:00 UTC, Jan 1 , 1970. ∗/<br />

44 uint32_t wc_nsec ; /∗ Nsecs 00:00:00 UTC, Jan 1 , 1970. ∗/<br />

45<br />

46 arch_shared_info_t arch ;<br />

47<br />

48 } shared_info_t ;<br />

This page is especially used for the event channel notifications.<br />

events 11.4.8 for more details on event handling.<br />

See callbacks and<br />

11.4.3 PageManager and KernelMemoryManager<br />

<strong>The</strong> PageManager and the KernelMemoryManager classes expect to be placed after the<br />

kernel end. <strong>The</strong>refore the following conditions must be meet:<br />

• the start <strong>of</strong> free memory after the kernel must be known<br />

• there must be enough free memory after the kernel<br />

• the free memory must be mapped in the page table (accessible by virtual addresses)<br />

<strong>SWEB</strong> expects that for the kernel 4Mb <strong>of</strong> memory are available and it is mapped<br />

correctly. <strong>The</strong>se 4MB are considered kernel memory and will only be used by kernel<br />

code. So there are two things which have to be done in <strong>SWEB</strong> Xen:<br />

• check and correct the 4MB mapping if necessary<br />

• calculate the start <strong>of</strong> free memory<br />

Under <strong>SWEB</strong> the symbol kernel_end_address (see<br />

arch/arch/utils/kernel-ld-script.ld for the definition) is identical to the<br />

start <strong>of</strong> the free memory after the kernel. Under Xen this is not true. See 11.1 for<br />

the general memory layout. <strong>The</strong>refore a pair <strong>of</strong> static methods in ArchCommon was<br />

133 <strong>of</strong> 151


11.4. <strong>SWEB</strong>- XEN CHAPTER 11. XEN<br />

introduced to get the free kernel memory area.<br />

To get the start <strong>of</strong> free memory one must call<br />

ArchCommon::getFreeKernelMemoryStart. When booting native the returned<br />

pointer will be the same as kernel_end_address while under Xen it will return a<br />

higher address.<br />

<strong>The</strong> second call is ArchCommon::getFreeKernelMemoryEnd. During the boot process<br />

in start_xen_init() the correct addresses are calculated.<br />

11.4.4 Memory regions and modules<br />

On a native boot <strong>SWEB</strong> uses the multiboot definition [11] filled by the Grub bootloader.<br />

In this structure the usable memory regions and module addresses are passed<br />

on. <strong>The</strong>se are parsed during the boot and the later needed information is stored in<br />

the structure multiboot_remainder in x86/source/ArchCommon.cpp. To get the<br />

information the following static methods exist:<br />

• ArchCommon::getNumUsableMemoryRegions<br />

• ArchCommon::getUsableMemoryRegion<br />

• ArchCommon::getNumModules<br />

• ArchCommon::getModuleStartAddress<br />

• ArchCommon::getModuleEndAddress<br />

Under Xen the needed information for the above methods is provided partly by the<br />

start info page (see 11.4.2). <strong>The</strong> remaining parts are calculated during the boot process<br />

in xen_start_init.<br />

11.4.5 Memory Layout<br />

Compare this to the <strong>SWEB</strong> linear memory layout in chapter VM 3.9.<br />

Note that the memory layout described in the Xen public header xen.h differs from the<br />

actually build layout by the Linux domain builder xc_linux_build.c Figure <strong>SWEB</strong><br />

Xen memory layout 11.3 shows the virtual memory layout <strong>of</strong> the kernel which contains<br />

the following items:<br />

• kernel image<br />

<strong>The</strong> following parts are all set up by the domain builder:<br />

• MFN list: Contains the numbers <strong>of</strong> all machine frames belonging to the guest<br />

domain.<br />

• start info: page containing the start info structure<br />

• store: page for the Xenstore<br />

• console: page for the Xen console<br />

• PT_base: boot page directory<br />

134 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.4. <strong>SWEB</strong>- XEN<br />

Figure 11.3: <strong>SWEB</strong> Xen memory layout<br />

• PT_frames: boot page tables<br />

<strong>The</strong> following part is constructed during the <strong>SWEB</strong> Xen boot process<br />

• additional page tables: needed to finally set up the virtual memory for the kernel<br />

<strong>The</strong> last parts are the same as under native <strong>SWEB</strong>.<br />

11.4.6 PSE<br />

<strong>SWEB</strong> uses PSE (page size extension). This enables a more efficient identy mapping<br />

starting at 3GB (See 3.9). It is also used to map the frame buffer <strong>of</strong> the graphic<br />

card. Also see section “Using 4MiB Pages” in chapter VM for more information on the<br />

support <strong>of</strong> 4MB pages in <strong>SWEB</strong>. Under Xen this mode <strong>of</strong> operation is not possible.<br />

<strong>The</strong> interface provided by Xen for guest domains is only aware <strong>of</strong> a single fixed page<br />

size. One problem is the pseudo physical memory abstraction 11.3.3. To support 4MB<br />

pages it would be necessary to get a continuous 4MB block <strong>of</strong> real machine memory<br />

starting at a 4MB boundary. As the memory is managed in 4k pages this is not<br />

easily possible. This could change in future versions <strong>of</strong> Xen as currently a redesign<br />

<strong>of</strong> the memory handling is discussed to be able to support architectures which are<br />

significantly different from the x86 32bit architecture.<br />

135 <strong>of</strong> 151


11.4. <strong>SWEB</strong>- XEN CHAPTER 11. XEN<br />

11.4.7 Console<br />

Under native <strong>SWEB</strong> there are two possible consoles: TextConsole and<br />

FrameBufferConsole. Direct frame buffer access is not allowed under Xen<br />

at the moment. <strong>The</strong>refore FrameBufferConsole must not be created under<br />

Xen. This is easily achieved as its creation depends on the return value <strong>of</strong><br />

ArchCommon::haveVESAConsole. Unfortunately the implementation <strong>of</strong> the console<br />

part was not too clear from the beginning. So at the moment if there is no VESAConsole<br />

a TextConsole is created. <strong>The</strong> problem is that the TextConsole also writes directly<br />

to the graphic card memory by the pointer returned from ArchCommon::getFBPtr.<br />

Xen provides a virtual console to every domain. This is accomplished by a shared<br />

memory page to transfer the data to and from the console. Data transfer is asynchronous.<br />

Several batches <strong>of</strong> data (up to the maximum buffer size) can be placed in<br />

the page. <strong>The</strong>y will not be read until an event is send over the console event channel.<br />

It is important to understand that not the Xen hypervisor handles the console itself. It<br />

only passes on the sent events. Instead dom0 must have a service which provides the<br />

console handling. This is currently a serial console. To access the console in dom0<br />

one has to enter the command xm console domid in a terminal. This connects the<br />

virtual Xen console to the current tty. To close the connection Ctrl+] must be entered.<br />

Implementation<br />

For Xen 3.0 an experimental patch exists which aims to provide a frame buffer emulation<br />

[12]. This patch was not available at the beginning but during the development<br />

a similar idea was considered. It would be possible to set up part <strong>of</strong> the memory as a<br />

frame buffer and return the pointer to it. TextConsole could then write directly to the<br />

buffer. To actually write the data to the console provided by Xen some kind <strong>of</strong> driver<br />

would have to be noticed <strong>of</strong> changes in this buffer and then update the Xen console<br />

the same way. This approach has two drawbacks: It would slow down console output<br />

further. But the major problem is to detect the buffer updates reliably. So this idea<br />

was discarded.<br />

<strong>The</strong> decision was made to implement a XenConsole which is derived from Console.<br />

<strong>The</strong> clean way would be to make Console an arch common header (e.g. ArchConsole)<br />

and derive a TextConsole in each arch tree. <strong>The</strong>n in main.cpp simply a TextConsole<br />

would be created which fits to the chosen architecture automatically. Unfortunately<br />

for this it was already too late. So at the moment an ifdef in main.cpp decides which<br />

console is build. At the next major update to the code base this should be fixed.<br />

Each console handles several terminals. (common/console/Terminal.h) <strong>The</strong>se provide<br />

a buffer for the screen output and one for input. Instead <strong>of</strong> directly accessing the<br />

hardware they use the methods provided by the Console. So no changes were needed<br />

in the terminals.<br />

<strong>The</strong> console also handles the input it gets form the keyboard manager by calling<br />

getKeyFromKbd and passes it on to the active terminal. Under Xen the console is<br />

fed with input data by the console event handler.<br />

136 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.4. <strong>SWEB</strong>- XEN<br />

11.4.8 Callbacks, Events and Exceptions<br />

Under Xen interrupts are not directly available to guest domains. <strong>The</strong>y are covered<br />

with events. Exceptions are available but evaluated by the hypervisor first.<br />

For a s<strong>of</strong>tware exception two parts are important: the caller and the calle. <strong>The</strong> caller<br />

uses hypercalls which are similar to system calls. For a list <strong>of</strong> the available hypercalls<br />

see the Xen 3.0 Interface documentation [13]. Basically a hypercall is a trap<br />

(s<strong>of</strong>tware exception). To select which hypercall should be called the corresponding<br />

number is written into the EAX register. Arguments are passed in EBX, ECX,<br />

EDX, ESI, EDI. After the call finished EAX contains the return value if there is any.<br />

See arch/xen/include/hypervisor.h for the assembler stubs and the Xen public<br />

header xen.h for the list <strong>of</strong> hypercalls and numbers.<br />

When a hypercall is done the hypervisor is invoked. Depending on the hypercall it<br />

does the necessary operations or delegates the handling <strong>of</strong> the hypercall to dom0 or<br />

another domain.<br />

Figure 11.4: Event sources<br />

For more detailed information about traps, faults and exceptions on which the Xen<br />

hypercalls, events and exceptions are based read the Intel developer manual Volume<br />

3 Architecture and Programing Manual [16]. Especially the chapters 14 and 13.<br />

Events <strong>The</strong>re are three possible sources for events (see figure 11.5):<br />

• Virtual interrupts (like the timer Xen provides)<br />

• Physical interrupts (if the domain has the necessary rights)<br />

• Events form inter domain communication<br />

137 <strong>of</strong> 151


11.4. <strong>SWEB</strong>- XEN CHAPTER 11. XEN<br />

An example for an inter domain event is the write to the console. To notify the console<br />

back end that there is new data available for output an event is send with the hypercall<br />

HYPERVISOR_event_channel_op from <strong>SWEB</strong> Xen. <strong>The</strong> hypervisor then makes a call<br />

to the console handling domain. This is at the moment always dom0.<br />

Exceptions Exceptions are not the result <strong>of</strong> a hypercall. But they may arise during<br />

the execution <strong>of</strong> a hypercall. A good example is the page fault exceptions raised by<br />

the processor when an address translation fails. Xen first receives all exceptions.<br />

Depending on the cause it handles them or passes it on the currently running domain.<br />

Implementation<br />

Two callbacks have to be registered at the hypervisor with<br />

HYPERVISOR_set_callbacks. <strong>The</strong>se are failsafe_callback and<br />

hypervisor_callback. <strong>The</strong> failsafe callback will only be called when a fault<br />

occurs during the execution <strong>of</strong> the hypervisor_callback. <strong>The</strong> segment selectors provided<br />

when registering the callbacks must have ring privilege level 1. FLAT_KERNEL_CS<br />

form the Xen public headers is correct and can be used. arch/xen/source/entry.S<br />

contains the needed assembler stubs for both callbacks. It also contains the stubs<br />

used for the trap table which is contained in xen_traps.c and registered during<br />

initialisation. <strong>The</strong>re the virtual exceptions are implemented.<br />

When Xen wants to notify us <strong>of</strong> an event the hypervisor callback is used. Care is<br />

taken that only one event is processed at a time (see the source code comments for<br />

details). In the shared info page there are two bit masks per virtual CPU. With these<br />

the guest finds out which events are pending. Like with interrupts the other one is<br />

used to mask out the event so Xen will not do another callback while processing the<br />

current one.<br />

For virtual exceptions the callback is not used. <strong>The</strong> assemblers stubs in entry.S are<br />

called directly. Each <strong>of</strong> them calls an appropriate function in xen_traps.c Most <strong>of</strong><br />

them are just dummy debug implementations at the moment.<br />

A full implemented example event handling is the console input. For compatibility<br />

the to the rest <strong>of</strong> <strong>SWEB</strong> the class is called KeyboardManager in<br />

arch_keyboard_manager.cpp. It contains a buffer to which the input from the<br />

console page is copied to. If the buffer is full but an event is received the data in the<br />

console page will be discarded. XenConsole gets the data by calling getKeyFromKbd<br />

11.4.9 Virtual CPUs<br />

Xen can provide several virtual CPUs (short VCPU) to one domain. <strong>The</strong> number is not<br />

necessarily the same as the number <strong>of</strong> physical CPUS. Currently <strong>SWEB</strong> is not aware <strong>of</strong><br />

multiple CPUs at all. This is also true for <strong>SWEB</strong> Xen which only handles VCPU 0. For<br />

example this is relevant when handling events because they are bound to one VCPU.<br />

Also the configuration file used for to start the <strong>SWEB</strong> Xen domain should specify only<br />

one VCPU.<br />

138 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.4. <strong>SWEB</strong>- XEN<br />

Figure 11.5: Overview over the exception and event implementation<br />

11.4.10 Block Devices<br />

Xen includes the emulation <strong>of</strong> block devices. This is archived by the back-end, frontend<br />

driver mechanism. Dom0 provides the back-end driver which access the real block<br />

device. This can be a physical disk, a physical partition, a network block device or a<br />

single file. Further sources which can be used as a block device are possible when<br />

integrated into the Dom0 code. A guest domain which wants to use a block device<br />

must implement a front end driver which communicates with the back-end driver over<br />

event channels and shared memory.<br />

Currently there is no implementation <strong>of</strong> a block device front end driver in <strong>SWEB</strong> Xen.<br />

It should be possible to implement one using the given <strong>SWEB</strong> block device infrastructure.<br />

Maybe some modifications are needed.<br />

11.4.11 Boot Process<br />

<strong>The</strong> following steps are done during the boot process.<br />

139 <strong>of</strong> 151


11.5. WORK TO DO CHAPTER 11. XEN<br />

1. execution start at _start: (arch/xen/source/head.S)<br />

(a) call start_xen_init (arch/xen/source/xen_startup.c)<br />

2. execution <strong>of</strong> start_xen_init:<br />

(a) copy start_info to global accessible area<br />

(b) map shared_info<br />

(c) setup hypervisor_callback, failsafe_callback<br />

(d) initialize pfn to mfn mapping<br />

(e) check and/or create 4MiB kernel mapping<br />

(f) create ident mapping at 3GB<br />

(g) calculate kernel end<br />

(h) calculate free memory<br />

(i) initialise traps<br />

(j) initialise events<br />

(k) activate event delivery<br />

(l) call startup(), this hands over to the normal boot process <strong>of</strong> <strong>SWEB</strong> (common/source/main.cpp)<br />

3. execution <strong>of</strong> startup (Continuing normal <strong>SWEB</strong> boot process)<br />

11.5 Work to do<br />

As stated in the introduction the port <strong>of</strong> <strong>SWEB</strong> to Xen is not completely finished.<br />

Especially the Task switching (threading) is missing. <strong>The</strong> following short list also<br />

includes optional tasks not really needed to run <strong>SWEB</strong> under Xen.<br />

• Task switching (threading). Xen supports this with some hypercalls.<br />

• Correct the partly weekend architecture separation<br />

• Find a real debugging solution. Note that since Xen 3.0 xen/tools/debugger<br />

exists which should be a good starting point.<br />

• Xenstore: At the moment there is no support in <strong>SWEB</strong> Xen for the Xenstore at<br />

all. As the current development <strong>of</strong> Xen seems to make more use <strong>of</strong> the Xenstore<br />

support for it should be added.<br />

• Virtual block devices: As <strong>SWEB</strong> already has a block device support it would be a<br />

nice task to add support for the virtual block devices by Xen. Note that for this a<br />

redesign <strong>of</strong> the arch_bd_manager.h might be necessary.<br />

• Virtually network card: Xen also <strong>of</strong>fers virtual network cards. <strong>SWEB</strong> does not<br />

know anything about network cards at the moment. <strong>The</strong>refore an architecture<br />

independent interface is needed and implementations for real hardware and the<br />

virtual cards by Xen can be done.<br />

140 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.6. APPENDIX: INSTALL GUIDE<br />

11.6 Appendix: Install guide<br />

11.6.1 General<br />

<strong>The</strong> <strong>SWEB</strong> port was done on a checkout <strong>of</strong> the xen-3.0-testing repository.<br />

This checkout is available for download.<br />

11.6.2 Getting Xen<br />

<strong>Information</strong>s how to get a fresh checkout <strong>of</strong> the source repository can be found at<br />

http://www.xensource.com/<br />

To get the Xen-3.0 testing snapshot we used for the <strong>SWEB</strong> port, download<br />

the tarball from http://www.sbox.tugraz.at/home/r/rotho/sweb/xen-3.<br />

0-testing-snapshot.tar.gz.<br />

<strong>The</strong> Linux command “wget http://www.sbox.tugraz.at/home/r/rotho/sweb/xen−3.0−testing−snapshot.tar.gz“<br />

will do that.<br />

11.6.3 Dependencies<br />

To build Xen, XenLinux and the documentation coming with the source, you will need<br />

following libraries/packages:<br />

• make<br />

• libncurses5-dev<br />

• libncurses5<br />

• gcc<br />

• libc6-dev<br />

• zlib1g-dev<br />

• python<br />

• python-dev<br />

• python-twisted<br />

• bridge-utils<br />

• iproute<br />

• libcurl3<br />

• libcurl3-dev<br />

• bzip2<br />

141 <strong>of</strong> 151


11.6. APPENDIX: INSTALL GUIDE CHAPTER 11. XEN<br />

• module-init-tools<br />

• latex<br />

• latex2html<br />

• transfig<br />

• tgif<br />

11.6.4 Prepare installation<br />

First the downloaded tarball must be extracted.<br />

Usually this is done with:<br />

“tar −xzf xen−3.0−testing−snapshot.tar.gz“.<br />

Change into the extracted xen source directory (usually “xen-3.0-testing“).<br />

Although debuging is not used at the moment you should enable it right away. At<br />

least this leads to more and better debug messages.<br />

Open the file “xen/Rules.mk“ with your favourite editor and the find following lines:<br />

1 verbose ?= n<br />

2 debug ?= n<br />

3 perfc ?= n<br />

4 perfc_arrays?= n<br />

5 domu_debug ?= n<br />

6 crash_debug ?= n<br />

Change this lines to:<br />

1 verbose ?= y<br />

2 debug ?= y<br />

3 perfc ?= n<br />

4 perfc_arrays?= n<br />

5 domu_debug ?= y<br />

6 crash_debug ?= y<br />

Enter the xen source directory and make clean<br />

11.6.5 Building Xen<br />

<strong>The</strong> next step is to build the Xen hypervisor. Enter the xen source directory and type<br />

“make world“. This will:<br />

• build xen<br />

• build the xen tools<br />

• download 2.6 linux source from kernel.org<br />

• patch the mentioned linux source<br />

• build a privileged linux kernel<br />

• build a guest linux kernel<br />

142 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.6. APPENDIX: INSTALL GUIDE<br />

Building customized domains<br />

In some cases it will be necessary to build a customised XenLinux kernel.<br />

To do so change into the subdirectory <strong>of</strong> the patched Linux kernel with<br />

“cd linux−2.6.12−xen0”.<br />

Enter “make ARCH=xen menuconfig” and alter the configuration to your needs.<br />

<strong>The</strong>n change into the xen source directory and enter “make kernels”.<br />

11.6.6 Installing Xen & Xen tools<br />

Note: To install the Xen kernel, Xen tools and the guest kernels root privileges are<br />

required.<br />

It is necessary to become root (e.g. with “su”) or by using sudo.<br />

To install the xen build and the guest domains type “make install” in the xen directory.<br />

This will:<br />

• build and install the Xen hypervisor<br />

• build and install the control tools<br />

• build and install guest kernels<br />

• build and install user documentation<br />

11.6.7 Configuring GRUB<br />

After building and installing xen, the boot loader has to be configured.<br />

To do so, modify the GRUB configuration file (usually “menu.lst” <strong>of</strong> “grub.conf”) which<br />

can be found in “/boot” or “/boot/grub”.<br />

You have to add an entry like this one:<br />

1 t i t l e Xen 3.0−testing / XenLinux 2.6.12.6−xen0<br />

2 root ( hd0, 0 )<br />

3 kernel /boot/xen−3.0.0. gz dom0_mem=270000<br />

4 module /boot/vmlinuz−2.6−xen0 root=/dev/hda1 ro console=tty0<br />

“title” is a GRUB command.<br />

It starts a new boot entry. <strong>The</strong> name <strong>of</strong> the entry is set to the contents <strong>of</strong> the rest <strong>of</strong><br />

the line, starting with the first non-space character.<br />

<strong>The</strong> kernel line tells GRUB where the Xen kernel can be found and which boot<br />

parameters should be passed to the Xen kernel.<br />

<strong>The</strong> dom0_mem=270000 options tells Xen that 270000KB Memory should be allocated<br />

for domain 0.<br />

“root” is a GRUB command which sets the current root device.<br />

This line will not be necessary in most cases. Read the GRUB documentation to learn<br />

143 <strong>of</strong> 151


11.6. APPENDIX: INSTALL GUIDE CHAPTER 11. XEN<br />

more.<br />

<strong>The</strong> module line contains the location <strong>of</strong> the XenLinux kernel which Xen starts as<br />

domain 0 and the kernel parameters which should be passed to the XenLinux kernel.<br />

Those kernel parameters are standard Linux boot parameters.<br />

11.6.8 <strong>SWEB</strong> under Xen<br />

We assume in this section, that:<br />

• there is already a <strong>SWEB</strong> source tree including the “arch/xen” subdirectory<br />

• <strong>SWEB</strong> has been compiled for x86 architecture successfully before<br />

Building <strong>SWEB</strong> for Xen<br />

Remove the x86 build <strong>of</strong> <strong>SWEB</strong> manually.<br />

If you don’t know in which directory your binary build is enter the <strong>SWEB</strong> source<br />

directory (in most cases it’s “sweb-stable”) and enter “rm ‘pwd‘−bin/∗−rf”.<br />

Change to the <strong>SWEB</strong> source directory and enter “make ARCH=xen”.<br />

Note:<br />

If you use your own Xen checkout you have to copy the Xen<br />

public headers from /xen/include/public/* to<br />

/arch/xen/include/xen3-public/.<br />

Don’t forget to include the subdirectories.<br />

<strong>The</strong> domain config file<br />

Before you can run <strong>SWEB</strong> as a guest domain, you must create a configuration file.<br />

At the moment a very simple configuration file will be sufficient. If the Xen port is<br />

developed further (e.g. network device support added) more options may have to be<br />

configured<br />

A minimal configuration file which fits our needs would look similar to this:<br />

1 name = "<strong>SWEB</strong>"<br />

2 kernel = "/home/user/sweb−testing−bin/sweb_xen . gz "<br />

3 memory = 64<br />

<strong>The</strong> “name” entry sets the name <strong>of</strong> the domain. This domain name has to be a unique<br />

identifier <strong>of</strong> the domain.<br />

<strong>The</strong> “kernel” entry contains the location <strong>of</strong> the kernel image which should be loaded.<br />

144 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.6. APPENDIX: INSTALL GUIDE<br />

<strong>The</strong> “memory” entry sets the domain memory size given in MiB.<br />

64 MiB <strong>of</strong> memory should be enough. If <strong>SWEB</strong> shows strange behaver or crashes<br />

without reason try a higher value.<br />

Save the configuration file as “/etc/xen/sweb.cfg”.<br />

11.6.9 Running Xen<br />

Running dom0<br />

To run the installed Xen version reboot the system an choose the option “Xen<br />

3.0-testing / XenLinux 2.6.12.6-xen0” from the GRUB boot menu.<br />

<strong>The</strong>n you can see XenLinux booting, which looks very similar to a conventional Linux<br />

boot.<br />

After the system hast booted you should be a able to login and work on your system<br />

as usual.<br />

If the system hang-up or you can not log in by another reason, reboot the system<br />

and choose your normal system from the GRUB boot menu and read the Xen<br />

documentation.<br />

After login start the xend control daemon by typing “xend start”. You have to be root to<br />

do this.<br />

Running a guest domain<br />

Before you try to run <strong>SWEB</strong> as a guest you first should verify that your Xen setup is<br />

complete and working. <strong>The</strong>refore it is best to create a guest domain with the previously<br />

during the install build domU kernel. For Debian the most convenient way is to use<br />

the package“xen−tools”. <strong>The</strong>se include some scripts for building and managing guest<br />

domain images. See the documentation <strong>of</strong> the package for details. <strong>The</strong> most important<br />

script is “xen−create−image”. After you successfully created a guest domain image and<br />

the configuration file for it use the command “xm create conffile −c” to start it and open a<br />

console to the newly started domain. If everything works then you can go on to <strong>SWEB</strong>.<br />

If the creation <strong>of</strong> the guest domain fails you first have to fix this. Under “/var/log/” you<br />

find the following log files from Xen:<br />

• xend.log<br />

• xend-debug.log<br />

• xen-hotplug.log<br />

Check these for any errors. Also verify your network setup is correct.<br />

145 <strong>of</strong> 151


11.6. APPENDIX: INSTALL GUIDE CHAPTER 11. XEN<br />

Running <strong>SWEB</strong> as guest<br />

To work with the Xen management user interface (xm) you need root privileges.<br />

Now you can boot the <strong>SWEB</strong> guest domain with “xm create sweb.cfg”.<br />

“xm list” gives now a output similar to this:<br />

1 # xm l i s t<br />

2 Name ID Mem(MiB) VCPUs State Time ( s )<br />

3 Domain−0 0 264 1 r−−−−− 107.9<br />

4 <strong>SWEB</strong> 1 64 1 −−−−−− 11.5<br />

“xm console <strong>SWEB</strong>” opens a console to <strong>SWEB</strong>.<br />

<strong>The</strong> shortcut CTRL + ] escapes the Xen console.<br />

“xm destroy <strong>SWEB</strong>” destroys the <strong>SWEB</strong> guest domain.<br />

“xm create sweb.cfg −c” creates a <strong>SWEB</strong> guest domain and opens a console to <strong>SWEB</strong><br />

without delay.<br />

Type “xm help” or “man xm” for more information.<br />

11.6.10 Alternative to real installation<br />

If for some reason Xen is not running reliably on your machine there is an alternative.<br />

You need quite some free disk space and a powerful machine (especially processor and<br />

memory). It is possible to run Xen completely under the qemu emulator. Here only<br />

only a short outline is given. You will need to read the qemu documentation carefully.<br />

• Install and configure qemu<br />

• Create a big disk image<br />

• Install Linux (e.g. Debian with the net install CD) into this image<br />

• Setup a complete build environment within the image (start the image with qemu<br />

and then work within the simulation)<br />

• Install Xen within the image<br />

• Test the Xen install with a guest image<br />

• Get <strong>SWEB</strong> into your image, build and start it there<br />

Note that you work most <strong>of</strong> the time running a complete disk image under qemu.<br />

So you have a complete Linux install with a complete Xen install which is run by<br />

qemu. This is quite slow (qemu runs under your regular system as a user process.<br />

Within this user process xen is running, the dom0 is running where you work and<br />

start additional guest domains). But is has advantages. As you installed everything<br />

within a disk image it can be easily deleted. Also you install Xen in a freshly installed<br />

Linux environment which saves you from modifying your real setup. It also might be<br />

easier to get the network setup done correctly. <strong>The</strong> main problem is to find a way to<br />

146 <strong>of</strong> 151


CHAPTER 11. XEN<br />

11.6. APPENDIX: INSTALL GUIDE<br />

synchronise the <strong>SWEB</strong> code in the image with the real one outside (e.g. share it over<br />

CVS or use a shared disk with qemu). Read the qemu documentation and choose the<br />

way which is best for you.<br />

147 <strong>of</strong> 151


11.6. APPENDIX: INSTALL GUIDE CHAPTER 11. XEN<br />

148 <strong>of</strong> 151


Chapter 12<br />

Hacking Sweb<br />

This chapter is intended to list some dos and donts when hacking around in sweb.<br />

12.1 C++ specific issues<br />

Because we are working in a freestanding environment there are serveral c++ features<br />

which can not be used in sweb.<br />

Exceptions Exceptions can not be used in sweb because the required functionality<br />

for stack unrolling is not present in sweb (at least not sufficient to satisfy the requirements<br />

in libstdc++).<br />

RTTI Depending on your compiler rtti might work, at least partially. Since no exceptions<br />

are possible its highly unlikely that your compiler will ever be able to implement<br />

them in a sane way. Thus, dont use it.<br />

Static Objects Static objects can be used but one has to note that constructors for<br />

static objects will never be called. <strong>The</strong> memory these objects use will be initialised<br />

to 0 tough. So it’s much better to wrap static objects in a singleton (but only if your<br />

singleton allocates using new!).<br />

Inheritance and Pure Virtual Methods Inheritance works without any limitations.<br />

Multiple inheritance should also work althou it was never tested in sweb. One special<br />

thing to take care <strong>of</strong> are Methods that are Pure Virtual. If for some reason a pure<br />

virtual is called this will result in a kernel panic and not just in the <strong>of</strong>fending thread<br />

beeing killed.<br />

Floating Point Numbers Using floating point math is a really big “no no“ in the<br />

kernel. While the fpu is initialised and handeled correctly during thread switches we<br />

do not handly any fpu exceptions currently. Also floating point math is very slow<br />

compared to integer math and there is no good reason to use it in the kernel anyway.<br />

149


12.2. DOS AND DONTS CHAPTER 12. HACKING <strong>SWEB</strong><br />

12.2 Dos and Donts<br />

Interrupts Never ever turn the interrupts <strong>of</strong>f unless you really know what you are<br />

doing. Whilst it’s common practice to turn <strong>of</strong>f the interrupts a lot <strong>of</strong> times in kernel<br />

mode when working with nachos, there is really no good reason for doing it in sweb. If<br />

you want to make sure that only your thread has access to a resource at a given point<br />

in time then use a lock.<br />

Allocating Memory Memory should only ever be allocated if it is ok for your thread<br />

to be suspended. Since memory allocation can block (i.e. yield the thread) one must<br />

never allocate memory inside an interrupt handler or when interrupts are turned <strong>of</strong>f.<br />

<strong>The</strong>re are excemptions to this rule however: <strong>The</strong> pagefault and the syscall handlers<br />

both run with interrupts turned on, so it is ok the allocate memory there.<br />

Another important thing to note is that a new or malloc may fail so always test if your<br />

allocation succeded or not.<br />

Locks Taking a lock is only allowed outside <strong>of</strong> interrupt context. Do never try to get<br />

a lock with interrupts turned <strong>of</strong>f.<br />

kprintf Both kprintf and kprintfd may take locks and allocate memory. Thus it is<br />

not allowd to use it when interrupts are turned <strong>of</strong>f. For printing out information in<br />

this cases there are the functions kprintfd_nosleep and kprintf_nosleep. This can be<br />

safely used but on the other hand can’t guarantee that your messages get printed at<br />

all!<br />

Threads To implement a thread one has to override the virtual Run Method. One<br />

however must never ever return from this Run Method.<br />

Stack usage On a typical Linux machine it is ok to use several megabytes <strong>of</strong> stack<br />

memory. In sweb however one only has a total <strong>of</strong> 8 kilobytes <strong>of</strong> stack. It thus is a wise<br />

idea to limit the depth <strong>of</strong> recursive functions and to not allocate anything on the stack<br />

besides some simply autovariables. It is certainly a very bad idea to allocate buffers<br />

and the like on the stack. Currently there is no mechanism to kill a thread once a<br />

stack overflow happens so one should pay special attention to stack usage.<br />

150 <strong>of</strong> 151


Bibliography<br />

[1] Wikipedia:Binary_prefix (http://en.wikipedia.org/wiki/Binary_prefix).<br />

[2] UMBC EECS Dpartment Course in Systems Design and Programming by Dr. Jim<br />

Plusquellic.<br />

[3] <strong>The</strong> Linux Virtual File-System Layer.<br />

[4] Virtual Filesystem.<br />

[5] Wikipedia:Race_condition (http://en.wikipedia.org/wiki/Race_condition).<br />

[6] NACHOS (http://www.cs.washington.edu/homes/tom/nachos).<br />

[7] Wikipedia: Hypervisor (http://en.wikipedia.org/wiki/Hypervisor).<br />

[8] INTEL Vanderpool (http://cache-www.intel.com/cd/00/00/19/76/197666_197666.pdf).<br />

[9] AMD Pacifica (http://developer.amd.com/assets/WinHEC2005_Pacifica_Virtualization.pdf).<br />

[10] Statement to the ReactOS domain builder (http://lists.xensource.com/archives/html/xendevel/2005-06/msg00074.html).<br />

[11] Grub Multiboot Definition (http://www.gnu.org/s<strong>of</strong>tware/grub/manual/multiboot).<br />

[12] Experimental Xen frame buffer (http://wiki.xensource.com/xenwiki/VirtualFramebuffer).<br />

[13] Xen 3.0 for x86 Interface Manual (http://www.cl.cam.ac.uk/Research/SRG/netos/xen/readmes/interf<br />

[14] IA-32 Intel R○ Architecture S<strong>of</strong>tware Developer’s Manual, Volume 1: Basic Architecture.<br />

Intel Corporation.<br />

[15] IA-32 Intel R○ Architecture S<strong>of</strong>tware Developer’s Manual, Volume 2A: Instruction Set<br />

Reference, A-M.<br />

[16] IA-32 Intel R○ Architecture S<strong>of</strong>tware Developer’s Manual, Volume 3: System Programming<br />

Guide. Intel Corporation.<br />

[17] Modern Operating Systems. Prentice-Hall, 2001.<br />

[18] Understanding the Linux Kernel, 2nd Edition, chapter 12. 2002.<br />

151

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!