Extreme High Performance Computing or Why Microkernels ... - dei.uc.
Extreme High Performance Computing or Why Microkernels ... - dei.uc.
Extreme High Performance Computing or Why Microkernels ... - dei.uc.
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Extreme</strong> <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />
<strong>or</strong> <strong>Why</strong> <strong>Microkernels</strong> S<strong>uc</strong>k<br />
Christoph Lameter, Ph.D.<br />
christoph@lameter.com<br />
Technical Lead Linux Kernel Software<br />
Silicon Graphics, Inc.<br />
Ottawa Linux Symposium<br />
2007-06-30
Introd<strong>uc</strong>tion<br />
Sh<strong>or</strong>t intro to <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />
How high does Linux currently scale<br />
Conceptual comparison: microkernel and monolithic<br />
OS (Linux)<br />
Fundamental scaling problems of a microkernel based<br />
architecture<br />
Monolithic kernel are also modular<br />
<strong>Why</strong> does Linux scale so well and adapt to ever larger<br />
and m<strong>or</strong>e complex machines<br />
Current issues<br />
Conclusion: Microkernel is an idea taken to unhealthy<br />
extremes.<br />
2
Applications of <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />
Solve complex computationally expensive problems<br />
Scientific Research<br />
Physics (quantum mechanics, n<strong>uc</strong>lear phenomena)<br />
Cosmology<br />
Space<br />
Biology (gene analysis, virus, bacteria etc)<br />
Simulations<br />
Weather (Hurricanes)<br />
Study of molecules and new substances<br />
Complex data analysis<br />
3D design<br />
Interactive modeling (f.e. car design, aircraft design)<br />
Str<strong>uc</strong>tural analysis.<br />
3
Dark Matter Halo Simulation f<strong>or</strong> the Milky Way<br />
4
Black Hole Simulation<br />
5
Carbon Nanotube-polymer composite material<br />
6
F<strong>or</strong>ecast of Hurricane Katrina<br />
7
Airflow Simulations<br />
8
<strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> Computer Architectures<br />
Supercomputer<br />
Single mem<strong>or</strong>y space<br />
NUMA architecture. Mem<strong>or</strong>y nodes / Distant mem<strong>or</strong>y.<br />
Challenge to scale the Operating System<br />
Cluster<br />
Multiple mem<strong>or</strong>y spaces<br />
Netw<strong>or</strong>ked commodity servers<br />
Netw<strong>or</strong>k communication critical f<strong>or</strong> perf<strong>or</strong>mance<br />
Challenge to redesign applications f<strong>or</strong> a cluster<br />
Mainframe<br />
Singe unif<strong>or</strong>m mem<strong>or</strong>y space with multiple process<strong>or</strong>s<br />
Scalable I/O subsystem<br />
Mainly targeted to I/O transactions<br />
Reliable and maintainable (24 by 7 availability)<br />
9
NASA Columbia Supercomputer with 10240<br />
process<strong>or</strong>s<br />
10
Current Maximum Scaling of a single Linux Kernel<br />
This is no cluster<br />
Single address space<br />
Processes communicate using shared mem<strong>or</strong>y<br />
Currently deployed configurations<br />
Single kernel boots 1024 process<strong>or</strong>s<br />
8 Terabyte of main mem<strong>or</strong>y<br />
10GB/sec I/O throughput<br />
Known w<strong>or</strong>king configurations<br />
4096 process<strong>or</strong>s<br />
256TB mem<strong>or</strong>y<br />
Next generation platf<strong>or</strong>m<br />
16384 process<strong>or</strong>s<br />
4-8 Petabyte (2^50 bytes) Mem<strong>or</strong>y<br />
11
Application<br />
Monolithic OS<br />
Hardware<br />
Monolithic kernel vs micro kernel<br />
Server Server<br />
Server<br />
Driver<br />
Hardware<br />
Application<br />
Server<br />
Driver<br />
Hardware<br />
12
<strong>Microkernels</strong> vs. Monolithic<br />
Microkernel claims<br />
Essential to deal with scalability issues.<br />
Allow a better designed system<br />
Essential to deal with complexity of large Operating<br />
systems<br />
Make the system w<strong>or</strong>k reliable<br />
However<br />
Large scale microkernel systems do not exist<br />
Research systems exist up to 24p (an unconfirmed<br />
rum<strong>or</strong>s about 64p).<br />
IPC overhead vs. Monolithic kernels function calls<br />
Need f<strong>or</strong> context switches within the kernel<br />
Transfer issues of messages.<br />
Significant eff<strong>or</strong>t is spend on optimizing around these.<br />
13
Isolation vs. Integration<br />
Microkernel isolates kernel components<br />
M<strong>or</strong>e secure from failure<br />
Defined API to between components of a kernel<br />
Monolithic OS<br />
Large potentially complex code<br />
Universal access to data<br />
API implicitly established by function call convention<br />
Difficulty of keeping application state in <strong>Microkernels</strong><br />
<strong>Perf<strong>or</strong>mance</strong> issues by not having direct access to<br />
relevant data from other subsystems.<br />
Monolithic OS like Linux also have isolation methods<br />
Source code modularization<br />
Binary modules<br />
14
Monolithic kernel has flexible APIs if no binary APIs<br />
are supp<strong>or</strong>ted like in Linux<br />
Microkernel must attempt to standardize on APIs to<br />
ensure that operating system components can be<br />
replaced.<br />
Thus a monolithic kernel can evolve faster than<br />
microkernel.<br />
APIs<br />
15
Competing technologies within a Monolithic Kernel<br />
Variety of locks that can be used to architect<br />
synchronization methods<br />
Atomic operations<br />
Reference counts<br />
Read Copy Update<br />
Spinlocks<br />
Semaph<strong>or</strong>es<br />
New Approaches to locking are frequently introd<strong>uc</strong>es<br />
to solve particular hard issues.<br />
16
Scaling up Linux<br />
Per cpu areas<br />
Per node str<strong>uc</strong>tures<br />
Mem<strong>or</strong>y allocat<strong>or</strong>s aware of distance to mem<strong>or</strong>y<br />
Lock splitting<br />
Cache line optimization<br />
Mem<strong>or</strong>y allocation control from user space<br />
Sharing is a problem<br />
Local Mem<strong>or</strong>y is the best<br />
Larger distances mean larger systems are possible<br />
The bigger the system the smaller the p<strong>or</strong>tion of local<br />
mem<strong>or</strong>y.<br />
17
All computation on a<br />
single process<strong>or</strong><br />
Only parallelism that<br />
needs to be managed is<br />
with the I/O subsystem<br />
Mem<strong>or</strong>y is slow<br />
compared to the<br />
process<strong>or</strong>.<br />
Speed of the system<br />
depends on the<br />
effectiveness of the<br />
cache<br />
Mem<strong>or</strong>y accesses have<br />
the same perf<strong>or</strong>mance.<br />
Single Process<strong>or</strong> System<br />
Process<strong>or</strong><br />
Cachelines<br />
Mem<strong>or</strong>y<br />
I/O<br />
Subsystem<br />
18
Multiple process<strong>or</strong>s<br />
New need f<strong>or</strong><br />
synchronization between<br />
process<strong>or</strong>s<br />
Cache control issues<br />
<strong>Perf<strong>or</strong>mance</strong> enhancement<br />
through multiple process<strong>or</strong>s<br />
w<strong>or</strong>king independently<br />
Cacheline contention<br />
Data layout challenges:<br />
shared vs. process<strong>or</strong> local<br />
All mem<strong>or</strong>y access have<br />
the same perf<strong>or</strong>mance<br />
Symmetric Multi Processing (SMP)<br />
CPU 2<br />
Cachelines<br />
Mem<strong>or</strong>y<br />
CPU 1<br />
St<strong>or</strong>age<br />
Subsystem<br />
19
• Multiple SMP like systems<br />
called “nodes”<br />
• Mem<strong>or</strong>y at various distances<br />
(NUMA)<br />
• Interconnect<br />
• MESI type cache coherency<br />
protocols<br />
• SLIT tables<br />
• Mem<strong>or</strong>y Placement<br />
• Node Local from node<br />
2 process<strong>or</strong> 3<br />
• Device Local<br />
Non Unif<strong>or</strong>m Mem<strong>or</strong>y Architecture (NUMA)<br />
Node 1<br />
Node 2<br />
Node 3<br />
Node 4<br />
CPU 1<br />
Remote Mem<strong>or</strong>y<br />
CPU 2 Cachelines<br />
CPU 3<br />
Local Mem<strong>or</strong>y<br />
CPU 4 Cachelines<br />
CPU 5<br />
Remote Mem<strong>or</strong>y<br />
CPU 6 Cachelines<br />
CPU 7<br />
Remote Mem<strong>or</strong>y<br />
CPU 8 Cachelines<br />
NUMA Interconnect<br />
St<strong>or</strong>age<br />
Subsystem<br />
Netw<strong>or</strong>k<br />
Interface<br />
20
Allocat<strong>or</strong>s f<strong>or</strong> a Unif<strong>or</strong>m Mem<strong>or</strong>y Architecture<br />
Page Chunks<br />
Page allocat<strong>or</strong><br />
Anonymous mem<strong>or</strong>y<br />
File backed mem<strong>or</strong>y<br />
Swapping<br />
Slab allocat<strong>or</strong><br />
Device DMA allocat<strong>or</strong><br />
Page Cache<br />
read() / write()<br />
Mmapped I/O.<br />
Page Cache<br />
Buffers<br />
Slab<br />
allocat<strong>or</strong><br />
Kernel C<strong>or</strong>e<br />
Process<br />
Mem<strong>or</strong>y<br />
Page<br />
Allocat<strong>or</strong><br />
Vmalloc<br />
PCI<br />
Subsystem<br />
Anonymous<br />
Pages<br />
Device Drivers<br />
21
Mem<strong>or</strong>y management per node<br />
Mem<strong>or</strong>y state and possibilities of allocation<br />
Traversal of the zonelist (<strong>or</strong> nodelist)<br />
Process location vs. mem<strong>or</strong>y allocation<br />
Scheduler interactions<br />
Predicting mem<strong>or</strong>y use?<br />
Mem<strong>or</strong>y load balancing<br />
Supp<strong>or</strong>t to shift the mem<strong>or</strong>y load<br />
NUMA Allocat<strong>or</strong>s<br />
22