22.07.2013 Views

Extreme High Performance Computing or Why Microkernels ... - dei.uc.

Extreme High Performance Computing or Why Microkernels ... - dei.uc.

Extreme High Performance Computing or Why Microkernels ... - dei.uc.

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Extreme</strong> <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />

<strong>or</strong> <strong>Why</strong> <strong>Microkernels</strong> S<strong>uc</strong>k<br />

Christoph Lameter, Ph.D.<br />

christoph@lameter.com<br />

Technical Lead Linux Kernel Software<br />

Silicon Graphics, Inc.<br />

Ottawa Linux Symposium<br />

2007-06-30


Introd<strong>uc</strong>tion<br />

Sh<strong>or</strong>t intro to <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />

How high does Linux currently scale<br />

Conceptual comparison: microkernel and monolithic<br />

OS (Linux)<br />

Fundamental scaling problems of a microkernel based<br />

architecture<br />

Monolithic kernel are also modular<br />

<strong>Why</strong> does Linux scale so well and adapt to ever larger<br />

and m<strong>or</strong>e complex machines<br />

Current issues<br />

Conclusion: Microkernel is an idea taken to unhealthy<br />

extremes.<br />

2


Applications of <strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> <strong>Computing</strong><br />

Solve complex computationally expensive problems<br />

Scientific Research<br />

Physics (quantum mechanics, n<strong>uc</strong>lear phenomena)<br />

Cosmology<br />

Space<br />

Biology (gene analysis, virus, bacteria etc)<br />

Simulations<br />

Weather (Hurricanes)<br />

Study of molecules and new substances<br />

Complex data analysis<br />

3D design<br />

Interactive modeling (f.e. car design, aircraft design)<br />

Str<strong>uc</strong>tural analysis.<br />

3


Dark Matter Halo Simulation f<strong>or</strong> the Milky Way<br />

4


Black Hole Simulation<br />

5


Carbon Nanotube-polymer composite material<br />

6


F<strong>or</strong>ecast of Hurricane Katrina<br />

7


Airflow Simulations<br />

8


<strong>High</strong> <strong>Perf<strong>or</strong>mance</strong> Computer Architectures<br />

Supercomputer<br />

Single mem<strong>or</strong>y space<br />

NUMA architecture. Mem<strong>or</strong>y nodes / Distant mem<strong>or</strong>y.<br />

Challenge to scale the Operating System<br />

Cluster<br />

Multiple mem<strong>or</strong>y spaces<br />

Netw<strong>or</strong>ked commodity servers<br />

Netw<strong>or</strong>k communication critical f<strong>or</strong> perf<strong>or</strong>mance<br />

Challenge to redesign applications f<strong>or</strong> a cluster<br />

Mainframe<br />

Singe unif<strong>or</strong>m mem<strong>or</strong>y space with multiple process<strong>or</strong>s<br />

Scalable I/O subsystem<br />

Mainly targeted to I/O transactions<br />

Reliable and maintainable (24 by 7 availability)<br />

9


NASA Columbia Supercomputer with 10240<br />

process<strong>or</strong>s<br />

10


Current Maximum Scaling of a single Linux Kernel<br />

This is no cluster<br />

Single address space<br />

Processes communicate using shared mem<strong>or</strong>y<br />

Currently deployed configurations<br />

Single kernel boots 1024 process<strong>or</strong>s<br />

8 Terabyte of main mem<strong>or</strong>y<br />

10GB/sec I/O throughput<br />

Known w<strong>or</strong>king configurations<br />

4096 process<strong>or</strong>s<br />

256TB mem<strong>or</strong>y<br />

Next generation platf<strong>or</strong>m<br />

16384 process<strong>or</strong>s<br />

4-8 Petabyte (2^50 bytes) Mem<strong>or</strong>y<br />

11


Application<br />

Monolithic OS<br />

Hardware<br />

Monolithic kernel vs micro kernel<br />

Server Server<br />

Server<br />

Driver<br />

Hardware<br />

Application<br />

Server<br />

Driver<br />

Hardware<br />

12


<strong>Microkernels</strong> vs. Monolithic<br />

Microkernel claims<br />

Essential to deal with scalability issues.<br />

Allow a better designed system<br />

Essential to deal with complexity of large Operating<br />

systems<br />

Make the system w<strong>or</strong>k reliable<br />

However<br />

Large scale microkernel systems do not exist<br />

Research systems exist up to 24p (an unconfirmed<br />

rum<strong>or</strong>s about 64p).<br />

IPC overhead vs. Monolithic kernels function calls<br />

Need f<strong>or</strong> context switches within the kernel<br />

Transfer issues of messages.<br />

Significant eff<strong>or</strong>t is spend on optimizing around these.<br />

13


Isolation vs. Integration<br />

Microkernel isolates kernel components<br />

M<strong>or</strong>e secure from failure<br />

Defined API to between components of a kernel<br />

Monolithic OS<br />

Large potentially complex code<br />

Universal access to data<br />

API implicitly established by function call convention<br />

Difficulty of keeping application state in <strong>Microkernels</strong><br />

<strong>Perf<strong>or</strong>mance</strong> issues by not having direct access to<br />

relevant data from other subsystems.<br />

Monolithic OS like Linux also have isolation methods<br />

Source code modularization<br />

Binary modules<br />

14


Monolithic kernel has flexible APIs if no binary APIs<br />

are supp<strong>or</strong>ted like in Linux<br />

Microkernel must attempt to standardize on APIs to<br />

ensure that operating system components can be<br />

replaced.<br />

Thus a monolithic kernel can evolve faster than<br />

microkernel.<br />

APIs<br />

15


Competing technologies within a Monolithic Kernel<br />

Variety of locks that can be used to architect<br />

synchronization methods<br />

Atomic operations<br />

Reference counts<br />

Read Copy Update<br />

Spinlocks<br />

Semaph<strong>or</strong>es<br />

New Approaches to locking are frequently introd<strong>uc</strong>es<br />

to solve particular hard issues.<br />

16


Scaling up Linux<br />

Per cpu areas<br />

Per node str<strong>uc</strong>tures<br />

Mem<strong>or</strong>y allocat<strong>or</strong>s aware of distance to mem<strong>or</strong>y<br />

Lock splitting<br />

Cache line optimization<br />

Mem<strong>or</strong>y allocation control from user space<br />

Sharing is a problem<br />

Local Mem<strong>or</strong>y is the best<br />

Larger distances mean larger systems are possible<br />

The bigger the system the smaller the p<strong>or</strong>tion of local<br />

mem<strong>or</strong>y.<br />

17


All computation on a<br />

single process<strong>or</strong><br />

Only parallelism that<br />

needs to be managed is<br />

with the I/O subsystem<br />

Mem<strong>or</strong>y is slow<br />

compared to the<br />

process<strong>or</strong>.<br />

Speed of the system<br />

depends on the<br />

effectiveness of the<br />

cache<br />

Mem<strong>or</strong>y accesses have<br />

the same perf<strong>or</strong>mance.<br />

Single Process<strong>or</strong> System<br />

Process<strong>or</strong><br />

Cachelines<br />

Mem<strong>or</strong>y<br />

I/O<br />

Subsystem<br />

18


Multiple process<strong>or</strong>s<br />

New need f<strong>or</strong><br />

synchronization between<br />

process<strong>or</strong>s<br />

Cache control issues<br />

<strong>Perf<strong>or</strong>mance</strong> enhancement<br />

through multiple process<strong>or</strong>s<br />

w<strong>or</strong>king independently<br />

Cacheline contention<br />

Data layout challenges:<br />

shared vs. process<strong>or</strong> local<br />

All mem<strong>or</strong>y access have<br />

the same perf<strong>or</strong>mance<br />

Symmetric Multi Processing (SMP)<br />

CPU 2<br />

Cachelines<br />

Mem<strong>or</strong>y<br />

CPU 1<br />

St<strong>or</strong>age<br />

Subsystem<br />

19


• Multiple SMP like systems<br />

called “nodes”<br />

• Mem<strong>or</strong>y at various distances<br />

(NUMA)<br />

• Interconnect<br />

• MESI type cache coherency<br />

protocols<br />

• SLIT tables<br />

• Mem<strong>or</strong>y Placement<br />

• Node Local from node<br />

2 process<strong>or</strong> 3<br />

• Device Local<br />

Non Unif<strong>or</strong>m Mem<strong>or</strong>y Architecture (NUMA)<br />

Node 1<br />

Node 2<br />

Node 3<br />

Node 4<br />

CPU 1<br />

Remote Mem<strong>or</strong>y<br />

CPU 2 Cachelines<br />

CPU 3<br />

Local Mem<strong>or</strong>y<br />

CPU 4 Cachelines<br />

CPU 5<br />

Remote Mem<strong>or</strong>y<br />

CPU 6 Cachelines<br />

CPU 7<br />

Remote Mem<strong>or</strong>y<br />

CPU 8 Cachelines<br />

NUMA Interconnect<br />

St<strong>or</strong>age<br />

Subsystem<br />

Netw<strong>or</strong>k<br />

Interface<br />

20


Allocat<strong>or</strong>s f<strong>or</strong> a Unif<strong>or</strong>m Mem<strong>or</strong>y Architecture<br />

Page Chunks<br />

Page allocat<strong>or</strong><br />

Anonymous mem<strong>or</strong>y<br />

File backed mem<strong>or</strong>y<br />

Swapping<br />

Slab allocat<strong>or</strong><br />

Device DMA allocat<strong>or</strong><br />

Page Cache<br />

read() / write()<br />

Mmapped I/O.<br />

Page Cache<br />

Buffers<br />

Slab<br />

allocat<strong>or</strong><br />

Kernel C<strong>or</strong>e<br />

Process<br />

Mem<strong>or</strong>y<br />

Page<br />

Allocat<strong>or</strong><br />

Vmalloc<br />

PCI<br />

Subsystem<br />

Anonymous<br />

Pages<br />

Device Drivers<br />

21


Mem<strong>or</strong>y management per node<br />

Mem<strong>or</strong>y state and possibilities of allocation<br />

Traversal of the zonelist (<strong>or</strong> nodelist)<br />

Process location vs. mem<strong>or</strong>y allocation<br />

Scheduler interactions<br />

Predicting mem<strong>or</strong>y use?<br />

Mem<strong>or</strong>y load balancing<br />

Supp<strong>or</strong>t to shift the mem<strong>or</strong>y load<br />

NUMA Allocat<strong>or</strong>s<br />

22

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!