Low-Power High Performance Computing - EPCC - University of ...

epcc.ed.ac.uk

Low-Power High Performance Computing - EPCC - University of ...

Low-Power High Performance Computing

Panagiotis Kritikakos

August 16, 2011

MSc in High Performance Computing

The University of Edinburgh

Year of Presentation: 2011


Abstract

The emerging development of computer systems to be used for HPC require a change

in the architecture for processors. New design approaches and technologies need to

be embraced by the HPC community for making a case for new approaches in system

design for making it possible to be used for Exascale Supercomputers within the next

two decades, as well as to reduce the CO2 emissions of supercomputers and scientific

clusters, leading to greener computing. Power is listed as one of the most important

issues and constraint for future Exascale systems. In this project we build a hybrid

cluster, investigating, measuring and evaluating the performance of low-power CPUs,

such as Intel Atom and ARM (Marvell 88F6281) against commodity Intel Xeon CPU

that can be found within standard HPC and data-center clusters. Three main factors are

considered: computational performance and efficiency, power efficiency and porting

effort.


Contents

1 Introduction 1

1.1 Report organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 RISC versus CISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 HPC Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 System architectures . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.2 Memory architectures . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Power issues in modern HPC systems . . . . . . . . . . . . . . . . . . 9

2.4 Energy and application efficiency . . . . . . . . . . . . . . . . . . . . . 10

3 Literature review 12

3.1 Green500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Supercomputing in Small Spaces (SSS) . . . . . . . . . . . . . . . . . 12

3.3 The AppleTV Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Sony Playstation 3 Cluster . . . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Microsoft XBox Cluster . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.6 IBM BlueGene/Q . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.7 Less Watts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.8 Energy-efficient cooling . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.8.1 Green Revolution Cooling . . . . . . . . . . . . . . . . . . . . 15

3.8.2 Google Data Centres . . . . . . . . . . . . . . . . . . . . . . . 15

3.8.3 Nordic Research . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.9 Exascale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Technology review 19

4.1 Low-power Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 ARM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.2 Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.3 PowerPC and Power . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.4 MIPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Benchmarking, power measurement and experimentation 25

5.1 Benchmark suites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 HPCC Benchmark Suite . . . . . . . . . . . . . . . . . . . . . 25

i


5.1.2 NPB Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . 25

5.1.3 SPEC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1.4 EEMBC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.1 HPL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.2 STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2.3 CoreMark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.3 Power measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3.2 Measuring unit power . . . . . . . . . . . . . . . . . . . . . . 29

5.3.3 The measurement procedure . . . . . . . . . . . . . . . . . . . 29

5.4 Experiments design and execution . . . . . . . . . . . . . . . . . . . . 30

5.5 Validation and reproducibility . . . . . . . . . . . . . . . . . . . . . . 31

6 Cluster design and deployment 33

6.1 Architecture support . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Hardware considerations . . . . . . . . . . . . . . . . . . . . . 33

6.1.2 Software considerations . . . . . . . . . . . . . . . . . . . . . 34

6.1.3 Soft Float vs Hard Float . . . . . . . . . . . . . . . . . . . . . 34

6.2 Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.3 C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.4 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.5 Hardware decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.6 Software decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.7 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.8 Porting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.8.1 Fortran to C . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.8.2 Binary incompatibility . . . . . . . . . . . . . . . . . . . . . . 40

6.8.3 Scripts developed . . . . . . . . . . . . . . . . . . . . . . . . . 41

7 Results and analysis 42

7.1 Thermal Design Power . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7.2 Idle readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.3 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7.3.1 Serial performance: CoreMark . . . . . . . . . . . . . . . . . . 44

7.3.2 Parallel performance: HPL . . . . . . . . . . . . . . . . . . . . 50

7.3.3 Memory performance: STREAM . . . . . . . . . . . . . . . . 58

7.3.4 HDD and SSD power consumption . . . . . . . . . . . . . . . 61

8 Future work 63

9 Conclusions 64

A CoreMark results 66

ii


B HPL results 67

C STREAM results 69

D Shell Scripts 70

D.1 add_node.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

D.2 status.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

D.3 armrun.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

D.4 watt_log.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

D.5 fortran2c.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

E Benchmark outputs samples 73

E.1 CoreMark output sample . . . . . . . . . . . . . . . . . . . . . . . . . 73

E.2 HPL output sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

E.3 STREAM output sample . . . . . . . . . . . . . . . . . . . . . . . . . 75

F Project evaluation 76

F.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

F.2 Work paln . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

F.3 Risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

F.4 Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

G Final Project Proposal 78

G.1 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

G.2 The work to be undertaken . . . . . . . . . . . . . . . . . . . . . . . . 78

G.2.1 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

G.3 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

G.4 Additional information / Knowledge required . . . . . . . . . . . . . . 79

iii


List of Tables

6.1 Cluster nodes hardware specifications . . . . . . . . . . . . . . . . . . 36

6.2 Cluster nodes software specifications . . . . . . . . . . . . . . . . . . . 37

6.3 Network configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.1 Maximum TDP per processor. . . . . . . . . . . . . . . . . . . . . . . 42

7.2 Average system power consumption on idle. . . . . . . . . . . . . . . . 43

7.3 CoreMark results with 1 million iterations. . . . . . . . . . . . . . . . . 44

7.4 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.5 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.6 STREAM results for 500MB array size. . . . . . . . . . . . . . . . . . 58

A.1 CoreMark results for various iterations. . . . . . . . . . . . . . . . . . 66

B.1 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

B.2 HPL problem sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

B.3 HPL results for N=500. . . . . . . . . . . . . . . . . . . . . . . . . . . 68

C.1 STREAM results for size array of 500MB. . . . . . . . . . . . . . . . . 69

iv


List of Figures

2.1 Single Instruction Single Data (Reproduced from Blaise Barney,

LLNL). .................................. 5

2.2 Single Instruction Multiple Data (Reproduced from Blaise Barney,

LLNL). .................................. 6

2.3 Multiple Instruction Single Data (Reproduced from Blaise Barney,

LLNL). .................................. 6

2.4 Multiple Instruction Multiple Data (Reproduced from Blaise Barney,

LLNL). ............................... 6

2.5 Distributed memory architecture (Reproduced from Blaise Barney,

LLNL). .................................. 7

2.6 Shared Memory UMA architecture (Reproduced from Blaise Barney,

LLNL). ............................... 8

2.7 Shared Memory NUMA architecture (Reproduced from Blaise Barney,

LLNL). ............................... 8

2.8 Hybrid Distributed-Shared Memory architecture (Reproduced from

Blaise Barney, LLNL). .......................... 8

2.9 Moore’s law for power consumption (Reproduced from W-chun

Feng, LANL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 GRCooling four-rack CarnotJet TM system at Midas Networks (source

GRCooling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Google data-centre at Finland, next to the Finnish gulf (source Google). 16

3.3 NATO ammunition depot at Rennesøy, Norway (source Green Mountain

Data Centre AS). . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 Projected power demand of a supercomputer (M. Kogge) . . . . . . 18

4.1 OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko). . . . 20

4.2 Intel D525 Board with Intel Atom dual-core. . . . . . . . . . . . . . 21

4.3 IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Morgan,

The Register). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4 Pipelined MIPS, showing the five stages (instruction fetch, instruction

decode, execute, memory access and write back (Wikimedia

Commons). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.5 Motherboard with Loongson 2G processor (Wikimedia Commons). 24

v


5.1 Power measurement setup. . . . . . . . . . . . . . . . . . . . . . . . 29

6.1 The seven-node cluster that was built as part of this project. . . . . 38

6.2 Cluster connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.1 Power readings over time. . . . . . . . . . . . . . . . . . . . . . . . . 43

7.2 CoreMark results for 1 million iterations. . . . . . . . . . . . . . . . 45

7.3 CoreMark results for 1 thousand iterations. . . . . . . . . . . . . . . 46

7.4 CoreMark results for 2 million iterations. . . . . . . . . . . . . . . . 46

7.5 CoreMark results for 1 million iterations utilising 1 thread per core. 47

7.6 CoreMark performance for 1, 2, 4, 6 and 8 cores per system. . . . . 48

7.7 CoreMark performance speedup per system. . . . . . . . . . . . . . 49

7.8 CoreMark performance on Intel Xeon. . . . . . . . . . . . . . . . . 49

7.9 Power consumption over time while executing CoreMark. . . . . . . 50

7.10 HPL results for large problem size, calculated with ACT’s script. . 52

7.11 HPL results for problem size 80% of the system memory. . . . . . . 52

7.12 HPL results for N=500. . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.13 HPL total power consumption for N equal to 80% of memory. . . . 54

7.14 HPL total power consumption for N calculated with ACT’s script. . 55

7.15 HPL total power consumption for N=7296. . . . . . . . . . . . . . . 56

7.16 HPL total power consumption for N=500. . . . . . . . . . . . . . . . 56

7.17 Power consumption over time while executing HPL. . . . . . . . . . 57

7.18 STREAM results for 500MB array size. . . . . . . . . . . . . . . . 59

7.19 STREAM results for 3GB array size. . . . . . . . . . . . . . . . . . 60

7.20 Power consumption over time while executing STREAM. . . . . . . 61

7.21 Power consumption with 3.5" HDD and 2.5" SSD. . . . . . . . . . . 62

vi


Listings

2.1 Assembly on RISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Assembly on CISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

vii


Acknowledgements

I would like to thank my supervisors Mr Sean McGeever and Dr. Lorna Smith. Their

guidance and help throughout the project was of great value and have greatly contributed

in the successful completion of this project.


Chapter 1

Introduction

With the continuous evolving of computer systems, power is becoming more and more a

constraint in modern systems, especially to those targeted at Supercomputing and HPC

in general. The requirement and the demand for continuously increasing performance

requires additional processors per board where electrical power and heat is a limit. This

is discussed in detail in DARPA’s Exascale Computing study [1]. For the last few years,

there is an increasing interest in the use of GPUs in HPC as they offer FLOP per Watt

performance far greater than standard CPUs. Designing power-limited systems can

have a negative affect on the delivered application performance, due to less powerful

processors and not appropriate design for the required tasks, and as a consequence

reduces the scope and effectiveness of such systems. Thinking about the upcoming

Exascale systems, this is going to be a great issue. New design approaches need to

be considered by exploiting low-power architectures and technologies that can deliver

acceptable performance for HPC and other scientific applications at reasonable and

acceptable power levels.

The Green500[6] list argues that the scope of high-performance systems for the past

decades has been to increase the performance in relationship to price. Increasing the

performance, and the speedup as a consequence, does not necessarily mean that the

system is efficient. SSS reports that "from the early 1990s to the early 2000s, the performance

of our n-body code for galaxy formation improved by 2000-fold, but the performance

per watt only improved 300-fold and the performance per square foot only

65-fold. Clearly, we have been building less and less efficient supercomputers, thus

resulting in the construction of massive data-centers, and even, entirely new buildings

(and hence, leading to an extraordinarily high total cost of ownership). Perhaps a more

insidious problem to the above inefficiency is that the reliability (and usability) of these

systems continues to decrease as traditional supercomputers continue to follow Moore’s

Law for Power Consumption." [8] [9].

Up to now, the chip vendors have been following Moore’s law [8]. When more than

one cores are incorporated within the same chip, the clock speed per core is decreased.

This is not an issue as two cores with a reduce clock speed give better performance

1


than a single chip with a relatively higher clock speed. Decreasing the clock speed

decrease the electrical power needed, as well as the corresponding heat, that is produced

within the chip. This concept is followed in most modern multi-core chips. The idea

behind low-power HPC stands on the same ground. A significant number of low-power,

low electricity consumption, chips and systems can be clustered together. This could

deliver the performance required by HPC and other scientific applications in an efficient

manner, in terms of application performance and energy consumption.

Putting together nodes with low-power chips will not solve the problem right away. As

these architectures are not being widely used in the HPC field, the required tools, mainly

compilers and libraries, might not be available or supported. An effort may be required

for porting these to the new architectures. Even having the tools, the codes themselves

may require porting and optimisation as well, in order to exploit the underlying power.

From a management perspective, every watt of reduced power consumption means savings

of $1M per year for large supercomputers as the IESP Roadmap reports [2]. The

IESP Roadmap reports also that the high-end servers (that are also used to build up

HPC clusters) were estimated to consume 2% of North American power as of 2006.

The same reports mentions that the IDC (International Data Corporation) estimates that

HPC systems will be the largest faction of high-end server market. That means, the

impact of the electrical power required by such systems needs to be reduced [2].

In this project we designed and build a hybrid cluster, investigating, measuring and

evaluating the performance of low-power CPUs, such as Intel Atom and ARM against

commodity Intel Xeon CPU that can be found within standard HPC and data-center

clusters. Three main factors are considered: computational performance and efficiency,

power efficiency and porting effort.

1.1 Report organisation

This dissertation is organised in three main groups of chapters. The first group includes

chapters 2 to 4, presenting background material, literature and technology reviews.

Chapter 5 can be considered its own group, discussing the benchmarking suites and

the benchmarks considered and used, the power measurement techniques and methods

as well as the experimentation process that we used through the project. The third group

includes chapters 6 to 9, discussing the design and deployment of our hybrid low-power

cluster, the results and analysis of the experiments that were conducted, suggestions for

future work and, finally, conclusions over the project.

2


Chapter 2

Background

In this chapter we make a comparison between RISC and CISC systems, we present

the systems and memory architecture that can be found in HPC, and what each one

means. In addition, we make a case for the power issues in modern HPC systems and

how energy efficiency relates to application efficiency.

2.1 RISC versus CISC

The majority of modern commodity processors, that are also used within the field of

HPC, are implementing the CISC (Complex Instruction Set Computing) architecture.

Although, the need of energy efficiency, lower cost, multiple cores and scaling are leading

to a simplification of the underlying architectures, requiring by hardware vendors to

develop energy efficient, high performance RISC (Reduced Instruction Set Computing)

processors.

RISC emphasises a simple instruction set made of simple highly optimised instructions

using single-clock reduced instructions and large number of general purpose registers.

That is a better match to integrated circuits and compiler technology than the complex

instruction sets [3] [4]. Complex instructions could be performed by the compiler,

minimising the need of additional transistors. That leads to an emphasis on software

and the implementation of more transistors to be used as memory registers.

For instance, in assembly language, multiplying two variables and storing the result to

the first variable (i.e. a=a*b), when run on a RISC system, it would look like the

following (assuming 2:3 and 5:2 are memory locations).

LOAD A , 2 : 3

LOAD B , 5 : 2

PROD A, B

STORE 2 : 3 , A

Listing 2.1: Assembly on RISC

3


Each operation - LOAD, PROD, STORE - is executed in one clock cycle, requiring three

clock cycles to perform the whole operation. Due to the simplicity of the operations

though, the processors will manage to perform the task relatively quickly.

CISC emphasises that hardware is always faster than software and a multi-clock complex

instruction set, adding transistors to the processors, could deliver better performance.

That also minimises the assembly lines. The code can then be executed by the

processors by performing multiple instructions per clock cycle, as opposed to RISC processors

where each clock cycle would execute a single instruction. CISC gives emphasis

on hardware by implementing additional transistors to store complex instructions.

Within a CISC system, the multiplication example above would require a single assembly

code line.

MULT 2 : 3 , 5 : 2

Listing 2.2: Assembly on CISC

In this case, the system must support an additional instruction, that of MULT. This is

a complex instruction that performs the multiplication operation within a single clock

cycle, and the whole operation would be executed directly on hardware, without the

need to specify the LOAD and STORE instructions. However, due to its complexity, the

execution time would be approximately the same time to that of RISC. When thinking

of large codes with intensive computation and the use of supercomputers that number

thousands of cores, these additional transistors that are able to handle complex instructions

can create power/heat issues and have large energy demands for the systems

themselves.

Modern RISC processors have become more complex than what in the early versions.

They implement additional, more complex, instructions and can execute two instructions

per clock cycle. However, when comparing modern RISC versus modern CISC

processors, the complexity differences and architectural design still exists, having difference

in both performance and energy consumption.

2.2 HPC Architectures

In this section we present the different architectures in terms of systems and memory.

Both RISC and CISC processors can belong to any of the architectures discussed bellow.

2.2.1 System architectures

High Performance Computing and Parallel architectures have been firstly classified by

Michael J. Flynn. Flynn’s taxonomy 1 defines four classifications of architectures based

on the instruction and data stream. These classifications are:

1 IEEE Trans.Comput., vol. C-21, no.9, p.948-60, Sept. 1972

4


• SISD - Singe Instruction Single Data

• SIMD - Single Instruction Multiple Data

• MISD - Multiple Instruction Single Data

• MIMD - Multiple Instruction Multiple Data

Singe Instruction Single Data: This classification defines a serial system that does

not provides any form of parallelism for either of the streams (instruction and data).

A single instruction stream is executed on a single clock cycle. A single data stream

is used as input on a single clock cycle for an instruction. Systems that belong to this

group are old mainframes, workstations and standard single-core personal computers.

Figure 2.1: Single Instruction Single Data (Reproduced from Blaise Barney,

LLNL).

Single Instruction Multiple Data: This classification defines a type of parallel processing,

where each processor executes the same set of instructions on a different stream of

data on every clock cycle. Each instruction is issued by the front-end and each processor

can communicate with any other processor but has access only to its own memory.

Array and vector processors, as well as GPUs, belong to this group.

Multiple Instruction Single Data: This classification defines the most uncommon parallel

architecture were multiple processors execute the same data stream on every clock

cycle. This architecture can be used for fault tolerance where different systems working

on the same data stream must report the same results.

Multiple Instruction Multiple Data: This classification defines the most common

parallel architecture used today. Modern multi-core desktops and laptops fall within

this category. Each processor executes a different instruction set on a different data

stream on every clock cycle.

2.2.2 Memory architectures

There are two main memory architectures that can be found within HPC systems: distributed

memory and shared memory. An MIMD system can be built with either mem-

5


Figure 2.2: Single Instruction Multiple Data (Reproduced from Blaise Barney,

LLNL).

Figure 2.3: Multiple Instruction Single Data (Reproduced from Blaise Barney,

LLNL).

Figure 2.4: Multiple Instruction Multiple Data (Reproduced from Blaise Barney,

LLNL).

6


ory architectures.

Distributed memory: In this architecture, each processor has its own local memory,

apart from caches, and each processor is connected with any other processor via an

interconnect. This requires the processors to communicate via the message-passing

programming model. This memory architecture enables the development of Massively

Parallel Processing (MPP) systems. Some examples of such systems include the Cray

XT6, IBM BlueGene and any Beowulf cluster. Each processor acts as being a individual

system by running its own copy of the operating system. The total memory size can

be increased by adding more processors and in theory, that can grow up to any size.

However, performance and scalability relies on appropriate interconnect and introduces

system management overhead.

Figure 2.5: Distributed memory architecture (Reproduced from Blaise Barney,

LLNL).

Shared memory: In this architecture, each processor has access to a global shared

memory. Communication between the processors takes place via writes and reads to

memory, using the shared-variable programming model. The most common architecture

of that type is Symmetric Multi-Processing and can be divided in two different

architectures: Uniform Memory Access (UMA) and Non-Uniform Memory Access

(NUMA). A UMA system defines a single SMP machine while a NUMA system is

made by physically linking two or more SMP systems where each system can access directly

the memory of another. Therefore, each processor has equal access to the global

memory. The processors do not require message-passing but an appropriate sharedmemory

programming model. Example systems include IBM and Sun HPC Servers

and any multi-processor PC and commodity server. The system appears as a single

machine to the external user and runs a single copy of the operating system. Scaling

on the number of processors within a single system is not trivial as memory access is a

bottleneck.

Hybrid Distributed-Shared Memory: This could be characterised as the most common

memory architecture used in supercomputers and other clusters today. It employs

both distributed and shared memory and is usually made up by interconnecting multiple

SMP systems in an UMA fashion, where each system has direct access only to

its own memory and needs to send explicit messages to the other systems in order to

communicate.

7


Figure 2.6: Shared Memory UMA architecture (Reproduced from Blaise Barney,

LLNL).

Figure 2.7: Shared Memory NUMA architecture (Reproduced from Blaise Barney,

LLNL).

Figure 2.8: Hybrid Distributed-Shared Memory architecture (Reproduced from

Blaise Barney, LLNL).

8


2.3 Power issues in modern HPC systems

Modern HPC systems and clusters are usually built by using commodity multi-core

systems. Connecting such systems with fast interconnect can create supercomputers

and offer desired platforms for scientists and any HPC user candidate. The increase

in speed is mainly achieved by increasing the number of cores within each system,

while dropping the clock frequency of each core, as well as the number of systems in

each cluster. The main issue with the CPU technology used today is that it is designed

without power efficiency in mind, solely following Moore’s law for theoretical performance.

While this has been working for Petascale systems that use such processors, it

is a challenge for the design, built and deployment of supercomputers that achieve need

to Exascale performance.

In order to address, and by-bass up to a level, the power issues with the current technology,

the use of GPUs is increasing as they offer better flop-per-watt performance.

Physicist scientists, among others, suggest that Moore’s law will gradually cease to

hold true around 2020 [3]. That introduces the need for a new technology and design in

CPUs as supercomputers will not be able to take advantage of Moore’s law anymore for

increasing their theoretical peak performance. Alan Gara or IBM says that "the biggest

part (of energy savings) comes from the use of a new processor optimised for energy

efficiency rather than thread performance". He continues that in order to achieve that,

"a different approach needs to be followed for building supercomputers, and that is the

use of scalable, energy-efficient processors". More experts have addressed the power

issues in a similar manner. Pete Beckman of ANL argues that "the issue of electrical

power and the shift to multiple cores will dramatically change the architecture and

programming of these systems". Bill Dally, chief scientist at NVIDIA, states that "an

Exascale system needs to be constructed in such a way that it can run in a machine

room with a total power budget not higher that what supercomputers use today". This

can be achieved by improving the energy efficiency of the computing resources that will

close the gap and reach Exascale computing in acceptable levels.

CPU is not the only high-power source in modern systems. 1) Memory, 2) communications

and 3) storage, add up greatly to the overall power consumption of a system.

Memory transistors are charged every single time a specific memory cell needs to be accessed.

On commodity systems, memory chips are independent components, separated

from the processor (RAM, not cache memory). This increases the power cost as there

is a need for additional memory interface and bus for the communication between the

memory and the processor. Embedded devices follow the concept of System-on-Chip

(SoC), where all the components are part of the same module, reducing distances and

interfaces, and hence power.

The interaction between different nodes rather than components of a single node, communications

between nodes requires power as well. The longer the distance between

systems, the more power they need to communicate in order to charge and power the

signal to travel. Optical and serial communications are already used to speed-up and do

more efficiently communications that partly solve the power issues. On the other hand,

9


Watts

Feature size

Figure 2.9: Moore’s law for power consumption (Reproduced from W-chun Feng,

LANL).

the large the systems become, the more communication they need. It is important to

keep the distance between the independent nodes as close as possible. Decreasing the

size of each node and keeping the extremes of a cluster close, could significantly reduce

the power needs and costs.

Commodity storage devices such as Hard Disk Drivers, are the most common within

HPC clusters, due to their simplicity, easy maintainability and relatively low cost. The

target is to get faster interconnect between the nodes and storage pools, rather than replacing

the storage devices themselves. High I/O is not very common in HPC, but is

very common in specific science fields that use HPC resources such as Astronomy, Biology

and Geosciences that tend to work with enormous data-sets. Such data-intensive

use-cases will increase the storage demands both in terms of capacity, performance and

power. HDDs smaller in physical size and SSDs (Solid State Disk) start becoming more

common in data-intensive research and applications.

2.4 Energy and application efficiency

The driven force behind building new systems until very recently, and still in most vendors,

is to achieve the highest clock speed possible, following Moore’s law. However,

it is pointed that around 2020 Moore’s law will gradually cease and a replacement technology

needs to be found. Transistors will be so small that quantum theory or atomic

10


physics will take over and electrons will leak out of the wires [5]. Even with the systems

today, Moore’s law does not guarantee application efficiency and of course does

not comply with energy efficiency as the overall clock speed increases. On the contrary,

application efficiency follow May’s law 2 . May’s law states that software efficiency

halves every 18 months, compensating for Moore’s Law. The main reason behind this is

that every new generation of hardware introduces new complex hardware optimisation,

handled by the compiler and compilers come against an efficiency barrier with parallel

computing. These two issues, especially that of energy efficiency, can be considered

as the biggest constraints for the design and development of acceptable Exascale systems

in terms on performance, efficiency, consumption and cost. To address this issue,

HPC vendors and institutes have start using GP-GPUs (General Purpose-Graphics Processor

Unit) within supercomputers, to achieve high performance without adding extra

high-power commodity processors, leading to hybrid supercomputers. The fastest supercomputer

in the world today is a RISC system, the K computer of RAKEN Advanced

Institute for Computational Science (AICS) in Japan, using SPARC64 processors, delivering

performance of 8.62 petaflops per second. A petaflop is equivalent to 1,000

trillion calculations. This system consumes 9.89 megawatts. The second faster supercomputer,

the Tianhe-1A of the National Supercomputing Center in Tianjin in China, is

a hybrid machine which is able to achieve the speed of 2.56 petaflops per second and

consumes 4.04 megawatts. This is achieved by combining both commodity CPUs, Intel

Xeon, and NVIDIA GPUs. These numbers can clearly show the difference that GPUs

can do in terms of power consumption for large systems.

GPUs are able to execute specially ported code in much less time than standard CPUs,

mainly due to their large number of cores and their design simplicity, delivering better

performance-per-Watt. While overall a GPU can cost more in term of power needs,

it performs the operations ver quickly, that in a length of time it overcomes the cost

and proves to be both more energy and application efficient when compared to standard

CPUs. In addition to this, it takes the processing load off the processor, reducing the

energy demands for the standard CPU. Low-power processors and low-power clusters

follow the same concept by using a large number of cores with the simplicity of reduced

instruction sets. We can also hypothesise, based on the increased use of GPUs and the

porting of applications to these platforms, that in the future the programming models

for GPUs will spread even more and GPUs will become more easy to program. In these

cases, the standard CPU could play the role of the data distributor to the GPUs, with

low-power CPUs being the most suitable candidate for such a task as they will not need

to undertake computationally intensive jobs.

From a power consumption perspective, the systems mentioned earlier consume 9.89

and 4.04 megawatts per second, for K computer and Tianhe-1A respectively. K computer

is listed in the 6 th position of the Green500 list while. The most power efficient

supercomputer, the IBM BlueGene Q/Prototype 2 hosted at NNSA/SC, consumes

40.95kW and achieves 2097.19 MFLOPs per Watt. It is listed on 110 th position in the

Top500 list, delivering 85.9 TFLOPs when executing the Linpack benchmark.

2 May’s Law and Parallel Software - http://www.linux-mag.com/id/8422/

11


Chapter 3

Literature review

In this section we look into projects that related to low-power computing and that have

been building and benchmarking low-power clusters.

3.1 Green500

The Green500 list is a re-ordering of the well-known TOP500 list, listing the most

energy-efficient supercomputers. Green500 raise awareness about power consumption,

promotes alternative total cost of ownership performance metrics, and ensures that supercomputers

only simulate climate change and not create it [6]. Green500 started in

April 2005 by Dr. Wu-chun Feng at the IEEE IPDPS Workshop of High-Performance

Computing, Power-Aware Computer.

3.2 Supercomputing in Small Spaces (SSS)

The SSS project started in 2001 by Wu-chun Feng, Michael S. Warren and Eric H.

Wiegle aiming at low-power architectural approaches and power-aware, software-based

project approaches. In 2002, the SSS project deployed the Green Destiny cluster, a 240node

system consuming 3.2 kWs, placing it at #393 on the TOP500 list at the time.

The SSS project has being making it clear that traditional supercomputers need to stop

following Moore’s law for power consumption. Modern systems have being start becoming

less and less efficient, following May’s law, which states that the efficiency is

dropped to half every two years. The project argues this with the fact that "from the

early 1990s to the early 2000s, the performance of our n-body code for galaxy formation

improved by 2000-fold, but the performance per watt only improved 300-fold and

the performance per square foot only 65-fold" [9].

12


3.3 The AppleTV Cluster

A research team at the Ludwig-Maximilians University in Munich, Germany have built

and experiment with low-power ARM cluster made of AppleTV devices, The AppleTV

Cluster. They also evaluated another ARM-based system, a BeagleBoard xM [28]. The

team used CoreMark, High Performance Linpack, Membench and STREAM for measuring

the CPU (serial and parallel) and memory performance of each system. The

CoreMark benchmark scored 1920 and 2316 iterations per second on BeagleBoard

xM and AppleTV respectively. On the HPL benchmark, the systems 22.6 and 57.5

MFLOPs for Single Precision operation. On Double Precision they achieve 29.3 and

40.8 MFLOPs for BeagleBoard xM and AppleTV respectively. The support of NEON

acceleration (support of 128-bit registers) on the BeagleBoard, allowed it to achieve

33.8 MFLOPs on Single Precision mode.

In terms of memory performance, the team reports copying rates of 481.1 and 749.8

MB/s for BeagleBoard xM and AppleTV respectively. The researchers states that when

compared to a modern Intel Core i7 CPU with 800MHz DDR2 RAM (the same frequency

and technology as in the ARM systems used) can deliver more than ten times

of the reported bandwidth [28]

The power consumption of the AppleTV cluster, which achieves an overall system performance

of 160.4 MFLOPs, is 10 Watt for the whole cluster when executing the HPL

benchmark and 4 Watt, for the whole cluster, when on idle. That results in 16 MFLOPs

per single Watt when fully executing the benchmark.

3.4 Sony Playstation 3 Cluster

Researchers at the North Carolina State University have built a Sony PS3 Cluster 3 . Sony

PS3 uses an eight-core Cell Broadband Engine Processor at 3.2 GHZ and 256MB XDR

RAM, suitable for SMP and MPI programming. The 9-node cluster ran a PowerPC

version of Fedora Linux. The cluster achieved a total of 218 GFLOPs and 25.6 GB/s

memory bandwidth. The researchers do not state any power consumption measurements.

However, the power consumption of Sony PS3 consoles varies from 76 Watt up

to 200 Watt for normal use. The power supply that is provided with these systems have

a 380 Watt power supply. The size of processor varies from a 90nm Cell CPU down to

45nm Cell.

3 Sony PS3 Cluster - http://moss.csc.ncsu.edu/ mueller/cluster/ps3/

13


3.5 Microsoft XBox Cluster

Another research team at the University of Houston, have built a low-cost computer

cluster with unmodified XBox game consoles 4 . Microsoft XBox comes with an Intel

Celeron/P3 733 MHz processor processor and 64MB DDR RAM. The 4-node cluster

achieved a total of 1.4 GFLOPs when executing High Performance Linpack on Debian

GNU/Linux, consuming between 96 and 130 Watts. That gives a range from 10.7 to

14.58 MFLOPs per single Watt. The cluster supported MPI, Intel C++ and Fortran

compilers.

3.6 IBM BlueGene/Q

In terms of high-end supercomputing related projects, IBM BlueGene/Q prototype machines

aim at designing and building energy-efficient supercomputer based on embedded

processors. On the latest Green500 list (June 2011), the BlueGene/Q Prototype 2

is listed as the most energy efficient system, achieving a total of 85880 GFLOPs overall

performance. That translates to 2097.19 MFLOPs per Watt as it consumes 40.95 kW.

The second most energy-efficient entry belongs to BlueGene/Q Prototype 1, achieving

1684.20 MFLOPs per Watt. The BlueGene/Q is still not available on the market.

3.7 Less Watts

The overall rising concerns over power efficiently, drop in power costs and reduction

of overall CO2 have pushed software vendors to look into saving power on software

level. The Open Source Technology Center of Intel Corporation, have established an

open source project, LessWatts.org, that aims to save power with Linux on Intel Platforms.

The project focuses on end users, developers and operating system vendors by

delivering those components and tools that are needed to reduce the energy required by

the Linux operating system 6 . The project targets on desktop, laptops and commodity

servers and achieves power savings by enabling, or disabling, specific extensions on the

Linux Kernel.

3.8 Energy-efficient cooling

Apart from the considerations and research into reducing the overall energy of a system

using energy-efficient processors, there has been research done and solutions produced

in reducing the cooling needs of clusters and data-centres that require huge amounts

4 Microsoft XBox Cluster - http://www.bgfax.com/xbox/home.html

6 Less Watts. Saving Power with Linux - http://www.lesswatts.org/

14


of power in total, including both the power needed for the systems as well as for system

cooling infrastructure. The main driving force behind such methods is the growing

power costs for keeping large systems and clusters at the correct temperature. HPC

clusters require sophisticated and effective cooling infrastructure as well. Such infrastructure

might use more energy than the computing systems themselves. The new cooling

systems do not solve the issues with heating within the processor, the efficiency of

a system and its scalability in order to perform higher than petaflop. Although, they introduce

an environmental friendly cooling infrastructure, dropping down maintenance

costs and energy demands as a whole for large clusters, similar to those of supercomputers.

3.8.1 Green Revolution Cooling

The Green Revolution Cooling is a US based company that offers cooling solution for

data-centres. They use a a fluid submersion technology, GreenDEF TM , that reduces the

cooling energy used clusters by 90-95% and the server power usage by 10-20% [19].

While these facts are interesting for commodity servers and even more for cooling systems,

such approaches do not target at the power efficiency of the processor architecture

and the standard power needs of the systems. These solution can be used with existing

HPC clusters or even with future HPC clusters in order to achieve an overall low-power,

and environmental friendly, infrastructure.

Figure 3.1: GRCooling four-rack CarnotJet TM system at Midas Networks (source

GRCooling) .

3.8.2 Google Data Centres

Google has been investigating in smart, innovative and efficient design for their large

data-centres that are used to provide web services to million of users. Two of their datacentres

in Europe, one in Belgium and one Finland, do not use any air conditioning or

15


chiller systems but they are cooling the systems using natural resources, such as the air

temperature and water. In Belgium, the average air temperature is less than the average

temperature cooling systems provide to data-centres, thus it can be used for cooling

the systems. Moreover, as the data-centre is close to an industrial canal, the water is

purified and used to cool the systems. In Finland, the facility is built next to the gulf

of Finland, enabling to use the low temperature of the sea water to cool the data-centre

[20].

Figure 3.2: Google data-centre at Finland, next to the Finnish gulf (source Google).

3.8.3 Nordic Research

Institutions, as well as industry, in Scandinavia and Iceland are investigating into green,

energy-efficient, solutions to support large HPC and data-centre infrastructure with the

lowest cost and reduced CO2 emissions. For achieving this, projects aim to exploit and

make use of abandoned mines (Lefdal Mine Project) [19], retired NATO ammunition

depot within mountains halls (Green Mountain Data Centre AS) [22] as well as design

of new data-centres in remote mountain locations, close to hydro-electric power plants

for natural cooling and green energy resources [23].

A new initiative has been signed between DCSC (Denmark), UNINETT Sigma (Norway),

SNIC (Sweden) and the University of Iceland for the Nordic Supercomputer to

operate in Iceland later in 2011. The location of Iceland was chosen as its climate offers

suitable natural resources for cooling such a computing infrastructure. Iceland produces

70% of its electricity from hydro, 29.9% from geothermal and only 0.1% from fossil

[24].

16


Figure 3.3: NATO ammunition depot at Rennesøy, Norway (source Green Mountain

Data Centre AS).

3.9 Exascale

The increasing number of computationally intensive problems and applications, such as

weather prediction, nuclear simulation or analysis of space data, have put the need for

the development of new computing facilities, targeting at Exascale performance. The

IESP defines as Exascale "a system that is taken to mean that one or more key attributes

of the system has 1,000 times the value of what an attribute of a Petascale system of

2010 has". Building Exascale systems with the current technological trends would require

huge amounts of energy, among other things such as storage rooms and cooling,

to keep it running. Wilfried Verachtert, high-performance computing project manager

at the Belgian research institute IMEC argues that "the power demand for an Exascale

computer made using today’s technology would keep 14 nuclear reactors running.

There are a few very hard problems we have to face in building an Exascale computer.

Energy is number one. Right now we need 7,000MW for Exascale performance. We

want to get that down to 50MW, and that is still higher than we want."

There are two main approaches investigated for the design and built of Exascale systems,

the Low-power, Architectural Approach and the Project Aware, Software-based

Approach [10], but it is still on prototype level.

Low-power, Architectural Approach: This approach refers to the same approach

we have chosen to work on this project. Low-power consumption, energy-efficient,

processors replace the standard commodity, high power, processors used in HPC

clusters up to now. Using energy efficient processors would enable system engineers

to built larger systems, with larger number of processors in order to achieve

Exascale performance at acceptable levels. IBM’s BlueGene/Q Prototype 2 is

17


ight now the most energy efficient, low-power supercomputing for its size, using

low-power PowerPC processors [10].

The processor architectural approach can be followed for other parts of the hardware.

Energy efficient storage devices, efficient high-bandwidth networking, appropriate

power supplies can decrease the total footprint of each system.

• Project Aware, Software-based Approach: It is suggested by many system researchers

that the low-power architectural approach sacrifices too much performance,

at an unacceptable level for HPC applications. A more architectural independent

approach is therefore suggested. This involves the the use of high-power

CPUs that support dynamic voltage and frequency scaling. That allows the design

and programming of algorithms that conserves power by scaling up and down the

processor voltage and frequency as needed by the application [10].

The approach chosen for this project is that of Architectural Low-Power Approach as it

can enable the design and building of any size of reliable, efficient HPC system and does

not require any significant change to existing parallel algorithms and code. In specific

designs and use-cases, a hybrid approach (a combination of both approaches) might be

the golden mean between acceptable performance, power consumption, efficiency and

reliability.

In figure 3.4 that follows is presented the power demands of supercomputers, from

2006 estimated up to 2020. Based on the fact that the graph was compiled on 2010 and

that the 2011 predictions match the current TOP500 systems, we can trust, even with

a deviation, the predictions and the power demands of supercomputer as time passes

over. This comes to justify the need for supercomputer that will be energy efficient.

Figure 3.4: Projected power demand of a supercomputer (M. Kogge)

18


Chapter 4

Technology review

In this section we are examining the most developed and likely candidates of low-power

processors for HPC

4.1 Low-power Architectures

The Low-power processor is not a new trend in the processor business. It is however

a new necessity in modern computer systems, especially supercomputers. Energyefficient

processors have been used for many years in embedded systems as well as in

consumer electronic devices. Also, systems used in the HPC field have been using lowpower

RISC processors such as Sun’s SPARC and IBM’s PowerPC. In this section we

will look into the most likely low-power processor candidates for future supercomputing

systems.

4.1.1 ARM

The ARM processors are widely used in many portable consumer devices, such as mobile

phones and handled organisers, as well as in networking equipment and other embedded

devices such as AppleTV. Modern ARM cores, such as Cortex-A8 (single core,

ranging from 600MHz to 1.2GHz), Cortex-A9 (single-core, dual-core and quad-core

version with clock speed at 2GHz) and the upcoming Cortex-A15 (dual-core, and quadcore

version, ranging from 1GHz to 2.5GHz), are 32-bit processors, using 16 registers

and are designed under the Harvard memory model, where the processor consists of two

single memories, one for instructions and one for data. This allows two simultaneous

memory fetches. As ARM cores are RISC cores, they implement the simple load/store

model.

The latest ARM processor in production, and available on existing systems, is ARM

Cortex-A9, using the ARMv7 architecture which is ARM’s first generation superscalar

19


architecture. That is the highest performance ARM processor, designed around the

most advanced, high- efficiency, dynamic length, multi-issue superscalar, out-of-order,

speculating 8-stage pipeline. The Cortex-A9 processor delivers unprecedented levels

of performance and power efficiency with the functionality required for leading edge

products across the broad range of systems [9] [10]. The Cortex-A9 processor comes

in both multi-core (MPcore) and single-core system versions, making it a promising

alternative for low-power HPC clusters. What ARM cores lack is the 64-bit address

space, as they support only 32-bit. The recent Cortex-A9 comes with optional NEON

media and floating-point processing engine, aiming to deliver higher performance for

most intensive applications, such as video encoding [11].

Cortex-A8 uses as well the ARMv7 architecture but implementing a 13-stage integer

pipeline and a 10-stage NEON pipeline. The NEON support is used for accelerating

multimedia applications as well as signal-processing applications. The default support

of NEON in Cortex-A8 comes out of the fact that this processor is mainly designed

for embedded devices. However, NEON technology can be used as an accelerator for

multiple data processing on single input. This, enables the ARM to operate on four

multiply-accumulate instructions per cycle via dual-issue instructions to two pipelines

[11]. NEON supports 64-bit and 128-bit, being able to operate both integer and floatingpoint

operations.

Commercial servers manufacturers are already shipping low-power servers with ARM

cores. A number of different low-cost, low-power ARM boxes and development boards

are available in the market as well, such as OpenRD, DreamPlug, PandaBoard and

BeagleBoard. Moreover, NVIDIA has announced the Denver project, which aims to

build custom CPU cores for the GPUs based on the ARM architecture, targeting both

on personal computers and supercomputers [9].

Figure 4.1: OpenRD board SoC with ARM (Marvell 88F6281) (Cantanko).

20


4.1.2 Atom

Atom is Intel’s low-power processor, aiming at laptop and low-cost, low-power servers

and desktops, ranging on clock-speed from 800MHz to 2.13GHz. It supports both 32bit

and 64-bit registers and being an x86-based architecture make it one of the most

suitable alternative candidates to standard high-power processors so far. Server vendors

already ship systems with Atom chips and due to their low price, can be very appealing

for prototype low-power systems that do not require software alterations. Each instruction

loaded in the CPU is translated to a micro-operation performing a memory load

and a store operation on each ALU, extending the traditional RISC design, allowing

the processor to perform multiple tasks per clock-cycle. The processor has a 16-stage

pipeline, where each pipeline stage is broken down to three parts, decoding, dispatching

and cache access [11].

The Intel Atom processor supports two ALUs and two FPUs. The first ALU is used to

handles any shift operation while the second one handle jumps. The FPUs are used for

any arithmetic operation, including integer ones. The first FPU is used for addition only,

while the second FPU handles multiple data over single instructions (SIMD) and operations

that involve multiplication and divisions. The basic operations can be executed

and completed within a single clock-cycle while the processor can use up to 31 clockcycles

for more complex instructions, such as floating-point divisions for instance. The

newest models support Hyper-Threading technology, allowing parallel execution of two

threads per core, providing virtually four cores on the system [11].

Figure 4.2: Intel D525 Board with Intel Atom dual-core.

21


4.1.3 PowerPC and Power

PowerPC is one of the oldest low-power RISC processors that has been used in the

HPC field and is still being used for one of the world’s fastest supercomputer, IBM’s

BlueGene/P. PowerPC processors are also available in standard commercial servers for

general purpose computing, not just for HPC. They support both 32-bit and 64-bit.

PowerPC processors can be found in IBM’s system, making it an expensive solution for

low-budget projects and institutes.

The latest BlueGene/Q is using one of the latest Power processors, the A2. PowerPC A2

is described as massively multicore and multi-threaded with 64-bit support. Its clockspeed

ranges from 1.4GHz to 2.3GHz. Being a massively multicore processor it can

support up to 16 cores per processor and is 4-way multi-threading, allowing simultaneous

multithreading for up to 64 threads per processor [18]. Each chip has integrated

memory and I/O controllers.

Figure 4.3: IBM’s BlueGene/Q 16-core compute node (Timothy Prickett Morgan,

The Register).

Due to its low-power consumption and its flexibility, the design of the A2 is used in the

PowerEN (Power Edge of Network) processor which is a hybrid processor between a

networking processor and a standard server processor. This type of processors are also

known as wire-speed processors, merging characteristics from network processors, such

as low-power cores, accelerators, integrated network and memory I/O, smaller memory

line sizes and total low-power, and characteristics from standard processors, such as

full ISA cores, support for standard programming models, operating system, hypervisor

and full virtualisation support. The wire-speed processors are used within application

in the area of networking processing, intelligent I/O devices, distributed computing and

streaming applications. The architectural consideration of power efficiency drops the

power consumption bellow 50% of the initial power consumption. The high number

of threaded processors are able to deliver better throughput power-performance when

compared to a standard CPU, but with poorer single-thread performance. Power is

also minimised by operating at the lowest voltage necessary to function at a specific

frequency [17].

22


4.1.4 MIPS

MIPS is a RISC processor that is used widely in consumer devices, with most popular

Sony Playstation PSX and Sony Playstation Portable (PSP). Being a low-power processor

its design is based on RISC, having all instructions completing in one cycle. It

supports both 32-bit and 64-bit registers and implements the Von Neumann memory

architecture.

PC

Instruction Fetch

Adder

Memory

IR

IF / ID

Instruction Decode

Register Fetch

RS1

RS2

Register

File

Sign

Extend

Imm

ID / EX

Execute

Address Calc.

Next SEQ PC Next SEQ PC

MUX

MUX

Zero?

Branch

taken

ALU

EX / MEM

Memory Access Write Back

IF ID EX MEM WB

Next PC

MUX

Memory

MEM / WB

MUX

WB Data

Figure 4.4: Pipelined MIPS, showing the five stages (instruction fetch, instruction

decode, execute, memory access and write back (Wikimedia Commons).

Being of RISC design, MIPS uses a fixed-length, regularly encoded instruction set with

use of the load/store model which is a fundamental concept in the RISC architecture.

The arithmetic and logic operations in MIPS design use a 3-operand instructions, enabling

compilers to optimise complex expressions formulation, branch/jump options

and delayed jump instructions. Floating-point registers are supported both in 32 and

64-bit, in the same way as general purpose registers are supported. Superscalar implementation

are enabled by the use of no integer conditions codes. MIPS offer flexible

high-performance caches and memory management with well-defined cache control

options. The 64-bit floating-point registers and the pairing of two single 32-bit floatingpoint

operations improves the overall performance and speed up specific tasks by enabling

SIMD [31] [32] [33] [34].

MIPS Technologies license their architecture designs to third parties in order to design

and build their MIPS-based processors. The Chinese Academy of Sciences have designed

the MIPS-based processor Loongson. Chinese institutes have started designing

23


and building MIPS chips for their next generation supercomputers [8]. China’s Institute

of Computing Technology (ICT) has licensed the MIPS32 and MIPS64 architectures

from MIPS Technologies [35].

Figure 4.5: Motherboard with Loongson 2G processor (Wikimedia Commons).

Looking at the market, commercial MIPS products do not target the server market,

or the generic computing market, making it almost impossible to identify appropriate

systems off-the-self for designing and building a MIPS low-power HPC cluster with the

needed software support for HPC codes.

24


Chapter 5

Benchmarking, power measurement

and experimentation

In this chapter we will give a brief description of the benchmarking suites we have

considered and on the final benchmarks we have run.

5.1 Benchmark suites

5.1.1 HPCC Benchmark Suite

The HPCC suite consists of seven low-level benchmarks, reporting performance on

floating-point operation, memory bandwidth and communications latency and bandwidth.

The most common benchmark for measuring floating-point operation is Linpack,

used widely for measuring peak performance of supercomputer systems. While

all of the benchmarks are written in C, Linpack builds upon the BLAS library which is

written in Fortran. In order to compile the benchmarks successfully on the ARM architecture,

that requires the use of GNU version of the BLAS library which is available in

C. The HPCC Benchmarks while are easy to compile and execute, do not represent a

complete HPC or scientific application. They are useful to identify the performance of

a system at low level but do not represent the performance of a system as a whole, when

executing a complete HPC application [14]. The HPCC Benchmarks are free of cost.

5.1.2 NPB Benchmark Suite

The NAS Parallel Benchmarks are developed by the NASA Advanced Supercomputing

(NAS) Division. This benchmarking suite provides benchmarks for MPI, OpenMP,

High-Performance Fortran, Java as well as serial versions of the parallel codes. The

suite provides 11 benchmarks, where the majority is developed in Fortran, having only

25


4 benchmarks written in C. Most of the benchmarks are low-level targeting at specific

system operations, such as floating-operation per second, memory bandwidth and I/O

performance. Examples of full applications are provided as well for acquiring more accurate

results on the performance of high performance systems [15]. The NAS Parallel

Benchmarks are free of cost.

5.1.3 SPEC Benchmarks

The Standard Performance Evaluation Corporation (SPEC) provides a large variety of

benchmarks, both kernel and application benchmarks, for many different systems, including

MPI and OpenMP versions. These suites are of interest to the HPC community

are the SPEC CPU, MPI, OMP and Power benchmarks. The majority of benchmarks

represent HPC and scientific applications, allowing to measure the overall performance

of a system.

The CPU benchmarks are designed to provide performance measurements that can be

used to compare computationally intensive workloads on different computer systems.

The suite provide CPU-intensive codes, stressing a system’s processor, memory subsystem

and compiler. It provides 29 codes, where 25 are available in C/C++ and 6 in

Fortran.

The MPI benchmarks are used for evaluating MPI parallel, floating point, compute

intensive performance across a wide range of cluster and SMP hardware. The suite

provides 18 codes where 12 are developed in C/C++ and 6 in Fortran.

The OMP Benchmarks are used for evaluating performance floating point and compute

intensive performance on SMP hardware based on OpenMP applications. The suite

provides 11 benchmarks, only 2 of the codes being available in C and 9 in Fortran.

The Power benchmarks is one the first industry-standard benchmark that is used to

measure the power and performance of servers and clusters in the same way as done for

performance. While this allows power measurements, it does not allow to observe the

performance for an HPC, or another scientific application, as their using Java serverbased

codes to evaluate the system’s power consumption.

5.1.4 EEMBC Benchmarks

The EEMBC (Embedded Microprocessor Benchmark Consortium) provide a wide range

of benchmarks for benchmarking embedded devices such as those used in networking,

digital media, automotive, industrial, consumer and office equipment products. Some

of the benchmarks are free of cost, and open source, while others are given under license,

academic or commercial. The benchmarks suites provide codes for measuring

single-core and multi-core performance, power consumption, telecom/networking performance,

floating-point operation performance as well as various codes for different

appliances of consumer electronic devices.

26


5.2 Benchmarks

In this section we describe the benchmarks we used to evaluate the systems used in

this project. These benchmarks do not represent full HPC codes, but are established

and well defined benchmarks used widely for reporting the performance of computing

systems. Full HPC codes tend to take long time to execute, proving to be a constraint for

the project in terms of the available time. That is an additional reason over the decision

on running simpler, kernel, benchmarks, where the data-sets can be defined by the user.

5.2.1 HPL

We use the High-Performance Linpack to measure the performance in flops of each different

system. HPL solves a random dense linear system in double precision arithmetic

either on a single or on distributed-memory systems. The algorithm used in this code

uses "a 2D block-cyclic data distribution - Right-looking variant of the LU factorisation

with row partial pivoting featuring multiple look-ahead depths - Recursive panel

factorisation with pivot search and column broadcast combined - Various virtual panel

broadcast topologies - bandwidth reducing swap-broadcast algorithm - backward substitution

with look-ahead of depth 1" [16] [17]. The results outline how long it takes

to solve the linear system and how many Mflops or Gflops are achieved during the

computational process. HPL is part of the HPCC Benchmark suite.

5.2.2 STREAM

The STREAM benchmarks is a synthetic benchmark that measures the memory bandwidth

and the computation rate for simple vector kernels [12]. The benchmark tests

four different memory functions, copy, scale, add and triad. It reports the bandwidth in

MB/s as well as the average, the minimum and maximum time it takes to complete each

of the operations. STREAM is part of the HPCC Benchmark suite. It can be executed

either in serial or in multi-threaded mode.

5.2.3 CoreMark

The CoreMark benchmark is developed by the The Embedded Microprocessor Benchmark

Consortium. It is a generic, simple benchmark, targeted at the functionality of a

single processing core within a system. It uses a mixture of read/write, integer and control

operations including matrix manipulation, linked lists manipulation, state machine

operations and Cyclic Redundancy, an operation that is commonly used in embedded

systems. The benchmark reports performance on how may iterations are performed in

total as well as in per second plus the total execution time and total processor ticks. It

27


can be executed either in serial or in multi-threaded mode, enabling to evaluate hyperthreading

cores more effectively. CoreMark does not represent a real application and

puts under stress the processor’s pipeline operations, memory access (including caches)

and integer operations [26].

5.3 Power measurement

Power measurement techniques vary and can be conducted in many different parts of

the system. Power consumption can be measured between the power supply and the

electrical socket, the motherboard (or another hardware part of the system) and the

power supply as well as between individual parts of the system. Initially, we want to

measure the system as a whole. That will let us know which systems can be bought

"off-the-shelf" on best performance-per-Watt basis.

For our experiments we adopt the technique used by the Green500 to measure the power

consumption of a system. That is, using a power meter between the power supply’s AC

input of a selected unit and a socket connected to the external power supply system.

That allows us to measure the power consumption of the system as a whole. The power

meter reports the power consumption of the system at any time and status, being idle

or when running a specific code. By enabling logging of data at specific times, we can

identify the power consumption at any moment it is required.

An alternative method of measuring the same form of power consumption is by sensorenabled

software tools installed within the operating system. That has as a pre-requisite

the hardware to provide the needed sensors. For the systems we have used, the highpower

Intel Xeon systems provided the necessary sensors and software allowing us to

use software tools on the host system to measure the power consumption. The lowpower

systems do not provide sensor support, preventing us from using software tools

to gather the power consumption of these systems. Due to this, we have used external

power meters on all of the systems in order to qualify all the readings equally by using

the same method and provide more fairness to the experiments.

Power measurement can also be performed on individual components of the system.

That would allow to measure specifically how much power each processor consumes

without being affected by any other parts of the system. With this method, we can also

measure the power requirements and consumption between different parts of the system

as well, like the processor and the memory for instance. While this is of great interest,

and perhaps one the best ways to qualify and quantify at the maximum level where

power is going and how it is used by each component, due to time constraints with this

project we could not invest the time and effort in this method.

28


5.3.1 Metrics

In this project we use the same metric as used by the Green500 list, being the "performanceper-watt"

(PPW) metric that is used to rank the energy efficiency of supercomputers.

The metric is defined by following equation:

PPW =

P erformance

Power

(5.1)

Performance in equation (5.1) is defined as the maximal performance by the corresponding

benchmark, defined as GFLOPS (Giga FLoating-point OPerations per Second)

for High Performance Linpack, MB/s (Mega Bytes per Second) for STREAM and

Iter/s (Iterations per Second) for CoreMark. Power in equation (5.1) is defined as the

average system power consumption during the execution of each benchmark for the

given problem size, defined as Watt per second.

5.3.2 Measuring unit power

The power measurements performed using the Watts up? PRO ES 7 and the CREATE

ComfortLINE 8 power meters. The meter is placed between the power supply’s AC

input, of the machine to be monitored, and the socket connected to the external power

supply infrastructure. This reports watts consumed at any time. The power meter is

provided with a USB interface and software that allow us to record the data we need to

an external system and study them at any desired time. This methodology reflects the

technique followed to submit performance results to the Green500 list [12]. The basic

set up is illustrated by the figure 5.1 that follows.

Figure 5.1: Power measurement setup.

5.3.3 The measurement procedure

The measurement procedure consists of nine simple steps, similar to these described in

the Green500 List Power Measurement Tutorial [15].

7 Watts up? - http://www.wattsupmeters.com/

8 CREATE - The Energy Education Experts - http://www.create.org.uk/

29


1. Connect the power meter between the electricity socket and the physical machine.

2. Power on the meter (if required).

3. Power on the physical machine.

4. Start the power usage logger.

5. Initialise and execute the benchmark.

6. Start recording of power consumption.

7. Finish recording of power consumption.

8. Record the reported benchmark performance.

9. Load power usage data and calculate average and PPW.

Having connected the physical machines and power meter and are both running, we

initialise the execution of the benchmark and then start recording the data for the power

consumption of the system. We use a problem size large enough in order to keep the

fastest system busy enough to provide a reliable recording of power usage during the

execution time. That gives more execution time to the other systems, allowing us to

gather accurate power consumption data for every system we are examining. For each

benchmark the problem size can vary depending on the limitations from a hardware

point of view (e.g. memory size, storage).

5.4 Experiments design and execution

Experimentation is the process that defines a series of experiments, or tests, that are

conducted in order to discover something about a particular process or system. In other

words, “experiments are used to study the performance of processes and systems“ [25].

The performance of a system though depends on variables and factors, both controllable

and uncontrollable.

The accuracy, meaning the success of the measurement and the observation, of an experiment

depends on these controllable and uncontrollable variables and factors as they

can affect the results. These variables can vary in different conditions and environments.

For instance, the execution of unnecessary applications while conducting the

experiments is a controlled variable that can effect negatively the experimental results.

The Operating System CPU scheduling algorithm on the other hand is not a controlled

variable and can vary within the same operating system when executed on a different

architecture. This plays major role in the differentiation of the results from system to

system. On the other hand, the architecture of the CPU itself is an uncontrolled variable

that will affect the results. The controllable factors for this projects have been identified

as below:

• Execution of non-Operating System specific applications and processes.

30


• Installation of unnecessary software packages as that can result in additional

power consumption for unneeded services.

• Multiple uses of a system by different users.

These factors have been eliminated in order to get more representative and unaffected

results. The uncontrolled factors have been identified as below:

• Operating System scheduling algorithms.

• Operating System services/applications.

• Underlying hardware architecture and implementation.

• Network noise and delay.

From this list, the only factor that is partially controlled, is the network noise and delay.

We use private IPs with NAT, and that prevents the machines to be contacted by outside

the private network, unless they issue an external call. Keeping the systems that are

about to be measured outside a public network eliminates the noise and delay that comes

with the physical wire from devices connected to this network. Finally, the technical

phase of experimentation has been separated into seven different stages:

• Designing the experiments.

• Planning the experiments.

• Conducting the experiments.

• Collecting the data.

• Generating data sets.

• Generating graphs and tables.

• Results analysis

5.5 Validation and reproducibility

The validation of each benchmark is confirmed by either of the provided validation tests

of each one, like the residual tests in HPL for instance, or by being accepted for publication,

which was the case for the CoreMark results that are published on the CoreMark

website 9 . The STREAM benchmark does also state at the end of each whether the run

validates or not. Having all of the experiments with all of the benchmarks to have

validated, we can claim accuracy and correctness over results we present bellow.

Reproducibility is confirmed by executing the benchmark four times with the same options

and definitions. The average of all of the runs is taken and presented in the results

9 CoreMark scores - http://www.coremark.org/benchmark/index.php?pg=benchmark

31


that follow in this section. The power readings have been performed at a frequency of

every second during the execution of the benchmark. The average is then calculated to

identify the average power consumption of each system when running a specific benchmark.

32


Chapter 6

Cluster design and deployment

In this chapter I discuss the hardware and software specifications of the hybrid cluster

I have designed and built as part of this project. I discuss the issues I encountered and

how I solved them.

6.1 Architecture support

6.1.1 Hardware considerations

To evaluate effectively the performance of low-power processors we need to have a

suitable infrastructure that enable us to run the same experiments across a number of

different systems, both low-power and high-power, in order to perform a comparison

in equal terms. Identifying identical systems in every aspect apart from the CPU is

realistically not feasible, within the time and budget of this project. Therefore, the experiments

are designed in such a way that enable us to measure equal software metrics.

For the analysis of the results we take under consideration any important differences in

the hardware that can affect the interpretation of the results.

The project experiments with different architectures, such as standard x86 [9] (i.e. Intel

Xeon), RISC x86 [10] (i.e. Intel Atom) and ARM [11] (Marvell Sheeva 88F6281). These

fall within a modern comparison and experimentation of CISC (Complex Instruction

Set Computing) versus RISC (Reduced Instruction Set Computing) designs for HPC

appliances. Each of these architectures though uses a different type of registers (i.e.

32/64-bit). For instance, the x86 architecture both support 64-bit registers while ARM

on the other hand supports only 32-bit registers. That may prove to be an issue for

scientific codes from a software performance perspective as the same code may behave

and perform differently when compiled on a 32 and 64-bit system.

While the registers (processor registers, data registers, address registers etc.) is one of

the main differences between architectures, identical systems is very hard to build when

33


using chips of different architectures in terms of the other parts of the hardware. The

boards will require to be different, memory chips and sizes may differ, networking support

can differ as well (e.g. Megabit versus Gigabit Ethernet support). Also, different

hard disks type, such as HDD versus SSD, will affect the total power consumption of a

system.

6.1.2 Software considerations

Abstracting the architectural differences to a software level, some tool-chains (libraries,

compilers, etc.) are not identical for every architecture. For instance, the official GNU

GCC ARM tool-chain is at version 4.0.3 while the standard x86 version is at 4.5.2. We

solved this by using the binary distributions that comes by default with Linux distribution

from specific vendors, such as Red Hat in our case that ships GCC 4.1.2 with

their operating version system on any supported architecture. The source code can also

be used to compile the needed tools but that proves to be a time consuming, and some

times not a trivial, task. It might be the only way though for installing a specific version

of the tool-chain when there is no binary compiled for the needed architecture.

The compiled Linux distributions available for ARM, such as Debian GNU/Linux and

Fedora are compiled for the ARMv5 architecture, which is an older ARM architecture

than the one the latest ARM processors are based on, ARMv7. Other distributions, such

as Slackware Linux, are compiled on even older architectures ARMv4. Using an operating

system, compilers, tools and libraries that are compiled for an older architecture

do not take advantage of the additional capabilities of the instruction set of the newest

architecture. A simple example is the comparison between x86 and x86_64 systems.

A standard x86 compiled operating system running on a x86_64 hardware would not

take advantage of the additional larger virtual and physical address spaces, preventing

applications and codes to use larger data sets.

Intel Atom on the other hand, does not have any issues with compiler, tools and software

support. Being an x86 based architecture it supports and can handle any x86

package that is available for commodity high-power hardware, used widely nowadays

in scientific cluster and supercomputers.

6.1.3 Soft Float vs Hard Float

Soft Floats use an FPU (Floating Point Unit) emulator on software level, while Hard

Floats use the hardware’s FPU. As we described earlier on, most of the modern ARM

processors come with FPU support. However, in order to provide full FPU support,

the required tools and libraries need to be re-compiled from scratch. Also, dependency

packages would need to be re-compiled as well, and that can include low-level libraries

such as the C library. The supported Linux distributions, compilers, tools and libraries

that target the ARMv5 architecture use soft floats as ARMv5 does not come with hardware

FPU support. Therefore, they are unable to take advantage of the processors FPU

34


and additional NEON SIMD instructions. It is reported that recompiling the whole operating

system from scratch with Hard Float support, can increase the performance up

to 300% [27]. By now there is no distribution fully available to be used and taking

advantage of the hardware FPU, and recompiling the largest part of a distribution from

scratch is beyond the scope of this project.

6.2 Fortran

The GNU ARM tool-chain provides C and C++ compilers but not a Fortran compiler.

That is a limitation on itself as this means no Fortran code can be compiled and run

widely on the ARM architecture. That can be a restricting factor for many scientists

and HPC system designers at this moment as there is a great number of HPC and scientific

applications that are written in Fortran. Specific Linux distributions, such Debian

GNU/Linux, Ubuntu and Fedora provide their own compiled GCC packages, including

Fortran support.

On a non-supported system, porting Fortran code to C can be time consuming. A way

to do this is to use Netlib’s f2c [22] library that is able of porting automatically Fortran

code to C. Despite the ability of successfully porting the whole code to C, it might

need additional work to link correctly the MPI or OpenMP calls within the C version.

What is more, the f2c tool supports only Fortran 77 codes. As part of this project, we

have created a basic script to automate the process of converting and compiling the

original Fortran 77 code to C. Other proprietary and open-source compilers, such as

G95, PathScale and PGI do not yet provide Fortran, or other, compilers for the ARM

architecture.

6.3 C/C++

The C/C++ support of the ARM architecture is totally acceptable and at the same level

as in the other architectures. However, we have used the GNU C/C++ compiler and we

have not investigated any proprietary compilers. Compiler suites that are common in

HPC, such as PathScale and PGI, do not support the ARM architecture. Both MPI and

OpenMP are supported within all the architectures that we have used, without any need

for additional software libraries or porting of the existing codes.

6.4 Java

The Java run time environment is supported as well within the ARM architecture by

the official Oracle Java for embedded systems for ARMv5 (Soft Float), ARMv6 (Hard

Float) and ARMv7 (Hard Float). It lacks though the Java compiler and that would

35


equire to develop and compile the application on a system of another architecture that

provides support for the Java compiler and then execute the resulted binary on the ARM

system.

6.5 Hardware decisions

In order to evaluate the systems, the design of the cluster reflects that of a hybrid cluster,

interconnecting systems of different architectures together. Our cluster consists of the

following machines.

Processor Memory Storage NIC Status

Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Front-end / Gateway

Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 1

Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 2

Intel Xeon 8-core (E5507) 16GB DDR3-1.3GHz SATA 1GigE Compute-node 3

Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 4

Intel Atom 2-core (D525) 4GB DDR2-800MHz SATA 1GigE Compute-node 5

ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 6

ARM (Marvell 88F6281) 1-core 512MB DDR2-800MHz NAND 1GigE Compute-node 7

Table 6.1: Cluster nodes hardware specifications

The cluster provides access to 34 cores, 57GB of RAM and 3.6TB of storage. All of the

systems, both the gateway and the compute-nodes, are connected on a single switch.

The gateway has a public and a private IP and each compute-node a private IP. That

enables all the nodes to communicate with each other while the gateway allows them to

access the external public network and the Internet if needed.

6.6 Software decisions

The software details of each system are outlined in the table that follows.

36


System OS C/C++/Fortran MPI OpenMP Java

Front-end SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node1 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node2 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node3 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node4 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node5 SL 5.5 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVA 1.6

Node6 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6

Node7 Fedora 8 GCC-4.1.2 MPICH2-1.3.2 GCC-4.1.2 JAVAE 1.6

Table 6.2: Cluster nodes software specifications

The x86 based system have run Scientific Linux 5.5 x86_64, with the latest supported

GNU Compiler Collection that provides C, C++ and Fortran compilers. We have installed

the latest MPICH2 version for enabling programming with the message-passing

model. Regarding OpenMP support, GCC provides default support for shared-variable

programming using OpenMP directives by specifying the correct flag at compile time.

For Java, we have deployed Oracle’s SDK that provides both the compiler and runtime

environment.

Regarding the ARM systems, there are some differentiations. The operating system

installed is Fedora 8, which belongs to the same family as Scientific Linux being a Red

Hat related project, but this version specifically is older. Deploying a more recent operating

system is possible but due to project time limitations we used the pre-installed

operating system as well as compilers and libraries. However, the GCC versions is

the same across all systems. MPI and OpenMP are supported by MPICH2 and GCC

respectively. In relation to Java, Oracle provides an official version with the Java Runtime

Environment for embedded devices. That version is the one we can run on the

ARM architecture. It lacks though the Java compiler, allowing only the execution of

pre-compiled Java applications.

The batch system used to connect the front-end and the nodes is Torque, which is based

on OpenPBS 10 . Torque comes with both the server and client sides of the batch system

as well as with its own scheduler, which is not though very flexible. We did not face

any issues installing and configuring the batch system across the different architectures

and systems.

10 PBS Works - Enabling On-Demand Computing - http://www.pbsworks.com/

37


Figure 6.1: The seven-node cluster that was built as part of this project.

38


6.7 Networking

In terms of networking connectivity the front-end acts a gateway to the public network

and the Internet, therefore it has a public IP which can be used to be accessed remotely

as a login-node. As the front-end needs to communicate with the nodes as well, it

uses a second interface with a private IP within the network 192.168.1.0/24. Each of

the compute-nodes uses a private IP on a single NAT (Network Address Translation)

interface. That allows each node to communicate with every single node in the cluster

as well as the front-end, which is used as gateway when needed to communicate with

the public network.

Hostname IP Status

lhpc0 129.215.175.13 Gateway

lhpc0 192.168.1.1 Front-end

lhpc1 192.168.1.2 compute-node

lhpc2 192.168.1.3 compute-node

lhpc3 192.168.1.4 compute-node

lhpc4 192.168.1.5 compute-node

lhpc5 192.168.1.6 compute-node

lhpc6 192.168.1.7 compute-node

lhpc7 192.168.1.8 compute-node

Table 6.3: Network configuration

The physical connectivity between the systems is illustrated by the figure bellow.

Figure 6.2: Cluster connectivity

39


6.8 Porting

The main reason for porting an application is the incompatibility between the architecture

the application is initially developed for and the targeted architecture. As we

have mentioned already in this report, the ARM architecture does not widely support

a Fortran compiler. This has as a result the need to use specific Linux distributions or

the porting Fortran code to C, or C++, in order to run it successfully on ARM. It is not

part of this project to investigate the level to which this can be done, neither for the

benchmarks used or for any other HPC or scientific application.

The Intel Atom processor, being an x86 architecture, can support all the widely used

HPC and scientific tools and codes. That means, there is no need of porting to be done

for any benchmarks or code desired to run on such a platform. Thus, Atom systems

can be used to build low-power clusters for HPC with Fortran support. Hybrid clusters

(i.e. consisting of Atom and other low-power machines) can be deployed as well. That

would require the appropriate configuration of the batch system into different queues,

reflecting the configuration of each group of systems. For instance, that could be a

Fortran-supported queue and a generic queue for C/C++. Queues that group together

systems of the same architectures can be created as well, in the same concept as done

already with GPU queues and standard CPU queues on cluster and supercomputers.

6.8.1 Fortran to C

During investigating the issue of Fortran support on ARM, I came across on a possible

workaround solution for platforms that do not support Fortran. This is using the f2c

tool (i.e. Fortran-to-C) from the Netlib repository which can convert Fortran code to C.

There are two main issues with this tool. Firstly, f2c is developed for converting only

Fortran 77 code to C. Secondly, and more related to HPC and scientific codes, calls to

the MPI and OpenMP libraries might not be converted successfully, failing to compile

the converted C code even when linked correctly with the MPI and OpenMP C libraries.

The f2c tool, was used to port for instance the LAPACK library to C and has also

influenced the development of the GNU g77 compiler which uses a modified version

of the f2c runtime libraries. We think that with more effort and a closer study to f2c, it

could be used to convert HPC codes directly from Fortran 77 to C.

6.8.2 Binary incompatibility

Another issue with hybrid systems made of different architectures is the binary incompatibility

of a compiled code. A code that is compiled on a x86 system will not be able,

in most cases, to execute on the ARM architectures, and vice versa, except if it is a very

basic one, without system calls that relate to the underlying system architecture. This is

40


a barrier for the design and deployment of hybrid clusters, like the one we built for this

project.

This architecture incompatibility requires the existence and availability of login-nodes

for each architecture in order the users to be able to compile their applications in the

target platform. In addition to this, each architecture should provide its own batch

system. However, in order to eliminate the need of additional systems and the added

complexity of additional schedulers and queues, a single login-node can be used with

specific scripts to enable code compilation of different architecture. This single frontend,

as well as the scheduler (be it the same machine or another), can have different

queues for each architecture, allowing the users to submit jobs at the desired platform

each time without conflicts and faulty runs of binary incompatibility.

6.8.3 Scripts developed

For easing and automating the deployment and management of the cluster, as well as

the power readings, we have developed a few shell scripts as part of this projects. The

source of the scripts can be found Appendix D.

• add_node.sh: Adds a new node on the batch system. It copies all the necessary

files to the targeted system, starts the required services, mount the filesystems and

attach the node to the batch pool. Usage: ./add_node.sh [architecture]

• status.sh: Reports on the status of each node, whether the batch service are running

or not. Usage: ./status.sh

• armrun.sh: It can be used to execute remotely any command on the ARM systems

from the x86 login-node. Particularly, it can be used to compiles ARM targeted

code on x86 login-node without requiring to login specifically to an ARM system.

Usage: ./armrun

• watt_log.sh: Captures power usage on Dell PowerEdge servers with IPMI sensors

support. It logs the readings in a defined file on which the average can be calculated

as well. Usage: ./watt_log.sh [application to monitor]

• fortran2c.sh: Converts Fortran 77 code to C using the f2c tool and generates the

binary after compiling the resulted C file. Usage: ./fortran2c.sh

41


Chapter 7

Results and analysis

In this section we present and analyse and results we gathered during the experimentation

process of the hybrid cluster we have built during this project. We start from

discussing the given Thermal Design Power and idle power consumption of each system

and we go into more detail for each benchmark individually.

7.1 Thermal Design Power

Each processor vendor defines the maximum Thermal Design Power (TDP). This is

maximum power that can be cooled by a cooling system within the processor and therefore

the maximum power a processor can use. The power is referred in Watts. Bellow

we present the values as given by the vendors of each of the processors we used.

Processor GHz TDP Per core

Intel Xeon, 4-core 2.27 80 Watt 20 Watt

Intel Atom, 2-core 1.80 13 Watt 6.5 Watt

ARM (Marvell 88F6281) 1.2 0.87 Watt (870 mW) 0.87 Watt

Table 7.1: Maximum TDP per processor.

The Intel Xeon systems uses two quad-core processors, with a TDP of 80 Watt each,

giving a maximum and total of 160 Watt per system. These very fist values can give us

a clear first idea on the power consumption of each system. Dividing the TDP of the

processor by the number of cores we get 20 Watt for each Intel Xeon core, 6.5 Watt

of each Intel Atom core and just 870 mW for ARM (Marvell 88F6281). By this, we

can clearly see the difference between commodity server processors, low-power servers

processors and purely embedded systems processors. The cooling mechanism within

each system is reduced or increased by the scope of the system and the design of the

processor.

42


7.2 Idle readings

In order to identify the power consumption of a system when is idle (i.e. lack of processing),

we gathered the power consumption rate without running an special software

or any of the benchmarks. We measured each system for 48 hours, allowing us to get

a concrete indication on how much power each system consumes in idle mode. The

results are listed bellow.

POWER

150.0

112.5

75.0

37.5

0

Processor Watt

Intel Xeon, 8-core 118 Watt

Intel Atom, 2-core 44 Watt

ARM (Marvell 88F6281) 8 Watt

Table 7.2: Average system power consumption on idle.

TIME

Intel Xeon Intel Atom

ARM

Figure 7.1: Power readings over time.

On figure 7.1 we can see that each system tends to use relatively more power when it

boots and then stabilises and keeps a constant power consumption rate throughout time

without executing any special software. Thus, these results reflect the power consumption

of the system while running their respective operating system after a fresh installation

with the only additional service running, the Batch System that we installed. We

43


can also observe that the Intel Xeon system tends to increase slightly its power usage

by 1Watt, from 118 to 119, every 20 seconds approximately most probably to a specific

operating system service or procedure. The results come to justify and confirm the TDP

values as given by each manufacturer, as the systems with the lowest TDP values are

those who consume less power on idle as well.

7.3 Benchmark results

In this section we present and discuss the results of each benchmark individually across

the various architectures and platforms that have been executed.

7.3.1 Serial performance: CoreMark

Table 7.3 shows the results of the CoreMark benchmark the consumption of the CPU

is dropped, its efficiency increases. For instance, Intel Xeon system performs 55.76

iterations per single Watt consumed, Intel Atom 65.9 iterations per Watt and ARM

206.63 iterations per Watt. In terms of power efficiency, ARM is ahead of the other two

candidates. The tradeoff comes in the total execution time as ARM being a single-core

takes 3.5x and 1.5x times more to complete the iterations than Intel Xeon and Intel

Atom respectively. Intel Atom, while it consumes more than half the power of Intel

Xeon, the performance-per-Watt (PPW) does not differ greatly to that of Intel Xeon

while taking 2.3x times to complete the operations.

Processor Iterations/Sec Total time Usage PPW

Intel Xeon 6636.40 150.68 119 Watt 55.76 Iters.

Intel Atom 2969.70 336.73 45 Watt 65.9 Iters.

ARM (Marvell 88F6281) 1859.67 537.72 9 Watt 206.63 Iters.

Table 7.3: CoreMark results with 1 million iterations.

Calculating the total power consumption for performing the same number of iterations,

ARM (Marvell 88F6281) proves to be the most power efficient with Intel Atom following

and Intel Xeon consuming the maximum amount of power. Each system consumes

in total 17930 Watt, 16152 Watt and 4839 Watt for Intel Xeon, Intel Atom, and ARM

respectively.

44


Watts per second

150.0

112.5

75.0

37.5

0

Intel Xeon Intel Atom ARM

Watts Iterations

Figure 7.2: CoreMark results for 1 million iterations.

The same differences in performance, and power consumption as well, are observed

with both smaller and larger number of iterations as presented on the figures that follow,

7.2 and 7.3 respectively. We also observe that the number of iterations per second

remains approximately the same no matter the total number of the iterations. The total

execution time increases proportionally as the number of total iterations increases. The

difference between the various systems in execution times stays near the same values

with keeping the same levels in power consumption as well. The results that bring the

ARM system in front of the other two candidates, in terms of performance per Watt, can

be explained by the simplicity of the CoreMark benchmark that targets at integer-point

operations.

45

7000

5250

3500

1750

0

Iterations per second


Watts per second

Watts per second

150.0

112.5

75.0

37.5

150.0

112.5

0

75.0

37.5

0

Intel Xeon Intel Atom ARM

Watts

Iterations

Figure 7.3: CoreMark results for 1 thousand iterations.

Intel Xeon Intel Atom ARM Cortex-A8

Watts Iterations

Figure 7.4: CoreMark results for 2 million iterations.

The results presented show the performance on a single core per system. The Intel Xeon

system has though 8-cores and the Intel Atom 4-cores (2 cores with Hyper-threading

46

7000

5250

3500

1750

0

7000

5250

3500

1750

Iterations per second

0

Iterations Iterations per per second second


on each core). The results of the systems with all the threads turned on for each system

are illustrated by figure 7.5˘g as follows.

Watts per second

150.0

112.5

75.0

37.5

0

8 THREADS

Watts

4 THREADS

Intel Xeon Intel Atom ARM

Iterations

1 THREAD

Figure 7.5: CoreMark results for 1 million iterations utilising 1 thread per core.

We can observe that the performance is increasing almost proportionally for Intel Xeon

and Intel Atom achieving in total 51516.21 and 9076.67 iterations per second, giving

432.90 and 201.7 iterations-per-Watt respectively. With these results, the ARM processor

is ahead of the Intel Atom with 4.93 iterations per Watt/s, and 261.65 iterations per

Watt/s behind Intel Xeon, which has significantly higher clock-speed, 2.27GHz versus

1.22 GHz. With these considerations under mind, as well as the fact that ARM does

not support 64-bit registers, we could argue that there is plenty of space of developments

and progress for the ARM microprocessor, as we can also see from its current

developments with multi-core support and NEON acceleration.

CoreMark is not based, and does not represent, any real application, but allows us to

draw some conclusions over specifically the performance of a single core and the CPU

itself. The presented results show clearly that the CPU with the highest clock-speed, and

architecture complexity, can achieve higher performance, being able to perform larger

number of iterations per second in shorter total execution time. In our experiments,

Intel Xeon that achieves the best performance, uses both instantly, as well as in overall

execution time, the highest amount of Watts to perform the total number of iterations.

Based on the figures and results presented earlier in this section, ARM is the most

efficient processor on performance-per-Watt basis, being able to handle very efficient

integer-point operations.

47

500

375

250

125

0

Iterations Iterations per per second second


Looking solely at iterations per second, the figures 7.6 and 7.7 show us how each system

performs, in terms of iterations and speedup, for the serial version as well as for 2, 4,

6 and 8 processors. Intel Xeon does pretty well, while Amdahl’s law applies on Intel

Atom and ARM once we exploit more threads than the physical number of cores. Thus,

the ARM system, being a single-core machine, performs any tasks, serial or multithreaded,

as serial.

ITERATIONS

60000

45000

30000

15000

0

1 2 4 6 8

Intel Xeon Intel Atom ARM

Figure 7.6: CoreMark performance for 1, 2, 4, 6 and 8 cores per system.

48


SPEEDUP

8

6

4

2

0

1 2 4 6 8

Intel Xeon Intel Atom ARM

Figure 7.7: CoreMark performance speedup per system.

The same rule applies for Intel Xeon. Figure 7.8 shows that the Intel Xeon system hits

the performance wall once there are allocated more threads than the actual number of

cores on the system.

ITERATIONS

60000

45000

30000

15000

0

1 2 4 6 8 10 12 14 16

Intel Xeon

Figure 7.8: CoreMark performance on Intel Xeon.

In the figure 7.9 that follows, we can see any power changes over time of benchmarking

each system with the CoreMark benchmark. As it can be clearly seen, the power usage

throughout the execution of the benchmark on each system is stable. The low-power

systems do not raise their power consumption as much as the high-power Intel Xeon

49


system. An explanation to this can be the fact in order to keep load-balance between

the processors, the system will utilise more than a single core even when executing a

single thread, thus requiring more power. The Intel Xeon system increases its power

usage by 5.88$, Intel Atom by 0.8% and ARM by 12.5%.

POWER

150.0

112.5

75.0

37.5

0

Intel Xeon

TIME

Intel Atom

Figure 7.9: Power consumption over time while executing CoreMark.

7.3.2 Parallel performance: HPL

For HPL, we used four different approaches to identify the suitable problem size for

each system. The first one, is the rule of thumb suggested by HPL developers, giving a

problem size nearly 80% of the total amount of the memory system 11 . The second one,

is using an automated script provided by Advanced Clustering Technologies, Inc. that

calculates the ideal problem size based on the information given for the target system 12 .

The third one is by using the ideal problem size for the smallest machine for all of

the systems. The fourth one using a very small problem size to identify differences in

performance depending on problem size, as larger problem sizes that do not fit in the

physical memory of the system will need to make use of swap memory, with a drop off

on performance. All the problem sizes are presented in the table 7.4.

11 http://www.netlib.org/benchmark/hpl/faqs.html#pbsize

12 http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

50

ARM


Processor Problem size Block size Method

Intel Xeon 13107 128 HPL

Intel Atom 3276 128 HPL

ARM (Marvell 88F6281) 409 128 HPL

Intel Xeon 41344 128 ACT

Intel Atom 20608 128 ACT

ARM (Marvell 88F6281) 7296 128 ACT

Intel Xeon

Intel Atom 7296 128 Equal size

ARM (Marvell 88F6281)

Intel Xeom

Intel Atom 500 32 Small size

ARM (Marvell 88F6281)

Table 7.4: HPL problem sizes.

Table 7.5 presents the benchmarks results of the HPL benchmark for all the different

problem sizes we used. Figures 7.10 and 7.11 present the results of the benchmark with

problem sizes defined by HPL and ACT.

Processor GFLOPs Usage PPW Problem size

Intel Xeon 1.22 197 Watt 6.1 MFLOPs 13107

Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276

ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409

Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344

Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608

ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296

Intel Xeon 1.21 197 Watt 6.14 MFLOPs

Intel Atom 4.15 55 Watt 75.45 MFLOPs 7296

ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPs

Intel Xeon 7.18 197 Watt 36.44 MFLOPs

Intel Atom 5.46 55 Watt 99.27 MFLOPs 500

ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table 7.5: HPL problem sizes.

51


Watts per second

200

150

100

50

0

Intel Xeon Intel Atom ARM

Watts GFLOPs

Figure 7.10: HPL results for large problem size, calculated with ACT’s script.

Watts per second

200

150

100

50

0

Intel Xeon Intel Atom ARM

Watts GFLOPs

Figure 7.11: HPL results for problem size 80% of the system memory.

The GFLOPs rate as well as the power consumption remains at the same level for both

the Intel Xeon and ARM systems while Intel Atom improves its overall performance

52

5.00

3.75

2.50

1.25

0

5.00

3.75

2.50

1.25

0

GFLOps per second

GFLOps GFLOps per per second second


y 800 MFLOPs and 14.54 MFLOps per Watt, when using a problem size equal to the

80% of the memory. We have experiment with a smaller problem size, N = 500 and

that allows the systems to achieve higher performance. These results are illustrated on

the figure 7.12 that follows.

Watts per second

200

150

100

50

0

Intel Xeon Intel Atom ARM

Watts

GFLOPs

Figure 7.12: HPL results for N=500.

This experiment shows us that Intel Xeon is capable of achieving relatively high performance

for small problem sizes, Intel Atom increases its performance by approximately

2 GFLOPs and ARM (Marvell 88F6281) by 20 MFLOPs, staying within the same levels

of performance as for large problem sizes. Despite the increase in performance for

both Intel Xeon and Intel Atom, ARM achieves the best performance-per-Watt, with

152.2 MFLOPs per Watt while Intel Atom 96.5 MFLOPs per Watt and Intel Xeon 37.51

MFLOPs per Watt. These results should not surprise us. In the Green500 list, the first

entry belongs to BlueGene/Q Prototype 2 that is ranked as the 110 th fastest supercomputer

in the TOP500 list, meaning that the fastest supercomputer does not necessarily

mean that is the most power efficiency as well and vice versa.

For the reported performance, we must not underestimate the fact that the installed operating

systems (including tools, compilers and libraries) on the ARM machines, as well

as the processor’s design and implementation by Marvell that we used, do not support

hardware FPU and use a soft float, FPU on software level. That prevents the systems

from using the NEON SIMD acceleration, from which the executed benchmarks could

benefit. As there is an increased interest from both the desktop/laptop and the HPC communities

for exploiting low-power chips, we can strongly suggest that hardware FPU

support will be available in the near future, enabling applications to achieve higher

performance, and take full advantage of the underlying hardware. It is reported, that

NEON SIMD acceleration can increase HPL performance by 60% [28]. ARM states

53

8

6

4

2

0

GFLOPs per second


that NEON technology can accelerate multimedia and signal processing algorithms by

at least 3x on ARMv7 [29] [30].

While the performance in GFLOPs is of course important and interesting, we must not

leave aside the total execution time and the total power consumption a system needs in

order to solve a problem of a given size. The CoreMark results have shown that the

ARM system can achieve the best performance in terms of performance-per-Watt as

well as in overall power consumption for integer-point operations. The results for HPL

differentiate from this and are less clear, depending on the problem size.

In figure 7.13 we can see the total power that is used by each system when is solving

a problem of which the size is equal to 80% of the total main system memory. This

experiment clearly shows us that larger the memory is, a larger problem is required

which then leads to more power usage. We see in this experiment that Intel Xeon uses

236250 Watt and takes 119.24 seconds to complete. Intel Atom uses 3004 Watt and

takes 54.62 seconds to complete while ARM uses 36.63 Watts in total and takes 4.07

to complete for N equal to 13107, 3276 and 409 for Intel Xeon, Intel Atom and ARM

respectively.

Total power consumption (Watt)

300000

225000

150000

75000

0

Intel Xeon Intel Atom ARM

Watts

Execution time

Figure 7.13: HPL total power consumption for N equal to 80% of memory.

The great difference in problem sizes do not allow us to draw specific conclusions,

neither for the achieved performance in GFLOPs or the power usage. Figure 7.14

presents the total power consumption for a given problem size calculated with the ACT

script, that is for N equal to 41344 for Intel Xeon, 20608 for Intel Atom and 7296 for

54

1500

1125

750

375

0

Execution time (sec)


ARM (Marvell 88F6281). That results to 7625270 Watt and 38707.10 seconds for Intel

Xeon, 919756.75 Watt and 16722.85 seconds for Intel Atom and 21988 and 23554.23

for ARM. We can see here that as the problem size increases for each node, both the

total power usage and execution time increase as expected. In this experiment we see

that while Intel Atom is able to solve the linear problem faster, the ARM system is still

ahead when comparing performance-per-Watt.

Total power consumption (Watt)

8000000

6000000

4000000

2000000

0

Intel Xeon Intel Atom ARM

Watts

Execution time

Figure 7.14: HPL total power consumption for N calculated with ACT’s script.

In order to be able and quantify in a better way the total power consumption, we performed

another experiment with a problem size N equal to 7296 on all systems. Figure

7.15 presents the power consumption for each system. We can see that this problem

size is solved relatively quick on Intel and Intel Atom, taking 41934 Watt and 213.95

seconds for Intel Xeon and 34266.1 Watt and 623.02 seconds for Intel Atom. The ARM

systems uses in total 211988 Watt and takes 23554.23 seconds to solve the problem.

That brings it at the bottom of power efficiency for this given problem due to the lack

of a floating-point unit on hardware level.

In order to quantify the differences in performance, we draw the same graph for the

problem size N equal to 500. The problem size is rather small to draw concrete conclusions

for the performance of the Intel Xeon and Intel Atom systems as they both solve

the problem within a second, using 197 and 52 Watts respectively, while the ARM system

takes 7.37 seconds consuming 66.33, using in total 130.67 Watts less than Intel

Xeon and 11.33 Watt more than Intel Atom. The results are illustrated by figure 7.16.

55

40000

30000

20000

10000

0

Execution time (sec)


Total power consumption (Watt)

Total power consumption (Watt)

300000

225000

150000

75000

0

Intel Xeon Intel Atom ARM

Watts

Execution time

Figure 7.15: HPL total power consumption for N=7296.

200

150

100

50

0

Intel Xeon Intel Atom ARM

Watts

Execution time

Figure 7.16: HPL total power consumption for N=500.

56

30000

22500

15000

7500

8

6

4

2

0

0

Execution time (sec)

Execution time (sec)


All the results can clearly show that the ARM system lacks in terms of floating-point

operations, although being competitive in terms of performance-per-Watt for small

floating-point problem sizes. As we have mentioned, the ARM system we used, OpenRD

Client with Marvell Sheeva 88F6281, does not have implemented the FPU on hardware

level neither provides the NEON acceleration for SIMD processing. The underlying

compilers and libraries perform the floating-point operations on software level and that

is a performance drawback for the system. Intel Atom is very competitive when compared

to high-power Intel Xeon, as it can achieve reasonably high performance with

relatively low power-consumption.

The graph in figure 7.17 shows us the power consumption over time for each system

when executing the HPL benchmark. The low-power systems achieve the peak of their

power consumption and keep a stable rate of it, only a few seconds after the benchmark

start executing. On the other hand, like with the CoreMark benchmark, the high-power

Intel Xeon system it takes approximately from 10 to 15 seconds to achieve its peak

power consumption and keeps a stable rate during the execution of the HPL benchmark.

This comes to justify the suggestions by the Green500 power measurement tutorial for

start recording the actual power consumption 10 seconds after the benchmark has been

initialised. The Intel Xeon system raises its power consumption by 56%, Intel Atom by

17.9% and ARM by 14.28%.

It is important to note here that the build up of power consumption for real applications,

and different types of applications, might differ to that of the HPL benchmark, or any

other benchmark.

POWER

200

150

100

50

0

Intel Xeon

TIME

Intel Atom

Figure 7.17: Power consumption over time while executing HPL.

57

ARM


7.3.3 Memory performance: STREAM

Processor Function Rate (MB/s) Avg. time Usage

Copy 3612.4793 0.0978

Intel Xeon Scale 3642.3530 0.0968 118 Watt

Add 3960.9033 0.1334

Triad 4009.4806 0.1319

Copy 2851.0365 0.1236

Intel Atom Scale 2282.0852 0.1543 44 Watt

Add 3033.9793 0.1742

Triad 2237.8844 0.2361

Copy 777.8065 0.4029

ARM Scale 190.8710 1.6398 8 Watt

Cortex-A8 Add 173.9241 2.6886

Triad 113.8851 4.0880

Table 7.6: STREAM results for 500MB array size.

As an overall observation, we see that the power consumption is not increased at all

when performing intensive memory operations when executing the STREAM benchmark

with small array size. ARM proves to be more efficient in terms of performanceper-Watt

as it copies 97.2MB per Watt consumed, while Intel Atom and Intel Xeon

54.8MB and 64.79MB respectively. That is, 1.7x and 1.5x times more efficient in

terms of performance and actual power usage. These results reflect the performance of

the system when using the maximum memory that could be handled by the OpenRD

client system, 512MB of physical memory in total. These results are presented on figure

7.18.

We executed an additional experiment on the Intel Atom and Intel Xeon boxes with the

larger array size, 3GB, nearly the maximum size that can be handled by the Intel Atom

system, having available 4096MB or physical memory. This experiment have shown

differentiation in the power consumption of each system, increasing the usage by 4

Watts on each system, to 122 and 48 Watts on Intel Xeon and Intel Atom respectively.

The increase of the same amount of power in both systems may reflect the similarities

they share being both of x86 architecture. The performance results with the 3GB

array size are presented on figure 7.19. We can see that the performance of both systems

is kept at the same levels, with Intel Xeon increasing slightly its performance and

power efficiency on the functions Copy and Scale, from 3612MB to 3627MB and from

3642MB to 3670MB respectively, than when using smaller size of array. The functions

Add and Triad are slightly decreased when using larger array size, from 3960MB to

3943MB and from 4009MB to 3991MB. These differences are so small that can be

categorised within the area of the statistical fault and standard deviation.

The performance differences between the various memory subsystems can be justified

by the bandwidth interface and the frequency of each system. The Intel Xeon sys-

58


tem uses the highest bandwidth interface and highest data-rate frequency (DDR3 and

1333MHz) than the other two systems (DDR2 and 800MHz). Having a closer look at

the low-power systems, both of them use equal bandwidth interfaces and data-rate frequency.

The large bandwidth advantage of the Intel Atom system lies to the fact that

its memory subsystem is made by two chips of 2GB each while the ARM system uses

four chips of 128MB each. That makes the Intel Atom system capable to fit all of the

array size (500MB) into a single chip, requiring less data movement within the transistors.

Although, the ARM system keeps higher performance-per-Watt than the Intel

Atom system.

Watts per second

150.0

112.5

75.0

37.5

0

Copy Scale Add Triad T Copy Scale Add Triad T Copy Scale Add Triad

Intel Xeon E5507 Intel Atom D525 ARM Cortex-A8

Watts MBs

Figure 7.18: STREAM results for 500MB array size.

59

5000

3750

2500

1250

0

Bandwidth per second


Watts per second

130.0

97.5

65.0

32.5

0

Copy Scale Add Triad Copy Scale Add Triad

Intel Xeon E5507 Intel Atom D525

Watts MBs

Figure 7.19: STREAM results for 3GB array size.

The figure 7.20 confirms that the power consumption stability rate for every second

of the sample is equal for each sample from the different array sizes we have used to

stress the memory subsystem of each system. With the larger 3GB array, the Intel Xeon

system increases its power consumption by 3.38% and the Intel Atom by 9.1%.

60

4000

3000

2000

1000

0

Bandwidth per second


POWER

150.0

112.5

75.0

37.5

0

Intel Xeon

TIME

Intel Atom

Figure 7.20: Power consumption over time while executing STREAM.

7.3.4 HDD and SSD power consumption

We have mentioned earlier in this work that altering components within the targeted

systems could affect performance, by either increasing it or decreasing it. The component

that is the easiest to test is the storage device. By default, the Intel Xeon and

Intel Atom machines come with commodity SATA HD drives (with SCSI interface for

Intel Xeon). We replaced the Hard Disk Drive on one of the Intel Atom machines with

a SATA Solid-State Drive. SSD does not involve spinning platters and thus avoids the

power required to spin them.

During the experiments we performed, various power consumptions were observed and

we can not draw a specific pattern, apart from the generic observation that SSD decreases

the overall power consumption of the system. When idle, the system with the

SSD uses 6 Watt less than the system with the HDD. On the CoreMark experiments,

the SSD system consumes 3 Watts less. On STREAM, the SSD system consumes 4

Watts less, while when executing HPL the difference is 10 Watts, giving a total power

consumption of 58 Watt per second. These results are illustrated by the figure 7.21 that

follows.

61

ARM


Watts per second

60

45

30

15

0

Idle CoreMark STREAM HPL

HDD

SSD

Figure 7.21: Power consumption with 3.5" HDD and 2.5" SSD.

HHDs of smaller physical size, for instance 2.5" instead of the standard 3.5" may decrease

the power consumption as well. As we did not have one of these disks available

we could not confirm this hypothesis. Previous research suggests as the physical size

of the disk decreases, its power consumption decreases as well, improving the power

efficiency of the whole system [36] [37].

These differences in power do not only reduce the costs and the scalability, in terms

of power, of such systems, but allow the deployment of extra nodes that consume the

amount of power taken from the difference in the consumption of the components. For

instance, the maximum difference of the HDD and SSD systems is 10 Watts, which is

enough for an additional ARM system which consumes at maximum 9 Watt. Saving

power from a single component in larger scale, can allow the deployment of additional

compute-nodes what would consume the same power as when using a not very power

efficient component on each system.

While CoreMark, HPL and STREAM do not perform intensive I/O operations, they

allow us to measure the standalone power consumption of the SSD and compare it

against that of the HDD.

62


Chapter 8

Future work

Future work in this field could investigate a number of different possibilities as they are

outlined bellow:

• Real HPC/scientific applications: Real HPC and scientific applications could be

executed on the existing cluster and their results could be used for analysis and

comparison against the results preented in this dissertation.

• Modern ARM: The cluster can be extended by deploying more modern ARM systems,

such as Coretex-A9 and the upcoming Cortex-A15 that support hardware

FPUs and multi-core.

• Intensive I/O: Additional I/O intensive benchmarks and applications could be executed

for identifying power consumption over such applications rather than applications

and codes that do no make heavy use of I/O operations.

• Detailed power measurements: More detailed power measurements could be performed

by measuring each system component individually and quantifying how

and where is power exactly used.

• CPUs vs. GPUs: Comparison between performance and performance-per-Watt

of low-power CPUs and GPUs.

• Parallelism: Extend the existing cluster by adding a significant number of lowpower

nodes to exploit more parallelism.

63


Chapter 9

Conclusions

This dissertation has achieved its goals as it has researched the current trends and technologies

on low-power systems and techniques for High Performance Computing infrastructures

and have reported the related work in the field. We have also designed and

successfully built a hybrid seven-node cluster consisting of three different systems, Intel

Xeon, Intel Atom and ARM (Marvell 88F6281), providing access to 34 cores, 57GB

of RAM and 3.6TB of storage and described the issues faced and how were solved.

The cluster environment supports programming in both message-passing and sharedvariable

models. MPI, OpenMP and Java threads are supported on all of the platforms.

We have experimented and analysed the performance, power consumption and power

efficiency for each different system of the cluster.

By observing the market and the developments of the HPC systems, low-power processors

will start being one of the default choices in the very near future. The energy

demands of large system will require the shifting to processors and systems that have

considered energy by design. Consumer electronics devices are becoming more and

more powerful as they need to execute computensively intensive applications and still

are designed with energy efficiency on mind.

For qualifying and quantifying the computationally performance and efficiency as well

as the power efficiency of each system, we ran three main different benchmarks (Core-

Mark, High Performance Linpack and STREAM) in order to quantify the performance

of each system on performance-per-Watt basis for integer-point operations, floatingpoint

operations as well as memory bandwidth. On CoreMark, the serial integer-point

benchmark, the ARM system achieves the best performance-per-Watt, with 206.63 iterations

per Watt against Intel Xeon and Intel Atom that perform 55.76 and 56.03iterations

per Watt respectively on single thread and 432.90 and 171.25 on utilising every thread

per core. This allows us to conclude that the ARM processor is very competitive and

can achieve very high score on integer-point operations, performing better than Intel

Atom which is a dual-core system with hyper-threading support, providing access to

four cores.

The ARM system does not support an FPU on hardware level due to its ARMv5 ar-

64


chitecture, lacking on performance on floating-point operations as we can see from the

HPL results, being able to achieve at maximum 1.37 GFLOPs while Intel Xeon 7.39

GFLOPs and Intel Atom 6.08 GFLOPs. In terms of power consumption, while ARM

achieves the best performance-per-Watt, 152.2 versus 37.51 and 96.50 for Intel Xeon

and Intel Atom respectively, it takes much longer to solve large problems. That introduces

a high overhead in total power consumption, having as a consequence the usage

of more power in total than what Intel Xeon or Intel Atom use.

In terms of memory performance, for small sizes, the power consumption remains at

minimum levels. Larger data-sizes, higher than 2GB increase the consumption on Intel

Xeon and Intel by 4 Watts. The ARM system is able to handle only small data-sizes,

512MB. Intel Xeon achieves the highest bandwidth per second as it uses the DDR3 at

1.3GHz while Intel Atom and ARM use DDR2 at 800MHz. Intel Atom also scores

higher than ARM as it is able to store the maximum data-set the ARM system can

handle within a single chip, unlike ARM that uses four individual memory chips.

Individual components affect systems performance as well. We have observed that SSD

storage can reduce the power consumption from 3 up to 10 Watt when compared to standard

3.5" HDD at 7200rpm. Other components, such as different memory subsystem,

interconnect as well as different power supplies, could affect the system performance.

Due to time constraints, as well as budget, we did not experiment with different components

one each of these subsystems.

In terms of porting and software support, all of the tested platforms support C, C++,

Fortran and Java. ARM does not support the Java compiler but only the Java Runtime

Environment. Intel Atom, being an x86 based architecture (despite its RISC design),

supports and is fully compatible with any x86 system that is being currently used. ARM

does not provide the same binary compatibility with existing systems due to architecture

differences, requiring recompiling of the targeted code. What is more, ARMv5 is

not capable on performing floating-point operations on the hardware level, having to

use soft float instead. The latest architecture, ARMv7, provides hardware FPU functionality

as well as SIMD acceleration. For taking advantage of the hardware FPU and

SIMD acceleration, changes are need to be made on the software level as well. Linux

distributions, or the needed compilers and libraries with all their dependencies, need

to be recompiled on the ARMv7 architecture in order to support hardware FPUs and

improve the overall system performance.

The emerging interest by the HPC communities in exploiting low-power architectures

and new designs to support efficiently and reliably the design and development of Exascale

systems. In combination with the market developments in consumer devices from

desktops to mobile phones, ensures the increasing of the functionality and the performance

of low-power processors to levels acceptable for HPC and scientific applications.

The cease of Moore’s law introduces an extra need for the development of such systems.

65


Appendix A

CoreMark results

Processor Iterations Iterations/Sec Total time (sec) Threads Consumption PPW

Intel Xeon 100000 6617.25 15.11 1 119 Watt 55.60

Intel Atom 100000 2954.12 33.85 1 54 Watt 54.70

ARM 100000 1859.70 53.77 1 9 Watt 206.63

Intel Xeon 1000000 6636.40 150.68 1 119 Watt 55.76

Intel Atom 1000000 2969.70 336.73 1 53 Watt 56.03

ARM 1000000 1859.67 537.72 1 9 Watt 206.63

Intel Xeon 2000000 6610.49 302.54 1 126 Watt 54.46

Intel Atom 2000000 2953.23 677.22 1 54 Watt 54.68

ARM 2000000 1861.36 1074.48 1 9 Watt 206.81

Intel Xeon 1000000 51516.21 155.89 8 119 Watt 432.90

Intel Atom 1000000 9076.67 440.69 4 53 Watt 171.25

ARM 1000000 1859.67 537.72 1 9 Watt 206.63

Table A.1: CoreMark results for various iterations.

66


Appendix B

HPL results

Processor Problem size Block size Method

Intel Xeon 13107 128 HPL

Intel Atom 3276 128 HPL

ARM (Marvell 88F6281) 409 128 HPL

Intel Xeon 41344 128 ACT

Intel Atom 20608 128 ACT

ARM (Marvell 88F6281) 7296 128 ACT

Intel Xeon

Intel Atom 7296 128 Equal size

ARM (Marvell 88F6281)

Intel Xeom

Intel Atom 500 32 Small size

ARM (Marvell 88F6281)

Table B.1: HPL problem sizes.

67


Processor GFLOPs Usage PPW Problem size

Intel Xeon 1.22 197 Watt 6.1 MFLOPs 13107

Intel Atom 4.28 55 Watt 77.81 MFLOPs 3276

ARM (Marvell 88F6281) 1.11 9 Watt 123.3 MFLOPs 409

Intel Xeon 1.21 197 Watt 6.1 MFLOPs 41344

Intel Atom 3.48 55 Watt 63.27 MFLOPs 20608

ARM (Marvell 88F6281) 1.10 9 Watt 122 MFLOPs 7296

Intel Xeon 1.21 197 Watt 6.14 MFLOPs

Intel Atom 4.15 55 Watt 75.45 MFLOPs 7296

ARM (Marvell 88F6281) 1.10 9 Watt 122.2 MFLOPs

Intel Xeon 7.18 197 Watt 36.44 MFLOPs

Intel Atom 5.46 55 Watt 99.27 MFLOPs 500

ARM (Marvell 88F6281) 1.13 9 Watt 125.5 MFLOPs

Table B.2: HPL problem sizes.

Processor GFLOPs Usage PPW

Intel Xeon 7.39 197 Watt 37.51 MFLOPs

Intel Atom 6.08 63 Watt 96.50 MFLOPs

ARM 1.37 9 Watt 152.2 MFLOPs

Table B.3: HPL results for N=500.

68


Appendix C

STREAM results

Processor Size Function Rate (MB/s) Avg. time Usage

Copy 3612.4793 0.0978

Intel Xeon 500MB Scale 3642.3530 0.0968 118 Watt

Add 3960.9033 0.1334

Triad 4009.4806 0.1319

Copy 2851.0365 0.1236

Intel Atom 500MB Scale 2282.0852 0.1543 52 Watt

Add 3033.9793 0.1742

Triad 2237.8844 0.2361

Copy 777.8065 0.4029

ARM 500MB Scale 190.8710 1.6398 8 Watt

(Marvell 88F6281) Add 173.9241 2.6886

Triad 113.8851 4.0880

Copy 3627.5380 0.5886

Intel Xeon 3GB Scale 3670.4334 0.5816 122 Watt

Add 3943.3052 0.8120

Triad 3991.4984 0.8022

Copy 2875.2246 0.7422

Intel Atom 3GB Scale 2275.0291 0.9379 56 Watt

Add 3035.8659 1.0544

Triad 2269.4263 1.4103

Table C.1: STREAM results for size array of 500MB.

69


Appendix D

Shell Scripts

D.1 add_node.sh

#!/bin/bash

#

# Author: Panagiotis Kritikakos

NODE=$1

ARCH=$2

ssh root@${NODE} ’mkdir /root/.ssh; chmod 700 /root/.ssh’

scp /root/.ssh/id_dsa.pub root@${NODE}:.ssh/authorized_keys

if [ "${ARCH}" == "ARM" ] || [ "${ARCH}" == "arm" ]; then

scp fstab.arm root@${NODE}:/etc/fstab

else

scp fstab root@${NODE}:/etc/fstab

fi

scp hosts root@${NODE}:/etc/hosts

scp profile root@${NODE}:/etc/profile

scp mom_priv.config root@${NODE}:/var/spool/torque/mom_priv/config

scp pbs_mom root@${NODE}:/etc/init.d/.

ssh root@${NODE} ’mount /home’

ssh root@${NODE} ’mkdir /usr/local/mpich2-1.3.2p1’

ssh root@${NODE} ’mount /usr/local/mpich2-1.3.2p1’

ssh root@${NODE} ’/sbin/chkconfig --add pbs_mom \

&& /sbin/chkconfig --levels 234 pbs_mon on’

ssh root@${NODE} ’/sbin/service pbs_mom start’

70


qterm -t quick

pbs_server

D.2 status.sh

#!/bin/bash

#

# Author: Panagiotis Kritikakos

for i in ‘cat nodes.txt‘

do

ssh root@$i ’hostname; service pbs_mom status’

done

D.3 armrun.sh

#!/bin/bash

#

# Author: Panagiotis Kritikakos

ARMHOST=lhpc6

ARGUMENTS=$*

ssh ${ARMHOST} ${ARGUMENTS}

D.4 watt_log.sh

#!/bin/bash

#

# Author: Panagiotis Kritikakos

option=$1

logfile=$2

code=$3

getAvg(){

totalwatts=‘cat ${logfile} | \

awk ’{total = total + $1}END{print total}’‘

71


}

elements=‘cat ${logfile} | wc -l‘

avgwatts=‘echo "${totalwatts} / ${elements}" | bc‘

printf "\n\n Average watts: ${avgwatts}\n\n"

if [ "${option}" == "average" ]; then

getAvg

exit 0

fi

if [ $# -lt 3 ] || [ $# -gt 3 ]; then

echo " Specify logfile and code"

exit 1

fi

if [ -e ${logfile} ]; then rm -f ${logfile}; fi

codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘

while [ ${codeis} -gt 0 ]; do

sudo /usr/sbin/ipmi-sensors | grep -w "System Level" | \

awk {’print $5’} | awk ’ sub("\\.*0+$","") ’ >> ${logfile}

sleep 1

codeis=‘ps aux | grep ${code} | grep -v grep | wc -l‘

done

getAvg

D.5 fortran2c.sh

#!/bin/bash

fortranFile=$1

fileName=‘echo $1 | sed ’s/\(.*\)\..*/\1/’‘

f2c $fortranFile

gcc ${fileName}.c -o $fileName -lf2c

72


Appendix E

Benchmark outputs samples

E.1 CoreMark output sample

The output that follows is a sample output from an Intel Xeon system when executing

CoreMark with 100000 iterations and a single thread.

2K performance run parameters for coremark.

CoreMark Size : 666

Total ticks : 15112

Total time (secs): 15.112000

Iterations/Sec : 6617.257808

Iterations : 100000

Compiler version : GCC4.1.2 20080704 (Red Hat 4.1.2-50)

Compiler flags : -O2 -DPERFORMANCE_RUN=1 -lrt

Memory location : Please put data memory location here

(e.g. code in flash, data on heap etc)

seedcrc : 0xe9f5

[0]crclist : 0xe714

[0]crcmatrix : 0x1fd7

[0]crcstate : 0x8e3a

[0]crcfinal : 0xd340

Correct operation validated. See readme.txt for run and reporting rules.

CoreMark 1.0 : 6617.257808 / GCC4.1.2 20080704 (Red Hat 4.1.2-50) -O2

-DPERFORMANCE_RUN=1 -lrt / Heap

E.2 HPL output sample

The output that follows is a sample output from an Intel Atom system when executing

HPL with problem size N=407.

Gflops : Rate of execution for solving the linear system.

73


The following parameter values will be used:

N : 407

NB : 128

PMAP : Row-major process mapping

P : 1

Q : 1

PFACT : Right

NBMIN : 4

NDIV : 2

RFACT : Crout

BCAST : 1ringM

DEPTH : 1

SWAP : Mix (threshold = 64)

L1 : transposed form

U : transposed form

EQUIL : yes

ALIGN : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.

- The following scaled residual check will be computed:

||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )

- The relative machine precision (eps) is taken to be 1.110223e-16

- Computational tests pass if scaled residuals are less than 16.0

==================================================================

T/V N NB P Q Time Gflops

--------------------------------------------------------------------------

WR11C2R4 3274 128 2 2 54.62 4.287e-01

--------------------------------------------------------------------------

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=0.0051440 ...... PASSED

==================================================================

Finished 1 tests with the following results:

1 tests completed and passed residual checks,

0 tests completed and failed residual checks,

0 tests skipped because of illegal input values.

----------------------------------------------------------------------------

End of Tests.

==================================================================

74


E.3 STREAM output sample

The output that follows is a sample output from an Intel Atom system when executing

STREAM with array size 441.7MB.

-------------------------------------------------------------

STREAM version $Revision: 5.9 $

-------------------------------------------------------------

This system uses 8 bytes per DOUBLE PRECISION word.

-------------------------------------------------------------

Array size = 19300000, Offset = 0

Total memory required = 441.7 MB.

Each test is run 10 times, but only

the *best* time for each is used.

-------------------------------------------------------------

Printing one line per active thread....

-------------------------------------------------------------

Your clock granularity/precision appears to be 2 microseconds.

Each test below will take on the order of 1194623 microseconds.

(= 597311 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 777.8065 0.4029 0.3970 0.4318

Scale: 190.8710 1.6398 1.6178 1.6900

Add: 173.9241 2.6886 2.6632 2.7319

Triad: 113.8851 4.0880 4.0673 4.1260

-------------------------------------------------------------

Solution Validates

-------------------------------------------------------------

75


Appendix F

Project evaluation

F.1 Goals

The project have achieved the following goals, as set by the project proposal and as

presented within this dissertation:

• Report on low-power architectures targeted for HPC systems.

• Report on related work done in the field of low-power HPC.

• Report on the analysis and specification of requirements for the low-power HPC

project.

• Report on the constraints of the available architectures on their use in HPC.

• Functional low-power seven-node cluster targeted for HPC applications.

• A specific set of benchmarks that can run across all chosen architectures.

• Final MSc dissertation.

The final project proposal can be found on Appendix G.

F.2 Work paln

The schedule that was presented in the Project Preparation report has been followed

and we have met the deadlines as described in the schedule. Slight changes have been

made and the schedule had to be adjusted as the project was progressing. The changes

applied to time scales of certain tasks.

76


F.3 Risks

During the project preparation, the following risks have been identified:

• Risk 1: Unavailable architectures.

• Risk 2: Unavailable tool-chains.

• Risk 3: More time required to build cluster/port code than to run benchmarks.

• Risk 4: Unavailability of identical tools /underlying platform.

• Risk 5: Architectural differences.

The risks have finally not affected the project and we managed to mitigate any of them

that occurred accordingly, as described in the project preparation report. However, we

came across another two risks that have not been initially identified:

• Risk 6: Service outage and support by the University’s related groups.

• Risk 7: Absence due to summer holidays.

The first one affected the project for a week and it slowed down the experimentation

process as the cluster could not be accessed remotely due to network issues

that were later solved. During these outage, the cluster had to be accessed physical

to conduct experiments and gather results. The second one did not cause any

issues to the project itself but perhaps if there was no absence more experiments

and more benchmarks could have been designed and executed.

F.4 Changes

The most important change was the decision over which benchmarks to run. We have

left aside the SPEC Benchmarks as they would require large times of execution, something

that could not be afforded for the project. We also left out the NAS Parallel

Benchmarks as they as they proved a bit complicated to execute in a similar manner

on all three architectures as well as they would take rather long to finish execution and

gather the needed results. For that reason we finally decided to proceed with the Core-

Mark benchmark to measure the serial performance of a core, the High-Performance

Linpack to measure the parallel performance of a system and STREAM to measure the

memory bandwidth of a system. These three benchmarks have been a good choice as

they are widely used and accepted in the HPC field, are configurable, easy to run and

can complete their execution in relatively short time, enabling us to design a number of

different experiments for qualifying and quantifying our the results.

77


Appendix G

Final Project Proposal

G.1 Content

The main scope of the project is to investigate, measure and compare the performance

of low-power CPUs versus standard commodity 32/64-bit x86 CPUs when executing

selected High-Performance Computing applications. Performance factors to be investigated

include: the computational performance along with power consumption and

porting effort of standard HPC codes across to the low-power architectures.

Using 32/64-bit x86 as the baseline, a number of different low-power CPUs will be

investigated and compared, such as ARM, Intel Atom and PowerPC. The performance,

in terms of cost and efficiency, of the various architectures will be measured by using

well-known and established benchmarking suites. Due to the differences in the architectures

and the available supported compilers, a set of appropriate benchmarks will

need to be identified. Fortran compilers are not available on the ARM platform, therefore

a number of C or C++ codes will need to be identified, that will represent either

HPC applications or parts of HPC operations, in order to put the systems under stress.

G.2 The work to be undertaken

G.2.1 Deliverables

• Report on low-power architectures targeted for HPC systems.

• Report on related work done in the field of low-power HPC.

• Report on the analysis and specification of requirements for the low-power HPC

project.

• Report on the constraints of the available architectures on their use in HPC- e.g.,

32-bit only, toolchain availability, existing code ports.

78


• Functional low-power cluster, between 6 and 12 nodes, targeted for HPC applications.

• A specific set of codes that can run across all chosen architectures.

• Final MSc dissertation.

• Project presentation.

G.3 Tasks

• Survey of available and possible low-power architecture for HPC use.

• Survey on existing work done in the low-power HPC field.

• Deployment of low-power HPC cluster.

• Identification of appropriate set of benchmarks to run on all architectures, run

experiments and analyse the results.

• Writing of the dissertation reflecting the work undertaken and the outcomes of

the project.

G.4 Additional information / Knowledge required

• Programming knowledge and skills are assumed as the benchmark codes might

require porting.

• Systems engineering knowledge to build up, configure and deploy low-power

cluster.

• Understanding of different methods/techniques of power measuring for computer

systems.

• Presentation skills for writing a good dissertation and presenting the results of the

project in front of public.

79


Bibliography

[1] P. M. Kogge and et al., "Exascale Computing Study: Technology Challenges in

Achieving Exascale Systems", DARPA Information Processing Techniques Office,

Washington, DC, pp. 278, September 28, 2008.

[2] J. Dongarra, et al., "International Exascale Software Project: Roadmap 1.1",

http://www.Exascale.org/mediawiki/images/2/20/IESP-roadmap.pdf, February

2011

[3] D. A. Patterson and D. R. Ditzel, "The Case for the Reduced Instruction Set

Computer," ACM SIGARCH Computer Architecture News, 8: 6, 25-33, Oct.

1980.

[4] D. W. Clark and W. D. Strecker, "Comments on ÔThe Case for the Reduced

Instruction Set Computer", ibid, 34-38, Oct. 1980.

[5] Michio Kaku, "The Physics of the Future", 2010

[6] S. Sharma, Chung-Hsing Hsu and Wu-chun Feng, "Making a Case for a

Green500 List", 20th IEEE International Parallel & Distributed Processing Symposium

(IPDPS), Workshop on High-Performance, Power-Aware Computing

(HP-PAC), April 2006

[7] W. Feng, M. Warren, E. Weigle, "Honey, I Shrunk the Beowulf", In the Proceedings

of the 2002 International Conference on Parallel Processing, August 2002

[8] Wu-chu Feng, The Importance of Being Low Power in High Performance Computing,

CTWatch QUARTERLY, Volume 1 Number 3, Page 12, August 2005

[9] NVIDIA Press, http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09

&version=live&releasejsp=release_157&xhtml=true&prid=705184, Accessed

13 May 2011

[10] HPC Wire, http://www.hpcwire.com/hpcwire/2011-03-

07/china_makes_its_own_supercomputing_cores.html, Accessed 13 May

2011

[11] Katie Roberts-Hoffman, Pawankumar Hedge, ARM (Marvell 88F6281) vs. Intel

Atom: Architectural and Benchmark Comparisons, EE6304 Computer Architecture

Course project, University of Texas at Dallas, 2009

80


[12] R. Ge, X. Feng, H. Pyla, K. Cameron, W. Feng, Power Measurement Tutorial for

the Green500 List, The Green500 List: Environmentally Responsible Supercomputing,

June 27, 2007

[13] J. J. Dongarra, the LINPACK benchmark: an explanation, In the Proceedings

of the 1st International Conference on Supercomputings, Springer-Verlag New

York, Inc. New York, NY, USA, 1988

[14] Piotr R. Luszczek et al., The HPC Challenge (HPCC) benchmark suite, In the

Proceeding of SC ’06 Proceedings of the 2006 ACM/IEEE conference on Supercomputing,

New York, NY, USA, 2006

[15] D. Weeratunga et al., "The NAS Parallel Benchmarks", NAS Technical Report

RNR-94-007, NASA Ames Research Center, Moffett Field, CA, March 1994

[16] Cathy May, et al., "The PowerPC Architecture: A Specification for A New Family

of RISC Processors", Morgan Kaufmann Publishers, 1994

[17] Charles Johnson, et al., A Wire-Speed Power Processor: 2.3GHz 45nm SOI with

16 Cores and 64 Threads, IEEE International Solid-State Circuits Conference,

White paper, 2010

[18] D.M. Tullsen , S.J. Eggers, H.M. Levy, "Simultaneous multithreading: maximizing

on-chip parallelism", ISCA Ô95, pp. 392-403, June 22, 1995

[19] Green Revolution Cooling, http://www.grcooling.com, Accessed 2 June 2011

[20] Google Data Centers, http://www.google.com/corporate/datacenter/index.html,

Accessed 2 June 2011

[21] Sindre Kvalheim, "Lefdal Mine Project", META magazine, Number 2: 2011, p.

14-15, Notur II Project, February 2011

[22] Knut Molaug, "Green Mountain Data Centre AS", META magazine, Number 2:

2011, p. 16-17, Notur II Project, February 2011

[23] Bjørn Rønning, "Rjukan Mountain Hall - RMH, META magazine, Number 2:

2011, p. 18-19, Notur II Project, February 2011

[24] Jacko Koster, "A Nordic Supercomputer in Iceland", META magazine, Number

2: 2011, p. 13, Notur II Project, February 2011

[25] Douglas Montgomery, "Design and Analysis of Experiments", John Wiley &

Sons, sixth edition, 2004

[26] CoreMark an EMMBC Benchmark, http://www.coremark.org, Accessed 12 May

2011

[27] Genesi’s Hard Float optimizations speeds up Linux performance up to

300% on ARM Laptops, http://armdevices.net/2011/06/21/genesis-hardfloat-optimizations-speeds-up-linux-performance-up-to-300-on-arm-laptops/,

Accessed 21 June 2011

81


[28] K. Furlinger, C. Klausecker, D. Kranzlmuller, The AppleTV-Cluster: Towards

Energy Efficient Parallel Computing on Consumer Electronic Devices, Whitepaper,

Ludwig-Maximilians-Universitat, April 2011

[29] NEON TM Technology, http://www.arm.com/products/processors/technologies/neon.php,

Accessed 21 June 2011

[30] ARM, ARM NEON support in the ARM compiler, White Paper, September 2008

[31] MIPS Technologies, MIPS64 Architecture for Programmers Volume I: Introduction

to the MIPS64, v3.02

[32] MIPS Technologies, MIPS64 Architecture for Programmers Volume I-B: Introduction

to the microMIPS64, v3.02

[33] MIPS Technologies, MIPS64 Architecture for Programmers Volume II: The

MIPS64 Instruction Set, v3.02

[34] MIPS Technologies, MIPS Architecture For Programmers Volume III: The

MIPS64 and microMIPS64 Privileged Resource Architecture, v3.12

[35] MIPS Technologies, ChinaÕs Institute of Computing Technology Licenses

Industry-Standard MIPS Architectures, http://www.mips.com/newsevents/newsroom/release-archive-2009/6_15_09.dot,

Accessed 21 June 2011

[36] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient disk replacement

and file placement techniques for mobile systems with hard disks, In the

Proceedings of the 2007 ACM symposium on Applied computing, New York,

NY, USA 2007

[37] Young-Jin Kim, Kwon-Taek Kwon, Jihong Kim, Energy-efficient file placement

techniques for heterogeneous mobile storage systems, In the Proceeding EM-

SOFT ’06 Proceedings of the 6th ACM & IEEE International conference on Embedded

software, ACM New York, NY, USA 2006

82

More magazines by this user
Similar magazines