Estimating Potential Parallelism by Data-Dependence Profiling

ieee.student.conference.de

Estimating Potential Parallelism by Data-Dependence Profiling

Estimating Potential Parallelism

by Data-Dependence Profiling

Abstract—With the rise of multi-core consumer hardware,

today’s software developers face the challenging task of refactoring

existing sequential code assets with a view to exploiting

modern multi-core processors’ parallel execution capabilities.

While there is a growing range of methods to support parallelism

in software products, a software engineer nonetheless has to

decide which parts of an already existing software are worth

the effort of being modified. We propose an extensible data

dependence profiling framework that facilitates the estimation

of a software component’s inherent potential for parallelism.

Our framework is based on the Low Level Virtual Machine

(LLVM) compiler infrastructure. We use LLVM to perform a

static analysis of a given code and instrument the code at the

intermediate representation level to record memory accesses and

control flow. The information recorded, together with the result

of the static analysis, is used to estimate the parallelism inherent

in the program. We evaluate our tool on a benchmark covering

algorithmic problems like searching, sorting and numerical

methods.

I. INTRODUCTION

The spread of multi-core consumer hardware is an ongoing

trend which arose in the last decade to manage CPU power

dissipation and increasing design complexity. Today, when

taking a look at the two biggest contemporary processor

manufacturers’ arrays of products, we can observe that most

of the available central processing units are dual-core, quadcore

or hexa-core processors and some have an even higher

core count [1] [2]. With the rise of smartphones, the endconsumers’

demand for the availability of software performing

computationally intensive tasks like playing video, rendering

graphics or recognizing speech on mobile phones influenced

the manufacturers to build multi-core processors into the

upcoming generation of devices. Additionally, the great popularity

of computer games led to the broad availability of

graphics processing units offering a potential of hundreds of

cores in a single device.

While software developers creating a new software product

are in a good position to adjust the design to make use of

these opportunities, there still exist large bodies of sequential

code and many companies are often unwilling to throw away

a software product which has done its job for a long time [3].

In this situation software developers face the challenging task

of refactoring an old software system to exploit the modern

hardware’s parallel execution capabilities, often without

having designed the product themselves and possibly without

Andreas Donig

Department of Informatics and Mathematics

University of Passau

Passau, Germany

Email: donig@fim.uni-passau.de

contact to the original authors, who may meanwhile have left

the company or may simply be unfamiliar with the modern

concepts of parallel programming.

These factors could cause the refactoring process to take

a long time resulting in a high cost. It would be of benefit

to the developers to have a tool which could tell them which

components of the existing software have the highest inherent

potential for parallelism and therefore are prime targets for

modifications to exploit this parallelism. Even when using

parallelising compilers, it would be beneficial to be able to

determine whether a manual optimisation is worthwhile. A tool

that creates parallel speedup estimates for sequential programs

could reduce the costs of software refactoring. Mindful of

these facts, we offer Parrot, an extensible profiling framework

facilitating the estimation of a software component’s inherent

potential for parallelism.

II. BACKGROUND

Compilers are software systems translating computer programs

from a source language to a target language. A common

compiler architecture has three components: the frontend, the

optimiser and the backend. The frontend parses the source

code and builds an abstract syntax tree (AST), which is a

language-dependent in-memory data structure. The AST is

translated to a language-independent intermediate representation

for optimisation. The optimiser then carries out different

kinds of transformations on the intermediate representation

with the goal of enhancing the program’s execution performance.

The backend, commonly also known as code generator,

subsequently translates the optimised program to the target

language carrying out target-dependent optimisations.

This three-phase design allows to reuse an already existing

set of optimisations as well as already existing backends

with newly implemented language frontends, which only have

to translate the source language to the compiler’s languageindependent

intermediate representation. Modern compilers

are usually designed to be multi-pass, organized as a series

of passes operating on one or multiple intermediate representations

instead of referring back to the program’s source

text. Using this approach, it is possible to convey knowledge

gathered about the compiled code from one pass to another.

The Low Level Virtual Machine (LLVM) compilation

framework was designed as a set of libraries with a well-


defined API [4], allowing software developers to implement

compiler passes working on LLVM’s intermediate representation

commonly simply referred to as IR. LLVM IR instructions

can use any number of local variables, which are commonly

called temporaries. Each temporary can be assigned only once.

Because IR instructions are in three-address code, LLVM IR

is in static single-assignment (SSA) form.

The LLVM Pass Framework [5] allows software developers

to implement analysis and transformation compiler passes using

various C++ libraries. The passes operate on the program’s

in-memory intermediate representation.

The most formative architectural design decision in Parrot

was to implement the analysis method on top of the LLVM

compiler infrastructure. Both the LLVM intermediate representation

and the LLVM Pass Framework are documented

extensively and designed thoroughly. By basing it on LLVM,

Parrot inherits support for a broad range of programming

languages as well as a solid foundation for extensibility.

III. THE PARROT DATA-DEPENDENCE PROFILER

Parrot’s architecture is divided into three major components.

The pass component builds upon the LLVM compiler infrastructure

and consists of two LLVM module passes written in

C++. The profiling library is written in C and is intended to be

compiled to LLVM IR and linked with the program analysed.

The analysis component implements the analysis method. It

is written in C++ and makes heavy usage of the Boost Graph

Library (BGL) [6].

A. The Pass Component

The pass component consists of two LLVM module passes.

The static dataflow analysis pass creates a static dataflow

graph. The instrumentation pass augments the module with

calls to functions in the profiling library, a process commonly

called instrumentation, and subsequently links this library with

the module.

The static dataflow analysis pass performs a static dataflow

analysis on an LLVM IR module. LLVM provides the dataflow

from definition to use of a temporary, but not the dataflow

across function parameters and returns. The static dataflow

pass infers the latter information and stores everything asa

graph data structure in the file format GraphML on disk.

The instrumentation pass traverses over an LLVM IR module’s

instructions. When an inspected instruction meets distinct

requirements (cf. III-B), the pass inserts instructions calling

profiling functions provided by the profiling library. To facilitate

the call of these functions, the pass first loads the profiling

library as an LLVM IR module and adds declarations for the

profiling functions contained therein. After the instrumentation

has been carried out, the pass links the module with the

profiling library. This approach automatically prevents the

instrumentation of the profiling library, which could cause the

program to recurse indefinitely.

B. The Profiling Library

The profiling library is a dynamic program analysis library

written in C. It is compiled to an LLVM bitcode module and

linked together with other IR modules. It contains a header file

providing data structures used by itself and Parrot’s other two

components. The instrumentation adds calls to the profiling

library in the proper places to record four types of profiling

events: basic block entry, function call, function return and

memory access.

Basic block entry profiling events are triggered when the

program’s control flow enters a basic block. This profiling

event provides information on the entered basic block’s identity

and its size in instructions.

Function call profiling events are triggered right before

every function call instruction. This profiling event provides

information about the basic block containing the call instruction

and the call’s position within the basic block.

Function return profiling events are triggered right after

every function call instruction. Like the function call profiling

events, they provide information related to the basic block

which contains the function call. They additionally provide

information on the position of the instruction after the function

call and information on the basic block’s number of instructions,

like a basic block entry profiling event does.

Memory access profiling events are triggered after every

memory access. Because LLVM IR is a load/store architecture,

this is restricted to load and store instructions. In addition to

the basic block which contains the memory access and the

load/store instruction’s position inside the basic block, memory

access profiling events record the memory address, the size in

bytes and the type of memory access (read/write).

The purpose of the basic block entry, function call and

function return profiling event type is to allow the analysis

component to recreate the complete execution trace, while

the memory access profiling event’s purpose is to enable

the analysis of possible write-read memory conflicts which

would result in data dependencies hard to compute precisely

at compile time.

Because of the execution overhead induced by the profiling

and the amount of gathered data which has to be stored, it

is of high priority to keep the number of triggered profiling

events small. It might be possible to reduce the number of

profiling events triggered further at run time at the price of a

more complicated analysis component.

C. The Analysis Component

The analysis component is a standalone program written in

C++. It uses the information recorded by the profiling library,

together with the static analysis result produced by the pass

component, to build up a dynamic dataflow graph. A dynamic

dataflow graph (DDG) is a directed acyclic graph which

describes the dataflow between instructions during the program

execution [7]. Assuming an infinite number of processors

without inter-processor communication overhead, where each

instruction takes exactly one cycle but can only be executed

after all the instructions it depends on have been executed,


TABLE I

RESULTS OF EXPERIMENTS

#bbentry #memaccess result

bisearch.c 21 26 2.43077

ctsort.c 49 142 4.10687

issort.c 182 97 2.62766

mgsort.c 373 23 2.50994

qksort.c 802 273 3.05318

rxsort.c 322 765 6.38949

interpol.c 458 666 6.54737

lsqe.c 49 89 4.55172

root.c 42 43 2.82

intmm.c 10 62 16.6

a program’s minimum execution time equals the number of

vertices contained in its DDG’s critical path. The program’s

average potential parallelism (number of instructions executing

at the same time) is calculated as its DDG’s number of vertices

divided by the minimum execution time.

IV. EVALUATION

Parrot was tested with a benchmark of ten programs written

in C. The benchmark covers algorithmic problems like

searching, sorting and numerical methods: bisearch.c uses

binary search to find a string in an array containing twelve

strings. The *sort.c programs sort an array containing ten integers,

where issort.c uses Insertion Sort, mgsort.c uses Merge

Sort, qksort.c uses quicksort and rxsort.c uses Radix Sort.

interpol.c performs polynomial interpolation. lsqe.c performs

least-squares estimation. root.c finds the roots of an equation.

intmm.c multiplies two 4 ⇥ 4 integer matrices. All tests were

executed using an Apple MacBook Pro running Mac OS X

10.7.3 having a 2.3 GHz Intel Core i5 dual-core processor

and 8 GB 1333 Mhz DDR3 RAM of which about 3 GB were

available for the tests.

All benchmark programs have a run time of less than one

second. The analysis process took seconds up to minutes. The

results are presented in Table I, which shows the number of

recorded basic block entries, the number of recorded memory

accesses, and the estimated parallel speedup. For instance the

n ⇥ n matrix multiplication has an ideal speedup of n 2 . The

estimated speedup of 16 matches the theoretical ideal speedup,

because the intmm.c benchmark multiplies 4 ⇥ 4 matrices. A

detailed evaluation can be found elsewhere [8].

V. CONCLUSION

Parrot was implemented within a three-month timeframe. It

consists of roughly 1600 lines of code, the biggest portion of

which is the analysis component with about 780 lines written

in C++. The static dataflow analysis pass has about 300 lines

and the instrumentation pass about 260 lines of code, both

also written in C++. The profiling library was implemented

with roughly 240 lines of C code. These numbers show that

not much code is needed to implement a data-dependence

profiling tool allowing the estimation of a program’s potential

for parallelism.

During the development process, scalability proved to be the

most challenging requirement when performing the estimation

of a program’s potential for parallelism. The enormous amount

of profiling data produced had to be written to the disk,

slowing down the analysis process. The integration of a realtime

compression should be considered. The design of algorithms

to reduce the number of instructions in memory should

also be a rewarding future field of activity. Nevertheless, our

tool turned out to be able to analyse important computational

problems like sorting an array with quicksort and merge sort

or the multiplication of matrices which encourages to continue

with the tool’s development in the future. The decision to

implement the analysis method on top of the LLVM compiler

infrastructure makes it possible to analyse programs written in

a variety of languages such as Ada, C, C++, D, Fortran and

Objective-C which are supported by LLVM. Adding support

for an additional language is possible by writing a LLVM

compiler frontend for that language. Changes of analysis code

are not necessary. Future work includes correlating the results

with the source code to determine the origin of parallelism

in the source. Also, inter-processor communication should be

taken into account.

ACKNOWLEDGMENT

The author wishes to express his gratitude to Dr. Armin

Größlinger and Prof. Christian Lengauer who were abundantly

helpful and offered invaluable assistance, support and

guidance. Deepest gratitude are also due to the members of

the LooPo Project [9] at the University of Passau, Andreas

Simbürger, Andreas Baier and Stephan Kleisinger, for their

constitutive ideas influencing the development of this tool.

REFERENCES

[1] Intel Corporation. Intel processor pricing. [Online]. Available:

http://files.shareholder.com/downloads/INTC/1990986410x0x548319/

F5088C1E-A041-4E55-8C2D-AB04AFA8C84C/Mar 4 12

Recommended Customer Price List.pdf

[2] Advanced Micro Devices, Inc. AMD processor price list. [Online].

Available: http://www.amd.com/us/products/pricing/Pages/pricing.aspx

[3] L. Wills, T. Taha, L. Baumstark Jr, and S. Wills, “Estimating

potential parallelism for platform retargeting,” in Proceedings of the

Ninth Working Conference on Reverse Engineering (WCRE’02). IEEE

Computer Society, 2002. [Online]. Available: http://dl.acm.org/citation.

cfm?id=882506.885136

[4] C. Lattner. The architecture of open source applications: LLVM.

[Online]. Available: http://www.aosabook.org/en/llvm.html

[5] C. Lattner and J. Laskey. Writing an LLVM pass. [Online]. Available:

http://llvm.org/docs/WritingAnLLVMPass.html

[6] J. Siek, L. Lee, and A. Lumsdaine. The Boost Graph Library (BGL).

[Online]. Available: http://www.boost.org/doc/libs/1 50 0/libs/graph/doc/

index.html

[7] J. Mak and A. Mycroft, “Limits of parallelism using dynamic

dependency graphs,” in Proceedings of the Seventh International

Workshop on Dynamic Analysis (WODA ’09). ACM, 2009, pp. 42–48.

[Online]. Available: http://doi.acm.org/10.1145/2134243.2134253

[8] A. Donig, “Estimating Potential Parallelism by Data-Dependence Profiling,”

Bachelor’s Thesis, University of Passau, Apr. 2012.

[9] The LooPo team. Polyhedral Loop Parallelization: LooPo. [Online].

Available: http://www.infosun.fmi.uni-passau.de/cl/loopo/

Similar magazines