Estimating Potential Parallelism
by Data-Dependence Profiling
Abstract—With the rise of multi-core consumer hardware,
today’s software developers face the challenging task of refactoring
existing sequential code assets with a view to exploiting
modern multi-core processors’ parallel execution capabilities.
While there is a growing range of methods to support parallelism
in software products, a software engineer nonetheless has to
decide which parts of an already existing software are worth
the effort of being modified. We propose an extensible data
dependence profiling framework that facilitates the estimation
of a software component’s inherent potential for parallelism.
Our framework is based on the Low Level Virtual Machine
(LLVM) compiler infrastructure. We use LLVM to perform a
static analysis of a given code and instrument the code at the
intermediate representation level to record memory accesses and
control flow. The information recorded, together with the result
of the static analysis, is used to estimate the parallelism inherent
in the program. We evaluate our tool on a benchmark covering
algorithmic problems like searching, sorting and numerical
The spread of multi-core consumer hardware is an ongoing
trend which arose in the last decade to manage CPU power
dissipation and increasing design complexity. Today, when
taking a look at the two biggest contemporary processor
manufacturers’ arrays of products, we can observe that most
of the available central processing units are dual-core, quadcore
or hexa-core processors and some have an even higher
core count  . With the rise of smartphones, the endconsumers’
demand for the availability of software performing
computationally intensive tasks like playing video, rendering
graphics or recognizing speech on mobile phones influenced
the manufacturers to build multi-core processors into the
upcoming generation of devices. Additionally, the great popularity
of computer games led to the broad availability of
graphics processing units offering a potential of hundreds of
cores in a single device.
While software developers creating a new software product
are in a good position to adjust the design to make use of
these opportunities, there still exist large bodies of sequential
code and many companies are often unwilling to throw away
a software product which has done its job for a long time .
In this situation software developers face the challenging task
of refactoring an old software system to exploit the modern
hardware’s parallel execution capabilities, often without
having designed the product themselves and possibly without
Department of Informatics and Mathematics
University of Passau
contact to the original authors, who may meanwhile have left
the company or may simply be unfamiliar with the modern
concepts of parallel programming.
These factors could cause the refactoring process to take
a long time resulting in a high cost. It would be of benefit
to the developers to have a tool which could tell them which
components of the existing software have the highest inherent
potential for parallelism and therefore are prime targets for
modifications to exploit this parallelism. Even when using
parallelising compilers, it would be beneficial to be able to
determine whether a manual optimisation is worthwhile. A tool
that creates parallel speedup estimates for sequential programs
could reduce the costs of software refactoring. Mindful of
these facts, we offer Parrot, an extensible profiling framework
facilitating the estimation of a software component’s inherent
potential for parallelism.
Compilers are software systems translating computer programs
from a source language to a target language. A common
compiler architecture has three components: the frontend, the
optimiser and the backend. The frontend parses the source
code and builds an abstract syntax tree (AST), which is a
language-dependent in-memory data structure. The AST is
translated to a language-independent intermediate representation
for optimisation. The optimiser then carries out different
kinds of transformations on the intermediate representation
with the goal of enhancing the program’s execution performance.
The backend, commonly also known as code generator,
subsequently translates the optimised program to the target
language carrying out target-dependent optimisations.
This three-phase design allows to reuse an already existing
set of optimisations as well as already existing backends
with newly implemented language frontends, which only have
to translate the source language to the compiler’s languageindependent
intermediate representation. Modern compilers
are usually designed to be multi-pass, organized as a series
of passes operating on one or multiple intermediate representations
instead of referring back to the program’s source
text. Using this approach, it is possible to convey knowledge
gathered about the compiled code from one pass to another.
The Low Level Virtual Machine (LLVM) compilation
framework was designed as a set of libraries with a well-
defined API , allowing software developers to implement
compiler passes working on LLVM’s intermediate representation
commonly simply referred to as IR. LLVM IR instructions
can use any number of local variables, which are commonly
called temporaries. Each temporary can be assigned only once.
Because IR instructions are in three-address code, LLVM IR
is in static single-assignment (SSA) form.
The LLVM Pass Framework  allows software developers
to implement analysis and transformation compiler passes using
various C++ libraries. The passes operate on the program’s
in-memory intermediate representation.
The most formative architectural design decision in Parrot
was to implement the analysis method on top of the LLVM
compiler infrastructure. Both the LLVM intermediate representation
and the LLVM Pass Framework are documented
extensively and designed thoroughly. By basing it on LLVM,
Parrot inherits support for a broad range of programming
languages as well as a solid foundation for extensibility.
III. THE PARROT DATA-DEPENDENCE PROFILER
Parrot’s architecture is divided into three major components.
The pass component builds upon the LLVM compiler infrastructure
and consists of two LLVM module passes written in
C++. The profiling library is written in C and is intended to be
compiled to LLVM IR and linked with the program analysed.
The analysis component implements the analysis method. It
is written in C++ and makes heavy usage of the Boost Graph
Library (BGL) .
A. The Pass Component
The pass component consists of two LLVM module passes.
The static dataflow analysis pass creates a static dataflow
graph. The instrumentation pass augments the module with
calls to functions in the profiling library, a process commonly
called instrumentation, and subsequently links this library with
The static dataflow analysis pass performs a static dataflow
analysis on an LLVM IR module. LLVM provides the dataflow
from definition to use of a temporary, but not the dataflow
across function parameters and returns. The static dataflow
pass infers the latter information and stores everything asa
graph data structure in the file format GraphML on disk.
The instrumentation pass traverses over an LLVM IR module’s
instructions. When an inspected instruction meets distinct
requirements (cf. III-B), the pass inserts instructions calling
profiling functions provided by the profiling library. To facilitate
the call of these functions, the pass first loads the profiling
library as an LLVM IR module and adds declarations for the
profiling functions contained therein. After the instrumentation
has been carried out, the pass links the module with the
profiling library. This approach automatically prevents the
instrumentation of the profiling library, which could cause the
program to recurse indefinitely.
B. The Profiling Library
The profiling library is a dynamic program analysis library
written in C. It is compiled to an LLVM bitcode module and
linked together with other IR modules. It contains a header file
providing data structures used by itself and Parrot’s other two
components. The instrumentation adds calls to the profiling
library in the proper places to record four types of profiling
events: basic block entry, function call, function return and
Basic block entry profiling events are triggered when the
program’s control flow enters a basic block. This profiling
event provides information on the entered basic block’s identity
and its size in instructions.
Function call profiling events are triggered right before
every function call instruction. This profiling event provides
information about the basic block containing the call instruction
and the call’s position within the basic block.
Function return profiling events are triggered right after
every function call instruction. Like the function call profiling
events, they provide information related to the basic block
which contains the function call. They additionally provide
information on the position of the instruction after the function
call and information on the basic block’s number of instructions,
like a basic block entry profiling event does.
Memory access profiling events are triggered after every
memory access. Because LLVM IR is a load/store architecture,
this is restricted to load and store instructions. In addition to
the basic block which contains the memory access and the
load/store instruction’s position inside the basic block, memory
access profiling events record the memory address, the size in
bytes and the type of memory access (read/write).
The purpose of the basic block entry, function call and
function return profiling event type is to allow the analysis
component to recreate the complete execution trace, while
the memory access profiling event’s purpose is to enable
the analysis of possible write-read memory conflicts which
would result in data dependencies hard to compute precisely
at compile time.
Because of the execution overhead induced by the profiling
and the amount of gathered data which has to be stored, it
is of high priority to keep the number of triggered profiling
events small. It might be possible to reduce the number of
profiling events triggered further at run time at the price of a
more complicated analysis component.
C. The Analysis Component
The analysis component is a standalone program written in
C++. It uses the information recorded by the profiling library,
together with the static analysis result produced by the pass
component, to build up a dynamic dataflow graph. A dynamic
dataflow graph (DDG) is a directed acyclic graph which
describes the dataflow between instructions during the program
execution . Assuming an infinite number of processors
without inter-processor communication overhead, where each
instruction takes exactly one cycle but can only be executed
after all the instructions it depends on have been executed,
RESULTS OF EXPERIMENTS
#bbentry #memaccess result
bisearch.c 21 26 2.43077
ctsort.c 49 142 4.10687
issort.c 182 97 2.62766
mgsort.c 373 23 2.50994
qksort.c 802 273 3.05318
rxsort.c 322 765 6.38949
interpol.c 458 666 6.54737
lsqe.c 49 89 4.55172
root.c 42 43 2.82
intmm.c 10 62 16.6
a program’s minimum execution time equals the number of
vertices contained in its DDG’s critical path. The program’s
average potential parallelism (number of instructions executing
at the same time) is calculated as its DDG’s number of vertices
divided by the minimum execution time.
Parrot was tested with a benchmark of ten programs written
in C. The benchmark covers algorithmic problems like
searching, sorting and numerical methods: bisearch.c uses
binary search to find a string in an array containing twelve
strings. The *sort.c programs sort an array containing ten integers,
where issort.c uses Insertion Sort, mgsort.c uses Merge
Sort, qksort.c uses quicksort and rxsort.c uses Radix Sort.
interpol.c performs polynomial interpolation. lsqe.c performs
least-squares estimation. root.c finds the roots of an equation.
intmm.c multiplies two 4 ⇥ 4 integer matrices. All tests were
executed using an Apple MacBook Pro running Mac OS X
10.7.3 having a 2.3 GHz Intel Core i5 dual-core processor
and 8 GB 1333 Mhz DDR3 RAM of which about 3 GB were
available for the tests.
All benchmark programs have a run time of less than one
second. The analysis process took seconds up to minutes. The
results are presented in Table I, which shows the number of
recorded basic block entries, the number of recorded memory
accesses, and the estimated parallel speedup. For instance the
n ⇥ n matrix multiplication has an ideal speedup of n 2 . The
estimated speedup of 16 matches the theoretical ideal speedup,
because the intmm.c benchmark multiplies 4 ⇥ 4 matrices. A
detailed evaluation can be found elsewhere .
Parrot was implemented within a three-month timeframe. It
consists of roughly 1600 lines of code, the biggest portion of
which is the analysis component with about 780 lines written
in C++. The static dataflow analysis pass has about 300 lines
and the instrumentation pass about 260 lines of code, both
also written in C++. The profiling library was implemented
with roughly 240 lines of C code. These numbers show that
not much code is needed to implement a data-dependence
profiling tool allowing the estimation of a program’s potential
During the development process, scalability proved to be the
most challenging requirement when performing the estimation
of a program’s potential for parallelism. The enormous amount
of profiling data produced had to be written to the disk,
slowing down the analysis process. The integration of a realtime
compression should be considered. The design of algorithms
to reduce the number of instructions in memory should
also be a rewarding future field of activity. Nevertheless, our
tool turned out to be able to analyse important computational
problems like sorting an array with quicksort and merge sort
or the multiplication of matrices which encourages to continue
with the tool’s development in the future. The decision to
implement the analysis method on top of the LLVM compiler
infrastructure makes it possible to analyse programs written in
a variety of languages such as Ada, C, C++, D, Fortran and
Objective-C which are supported by LLVM. Adding support
for an additional language is possible by writing a LLVM
compiler frontend for that language. Changes of analysis code
are not necessary. Future work includes correlating the results
with the source code to determine the origin of parallelism
in the source. Also, inter-processor communication should be
taken into account.
The author wishes to express his gratitude to Dr. Armin
Größlinger and Prof. Christian Lengauer who were abundantly
helpful and offered invaluable assistance, support and
guidance. Deepest gratitude are also due to the members of
the LooPo Project  at the University of Passau, Andreas
Simbürger, Andreas Baier and Stephan Kleisinger, for their
constitutive ideas influencing the development of this tool.
 Intel Corporation. Intel processor pricing. [Online]. Available:
F5088C1E-A041-4E55-8C2D-AB04AFA8C84C/Mar 4 12
Recommended Customer Price List.pdf
 Advanced Micro Devices, Inc. AMD processor price list. [Online].
 L. Wills, T. Taha, L. Baumstark Jr, and S. Wills, “Estimating
potential parallelism for platform retargeting,” in Proceedings of the
Ninth Working Conference on Reverse Engineering (WCRE’02). IEEE
Computer Society, 2002. [Online]. Available: http://dl.acm.org/citation.
 C. Lattner. The architecture of open source applications: LLVM.
[Online]. Available: http://www.aosabook.org/en/llvm.html
 C. Lattner and J. Laskey. Writing an LLVM pass. [Online]. Available:
 J. Siek, L. Lee, and A. Lumsdaine. The Boost Graph Library (BGL).
[Online]. Available: http://www.boost.org/doc/libs/1 50 0/libs/graph/doc/
 J. Mak and A. Mycroft, “Limits of parallelism using dynamic
dependency graphs,” in Proceedings of the Seventh International
Workshop on Dynamic Analysis (WODA ’09). ACM, 2009, pp. 42–48.
[Online]. Available: http://doi.acm.org/10.1145/2134243.2134253
 A. Donig, “Estimating Potential Parallelism by Data-Dependence Profiling,”
Bachelor’s Thesis, University of Passau, Apr. 2012.
 The LooPo team. Polyhedral Loop Parallelization: LooPo. [Online].