PROGRAM STRUCTURE TREES - Software Systems Lab

P R O G R A M S T R U C T U R E 

T R E E S 

A comparison of different algorithms 

tobias grosser 

University of Passau 

February 22, 2010 

Starting with a general introduction to program structure 

trees (PSTs) this work gives an overview over several algorithms 

to create PSTs. The algorithms will be compared 

in terms of runtime complexity, storage effectiveness, implementation 

complexity, restrictions on the control flow 

graphs (CFGs) they can analyse as well as what kind of 

control flow regions they can detect. 

Furthermore it will be shown how these algorithms can be 

used to analyse control flow automatons as used in software 

model checking. 

contents 

1 Introduction 2 

2 Basic components 2 

2.1 Control flow graph 2 

2.2 (Simple) Region 2 

2.3 Refined Region 2 

2.4 Program structure tree 4 

3 Different algorithms 4 

3.1 Regions in structured programs 6 

3.2 The Program Structure Tree 6 

3.3 The Refined Program Structure Tree 6 

3.4 Dominance Tree based RPST Calculation 7 

4 Comparison 7 

5 Region detection in static program analysis 7 

5.1 The control flow automaton 8 

5.2 Convert an CFG to an CFA 8 

References 9 

a Tables 10 

1

1 introduction 

Nowadays modern static analysis tools as well as optimizing compilers 

apply powerful optimization techniques and analysis on programs. 

To achieve the best results possible analysis and transformations are 

developed that are powerful, but often only applicable if the program 

to analyse fulfilles certain properties. In general a complete program 

does not fulfill the restrictions imposed by the intended analysis. 

However it is possible to extract and analyse just the regions of a 

program, that satisfy the required restrictions. A way to find these 

regions is to find all possible regions and to remove the ones, that do 

not satisfy the restrictions. 

Therefore it is interesting to understand the different algorithms 

available to detect the regions in a program and to investigate the 

(dis)advantages each of them has. 

2 basic components 

2.1 Control flow graph 

In compilers the code of a function as seen in Figure 1 can be described 

using a control flow graph (CFG) G. G = (V, E) consists of a set of 

vertices V, called basic blocks, and a set of edges E connecting these 

basic blocks. Every basic block contains a list of statements. 

The execution of a function is defined as a walk over the CFG, where 

every time a basic block is passed its statements are executed in linear 

order. The walk starts always at a specific basic block, the entry basic 

block, and ends if it arrives at a basic block, that is terminated with a 

”return” statement. 

To represent non linear control flow, branch statements may terminate 

a basic block. These branch statements pass, based on the result of a 

condition, the control to another basic block. The control flow is always 

following the edges of the CFG. 

2.2 (Simple) Region 

A connected subgraph of the CFG, that has only two connections to 

the remaining CFG, an incoming and an outcoming edge, is called a 

(single entry single exit) region. Such a region can be analyzed and 

transformed like a separate function. This can be modeled as seen in 2 

by replacing the orange region with a call to a function, that contains 

the orange CFG region. Moving or replacing the entire region is as 

simple as moving two edges in the CFG or if extracted as a function, 

changing a function call. 

A region is called trivial region, if it contains exactly one basic block. 

A region A is called canonical region, if it there is no set of regions 

that can be combined to construct A. 

2.3 Refined Region 

The definition of a region can be extended to a so called refined region. 

A refined region is a connected subgraph of a CFG, that can be transformed 

to a region by inserting two empty bbs, that join multiple entry 

or exit edges. 

2

int i, a, b 

i = 0 

if (i != 100) 

T 

F 

void foo ( ) { 

i n t i , a , b ; 

for ( i =0; i ! = 1 0 0 ; i ++) { 

a =3; 

i f ( i ==a ) 

b =5; 

e lse 

b =8; 

} 

return ; 

} 

(a) Source code 

a = 4 

if (i == a) 

T 

F 

b = 5 b = 8 

i++ 

(b) Control flow graph 

return 

Figure 1: A simple function 

int i, a, b 

i = 0 

entry 

if (i != 100) 

T 

F 

a =4 

if (i == a) 

T 

F 

a =4 

if (i == a) 

T 

F 

return 

int i, a, b 

i = 0 

b = 5 b = 8 

b = 5 b = 8 

exit 

if (i != 100) 

T 

F 

i++ 

i++ 

region(i, a, b) 

return 

return (i, a, b) 

(a) Region in foo() 

(b) Region replaced 

(c) Function ”orange()” 

Figure 2: Extracting a region into a function 

3

a 

b 

t_1 

entry 

a 

entry 

b 

entry 

c 

T F 

T 

c 

F 

d 

e 

d 

e 

t_2 

exit 

exit 

exit 

g 

g 

(a) Refined region 

(b) After transformation 

Figure 3: Transform an extended region to a plain region 

Every region is also a refined region. The definitions for trivial and 

canonical regions also apply to refined regions. 

An analysis of the Polyhedron 1 benchmark showed that only about 

30% of all refined regions are simple regions. The complete analysis 

can be found in Table 2. 

2.4 Program structure tree 

In general there are multiple canonical regions in every CFG. Two 

canonical regions A and B may contain the same basic blocks. In this 

case either region A is completely contained in region B, such as all 

basic blocks in A are also in B, or B is completely contained in A. 

This relationship is represented as a program structure tree, where a 

region B is an ancestor of A, if B is completely contained in A. 

The example in 4 shows a CFG where the simple regions are marked 

with blue and the refined regions with orange borders. 

3 different algorithms 

There are different algorithms available, that can be used to analyze a 

CFG and build a region tree. They differ in the CFGs they can analyze, 

the kind of region they can detect, as well as the implementation afford 

necessary, the runtime complexity and/or the required prerequisites. 

1 http://www.polyhedron.com 

4

a 

g 

h 

b 

l 

j 

i 

f 

k 

c 

d 

e 

Figure 4: Program structure tree with simple regions (blue) and refined 

regions (orange). 

5

3.1 Regions in structured programs 

Obtaining an region tree of a structured annotated program is trivial. 

Every loop and every condition is a region. The nesting of the region 

tree is equivalent to the nesting of the loops and conditions in the 

original program. 

Therefore this approach is taken in several algorithms, without explicitally 

specifying it as an region detection algorithm. 

However as soon as more complicated constructs like loops with 

multiple exits (breaks), exceptions or even gotos appear this approach 

does not work any more. Unfortunately many programming languages 

allow at least some of these constructs so a more general approach is 

necessary. 

3.2 The Program Structure Tree 

The in [4] published algorithm nowadays can be seen as the classical 

approach to calculated region trees, or as they are called in this paper, 

program structure trees (PST) for a general, possibly irreducible, CFG. 

One reason for this is the fast and streigtforward algorithm. 

Based on a simple data structure, called bracket list, the CFG is 

analysed without any previous information required. The runtime is in 

O(V + E). 

The algorithm can detect simple regions, however no refined regions. 

To detect refined regions, a possible approach is to insert merge 

basic blocks beforehand. However is has two drawbacks. First of 

all, deciding where to insert merge blocks is not trivial, but probably 

requires some analysis. Furthermore modifiying the program often 

invalidates existing analysis like dominance information, and is nothing 

that someone wants in a production compiler. 

3.3 The Refined Program Structure Tree 

In [5] the PST approach was extended and a such called Refined Program 

Structure Tree (RPST) was introduced. This RPST was used 

to model workflows in buisness processes, however it could also be 

applied to control flow graphs. 

One of the main advantages of the RPST is the refined definition of 

a region, that allows not only to present simple regions with just one 

entry and exit edge, but also regions that still have several entry and 

exit edges, which could be joined to a single entry or exit edge. This 

refinement permits the detection of regions, that cannot be handled in 

a plain PST. 

To calculate the RPST a preliminary analysis is required to build 

the triconnected components of the CFG as described in [3] and corrected/improved 

in [2]. If this analysis is not yet available the afford 

required to implement the RPST construction algorithm seemd to be 

high, especially as the triconnected components algorithm is not trivial. 

Another drawback of this algorithm and the refined region definition 

is, that a region cannot be described in constant memory, but has to be 

defined by all incoming and outgoing edges. To know if a basic block 

is part of a region a auxilary data structure has to store a mapping in 

between basic blocks and regions, otherwise a graph walk is required. 

6

Analysis PST RPST DRPST 

Applicable All CFGs All CFGs All CGSs 

Precision Basic regions Enhanced regions Enhanced regions 

Runtime O(V + S) O(V + S) O((V + S) 2 ), probably 

better 

Prerequisites None Triconnected components 

Dominance and Postdominance 

trees 

Representation 2 edges all edges in region 2 basic blocks 

BB in Region extra mapping required based on dominance info 

Table 1: Comparison of different region detection algorithms 

3.4 Dominance Tree based RPST Calculation 

In winter 2009 another approach was developed as a region detection 

algorithm for the LLVM compiler toolkit. The objective was to achieve 

the same precision as in the previous described algorithm, but to take 

advantage of already existing analysis. 

One of the most common analysis in restructering compilers is 

the (post)dominance information. Therefore a relatively simple algorithm 

was developed, that calculates a region tree based on the 

(post)dominance information already available. 

The algorithm is able to detect all refined regions on any (even 

unstructured) CFG. A first analysis of the runtime complexity has 

shown an upper bound of O((V + S) 2 ), however it seems possible to 

proof even better performance in the order of magnitude of O((V + 

S) + log(V + S)). 

Another advantage of a dominance tree based approach are the 

constant time operations to check if a basic block is part of a region. 

These operations are possible, as the algorithm can take advantage 

of the existing (post) dominance information. This also leads to the 

advantage of being able to store the description of a region in a constant 

amount of memory, two references to a basic block. 

4 comparison 

In “Table 1” the attributes of all algorithms, that can handle general 

CFGs, are summed up. Because of the better precision the RPST and 

DRPST algorithms seem to be the most powerful analysis. In theory 

the RPST algorithm is already perfect in terms of runtim complexity, 

coverage and precision, however in practise it requires a lot of 

implementation afford. 

This problem seems to be solved by the DRPST algorithm, that can 

be implemented easily, if dominance information is already available. 

The only drawback is the not yet proven optimal runtime complexity. 

However in first tests a limitation because of this, was not found. 

5 region detection in static program analysis 

In static program analysis, especially software model checking, often 

extremly expensive analysis are performed. To get resonable runtime it 

is therefore necessary to reduce the affords required as much as possible. 

Program region trees offer various possiblities to reduce complexity. 

7

One possibility is to analyze different parts of a program with different 

accuracy, such that only interesting regions are analyzed precisely. 

Other regions could be analyzed roughly and, if required, be refined 

later. 

Another possiblity is to hide the complexity of a program, by considering 

only the overall effect of a region on the program state, but 

not simulating every single change that the execution of a program 

region implies. As the complexity of a lot of algorithms is related to the 

number of analyzed program states, reducing the number of program 

states might improve runtime significantly. 

5.1 The control flow automaton 

Static program analisis often uses a control flow automaton (CFA) to 

represent a program, in contrast to the CFGs that are more widespread 

in optimizing compilers. 

A control flow automaton is a graph G=(V, E) with a set of vertives V 

and a set of edges E. Every vertice represents a program state. Transitions 

in between program states are defined by the edges, that connect 

two states applying an operation. The operations can be either a set of 

calculations or a assume operation. 

5.2 Convert an CFG to an CFA 

To convert an CFG to an CFA as seen in 5 it is necessary to find the 

program states and transitions in an CFG. In a CFG the program state 

is changed by the operations executed in the basic blocks. Before and 

after the operations in a basic block, there is a defined program state. 

Therefore we can create for every basic block BB two new program 

states c 1, c 2 one defining the state before and one after the execution of 

BB. The transition between these two program states can be defined as 

execution of the operations contained in the basic block. If a terminating 

branch statement exists in the basic block, the conditions can be applied 

as assume operations on the transitions from c 2 to the next program 

states. 

8

a 

b 

a 

b 

c_1 

i = 100 

i=100 

if (b == 80) 

T 

F 

c_2 

b == 80 

b != 80 

d 

e 

d 

e 

(a) CFG 

(b) CFA 

Figure 5: Convert a CFG basic bock to CFA nodes and transitions 

references 

[1] Thomas Ball. What’s in a region?: or computing control dependence 

regions in near-linear time for reducible control flow. ACM Lett. 

Program. Lang. Syst., 2(1-4):1–16, 1993. 

[2] Carsten Gutwenger and Petra Mutzel. A linear time implementation 

of spqr-trees. In GD ’00: Proceedings of the 8th International Symposium 

on Graph Drawing, pages 77–90, London, UK, 2001. Springer-Verlag. 

[3] John E. Hopcroft and Robert Endre Tarjan. Dividing a graph into 

triconnected components. SIAM J. Comput., 2(3):135–158, 1973. 

[4] Richard Johnson, David Pearson, and Keshav Pingali. The program 

structure tree: Computing control regions in linear time. pages 

171–185, 1994. 

[5] Jussi Vanhatalo, Hagen Völzer, and Jana Koehler. The refined 

process structure tree. In BPM ’08: Proceedings of the 6th International 

Conference on Business Process Management, pages 100–115, Berlin, 

Heidelberg, 2008. Springer-Verlag. 

9

a 

tables 

Program Plain Extendable Plain / Extandable 

ac.s 65 77 0.84 

aermod.s 1069 5262 0.20 

air.s 255 486 0.52 

capacita.s 102 232 0.44 

channel.s 56 69 0.81 

doduc.s 241 702 0.34 

fatigue.s 44 133 0.33 

gas dyn.s 76 219 0.35 

induct.s 66 239 0.28 

linpk.s 44 99 0.44 

mdbx.s 84 272 0.31 

nf.s 57 122 0.47 

protein.s 93 313 0.30 

rnflow.s 291 786 0.37 

test fpu.s 242 650 0.37 

tfft.s 24 58 0.41 

Sum 2809 9719 0.29 

Table 2: Regions in the polyhedron.com benchmark 

10

PROGRAM STRUCTURE TREES - Software Systems Lab

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?