On the analysis of protein interaction networks - Structural ...

sbg.bio.ic.ac.uk

On the analysis of protein interaction networks - Structural ...

On the analysis of

protein interaction networks

by

William Paul Kelly

A thesis submitted for the degree of

Doctor of Philosophy of the University of London

Department of Mathematics

Imperial College London

180 Queen’s Gate

London, England

October, 2009


c○ 2009 William Paul Kelly

All rights reserved

Typeset in Times by LATEX

Graphs typeset in R for Mac OS X

This dissertation is the result of my own work

and includes nothing which is the outcome of

work done in collaboration except where

specifically indicated in the text.

This dissertation is not substantially the same

as any submitted by the author for any other

degree or diploma or other qualification at

any other university.

No part of this dissertation has already been,

or is currently being submitted by the author

for any other degree or diploma or other

qualification.

This dissertation does not exceed 50,000

words, including appendices, footnotes,

tables and equations. It does not contain

more than 100 figures.

This work is supported by a Wellcome Trust

grant and completed in the Department of

Mathematics and Centre for Bioinformatics

at Imperial College, London.

2


Abstract

Protein interaction networks describe the reported protein interactions found in an organism.

Understanding their organisation will have an impact on all areas of systems

biology. The amount of interaction data has expanded dramatically since the advent of

high-throughput experimental technologies. However, interaction data are believed to

contain a high proportion of false-positive interactions as well as true interactions. Incorporating

knowledge of other biological characteristics may allow more reliable interaction

networks to be produced.

This thesis presents an analysis of the reported Saccharomyces cerevisiae protein interaction

network, providing an overview of its contents and a comparison of the contributing

experimental techniques. Algorithms for constructing random networks are described and

used to assess whether the network’s topology depends upon biological covariates. It is

shown that the choice of random network generation algorithm can affect the conclusions

drawn.

Phylogenetic trees of S. cerevisiae proteins are compared in order to assess possible evolutionary

linkage of protein-protein interactions. The similarity of phylogenetic tree topologies

found between interacting proteins are compared to those found for a variety of

randomly constructed networks. Whilst the orthologues of interacting proteins show a

tendency to be conserved together, the topologies are not more similar than those found

for random networks. However, topological similarity is shown to be a means of differentiating

between interacting and non-interacting protein pairs that have been reported as

binding together in the same multi-protein complex structure.

Finally, a model is described that predicts interactome size and false discovery rate for

reported data. The model uses all available interaction data to present the relationship

between error rate, interactome size, and the proportion of observed true interactions.

The classification of true interactions is through the use of repeated data and plausible

interactome sizes are used to assess the number of reported interactions necessary to find

true interactions reliably.

3


Contents

List of Figures 9

List of Tables 11

List of Abbreviations 14

List of Mathematical Notation 15

Acknowledgements 16

1 Introduction 17

1.1 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.2 Biological systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.1 Genomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.2.2 Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.2.3 Protein interactions . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.2.4 HIV example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Comparative genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

1.3.1 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . 26

4


1.3.2 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . 28

1.3.3 Correlated evolution . . . . . . . . . . . . . . . . . . . . . . . . 30

1.4 Protein interaction data . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

1.4.1 Traditional methods . . . . . . . . . . . . . . . . . . . . . . . . 34

1.4.2 High-throughput methods . . . . . . . . . . . . . . . . . . . . . 36

1.4.3 Interaction inference . . . . . . . . . . . . . . . . . . . . . . . . 37

1.4.4 Computational predictions . . . . . . . . . . . . . . . . . . . . . 39

1.5 Graph theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.5.1 Graph properties . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.5.2 Graph ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.6 Noise in interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.6.1 Sampling notation . . . . . . . . . . . . . . . . . . . . . . . . . 53

1.6.2 Error rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

1.6.3 Error and size estimates . . . . . . . . . . . . . . . . . . . . . . 56

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2 An exploratory analysis of interaction data 63

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.2 Interactome databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.3 Analysis of the BioGRID S. cerevisiae database . . . . . . . . . . . . . . 65

2.3.1 Year of publication . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.3.2 Experiment size and technique . . . . . . . . . . . . . . . . . . . 67

2.3.3 Self-interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.3.4 Gene Ontology annotations of interacting proteins . . . . . . . . 70

2.3.5 Gene Ontology annotations and experimental techniques . . . . . 74

2.3.6 Repeated interactions . . . . . . . . . . . . . . . . . . . . . . . . 78

5


2.4 Interaction networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

2.4.1 Local graph structure . . . . . . . . . . . . . . . . . . . . . . . . 80

2.4.2 Degree sequence . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

3 Graph ensembles 86

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2.2 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.2.3 Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.1 Rewiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.3.2 Perturbations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4 Phylogenetic topologies of interacting proteins 117

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4.2.2 Correlated divergence . . . . . . . . . . . . . . . . . . . . . . . 121

4.2.3 Measuring topological differences . . . . . . . . . . . . . . . . . 122

4.2.4 Phylogenetic analyses . . . . . . . . . . . . . . . . . . . . . . . 123

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.3.1 Phylogenetic profiles . . . . . . . . . . . . . . . . . . . . . . . . 125

4.3.2 Topological similarity . . . . . . . . . . . . . . . . . . . . . . . 128

4.3.3 Phylogenetic methods . . . . . . . . . . . . . . . . . . . . . . . 131

6


4.3.4 Further analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5 Measuring the interactome 138

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.2.2 Coupon collecting . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.2.3 Single coupon . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

5.2.4 Multiple coupons . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.2.5 Finding true interactions . . . . . . . . . . . . . . . . . . . . . . 150

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.3.1 Interactome size . . . . . . . . . . . . . . . . . . . . . . . . . . 152

5.3.2 Experiment size . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.3.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6 Conclusions 162

6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

6.3 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

A Mathematical techniques 167

A.1 Likelihood analysis of degree distributions . . . . . . . . . . . . . . . . . 167

A.2 Scaling degree random graphs . . . . . . . . . . . . . . . . . . . . . . . 168

A.3 Exponential random graphs . . . . . . . . . . . . . . . . . . . . . . . . . 169

A.4 Further biological random graphs . . . . . . . . . . . . . . . . . . . . . . 171

7


B Data tables for biological traits 172

B.1 Experimental interaction techniques . . . . . . . . . . . . . . . . . . . . 172

B.2 Further Gene Ontology annotation analysis . . . . . . . . . . . . . . . . 174

C Graph ensemble output 177

D Phylogenetic topology 180

D.1 Phylogenetic topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

D.2 Supplementary phylogenetic results . . . . . . . . . . . . . . . . . . . . 181

D.3 Escherichia coli phylogenetic trees . . . . . . . . . . . . . . . . . . . . . 186

E Sampling schemes 188

8


Figures

1.1 Interacting proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2 HIV virion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.3 Sequence alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.4 A phylogenetic tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.5 Example protein interaction network . . . . . . . . . . . . . . . . . . . . 35

1.6 Complex interaction models . . . . . . . . . . . . . . . . . . . . . . . . 39

1.7 A graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

1.8 Overlap method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

1.9 High-throughput interaction overlap . . . . . . . . . . . . . . . . . . . . 58

1.10 Sample space overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.1 Number and type of interaction reported in S. cerevisiae by year . . . . . 67

2.2 Experimental techniques contribution to BioGRID . . . . . . . . . . . . 68

2.3 Molecular function annotations of reported interactions . . . . . . . . . . 71

2.4 Cellular component annotations of reported interactions . . . . . . . . . . 72

2.5 Biological process annotations of reported interactions . . . . . . . . . . 73

2.6 Proportion of matching functional annotations by experiment technique . 75

2.7 Proportion of matching component annotations by experiment technique . 76

2.8 Proportion of matching biological process annotations by experiment technique

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

2.9 Accrual of reported yeast protein interactions over time . . . . . . . . . . 78

9


2.10 Rank-degree plots of network data. . . . . . . . . . . . . . . . . . . . . . 82

3.1 Node shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.2 Network shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

3.3 Bipartite shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.4 Biological node shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3.5 Biological network shuffle . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.6 Co-expression trait for ensembles using LC graph . . . . . . . . . . . . . 100

3.7 Complex trait for ensembles using LC graph . . . . . . . . . . . . . . . . 101

3.8 Gene Ontology traits for graph ensembles . . . . . . . . . . . . . . . . . 103

3.9 Component and clustering traits for graph ensembles . . . . . . . . . . . 104

3.10 Co-expression trait for topological ensembles . . . . . . . . . . . . . . . 106

3.11 Instability and distance for GO perturbations . . . . . . . . . . . . . . . . 109

3.12 Null homology perturbations for complex annotations . . . . . . . . . . . 110

3.13 Null homology perturbations for process annotations . . . . . . . . . . . 111

3.14 Similarity score by perturbation method . . . . . . . . . . . . . . . . . . 113

4.1 Phylogeny of study species . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.2 Topology edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.3 Phylogenetic profiles for each ensemble . . . . . . . . . . . . . . . . . . 126

4.4 Phylogenetic profile differences . . . . . . . . . . . . . . . . . . . . . . 127

4.5 Topological matching for LC interaction graph . . . . . . . . . . . . . . 129

4.6 Mismatch score using LC interaction graph . . . . . . . . . . . . . . . . 130

4.7 Topological similarity for LC interaction graph . . . . . . . . . . . . . . 131

4.8 Similarity of topologies for different tree algorithms . . . . . . . . . . . . 133

5.1 Single coupon function . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

10


5.2 S. cerevisiae physical interactome size . . . . . . . . . . . . . . . . . . . 153

5.3 S. cerevisiae genetic interactome size . . . . . . . . . . . . . . . . . . . . 154

5.4 Experiment and interactome size . . . . . . . . . . . . . . . . . . . . . . 155

5.5 Single or multiple coupons . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.6 Multiple coupon interactome size results . . . . . . . . . . . . . . . . . . 157

B.1 GO slim matching annotations through time . . . . . . . . . . . . . . . . 175

B.2 Known GO annotations for PPIs by method . . . . . . . . . . . . . . . . 176

C.1 Graph ensemble traits for DIP data . . . . . . . . . . . . . . . . . . . . . 178

C.2 Graph ensemble traits for CORE data . . . . . . . . . . . . . . . . . . . 179

D.1 Phylogeny results for DIP (PROML trees) . . . . . . . . . . . . . . . . . 182

D.2 Phylogeny results for CORE (PROML trees) . . . . . . . . . . . . . . . . 183

D.3 Phylogeny results for LC (PAML trees) . . . . . . . . . . . . . . . . . . 184

D.4 Phylogeny results for LC (PARS trees) . . . . . . . . . . . . . . . . . . . 185

D.5 Phylogenetic topology matches for E. coli data . . . . . . . . . . . . . . 187

E.1 Node sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

E.2 Edge sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

E.3 Edge discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

11


Tables

1.1 HTP experimental methodologies . . . . . . . . . . . . . . . . . . . . . 37

1.2 Error rate notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

1.3 FDR estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

1.4 S. cerevisiae interactome size predictions . . . . . . . . . . . . . . . . . 61

2.1 Interaction databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

2.2 Self-interactions found from each experimental technique . . . . . . . . . 70

2.3 Components and degree for empirical graphs . . . . . . . . . . . . . . . 80

2.4 Clustering coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2.5 AIC analysis of possible degree distribution . . . . . . . . . . . . . . . . 83

3.1 Empirical graph traits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.2 Size of homology sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.1 Similarity for each phylogenetic tree construction algorithm . . . . . . . 132

4.2 Complex results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

5.1 Interaction datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5.2 Classification performance if ρm = 20,000 . . . . . . . . . . . . . . . . . 158

5.3 Classification performance if ρm = 40,000 . . . . . . . . . . . . . . . . . 159

B.1 Interaction prediction methodologies . . . . . . . . . . . . . . . . . . . . 172

B.2 GO slim annotation classes . . . . . . . . . . . . . . . . . . . . . . . . . 173

12


B.3 BioGRID experimental methods . . . . . . . . . . . . . . . . . . . . . . 174

D.1 Number of topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

13


Abbreviations

BioGRID

BLAST

CORE

DIP

DNA

ER

ERGM

FDR

FN

FP

FRET

GCC

GO

HTP

LC

MIPS

ML

mRNA

MS

PAML

PARS

PHYLIP

PIN

PPI

PROML

RNA

SSE

TN

TP

Y2H

Biological general repository for interaction datasets

Basic Local Alignment Search Tool

PPI subset taken from DIP (interaction graph)

Database of interacting proteins (interaction graph)

Deoxyribonucleic acid

Erdös-Rényi (random graph)

Exponential random graph model

False discovery rate

False negative

False positive

Fluorescence resonance energy transfer

Giant connected component

Gene Ontology

High-throughput experiment

Literature curated PPIs (interaction graph)

Munich information center for protein sequences

Maximum likelihood

Messenger ribonucleic acid

Mass spectrometry

Phylogenetic analysis by maximum likelihood (phylogeny

inference)

Parsimony (phylogeny inference)

Phylogeny inference package

Protein interaction network

Protein-protein interaction

Protein maximum likelihood (phylogeny inference)

Ribonucleic acid

Small scale experiment

True negative

True positive

Yeast two-hybrid

14


Mathematical Notation

E(X)

P(X)

D (A)

r A,B

|z|

E

V

G ∼ (V, E)

d (v)

C (v)

N (v)

β (v)

φ (e)

Π (G)

∆ (G, Φ)

η A,B

Γ A,B

Expectation of random variable X

Probability of event X

Distance matrix for protein A

Pearson correlation coefficient between A and B

Absolute value of z

Edge set

Node set

Graph with nodes, v ∈ V , and edges, e ∈ E

Degree of node v

Clustering coefficient of node v

Set of neighbours of node v

Biological characteristic for node v

Biological characteritisc for edge e

Network trait for graph G

Biological trait, φ, for graph G

Distance between phylogenetic topology for proteins A

and B

Similarity of topologies for proteins A and B

15


Acknowledgements

I would like to thank all the members of the Centre for Bioinformatics for providing

a helpful and friendly environment. In particular: Ino Agrafioti, Sara Dobbins, Isabel

Holmquist, Piers Ingram, Paul Kirk, Yussanne Ma, Ronald Stewart and Tom Thorne.

I thank my supervisors Niall Adams and Michael Stumpf for providing guidance, advice

and support throughout my research at Imperial College. David Stephens and Frank Kelly

also provided additional support through different sections of the project.

This thesis was greatly enhanced by those that have commented on and proof read it

throughout. Thanks go to Niall Adams, Paul Kirk, Katherine Sharrocks and Michael

Stumpf.

My friends have provided a huge amount of diversionary support over the last four years,

alleviating some of the pressure at key stages of my Ph.D. Thanks also go to Mum, Dad,

my brother and Kat for love and support throughout the project.

Finally, I am grateful for the funding and support from the Wellcome Trust for both the

Ph.D. and preceding masters which enabled me to complete my studies at Imperial College.

16


Chapter 1

Introduction

This chapter presents an overview of this thesis and reviews background material used in

subsequent chapters. The scope and contents of the thesis are outlined (Section 1.1). The

genome and the interactome are introduced (Section 1.2). Comparative genomics is discussed

together with the use of phylogenetics for classifying potential protein interactions

(Section 1.3). Techniques used to generate protein interaction data are also introduced

(Section 1.4). Graph theory is introduced and relevant notation defined (Section 1.5).

The literature concerning error found in protein interaction data, and the notation used is

presented (Section 1.6).

17


1.1. THESIS OVERVIEW Introduction

1.1 Thesis overview

Systems biology is an emerging inter-disciplinary field which studies the function of biological

organisms using a breadth of experimental and computational approaches. A

primary aim of these systems approaches is to provide mechanistic, quantitative and predictive

models for the dynamics of biological interactions (Schwikowski et al., 2000;

Luscombe et al., 2001). Molecular information is used to study the interactions between

elements of a biological entity. These interactions form networks that are used to model

the overall system (Hintze and Adami, 2008).

This thesis explores the properties of a protein interaction network. The biological features

of proteins are used to generate network models of protein-protein interactions. Repeated

experimental data are used to form an estimate of the unknown number of distinct

interactions found in the interactome and to compare the different experimental techniques

that have been used to generate interaction network data.

1.1.1 Scope

This thesis assumes that biological characteristics can explain aspects of the structure and

evolution of protein interaction networks. However, the data are considered from two

distinct perspectives. First, a collection of empirical graphs are assumed to represent the

complete protein interaction network and are analysed. Second, the reported data are

assumed, as a result of experimental noise, to be a collection of true and false interactions

which form a subset of the complete interaction set. This latter view is used to model the

number of different possible protein interactions and assess the reliability of the published

data.

Algorithms used to generate protein interaction networks attempt to understand how these

systems have evolved, or to provide means of assessing the relevance of biological characteristics.

One aim of this thesis is to understand how statistical analyses of biological

characteristics on large scale interaction networks are affected by the choice of random

graph null models. The relevance of particular biological characteristics (which have been

used to find protein interactions previously) and the accuracy of current protein interaction

data are also considered. A further aim is to develop a means of elucidating the

complete set of possible protein-protein interactions from reported datasets. Finally, this

18


1.1. THESIS OVERVIEW Introduction

thesis serves as an overview of the current state of Saccharomyces cerevisiae interactome

data.

1.1.2 Outline

Chapter 1 presents the required biological background. This breaks down into an introduction

to genomics and comparative analyses before discussing recent research on

physical networks. Then the published literature regarding interaction data error rates and

possible sizes of protein interaction networks are introduced. The chapter concludes with

a discussion regarding how the interaction data have been generated.

Chapter 2 analyses the available interaction data in S. cerevisiae and presents the empirical

graphs that are subsequently used in this thesis. The protein interaction data for S.

cerevisiae are analysed in order to motivate the biological constraints used to generate

random networks. Due to the recent proliferation of new experimental data the analyses

are crucial both to reappraise previous results and to fix (in time) the context in which this

work is conducted.

Chapter 3 describes a variety of random graph ensembles. These graph ensembles are

motivated by the biological literature and factors considered relevant to protein interaction

network structure. The ensemble averages of various covariates are compared and

contrasted across the graph ensembles and the empirical data.

Chapter 4 uses the graph ensembles to test whether protein interactions have more similar

phylogenetic topologies than would be expected by chance in the random ensembles. This

is tested both for individual interactions as well as in the context of the graphs produced

by subsamples of the reported S. cerevisiae interactome.

Chapter 5 presents a model to determine the number of distinct interactions found in

the interactome. The relationship between the interactome size and the number of falsely

reported interactions are assessed. The model is also used to assess the number of interaction

reports required to use validated information to reliably classify the true interactions

from erroneous data. The error rates for different types of experimental interaction data

are compared along with predicted interactome sizes for protein-protein and protein-DNA

interactions in S. cerevisiae.

Chapter 6 draws the work together through a summary and general conclusion before a

19


1.1. THESIS OVERVIEW Introduction

discussion of future work that can be used to develop the methods and results presented

in this thesis.

1.1.3 Publications

Contributions from this thesis have been published and the references are:

1. (Stumpf et al., 2007) Stumpf, MPH, Kelly, WP, Thorne, T, and Wiuf, C. Evolution

at the system level: the natural history of protein interaction networks. Trends in

Ecology & Evolution, 22:366–373, 2007.

WPK completed analysis on S. cerevisiae PPI and evolutionary rate correlations

for this article. Figure 1.1 is from this article.

2. (Kelly and Stumpf, 2008) Kelly, WP and Stumpf, MPH. Protein-protein interactions:

from global to local analyses. Current Opinion in Biotechnology, 19:396–

403, 2008.

WPK and MPHS wrote the article, WPK performed the data analysis and created

figures which also appear in Chapter 2 as Figures 2.3, 2.4, 2.5, and 2.9.

20


1.2. BIOLOGICAL SYSTEMS Introduction

1.2 Biological systems

Genomic techniques paved the way for the biological sciences to characterise the molecular

constituents of life (Bruggeman and Westerhoff, 2007). These constituents have been

found to organise and function through various systems of molecular interactions. Networks

can be used to describe biological interactions such as: the atomic interactions

occuring between protein structures; the interactions of metabolites and proteins during

specific cellular events such as the cell cycle; and, on a macroscopic level, the interrelationships

between organisms in an ecosystem (Alm and Arkin, 2003). Systems approaches

aim to develop an understanding of the inter-relationships between proteins,

metabolites or other molecules across organisms (Barabasi and Oltvai, 2004).

Modern high-throughput techniques, taking measurements on a system-wide level, are

well suited to the global analysis and modelling of networks and processes (LaCount

et al., 2005). The published data, when adequately verified, can be used to train computational

models, as well as to validate models that have been proposed (Shen et al., 2007).

In parallel, computational methods have the potential to reduce noise and systematic errors

(Gilchrist et al., 2004), whilst also forming a new means of providing constructive

feedback across in vitro and in vivo experiments.

The yeast species Saccharomyces cerevisiae is the study organism for this thesis. S. cerevisiae

has multiple studies providing global analyses of its interactions (Gavin et al., 2006;

Hart et al., 2006). Its cells are approximately spherical, around ten micrometres in diameter,

and are easy to culture and perform biological experiments upon. The species

has at least 5,800 distinct proteins and its genome sequence has 12,495,682 base pairs

(Hirschman et al., 2006). S. cerevisiae has a large number of proteins homologous to

human proteins, including cell cycle and signalling proteins, making it a good model organism

for experiments probing fundamental eukaryotic processes. S. cerevisiae is one

of the most intensively studied eukaryotic model organisms in molecular and cell biology

(Hong et al., 2008).

1.2.1 Genomes

Deoxyribonucleic acid (DNA) is the hereditary material of the vast majority of organisms.

DNA encodes all of the information required for the processes of individual cells, and

21


1.2. BIOLOGICAL SYSTEMS Introduction

consequently the functions and inherited characteristics of organisms. The DNA of a cell

comprises that cell’s genome – the book of instructions.

Definition 1.1 (Genome) A genome is all the genetic information, the entire genetic complement

of the hereditary material, possessed by an organism.

DNA is a polymer consisting of four nucleotide bases: adenine (A), cytosine (C), guanine

(G) and thymine (T). Sequences of these nucleotides are joined by covalent bonds to

form strands of DNA, each strand of DNA forming hydrogen bonds with a second strand

in a specific manner known as complementary base pairing. DNA is composed of two

complementary strands in the shape of a double helix. Each base in a strand forms a

hydrogen bond with another specific base – adenine with thymine; cytosine with guanine.

1.2.2 Proteins

A protein is formed from a DNA sequence. This sequence of bases, a gene, is ‘read’ using

the cellular enzyme RNA polymerase to produce an RNA copy of the DNA, which is in

turn ‘read’ by the cell’s ribosomes to produce a protein. Within each gene, some of the

DNA does not directly provide information that can be read to produce proteins and are

non-coding sequences, or introns. That DNA which contains sequences that can be read

to produce proteins are coding sequences, or exons.

Proteins comprise sequences of the twenty different amino acids, joined together as per

the instructions found within the cell’s DNA. A stretch of DNA that codes for a single

protein is called a gene. As there are 20 amino acids that produce proteins but only 4

nucleotide bases found in DNA it is impossible that one nucleotide base ‘codes’ for one

amino acid. A sequence of three contiguous bases forms a unit, or codon. 61 of the

64 (4 3 ) possible codons each map to a fixed amino acid, and there is redundancy in the

genetic code, with each amino acid being coded for by up to 6 different codons. The 3

remaining codons map to a stop codon within messenger RNA that signals termination of

translation, and the Methionine amino-acid codon, or start codon, initiates the production

of a protein.

Definition 1.2 (Protein) A protein, p, is a sequence of amino acids defined by the DNA

sequence of a gene.

22


1.2. BIOLOGICAL SYSTEMS Introduction

As the collection of DNA, including the genes, forms an organism’s genome, so the collection

of proteins expressed by an organism forms the organism’s proteome. Protein interaction

network research is concerned with the inter-relationships between the proteins

of a proteome.

Definition 1.3 (Proteome) A proteome, P , is the complete set of proteins, {p 1 , · · · , p n },

expressed by a genome.

1.2.3 Protein interactions

The functional operation of biological processes and systems is dependent on the interrelationships

between proteins. Understanding these interactions helps not only to elucidate

how the system works but also to increase our knowledge regarding the evolution of

organisms and function.

Protein interactions are observed using a variety of different experimental techniques, as

discussed in Section 1.4 on page 34. Although the theoretical concept of a protein interaction

is well-defined, observed interactions are subject to errors and misclassification.

Accordingly, care has to be taken to note the difference between observed interactions

and true interactions. This is expanded upon in Section 1.6 where the space of possible

protein-protein interactions is defined, and in Chapter 5.

Definition 1.4 (Protein interaction) A protein interaction is the binding of a protein, p,

to another molecule.

Interactomes form the complete set of possible molecular interactions which can occur

within the cell (Sanchez et al., 1999). These sets may include interactions between any

type of biological molecule, including proteins. Throughout this thesis, the complete set

of interactions found within the proteome, P , comprises an interactome – or a protein

interaction network (PIN). Consequently, the complete interactome may form a superset

of the interactions that may occur within a particular individual or environment for the

studied system. The networks observed in this thesis involve two types of protein based

interaction:

• physical: the binding (molecules join to form combined structure) of two different

proteins, a protein-protein interaction (PPI) (Collins et al., 2007a; Tarassov et al.,

23


we review these approaches as well as future challenges

surrounding the evolutionary study of PINs.

From bags of genes to networks of interacting loci

The field of evolutionary genetics has made much progress

in unravelling the molecular basis of genetic and pheno-

1.2. BIOLOGICALtypic SYSTEMS variation among individuals in a population, as well Introduction

2008).

• genetic: the binding of a protein to a component of the genetic sequence (Boone

et al., 2007).

of a network and network statistics discussed in the main text. A

lly described by a graph, G, which contains a set of nodes or

and edges, E (cyan): thus, G =(V,E). Here, we only consider

with binary edges; that is, interaction between two proteins is

ot; edges have no directions and no distinction

set of

is

molecules.

made between

ths of different edges. In the future, quantitative interaction data

htforward extensions to the mathematical description of G. For

e Refs [2,8,9].

lexity of evolutionary analysis of biological

reflected by the diversity of different

sed to study or model PIN evolution: from

n straight from statistical physics, via studies

ethods from molecular evolution, to analyses

vily influenced by structural genomics. Here,

Definition 1.5 (Interactome) An interactome is the complete collection of biological interactions,

of a given type, found within an organism. This is the set of interactions that

can be detected experimentally under any conditions, in vivo or in vitro, for the defined

of the rest of the structure – for instance units that have been found in several different

pairs of domains. A typical PPI is shown in Figure 1.1 where two proteins are shown

bound to each other, and the protein domains involved in binding are depicted in different

colours.

are protein interaction networks?

bolic networks and gene regulatory networks aim to

e basic biochemistry and the set of regulatory interacogical

organisms, respectively, PINs lack such a

d interpretation. A PIN consists of all reported proteractions

in an organism. When reporting an interacn

two proteins, we typically mean that some

cal interaction has been detected in in vitro biochemical

as yeast-2 hybrid, immuno-precipitation and tandemation,

using protein tags. These experimental assays

considerable noise levels, especially when used in highttings;

thus, it is generally difficult to determine the

h interactions detected in vitro are relevant in vivo. Not

eractions will be realized simultaneously and there is as

at would enable the analysis of protein interactions in

anism under different environmental or physiological

general, the network data are also only of a qualitative

, interactions are either present or not but their strength

ed.

reality, interactions are between different protein

er than proteins. Figure I shows the structure of the

eatic a-amylase (blue structure) in complex with a bean

bitor (red and yellow structure; protein database code

he interaction occurs solely between the blue and red

ough the inhibitor also has a 2nd domain, shown in

proteins containing the red and blue domains might

as among species. In particular, the interplay between

theoretical analysis and experimental studies has led to

the development of statistical frameworks for the quantitative

analysis of genetic variation. At the level of populations

of individuals belonging to the same species,

population genetics and quantitative genetics have developed

sets of extensively tested models for the evolution of

systems consisting of a small and large number of genetic

loci, respectively. These models have been studied carefully

and, given a set of suitable assumptions, are amenable

to exact mathematical analysis.

In population genetics, most studies focus on either a

single locus or a few loci. Although for the former, our

understanding of the model is now fairly complete [13,14],

systems of interacting loci are an active field of interest, with

many questions remaining. Most studies have looked either

at pairs of loci or at systems of loci with certain simplifying

limits, such as independent loci, where loci are in linkage

equilibrium and are inherited independently. One crucial

aspect of such theoretical models is the precise way in which

Protein domains arethe components genotype isof related a protein to the that phenotype have been (generally found to sub-

exist independently

sumed into some measure of darwinian fitness). The more

that loci contribute to a trait, the more difficult modelling

proteins. In general, becomes, physical as additional protein interactions assumptions have are found to be made: to exist gener-

between particular

ally independence of the contributions from different loci is

assumed. As the number of loci increases, however, systems

enter the realm considered by quantitative genetics: here, a

com

Figure 1.1: Interacting Figure I. proteins. This shows the structure of the porcine pancreatic a-amylase

(blue) bound with a bean lectin-like inhibitor (protein with two domains: red and yellow) (Gilles et al.,

1996). The interaction occurs solely between the blue and red domains.

24


1.2. BIOLOGICAL SYSTEMS Introduction

1.2.4 HIV example

This example is given to put the previously detailed theory into the context of a model

organism – the HIV-1 provirus. Figure 1.2 shows the HIV virion. This is a well-studied

virus with a complete genome less than 10 kilobases long, encoding 15 different proteins.

1.2 HIV-1

Although this virus has a short genome sequence, researchers are interested in both the

set of relationships within its proteome, and those that can occur between human and HIV

proteins (Sharrocks, 2007).

1.2.1 Structure of the HIV-1 Virion

HIV-1 has been reported to interact with 1,448 human proteins (Ptak et al., 2008). This in-

Simplified, the mature HIV-1 virion appears as a core of structural proteins sur-

volves 2,589 HIV-1 to human protein interactions. Across the different proteins of HIV-1,

roundeda single by a lipid regulatory envelope protein, containing Tat (Trans-Activator glycoproteins of Transcription), (see Figure participates 1.3inbelow) around [Wright

a third of these unique interactions. The high number of interactions, and the disproportionate

number of reported interactions with only one of the proteins, shows the potential

et al., 2007]. The two copies of the RNA genome are encapsidated by nucleocapsid

(NC), and surrounded by capsid (CA) to form a cone-shaped core. Matrix

inhomogeneity of networks even when such a small set of proteins is considered. This illustrates

the the inner possible surface scale of of the the human membrane interactome, to which contains it isinteractions tetheredbetween

by a myristyl

(MA) lines

20,000-24,000 proteins, and emphasises the need to first understand model examples. For

moiety, and where it engages in an undefined interaction with the trimeric Env

model organisms, the interactions of interest are those that can occur between proteins

glycoproteins. Also contained within the virion are the Pol polyprotein products,

found solely in the organism’s own genome sequence: those interactions forming the

RT, PR, organism’s and IN, interactome. and the p6 protein.

Figure 1.2: HIV virion. HIV-1 provirus is approximately 9.2kb in length, encoding 15 distinct proteins

(image from Sharrocks (2007)).

Figure 1.3: Structure of the HIV-1 virion

Trimeric gp120/gp60 embedded in a double lipid bilayer lined by MA surrounds a cone shaped

core. NC intimately associates with the viral RNA within the core, where the enzymes IN, RT, and

PR, the accessory proteins and various host proteins may also be found. The structure of the core

is maintained by CA. This is the structure of a25mature

virion, and is only present once PR has

acted upon the Gag and Gag-Pol polyproteins.


1.3. COMPARATIVE GENOMICS Introduction

1.3 Comparative genomics

The field of comparative genomics mirrors biological studies that have been conducted

for decades in ecology and the study of taxa diversity (MacDonald, 1979; Bangert et al.,

2006; Pratt et al., 2008). Comparisons are made in order to understand organic diversity

and to appreciate the role of evolution in creating that diversity (Allen et al., 2005; Bangert

et al., 2006).

Evolutionary work is complicated by a lack of direct knowledge of the history of different

organisms, although data for some model organisms are more readily available than

others. Studies of model organisms with short generation times – such as Escherichia

coli (Konagurthu and Lesk, 2008) or S. cerevisiae (Wolfe, 2006) – have been most useful

when aiming to understand the links between characteristics of genomes, genes and

proteins.

1.3.1 Sequence alignment

Sequence alignment provides a means of measuring similarity between strings: in this

case for DNA, RNA, or amino acid sequences. Sequences that show high levels of similarity

may be linked by function, structure, or close evolutionary relationships (Yang,

2006). Different sequences are aligned, a multiple sequence alignment (MSA) is shown

in Figure 1.3, according to a scoring matrix. This matrix is dependent on the alphabet of

possible items found in the sequences.

Definition 1.6 (Alphabet) An alphabet is the set of letters that make up the possible items

in a code – e.g. DNA genetic code has an alphabet of {A, C, G, T }.

Definition 1.7 (Sequence) A sequence, of length q, is a q-tuple of letters, (a 1 , a 2 , . . . , a q ),

where each letter, a i , is in an alphabet: a i ∈ A

Sequence similarities are commonly interpreted as indicating some biologically relevant

link between the studied DNA, RNA or protein (Ramani et al., 2008). Alignments are

used to classify sequences of unknown function or origin by inference from sequences

26


1.3. COMPARATIVE GENOMICS Introduction

Figure 1.3: Sequence alignment. An example section of a MSA of five similar amino acid sequences

from different yeast species. Each dash, ‘-’, represents a gap in the alignment.

that have been more often studied. Accordingly, the roles of genes and proteins in nonmodel

organisms can be inferred from work completed on experimental species using

sequence alignments (Lehner and Fraser, 2004).

Differences between sequences can be used to determine the probable evolutionary history

of biological sequences (Felsenstein, 1984). Under appropriate models of genetic

evolution, the genetic distances between different genes, proteins, or organisms can be

found through the alignment of relevant genetic material. These alignments are used to

produce phylogenetic trees that represent the relationships between different biological

samples, as discussed further in Section 1.3.2.

BLAST (Altschul et al., 1990) is a program used to compare sequences such as DNA or

amino acids. A BLAST search is performed on a sequence of interest, in general a complete

gene or protein, and this is aligned against a library of other sequences. The BLAST

alignment algorithm is similar to the Smith and Waterman (1981) algorithm for sequence

alignment, producing an ordered list of sequences similar to the query information. This

list may include homologous proteins.

Definition 1.8 (Homology) Proteins are homologous if they have evolved from a common

ancestor. This may be indicated through a high level of sequence similarity, as

measure by alignment.

BLAST searches, on appropriate libraries of distinct individual species, can be used to

find orthologous proteins, as demonstrated in Figure 1.3. Paralogous proteins are found

by the same means, but using a library of sequences from the same proteome as the query

protein. Paralogous proteins are brought about by a gene duplication event in an ancestral

species. These duplication events are a consequence of some error which increases the

size of the genome by replicating some subset of the DNA through a variety of means

27


1.3. COMPARATIVE GENOMICS Introduction

including whole chromosome duplications (Hakes et al., 2007b).

Paralogous genes, at the point of duplication, are in general identical to genes already

found in the genome. Accordingly, this leads to functional redundancy as it is often not

advantageous to have two identical genes. Thus, this enables mutations which may disrupt

the structure and function of one of the two genes are not selected against enabling novel

functions to evolve or other forms of evolutionary development (Zhang, 2003). The role

of duplication events, and the subsequent evolution of redundant genes, is an important

driver of PIN development (Pastor-Satorras et al., 2003).

Definition 1.9 (Orthology) Orthologous proteins have evolved from a common ancestor,

separated by a speciation event. Orthologous proteins are homologous and found in

different species.

Definition 1.10 (Paralogy) Paralogous proteins have a common ancestor in the same

species, arising due to a gene duplication. Paralogous proteins are homologous and

occur in the same species.

1.3.2 Phylogenetic trees

Phylogenetic trees are used to represent evolutionary relationships between genomes,

genes or proteins. Differences between sequences are used to reconstruct a branching

process of divergence from a common ancestor, resulting in a diagrammatic representation

of the historical evolutionary relationships between different entities (see Figure 1.4).

Definition 1.11 (Phylogenetic tree) A phylogenetic tree details the inferred evolutionary

relationships among a set of species, genes, or proteins.

Phylogenetic trees not only depict information as to the evolution of particular sequences,

but may also inform about theoretical inter-relationships between sequences (such as their

participation in PPIs (Sato et al., 2003)).

The phylogenetic trees considered have two main components: the distances between

sequences and a set of branching events that represent when sequences have diverged. A

28


1.3. COMPARATIVE GENOMICS Introduction

Figure 1.4: A phylogenetic tree. This shows a sample phylogenetic tree, with time flowing from left

to right, for 6 Saccharomyces species. Approximate number of million years (Myr) to the common ancestors

between S. cerevisiae and S. paradoxus, S. mikatae, S. kudriavzevii are shown (Wolfe, 2006).

branching event, where a line splits in Figure 1.4, describes a common ancestor diverging

into distinct entities. Branching, or divergence, events are characterised by the topology of

the phylogenetic tree. The model used to reconstruct the phylogenetic trees can produce

two sets of possible topologies: bifurcating or multifurcating.

Definition 1.12 (Bifurcating tree) Bifurcating trees are such that a branching event results

in exactly two divergent sequences, a binary tree.

Definition 1.13 (Multifurcating tree) Multifurcating trees are such that a branching event

can result in any number of divergent sequences.

Figure 1.4 shows a tree on 6 different yeast species. This tree also may be represented

using a bracket notation, representing the locations of branching events between different

lineages. For example, the topology of the tree in Figure 1.4 is: (S.castellii,

(S.bayannus, ((S.cerevisiae, S.paradoxus), (S.mikatae, S.kudriavzevii)))). Within

any pair of brackets, the ordering is irrelevant – e.g. for two species

(S.cerevisiae, S.paradoxus) = (S.paradoxus, S.cerevisiae). This is a bifurcating tree

as each branching events divides the sequences into two subsets.

Multifurcating trees can have any number of sets at each branching event. This includes,

for three species, the tree topology: (S.cerevisiae, S.paradoxus, S.mikatae). The set

29


1.3. COMPARATIVE GENOMICS Introduction

of multifurcating trees includes all bifurcating tree topologies on the same number of

sequences.

A variety of different methods are used to construct phylogenetic trees that detail the evolutionary

linkages between different sequences, although they can be described broadly as

falling into the following categories: parsimony; maximum likelihood; or distance methods.

They produce a tree given a sequence alignment for a collection of sequences, A i .

Let a i,j be letter j from sequence i in the alignment.

Parsimony methods are non-parametric approaches used to find phylogenies. A maximum

parsimony method assigns a model of evolutionary change onto the sequence alphabet.

Then the best tree is found by determining the minimum number of letter changes

required to match letter j for all sequences, A i . Each step is a change, for a given a i,j , from

one letter to another. The algorithm aims to produce a phylogenetic tree that minimises

the total number of changes across the alignment.

Maximum Likelihood methods are parametric, employing some probability model of

sequence evolution. They use expected patterns of mutational change, alongside the

probability model used, to find the most likely tree arrangement. These maximum likelihood

(ML) methods, including algorithms found in the Phylogeny Inference Package

(PHYLIP) (Felsenstein, 1995) or Phylogenetic Analysis by Maximum Likelihood (PAML)

(Yang, 2004), take a model of evolutionary change for the letters of the sequences being

considered – e.g. amino acids for a set of aligned protein sequences. The model assumes

this pattern of evolutionary change, and then assesses the probability of each potential tree

arrangement for every position – i.e. letter j – of the sequence alignment. The tree that

is the most likely, after permuting through all possible combinations, is the phylogeny for

the sequence alignment assessed (Mount, 2004).

Distance methods produce trees based on the number of differences between sequences

found in a MSA. For instance, neighbour-joining algorithms produce a phylogeny by

adding the most similar sequence as an additional branch to a given tree by using the

evolutionary distances found between the sequences.

1.3.3 Correlated evolution

Studies have asserted linkage between the evolutionary rate of proteins and PPIs (Pellegrini

et al., 1999; Goh and Cohen, 2002; Gertz et al., 2003; Pazos et al., 2005). For exam-

30


1.3. COMPARATIVE GENOMICS Introduction

ple, chemokines and their corresponding receptors show evidence for correlated evolution

reflected by similarity of their respective phylogenetic trees (Goh et al., 2000). In the case

of TGFβ ligands and their receptors (Gertz et al., 2003), topological similarities between

closely related proteins’ phylogenies have been used to find novel PPIs.

Definition 1.14 (Correlated evolution) The level of correlated evolution between two

proteins is the linkage, or correlation, found between the evolutionary rates of the two

proteins.

Pellegrini et al. (1999) introduced the phylogenetic profile as whole genome sequences

became widely available. Phylogenetic profiles have been used to infer the complexes or

pathways in which an unknown protein participates, or to predict protein function (Loganantharaj

and Atwi, 2007).

Definition 1.15 (Phylogenetic profile) A phylogenetic profile for a protein is an n-bit

string which details whether an orthologue exists, defined by some threshold on sequence

similarity, for the protein in each of n distinct species.

The mirrortree approach is based on an observation that interacting or functionally related

proteins have similar phylogenetic trees (Juan et al., 2008a). The mirrortree algorithm

(Pazos and Valencia, 2001; Juan et al., 2008b) uses MSA of orthologous sequences, and

the underlying species distance matrix, to help predict PPIs. The correlation between distance

matrices for proteins is used to help find potential PPIs. Distance matrices detail

the evolutionary time separating sequences based on a probability model of evolution between

the sequence alphabet. An n × n matrix contains information on distances between

n sequences. Suppose we have found homologous proteins for A and B in n species.

Definition 1.16 (Distance matrix) A distance matrix, D, is a two-dimensional array where

each entry, d i,j , is the distance between sequence i and sequence j.

The basic mirrortree algorithm uses distance matrices for each protein of a proteome to

aid PPI classification (Pazos and Valencia, 2001). Let the distance matrix, D (A), for

protein A be defined such that d i,j (A) is the evolutionary distance between homologous

31


1.3. COMPARATIVE GENOMICS Introduction

proteins found in species i and j. Let the correlation of the evolutionary rates of A and B

be the Pearson correlation coefficient, r, of the two distance matrices:

∑ (

di,j (A) − ¯d i,j (A) ) ( d i,j (B) − ¯d i,j (B) )

r A,B =



i


1.3. COMPARATIVE GENOMICS Introduction

Definition 1.17 (Co-evolution) Co-evolution of two proteins occurs if divergent changes

in one protein are complemented by compensatory changes in the second protein.

Distance matrix methods assume evidence of co-evolution to justify the PPI predictions

(Jothi et al., 2005, 2006), rather than correlated evolution – as highlighted in Hakes et al.

(2007a). It is important to differentiate between these concepts, as they guide the interpretation

of results and to ensure that the biological characteristics are correctly defined. The

coefficient itself, r A,B , details linkage in the evolutionary rates and nothing about how the

proteins have actually evolved. Co-evolution, on the other hand, requires evolutionary

changes to be complementary between the candidate proteins.

Whilst evidence of co-evolution implies correlated evolution, the opposite does not hold,

as a similarity of evolutionary rates does not mean the mutations are necessarily compensatory.

Correlated evolution may just reflect the evolutionary divergence that has occurred

which has been linked with the expression rates of individual proteins rather than PPIs

(Jordan et al., 2003; Agrafioti et al., 2005; Drummond et al., 2006).

Chapter 4 explores the possible link of evolution with PPIs focusing on the topology

of phylogenetic trees. Sequence alignment tools are employed to identify orthologous

proteins that are used to construct phylogenetic trees for each protein in S. cerevisiae.

The topologies are then used to assess whether interacting proteins’ phylogenies are more

similar in observed PINs than expected through comparison to the properties of randomly

generated networks.

33


1.4. PROTEIN INTERACTION DATA Introduction

1.4 Protein interaction data

Experimental mapping of biological networks is challenging and requires considerable

resources and effort. A collection of large protein interaction datasets exists for some

organisms (Hermjakob et al., 2004; Breitkreutz et al., 2008). Large quantities of protein

interaction data, for model organisms such as S. cerevisiae, became available as a

consequence of high-throughput experimental technologies (Bader et al., 2008). These

experiments report thousands of putative interactions each year (Collins et al., 2007a).

In contrast, relatively few interactions had been reported in total before the turn of the

century.

The large number of reported protein interactions represents the work of dozens of groups

over many years (e.g. Uetz et al. (2000); Ito et al. (2001); Gavin et al. (2002); Lappe and

Holm (2004)). The techniques used across these groups vary considerably but may be divided

into two broad categories: traditional methods that delineate individual interactions

or the interactions between a small number of proteins; and high-throughput methods that

probe thousands of possible interactions simultaneously. Techniques analyse different

genetic, biochemical or physical traits, probing various subsamples of the complete set

of protein pairs. These studies have helped to enable resources such as the Gene Ontology

relational database to be developed (Camon et al., 2004; Hong et al., 2008) and have

allowed protein interaction maps to be produced.

Figure 1.5 shows an example PIN based on the data from two S. cerevisiae based experiments.

Each protein has a variety of biological characteristics and every interaction can

be described by a collection of biological and experimental details. Although protein interactions

either occur, or do not occur, between each protein pair available experimental

data itself is often expressed quantitatively rather than qualitatively. This output is often

reduced to binary information for each presented protein pair to enable their simple description

as a graph, as shown in Figure 1.5. The types of experiment used to generate the

static S. cerevisiae PPIs used throughout this thesis are described in this section.

1.4.1 Traditional methods

Traditional or small-scale experiments (SSE) are mainly hypothesis driven tests that aim

to answer specific biological questions (Cusick et al., 2009). They focus on understanding

biochemical properties, binding affinities, or how processes are performed through

34


1.4. PROTEIN INTERACTION DATA Introduction

Figure 1.5: Example protein interaction network. The nodes represent proteins, whilst each

edge represents an interaction reported between the two proteins that it joins. The data are from Yuan et al.

(2001) and Gurunathan et al. (2002) which form a subset of the BioGRID database. Different colours represent

different biological process annotations that have been assigned to the proteins, a label concerning

the biological properties of the protein.

combinations of protein interactions. PPI data from these techniques are limited to those

proteins targeted to address hypotheses of interest.

Fluorescence resonance energy transfer (FRET) generates information on interacting proteins

and provides in vivo spatial information using spectroscopy (Andrews and Demidov,

1999; Raveh et al., 2009). The proteins of interest are associated with complementary fluorophores

that fluoresce when located closely together. FRET can be used to observe

proteins binding as well as the abundance of the bound structure in vivo.

X-ray crystallography is used to determine the structures of molecular structures at the

atomic level. Performing x-ray crystallography on these structures, which may include

bound proteins, provides information about how the constituent parts bind with each other

(Meinke et al., 2008). Other structural methods, such as nuclear magnetic resonance

(NMR) (Freifelder, 1982), can also be used to provide similar information about protein

complexes (Kiel et al., 2008).

Atomic force microscopes, which can measure to a resolution of fractions of a nanometre,

can be used to measure interaction forces. These microscopy methods enable the analysis

of protein interactions at the molecular level, but only for single interactions (Gaczynska

35


1.4. PROTEIN INTERACTION DATA Introduction

et al., 2004).

1.4.2 High-throughput methods

High-throughput (HTP) experiments aim to survey as large a number of PPIs as possible

using technology that can be scaled to test thousands of protein pairs (Cusick et al., 2009).

These techniques can be readily automated and generally report more interactions than

SSEs. Coverage (the protein pairs tested) of the interactome space depends on a variety

of experimental limitations including unknown systematic bias and the inability to test

certain proteins. Some HTP experiments may also exhibit bias towards testing proteins of

particular function or interest.

Affinity capture, including co-immunoprecipitation, with mass spectrometry (Ho et al.,

2002; Gavin et al., 2002) and yeast-two-hybrid (Ito et al., 2001; Uetz et al., 2000) techniques

have been used extensively to identity PPIs. Mass spectrometry (MS) analyses

proteins in vitro by producing peptide ions which are recognized by their mass-to-charge

ratios and consequently can be directly associated to particular proteins.

Yeast two-hybrid (Y2H) experiments require a transcription factor gene that produces

two protein domains, DNA-binding and DNA-activating, which are both essential for

the transcription of an associated reporter gene. The DNA-binding and DNA-activating

domains, which are required in close proximity for the reporter gene to be transcribed,

are separated for the Y2H experiment. A protein of interest (bait) is fused to the DNAbinding

domain, and another protein (prey) is fused to a DNA-activating domain. These

two fusion proteins, or any of the four original parts, are not sufficient to initiate the

transcription of the reporter gene alone. The bait and prey are reported to bind if the

reporter gene is transcribed when the two fused proteins are present (Ito et al., 2001).

There is a lack of symmetry when using bait-prey techniques. Whether a reported interaction

can be replicated with the bait and prey swapped is important when determining

the reliability of reported PPIs (Scholtens et al., 2008). The interaction characteristics of

each protein cannot be assumed to be just the collection of all observed interactions as

the data contain noise. Information on the context of the experimental technique can also

help to improve confidence in the predictions.

A variety of other high-throughput experimental methodologies have been used to populate

protein interaction databases (Shoemaker and Panchenko, 2007a), some of which are

36


1.4. PROTEIN INTERACTION DATA Introduction

detailed in Table 1.1. These methodologies probe subtly different biological traits, from

which putative interactions have been derived. For instance, gene co-expression studies

observe functional linkages between proteins rather than physical binding relationships

(Bhardwaj and Lu, 2005). These identify different types of interaction, helping to populate

the database of functional (or other biological) characteristics for individual proteins.

Method Interaction Assay

Yeast-two-hybrid binary in vivo

Mass spectrometry complex in vitro

Protein microarray binary, complex in vitro

Gene co-expression functional in vitro

Synthetic lethality functional in vivo

Table 1.1: HTP experimental methodologies. Methods used to find different types of proteinprotein

association, including protein-protein interactions.

Reliability issues in a range of contemporary HTP studies have been highlighted by von

Mering et al. (2002) using a benchmark reference set of thousands of protein interactions.

The putative interactions reported by each mapping showed little agreement relative to

the number of interactions that each global study presented (Ito et al., 2001; Uetz et al.,

2000). Assuming that each method has probed the same protein pairs this suggested: a

low true-positive rate, a high false-positive rate, or a combination of both.

von Mering et al. (2002) reported that a variety of the new techniques had FDRs of between

90% and 99%. However, these error rates were estimated by the overlap between a

previously known reference set of PPIs and the new data. Although a flawed comparison

– if all the interactions were already known the experiments were pointless – it highlighted

the skepticism that some contemporary approaches provoked. This skepticism

has resulted in the analysis and development of several techniques designed to estimate

and account for noise (Nariai et al., 2005; Shoemaker and Panchenko, 2007b).

1.4.3 Interaction inference

Interaction data, as well as being reported directly, can also be inferred from other biological

association studies. Techniques such as gene co-expression probe functional linkages

between proteins in vitro. Consequently, the collection of PPIs also includes a body of

inferred evidence that has been derived from these studies alongside the binary PPI data

found directly.

37


1.4. PROTEIN INTERACTION DATA Introduction

Protein complexes

Protein complexes have added indirect evidence for binary interaction partners. Each

complex is a collection of proteins that have been found to bind as a multi-protein structure

(i.e. one with more than 2 proteins). Krogan et al. (2006) and Gavin et al. (2006)

reported on protein complexes in S. cerevisiae: Krogan et al. (2006) found 547 distinct

complexes, averaging just under 5 proteins per complex; whilst Gavin et al. (2006) published

491 complexes.

There are a variety of methods that can identify pairwise interactions from complex experiment

results. These include the matrix and spoke models (Bader and Hogue, 2002;

Hakes et al., 2007c). Each method, having observed a multi-protein complex, assigns

some subset of the protein pairs as binary interactions, according to some structural argument.

These interactions may not actually have been observed, but are generally reported

along with the complete structure that makes up the protein complex.

Figure 1.6 illustrates these assignments for a toy example of a complex. The complex is

made up of 3 core proteins, always in the complex, and a selection of unessential periphery

proteins.

(a) Matrix The matrix approach assigns protein interactions to all possible pairs found

to co-occur in the experimentally observed complex. This ignores the possibility that

each protein may not actually bind with every other protein, but is used to infer pairwise

interactions from observed complexes.

(b) Spoke The spoke model refines the set of interactions that are assigned based on the

complex found. A subset of the proteins, such as the core proteins found, are assumed to

interact with all other members of the complex.

(c) Observed topology The topology approach observes the actual topology of the protein

complex, assigning an interaction if the topology suggests the proteins actually bind

to each other.

Hakes et al. (2007c) studied the differences between these methods to assess whether the

spoke or matrix assignments produced a higher proportion of false-postive interactions.

38


1.4. PROTEIN INTERACTION DATA Introduction

Protein Complex

(a) Matrix

Periphery proteins

(b) Spoke

Core proteins

(c) Observed

Figure 1.6: Complex interaction models. Each complex is made up of core and periphery proteins,

and binary interactions can be inferred through: (a) Matrix method that assigns interactions between

all possible protein pairs; (b) Spoke method that attributes interactions from a protein to all other proteins;

or (c) a method by which interactions are assigned according to the molecular structure that has been

experimentally observed.

Analysis of S. cerevisiae protein complexes showed that smaller complexes are best described

by the matrix model. If the number of proteins in the complex exceeds 5 the spoke

model is a better means of inferring pairwise PPIs.

Ultimately, it is important to note that the biological structures formed by proteins binding

are not solely the result of pairwise interactions. The full collection of protein interactions

includes a set of binary interactions and a separate (possibly overlapping) set of multiprotein

complexes. The sets’ properties are not necessarily identical.

1.4.4 Computational predictions

In silico methods predict protein and domain interactions using trait information, by considering

a variety of physical or functional associations (see Table B.1 in Appendix B).

39


1.4. PROTEIN INTERACTION DATA Introduction

However, comparisons with the small reference sets of known interactions suggests that

even the most successful novel PPI prediction methods suffer from high false-positive and

false-negative rates (Lu et al., 2005; Mika and Rost, 2006). In silico methods have also

used a combination of the reported PPI data and biological characteristics in an attempt

to reduce the noise found in putative interaction data (Deane et al., 2002).

Experimental data and computational predictions complement each other to form our

knowledge of the true interactome. However, it is possible that the interpretation of the

interactome using computational methods is biased by prior knowledge and assumptions.

Cross-species interactions

A selection of promising prediction methods have been used to infer interactions across

different species (Wojcik et al., 2002; Li et al., 2004; Bork et al., 2004). These aim

to transfer knowledge of interactions from a model organism to another organism. For

example, if proteins A and B have been reported to interact in one species, and if orthologous

(see Definition 1.9) proteins, A ′ and B ′ can be found in a different species, then the

interaction is assigned to the second species provided certain conditions are met (Gertz

et al., 2003; Albert and Albert, 2004). This is clearly a sensible starting point, but limitations

are also evident: unreliable interaction data will be propagated across species and it

may be difficult to reconcile conflicting data.

Biological trait based inference

A reference set of interactions can be used to assign belief, or confidence, to potentially

interacting protein pairs. The traits of the reference set, such as sequence data or expression

profiles, can predict which protein pairs are more likely to be in the true interaction

graph (Bader et al., 2004; Ben-Hur and Noble, 2005). Hypothesis testing can be used to

see whether a particular trait is correlated with observed PPI or PIN data (Agrafioti et al.,

2005). If traits are being assessed against a PIN the graph structure (or topology) may

be important, which may be captured using a probabilistic graph ensemble. However,

the choice of model used for comparison subtly, and possibly significantly, affects the

hypothesis tested (Thorne and Stumpf, 2007). Differences between graph ensembles, and

their effects on inferences about PIN data, are explored in Chapter 3.

40


1.5. GRAPH THEORY Introduction

1.5 Graph theory

This section details graph theory as used in Chapters 2-4. The inter-relationships of proteins

found in an organism’s proteome are studied in this thesis. In order to study these

interactions, the experimental data are represented as a set of binary interactions between

distinct proteins, forming a graph, G. In this thesis, the terms ‘network’ and ‘graph’ are

used interchangeably.

Graphs are used to represent the PINs to enable possible understanding of the evolution

and structure of the interactome. However, each individual interaction may only occur

under specific circumstances and at particular times within the cell cycle. The interactions

all are found on a set of proteins, V , which forms the proteome of interest. The aim is to

be able to find which of the possible protein pairs, {(v i , v j ) : i < j, v i ≠ v j ∈ V }, are

interactions and to analyse this set.

An undirected graph, G ∼ (V, E), consists of a set of nodes, V , together with a set of

edges, E. Each edge, e ∈ E, is a pair of (unordered) nodes found in V . The degree of

each node is equal to the number of edges that connect to it, see Definition 1.20. A graph,

G, has order |V | and size |E|. Figure 1.7 shows a graph with order 9 and size 10.

Figure 1.7: A graph. The red circles are the nodes, the set V , whilst the cyan links between nodes are

the edges, E, of G ∼ (V, E).

41


1.5. GRAPH THEORY Introduction

Definition 1.18 (Graph) A graph, G ∼ (V, E) is a set of nodes V = {v 1 , . . . , v n } and a

set of edges E = {e 1 , . . . , e m } ⊆ {(v i , v j ) : i < j, v i ≠ v j ∈ V }.

Definition 1.19 (Subgraph) A graph, H ∼ (V H , E H ) is a subgraph of G ∼ (V G , E G ),

H ⊆ G if and only if V H ⊆ V G and E H ⊆ E G .

Definition 1.20 (Degree) The degree, d(v i ), of node v i

v j ∈ V , such that (v i , v j ) ∈ E.

∈ V is the number of nodes,

d(v i ) =

n∑

I ((v i , v j ) ∈ E) ,

j=1

where,

I ((v i , v j ) ∈ E) =

{

0 if (v i , v j ) /∈ E

1 if (v i , v j ) ∈ E .

The graphs considered here are simple – having no self-interactions, i.e. (v i , v i ), or edge

directions. A directed graph has a direction on each edge e = (v i , v j ), leading to each

node having both an in- and out- degree. The graph can also be labelled: each element,

v ∈ V or e ∈ E, is then associated with some label, φ (v) or φ (e).

A simple graph G can be represented as an upper-triangular binary matrix, A. This adjacency

matrix, A, is an n × n matrix detailing the edges found in the graph.

Definition 1.21 (Adjacency matrix) An adjacency matrix, A, is an n × n upper triangular

matrix where the entry a i,j denotes whether there is an edge between the nodes v i and

v j for i < j.

a i,j = I ((v i , v j ) ∈ E) . (1.2)

1.5.1 Graph properties

The statistical properties of graphs, from analyses of individual nodes to measurements

across the complete graph, G, motivate the studies throughout this thesis. The properties

measured are divided into two distinct types: graph structural characteristics that can be

42


1.5. GRAPH THEORY Introduction

found from the adjacency matrix; and biological characteristics that require edge or node

labels. Each characteristic is referred to as a trait when it is a global property of the

graph. For example, the average degree is a network trait whilst the average level of

co-expression of interacting proteins across the whole graph is a biological trait.

Network traits

Network traits of a graph are defined as those statistical properties that can be derived

solely from the adjacency matrix, A. These are topological characteristics of the graph.

Some of the important properties, which also have relevance to PINs, are now introduced.

The network trait, Π α (G), for a characteristic α (v i ) (or α (e i )) is defined as the arithmetic

mean of the characteristic across the graph.

Definition 1.22 (Degree sequence) The degree sequence of the graph G is a list of the

node degrees from Definition 1.20: [d(v 1 ), . . . , d(v n )] of degrees for all nodes, v i ∈ V.

The distribution of degrees found in a graph is used to summarise a graph’s structure.

A graph where each node has the same number of edges may have different statistical

properties (biological or topological) than another graph that has the same size and order

but where most of the edges are found between only a subset of the nodes. Experimentally

determined PINs have been found to contain a set of nodes, hubs, that have a very high

degree (He and Zhang, 2006). Hub enriched biological networks have encouraged the

development of evolutionary models (Stumpf et al., 2007) that attempt to explain how

these graphs have been created.

The clustering coefficient summarises properties of graph nodes and can help the analysis

of graph motifs (recurring subgraphs found in the data). PINs, as well as other physical

networks, have been observed with a small proportion of the possible edges, a sparse

graph, together with a relatively high clustering coefficient for each node (Barabasi and

Albert, 1999; Carlson and Doyle, 1999; Yook et al., 2002).

Definition 1.23 (Neighbours) The neighbours, N (v i ), of a node v i are those nodes v j

that are connected to the node of interest: N (v i ) = {v j ∈ V : (v i , v j ) ∈ E}.

Definition 1.24 (Clustering coefficient) The clustering coefficient, C(v), of a node v ∈

43


1.5. GRAPH THEORY Introduction

V is the proportion of its neighbours that are themselves neighbours,

C(v) =


v i ,v j ∈N(v) I ((v i, v j ) ∈ E)

) .

( d(v)

2

Nodes found in a clique are likely to have higher clustering coefficients than other nodes,

and regions of the graph where there is high proportion of the possible edges are also

referred to as highly clustered subgraphs.

Definition 1.25 (Complete graph) A complete graph, K |V | ∼ (V, E), is a graph where

E contains all possible edges, (v i , v j ), between nodes in V .

Definition 1.26 (Clique) A subgraph, H ∼ (V c , E c ), of G is a clique if V c ⊆ V and H is

complete, H = K |Vc|.

Two nodes are connected if there is a path between them, and the graph is connected if

a path can be found between all nodes. Measuring the distance between nodes provides

information about a graph’s connectedness. Examining the different paths between two

nodes, v i and v j , enables an assessment of the structural stability or robustness of the

graph.

Definition 1.27 (Path) A path in a graph, G ∼ (V, E), is a sequence of distinct nodes,

v i ∈ V , such that from each of the nodes there is an edge to the next node, v i+1 , in the

sequence. A path, P (v 1 , v p ), is a set, {v 1 , v 2 , . . . , v p }, such that:

∃ (v i , v i+1 ) ∈ E ∀ i ∈ [1, p − 1] .

Definition 1.28 (Distance) The distance, dis (v i , v j ), between two different nodes, v i , v j ∈

V, i ≠ j, is the length of the minimal path, dis (v i , v j ) = min |P (v i , v j ) |, that exists between

the two nodes.

44


1.5. GRAPH THEORY Introduction

The shortest path lengths found between nodes have also been used as a measure of centrality

for graphs. For PINs, the average path length is small relative to the number of

nodes. The distance between any pair of nodes of the giant connected component (GCC)

is small (Dorogovtsev and Mendes, 2001; Watts, 2004).

Definition 1.29 (Giant connected component) The giant connected component (GCC)

of a graph is the largest connected subgraph.

Definition 1.30 (Robustness) The ability of a system to respond to either external or

internal changes whilst maintaining consistent behaviour.

The robustness of a graph can be measured in terms of its topological robustness by

monitoring the effect of small perturbations on each characteristic (Barabasi and Oltvai,

2004). If, for instance, deleting random edges or nodes from the graph has little effect

on the average distance between nodes, then the distance over the graph can be said to be

robust.

Definition 1.31 (Motif) A motif of a graph is any local, recurring subgraph. For instance,

a ‘triangle’ motif is the complete graph on 3 nodes.

Motifs, or graphlets, refer to small subgraphs of a large graph (Shen-Orr et al., 2002; Milo

et al., 2002; Przulj, 2007). Counting the different motifs, for example the subgraphs that

can occur on 3 nodes, has been used to differentiate between different random graph models

and empirical data. The definitions, however, are not always consistent and it is difficult

to compare subgraphs with different orders. Differences in the counting techniques

have plagued some of the early work on the significance of particular motif features. This

has made it difficult to compare the statistics to expectations or to each other (Kashtan

et al., 2004; Konagurthu and Lesk, 2008). However, analytical results have been obtained

regarding the ability to test the significance of the number of individual motifs found in a

PIN given a random graph model (Picard et al., 2008).

Empirical graphs have been shown to contain various correlations between neighbouring

nodes – for example, nodes of similar degree may also be neighbours. Assortativity

measures correlations between node properties. The assortativity of degree has been of interest

for physical networks (Maslov and Sneppen, 2002; Vázquez et al., 2002; Newman,

45


1.5. GRAPH THEORY Introduction

2003). For example, in social networks, the probability of each interaction is not independent,

as friendship groups make up highly connected regions, affecting the graph’s

structure (Newman and Park, 2003).

Definition 1.32 (Graph assortativity) Assortativity details the correlation between nodes

of a graph, e.g. the correlation of the connectivity of neighbours.

Each characteristic, for a node or edge, can be averaged over the graph under consideration.

This ensemble average can then be compared with other graphs to observe differences,

e.g. the difference in the average degrees of graphs G and H.

Biological traits

The biological properties of a graph are described by biological traits, ∆ (G, Φ). These are

assessed using both the graph, G, and a characteristic Φ = {φ 1 , · · · , φ m } of the graph’s

edges (or nodes). Unlike network traits, a biological trait will not necessarily be invariant

under permutation of node labels.

Biological traits can include information on the properties or sequence composition of the

proteins as well as measurements of protein activity such as abundance measurements.

When the biological characteristic of interest, β (.), is associated to each node, v i ∈ V , a

function f : β (V ) × β (V ) → R is used to find a characteristic, Φ, for the edges.

Biological characteristics have been proposed as being linked with PPIs (Valencia and

Pazos, 2002; Bhardwaj and Lu, 2005; Thorne and Stumpf, 2007), and as a means of classifying

PPIs (Salwinski and Eisenberg, 2003; Yu and Fotouhi, 2006; Skrabanek et al.,

2008; Ramani et al., 2008). These classification methods use various biological characteristics

to predict interactions (Ben-Hur and Noble, 2005; Srinivasan et al., 2007) or

confer additional support for observed PPIs (Bader et al., 2004; Shen et al., 2007).

Bader et al. (2004) developed a quantitative method for evaluating the biological relevance

of PPIs from large scale experimental data. Information from other sources, such as

mRNA expression, genetic interactions and other biological annotations, was compared

with PPIs. These characteristics are used to assign levels of confidence to the putative

PPIs to enable the generation of more reliable interaction datasets.

46


1.5. GRAPH THEORY Introduction

Genes that produce proteins which interact are believed to be found in close physical

proximity on the genome (Overbeek et al., 1999; Skrabanek et al., 2008). These interacting

proteins have shown correlated functional and process annotations across many

organisms and genomes. Classification methods have used the cellular location of proteins

to define protein pairs that do not interact (Jansen et al., 2003; Jansen and Gerstein,

2004).

Protein complexes have been shown to be linked to the function of proteins (Krogan et al.,

2006), and are linked to the set of PPIs. In order to clearly differentiate between a complex

and a PPI, the former is defined here as involving more than 2 different proteins. Either

way, complexes must be linked with PPIs as each is formed by a collection of individual

interactions.

Definition 1.33 (Protein complex) A protein complex is a bound protein structure containing

more than two proteins.

Correlated mRNA expression patterns of proteins have been used to infer function across

species (Eisen et al., 1998; Marcotte et al., 1999; Stuart et al., 2003). Transcriptional

mRNA levels have also been used to determine the age of physical PPIs and to infer physical

binary interactions (Deane et al., 2002; Jansen et al., 2003). Ramani et al. (2008) compared

human mRNA co-expression patterns between orthologous genes in other species,

to demonstrate the ability of mRNA data to identify proteins that are found in the same

protein complexes and PPIs.

Gene Ontology

Biological traits are annotated within databases through the use of frameworks such as

Gene Ontology (GO) (Ashburner et al., 2000; Camon et al., 2004). GO is a structured

hierarchy for gene and protein annotations based on their involvement in: biological processes;

cellular components; or molecular functions. A category forms a relational vocabulary,

each term having a hierarchical relationship to one another, similar to the EC

nomenclature for enzymes.

GO is used to categorise, and compare, different proteins with the potential to determine

PPIs or improve the reliability of experimental predictions (Lin et al., 2004). GO slim,

a non-hierarchy based classification system based on the relational vocabulary, is used

47


1.5. GRAPH THEORY Introduction

throughout this thesis. This allows a broad overview of the ontology, enabling easier

comparison of terms without any requirement to define semantic distances between different

annotations.

For each category, known annotations relate to particular biological properties that have

been experimentally determined. For instance, for cellular components, a protein will

be annotated with each component in which it has been found (e.g. nucleus, plasma, or

ribosome).

These are a selection of characteristics that have been reported as being associated with interactions.

Accordingly, biological traits related to these findings are used to characterise

observed PIN data throughout the thesis.

1.5.2 Graph ensembles

Random graphs are used extensively throughout this thesis. A random graph is used here

to refer to a graph with a fixed number of nodes and edges, n and m respectively. Random

graphs are generated from various graph probability distributions. These distributions

produce graphs with characteristics and properties that match those found in an empirical

graph. Graphs sampled from a probability distribution are referred to as being drawn from

a particular graph ensemble.

Definition 1.34 (Graph ensemble) Each graph ensemble is a probability distribution

over the space of possible graphs, in general for a fixed number of nodes, n, and/or

edges, m.

The ensembles considered later represent the set of graphs that meet certain constraints.

The constraints used are biological and network traits that are believed to be relevant to

the generation of PPIs or PINs, as described in Section 1.5.1.

Random graph studies have developed over the last 50 years from the study of graphs with

a fixed probability that each edge is present (Erdös and Rényi, 1959) to the generation of

large scale graphs, with thousands of nodes, that exhibit similar properties to real-life

graphs (Watts and Strogatz, 1998). The development of generative methods for real-life

large scale graphs has accompanied the study of the evolution of biological networks

including PINs. This section describes some graph ensembles used in network research

and more recently for PIN analyses.

48


1.5. GRAPH THEORY Introduction

Erdös-Rényi graphs

Erdös and Rényi (1959) introduced the random graph. Starting with an empty graph

G ∼ (V, ∅) on n nodes, V = {v 1 , . . . , v n }, edges are added at random with the same

probability. The number of edges m found in each random graph is fixed, so sampling

(without replacement) m edges from the complete set of edges {(v i , v j ) : v i , v j ∈ V, i <

j} determines the graph.

However, the Erdös-Rényi (ER) model of random graphs used throughout this thesis is

subtly different to the above description. A random graph is denoted by G (n, p), a graph

with n nodes and such that every possible edge occurs independently with probability p.

Definition 1.35 (Erdös-Rényi Graph) An ER graph is G (n, p) on n nodes such that

each edge, (v i , v j ) occurs independently with probability p, P ((v i , v j ) ∈ E) = p.

Although these graphs can be used as random samples to compare to observed data, they

do not replicate some of the prominent properties of real-life data. The combination of

a small number of edges and high clustering coefficients, as seen in empirical systems

(Takemoto and Oosawa, 2005), are not well represented by ER random graphs.

Small-world graphs

Small-world graphs (Dorogovtsev and Mendes, 2001; Watts, 2004) describe networks

where nodes can be reached from each other by traversing a small number of edges, so

the average path length is small. The path length between two nodes, v i .v j ∈ V, i ≠ j

is dis (v i , v j ) ≤ log (n) = log (|V |). A further typical property of small-world models is

that the graphs are sparse. An average node has a small number of neighbours, and the

graph size |E|


1.5. GRAPH THEORY Introduction

in the range 2 < γ < 3. Many empirical networks have been observed to be scale-free in

addition to having small-world features (Li et al., 2007), or not (Small et al., 2007, 2008).

Empirical PINs have been shown to have both small-world and scale-free properties (Almaas,

2007).

Definition 1.36 (Scale-free graph) A scale-free graph G on n nodes is such that the degree

distribution follows a power-law distribution, P (d (v i ) = k) ∼ k −γ .

The networks of the internet (Carlson and Doyle, 1999), power grids (Watts and Strogatz,

1998), and latterly PPIs (Stumpf and Wiuf, 2005) have been found to have a small number

of hubs along with a degree distribution similar to the power-law distribution. In biological

systems, Barabasi and Albert (1999) also observed that empirical networks exhibited

an abundance of hubs and claimed that the degree distribution of the PIN is best described

by the same power-law degree model.

The role of hubs and cliques in PINs has generated much interest (He and Zhang, 2006;

Batada et al., 2006a,b; Kim et al., 2008). Their ability to confer apparent robustness has

also been studied extensively (Barabasi and Oltvai, 2004; Wagner, 2005). Hub enriched

graphs are robust (see Definition 1.30) to random deletion of edges (Albert and Barabasi,

2000; Yook et al., 2002), as these perturbations do not greatly affect the average path

length between nodes. However, removal of particular hubs can drastically alter the average

path length across the graphs, hence these graphs are referred to as ‘robust yet fragile’

(Wagner, 2005; Doyle et al., 2005).

The power-law distribution has been used as a diagnostic test for the degree sequences

of PIN data (Reguly et al., 2006). However, it has been shown that the scaling properties

of the observed degree distribution are not best approximated by this simple probability

distribution (Tanaka et al., 2005).

Stumpf et al. (2005a) used a likelihood based approach (see Section A.1 in Appendix A)

to assess how best to model the degree distribution for empirical PIN data. For observed

data, the likelihood analysis of the degree distribution can allow an interpretation, over

those assessed, of the most likely generation model. The authors showed that simple

power-law models do not provide the best description of the observed data from S. cerevisiae,

where discretised log-normal distributions are seen to be a better fit. The conflicting

reports found in the literature suggest that using these simple probability models for

the degree distribution may be a flawed means of modelling the empirical data.

50


1.5. GRAPH THEORY Introduction

Biological graphs

Empirical graphs have also generated interest in further models that fix network traits

such as the degree sequence, clustering coefficients and other local structure (Barabasi

and Albert, 1999; Park and Newman, 2004; Stumpf et al., 2007; Kim and Marcotte, 2008).

Several different types of graph model have been used to model PINs including: graph

ensembles that generate scale-free degree sequences (Aiello et al., 2000; Gkantsidis et al.,

2003; Li et al., 2005); exponential random graph models (ERGMs) that have been used

in social network research (Pattison and Wasserman, 1999; Robins et al., 2007); mixture

models that use ER random graphs (Daudin et al., 2007); and geometric random graphs

(Higham et al., 2008) (described in Section A.2 in Appendix A).

Graph models have also been developed that use evolutionary motivation (Aiello et al.,

2000), starting from a graph with two nodes by duplicating nodes and preferentially attaching

new edges to highly connected nodes. Duplication-divergence graphs model the

process of gene duplication, each new node added is identical to an existing node and

then small changes are made to reflect evolutionary divergence (Gkantsidis et al., 2003).

These aim to both model the observed PIN data as well as improve knowledge of how

the networks may have evolved. ERGMs produce graphs that have the same expected

trait statistics as empirical network data (Pattison and Wasserman, 1999), enabling an

assessment regarding the types of graphs generated with given traits. These techniques

can be used to test the influence of traits on the graph or when assessing the significance

of observed features against what would be expected in a random graph with the same

properties.

51


1.6. NOISE IN INTERACTIONS Introduction

1.6 Noise in interactions

It is essential to appreciate how accurately empirical data reflects the true interactome. For

this reason it is of crucial importance to have an understanding of noise found in PIN data

when performing any global PIN analyses. Noise may be either stochastic, systematic

experimental error or related to biological properties of the proteins being tested (interactions

may be transient or condition dependent). The focus in Chapter 5 is on how the

amount of stochastic error can be measured. Error definitions and literature concerning

estimations of the PIN size and PPI data error rates are discussed in this section.

Determining whether two proteins interact is hard to achieve. Many issues with experimental

data exist including: biases or systematic errors from experimental techniques

(Aloy and Russell, 2002; Chiang et al., 2007); how to use binding affinities to infer interactions

(Aloy and Russell, 2006); and basic uncertainties regarding our understanding of

the regulation system of the cell.

Interactions may only occur under specific conditions, or experimental techniques may be

inaccurate, producing sets of data where interactions cannot be known for certain. Similarly,

these experimental techniques may only be able to produce a subset of the true

interactions, making it difficult to define absolutely proteins that do not interact. These

false interactions are needed, as well as true interactions, for biological prediction algorithms

to work effectively when reference sets for both interactions and false interactions

are required (Ben-Hur and Noble, 2005).

PINs are represented using uncertain data to produce static empirical graphs. The edges,

interactions, of the graph either exist (1) or do not exist (0). Although this may be correct

for the actual protein interaction network, over all environments, our representation is

based on the collation of putative interactions, as opposed to an analysis of a subgraph of

the true graph. Theoretically, the interactome sought is the set of all interactions that can

occur under any conditions in vivo or in vitro for a fixed collection of proteins (and their

possible genotypes).

Measurement error is often ignored in graph based studies of these systems (for example

in Schwikowski et al. (2000) and Barabasi and Oltvai (2004)). Biological traits can aid in

minimising the number of false reported interactions considered in the graphs: by pruning

putative interaction data according to some known biological information (Deane et al.,

2002), or using interaction set overlaps to find errors in each dataset in contrast to known

52


1.6. NOISE IN INTERACTIONS Introduction

interactions (D’haeseleer and Church, 2004). The choice of known interactions, however,

will undoubtedly be biased until the complete true interactome is discovered.

Chapters 3-5 study experimental data in order to interpret empirical graphs and also to

develop a model to find the size of the true PIN — this forming the interactome of interest

throughout the thesis.

1.6.1 Sampling notation

The set of nodes, V , for the PINs considered is a proteome (see Definition 1.3). Let the

complete set of protein pairs be E Ω , forming the edge set for a graph of the interaction

sample space, Ω. This graph is used to assess the coverage (tested protein pairs) and error

rates of experimental data.

Definition 1.37 (Interaction sample space) The interaction sample space, Ω, is the complete

graph, excluding loops, between node pairs found in the proteome V (so Ω ∼ K |V | ).

Ω ∼ (V, E Ω ) , where E Ω = {(v i , v j ) : v i < v j ∈ V }.

The interaction graph, G, is the protein interaction network or interactome, as introduced

in Definition 1.5. Any pair of nodes, v i , v j ∈ V , is either in the true (unknown) interactome

G, or not.

Definition 1.38 (Interaction) An interaction is a pair of proteins (v i , v j ) that is in the

true interaction graph, G.

Definition 1.39 (Non-interaction) A false interaction is a pair of proteins (v i , v j ) that is

not in the true interaction graph, G.

Let any edge, for example (v i , v j ), be either an interaction or a false interaction, based on

whether it is found in the true interaction graph, G. Let the set of false interactions, on

the node set V , be defined as the false interaction graph, G ′ . The union of these graphs is

the interaction sample space, G ∪ G ′ = Ω.

53


1.6. NOISE IN INTERACTIONS Introduction

Definition 1.40 (Interaction graph) The interaction graph, G, has an edge between each

pair of proteins that interact in the considered interactome.

G ∼ (V, E), where (v i , v j ) ∈ E ⇐⇒ v i and v j bind.

Definition 1.41 (False interaction graph) The false interaction graph, G ′ , has edges between

all proteins in the proteome that do not interact.

G ′ ∼ (V, E ′ ), where E ′ = E Ω \ E.

The set of all edges, E Ω , is needed in order to estimate the proportion of possible protein

pairs that have been tested, and to assess the error rates defined in Definitions 1.42-1.45.

1.6.2 Error rates

Any experiment, P , may be represented as a graph. The reported interaction dataset, E P ,

forms the edges of the graph, G P , whilst the nodes, V P , are the set of proteins that appear

in the interaction data. The experiment size is defined as the number of edges, |E P |, that

are in G P . Given uncertainty in experimental interaction data the edges found in the graph

may be members of E or E ′ .

Two further pieces of information are required to measure the error rates for a particular

experiment: the complementary set of protein pairs that were tested but gave negative

results, and the true interactome (G). The latter is unknown whilst the negative false

interactions are not generally explicitly reported. This makes it difficult to assess the error

rates. The set of tested interactions, those that have been considered in the experiment,

P , could be any superset of E P . Chapter 5 assumes that all protein pairs from V P are

tested in any experiment P , a technique known as node sampling (Lee et al., 2006) (see

Appendix E). The graph G is also required or a representative reference set sampled from

the interactome. The notation used for experimental error is now briefly outlined and

presented in Table 1.2.

54


1.6. NOISE IN INTERACTIONS Introduction

Definition 1.42 (False positives) The false-positive set of interactions, F P , contains reported

protein pairs that are erroneously reported as true interactions (as these are in

fact not interacting proteins). These interactions are actually found in E ′ . Denote by p F P

the conditional probability that a reported interaction is a false-positive.

Definition 1.43 (True positives) The true-positive set of interactions, T P , contains interactions

that are correctly reported (found in E) in an experiment. Denote by p T P the

conditional probability that an interaction is a true-positive.

Definition 1.44 (False negatives) The false-negative set of interactions, F N, are those

that are tested but incorrectly not reported in an experiment (in E). Denote by p F N the

probability that an interaction is a false-negative.

Definition 1.45 (True negatives) The true-negative set of interactions, T N, are those

that are tested but correctly not reported in an experiment (in E ′ ). Denote by p T N the

probability that an interaction is a true-negative.

Edge ∈ E

Edge /∈ E

Edge found p T P p F P

Edge not found p F N p T N

Table 1.2: Error rate notation. Four error characteristics required to study the available interactome

data. Each p A is a conditional probability regarding whether an interaction is reported (edge found)

or not (edge not found). The probabilities are conditional on whether the interaction is actually a true

interaction, so whilst the columns sum to 1 the rows may not.

The false discovery rate (FDR) is directly associated with the size estimates for the interactome

model described in Chapter 5. The determination of the sets of true interactions

and false interactions, for a dataset, form the key to estimating the test statistic FDR.

Throughout the thesis, when referring to PPI data, this is used as a summary statistic for

a dataset, rather than an expectation.

Definition 1.46 (False discovery rate) The false discovery rate for a dataset, P , is the

proportion of false-positive interactions found.

F DR =

|F P |

|T P | + |F P | .

55


1.6. NOISE IN INTERACTIONS Introduction

1.6.3 Error and size estimates

PPIs are observed in a variety of environments and using different methods. The experimental

techniques employed to determine interactions (discussed in Section 1.4 on

page 34) present differing amounts of noise. There is a need to reassess the reliability

of PIN data as more interaction studies are published. Studies interested in error rates

primarily focus on individual large experimental datasets (von Mering et al., 2002). However,

this thesis is interested in assessing the FDR found across a collection of studies,

rather than a single experiment.

Several studies have estimated the FDR and interactome size of S. cerevisiae. They use

similar methods to each other and the estimates can be dependent on each other. However,

they differ in their approach to the set of tested protein pairs and their use of different experimental

studies. A rapid accumulation of new PPI data in recent years (see Chapter 2)

has also affected the estimates. This section describes a collection of methods that have

been used to find the size of the S. cerevisiae physical interactome.

Overlapping interactions

D’haeseleer and Church (2004) presented an overlap method for estimating error rates

in PIN data sets. As a consequence of their methodology they are also able to estimate

the size of the S. cerevisiae PPI interactome. Their overlap method estimates the FDR

from 3 data sets: a reference set (taken from a reliable PPI source) along with two other

large experimental sets. Using the overlaps between all the sets, and assuming that these

overlaps are error free (both as a consequence of the validation, and also the assumption

that the reference is highly accurate) the ratio that should occur if the other sets are error

free can be found.

Figure 1.8 shows the overlap sets, between the 3 different datasets, that are used to find

the false discovery rate. The FDR estimate is found from the ratios of the number of

interactions, I-VI, found in each part of Figure 1.8. A and B are experimental datasets

whilst REFERENCE set is a set of true interactions. The area separated by the dashed line

contains all the false interactions. The FDR is approximated by assuming that the data

were obtained independently such that the ratio between I and II is equal to the ratio of

III and IV. The the sizes of V and VI can be found that solve,

56


1.6. NOISE IN INTERACTIONS Introduction

Reference

III

I

IV

II

A

V

VI

B

Figure 1.8: Overlap method. The overlap found between three different interaction datasets, two

of which are being compared against a reference set. FDR is estimated using the ratios of the number

of interactions, I-VI, found for the 3 datasets. A and B are experimental datasets whilst REFERENCE is

a set of true interactions. The area separated by the dashed line represents estimated interaction noise,

containing V and VI interactions from sets A and B respectively.

IV = III II

I . (1.3)

The solution to this equation may not be unique, and the size of VI has also been separately

estimated to help find a unique FDR (Deng et al., 2003).

Figure 1.9, an example of I-VI found in Figure 1.8 for reported interaction studies, shows

the overlaps found between PPI data for sets assessed by D’haeseleer and Church (2004).

The FDR rates found using this method ranged from 0.46 to 0.90. Interactome size estimates

from the overlap found between pairs of studies: Uetz et al. (2000) and Ito et al.

(2001) [8,535-10,127]; Ho et al. (2002) and Gavin et al. (2002) [7,257-25,440].

Grigoriev (2003) measured the overlap of interaction data for each protein to find the

number of interactions. The number of interactions for a protein from a particular experiment

are assumed to be binomially distributed, similarly to the assumption used for the

coupon collecting model described in Chapter 5.

Suppose that for each protein, A, a A different interactions are found in one study and b A

57


1.6. NOISE IN INTERACTIONS Introduction

1411

MIPS

47

30

54

Ito et al

Uetz et al

157

4241 706

Figure 1.9: High-throughput interaction overlap. This venn diagram shows the overlap between

two protein interaction studies and a reference set. The reference set is from MIPS (Güldener et al.,

2006) whilst the interaction data are reported in Ito et al. (2001) and Uetz et al. (2000).

different interactions in a second study. The average overlap, O, between the interaction

sets for every protein is used to eliminate noise and find the number of interactions, n A ,

for protein A. Then n A for each individual protein, A, is found from,

n A = a Ab A

O . (1.4)

Grigoriev (2003) concluded that S. cerevisiae proteins have an average degree of between

3 and 5, leading to an estimate of between 16,000-26,000 distinct interactions in S. cerevisiae.

These estimates excluded proteins thought to generate a number of false-positive

interactions. The method can easily be extended to take account of known error rates as

a A and b A can be adjusted for each protein, A, to reflect the occurrence of false interactions.

Protein coverage

The coverage of each experiment is of vital importance when considering error rates.

Each experimental technique has possibly different amounts of noise or may only be able

to test a subspace of the possible protein pairs. Experimental noise and coverage bias

58


1.6. NOISE IN INTERACTIONS Introduction

have been considered in order to produce FDR and interactome size estimates (Chiang

et al., 2007; Huang et al., 2007; Gentleman and Huber, 2007). The probability of a falsepositive,

or the ability to test certain protein pairs, may not be the same for each type of

experiment and this may influence the error results found using overlap methods.

Huang et al. (2007) used capture-recapture tests for yeast-two-hybrid experiments to estimate

the FDR, coverage and interactome size of S. cerevisiae. This method is similar to

an overlap study (see Figure 1.8) although the coverage is fixed as replicates are analysed

where possible. This indicated that between 15% and 27% of the yeast (and potentially

45% of worm and fly) data are misclassified as interactions.

Hart et al. (2006) focused on improving the methodology set out in Figure 1.8 (D’haeseleer

and Church, 2004). Early studies were found to understate the interactome size as each

dataset probed different sets of protein pairs. Using estimated error rates and intersection

of datasets they estimated which protein pairs have been tested. The estimated size is

then scaled to take account of this coverage. The S. cerevisiae PIN was found to have

38,000-76,000 interactions and current data only contained half of the true interactions

(Hart et al., 2006) .

The need to assess the coverage of each study was highlighted by Gentleman and Huber

(2007). Direct comparison of interaction data fails to take account of the different protein

pairs tested in each study (illustrated in Figure 1.10). Each study does not test every

protein-pair in the whole set E Ω . Accordingly, unless this is explicitly taken into account,

the overlap between studies will be lower than expected increasing the reported error rate.

Stumpf et al. (2008) assessed the size of S. cerevisiae whilst discussing the possible sizes

of various interactomes including H. sapiens. The proportion of proteins tested in each

study and an assumed model for the true interaction graph, G, are used to find the size of

the interactome. The authors estimated that the S. cerevisiae interactome size is 24,000-

26,000. The study showed the potential differences in interactome size in comparison to

proteome sizes. Although H. sapiens has 20% more genes than Caenorhabditis elegans

it has over twice as many PPIs – over 500,000 in contrast to approximately 250,000.

False-negative interactions

The rate of false-positive and false-negative interactions has also been estimated using

repeated interaction data from two studies. Chiang et al. (2007) performed error analysis

59


1.6. NOISE IN INTERACTIONS Introduction

I

III

II

IV

Figure 1.10: Sample space overlap. When using overlap in PPI experiments, the underlying space

that is being probed is of crucial importance to the final estimate. Many maps suggest they are globally

sampling the protein space, but in reality as a result of experimental limitations are observing only a subset

of the possible interaction pairs. Each black circle represents a protein, whilst I,II,IV show the spaces

tested by each method. III is the space that should be used as a reference set of proteins when the overlap

in interactions between studies I and II are performed.

using: the number of repeated reported interactions (X); the number of reported protein

pairs between which no interaction exists (Y ); and the number of non-repeated reported

interactions (Z). The error rates, for a pair of datasets, were found using the following

relationships:

E (X) = m (1 − p F N ) 2 + np 2 F P ,

E (Y ) = mp 2 F N + m(1 − p F P ) 2 ,

E (Z) = 2mp F N (1 − p F N ) + np F P (1 − p F P ) , (1.5)

where m is the number of true interactions (observable interactome size) and n is the

number of false interactions. If we know the protein set, V , tested against each other then

a further condition is ( )

|V |

2 = m + n.

Equations 1.5 reveal the trade-off between p F P and p F N inherent in two datasets comparisons.

If a reference set is used to find the FDR in experimental data then careful

60


1.6. NOISE IN INTERACTIONS Introduction

consideration about its coverage and false-negative rates are required for accurate results.

The reference set should not be assumed to be just a random subset of the true interactome.

Reported estimates

Tables 1.3-1.4 detail FDR and PIN interactome size estimates for S. cerevisiae. The FDR

estimates are high, ranging from around 0.15 to 0.90 for interaction sets, but are still informative

due to the low marginal probability that a random protein pair form an interaction.

As time has passed, and more data have become available, the size estimates have tended

to increase. However, the current consensus appears to suggest that the number of distinct

PPIs is between 20,000-40,000. Hart et al. (2006) present a larger and wider prediction

than other methods such as that of Stumpf et al. (2008) who estimate perhaps a third fewer

S. cerevisiae PPIs.

Study Data FDR

D’haeseleer and Church (2004) Uetz et al. (2000) 0.46

D’haeseleer and Church (2004) Gavin et al. (2002) 0.50-0.68

D’haeseleer and Church (2004) Ho et al. (2002) 0.79-0.90

Huang et al. (2007) Ito et al. (2001) 0.15-0.27

Table 1.3: FDR estimates.

studies.

This table contains some FDR estimates found in S. cerevisiae PPI

Study Year Interactome Size

Tucker et al. (2001) 2001 8-12, 000

von Mering et al. (2002) 2002 > 30, 000 a

Bader and Hogue (2002) 2002 ≈ 20, 000

Sprinzak et al. (2003) 2003 10-17, 000

Grigoriev (2003) 2003 16-26, 000

Hart et al. (2006) 2006 38-76, 000

Stumpf et al. (2008) 2008 24-26, 000

a includes inferred interactions from matrix complex annotations which possibly overstate binary PPIs.

Table 1.4: S. cerevisiae interactome size predictions. Published estimates of S. cerevisiae’s

interactome size from different sources, along with the year of publication.

61


1.7. SUMMARY Introduction

1.7 Summary

Chapter 1 has outlined the concepts that feature prominently in Chapters 2-5. The graphs,

random graph methods and traits are relied upon in all subsequent chapters. Biological

interactions form graphs on thousands of distinct nodes, containing an unknown number

of inter-relationships between individual nodes. These relationships are studied in order

to understand how biological organisms function, and to appreciate possible differences

in complexity across different species. An understanding of how function may be inferred

across organisms enables easier experimentation on model organisms.

Graphs have been studied in an attempt to understand better the structural properties of

biological networks. Random graph models have made progress in simulating the degree

sequences, and elements of the local structure, of the observed graphs. The degree

sequence, as well as other network traits, has been used to characterise biological networks.

Whilst the degree sequence does not fully summarise all aspects of the graph, it

is believed to contain information pertinent to the underlying evolution of the graph and

how that relates to the stability and operation of the system. The traits of empirical graph

data are discussed further in Chapter 2 along with analysis of the biological traits used in

subsequent chapters.

Random graphs can be compared to empirical data to assess the relevance of traits and

characterise properties of real life networks, as pursued in Chapter 3. In turn, graph studies

can further the biological understanding of PINs as well as the underlying proteins and

biological interactions. Evolutionary rates of protein pairs have been linked, along with

a number of biological characteristics, to the incidence of PPIs. Studies have looked for

linkage over small subsamples of the complete interactome which may have been biased

by prior knowledge. Chapter 4 uses graph theory alongside the phylogenetic concepts

introduced in this chapter to compare phylogenetic tree topologies.

Experimental data from PPI studies form the knowledge about interactomes. However, the

data are both a subsample of the true interactome and may contain a number of incorrectly

observed protein pairs. Accordingly, other biological traits that are perhaps easier to

measure have been used to classify possible interactions and measure the error rates of

published data. The coverage of an experiment (the protein pairs that have been tested as

interactors) is paramount when considering its error. Chapter 5 introduces a model that

finds the FDR and interactome size from any experimental dataset.

62


Chapter 2

An exploratory analysis of interaction

data

This chapter presents an analysis of the S. cerevisiae interactome data. A collection of

the most prominent databases of S. cerevisiae PPI data are introduced (Section 2.2). The

biological traits of the S. cerevisiae interactome data found in BioGRID are then analysed

(Section 2.3). Finally, three empirical PINs are defined from the interaction databases and

their traits contrasted (Section 2.4). These graphs are used in subsequent chapters.

63


2.1. INTRODUCTION PIN Analysis

2.1 Introduction

The vast increase in interaction data, for instance from novel methods introduced in Section

1.4, since early studies on PINs makes reappraisal of biological and network characteristics

essential. Section 2.2 details some of the important PPI databases that contain S.

cerevisiae data. The analyses in Section 2.3 give an understanding of the current state of

the S. cerevisiae interactome data, as well as the properties of the graph data that are used

subsequently in this and later chapters.

The interactome is further investigated in Section 2.4, through the use of three empirical

graphs taken from the Database of Interacting Proteins (DIP) and the Biological General

Repository for Interaction Datasets (BioGRID). These datasets are analysed and their

traits compared to inform later work. Validations of individual interactions are also discussed

as these can confer information about the reliability of putative data.

2.2 Interactome databases

A variety of databases is available that include different samples of the complete set of

interaction data. These databases provide different services: some include computationally

inferred interactions, whilst others report only experimentally determined PPIs that

satisfy certain criteria. Table 2.1 details a selection of these interactome sources. Three

key S. cerevisiae databases are described here: Munich Information Center for Protein

Sequences (MIPS); Database of Interacting Proteins (DIP); and the database of primary

concern throughout this thesis, Biological General Repository for Interaction Datasets

(BioGRID).

Database Website Interactions

BioGRID (Stark et al., 2006) www.thebiogrid.org 220,000

IntAct (Kerrien et al., 2006) www.ebi.ac.uk/intact 170,000

BIND (Bader et al., 2001) bind.ca 84,000

DIP (Xenarios et al., 2002) dip.doe-mbi.ucla.edu 57,000

MPact (Güldener et al., 2006) mips.gsf.de/services/ppi 15,000

MIPS (Güldener et al., 2006) mips.gsf.de/services/ppi 4,000

Table 2.1: Interaction databases. A selection of available online sources of protein interaction

data, for different species, that can be used to form empirical PINs.

64


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

MIPS interaction data are found in the MPact database (Mewes et al., 2006). MIPS includes

a set of high confidence yeast protein interactions from small scale experiments.

It separates HTP from SSE experimental data enabling researchers to use MIPS PPIs as a

potentially reliable reference set.

DIP was established to collate interactions found in S. cerevisiae (Xenarios et al., 2002).

DIP includes, in addition to the list of interaction data found from the literature, a CORE

subset of interactions. CORE PPIs are those from DIP that satisfy criteria which aim to

improve data quality (Deane et al., 2002). Although the majority of the data found in DIP

come from S. cerevisiae studies, it also now contains interaction information on around

twenty other organisms.

BioGRID aims to store all the interactions experimentally reported in published literature

(Breitkreutz et al., 2008). The data include physical (i.e. protein-protein) and genetic

(e.g. protein-DNA) interactions and are regularly updated. It contains curated – reviewed

manually – interaction data originally formed by a comprehensive curation of published

articles (Reguly et al., 2006). Binary PPIs are reported from multi-protein complex data

using the spoke model (see Section 1.4.3). Having been established in 2006, it now contains

over 220,000 interaction reports from 21,638 publications for 22 organisms. S. cerevisiae

contributes over half of the reported interactions to the BioGRID database. This

extensive source contains repeats of reported PPIs.

2.3 Analysis of the BioGRID S. cerevisiae database

This section presents an analysis of the available S. cerevisiae PIN data from BioGRID.

Owing to the rate of data generation, as illustrated later by Figure 2.1, it seems apt to

reassess the properties of the reported interactions used here rather than assuming claims

previous studies have asserted. The properties of genetic and physical interactions, found

using the methods introduced in Section 1.4, are studied in this section.

BioGRID v2.0.39 (April 2008 release) is used here, containing 115,024 reported S. cerevisiae

protein interactions. This section details the stratification of these data according to

the experimental techniques used to obtain them, the year in which they were published,

and the size of the experiment (i.e. SSE or HTP) from which they were obtained. Additionally,

the numbers of self-interactions and the GO annotations of interacting proteins

found using different experimental methods are explored. The GO annotations are pre-

65


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

sented in order to assess how these traits relate to the reported interactions, having been

previously linked with PPIs. The overlap methods, and other error rate techniques, have

relied on the ability to compare data from contrasting techniques, so the biological properties

of the experimental methods are also compared in this section. Finally, the repeated

interactions are analysed as a prelude to the estimation of PIN size and FDR described in

Chapter 5.

Throughout, when counting the number of distinct interactions, no distinction is made

between bait and prey. For example, it makes no difference if proteins A and B are

reported as (A, B) or (B, A). From a biophysical point of view this may not be the

best interpretation, but it enables easier comparison between techniques, whilst allowing

comparison of multiply repeated interactions easily.

2.3.1 Year of publication

Figure 2.1 shows the previously mentioned increase in reported interactions whilst dividing

them into genetic and physical interactions. In general, more interactions are reported

every year. However, during the early years shown, the reported data predominantly pertained

to genetic interactions and in the last decade physical PPIs have become the main

reported data. There are 66,464 physical and 48,560 genetic reported interactions in BioGRID.

In 2002, when there was a marked increase in the number of physical interactions reported,

HTP techniques based on mass spectrometry led to large scale complex identification

alongside other novel PPIs (Ho et al., 2002; Gavin et al., 2002). There has been

a further surge in the number of newly reported genetic interactions since 2003. Collins

et al. (2007a) presented an analysis of the genetic interactions found from the newly acquired

physical complex data, detailing previously unknown functional information. This

built on the earlier protein complex data of Krogan et al. (2006) and Gavin et al. (2006).

HTP techniques have become more widely used in recent years. The reduction in the

yearly reported PPIs immediately after HTP studies in 2002 may reflect an early difficulty

of publishing new experimental data obtained using these methods. However, the

techniques are now used more regularly along with other HTP methods for genetic (Pan

et al., 2006; Collins et al., 2007b) and physical (Ptacek et al., 2005; Collins et al., 2007a)

interactions. These studies have found thousands of interactions in recent years, of which

66


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

Reported Interactions

0 5000 10000 15000 20000 25000 30000

0 50 100 150

Physical

Genetic

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1990

1991

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Figure 2.1: Number and type of interaction reported in S. cerevisiae by year. Genetic

interactions dominate the earliest data until 1997. The inset shows, in more detail, the types of interaction

reported between 1977 to 1991. Since 1997, the majority of interactions reported have been physical.

the majority are still novel (as shown in Figure 2.9).

2.3.2 Experiment size and technique

Reported interactions originate from studies using a number of different experimental

techniques. There are 22 different experimental techniques contained in BioGRID. The

physical and genetic interaction sets are divided by these techniques (see Table B.3 in

Appendix B). Physical interactions are sourced mainly from affinity capture, two-hybrid

and FRET experiments whilst the genetic interactions are found using other techniques.

Interactions come from 7,393 different published studies across 5,232 yeast proteins,

comprising two main types of study: HTP and SSE. HTP studies form a small proportion

of experiments with only 52 studies (0.7%) reporting more than 100 interactions and 11

(0.1%) that contribute over 1,000 interactions to the database. However, the majority of

the data (77,854 interactions or 68%) come from these 52 studies. Amongst the 7,341

67


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

SSEs (i.e. those reporting 100 or fewer interactions), 88% report fewer than 10 distinct

interactions. Several authors have questioned the reliability of large scale studies relative

to small scale experiments (Aebersold and Mann, 2003; Phizicky et al., 2003; Bader et al.,

2004). Figure 2.2 shows the number studies and interactions (including repeats) produced

by the different experimental methods.

Figure 2.2: Experimental techniques contribution to BioGRID. The majority of the physical

interactions (techniques in the red dashed area)have been found using affinity capture MS and yeast-twohybrid

studies whilst the majority of the studies focus on affinity capture western. The spread of the genetic

data (all within the green dashed area) are more even across a wider range of methodologies.

Figure 2.2 shows that physical interactions have been primarily identified from data generated

by affinity capture using either western blot or mass spectrometry (MS). Affinity

capture MS studies produce the most interactions both overall (59% of physical PPIs) and

per published article (on average, 157 interactions). The yeast-two-hybrid (Y2H) studies

also contribute a large number of the interactions (15% of physical), whilst the remaining

methodologies contribute a smaller proportion of physical reports. Genetic data are

68


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

spread more evenly across 8 techniques, although phenotypic enhancement (37% of genetic)

and synthetic lethality (26% of genetic) experiments contribute the majority of the

data.

As mentioned, affinity capture MS studies contribute, on average, over 150 distinct interactions

to the data which is far higher than any of the other methods. Y2H (on average,

13 interactions per study) and biochemical activity (23) form the only other methods

that contribute more than an average of 10 physical interactions per published article.

These 3 techniques make up the majority of the HTP data (if defined by experiment size).

For genetic studies, phenotypic suppression (26), phenotypic enhancement (25), synthetic

lethality (22) and synthetic growth defect (21) techniques contribute more than 10 interactions

per study.

2.3.3 Self-interactions

There is a collection of proteins that have been shown to self-interact, for instance forming

a dimer of two identical proteins (homodimer). Self-interactions have been reported using

the majority of experimental techniques, as shown in Table 2.2. In order to compare the

biological characteristics of PPIs produced by each method it is necessary to appreciate

the propensity of self-interaction reporting. Trivially, the biological characteristics found

for any self-interaction will match as only a single protein is involved. Self-interactions

found in the BioGRID data are removed from subsequent analyses: interest lies only in

those PPIs found between different proteins.

Protein-peptide (29.2%) and co-crystal structure (18.2%) exhibit a high proportion of selfinteractions,

as shown in Table 2.2. This would clearly influence any potential interaction

analysis, or classification, based on similarity of an interacting partner’s biological annotations.

Self-interactions have been rarely reported using genetic methods (6 distinct

reports from dosage rescue, 3 from phenotypic suppression and 4 across the synthetic

techniques taken from for example: Mösch and Fink (1997); Brizzio et al. (1999); Harkness

et al. (2002); Umemura et al. (2007)). Much of this variation can simply be explained

by the nature of the respective experimental methodology.

69


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

Method Interactions Self-interactions [%]

Affinity Capture-MS 24,295 227 [0.9]

Affinity Capture-RNA 57 1 [1.8]

Affinity Capture-Western 4,523 124 [2.7]

Biochemical Activity 5,192 23 [0.4]

Co-crystal Structure 132 24 [18.2]

Co-fractionation 562 14 [2.5]

Co-localization 304 3 [1.0]

Co-purification 1,193 19 [1.6]

Dosage Growth Defect 63 0 [0.0]

Dosage Lethality 433 0 [0.0]

Dosage Rescue 3,059 6 [0.2]

Far Western 53 1 [1.9]

FRET 68 4 [5.9]

Phenotypic Enhancement 15,948 0 [0.0]

Phenotypic Suppression 4,395 3 [0.1]

Protein-peptide 113 33 [29.2]

Protein-RNA 33 0 [0.0]

Reconstituted Complex 1,748 91 [5.2]

Synthetic Growth Defect 5,809 0 [0.0]

Synthetic Lethality 9,638 2 [0.0]

Synthetic Rescue 1,931 2 [0.1]

Two-hybrid 7,802 345 [4.4]

Table 2.2: Self-interactions found from each experimental technique. The number of

distinct self-interactions found in the BioGRID data are shown. Protein-peptide and co-crystal structure

techniques have a far higher tendency of reporting self-interactions, whilst genetic techniques do not report

many protein-DNA self-interactions.

2.3.4 Gene Ontology annotations of interacting proteins

A selection of probabilistic methods have been proposed that use location, functional, and

process annotations to find novel, or prune existing, PPI data (Chinnasamy et al., 2006;

Skrabanek et al., 2008). Methods have assumed that interactions require matching functional

characteristics to be included in the training data used to represent true interactions

(Jansen and Gerstein, 2004), and also that the proteins may not co-localise in order to

generate a negative training set (Jansen et al., 2003). Accordingly, the biological characteristics

of proteins that interact should predominantly show matching properties for the

available GO categories, as these have already been used to assign training sets of PPIs.

GO annotations for function, location, and process are used to assess how well these biological

characteristics reflect the BioGRID PPI data. An organism specific S. cerevisiae

70


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

‘GO slim’ scheme, based on GO categories and taken from the Saccharomyces Genome

Database (SGD), is used for comparison of proteins. The three GO categories have different

annotations for: 21 molecular functions; 23 cellular components; and 32 biological

processes. Each protein can have multiple annotations or no known annotation. In the

latter case, the protein is either classed as unknown or ignored dependent on the analysis.

The proportion of PPIs that have been reported for all different combinations of GO

annotation are shown in Figures 2.3-2.5.

Figure 2.3: Molecular function annotations of reported interactions. A heatmap describing

the proportion of possible protein pairs that are reported in BioGRID, stratified according to the GO

annotation molecular function category. The protein pairs reported ranges from 0% to 32% (from white to

red). The diagonal, showing incidence of matching annotations, classes exhibit the highest proportion of

reported interactions.

Figure 2.3 shows, for functional annotations, the percentage of protein pairs for each particular

annotation that have been reported as interactions, ranging from 0% to 32% of the

possible protein pairs. Proteins that share the same annotation have the highest marginal

probability of having been reported as interacting, although this set of interactions is only

20% of the reported PPIs. The rest of the reported data are from proteins that do not

71


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

share functional annotations. For example, a high proportion of interactions have been reported

between protein binding and motor activity classes, and also between the enzyme

regulator and signal transducer activity classes.

Figure 2.4 shows how cellular component annotations relate to the observed PPIs. There

are 23 different annotations, and again the within annotation protein pairs show the highest

tendency to have been observed as interactions. For this GO category fewer than a

fifth of the potential protein pairs have been reported for any annotation characteristic.

Several annotation pairs also exhibit a higher proportion of PPIs than some of the within

annotation groups, e.g. between cell wall and extracellular proteins, and also between

nucleus and chromosome. Only 28% of the reported interactions are between proteins

that share the same component annotation.

Figure 2.4: Cellular component annotations of reported interactions. A heatmap describing

the proportion of possible protein pairs that are reported in BioGRID, stratified according to the

GO annotation cellular component category. The protein pairs reported ranges from 0% (white) to 17%

(red). The diagonal, showing incidence of matching annotations, classes exhibit the highest proportion of

reported interactions.

72


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

Biological process annotations cover a larger number of classes. Figure 2.5 shows similar

behaviour to the other ontology groups and the reported PPI distribution across the

categories. Relationships showing the highest marginal rate of reported interactions are

between proteins with identical annotations. However, once again, a small overall percentage

(approximately 21%) of the reported data are from protein pairs that have the

same annotation class. Although the electron transport proteins appear to show little

agreement with the other classes, there are only a limited number of interactions reported

involving proteins from this class (17 in total), making any inference about how these

components interact with other classes difficult.

Figure 2.5: Biological process annotations of reported interactions. A heatmap describing

the proportion of possible protein pairs that are reported in BioGRID, stratified according to the GO annotation

biological process category. The protein pairs reported range from 0% (white) to 16% (red). The

diagonal, showing incidence of matching annotations, classes exhibit the highest proportion of reported

interactions.

In summary, although within class annotations are highly represented, these interactions

only represent a small proportion of the overall PPI data. Whilst it may be appealing

to assume that some clear link between these biological annotation categories and PPIs

73


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

exists, and attribute the other data to noise, the evidence suggests that assigning training

sets for PPI predictions that rely on GO characteristics is misleading. The probabilistic

methods may, in fact, actually be predictive of GO characteristic linkage, rather than

reliable PPI predictions.

2.3.5 Gene Ontology annotations and experimental techniques

Experimental techniques may not be designed to sample the same protein pairs just as

some are not designed to find homodimers. Overlap studies, introduced in Section 1.6.3,

have reported error rates based on the comparison of data from different experiments

in general using a reference set from a different technique (e.g. the MIPS set used in

D’haeseleer and Church (2004) is predominantly drawn from SSE methods). The level

of similarity shown between GO characteristics for each method may give an indication

of how similar the data are and guide whether it is a valid assumption to use a reference

set from SSE methods to test the reliability of HTP data.

For all subsequent analysis, an interaction is defined as having matching annotations if

both proteins have the same known GO annotation for a particular annotation category.

Figures 2.6-2.8 show the proportion of matching annotations found for PPIs reported by

each experimental technique. For each analysis an interaction is included only if there

exists a known GO annotation for each protein.

The dashed line, in each figure, shows the proportion of matching annotations for the complete

physical or genetic interaction set. Each bar, for a given experimental technique, has

a colour density illustrating the p-value of a proportion test between matching annotations

found in the particular experiment type and the complete interaction set (physical or genetic).

These are more translucent if there is less evidence to support the experimental

statistic being different from that of the physical or genetic data (and fully coloured if

significant at the 5% level as being different to the complete set proportion).

Figure 2.6 presents results for the molecular function category. The genetic and physical

interactions show marked differences in the level of matching annotations, with fewer

found in genetic data. Both biochemical activity techniques and protein-RNA techniques

produce a low level of matching annotations. In contrast, interactions elucidated using

co-crystal structure exhibit the highest proportion of matching annotations. Co-crystal

structure methods also contribute the highest proportion of self-interactions and the 108

74


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

other PPIs found using this method show a high level of annotation concordance from

165 studies.

Matching function proportion

0.0 0.2 0.4 0.6 0.8

24068

Affinity Capture−MS

4399

1657

Reconstituted Complex

Affinity Capture−Western

7457

Two−hybrid

5169

Biochemical Activity

Physical interactions

108

52

64

80

Co−crystal Structure

Far Western

FRET

Protein−peptide

301

56

33

1174

548

9636

3053

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

Genetic interactions

5809

Synthetic Growth Defect

1929

Synthetic Rescue

433

Dosage Lethality

15948

Phenotypic Enhancement

4392

Phenotypic Suppression

63

Dosage Growth Defect

Figure 2.6: Proportion of matching functional annotations by experiment technique.

The proportion of matching functional annotations found are shown for each experimental technique.

Dashed lines show average proportion across complete genetic or physical interaction set. Bar density

shows p-value of binomial proportion test, assessing similarity, between technique and genetic or physical

dataset.

Figure 2.7 shows the proportion of matching cellular component annotations found for

reported interactions. The overall difference between the proportion of matching annotations

in the genetic and physical techniques is smaller than for functional annotations, and

the overall propensity to match is higher. The larger physical datasets are comprised of

affinity capture MS, affinity capture western, biochemical activity and two-hybrid (again

referred to as Y2H) reported interactions. These datasets all exhibit different annotation

characteristics. Affinity capture MS contributes around half of the physical data and has

a significantly different propensity for matching annotations than any of the other techniques

that contribute more than 5% of the data. Two-hybrid and biochemical techniques

75


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

produce interactions with a relatively low propensity of having matching component, or

functional, annotations.

Physical interactions

Genetic interactions

Matching component proportion

0.0 0.2 0.4 0.6 0.8

24068

Affinity Capture−MS

4399

Affinity Capture−Western

1657

Reconstituted Complex

7457

Two−hybrid

5169

Biochemical Activity

108

52

64

80

Co−crystal Structure

Far Western

FRET

Protein−peptide

301

56

33

1174

548

9636

3053

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

5809

Synthetic Growth Defect

1929

Synthetic Rescue

433

Dosage Lethality

15948

Phenotypic Enhancement

4392

Phenotypic Suppression

63

Dosage Growth Defect

Figure 2.7: Proportion of matching component annotations by experiment technique.

The proportion of matching component annotations found are shown for each experimental technique.

Dashed lines show average proportion across complete genetic or physical interaction set. Bar density

shows p-value of binomial proportion test, assessing similarity, between technique and genetic or physical

dataset.

Figure 2.8 shows the same matching annotation data for biological process annotations.

The proportion of matching annotations found for each technique replicate the trends

shown for molecular function annotations in Figure 2.6. Biochemical activity and Y2H

studies report interactions with a lower propensity to share concordance of biological annotations,

whilst some of the smaller scale experimental techniques (including co-crystal

structure) display very high levels of concordance. In general, the level of matching annotations

shown by each technique is significantly different from those produced by the

other techniques.

Overall, Figures 2.6-2.8 show large differences in the proportion of matching annota-

76


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

Physical interactions

Genetic interactions

Matching process proportion

0.0 0.2 0.4 0.6 0.8

24068

Affinity Capture−MS

4399

Affinity Capture−Western

1657

Reconstituted Complex

7457

Two−hybrid

5169

Biochemical Activity

108

52

64

80

Co−crystal Structure

Far Western

FRET

Protein−peptide

301

56

33

1174

548

9636

3053

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

5809

Synthetic Growth Defect

1929

Synthetic Rescue

433

Dosage Lethality

15948

Phenotypic Enhancement

4392

Phenotypic Suppression

63

Dosage Growth Defect

Figure 2.8: Proportion of matching biological process annotations by experiment

technique. The proportion of matching biological process annotations found are shown for each experimental

technique. Dashed lines show average proportion across complete genetic or physical interaction

set. Bar density shows p-value of binomial proportion test, assessing similarity, between technique and

genetic or physical dataset.

tions found for genetic or physical interactions. Over one third of S. cerevisiae proteins

have unknown annotations, and the interactions found by each method, for either genetic

or physical methods, produce interactions where both proteins have known annotations

around 60% of the time (shown in Figure B.2 in Appendix B). Affinity capture and farwestern

techniques report interactions with an enriched level of known annotations. FRET

interaction data almost always involve proteins that have been annotated in all three GO

categories suggesting this technique focuses on well studied proteins. The use of interaction

data from particular experimental methods as high confidence reference sets may

bias the final analysis of novel methods, as some of the SSE methods (such as FRET)

appear highly biased in their reporting of PPIs relative to the overall corpus of reported

interactions.

77


2.3. ANALYSIS OF THE BIOGRID S. CEREVISIAE DATABASE PIN Analysis

2.3.6 Repeated interactions

Figures 2.6-2.8 have shown that biological differences exist between the characteristics

of genetic and protein-protein interactions. Accordingly, the characteristics of these networks

may be fundamentally different and to eliminate any possible confounding influence

the remaining analyses consider only the physical interactome. In order to assess the

overall coverage of reported PPI data, the number of verifications of reported interactions

can be used. These can be used to assess either how long it may take to report the complete

interactome, or to assess the error rates found in the data (conditional on a known

interactome size).

Figure 2.9 shows the number of PPIs reported (and stored in BioGRID) until 2007, divided

into novel or repeated reports. The complete data contain 30,074 interactions that

have been reported once and 9,506 other interactions that have been reported more than

once. These validated interactions therefore make up approximately 24% of the distinct

PPIs. Across the complete physical data, containing 66,464 reports, there are 1,679 (4%)

interactions that have been reported at least 5 times and 321 (0.8%) that have been reported

more than 10 times.

Reported Interactions

0 5000 10000 15000 20000 25000 30000

Novel Interactions

Repeated Interactions

1977

1978

1979

1980

1981

1982

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Figure 2.9: Accrual of reported yeast protein interactions over time The numbers of reported

interactions found for S. cerevisiae are shown over the last 30 years. Red indicates novel interactions

while yellow bars represent the reported interactions that have been published before.

78


2.4. INTERACTION NETWORKS PIN Analysis

A single interaction, between YDR477W and YGL115W, has been reported 39 times.

These proteins, which are both kinases, are part of the protein complex Snf1p. This complex

is essential for regulating transcriptional changes in multiple different biological processes

(Kuchin et al., 2000; Lo et al., 2001), and is homologous to AMP-K which is found

in all eukaryotes (Kemp et al., 1999). Protein kinases modify other proteins through phosphorylation.

This can result in functional change of the target protein, and approximately

30% of yeast proteins can be modified at any time (Ptacek et al., 2005). The proteins

are both of crucial importance to understanding eukaryotic processes and have featured

in commonly completed stress response studies.

Section 2.3 has illustrated some of the properties of BioGRID S. cerevisiae protein interactions.

The vast majority of interactions occur between proteins that do not share the

same GO annotations, and in general each experimental method reports interactions with

significantly different GO annotation characteristics. Accordingly, the use of GO annotations

to form strict training and reference sets for reliability analyses appears flawed

without further evidence regarding the reliability of the BioGRID database. The availability

of PPI validations and a wide variety of published analyses means that our focus

in the studies concentrates on the physical interactome.

2.4 Interaction networks

The development of novel experimental techniques, along with the concurrent increase

in interaction data, makes it difficult to effectively isolate stochastic or systematic errors.

Accordingly, graph analyses presented here use a collection of empirical network datasets

rather than just a single complete set. As well as possibly improving interpretation of

the analyses, this enables observation of any differences between published empirical

networks (e.g. between a network that has been curated and one which has not been).

CORE and DIP graphs are formed from the data contained in DIP, whilst a literature

curated graph (LC) has been generated using BioGRID. These network datasets, referred

to as empirical graphs, are now defined and explored.

The DIP graph contains all distinct PPIs found in the Database of Interacting Proteins

(April 2008). CORE is a subset of the DIP graph found by comparison of the expression

levels and the availability of paralogous interaction data for each interaction in DIP

(Deane et al., 2002). These criteria define the CORE set which is around a third the size

79


2.4. INTERACTION NETWORKS PIN Analysis

and considered as a higher confidence set of interactions than DIP.

LC contains a subset, found from hand curation of the literature, of reported interactions

from BioGRID (Reguly et al., 2006). The PPI data have been divided into highthroughput

(HTP) and small scale experimental (SSE) data. The HTP data are taken from

five studies: Uetz et al. (2000); Ito et al. (2000, 2001); Ho et al. (2002); Gavin et al.

(2002).

The three empirical graphs form different samples of S. cerevisiae PPIs, containing data

that has been hand curated (LC), passed some expert criteria (CORE), or is a complete

interaction database (DIP). They are treated as being subsamples of the S. cerevisiae interactome

in Chapters 3-4. The use of multiple observed PINs enables discussion regarding

how different network sizes, and reliability, affect analyses.

Network traits of the empirical graphs can be used to characterise the biological network

(see Section 1.5.1 on page 43). They also form a means of discriminating between individual

models used to generate graphs such as ERGMs (Pattison and Wasserman, 1999)

or geometric models (Higham et al., 2008). The network traits of the empirical graphs are

now compared.

2.4.1 Local graph structure

The three empirical graphs have different sizes, with CORE containing approximately a

third of the interactions found in DIP. LC graph is the largest graph, having the highest

order, 5,109 nodes or proteins, and size, 21,283 interactions or edges (as shown in Table

2.3). The LC set is split up into HTP and SSE subsets, of which HTP contributes

11,571 distinct interactions whilst SSE data contributes 11,334 distinct interactions to the

graph (the intersection of HTP and SSE containing 1,622 interactions).

Graph Nodes Edges Components Maximum degree Mean degree

CORE 2,528 5,728 78 91 4.8

DIP 4,931 17,471 31 283 7.0

LC 5,109 21,283 42 319 8.5

Table 2.3: Components and degree for empirical graphs. The network traits for empirical

data are shown including the size, order and components found in each graph.

The highest connected protein (node with highest degree) has a similar proportion of

80


2.4. INTERACTION NETWORKS PIN Analysis

the total interactions found in the empirical graph: 1.6% (CORE); 1.6% (DIP); 1.5%

(LC). However, this highest connected protein is connected to a higher proportion of each

graph’s proteins as the graph size increases: 3.6% (CORE); 5.7% (DIP); 6.2% (LC). The

maximum degree increases in line with the size of the graph as opposed to the graph’s

order.

The percentage of [nodes, edges] found in the (largest) component, the GCC, for each

graph are: CORE [92.7%, 97.9%]; DIP [98.8%, 99.8%]; LC [98.4%, 99.8%]. The GCC

for each graph suggests that the CORE graph is structurally different from the other empirical

graphs as it has a smaller proportion of the edges and nodes in its GCC.

Table 2.4 lists the clustering coefficients for the empirical graphs and the average clustering

coefficient found for the set of graphs with the same degree sequence. Graphs with

an identical degree sequence, on average, exhibit a significantly lower clustering coefficient

than the empirical graphs. CORE data exhibits the highest level of clustering of the

empirical graphs whilst producing the lowest level of clustering in the associated random

graphs. This suggests that the data are highly clustered around small sets of nodes, which

do not have many edges between them. In contrast, the LC and DIP graphs have lower

clustering coefficients which could be a consequence of the higher average degree of each

node in these graphs. However, the average trait for the random graphs is significantly

lower in all cases.

Clustering coefficient

Graph Observed Random (avg.)

CORE 0.205 0.005

DIP 0.094 0.009

LC 0.125 0.013

Table 2.4: Clustering coefficients. This table shows the clustering coefficients found for the empirical

data in comparison to the average coefficient for a random graph with the same degree sequence.

2.4.2 Degree sequence

Figure 2.10 shows rank-degree plots on a log-log scale for the nodes of each of the empirical

graphs: CORE, DIP and LC. Plots with a straight line have been used to justify the

use of a power-law distribution for PIN degree sequences. Whilst the degree sequences

do not fall perfectly on a line, there may be evidence of a power law and scaling relationship

if the majority of proteins with few interactions are excluded. However, overall

81


●● ●●

●●●●●●●●●●●●

●●● ●●●●●●●

●●●● ●●●●●●

2.4. INTERACTION NETWORKS PIN Analysis

the interaction graphs do not appear to have a degree distribution that can be reasonably

taken to be described by a power-law. A large proportion of the nodes have a limited

number of edges, whilst a handful have a large number of the edges. The CORE data,

which have around a third of the edges of DIP, also contains fewer nodes with noticeably

higher degree than the DIP or LC graphs.







● ●

●●

Node Degree

1 10 100





●● ●●●

















Node Degree

1 10 100



●●●●●●














Node Degree

1 10 100

● ●●●














1 10 100 1000 5000

Node Rank

1 10 100 1000 5000

Node Rank

1 10 100 1000 5000

Node Rank

(a) CORE

(b) DIP

(c) LC

Figure 2.10: Rank-degree plots of network data. These show rank-degree plots for the degree

sequence data, on a logarithmic plot.

In order to assess the fit of degree sequence data to a Poisson distribution and some commonly

used heavy-tailed distributions, maximum log-likelihood analyses (Burnham and

Anderson, 1998) have been performed (described in Section A.1 in Appendix A). Table

2.5 shows the Akaike information criterion (AIC) results that are used to choose

between the tested models (where the models have up to 4 free parameters). The PIN

degree sequences demonstrate relationships more indicative of a heavy-tail distribution

than a Poisson distribution. The best fit of this selection is the stretched exponential

model, whilst the discretised log-normal distribution fits nearly as well. The maximum

likelihood parameters found for the exponential distribution are less likely than those parameters

found when using a power-law distribution as has been found previously (Reguly

et al., 2006). However, the assertion that the degree sequences are best modelled by the

power-law distribution are not backed up by even a comparison with only a handful of

other heavy-tailed alternatives. Accordingly, it may be more reasonable to generate random

graphs that have the same degree sequence as the empirical data, to avoid placing

assumptions on the character of the degree distribution that are not well supported by

analysis.

The three empirical graphs have similar degree distributions even though they are of dif-

82


2.5. DISCUSSION PIN Analysis

Graph Poisson Exponential Power-law Log-normal Stretched exponential

CORE 18750 13220 12550 11820 11800

DIP 62290 29700 28850 27160 27130

LC 83820 32480 31790 29940 29900

Table 2.5: AIC analysis of possible degree distribution. This table shows the AIC for 5

different possible degree distributions. The values relate to the log-likelihood and number of parameters

for the distribution used and each empirical graph’s degree sequence.

ferent sizes. The CORE set is more highly clustered than the larger networks, although it

also has more connected components than the larger graphs. The graphs are all on subsets

of the complete proteome, and do not appear to be complete subgraphs of the true

interactome as the highest degree scales with the size rather than the order of the graph.

The degree distribution of the graph can be easily replicated by fixing the structure of the

graph, rather than using a simple probability model for the degrees of each node which

do not closely fit the empirical data. Accordingly, rewiring approaches may be a better

means of generating random graphs to compare with the reported networks.

2.5 Discussion

The majority of interaction data has been published in the last 5 years, and the publication

of novel interactions from S. cerevisiae, as shown in Figure 2.9, is still common. However,

the number of distinct PPIs reported already exceeds, or is comparable to, predicted

interactome sizes (listed in Table 1.4 on page 61). This could reflect that empirical data

contain erroneous interactions suggesting a need for means to improve their reliability.

Recently, S. cerevisiae PPI data have been supplemented by two global studies of multiprotein

complexes (Krogan et al., 2006; Gavin et al., 2006). These have added validations

for some binary interactions whilst also presenting an extra biological characteristic that

may help to predict further binary PPIs. Any correlation of interactions with biological

characteristics may be aided by complex annotations and enable better understanding of

how best to infer binary interactions from complex annotations.

Interaction data here are reported from a wide variety of experimental sources. As shown

in Figures 2.3-2.5, there is a tendency for physical interactions to be between proteins

that share GO slim biological characteristics. However, there are some inter-annotation

groups that exhibit a high number of reported interactions. Electron transport proteins

83


2.5. DISCUSSION PIN Analysis

have been found to interact with those involved with membrane organisation, whilst nuclear

proteins show an abundance of interactions with chromosomal proteins. Both of

these observations are consistent with the biological system, as electron transportation

occurs on internal mitochondrion membranes and chromosomes are found within the nucleus.

Electron transport proteins are known to be connected by non-protein co-factors

when performing their role within the cell, supporting the lack of within category interactions

found for this annotation. Clearly, interactions both within annotation classes,

and between classes, are biologically important. All these linkages may reflect important

properties of the interaction network and preserving these graph characteristics maybe

crucial if aiming to create a plausible graph model.

The proportion of matching annotations found for each experimental technique varies

widely. The differences found in GO characteristics for each technique could be a consequence

of either experimental design or noise. If the techniques can all test the same set

of protein pairs, then this suggests a higher level of noise from some techniques (depending

on which experimental technique is most accurate). However, it seems more realistic

that experimental design explains these differences and the reliability of each experiment

can best be assessed by replicated data, rather than comparison with results from different

techniques.

It was also shown that data obtained by high throughput methodologies exhibit a less pronounced

level of similarity in their GO slim biological characteristics than other smallscale

experimental techniques. This corroborates the evidence supplied in Table 2.2 which

suggests different techniques have different propensities for reporting self-interactions.

The concept that different techniques produce different subsets of the data must be taken

into account when comparing data to measure noise or generalising about PPI characteristics,

especially when examining the contribution of HTP experiments (52 studies

contribute 68% of the data).

The high prevalence of self-interactions found using x-ray crystallography or peptideprotein

methods could be explained in a number of ways. First, the intricacy of the technique:

x-ray crystallography requires protein crystals, found using high concentration of

protein in solution. The propensity of a protein to form a self-interaction, or homodimer,

under these conditions may be abnormally, non-physiologically, high. Second, both x-

ray crystallography and peptide-protein methods aim to isolate specific structures, and

consequentially may set out to deliberately elucidate homodimers. Third, the lack of selfinteractions

reported using other methods may be explained by the more macroscopic

84


2.5. DISCUSSION PIN Analysis

tools they employ. Whereas x-ray crystallography and peptide-protein methods actually

observe the protein structures, this visible information is lost when mass spectrometry is

employed – a protein is either present or not, and the level of the protein (i.e. whether

there is twice as much, as in a homodimer) cannot be determined.

Whilst biological traits can inform regarding the structure or characterise aspects of a

graph, the reliability of individual PPIs, and the graph as a whole is informed by validations

and replicated reports of interactions. Previous publications have used specific

methods, or annotations, to verify data presuming that they are more reliable. However,

as shown by Figures 2.6-2.8, individual methods show massive variation in matching annotations

so the skewed use of reference data from particular SSE methods may not be

appropriate – for coverage and experimental reasons. Validated, within technique, replicates

however can offer a less biologically biased view of the true interactome.

S. cerevisiae PPI graphs feature a small set of highly connected proteins contained in a

sparse graph for thousands of proteins. The majority of these proteins have few interacting

partners and the mean degree ranges from 5 to 8 for our 3 empirical graphs. The empirical

graphs are more highly clustered than graphs with the same degree sequence, containing

small cliques and highly clustered sets that share matching biological annotations.

High levels of clustering may reflect a tendency for small groups of proteins to exhibit

similar interaction partners or be an artefact of the techniques used to find the data. The

majority of the experiments are small scale and focused on small sections of the proteome,

so it is hard to know for certain if this feature is also true for the complete error-free interactome.

Data from complex experiments have also influenced the observed clustering,

whilst yeast two-hybrid data, completed on large components of the proteome, does not

exhibit as high levels of clustering.

The degree sequences of the empirical graphs exhibit characteristics consistent with scale

free networks although they are not best modelled by a power-law degree distribution (as

shown in Table 2.5). Indeed, the power-law distribution for the degree sequence may not

capture the intricacy and complexity of the interaction data, as has been seen for graphs

of the internet (Doyle et al., 2005). Instead, the degree distribution could be preserved

simply by fixing the degrees observed in empirical data. This simple approach avoids

using probability models to generate structurally similar random graphs for hypothesis

testing. The use of probability models for the degree distribution could easily result in

comparing empirical data with possibly inaccurate random graphs that do not reasonably

reflect the key properties of the empirical graph.

85


Chapter 3

Graph ensembles

This chapter analyses the network and biological traits of a selection of random graph

ensembles. These graphs (Section 3.2) replicate network and biological constraints to

gain insights into the evolution, and structure, of PINs. Random graphs sampled from

various graph ensembles are compared with empirical data (Section 3.3). The average

ensemble traits for these methods form the statistics compared.

86


3.1. INTRODUCTION Ensembles

3.1 Introduction

Topological and biological features of empirical protein interaction graphs may inform

us about the network’s evolution (Stumpf et al., 2007). The interactome may also provide

further information about protein complexes, interactions, and function of biological

systems (Chen et al., 2007) or be used in comparative analyses.

A variety of different graph ensembles, or null models, have been used for PIN analyses

(Milo et al., 2002; Jordan et al., 2003), although the rationale for their choice is not always

clear. Assumptions regarding how the graph is structured or its size and order may bias

conclusions, leading to a model not being appropriate for our hypothesis, and risk falsely

dismissing findings or generating false positive conclusions (May, 2001). Ideally a null

model is used to negate the potential effects of confounding variables or processes. In

practice, it is difficult to find a truly null model as it cannot be certain that features which

have shaped the data are not already woven into the model (Harvey et al., 1983; Strong

et al., 1984).

A selection of studies have investigated whether or not traits of interacting proteins are

different from those of non-interacting proteins (Fraser et al., 2002; Lemos et al., 2004).

Particular topological traits have also been found in a variety of different biological graphs

(Jordan et al., 2003; Berg and Lassig, 2004; Agrafioti et al., 2005). Graph ensembles

which choose certain characteristics to fix have been used to show the significance of traits

observed in the empirical data (under the assumption that the data represent a complete

PIN). For instance, the total number of nodes, or edges, have been fixed and the random

graphs generated compared to an observed graph (Wagner, 2001). Degree sequences,

and other biological traits, of the empirical data have also been fixed in the chosen graph

ensemble (Milo et al., 2002; Thorne and Stumpf, 2007).

To make a reasonable comparison, which may lead to the conclusion that a characteristic

is important in determining the interaction graph, the ensemble graph model should

retain certain properties of the empirical data. However, when generating graphs (where

the properties of the nodes and edges are important) it is hard to define a satisfactory

parameter set, such as the number of edges, that should be fixed.

In order to assess the possible linkage of traits with protein interactions, or with the graph

structure, the empirical data are compared here to the traits of different graph ensembles

(as introduced in Section 1.5.2 on page 48). Traits used form a selection of properties

87


3.2. METHODS Ensembles

that have been reported in the literature as linked with PPIs, or have previously been used

to generate ensembles in biological graph analyses (from Section 1.5.1 on page 46). A

selection of different graph ensembles are proposed and their average trait properties are

measured. The traits assessed relate to the functions, processes and apparent abundance

of the S. cerevisiae proteins, as well as the structural properties of each protein found in

the interaction graph.

Analyses of graph ensembles are complemented by an investigation of perturbed graphs

which are close to each empirical graph. Here the graph is gradually perturbed, only

changing a single edge or pair of nodes at each step. Traits are observed as the graphs

are perturbed and the perturbation effects analysed. This analysis is used to assess the

significance of the trait statistics found for the empirical data as well as being used as a

measure of each trait’s robustness.

3.2 Methods

Biological and network traits are studied through the generation of various random graphs

with identical order and size. The data are assumed to represent the complete PINs, ignoring

noise in an attempt to model the available reported PPI data. These models, whilst

non-random, hope to inform about the possible structure of PINs in alternative species

whose evolutionary histories are similar but divergent. The use of random graph models,

with appropriate assumptions, should hopefully allow an assessment of how important

topological structure, or other biological traits, are in the context of the overall graph.

Graphs are generated using two approaches:

• rewiring: graphs are sampled from graph ensembles based on traits of empirical

data (Section 3.2.2);

• perturbing: empirical graphs are altered by permuting nodes or moving edges, oneby-one

(Section 3.2.3).

88


3.2. METHODS Ensembles

3.2.1 Data

In order to define biologically motivated rewiring schemes for graph ensembles, biological

traits are used. A number of biological features have been proposed as means of

classifying protein-protein interactions (Salwinski and Eisenberg, 2003; Yu and Fotouhi,

2006; Skrabanek et al., 2008; Ramani et al., 2008), or linked with PPIs (Valencia and

Pazos, 2002; Bhardwaj and Lu, 2005; Thorne and Stumpf, 2007).

A collection of noted biological characteristics are used here for analyses and generation

of random graphs: (a) molecular function, biological process or cellular component annotations

taken from the GO slim ontology (Ashburner et al., 2000); (b) multi protein

complexes found in Gavin et al. (2006); (c) mRNA expression levels as a proxy for S.

cerevisiae protein expression levels from Cho et al. (1998); and (d) percentage sequence

similarity between S. cerevisiae proteins found from BLAST alignments.

The random graphs analysed in Section 3.3 also use empirical graphs (CORE, DIP or LC)

defined in Section 2.4 on page 79.

3.2.2 Rewiring

Graph ensembles are generated that take account of observed traits of empirical graphs.

Rather than focussing solely on network traits to find a probability model for plausible

PINs, biological constraints are used to construct graphs alongside network traits (which

are either fixed, or not, depending on the graph ensemble).

Graph rewiring (Bender and Canfield, 1978) is used to generate random graphs from the

empirical data. Each rewiring maintains both the size, n, and order, m, of the empirical

graph used to generate the rewired graph.

Definition 3.1 (Rewiring) An edge, e, is rewired if it is deleted from a graph’s edge set,

E, and a new edge, e ′ , is added to the graph from the same node set, V .

Definition 3.2 (Graph rewiring) A graph, H ∼ (V H , E H ), is a rewiring of G ∼ (V G , E G )

if |E H | = |E G | and V H = V G .

Each random graph considered is a sample from a graph ensemble (forming a particular

probability distribution over the space of graphs with n nodes and m edges). Comparisons

89


3.2. METHODS Ensembles

are made between the empirical graph and those found when sampled from the graph

ensemble. Consequently, the ensemble serves as a null model for the analyses presented

here. Topological and then biological ensembles are discussed. The empirical graphs are

considered to be G ∼ (V, E) throughout.

Topological ensembles

Three different graph ensembles are used that fix certain network traits of the empirical

data: (i) Random graph; (ii) Node shuffle; and (iii) Network shuffle. These take account

of the degree sequence, size, and order of the empirical graph.

(i) Random graph A graph, H, from this ensemble is generated using the ER graph

model (see Section 1.5.2 on page 48). This fixes the order, n, and size, m as the same as

the empirical graph, G. Biological node traits (such as sequence or annotations) are fixed

and the m edges are sampled uniformly without replacement.

(ii) Node shuffle A graph, sampled from this graph ensemble, retains all network traits

of the empirical graph, maintaining the adjacency matrix, A. The node traits are permuted

amongst all the nodes of the graph, G. Although the generated graph, H, has identical

structure to the empirical graph the node specific traits are randomly allocated amongst

the nodes, V .

This graph ensemble produces graphs that retain the precise topological features of the

network, whilst disassociating the biological traits, β G (v), of each node, v, from its network

characteristics. This enables assessment of whether the structure of the graph and

the node labels are related.

(iii) Network shuffle This graph ensemble generates graphs that preserve network traits,

using the rewiring algorithm presented by Bender and Canfield (1978). The degree of

each node, d G (v), along with each node biological trait, β G (v), are fixed. Edges are

randomly distributed under these constraints. The number of legal moves may be small

under certain conditions, primarily as the proportion of possible edges increases. In the

case of PIN data this is not a concern in general as the graphs are sparse.

90


3.2. METHODS Ensembles

Node Shuffle

(a) Randomly assign labels to proteins

β(A), β(B), β(C), β(D), β(E)

β(B)

B

β(C)

B

C D E

C D E

β(A)

A

β(C) β(D) β(E)

β(E)

A

β(A) β(B) β(D)

(b) Perform analysis on simulated network

Figure 3.1: Node shuffle. This figure shows the process used for node shuffle. The labels for each

node are permuted (e.g. node colour) such that the topology of the graph is fixed.

Network shuffle rewiring produces graphs with the same degree sequence and such that

each node has identical degree and biological characteristics. The neighbours of each

node are altered whilst the degree of each node, d G (v i ), is fixed:

H ∼ (V, E ′ ) where for each v i ∈ V, d H (v i ) = d G (v i ). (3.1)

Network Shuffle

Graphs maintain the degree distribution and other characteristics of each node.

(a) Randomly reassign edges

e 1 ,e 2 ,e 3 ,e 4 ,e 5

B

B

C D E

C D E

A

A

(b) Retain degree of each protein

Figure 3.2: Network shuffle. The degree of each node, [A,B,C,D,E], is fixed along with the node

characteristic, colour, whilst the edges are randomly rewired.

91


3.2. METHODS Ensembles

Biological ensembles

The following graph ensembles fix both network and biological traits of the empirical

data: (iv) Bipartite shuffle; (v) Biological node shuffle; and (vi) Biological network shuffle.

(iv) Bipartite shuffle This uses an edge characteristic, Φ = {φ 1 , . . . , φ m }, to rewire

the empirical graph. Each edge, e i , is rewired such that it retains the same characteristic,

φ i . Figure 3.3 shows an example rewiring when the trait, φ(.), is the colour of the

connected nodes. Edges are rewired randomly whilst maintaining the connections of particular

colours. The number of edges that connect each of the possible characteristics – in

this case (blue, blue), (green, green), or (blue, green) – are fixed.

Bipartite shuffle ensemble ignores the network traits of the empirical graph, instead replicating

the types of biological trait between connected nodes. It can be viewed as performing

a set of random graph rewirings, as for graph ensemble (i). Each rewiring, however,

is over a subgraph of nodes that have particular characteristics. These are the subsets of

Bipartite Shuffle

nodes with either the same node characteristic or with two specific characteristics. All of

(a) Permute edges to fix label frequencies

B

φ(e 2 )

φ(e4 ) φ(e 5 )

e 1 ,e 2 ,e 3 ,e 4 ,e 5

B

φ(e 1 )

A

φ(e 3 )

C D E

A

C D E

φ(e 1 )=φ(e 2 )=φ(e 5 ) = (g, b); φ(e 3 )=φ(e 4 )=(b, b)

Figure 3.3: Bipartite shuffle. Each edge is rewired uniformly and at random to one of the set of node

pairs that share the same edge characteristic, φ(e i ).

the edges, e i , retain the characteristic, φ i , fixing the graph trait statistic, ∆(G, Φ), for that

characteristic. Obviously, this can be extended to fix multiple trait statistics, although this

increases the complexity of the task whilst also limiting the size of the set of graphs that

can be sampled.

92


3.2. METHODS Ensembles

This ensemble technique is not used in the rewiring component of this chapter. However,

it is used to define biological network shuffle and the same biological rewiring constraint

defines possible perturbations applied to empirical data in Section 3.2.3.

(v) Biological node shuffle This graph ensemble produces a subset of the graphs that

can be sampled from node shuffle, retaining the topology of the observed graph, G. Biological

node shuffle permutes the nodes such that each node, v i , is switched with one, v j ,

sharing Bipartite a particular characteristic, Node β(v i ) = Shuffle

β(v j ).

(a) Permute nodes to maintain test characteristic

A, B, C, D, E

β(B)

B

β(E)

E

C D E

D A B

β(A)

A

β(C) β(D) β(E)

β(C)

C

β(D) β(A) β(B)

β(A) =β(C) =β(D) =b; β(B) =β(E) =g

Figure 3.4: Biological node shuffle. This permutes each node to another node, v i → v j such that

β(v i ) = β(v j ) for the particular characteristic, β, under consideration.

(vi) Biological network shuffle This graph ensemble is based on the algorithm used to

produce the network shuffle graph ensemble. An edge, e h , has a characteristic, φ(e h ),

determined by characteristics of the nodes it connects, φ(e h ) = φ(v i , v j ). Each edge is

rewired to maintain the degree of each node, d G (v), as in network shuffle, and retains the

characteristic of the rewired edge, e h . So e h → e ′ h if φ(e h ) = φ(e ′ h).

This rewiring algorithm is a form of bipartite shuffle graph rewiring, only contingent on

the bipartite graphs being randomly rewired according to the Bender and Canfield (1978)

rewiring, that forms the basis for the network shuffle ensemble (and similar in approach

to that taken by Thorne and Stumpf (2007)).

Constraints used for the biological ensembles could involve any number of biological

traits. However, only fixing one characteristic for each edge/node is assessed, and then

93


3.2. METHODS Ensembles

Bipartite Network Shuffle

(a) Permute edges to fix label frequencies

(b) Retain degree of each protein

B

φ(e 2 )

φ(e4 ) φ(e 5 )

e 1 ,e 2 ,e 3 ,e 4 ,e 5

B

φ(e 1 )

A

φ(e 3 )

C D E

A

C D E

φ(e 1 )=φ(e 2 )=φ(e 5 ) = (g, b); φ(e 3 )=φ(e 4 )=(b, b)

Figure 3.5: Biological network shuffle. This retains the degree of each node, d G (v), and also

rewires each edge, e, to one of the available node pairs that share the same edge characteristic, φ(e).

the resulting effects on the trait ensemble averages.

Four biological traits are used for each biological ensemble: complex membership [complex];

functional annotation [function]; biological component annotation [component];

biological process annotation [process]. For each graph ensemble, 1,000 random graphs

are sampled for each empirical graph. The following 11 graph ensembles are compared

against the 3 empirical graphs:

• (1) Random graph [size and order fixed]

• (2) Node shuffle [nodes permuted according to node characteristic]

• Biological node shuffle: (3) [process]; (4) [component]; (5) [function]; (6) [complex]

• (7) Network shuffle [edges rewired according to edge characteristic]

• Biological network shuffle: (8) [process]; (9) [component]; (10) [function]; (11)

[complex]

Graphs are sampled from each ensemble uniformly across all graphs which satisfy the

relevant constraints. In order to sample a random graph, H, from the empirical graph, G,

the following is implemented:

94


3.2. METHODS Ensembles

1. INITIALISE NEW EMPTY (E H = ∅) GRAPH, H ∼ (V G , E H ) AND EMPTY SET, T

2. RANDOMLY PICK e ∈ E G \ T

3. FIND SET S OF (v 1 , v 2 ) /∈ E H , v 1 , v 2 ∈ V G WHICH FULFILL ENSEMBLE CON-

STRAINTS

4. SAMPLE e ′ ∈ S IF NON-EMPTY OR RETURN TO 2

5. ADD EDGE SO E H = E H ∪ e ′ , T = T ∪ e

6. RETURN TO 2 UNLESS T = E G

The algorithm may require knowledge of the degree of each node in G and partially

formed H during the course of the implementation, as well as fixed node characteristics

related to the biology of each protein.

3.2.3 Perturbations

A graph is perturbed by rewiring a single edge or pair of nodes (forming a set of edge

rewirings dependent on which nodes are changed). The random graphs are used to see

how traits are affected by small changes to the empirical data, which may represent small

evolutionary changes or effects of noise. The distance and how stable the perturbed

graph’s properties are to the empirical data are measured.

Definition 3.3 (Perturbed graph) A graph, G i+1 ∼ (V i+1 , E i+1 ), is a perturbed graph

of G i ∼ (V i , E i ) if either: they differ by a single edge; or two nodes have been permuted.

Both share the same order, |V i+1 | = |V i |, and size, |E i+1 | = |E i |. The subscript, i, is the

number of perturbation steps taken from the empirical graph, G.

Studies that analysed the effect of using incomplete data (de Silva et al., 2006; Lee et al.,

2006) or subgraphs of true graphs (Stumpf and Wiuf, 2005; Stumpf et al., 2005b) have

shown that certain biases in the assessment of structural network traits are inevitable when

using a subset of the true data. Subgraphs may not have similar properties, such as the

degree distribution, as the complete true graph. It is important to know whether a graph

ensemble can be linked to empirical PINs as well as if similarly motivated perturbations

reproduce similar trait statistics.

95


3.2. METHODS Ensembles

The rate of change of traits and a measure of how different the graphs are is used to compare

perturbed and empirical graphs. Perturbed graphs, as in Definition 3.3, introduces

a method for performing a single perturbation to a graph, or a step. The number of steps

between graphs may not be adequate as a comparison between graphs generated by different

perturbation methods, as steps may cancel out or lead to different rates of topological

change. The differences between perturbed graphs are summarised using two measures:

distance and instability between the graphs. These measures are defined for graphs that

share the same order and size.

Distance between graphs, G ∼ (V, E G ) and H ∼ (V, E H ), is defined as the number of

different edges found (given the same order and size). This can be easily found from the

(upper triangular) adjacency matrices for the two different graphs, forming a Hamming

distance (Hamming, 1950). The distance is always a multiple of two due to each graph

having the same size.

Definition 3.4 (Graph distance) Given two graphs, G ∼ (V, E G ) and H ∼ (V, E H ),

with adjacency matrices, A and B , let the distance between the graphs be defined as:

c(G, H) =


|a i,j − b i,j |.

i


3.2. METHODS Ensembles

of the trait across all node pairs, ∆(Ω, Φ):

s(G, H, Φ) =

∆(G, Φ) − ∆(H, Φ)

∆(G, Φ) − ∆(Ω, Φ) .

Let graph G n be a graph that is n perturbations from the empirical graph G. Now denote

the closeness and instability for these perturbed graphs as: c n = c(G, G n ) and s n (Φ) =

s(G, G n , Φ) respectively. The distance, c n , instability, s n , and traits of the graph, G n ,

generated by n perturbations to the empirical graph G are analysed. The three approaches

used to perturb graphs are:

• Biological edge: a random edge, e a , is rewired to form a new edge, e b

that φ(e a ) = φ(e b ).

/∈ E, such

• Biological node: a random node, v i , is permuted with another node, v j , sharing a

given characteristic, β(v i ) = β(v j ).

• Biological shuffle: two edges e a = (v a1 , v a2 ) and e b = (v b1 , v b2 ), sharing node

characteristics β(v a1 ) = β(v b1 ) and β(v a2 ) = β(v b2 ), are deleted and replaced with

e ′ a = (v a1 , v b2 ) and e ′ b = (v b 1

, v a2 ).

GO annotations ([process], [component], [function]) and a series of homology sets (H α )

are used to constrain perturbations. The homology sets (H α ) are defined by scores from

BLAST (for proteins A and B: A ∈ H α (B) if score (A, B) > α).

97


3.3. RESULTS Ensembles

3.3 Results

Rewired (Section 3.3.1) and perturbed (Section 3.3.2) graphs are presented in this section.

The biological trait statistics used are, for GO and complex annotations, the proportion of

edges found between nodes with matching annotations. For expression data, the trait is

the average level of co-expression, using Spearman’s rank correlation coefficient, for all

graph edges. The clustering coefficient for a graph is the average clustering coefficient

across the full node set, as introduced for a given node in Definition 1.24 on page 44.

The trait statistics for each empirical graph are compared to the ensemble trait averages

found for rewired and perturbed graphs. An ensemble trait average is the arithmetic mean

of the trait values found for (1,000) random samples from the considered graph ensemble.

For each rewiring ensemble assessed the available space of allowed graphs greatly

exceeded the number sampled and there were no computational problems sampling from

each set. The variability of the trait values for each ensemble method are also contrasted

and compared, whilst the instability of each trait is focused on when observing the effects

of perturbations on the empirical graph data.

The results presented are descriptive in nature. Network shuffle and node shuffle ensembles

are contrasted to the traits observed for random graphs and those found for empirical

data. The effects of constraining on biological characteristics as well as network characteristics

are also observed. The aim is to assess the biological properties of each of the

ensembles, and highlight anomalous findings, to inform the use of graph null models for

further network analyses and those presented in Chapter 4.

3.3.1 Rewiring

The trait statistics that graphs from each ensemble produce are presented in this section.

It was found that graph ensembles used here do not reproduce the observed traits of empirical

graphs, which are shown in Table 3.1.

For each set of analyses, the level of each trait is displayed as a proportion of that observed

for the empirical graphs. Trends across the empirical graphs are similar. LC, which is the

largest empirical graph, is used for illustration here whilst further results can be found in

Appendix C.

98


3.3. RESULTS Ensembles

Graph

Trait

Co-expression Complex Function Process Component Clustering

CORE 0.11 0.59 0.31 0.41 0.50 0.21

DIP 0.07 0.51 0.17 0.23 0.38 0.09

LC 0.09 0.51 0.19 0.28 0.44 0.13

Complete (Ω) 0.01 0.03 0.03 0.03 0.18 1

Table 3.1: Empirical graph traits. Trait values are detailed for the three empirical graphs and

the complete graph with nodes for all proteins found in LC (i.e. the graph with all possible edges). The

traits for matching GO categories, complex annotations, average co-expression and clustering coefficients

are detailed. For each GO category and the complex annotations the trait statistic is the proportion of

edges found between nodes with matching annotations. The average level of co-expression of the nodes that

are connected forms the co-expression trait and the clustering co-efficient trait is the average clustering

co-efficient for each node.

Co-expression rates

Figure 3.6 shows the co-expression graph trait for each graph ensemble. Although there

are large differences between the ensembles, none of the ensembles produce traits close to

the co-expression trait of the empirical data. Biological ensembles that constrain complex

annotations, [complex], produce the closest trait values to those observed in the empirical

graphs. Biological network shuffle [complex] graphs have co-expression trait values of

80-90% of the trait value for LC or CORE, and less than 80% when compared to DIP.

Random graph and node shuffle ensembles, in contrast, produce graphs that have approximately

20% of the LC trait value.

Network shuffle ensembles produce graphs that have higher, and less variable, trait statistics

than the equivalent node shuffle ensembles. The node shuffle ensembles show higher

variance in trait values than the random graph ensemble, although when no further biological

constraints are applied the mean trait is approximately equal.

Complex annotations

The proportion of matching complex annotations show similar trends to those seen for

the co-expression trait values. Figure 3.7 shows the proportion of edges found between

proteins that have been reported in the same protein complexes. Each boxplot shows the

average trait value produced for graph ensembles as a proportion of the value found for

LC (shown in Table 3.1). Figure 3.7 shows that the non-biological ensembles (random

99


3.3. RESULTS Ensembles

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Average co−expression

Figure 3.6: Co-expression trait for ensembles using LC graph. Average co-expression

trait results are shown rescaled in proportion to the trait value for the LC graph. The red line, always

at 1, shows the empirical trait statistic. The lowest four (yellow) boxplots relate to biological network

shuffle results, whilst the red relate to biological node shuffle. The fifth from bottom (green) boxplot is

for the unconstrained network shuffle, the blue boxplot for node shuffle and the top (cyan) boxplot shows

the results for the random graph ensemble. Co-expression found increases to the right on the graph, and

network shuffle ensembles show the highest levels, although lower than that found in the empirical graph.

graph, network shuffle or node shuffle) all produce similar number of matching complex

annotations.

As for the other traits, network shuffle produces less variability in the complex trait values

than either random graph, or the most variable ensemble node shuffle. The trait does not

depend on whether the degree sequence is fixed, or whether the labels are fixed along with

the node degrees. The ensembles which only constrain network characteristics produce

an ensemble trait average almost identical to that shown by the random graph ensemble.

The random graph ensemble average trait produces only a tenth of the matching complex

100


3.3. RESULTS Ensembles

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching complex annotations

Figure 3.7: Complex trait for ensembles using LC graph. Matching complex annotations

trait results are shown rescaled in proportion to the trait value for the LC graph. The red line, always

at 1, shows the empirical trait statistic. The bottom four (yellow) boxplots relate to biological network

shuffle results, whilst the red relate to biological node shuffle. The fifth from bottom (green) boxplot is for

the unconstrained network shuffle, the blue boxplot for node shuffle and the cyan shows the results for the

random graph ensemble. In contrast to co-expression data, the variance of the trait is lower, shown by the

smaller width of the boxplots.

annotations that are seen in the empirical LC graph.

Biological network shuffle ensembles produce a higher proportion of matching complex

annotations, between just under 20% and 35%, than the biological node shuffle ensembles.

After [complex], the [process] constrained ensembles produce graphs with the highest

trait value, followed by [function] and lastly [component] for the GO categories. All

these ensembles produce graphs that exhibit less than half of the complex annotation

matches found in the LC data. This is the lowest proportion, in general, of matching annotations

retained of all biological traits examined in this section. Clearly, the [complex]

101


3.3. RESULTS Ensembles

constrained ensembles produce the closest match to the LC graph for this trait. However,

owing to multiple annotations (which mean that matching annotated links can be broken

under rewiring as a unique annotation is chosen), both ensemble averages are lower than

the value found for the LC graph.

Gene Ontology

The ensemble averages produced for the matching GO category traits are shown in Figures

3.8-3.9. By construction, the trait values are closest to the empirical value for ensembles

that are constrained by the same biological characteristic (i.e. [function] produces

almost identical results for matching function annotations). Otherwise, however, the trait

statistics are consistently lower in each different ensemble (whether network or biological

characteristics are constrained) than those seen for LC. This is confirmed by the two

further empirical graphs (shown in Appendix D).

Although biological ensembles reproduce similar trait statistics for the characteristic constrained,

there are still slight differences to the empirical trait. These differences must be

a consequence of multiple annotations for each protein.

Topological ensembles, network shuffle and node shuffle with no biological constraints,

exhibit different results across the GO slim categories (although consistent with earlier

observed traits). For each trait network shuffle ensemble graphs produce a higher proportion

of the matching annotations than node shuffle, which permutes the labels over a fixed

topological structure. The network shuffle ensemble produces higher values for the component

trait than the biological node shuffle [function] ensemble, shown in Figure 3.9(a).

Otherwise topological ensembles all produce lower traits than any of the biological ensembles

presented.

Clustering coefficient

Each graph ensemble fixes different topological network characteristics, with all the node

shuffle methods fixing the complete topological structure of the observed graph. So inevitably,

Figure 3.9(b) shows that the clustering coefficient found for each node shuffle

ensemble (biological or topological) is the same as that found for LC.

In general, for other ensembles such as random graph and network shuffle, the clustering

102


3.3. RESULTS Ensembles

Random graph

Random graph

Node shuffle

Node shuffle

Node shuffle [process]

Node shuffle [process]

Node shuffle [component]

Node shuffle [component]

Node shuffle [function]

Node shuffle [function]

Node shuffle [complex]

Node shuffle [complex]

Network shuffle

Network shuffle

Network shuffle [process]

Network shuffle [process]

Network shuffle [component]

Network shuffle [component]

Network shuffle [function]

Network shuffle [function]

Network shuffle [complex]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching function annotations

Proportion matching process annotations

(a) Function

(b) Process

Figure 3.8: Gene Ontology traits for graph ensembles. Ensemble trait averages for two Gene Ontology categories are shown rescaled in proportion

to the trait value for the LC graph. The red line, always at 1, shows the empirical trait statistic. The bottom four (yellow) boxplots relate to biological network

shuffle results, whilst the red relate to biological node shuffle. The fifth from bottom (green) boxplot is for the unconstrained network shuffle, the second from top

(blue) boxplot for node shuffle and the top (cyan) boxplot shows the results for the random graph ensemble.

103


3.3. RESULTS Ensembles

Random graph

Random graph

Node shuffle

Node shuffle

Node shuffle [process]

Node shuffle [process]

Node shuffle [component]

Node shuffle [component]

Node shuffle [function]

Node shuffle [function]

Node shuffle [complex]

Node shuffle [complex]

Network shuffle

Network shuffle

Network shuffle [process]

Network shuffle [process]

Network shuffle [component]

Network shuffle [component]

Network shuffle [function]

Network shuffle [function]

Network shuffle [complex]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching component annotations

Average clustering coefficient

(a) Component

(b) Clustering

Figure 3.9: Component and clustering traits for graph ensembles. Ensemble traits are shown for the graphs sampled for clustering and biological

component traits, rescaled in proportion to the trait value for the LC graph. The red line, always at 1, shows the empirical trait statistic. The bottom four (yellow)

boxplots relate to biological network shuffle results, whilst the red (or four of those on the red line in (b)) boxplots relate to biological node shuffle. The green

boxplot is for the unconstrained network shuffle, the second (blue) boxplot for node shuffle and the top (cyan) boxplot shows the results for the random graph

ensemble.

104


3.3. RESULTS Ensembles

coefficient is generally less than a quarter of the value found in the equivalent empirical

graph. Biological network shuffle [complex] forms an exception to the low clustering

coefficients produced by these ensembles. The complex annotation constraint results in

an ensemble average clustering coefficient of over half of that found for LC. Ultimately,

however, only node shuffle reproduces the local structure of the empirical data: something

it does by design.

Differences between empirical graphs

Although the analyses show similar trends across all 3 empirical graphs, the average coexpression

trait results differ between DIP and the other two empirical graphs. For the

DIP graph there is a noticeable difference between the trait statistics found for node shuffle

and random graph ensembles. This is not found in the results for LC (seen earlier in Figure

3.6) or CORE. Figure 3.10 shows the trait values found for average co-expression for

each of the empirical graphs, as a proportion of the empirical observation. This shows the

co-expression trait results across the topological ensembles: node shuffle (Figure 3.10(a));

network shuffle (Figure 3.10(b)); and random graph (Figure 3.10(c)). Random graph and

network shuffle ensembles show the same properties, reproducing similar proportions of

the co-expression that is evident in the relevant empirical data. However, the DIP coexpression

proportion found for graphs generated from the node shuffle ensemble are

noticeably higher.

Overall, rewiring the empirical data generates graphs that have lower trait values, although

for each graph similar proportions of the empirical trait value are generated by each ensemble

method. Node shuffle ensembles generate graphs with similar biological traits to

those found in random graph samples (except for DIP co-expression trait shown in Figure

3.10(a)). The trait statistics are not greatly affected by fixing the exact structure of

the empirical graph shown through comparison to the random graph ensemble. However,

node shuffle trait values are more variable than those produced by graphs sampled from

random graph ensembles. The third topological method, network shuffle, retains a higher

proportion of matching annotations than equivalent node shuffle ensembles. The average

distance between the random graphs and empirical data was, as expected, influenced by

the constraints placed on the ensemble. Network shuffle ensemble graphs were on average

closest to the empirical data, followed by the equivalent node shuffle ensembles and then

the random graph ensemble.

105


3.3. RESULTS Ensembles

CORE

DIP

LC

CORE

DIP

LC

CORE

DIP

LC

Density

0 5 10 15 20

Density

0 5 10 15 20

Density

0 5 10 15 20

0.0 0.1 0.2 0.3 0.4 0.5

0.0 0.1 0.2 0.3 0.4 0.5

0.0 0.1 0.2 0.3 0.4 0.5

Proportion of empirical co−expression

Proportion of empirical co−expression

Proportion of empirical co−expression

(a) Node shuffle

(b) Network shuffle

(c) Random graph

Figure 3.10: Co-expression trait for topological ensembles. Histograms show the density of samples from the topological ensembles for the coexpression

trait. The x-axis is the proportion of the empirical trait observed for each sampled graph – CORE, DIP or LC. DIP shows different behaviour for node

shuffle ensemble than the other two empirical graphs. For each other ensemble method, the traits values show similar behaviour relative to the empirical trait

value.

106


3.3. RESULTS Ensembles

3.3.2 Perturbations

The rewired graph ensembles present a means of generating random graphs based on

basic assumptions about the biological and network properties of PINs. Whilst these

ensembles did not generate graphs that shared biological traits with the empirical data,

they showed different properties dependent on the network characteristics maintained.

Empirical graphs are now perturbed, rather than rewired, to assess how constraints on

the graph’s evolution affect the same graph traits. The trait statistics are now measured

against the distance found between graphs, whilst in Section 3.3.1 the distance between

the graphs was not considered.

Perturbed graphs were generated using the CORE graph (which has 2,528 proteins and

5,728 interactions). Stability and closeness measures are used to compare each of the

simulated graphs. First, the use of GO category constraints on perturbations is described,

and second the analyses performed on homology sets (H α ) for sets with scores α ∈ [10,

100, 500, 1000]. Results for the perturbations are the average statistic across the runs (for

each of 0–10,000 steps from the empirical graph). Each constraint has been used for the

three different perturbation approaches found in Section 3.2.3: (a) biological edge; (b)

biological node; and (c) biological shuffle.

This section concentrates on the relationship observed between graph distance and instability

(defined in Section 3.2.3). The graph distance (introduced in Definition 3.4) is a

Hamming distance between each of the perturbed graphs, which is always a multiple of

two, and the CORE graph (and is between 0 and 11,456 as CORE has 5,728 edges).

Instability (introduced in Definition 3.5) is a measure of how the trait values compare to

the trait values found for CORE and the complete graph (on the same node set as CORE).

For the traits considered the CORE graph trait value is higher than that found on average

in the complete graph. In general, a negative instability means that a trait value is larger

than the value found for CORE and an ER random graph has an expected trait instability

of 1 (as an ER graph is a random sample of all possible edges).

Gene Ontology constrained graphs

Figure 3.11 shows the distance and instability for the perturbations when GO annotations

are used to constrain each step. Aside from the fixed trait, the non-constrained traits move

107


3.3. RESULTS Ensembles

towards the value found for the complete graph, although the biological shuffle method

produces a smaller value of instability and distance across the methods.

Figure 3.11(a) shows the instability and distance results for biological node [function].

These perturbed graphs are the furthest away from the empirical graphs for a fixed number

of perturbations. Figures 3.11(b)-3.11(c) show the general behaviour of the other two

perturbation methods. Biological edge graphs are closer than biological node graphs to

the empirical data, and biological shuffle graphs are closer still, for fixed steps. This illustrates

the extra constraints that are placed on the adjacency matrices by these algorithms.

Homology null sets

The genetic similarity of proteins should be linked to their interaction partners. If the

perturbations made to an empirical interaction graph are constrained according to genetic

similarity it is expected that the biological traits would be retained more readily than

by chance. To test this hypothesis, four sets of homologous sequences were generated

for each protein, using similarity scores of at least {10, 100, 500, 1000} to define which

protein pairs are considered homologous.

These four sets are of various sizes, forming

subsets of the proteome, V , for each protein. Clearly, the set for each protein decreases in

size as the score increases (H ξ ⊆ H β ∀ ξ > β). Table 3.2 shows the average number of

sequences found for each protein.

Score, α

10 100 500 1000

Proteins 65.95 1.92 0.20 0.04

Table 3.2: Size of homology sets. Average set sizes of H α (v) for a protein v and score α. The

average set size that nodes can be permuted within for the simulated perturbations decreases as the score

α is increased.

Two different null sets were constructed to assess the affect of constraining perturbations

on homology sets. First, random sets were chosen such that members of H α ′ (v) are

picked (uniformly) at random so ∀ v ∈ V : |H α ′ (v)| = |H α (v)|. Second, structure sets

were generated such that H α ′ (v) = H α (f (v)), where f : V → V is a random permutation,

or bijection, of the original protein set. Structure retains the size and structure of

the homology sets whereas random just maintains the size of each homology set.

Figures 3.12-3.13 show the results, for complex annotation and GO process traits, of these

108


3.3. RESULTS Ensembles

Distance, c(.)

0 1000 2000 3000 4000 5000

Distance, c(.)

0 1000 2000 3000 4000

Distance, c(.)

Function

Component

Process

Co−expression

Complex

Function

Component

Process

Co−expression

Complex

Function

Component

Process

Co−expression

Complex

0 1000 2000 3000

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5

Stability, s(.)

Stability, s(.)

Stability, s(.)

(a) Biological node [function]

(b) Biological edge [component]

(c) Biological shuffle [process]

Figure 3.11: Stability and distance for GO perturbations. Distance and instability measurements are shown for a selection of biological traits

(GO annotations, complex annotations and average co-expression) when CORE is perturbed by the three perturbation methods. The y-axis shows the distance,

c (.), from the empirical graph. The x-axis shows the instability, s (.), for a particular biological trait statistic. All of the trait values converge towards that for

the complete graph as the empirical graph is perturbed. All traits show similar instability and distance relationships except from the trait that is being explicitly

constrained in the algorithm.

109


3.3. RESULTS Ensembles

Distance, c(.)

0 500 1000 1500 2000 2500 3000 3500

Distance, c(.)

Homologues

Structure [NULL]

Random [NULL]

Homologues

Structure [NULL]

Random [NULL]

0 200 400 600 800 1000 1200 1400

−0.4 −0.2 0.0 0.2 0.4

−0.4 −0.2 0.0 0.2 0.4

Stability, s(.)

Stability, s(.)

(a) Biological edge H100 (v), complex trait

(b) Biological edge H500 (v), complex trait

Figure 3.12: Null homology perturbations for complex annotations. The trait results for perturbation simulations of null sets, H α ′ (v), are shown

in comparison to the sets, Hα, determined by sequence homology. Figures 3.12(a)-3.12(b) show the complex annotation trait instability, s (.), and distance, c (.),

using the biological edge perturbation algorithm. Structure graphs reach comparable distances to the homology sets, although the instability is several times

larger, showing that instability is greater if perturbations of edges are rewired according to genetic sequence similarity. The instability, along with the possible

distance, reduces as the score is increased.

110


3.3. RESULTS Ensembles

Distance, c(.)

0 200 400 600 800 1000 1200

Distance, c(.)

Homologues

Structure [NULL]

Random [NULL]

Homologues

Structure [NULL]

Random [NULL]

0 100 200 300

−0.4 −0.2 0.0 0.2 0.4

−0.4 −0.2 0.0 0.2 0.4

Stability, s(.)

Stability, s(.)

(a) Biological node H500 (v), process trait

(b) Biological node H1000 (v), process trait

Figure 3.13: Null homology perturbations for process annotations. The trait results for perturbation simulations of null sets, H α ′ (v), are shown

in comparison to the sets, Hα, determined by sequence homology. Figures 3.13(a)-3.13(b) show the process annotation trait instability, s (.), and distance, c (.),

using the biological node perturbation algorithm. The instability, along with the possible distance, reduces as the score is increased. If the score is 1,000 then the

homology set perturbed graphs have a maximum distance of 100, so only 2% of the edges differ.

111


3.3. RESULTS Ensembles

null set simulations. Distance is plotted against instability for the graphs, G n , generated

by the perturbation simulations. These plots show a noticeable difference in the behaviour

of the true homologue sets in comparison to the null sets. Maintaining the structure of the

homologue sets leads to graphs that are approximately the same distance apart as those

from the true homology sets although the traits are less volatile (exhibiting a significantly

lower instability which means they do not change as much).

As the score (α) increases both the instability and distance get smaller for a given number

of perturbations. Figures 3.12(b) and 3.13(a) show that although the null models for

biological node and biological edge reach similar distances apart across the simulations,

the homologue sets show large differences in the average distances reached. As found

for the ensemble methodologies, rewiring edges as opposed to nodes results in a lower

variability in the distance from the empirical data.

The structure and random null model comparisons show that rewiring according to homologous

sequence information generates graphs with more stable traits than expected

by chance for these particular null models. The homologous sequences, therefore, show

a higher level of annotation similarity (or level of co-expression) than found for random

protein pairs.

Similarity constrains instability and distance

Figure 3.14 shows that as the similarity score, α, increases the sampled perturbed graphs

exhibit smaller distance and instability values. This is found for the null models above as

well, although there are differences between the perturbation methodologies. The biological

node perturbed graphs have greater instability and distance statistics than comparable

graphs (measured by number of steps) found using biological edge perturbations.

Figure 3.14 shows the distance and instability found for each of the three perturbation

algorithms, for all considered scores. For each method, the instability reduces as the

score increases, and when comparable graphs the same distance from the empirical data

exist, the instability is lower as the homology score is increased. For a fixed distance,

instability is comparable across all three methods. The differences in the distance reached

by the methods – node, edge, or shuffle – are a consequence of the different topological

constraints placed on the graph, its nodes and edges.

112


3.3. RESULTS Ensembles

Distance, c(.)

0 1000 2000 3000 4000 5000

Distance, c(.)

0 1000 2000 3000 4000

Distance, c(.)

10

100

500

1000

10

100

500

1000

10

100

500

1000

0 1000 2000 3000 4000

−0.5 0.0 0.5

−0.5 0.0 0.5

−0.5 0.0 0.5

Stability, s(.)

Stability, s(.)

Stability, s(.)

(a) Biological node

(b) Biological edge

(c) Biological shuffle

Figure 3.14: Similarity score by perturbation method. Distance and instability values are displayed as the similarity score, α, increases in the three

figures. The shown plots are for the GO cellular component annotation trait.

113


3.4. DISCUSSION Ensembles

3.4 Discussion

A series of simulations have been performed on empirical PINs in this chapter, to assess

whether random graphs replicate the biological traits found in empirical data. Random

graphs have been formed by making both small changes to the observed graph, and completely

rewiring the data by different algorithms. Whilst the graph ensembles produced

graphs that did not share trait statistics with empirical data, the analysis has shown that

certain characteristics can be more closely reproduced through a variety of topological, or

network dependent, means.

Rewiring has allowed the similarity of graph ensembles to the PINs to be observed as well

as how the ensembles differ in relation to each other. The node shuffle ensemble results

suggest that the biological traits are not necessarily dependent on topological structure.

The variability of the trait values observed is higher for node shuffle graphs than other

tested ensembles, reflecting the effect of rewiring nodes in graphs that exhibit scaling

properties (see Section 2.5 on page 83). This shows the effect of the small number of

hub proteins and how they increase the variability of the measurements if only the degree

sequence is maintained.

Traits can be maintained more readily by constraining characteristics for the edges of

the random graphs. If biological characteristics of the edges are maintained, alongside

the degree of each particular node, then the trait statistics produced are closer to those

found in the empirical data than those found from solely topology dependent ensembles

– e.g. node shuffle or random graph ensembles. The biological network shuffle ensemble

shows that extra biological constraints can be used to produce graphs that maintain the

proportion of matching annotations.

Biological constraints can be used to increase the expected trait statistics for each random

graph generated. Functional annotations and complex annotations show the highest level

of correlation, whilst the complex annotations are the best constraint if the clustering coefficient

is important when generating random graphs. The complex annotations cover the

widest number of possible classes (one for each of the 547 observed protein complexes)

and the smallest number of proteins per class. Therefore, these complex annotations are

located in some of the highly clustered and well connected neighbourhoods of the emprical

graph. This may be a consequence of how complex data have been used to focus

binary interaction testing, or reflect the true biology of the graph.

114


3.4. DISCUSSION Ensembles

The three empirical graphs display similar trends across most analyses, although the node

shuffle ensemble graphs show surprising differences for the DIP data. The graph data are

all supposed to be representations of the same S. cerevisiae PIN. This suggests that there

are fundamental differences between the graphs which may be explained either by error or

different levels of coverage across the annotated proteins. However, even if, as observed,

different results are obtained for different realisations it does not necessarily hold that this

can be linked to the true interactome. The property noted may be a consequence of DIP

being the only data set used that has not been curated.

GO annotations can be used to fix certain biological traits. When a GO characteristic is

used under the perturbation methods, the other GO annotation measures rapidly approach

the complete graph trait value regardless of the algorithm used. Within the traits assessed,

functional annotations are most highly correlated with complex annotations.

Homology sets have been shown to be linked to the biological traits. The instability and

distance that perturbed graphs reached using these sets is significantly lower than seen for

equivalent sized null sets. These results also show that set structure, as well as the size of

the sets, affects the instability of the traits alongside the distance between the empirical

data and the resulting graphs.

Multiple characteristic annotations have meant that certain annotations are not retained

even though the rewiring algorithm has aimed to fix them. For each rewiring, a single

annotation has been chosen to determine how the node, or edge, is rewired. For the

GO categories this meant that biological node shuffle graphs exhibited a lower trait value

than the empirical data, whilst biological network shuffle produced graphs with almost

identical (in general even marginally higher) trait values. Each rewiring technique has a

different effect on the empirical graph. Biological network shuffle rewires each edge to

retain a characteristic, whilst biological node shuffle retains only a given characteristic for

the rewired node.

Extra constraints could be added that would undoubtedly increase the rewired, or perturbed,

graphs’ similarities to empirical graphs. However, the low graph distance seen

between the shuffle method graphs and empirical data shows that this ensemble may not

generate effective random graphs for comparison with the empirical data. For higher homology

scores, the shuffle perturbation method only alters a very small number of possible

edges (showing at most 2% difference), suggesting that the sampled graphs will retain the

empirical graph’s properties by design, irrespective of their significance.

115


3.4. DISCUSSION Ensembles

The affect of topological structure, or biological structure, underlying the empirical data

should not be ignored when analysing complex graph structures. Graph ensembles offer

a means of generating different random graph structures for network analysis. However,

maintaining the topology or a node characteristic does not appear sufficient to generate

graphs that share the traits of the empirical PINs. The graph structure, which all the node

shuffle ensembles maintain, is not a sufficient property to reproduce any of the tested nonnetwork

traits. Indeed, for LC and CORE empirical data, the node shuffle ensemble trait

averages are similar to those found in the random graph ensemble, perhaps showing that

graph structure does not influence the similarity of GO or complex annotations.

This chapter highlights the importance of using appropriate null models when testing

hypotheses on large scale graphs. This may change the outcome of hypothesis tests.

Indeed, the trait under consideration should be tested against a variety of different graph

ensemble probability distributions in order to effectively disassociate the possible effects

of topological, as well as other possibly biological, confounding factors from the analysis.

The different ensembles enable a means of clearly defining what is meant by ‘expected

by chance’ in the network context. Whereas the linkage of individual PPIs to particular

traits can be made by assessment against an ER random graph, this is not true if the trait

is believed to be linked to the network structure or other possibly biological covariates.

The ensembles enable a more subtle view of linkage between traits and PINs, allowing a

test of whether the trait found in the observed interactome are more similar than would be

expected in a random graph with clearly defined properties.

116


Chapter 4

Phylogenetic topologies of interacting

proteins

This chapter presents a study of the phylogenetic topologies of yeast proteins (Section 4.2),

analysing whether or not the topological properties of a protein’s phylogenetic tree are

more similar between interacting proteins than would be expected. Further analysis contrasts

the linkage of expression and topological characteristics between interacting proteins

or proteins found in the same complex (Section 4.3).

117


4.1. INTRODUCTION Phylogenetic topologies

4.1 Introduction

The connection between the degree of a protein, and the ability of that protein to change,

or evolve, is of considerable interest. That a protein involved in a high number of interactions

can be evolutionarily constrained by those interactions has been suggested in the

past (Fraser et al., 2002). Several studies indicate a linkage between the evolutionary rate

of proteins and the number of PPIs in which they are involved (Pellegrini et al., 1999;

Goh and Cohen, 2002; Gertz et al., 2003; Pazos et al., 2005). Conversely the extent to

which PPIs may influence the evolutionary properties of proteins has been estimated using

relative sequence conservation by Jordan et al. (2003), who suggest that evolutionary

rate shows a much stronger association with factors other than a protein’s degree.

In addition to the connection between the number of PPIs in which a protein partakes and

the evolution of that particular protein, the idea that two proteins can evolve in tandem

has been postulated. The proteins involved in a small number of E. coli PPIs have been

shown to have correlated evolutionary rates (Pazos et al., 2005). This finding has been

replicated for particular protein families (Jothi et al., 2005; Juan et al., 2008b). These

studies focus primarily on employing distance methods to demonstrate phylogenetic similarity

and do not directly compare the topological properties of explicitly reconstructed

protein phylogenetic trees. Instead they measure the similarity between branch lengths of

the phylogeny, assuming the same model of evolution across the complete tree. The topological

information, under an assumption of co-evolution should be significantly linked

to the presence of PPIs.

Although the construction of accurate phylogenetic trees (for many species) is computationally

difficult, it is necessary to assess whether the topology of these phylogenies can

be used to predict PPIs, an end to which protein phylogenetic profiles (Pellegrini et al.,

1999), distance matrices (Pazos and Valencia, 2001; Sato et al., 2003; Pazos et al., 2005),

and other measures of co-evolution between proteins (Goh et al., 2000; Goh and Cohen,

2002; Gertz et al., 2003; Ramani and Marcotte, 2003) have already been put.

This chapter assesses whether the topological properties of a protein’s phylogenetic tree

and interactions are linked. The hypothesis that phylogenetic topologies of interacting

proteins are more similar than those of protein pairs connected is tested using random

graph ensembles. A variety of different graph ensembles are used, along with a collection

of empirical PINs, to assess whether the topologies of PPIs, or the set of PPIs seen in the

full PIN, are more similar than those found in the different graph ensembles.

118


4.2. METHODS Phylogenetic topologies

High levels of concordance between the individual protein phylogenetic trees are anticipated

as these should tend to follow the species tree. Whether or not characteristics

of phylogenetic trees, especially their topology, show concordance between interacting

proteins greater than would be expected in random graphs has not previously been explicitly

tested on a global level. The similarity of phylogeny topologies found in S. cerevisiae

PIN data are compared to the same trait for a collection of random graph ensembles which

were introduced in Chapter 3.

The difference between complex and binary interaction data is also assessed. To illustrate

potential differences in these sets, the hypothesis that co-expression rates are higher for

protein pairs within complexes than for those outside complexes is tested using expression

data alongside the topological characteristics of the proteins’ phylogenetic trees.

4.2 Methods

This section describes the analyses applied to explore the role of evolutionary constraints

on S. cerevisiae protein pairs and the S. cerevisiae PIN. These have also been applied to

complex annotation data, as introduced in Section 3.2.1 on page 89, and the topological

similarity has been compared to co-expression rates as a possible classifier of PPI or

multi-protein complex membership.

4.2.1 Data

PIN data are used for the empirical graphs (CORE, DIP and LC) as defined in Section 2.4

on page 79. Expression data, complex data and GO annotations used are taken from the

sources described in Section 3.2.1 on page 89. To generate phylogenetic trees for each S.

cerevisiae protein, a selection of 9 other yeast species, from Saccharomyces and Candida

genera, have been mined for orthologous proteins using BLAST in the same means as

described in Agrafioti et al. (2005). The 10 species form a range of yeasts with common

ancestry to S. cerevisiae of between approximately 10 million years (S. paradoxus) and

over 300 million years (S. pombe), as shown in Figure 4.1. Protein coding sequences

for each proteome used have been translated from their genome sequences (Mewes et al.,

2006).

119


4.2. METHODS Phylogenetic topologies

Figure 4.1: Phylogeny of study species. This shows the evolutionary relationship of the ten yeast

species used (Wolfe, 2006; Fitzpatrick et al., 2006). S. cerevisiae proteins resulting from gene duplication

events are thought to retain the same interactions as the original gene for millions of years rather than tens

or hundreds of million years (Wagner, 2001). The genera Saccharomyces and Candida feature in the ten

species.

BLAST queries were used to identify if orthologous proteins exist for each S. cerevisiae

protein in the other species, to enable the creation of each protein’s phylogenetic

tree. Multiple sequence alignments (MSA) (see Section 1.3.1) were performed using

CLUSTALW (Thompson et al., 2002) for each S. cerevisiae protein and the most similar

protein from every other species, as discovered using BLAST.

Phylogenetic tree topologies for each protein were inferred from the MSAs. Three different

algorithms were used to infer the phylogenetic trees: PARS and PROML from Phylip

3.6 (Felsenstein, 1995); and the Codonml routine from PAML (Yang, 2007). For each

phylogeny method, the analysis is restricted to those proteins where trees were inferred

unambiguously (as an algorithm may return multiple trees with equal confidence). A tree

for each S. cerevisiae protein is tested across a subset of the 10 related species, dependent

on the availability of orthologous protein sequences.

The species tree, shown in Figure 4.1, and the protein trees may not necessarily agree

120


4.2. METHODS Phylogenetic topologies

(Tajima, 1983). As well as the protein trees being on a subset of the 10 study species

(dependent on the availability of homologues) the topology may also be different. The

species tree hopes to depict the evolutionary history, whilst the protein trees represents

how a set of homologous proteins have evolved relative to each other through time. The

differences are particularly apparent when the divergence time between the species is

short (Pamilo and Nei, 1988), so for the yeast species used here there should be apparent

variability (which is required for meaningful results) between the trees produced.

4.2.2 Correlated divergence

Proteins that co-evolve have similar evolutionary paths (Pazos and Valencia, 2008) where

the mutational changes in each protein are triggered by changes in the co-evolving protein

– i.e. the changes are compensatory. One consequence of co-evolution between protein

pairs is a tendency to see similar rates of evolutionary change which are reflected through

the branch lengths exhibited on the protein phylogenetic trees (Juan et al., 2008a). These

branch lengths, whilst indicative of possible co-evolutionary behaviour, may also be indicative

of correlated evolutionary rates, which may also be non-compensatory, as has

been shown in S. cerevisiae (Hakes et al., 2007a). The correlation observed in the evolutionary

distances is a consequence of constraints on the evolutionary rate, rather than a

consequence of compensatory changes.

Whilst co-evolutionary behaviour between proteins will influence their rates of evolution,

it also should affect the topology of their respective phylogenetic trees. If the proteins do

interact, then each divergent split (reflected in the topology) will trigger changes in the

co-evolving protein.

If proteins, labelled A and B, co-evolve then any evolutionary change in protein A will

trigger compensatory changes in the second protein B – and vice versa. If A diverges

forming proteins A ′ and A ′′ , then B will either be triggered into diverging into B ′ and B ′′

(although it may be true that these new proteins are identical). Accordingly, the topology

of the protein trees should reflect co-evolutionary pressures that may be the result of the

proteins interacting across the study species. Phylogeny analysis alone cannot discover

protein pairs where this is true, as no genetic correlation is seen. Phylogenetic trees can

be used, however, to observe similar rates of evolutionary change or if divergence events

occur in similar patterns when comparing different proteins.

121


4.2. METHODS Phylogenetic topologies

In order to provide an alternative view of phylogenetic similarity the topologies of the

trees are compared. Measurement of topological similarity aims to discover potentially

co-evolutionary relationships between proteins A and B where both proteins diverge (becoming

A ′ ≠ A ′′ and B ′ ≠ B ′′ ) and share phylogenetic tree topology. When the topologies

match, protein pairs are defined as co-diverging across the study species (or the subset

where homologous proteins exist).

Definition 4.1 (Co-divergence) Co-divergent proteins, over a set of species, are those

that share the same protein phylogenetic tree topology.

The topologies of protein phylogenetic trees are the same if the proteins exhibit the same

pattern of divergences, although topology alone cannot distinguish between compensatory

and non-compensatory divergence. Protein trees will differ from the consensus species

tree (found for the complete genome rather than a protein sequence), and these changes

are assessed for linkage between reported PPIs and the phylogenetic topology similarity

– as an assessment of whether PPIs exhibit evidence of co-evolution across yeast species.

4.2.3 Measuring topological differences

In order to measure similarity, an edit distance, η, between phylogenetic topologies on

a set of n species is defined. This distance is based on a nearest-neighbour interchange

method (Felsenstein, 2003).

A phylogenetic tree topology, e.g. ((1, 2), (5, (3, 4))), contains a set of species, {1,2,3,4,5},

and divergence events or internal nodes, represented by brackets. Topologies are neighbours

if they can be made identical when a single species is moved across a node. For

the string tree notation, across a node means either: (i) swapping a species with the first

bracket either side in the string (deleting unnecessary brackets e.g. ((1, )2)) = (1, 2));

or creating a bracket around two species in the same set of brackets – e.g. (1, 2, 3) is

a neighbour of ((1, 2), 3). For ((1, 2), (5, (3, 4))) the neighbours are: (1, 2, (5, (3, 4))),

((1, 2), (5, 3, 4)), and (5, (1, 2), (3, 4)). Neighbours are found from the set of multifurcating

trees as defined in Section 1.3.2 on page 28. Figure 4.2 shows a minimal sequence of

neighbouring phylogenetic trees to travel from topology ((1, 3), (2, 4, 5)), for protein A,

to ((1, 2), (5, (3, 4))), for protein B. The distance, η A,B , is the minimum number of tree

topology changes required to generate matching trees.

122


4.2. METHODS Phylogenetic topologies

Figure 4.2: Topology edit distance. An example of the measure of similarity between phylogenetic

tree topologies: ((1,3),(2,4,5)) and ((1,2),(5,(3,4))). The score here is 5.

Each protein may have a different number of homologous proteins on which the phylogenetic

tree is based. The number of possible trees (see Section D.1 in Appendix D) is

dependent on the number of species included. As a consequence of this, the edit distance

is not directly comparable if the trees have a different number of species. The similarity

of topologies, Γ A,B ∈ [0, 1], which takes account of the number of species, is:

Γ A,B = 1 − η A,B

M n

. (4.1)

where η A,B is the score between two trees sharing the same n species and M n is the

maximum possible score between two trees on n species.

The maximum edit distance between two phylogenetic trees on n species is found by the

recursion:

M n+1 = M n + (n − 2) , (4.2)

with M 3 = 2.

4.2.4 Phylogenetic analyses

The similarity in the phylogenetic tree topologies of interacting proteins is assessed.

Given two trees, their topologies match if the phylogenetic trees, on the set of species

that appear in both topologies, are (non-trivailly) identical. This requires that the two

trees share at least 3 different species. Along with the match characteristic, both the

score, η, and similarity, Γ, are used to assess the similarity of PPIs in the empirical graphs

(CORE, DIP and LC). The orthologue information for each protein is used to construct

123


4.2. METHODS Phylogenetic topologies

the phylogenetic profiles for each protein pair compared.

Analyses are completed on the empirical graphs and sampled random graphs from different

graph ensembles (see Section 3.2 on page 88). These graph ensembles, making up 11

different probability distributions on graphs with fixed size and order, are:

• (1) Random graph [size and order fixed]

• (2) Node shuffle [nodes permuted according to node characteristic]

• Biological node shuffle: (3) [process]; (4) [component]; (5) [function]; (6) [complex]

• (7) Network shuffle [edges rewired according to edge characteristic]

• Biological network shuffle: (8) [process]; (9) [component]; (10) [function]; (11)

[complex]

This chapter focuses primarily on the differences between the three types of ensemble:

random graph; node shuffle; network shuffle. These ensembles probe different aspects

of a putative association between the PIN and phylogenetic properties of the constituent

proteins. As discussed at length in Chapter 3, node shuffle graph ensembles fix the graph

structure and the phylogenetic tree labels are permuted randomly amongst the nodes.

network shuffle graph ensembles associate a tree phylogeny and fixed degree with each

node but randomise the interactions. These probe the relative similarity of interacting

phylogenies against the traits produced by various types of random graph.

For the analyses, 1,000 graphs are sampled from each graph ensemble. Traits are compared

for these graphs with the empirical data. For each empirical graph, the three different

phylogenetic techniques (PROML, PARS and PAML) are also contrasted.

124


4.3. RESULTS Phylogenetic topologies

4.3 Results

The results from the phylogenetic analyses are now presented. First, the phylogenetic

profiles are assessed for each empirical graph and graph ensemble method. Second, the

topological similarity of interacting proteins is presented and followed by a comparison of

the three different phylogenetic tree construction algorithms. Finally, additional analyses

of data from E. coli and on the most closely related yeast species are presented.

The results presented in this section focus on the PROML phylogenetic trees, although

there is also a comparison of the three phylogeny techniques in Section 4.3.3. The number

of trees generated (owing to either no result from the algorithm or ambiguous trees) for

the methods are: PROML – 4,380; PARS – 3,617; and PAML – 4,260. The average

number of species presented in each protein tree, across all different phylogeny methods,

is greater than 6.

4.3.1 Phylogenetic profiles

The orthologue data for each S. cerevisiae protein form a 9-bit phylogenetic profile where

each bit signifies the presence or absence of an identifiable (sequence) orthologue in a

given yeast species. Proteins for which no orthologue data are available have been excluded

from the analysis, as they have been for all subsequent topological analyses. On

average, each protein has more than five identifiable orthologous proteins across the 9

searched species (after those proteins which did not produce phylogenetic trees have been

discarded).

For each edge, the difference between the profiles of the two connected proteins is measured.

This reflects the number of species where only one of the proteins is conserved.

Figure 4.3 shows the phylogenetic profile differences found for the sampled ensembles

alongside a red line showing the average for LC. Across all the ensembles, the phylogenetic

profile difference is higher in general for the graphs sampled from each ensemble in

comparison to the value found for LC. Network shuffle ensembles are closer than the node

shuffle ensembles, as found in Chapter 3 for other biological traits. Similarly, the node

shuffle ensembles produce higher variability than either the random graph ensembles or

the network shuffle ensembles. If the edge rewirings are constrained by complex annotations

(the [complex] ensembles) the phylogenetic profile differences are closest to those

found in the empirical graph.

125


4.3. RESULTS Phylogenetic topologies

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Average difference in phylogenetic profile

Figure 4.3: Phylogenetic profiles for each ensemble. Boxplots of the phylogenetic profile

differences for edges in graphs for the 11 graph ensembles. The red line shows the average phylogenetic

profile difference found for edges found in the LC graph.

Figure 4.4 shows a selection of the graph ensembles – node shuffle, network shuffle,

random graph and biological node shuffle [complex] – against the true output for the

three empirical graphs: CORE, DIP and LC. The proportion of interacting proteins (after

rewiring or the red dot for empirical data) are shown for each possible phylogenetic profile

(0–9). The horizontal axis shows the differences found between phylogenetic profiles,

ranging from 0 (both proteins have orthologues in exactly the same species) to 9 (one of

the compared proteins has orthologues in only those species that the other does not).

The results show that empirical interactions exhibit a higher propensity for similar phylogenetic

profiles across all four shown ensembles. Biological node shuffle [complex]

ensemble graphs, shown in Figure 4.4(d), are closer to the empirical data than any of the

other ensembles. For all graph datasets the phylogenetic profiles with 3 or fewer differences

are found more often among the real interacting pairs than in tested random graph

126


4.3. RESULTS Phylogenetic topologies

Proportion

0.00 0.05 0.10 0.15 0.20 0.25

CORE

DIP

LC

Empirical

Proportion

0.00 0.05 0.10 0.15 0.20 0.25

CORE

DIP

LC

Empirical

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

Difference

Difference

(a) Random graph

(b) Network shuffle

Proportion

0.00 0.05 0.10 0.15 0.20 0.25

CORE

DIP

LC

Empirical

Proportion

0.00 0.05 0.10 0.15 0.20 0.25

CORE

DIP

LC

Empirical

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9

Difference

Difference

(c) Node shuffle

(d) Biological node shuffle [complex]

Figure 4.4: Phylogenetic profile differences. The differences in phylogenetic profiles, for each

edge, shown as a proportion of the comparisons made across the data for four different graph ensembles.

The empirical data, shown as red dots for each boxplot, are generally higher for differences less than 4,

showing that the observed PPIs are more likely to share phylogenetic profiles than those edges found in any

of the random graph ensembles.

ensembles.

Figure 4.4 shows that there is little difference between the results found for each empirical

graph. Although the graphs are of different sizes, the PPIs in each of them show similar

127


4.3. RESULTS Phylogenetic topologies

phylogenetic profile differences in both the random ensembles and empirical data. An

exception to this is the DIP graph, where a higher proportion of edges are found between

proteins that have matching phylogenetic profiles. The proportion of matching phylogenetic

profiles is also higher in the node shuffle ensemble graphs sampled using DIP than

in either the empirical results for CORE or LC or the other graph ensembles using DIP.

4.3.2 Topological similarity

The phylogenetic topologies for each graph are measured in three ways across all edges:

the proportion of matching topologies; the topology score, η; and the similarity score,

Γ, found on average. The similarity found in the empirical data should be higher than

that found for random ensembles if there is any evidence for enriched co-evolutionary behaviour

between interacting proteins. This section describes analysis using the PROML

phylogenetic trees, although the trends between the ensemble methods are the same for

each of the tree construction methods (PAML, PROML and PARS), which are in Appendix

D.

Figure 4.5 shows the proportion of matching topologies for each of the graph ensembles

in comparison to the proportion found for the LC graph (shown as a red line). Once again,

the node shuffle ensemble shows higher variance of the trait than other graph ensembles.

Each of the biological node shuffle ensembles constrained by a GO category exhibits a

higher proportion of matching topologies than is seen in either LC or any of the network

shuffle ensembles or random graph ensemble. Indeed, the average level of topology

matching seen in all but the [complex] constrained ensembles is higher than found in the

LC graph.

Figure 4.6 shows the topological scores between proteins that have interactions for LC

and the graph ensembles. The topological score trait, which measures the average score

between phylogenetic topologies, is higher for the random graph and node shuffle ensembles

than for LC data. The reported interactions have more similar topological trees, if

measured by the average score, than those found in sampled graphs from either the random

graph or node shuffle ensembles. However, the average trait score for the network

shuffle ensembles is lower than seen for the LC graph, and lowest for the basic network

shuffle ensemble.

Score results, seen in Figure 4.6, may be influenced by the number of lineages compared

128


4.3. RESULTS Phylogenetic topologies

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.34 0.36 0.38 0.40 0.42 0.44 0.46

Average topology matches

Figure 4.5: Topological matching for LC interaction graph. Boxplots for the distribution

of matching topologies found for each sampled graph ensemble. The red line shows the result for the LC

graph. The [complex] constrained graphs are the only ensembles that present fewer matching topologies

than found in the empirical data.

in each tree comparison. Indeed, for each edge the average number of shared orthologues

(the number of orthologues found in both proteins) is lower for each network shuffle ensemble.

This will downwardly influence the average score as the topological score, η,

does not take account of the number of lineages in each tree comparison. In contrast,

higher scores will be possibly be evident between proteins whose phylogenetic profiles

are more similar.

Figure 4.7 shows the same results for the similarity measure, Γ, which takes account of

the number of lineages and the score when comparing phylogenetic topologies. Unlike

the results for the average score, all of the network shuffle ensembles now are not singnificantly

different from the value observed empirically, although the average similarity is

consistently marginally lower. Node shuffle ensembles have a lower similarity measure

129


4.3. RESULTS Phylogenetic topologies

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

1.6 1.8 2.0 2.2 2.4

Average score

Figure 4.6: Mismatch score using LC interaction graph. Boxplots for the distribution of

average scores between topologies found for graph ensembles. The red line shows the trait statistic for LC.

Node shuffle and random graph ensembles exhibit higher scores than found in the empirical data, whilst

network shuffle ensembles under any of the tested constraints exhibit a lower average score than is found

in the empirical graph.

than the empirical data, although the sampled distribution of average similarities overlaps

with the empirical result. The random graph ensemble has significantly lower similarity.

Whilst the topologies of individual PPIs are more similar than expected for a random protein

pair, the average similarity across the empirical graph is not higher than is expected if

the degree of each protein is maintained and the edges reshuffled as is the case for network

shuffle ensembles.

130


4.3. RESULTS Phylogenetic topologies

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.81 0.82 0.83 0.84 0.85 0.86

Average similarity

Figure 4.7: Topological similarity for LC interaction graph. Boxplots for the distribution of

average similarity between topologies found for graph ensembles. The red line shows the trait statistic for

LC. The empirical data have similar similarity values as those found for the network shuffle graphs, whilst

the similarity is significantly higher for each of these than is found in graphs sampled from the random

graph ensemble.

4.3.3 Phylogenetic methods

The level of similarity for the PROML tree construction method showed little difference

between the similarity found in graphs from the network shuffle ensembles and the real

empirical phylogenetic topology. However, the level of similarity found for the tree construction

methods shows more variability. Table 4.1 shows the similarity trait, Γ, for each

of the empirical graphs using trees constructed by each of PAML, PARS and PROML.

There are large differences between the traits statistics for each of phylogenetic construction

algorithm, although the trends are similar when comparing the traits produced by

graph ensemble with those seen empirically.

131


4.3. RESULTS Phylogenetic topologies

Tree construction Graph Real

PAML

PROML

PARS

Similarity, Γ

Node shuffle Network shuffle

CORE 0.74 0.747 [0.737,0.757] 0.742 [0.737,0.747]

DIP 0.75 0.760 [0.750,0.769] 0.743 [0.740,0.746]

LC 0.74 0.750 [0.741,0.760] 0.741 [0.739,0.743]

CORE 0.84 0.848 [0.839,0.857] 0.846 [0.843,0.849]

DIP 0.84 0.841 [0.834,0.849] 0.838 [0.837,0.840]

LC 0.84 0.831 [0.822,0.839] 0.837 [0.836,0.839]

CORE 0.90 0.901 [0.893,0.909] 0.899 [0.896,0.903]

DIP 0.90 0.893 [0.884,0.900] 0.893 [0.891,0.895]

LC 0.89 0.882 [0.874,0.890] 0.893 [0.891,0.895]

Table 4.1: Similarity for each phylogenetic tree construction algorithm. Average similarity,

Γ, for each phylogenetic tree construction algorithm, for the empirical graphs. The results for node

shuffle and network shuffle graph ensembles are given along with the 95% sample range for the similarity

trait.

PARS topologies show a higher level of concordance across the PPIs in all cases. The two

maximum likelihood phylogenetic algorithms (PAML and PROML) produce lower levels

of similarity. Whilst each algorithm produces a different level of similarity for the empirical

graph data, these different values are also seen in the random graphs sampled from

the network shuffle and node shuffle ensembles. Differences between the phylogenetic

algorithms are also reflected in the level of matching topologies found in the empirical

data for each tree algorithm. For example, in the case of the CORE data, phylogenies

inferred using PAML match in approximately 17% of comparisons; phylogenies inferred

using PROML match 42%; and phylogenies inferred using PARS match in 57%. The

contrasting results for these phylogenetic algorithms may be explained by the differences

between the possible number of bifurcating and multifurcating topologies (see Table D.1)

as well as the tree search heuristics used.

Figure 4.8 shows the average similarity, for comparisons made on a fixed number of orthologues,

for each of the tree construction methods. The empirical traits are shown,

along with the results using the network shuffle ensembles. Figure 4.8(a) uses the PAML

phylogenetic trees and the similarity of phylogenetic topologies increases as they share

more orthologous sequences. The similarity levels range from 0.65 to 0.80 for the empirical

PIN data. Figures 4.8(b) and 4.8(c) show a different trend (for PARS and PROML

trees) as the similarity of topologies decreases significantly as the number of shared orthologous

sequences increases. PARS trees exhibit average similarity greater than 0.90

for comparisons between proteins which share 4 orthologous proteins in the study species.

132


4.3. RESULTS Phylogenetic topologies

CORE

DIP

LC

Empirical

Average similarity

0.6 0.7 0.8 0.9 1.0

Average similarity

0.6 0.7 0.8 0.9 1.0

Average similarity

CORE

DIP

LC

Empirical

CORE

DIP

LC

Empirical

0.6 0.7 0.8 0.9 1.0

4 5 6 7 8 9 10

4 5 6 7 8 9 10

4 5 6 7 8 9 10

Species

Species

Species

(a) PAML

(b) PROML

(c) PARS

Figure 4.8: Similarity of topologies for different tree algorithms. Average level of similarity for topology comparisons (when rewiring using

network shuffle) made for a fixed number of shared species. The average similarity for each method is different, and the trends seen in PAML show marked

difference to those methods taken from Phylip. The variance of similarities for each tree construction method increases as more species are compared.

133


4.3. RESULTS Phylogenetic topologies

4.3.4 Further analyses

Phylogeny analyses have also been applied to a small set of E. coli PPI data (producing the

mirrortree results found in Pazos et al. (2005)). The results (in Appendix D.3) corroborate

those presented earlier in this section. The E. coli set consisted of 118 proteins in addition

to phylogenetic information on 47 bacterial species. The similarity between topologies

found for the E. coli data, using network shuffle graphs (with no biological constraints),

matched the reported results for the S. cerevisiae data.

As well as there being no obvious link between the similarity of phylogenetic topologies

on the S. cerevisiae species, the same results hold for a smaller subset of the 10 study

species. To assess if the divergent range of study species makes a difference to the topological

analyses, the 3 most divergent species (S. pombe, C. albicans and S. kluyveri) have

been excluded and the same comparisons carried out.

Similarity and the proportion of matching topologies increases as species are excluded.

However, leaving out these species does not change the topological results when viewed

in relation to the results for the random graph ensembles on the same tree topologies.

The network shuffle and empirical similarity levels are almost the same and node shuffle

ensembles produce graphs with only marginally less similarity. The choice of null

ensemble potentially affects the outcome of the analysis, although none of the measures

produce differences as large as those seen for biological traits or the clustering coefficient

measured in Chapter 3.

The [complex] ensembles use complex annotations to determine how edges or nodes

can be rewired in the sampled random graphs. This constraint resulted in the closest

results to empirical data for the average difference in phylogenetic profiles, scores and the

number of shared orthologues. Similarity of phylogenetic topologies, or co-divergence,

is assessed for PPIs and protein pairs that co-occur in complexes in order to observe

differences between these two classes. Co-expression rates, which have also been linked

with PPIs, are also contrasted with the similarity.

Similarity levels are on average higher for reported PPIs than the set of protein pairs that

have been found in the same complex, as displayed in Table 4.2. However, for other

traits (such as co-expression, phylogenetic profile differences or functional annotations)

the observed correlation differs. Reported interactions that do not have matching complex

annotations produce more divergent phylogenetic profiles, and lower functional similarity,

134


4.4. DISCUSSION Phylogenetic topologies

than the average for any protein pair that has matching complex annotations. Finally, the

complete set of protein pairs exhibits a higher level of similarity, across phylogenetic

topologies, than protein pairs that have been reported in the same complex.

LC PPIs

Protein pairs

Trait All Same complex Different complex Same complex All

(

Edges 21283 3663 3582 33373 5109

)

2

Function 0.19 0.46 0.15 0.18 0.03

Co-expression 0.09 0.20 0.10 0.17 0.01

Phylogenetic profile 3.24 2.94 3.23 3.17 3.38

Similarity, Γ 0.84 0.83 0.84 0.81 0.83

Table 4.2: Complex results. Results for phylogenetic topologies (PROML trees) and the coexpression

trait. These show how traits differ for sets of reported PPI and reported matching complex

annotations. Co-expression trait is the average co-expression found between all protein pairs considered

4.4 Discussion

This chapter showed that there is no significant evidence for phylogenies of interacting

proteins to show higher levels of topological similarity than expected in a PIN by chance.

This finding was further investigated, to address potential reasons for such a conclusion,

by: (i) employing different phylogenetic inference approaches; (ii) using a range of different

PIN data sets; (iii) investigating the role of protein abundance as a potentially confounding

variable; and (iv) investigating the diversity of phylogenetic trees of proteins

forming complexes.

The objective was to determine whether protein phylogenetic tree topologies are more

similar among pairs of interacting proteins than among pairs of proteins for which no interactions

have been reported accounting for the PIN structure. Empirical graph data have

been assessed against a variety of graph ensembles. The two main ensembles which take

account of topological structure (network shuffle and node shuffle) show contrasting results

regarding the similarity of tree topologies. Node shuffle results suggest a marginally

higher level of both topological matches, and of the similarity of empirical data. In contrast,

network shuffle graph ensembles produced similarity which is not significantly different

from the empirical data’s similarity, highlighting the importance of choosing an

appropriate graph ensemble when probing traits of biological networks.

135


4.4. DISCUSSION Phylogenetic topologies

What does emerge from contrasting network shuffle and node shuffle ensembles is the role

of hub proteins. This is particularly apparent from the wider variation in the node shuffle

ensemble results. This variation is primarily due to changes in the phylogenetic profile

of the highly connected proteins. In network shuffle the topology-degree relationship

remains fixed, and because degree-degree correlations are low, less variability is observed

in the probability of matches.

The tree phylogeny methods used have made use of both likelihood and parsimony approaches.

Although there are some differences between the methods, as seen in Figure

4.8, the topological results are consistent across the methods. However, the tree edit

distance used, whilst easy to compute, may not be the best means of comparing protein

phylogenetic trees. For the likelihood approaches, PROML and PAML, an alternative

analysis could be completed that compares the full likelihoods across the possible tree

topologies. This would generate an alternative measure of the similarity between the

trees, and perhaps be a better indicator of similarity (although the need for a fair comparison

between trees on different numbers of lineages may make this a difficult procedure).

In practice, the use of heuristics to find the best tree make application of such likelihood

comparisons cumbersome, whilst the tree edit distance used works across all phylogeny

approaches and is easy to implement.

Although there is little evidence for the proportion of matching phylogenetic tree topologies

to be significantly enriched in the empirical graph data when compared to structurally

similar graphs, the topologies show notable results when interacting and non-interacting

proteins that occur in the same yeast complexes are compared (in Table 4.2). The complex

protein pairs show a higher correlation to mRNA co-expression levels than for random

protein pairs or reported interactions that are not in the same complex. However, all reported

interactions exhibit a higher propensity to share topologies than protein pairs that

co-occur in a complex but have not been reported to interact.

Moreover, the mRNA co-expression data show a higher level of co-expression for proteins

that appear in the same complex, above those found for reported PPIs. The use

of co-expression data for PPI classification should actively take account of the potential

confounding factor of complex membership. Topological similarity, however, appears to

deliver the same results on average for protein interactions found within complexes or

those that occur between complexes. The observed similarity is greater than that found

for non-interacting protein pairs that are found in the same complex. This suggests that

non-interacting complex partners are more divergent in topology than a random pair of

136


4.4. DISCUSSION Phylogenetic topologies

proteins, which may be surprising given an assumption that proteins with similar functional

roles have a higher propensity to co-evolve.

These results concerning the topology of interacting proteins do not, however, necessarily

contradict previous work regarding the co-evolution of interacting proteins (Goh et al.,

2000; Goh and Cohen, 2002; Ramani and Marcotte, 2003; Pazos et al., 2005). Measures

of the evolutionary rate or functional similarity are not accounted for in this analysis and

could be linked with interactions; in yeast (and also in Caenorhabditis elegans), however,

there is evidence that such a correlation among the evolutionary rates on interacting

proteins is at best weak (Agrafioti et al., 2005). Several sets of authors have also shown

that it is in fact the expression level of a gene (or a measure that may act as a proxy for

gene expression level, such as the codon-adaptation index (Sharp and Li, 1987)) which

explains most of the variation in protein evolutionary rate (Jordan et al., 2003; Agrafioti

et al., 2005; Drummond et al., 2006; Hakes et al., 2007a) and not properties related to the

topology of the interaction network. This also appears to be independent of noise in, and

incompleteness of, the PIN data (de Silva et al., 2006).

This chapter has highlighted the conceptual difference between predicting individual interactions,

and predicting the whole interactome of an organism. Correlations among

pairs of proteins may be used to detect some interaction, or perhaps complex, partners of

a protein. For any given protein it will frequently be found that some of its interaction

partners have similar properties. Whilst the set of protein pairs with very similar properties

is enriched for true interactors, not all are or have been reported as interactors. It

is important to realise that although co-evolution has been shown to be important across

key functional proteins, evidence for any evolutionary properties – correlated evolution,

co-evolution. or our co-divergence measure – may be absent or weak when the whole

interactome is being considered. Overall, the observed level of phylogenetic similarity is

not higher for the empirical PINs than that expected for a random network with the same

degree sequence.

137


Chapter 5

Measuring the interactome

This chapter describes a model for finding the interactome size or the false-discovery rate

for interaction data (Section 5.2). The model estimates the size and false-discovery rate

using the number of repeated interactions, and provides suggestions as to how repeated

reports can be used to reduce noise (Section 5.3).

138


5.1. INTRODUCTION Interactome size

5.1 Introduction

Knowledge of the interactome size provides a view about the biological complexity of an

organism (Copley, 2008). Proteomes may have similar orders whilst exhibiting different

interactome sizes (Stumpf et al., 2008). Interactome size determination also allows an

appreciation of how close the reported data are to a full picture of the underlying true

interaction network.

Recent publications have assessed the quality of the available protein interaction data for

S. cerevisiae (Chiang et al., 2007; Scholtens et al., 2008). In parallel, as discussed in Section

1.6.3 on page 56, studies have used graph theoretic methods to find the interactome

size for a variety of species (Stumpf et al., 2008). The estimates are generally based on

small collections of HTP studies (Grigoriev, 2003). Reference sets (such as MIPS or SSE

data) are often used in parallel to estimate the error rates in HTP data (D’haeseleer and

Church, 2004).

In this chapter a model is presented which estimates interactome size and FDR from

reported interaction data. The model observes all the data that has been reported, rather

than the output from a small number of studies. Multiply reported interactions are used

to obtain estimates for interactome size and FDR.

The model presented here assumes that interaction data are sampled independently so that

the reporting of PPIs can be viewed as a coupon collecting problem or multiple capturerecapture

approach (Shokouhi et al., 2006). A coupon, treated as an individual protein

interaction, is drawn from an urn of fixed size. The observed reported interactions are

drawn from either an urn containing true interactions, or a second urn containing false

interactions.

Modern global mappings of protein interactions (using HTP methods) attempt to survey

as many protein pairs as possible to find protein interactions and produce over two thirds

of the data. Accordingly, HTP data can be reasonably equated to independent sampling

from the set of true, or false, interactions. SSEs, however, make up the majority of the

experiments and produce a significant minority of reported interactions. SSEs have been

viewed as more reliable but are difficult to summarise from a sampling point of view. For

the sampling of proteins, an independent sampling approach has been shown as the meanfield

approximation to non-independent and non-random sampling (Stumpf et al., 2008,

Supporting Information). In this case, the same argument is used to justify assuming the

139


5.2. METHODS Interactome size

independent sampling of interactions for the PPI data.

The FDR is used to observe the influence of false-positive interactions which have been

found to be inherent in experimental studies (von Mering et al., 2002). The proposed

coupon collecting model is analysed and compared to an alternative model which models

the effect of drawing multiple coupons simultaneously. Finally, the use of validation

information is proposed and analysed as a means of separating reported PPI data into

interactions and false interactions. The chapter aims to understand whether the observed

error of HTP data is an insurmountable barrier from its usage to elucidate the complete

interactome in S. cerevisiae. The use of validations, and the coupon model proposed, are

tested to assess how the current methods could be utilised to elucidate the full interactome.

5.2 Methods

This section builds on methods developed in the field of systems biology over the last

decade. Studies have focused on the size of the complete interactome with a hope that

this will aid the assessment of whether reported interactions are true or false postives

(Salwinski and Eisenberg, 2003). Section 1.6 on page 52 introduced error rate notation

for PPIs, and methods used to find the interactome size were discussed in Section 1.6.3 on

page 56. The use of validated information from two HTP datasets features prominently

in the discovery of interactome size (Grigoriev, 2003).

Validation data are used to estimate the number of distinct interactions in the S. cerevisiae

interactome. The coupon model described here also provides, for a given FDR and interactome

size, an assessment of the number of reported interactions required to separate

PPI data into interacting and non-interacting sets.

5.2.1 Data

The data required to model the FDR and interactome size are taken from the BioGRID

interaction dataset for S. cerevisiae. The information required for the coupon collecting

model consists of: the number of different protein pairs observed (m obs ); the number of

interactions reported (s obs ); the number of distinct interactions reported (i obs ); and a list

of experiment sizes ({r obs

1 , r obs

2 , . . . , r obs

q }) as defined in Section 1.6.2.

140


5.2. METHODS Interactome size

Table 5.1 shows data used in Section 5.3, although the experiment sizes (r obs ) are not

k

shown. The number of different protein pairs observed, m obs , is defined using the number

of distinct proteins, ň, in the data according to m obs = ( ň

2)

. This is used as an estimate

for the complete dataset, rather than a proposal of which protein pairs have been assessed

in each experiment. In order to assess the size of the complete interactome this figure

provides an estimate of the observed coverage of the methods from which the interactions

have been sampled.

Dataset

Model parameters

Size, r k Experiments All, s obs Distinct, i obs Proteins, ň

All PPI 4,167 59,956 41,313 4,967

≥ 5 932 54,320 39,222 4,856

< 5 3,235 5,636 4,768 2,203

≥ 10 398 50,884 37,761 4,817

< 10 3,769 9,072 6,914 2,550

≥ 100 32 42,573 33,779 4,719

< 100 4135 17,383 12,452 3,216

≥ 1000 7 35,596 28,710 4,239

< 1000 4,160 24,360 18,119 4,011

All genetic 4,426 44,275 38,071 3,793

Table 5.1: Interaction datasets. The different data subsets, of physical PPIs from BioGRID, used

to find FDR, κ, and interactome size. The protein data exclude proteins that have only been reported as

self-interacting. Both subsets (≥ and


5.2. METHODS Interactome size

(2008), is:

ρ =

=

( n

2)


2

)

n (n − 1)

ň (ň − 1) . (5.1)

This estimator is used to find the total number of interactions from those found by sampling

only ň proteins (node sampling is further described in Appendix E). This provides

an unbiased estimate of the complete network size assuming that the interactions have

been sampled uniformly (Stumpf et al., 2008). Uniform sampling of the possible interactions

is also a necessary assumption of the coupon model later introduced. Equation 5.1

assumes that ( ň

2)

protein pairs are observed, or tested, to produce reported data. The

total number of possible proteins found in S. cerevisiae, n, is here defined to be 5,800

(Hirschman et al., 2006). The number of proteins in a dataset, ň, is used to find ρ.

5.2.2 Coupon collecting

A model is proposed to describe the sampling of true and false interactions from the S.

cerevisiae PIN. The aim is to estimate the overall population size of true interactions by

using knowledge of the total number of observed interactions and the number of times

repeated interactions are observed.

The model can be considered to be a capture-recapture approach (Bunge and Fitzpatrick,

1993; Chao, 2001). These approaches have commonly been employed in the literature in

order to find a population’s size or to elucidate its class structure. The overlap found between

two samples (i.e. the number of items recaptured) is used to estimate the complete

population’s size, as also used for interactome analyses (see Section 1.6.3).

Multiple capture-recapture (Shokouhi et al., 2006) is an extension of this approach to

account for any number of samples. This has been used to estimate the size of different

populations by observing the overlap between different samples (Xu et al., 2007). This

method has also been generalised to use non-uniform sample sizes (Thomas, 2008).

The population size estimator introduced in this section is equivalent to the homogeneous

capture-recapture estimator (Shokouhi et al., 2006) when taking samples (of size 1) with

142


5.2. METHODS Interactome size

replacement from a finite population. The population considered to begin with is only

true interactions (no false interactions are considered). It is also a natural extension of the

overlap methodologies which have focused on individual PPI experiments. Sections 5.2.3

and 5.2.4 describe extensions to the simple estimator. First, the capture-recapture model

is adapted in order to account for the presence of false as well as true interactions. The

model is then further modified so that non-uniform sample sizes may be considered.

Suppose that interactions, or coupons, are sampled (where each sample reports one interaction)

with replacement from an urn containing m different interactions. Having sampled

i distinct interactions, the probability that the next sampled interaction is novel is,

P (novel interaction sampled | i distinct interactions) = m − i

m . (5.2)

The number of samples to find a novel interaction, given that i ≥ 0 have already been

collected, is geometrically distributed with success parameter θ = m−i . For this geometrically

distributed variable, the expected number of samples required to find a

m

novel

interaction is: E (samples, to find novel interaction) = 1 = m . Thus, using the linearity

θ m−i

of expectations, the expected number of samples, S, to collect i distinct interactions is:

E (S, to find i distinct interactions) =

∑i−1

E (novel sample | k distinct)

k=0

= 1 + m

m − 1 + . . . + m

m − i + 1

∑i−1

1

= m

m − k . (5.3)

k=0

The variance of the number of samples, S, to collect i distinct interactions, can also be

143


5.2. METHODS Interactome size

found from the sum of variances of independent geometric random variables,

V (S, to find i distinct interactions) =

=

<

∑i−1

V (novel sample | k distinct)

k=0

i−1


( m

m − k

k=0

∑i−1

( m

m − k

k=0

< m 2 π2

6

) 2 (

1 − m − k

m

) 2

)

(5.4)

The coefficient of variation (CV ) for the distribution of samples necessary to find i distinct

interactions is,


V (S, i)

CV =

E (S, i)

< √ π 1

∑ 6 i−1

. (5.5)

1

k=0 m−k

For the coupon distribution the CV decreases as m increases, and is below 1 for all parameters

of interest, informing about the reliability of the model presented in Section 5.2.3.

When m is known then Equation 5.3 can be used to estimate the number of samples necessary

to have found all of the distinct interactions. Alternatively, given the the number of

distinct interactions found, i, and a given m, the expected number of interactions sampled,

S, can be compared to observed data.

5.2.3 Single coupon

The single coupon model now described is a modified version of that introduced in Section

5.2.2. Suppose that PPIs are reported from a set of protein pairs, E obs , defined as all

pairs of ň different proteins (those protein pairs that can be observed experimentally). Let

m obs be the size of E obs , so m obs = ( ň

2)

.

The observed PPIs are either found as edges of the true interaction graph or false inter-

144


5.2. METHODS Interactome size

action graph (introduced in Section 1.6 on page 52). These may be considered as being

found in two urns containing either: PPIs, e a = (v i , v j ), found in E ∩ E obs ; false interaction

protein pairs, e b = (v k , v h ), found in E ′ ∩ E obs . Each reported interaction is found in

one of these urns, since E ∪ E ′ = E Ω . Let m be the size of E ∩ E obs and m ′ the size of

E ′ ∩ E obs .

The proportion of reported data that are found in E ′ is also the FDR, κ. s obs is the observed

number of reported interactions. Now let S be the number of interactions sampled from

E and S ′ be sampled from E ′ . Then, suppose S ∈ [0, s obs ] ∩ Z is fixed,

S = (1 − κ) s obs , (5.6)

and also trivially, S ′ = κs obs . Then κ can be found directly from S.

The observed number of distinct interactions, i obs , is made up of those sampled from E

and those from E ′ . Let i be sampled from E and i ′ be from E ′ . As E∩E ′ = ∅, i obs = i+i ′ .

In summary,

s obs = S ′ + S,

i obs = i ′ + i,

m obs = m ′ + m. (5.7)

m and i are required to satisfy Equations 5.7, alongside a fixed S (from which κ is found)

and s obs . S (and S ′ ) are assumed to be the expected number of samples necessary to find i

(and i ′ ) interactions and E (S, i) = S (from Section 5.2.2). m and i are sought that satisfy

the Equation 5.7 and,

∑i−1

1

S = m

m − k ,

k=0

∑i ′ −1

S ′ = m ′ 1

m ′ − k . (5.8)

k=0

In order to find a solution for m, i and S, solutions are sought such that the following

145


5.2. METHODS Interactome size

function, g (m, i), is zero,

∑i−1

g (m, i) = s obs − m

k=0

∑i−1

= s obs − m

k=0

1

m − k − ∑i ′ −1

m′

k=0

1

m ′ − k

i

1

obs

m − k − (m ∑−i−1

obs − m)

k=0

1

m obs − m − k . (5.9)

An approximate solution for m and i is found by assuming that the parameters are from

a continuous function (rather than discrete as they are in truth) such that for all m ∈

[0, m obs ], solutions are sought (if they exist) for i ∈ [0, i obs ]. The complete interactome

size, m Ω , is then found for a given solution using ρ, the scaling factor introduced in

Equation 5.1, and m:

m Ω = ρm

n (n − 1)

= m. (5.10)

ň (ň − 1)

Uniqueness of solution

In order to examine the possible uniqueness of i such that g (m, i) = 0 take m and i both

as positive reals (as performed to find a solution). The expectation found in Equation 5.3

is approximated as the following only for this section,

∑i−1

E (S, to find i distinct interactions) = m

k=0

i−1


= m

k=1

∫ i−1

1

m − k

1

m − k + 1

1

≈ m

0 m − x dx + 1

( )

m

= m log

+ 1. (5.11)

m − i + 1

146


5.2. METHODS Interactome size

Now to examine the uniqueness of a solution for the coupon model, Equation 5.8 are

approximated using Equation 5.11 as,

(

m

S ≈ m log

m − i + 1

(

S ′ ≈ m ′ log

)

,

m ′

m ′ − i ′ + 1

)

. (5.12)

g (m, i) now is,

(

g (m, i) ≈ s obs − m log

(

= s obs − m log

)

m

m − i + 1

m

m − i + 1

(

− m ′ log

m ′

)

− (m obs − m) log

and the derivative of g (m, i) with respect to i is,

)

m ′ − i ′ + 1

(

)

m obs − m

,

(m obs − m) − (i obs − i) + 1

(5.13)

∂g (m, i)

∂i


m


m − i + 1 + m obs − m

(m obs − m) − (i obs − i) + 1 . (5.14)

g(10000,i)

0 5000 10000 15000

g(50000,i)

-20000 -10000 0 10000 20000

0 10000 20000 30000 40000

(a) g (10000, i) for i ∈ [0, i obs ]

i

0 10000 20000 30000 40000

(b) g (50000, i) for i ∈ [0, i obs ]

i

Figure 5.1: Single coupon function. g (m, i) for the physical interactome data parameters, s obs =

59956, m obs = ( )

4967

2 and iobs = 41313. For m ∈ {10000, 50000} the function can be seen to have a

single solution satisfying g (m, i) = 0.

147


5.2. METHODS Interactome size

This derivative is negative if

m ((m obs − m) − (i obs − i) + 1) > (m obs − m) (m − i + 1) ,

which reduces to

m obs

m > i obs − 2

i − 1 . (5.15)

As protein interaction graphs are assumed to be sparse (i.e. m ≪ m obs ), it follows that

∂g(m,i)

∂i

can be positive only for small i. Figure 5.1 shows the behaviour of g (m, i) for the

physical data parameters taken from Table 5.1.

Using Equation 5.13 and setting i = 1 for simplicity,

(

)

m obs − m

g (m, 1) = s obs − (m obs − m) log

, (5.16)

(m obs − m) − i obs

which is positive for all parameter sets defined in Table 5.1 and m ≪ m obs .

Further, Equation 5.14 is decreasing in i, so the second derivative of g with respect to i

is negative. Therefore, as g (m, 1) for considered m is positive, if an i exists such that

g (m, i) = 0 then the solution is unique.

5.2.4 Multiple coupons

Rather than a series of independent studies reporting individual interactions, the S. cerevisiae

data have been published in studies producing multiple interactions. Each study,

P k , contains a set of reported interactions E Pk . A multiple coupon model assumes that

interactions are drawn without replacement from the observable protein pairs, E obs . This

differs from the assumption in Section 5.2.3 where each interaction is drawn from E obs

with replacement.

Recall that the number of true interactions, the interactome size, is m. Now suppose

that q experiments, P 1 , . . . , P q , are conducted and that the number of true interactions

reported in experiment P k is r k . For each experiment, P k , let p h,j,k be the probability of

drawing (j − h) novel true interactions, given that h distinct true interactions are observed

148


5.2. METHODS Interactome size

in experiments {P 1 , . . . , P k−1 }. The probability p h,j,k can be described as a transition

matrix (each state referring to the number of distinct interactions sampled) where for the

kth experiment,


⎨ 0 if j < h

p h,j,k = (


h

j−h )( r k −j+h)

( m r k

)

if j ≥ h,

(5.17)

which is equivalent to,

p h,j,k =

(

(m − h)!h!r k ! (m − r k )!

)

(j − h)! (m − j)! (r k − j + h)! (j − r k )!m!

if j ≥ h. (5.18)

To find possible values of κ and m that are consistent with the data found in Table 5.1,

different values of m, s and i are simulated. Unlike the single coupon model, however,

the experiments provide s obs samples and in each experiment the reported interactions

have to be split into true (e ∈ E) and false (e ∈ E ′ ) reported interactions. The complete

experiment sizes, {r 1,obs , r 2,obs , . . . , r q,obs }, are such that,

s obs =

q∑

r k,obs . (5.19)

k=1

In order to simulate this model, κ ∈ { 1

s obs

, . . . , s obs−1

s obs

} is chosen, and then the number of

interactions drawn from the urns of true interactions and false interactions are uniformly,

and at random, selected such that {r 1 , r 2 , . . . , r q } are sampled from the interaction urn

(E) and {r ′ 1, r ′ 2, . . . , r ′ q} are sampled from the false interaction urn (E ′ ). These sampled

such that r k + r ′ k = r k,obs ∀ k ∈ [1, q] and ∑ q

k=1 r k = (1 − κ) s obs .

For each possible κ (along with a collection of 1,000 sampled experiment sizes) and each

m ∈ [0, m obs ] the average number of distinct interactions, ǐ, is found through simulation

and forms a possible solution for m and κ only if ǐ = i obs . The multiple coupon model is

simulated in order to assess the effect of sampling from experiments of different sizes, in

contrast to the simple with replacement model in Section 5.2.3. The model is also used

to assess the possible interactome size, and FDR, predictions found from HTP and SSE

data.

149


5.2. METHODS Interactome size

5.2.5 Finding true interactions

False interaction and true interaction data, using the single coupon model, can be generated

for known protein pairs m obs , interactome size m, and FDR κ (having found solutions

κ and m from data in Table 5.1). The effect on the number of times an interaction, e, is

reported, V (e) is simulated for different s obs values to assess how the repeats can be used

to classify true interactions (e ∈ E) and false interactions (e ∈ E ′ ).

Let V (e) be the number of times an interaction, e, has been reported. For e ∈ E and κ

such that s = (1 − κ) s obs is large the probability that an interaction, e ∈ E, is reported

V (e) times is approximated as being Poisson distributed:

P (V (e) = k | e ∈ E) =

( s

k

) ( 1

m) k (

1 − 1 m

) s−k


(

exp − s ) ( )

s k

m

, (5.20)

m k!

and similarly if e ∈ E ′ ,

) ( s ′ ) k

P (V (e) = k | e ∈ E ′ ) ≈ exp

(− s′ m ′ . (5.21)

m ′ k!

For any reported interaction, e, P (e ∈ E) = 1 − f and P (e ∈ E ′ ) = f. V (e) is used to

classify e as being sampled from E if:

P (e ∈ E|V (e)) > P (e ∈ E ′ |V (e)) . (5.22)

To find the threshold value, k ∗ , to ensure minimum error when using V (e) for classification,

the solution of the following is found,

P (e ∈ E|V (e) = k ∗ ) = P (V (e) = k∗ |e ∈ E) P (e ∈ E)

. (5.23)

P (V (e) = k ∗ )

So using in turn Equations 5.20-5.23 the threshold value k ∗ (noting k ∗ ∈ [0, min(s, s ′ )])

150


5.2. METHODS Interactome size

is the solution of,

(

m exp − s ) ( )

s k ∗

m

m k ∗ !

( s

m) k ∗ ( m ′

s ′ ) k ∗

(1 − κ) = m ′ exp

(− s′

( ) sm

′ k ∗

ms ′

( s (mobs − m)

m (s obs − s)

) k ∗

=

=

k ∗ =

) ( s ′ ) k ∗

m ′ κ,

m ′ k ∗ !

( )

κ m ′ s

1 − κ m exp m − s′

,

m ′

κm ′

(1 − κ) m exp ( s

m − s′

m ′ )

,

= κ (m (

obs − m) s

(1 − κ) m exp m − s obs − s

m obs − m

( )

) log

κ(mobs −m)

( s

m − s obs − s

m obs − m

log

(1−κ)m

(

s(mobs −m)

m(s obs −s)

)

,

) .

(5.24)

Equation 5.24 defines a k ∗ (assuming an equal cost of misclassifying either class) for

given s obs , m obs , m and κ such that e is classified as being in E if,

V (e) > k ∗ . (5.25)

Let C (e | V (e)) be the class predicted for interaction, e. Then,

{

E

C (e | V (e)) =

E ′

if V (e) > k ∗

otherwise

. (5.26)

In order to observe how the classifier, C, performs, s obs interactions are drawn from the

single coupon model using the parameters above. k ∗ is found for a given s obs using Equation

5.24 and each interaction, e, is classified as either a true interaction, e ∈ E, or false

interaction, e ∈ E ′ .

Let c F P be the percentage of interactions misclassified as true interactions (a second false

discovery rate) and c T P be the percentage of true interactions (which can be actually

tested) correctly identified (the sensitivity). These are both equivalent to previously used

notation for interactions set out in Section 1.6.2 on page 54, but they are now defined

clearly to avoid confusion with the FDR, κ, used for the coupon models:

false discovery rate = c F P = |E′ ∩ {e ∈ E Ω : V (e) > k}|

|{e ∈ E Ω : V (e) > k}|

(5.27)

151


5.3. RESULTS Interactome size

and,

sensitivity = c T P = |E ∩ {e ∈ E Ω : V (e) > k}|

. (5.28)

m

The misclassification rate is computed for the observed interactome size, m, and FDR, κ,

estimates found from the coupon model, for s obs ∈ [40000, 600000].

5.3 Results

Section 5.3.1 presents the estimated S. cerevisiae interactome size along with an investigation

into the interplay between FDR, interactome size, and the proportion of distinct

interactions reported. Section 5.3.2 contrasts the multiple coupons model predictions

with those of the simple single coupon model. These results show how the predicted error

rates change, for a given interactome size, when the model takes account of experiment

size. Section 5.3.3 makes use of the range of FDR, κ, and interactome size estimates,

ρm, found in Section 5.3.1 to see how the number of reported interactions, s obs , changes

the reliability of true interaction classification using the number of validations for each

interaction.

5.3.1 Interactome size

The BioGRID physical interaction data, found in Table 5.1, were used to find the results

for FDR and interactome size shown in Figure 5.2. The scaling factor, ρ, is approximately

1.36. Figure 5.2(a) shows the relationship between FDR, κ, and interactome size, ρm.

This shows that the FDR, for the complete data, could be between 0 and 0.6, whilst the

interactome has fewer than 100,000 interactions. Using interactome size estimates guided

by the literature of 20,000-40,000 interactions (from Table 1.4 on page 61) produces an

estimated FDR across the complete data of 0.32-0.47. Similarly, using FDR estimates

from the literature (which have predicted an FDR of larger than 0.2 in general) suggests

that the interactome size has fewer than 60,000 interactions.

i

Figure 5.2(b) shows the proportion of the interactome, , that has been reported for the

ρm

range of FDR estimates. Somewhere between 40% and 80% of the complete true interactome

has been found depending on the FDR. A higher FDR, due to its associated lower

interactome size (in Figure 5.2(a)), means that a higher proportion of the interactome

152


5.3. RESULTS Interactome size

has been reported. There are fewer unseen true interactions if there is more noise (interactions

sampled from false interaction urn, E ′ ), a result consistent with how validation

information is used to find the FDR and interactome size in the coupon models.

False positive rate

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

0.0 0.2 0.4 0.6 0.8 1.0

2e+04 4e+04 6e+04 8e+04 1e+05

Interactome size

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of interactome found

(a) Size, m, and FDR, κ.

(b) Proportion known,

i

ρm

, and FDR, κ.

Figure 5.2: S. cerevisiae physical interactome size. The results found for S. cerevisiae using

the single coupon model are shown in the two plots. Figure 5.2(a) displays the estimated FDR and size.

Figure 5.2(b) shows how the FDR relates to the proportion of the complete interactome that has been

reported.

Figure 5.3 shows the interactome size predictions for genetic interaction data (from Table

5.1). Although a similar number of distinct interactions are available for genetic and

physical interactions, the proportion of known interactions is substantially lower. Figure

5.3 shows that the single coupon model estimates that less than 40% of the genetic

interactions have been reported and that the FDR is less than 0.8 for any interactome size.

Genetic interactome size estimates, using the same range of plausible FDRs (0.32-0.47)

suggested by published PPI interactome sizes, are 80,000-150,000. However, these FDR

estimates have been found using different experimental methods which may not be applicable

to the genetic data. The genetic interactome size estimates suggest that this interactome

is much larger than the physical interactome (if the same FDR is assumed for

each dataset). However, this estimate has been made by a model that assumes only the

occurrence of n proteins, rather than modelling the binding of proteins to DNA. This may

act to underestimate the genetic interactome size as the number of possible interactions is

153


5.3. RESULTS Interactome size

False positive rate

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

0.0 0.2 0.4 0.6 0.8 1.0

0 50000 100000 150000 200000 250000 300000

Interactome size

0.0 0.2 0.4 0.6 0.8 1.0

Proportion of interactome found

(a) Size, m, and FDR, κ.

(b) Proportion known,

i

ρm

, and FDR, κ.

Figure 5.3: S. cerevisiae genetic interactome size. Plots show the results found using the single

coupon model, for the genetic interaction data found in BioGRID. Figure 5.3(a) displays the estimated FDR

and size. Figure 5.3(b) shows how the FDR relates to the proportion of the complete interactome that has

been reported.

doubled. But the results suggest that there are at least several times greater genetic than

protein-protein interactions in S. cerevisiae.

5.3.2 Experiment size

Each type of dataset may have a different FDR (some published estimates for HTP studies

are shown in Table 1.3 on page 61). To compare the noise found in HTP and SSE data,

the complete interactome data are split by experiment, P k , according to its experiment

size, r k,obs . The estimated FDR, κ, for a given interactome size is then compared between

SSE ɛ = {P k : r k,obs < ɛ} and HT P ɛ = {P k : r k,obs ≥ ɛ} data for ɛ ∈ {5, 10, 100, 1000}.

Table 5.1 details the data used for parameter values in the single coupon model.

Figure 5.4 shows the FDR, κ, and interactome size, ρm, results for these SSE and HTP

datasets. The figures show, for a fixed interactome size, the variation in FDR between

reported interaction sets.

In general, for a fixed interactome size, SSE experiments have a lower FDR than the

HTP data (when defined using the same ɛ). HTP data produce a wider range of possible

154


5.3. RESULTS Interactome size

False positive rate

0.0 0.2 0.4 0.6 0.8

All

5+

10+

100+

1000+

False positive rate

0.0 0.2 0.4 0.6 0.8

All

5−

10−

100−

1000−

0 20000 40000 60000 80000 100000 120000 140000

0 20000 40000 60000 80000 100000 120000 140000

Interactome size

Interactome size

(a) HT P ɛ = {P k : r k,obs ≥ ɛ}

(b) SSE ɛ = {P k : r k,obs < ɛ}

Figure 5.4: Experiment and interactome size. Plots show the interactome size, ρm, and FDR,

κ, estimates found for SSE ɛ and HT P ɛ data using the single coupon model. The HT P ɛ data are less

reliable than the complete data whilst the SSE ɛ data are more reliable, although there is a general lack

of validations in the smallest datasets (SSE 5 ) which may suggest either a higher FDR or poor modelling

performance.

interactome sizes. For HT P 1000 (containing only 7 studies reporting more than 1,000

interactions) the possible maximal interactome size is 150,000, assuming that FDR is

negligible.

If only the smallest interaction experiments are considered, SSE 5 , the single coupon

model estimates a higher FDR than is found for the complete data. This suggests that the

SSE 5 set is less reliable than the complete interaction set. However, it is more probable

that for SSE 5 the scaling factor used or the model’s dependence on multiple publications

detailing exactly the same results are not as realistic as for the other datasets.

The single coupon model ignores the effect of experiment size. This will have a more

profound effect on the HTP results, as the single coupon model provides a better description

of data produced by smaller experiments. Section 5.2.4 described a multiple coupon

model that takes explicit account of experiment sizes, {r 1,obs , r 2,obs , . . . , r q,obs }, (at the expense

of simplicity) which is now used to estimate the size and FDR for the same datasets.

Figure 5.5 shows the difference between the predicted FDR and size values for the multiple

and single coupon models. The figures show the minimal changes on the solutions

155


5.3. RESULTS Interactome size

when only smaller experiments are considered (in this example SSE 100 ), whilst the effect

of using the HT P 100 data is more pronounced.

False positive rate

0.0 0.2 0.4 0.6 0.8

Single sample

Multiple samples

False positive rate

0.0 0.2 0.4 0.6 0.8

Single sample

Multiple samples

0 50000 100000 150000

0 50000 100000 150000

Interactome size

Interactome size

(a) SSE 100 = {P k : r k,obs < 100}

(b) HT P 100 = {P k : r k,obs ≥ 100}

Figure 5.5: Single or multiple coupons. Plots show the differences between the interactome size,

ρm, and FDR, κ, estimates found using the single and multiple coupon models for: (a) SSE 100 and (b)

HT P 100 data. The single coupon results are shown in red and the multiple coupon results are shown in

blue. The effect on SSE data is small between the models, whilst there is a larger difference to the predictions

made when considering the HTP experiments.

Figure 5.6 shows HT P 100 and SSE 100 results from the multiple coupon model, along

with the results for the full physical data shown in black. The complete data results

only show minor differences to the estimated FDR and sizes found for the single coupon

model (shown in Figure 5.2). The maximal interactome size is about 50% larger for

the HT P 100 set than found for the SSE 100 . For interactome size estimates from recent

publications of 20,000-40,000 the FDR estimates for each dataset are: 0.31-0.46 (all);

0.38-0.54 (HT P 100 ); and 0.24-0.42 (SSE 100 ). The lower FDR estimates relate to a higher

estimated interactome size.

156


5.3. RESULTS Interactome size

False positive rate

0.0 0.2 0.4 0.6 0.8

All

100−

100+

0 50000 100000 150000

Interactome size

Figure 5.6: Multiple coupon interactome size results. Plot shows the FDR and size results for

multiple coupon model. The results are shown for three datasets: complete physical data; SSE 100 ; and

HT P 100 . This shows the differences between the data, with the maximal size, ρm, being over 50% more

for the HT P 100 data in contrast with the SSE 100 data.

5.3.3 Classification

Assuming that interactions are sampled by the single coupon model, a threshold can be

found to classify a reported interaction, e, as being from E or E ′ according to the number

of times reported, V (e) (as set out in Section 5.2.5). In order to assess how the number of

interactions, s obs , influences the misclassification rates we use estimates for m obs , m and

κ from Section 5.3.1.

Two different sets of parameters for [ρm;κ] are used taken from earlier results for the

single coupon model: [20,000; 0.32] and [40,000; 0.47]. The scaling factor, ρ, is 1.36 as

found for the complete BioGRID data. This is used to define the observable space of true

interactions, m, along with m obs . Equation 5.24 defines a k ∗ for given s obs , m obs , m and

157


5.3. RESULTS Interactome size

κ such that e is classified in E or E ′ . From k ∗ , a k int is found such that e ∈ E if,

V (e) ≥ k int . (5.29)

Tables 5.2-5.3 show how the misclassification rates change as the number of reported

interactions, s obs , changes. Using the currently available data, the model estimates that

between 64% and 41% of the true interactions can be found using repeated interaction

data. At the same time, the misclassification rate for this set is lower than 1%. If only

the 7 experiments which report more than 1,000 interactions were repeated then the use

of repeated information would be able to find approximately 88% of the observable true

interactions. This assumes that the repeated experiments would yield a similar number of

reported interactions.

s obs k int c F P [c 0.05

F P , c0.95 F P ] c T P [cT 0.05

P , c0.95 T P ]

40,000 2 0.003 [0.002,0.004] 0.424 [0.417,0.431]

60,000 2 0.004 [0.003,0.005] 0.638 [0.631,0.644]

80,000 2 0.006 [0.004,0.007] 0.784 [0.778,0.789]

100,000 2 0.008 [0.006,0.009] 0.876 [0.871,0.880]

200,000 3 0 [0,0] 0.975 [0.973,0.977]

300,000 3 0 [0,0] 0.999 [0.998,0.999]

400,000 4 0 [0,0] 1 [0.999,1]

500,000 4 0 [0,0] 1 [1,1]

600,000 4 0 [0,0] 1 [1,1]

Table 5.2: Classification performance for ρm = 20,000. Simulated classification performance

of C (e | V (e)) for parameters ρm = 20,000 and κ = 0.47. The c F P and c T P percentages are shown as

s obs increases, including the 95% sample range. k int is the minimum number of validations, V (e), required

for an interaction, e, to be classified as being sampled from E.

The results show that even if the reported interaction data has a high FDR of 0.47, replicated

experiments can be used to find the stochastic error that has been assumed to be

inherent in the HTP technologies. This is also without the use of any corroborating biological

evidence (as used to generate the CORE graph) on those interactions treated as

being sampled from the true interaction set E.

158


5.4. DISCUSSION Interactome size

s obs k int c F P [c 0.05

F P , c0.95 F P ] c T P [cT 0.05

P , c0.95 T P ]

40,000 2 0.001 [0,0.002] 0.238 [0.234,0.241]

60,000 2 0.001 [0.001,0.002] 0.405 [0.400,0.410]

80,000 2 0.002 [0.001,0.002] 0.553 [0.548,0.558]

100,000 2 0.002 [0.002,0.003] 0.673 [0.669,0.678]

200,000 2 0.007 [0.006,0.007] 0.945 [0.943,0.947]

300,000 2 0.014 [0.013,0.015] 0.992 [0.992,0.993]

400,000 3 0 [0,0] 0.995 [0.994,0.996]

500,000 3 0 [0,0] 0.999 [0.999,1]

600,000 3 0 [0,0.001] 1 [1,1]

Table 5.3: Classification performance for ρm = 40,000. Simulated classification performance

of C (e | V (e)) for parameters ρm = 40,000 and κ = 0.32. The c F P and c T P percentages are shown as

s obs increases, including the 95% sample range. k int is the minimum number of validations, V (e), required

for an interaction, e, to be classified as being sampled from E.

5.4 Discussion

Coupon models have been used to find the association between FDR and interactome

size for an interaction dataset. The models require validated interactions to form a nonnegligible

proportion of the data in order to provide plausible results. However, the available

S. cerevisiae interaction data form a good dataset with a large number of validations.

Results suggest that the maximal FDR rate for the physical data is 0.6, and that given an

interactome size of 20,000-40,000 the FDR is 0.32-0.47. In contrast, the genetic interactome

is predicted to be several times bigger than the S. cerevisiae PIN, and less than a

third of the true interactions have been reported.

The coupon models require an FDR to predict the exact interactome size. However, the

model can be used to verify published estimates for either FDR or size, as they should

produce plausible estimates for the other parameters. For instance, reported FDRs for

HTP data (see Table 1.3 on page 61) have been published with a rate of over 0.9. The

coupon model, run on only the largest datasets (HT P 1000 ), suggests that this is not possible

and the FDR is less than 0.8 even in an extreme case (where observable interactome

size, m, is minimal given the sample data).

The multiple coupon model produces a similar range of possible sizes as the single coupon

model, although the FDR estimates are higher in the single coupon model for a fixed

interactome size. For high estimates of interactome size (which only appear possible

when viewing the HTP data) the difference between the two models is larger, showing

159


5.4. DISCUSSION Interactome size

the need to take account of the experiment size when only considering HTP data. The

fact that SSE data present a smaller maximal size of the interactome perhaps indicates a

difference in the coverage of these experiments when compared to the HTP results. The

scaling factor used (which is found using the same method for all datasets) should reflect

the size of the experiment in order to compensate for the apparent lower coverage of the

small experiments.

Repeated interaction information can be used to find a reference set of PPIs that have

not been classified by any biological characteristics. This requires uniform experimental

testing across the observable space of protein pairs, thus resulting in the uniform sampling

of interaction and false interactions that is assumed for the urn models. Whilst this may

be unrealistic, more recent HTP techniques present an opportunity for all the observable

protein pairs to be tested in this manner.

Uniform sampling of interactions, which is at least correct in the mean-field approximation,

has been assumed in the implementation of the coupon model. This assumption is

further supported by the technical set up of the larger experiments that contributed the

majority of the PPI data. However, the role of systematic error in any of the experimental

methods is consequently ignored. If the sampling is significantly skewed towards a

particular subset of proteins, or particular interactions, then the overall interactome size

estimates will be lower than in reality, as will the FDR estimates.

The sampling factor has also made an implicit assumption about the coverage of each of

the interaction datasets. This sampling factor means that it is assumed that each protein

has at least one true or false interaction. If no reported interaction data can be found for

a protein there is assumed to be no evidence that it has actually been tested. As a consequence

of this assumption, the interactome size estimates may be overstated. Similarly, if

studies do not test all protein pairs from the protein set inferred by the interaction reports,

then the size may be understated. However, without further information on negative results

from these studies, it is hard to effectively find the coverage for all the studies used.

This further signifies the need for the community to report the results of protein-protein

interaction studies more fully; in particular it is important to report protein pairs with

negative results.

Differences in how the protein pairs have been sampled may make the comparison of

error rates between SSE and HTP studies inaccurate. If the SSE size estimates are too

low, then the relative FDR will increase for a given interactome size, further reducing the

difference between the FDR seen for HTP and SSE data. Overall, the FDR is found to be

160


5.4. DISCUSSION Interactome size

up to 50% larger in the biggest HTP experiments than found in the smallest SSE.

The coupon model also assumes that errors are stochastic in nature, rather than systematic.

This will lead to systematic errors being wrongly assigned as true interactions, as

they will almost certainly appear more often than stochastic errors. Ignoring this potential

set of errors will increase the interactome size estimate, whilst reducing the FDR presented

by the coupon models. In order to take account of systematic errors from different

techniques, the same multiple coupon model should be reapplied to all the data from each

technique. Given enough data, the amount of systematic error from each technique can

then be assessed and a more reliable interactome size estimate reached.

Over half a million sampled interactions may be required to classify all the reported interactions

correctly. This would be lower if the FDR could be reduced in experimental

replicates. However, these data could just be found by repeating already observed HTP

experiments around 10 times and reporting all the data separately so that validations can

be found. Then, if the scaling factor is appropriate, validations enable complete elucidation

of the (approximately 73%) PPIs that are currently observable. Then further inference

methods using biological characteristics, or new experimental methods, can use this PPI

reference set to fully elucidate the S. cerevisiae interactome.

161


Chapter 6

Conclusions

This chapter gives a summary of the work described in this thesis, the conclusions drawn,

and finally a general discussion of the areas which require further work.

162


6.1. SUMMARY Conclusions

6.1 Summary

This thesis has presented a collection of PIN based analyses, using the S. cerevisiae interaction

data as the illustrative example.

Chapter 2 presented the characteristics of currently available protein interaction data for

S. cerevisiae. Each experimental technique probes subtly different protein pairs and thus

also protein interactions. This makes a comparison of techniques fraught with difficulty

when assessing reliability. The S. cerevisiae PIN is relatively highly clustered forming a

graph with over 5,000 nodes.

Chapter 3 described and analysed the properties of random graph ensembles that retain

network and biological characteristics of the empirical data. Ensemble averages, for various

traits, were found to differ significantly dependent on whether the degree distribution,

graph structure, or biological characteristics were fixed in the the random ensembles. The

variability of trait statistics is larger when the adjacency matrix of the empirical graph

is fixed when compared to ER random graphs, showing the effect of a small number of

nodes with high connectivity. The use of biological and network characteristics may affect

subsequent analysis regarding how biological covariates are linked with PPIs and PINs.

A range of suitable graph ensembles should be tested when assessing trait associations in

order to appreciate better how these traits are linked to the observed graphs.

Chapter 4 used the random ensembles introduced in Chapter 3 to assess whether phylogenetic

topologies of S. cerevisiae proteins are linked to PPIs. Although they are found to be

more similar than the topologies of randomly selected protein pairs, if the random graph

ensemble fixes the network structure this linkage disappears. Accordingly, it is hard to

distinguish whether the topologies are linked to the PPIs, or to the network structure that

is found for the empirical data.

Chapter 5 described a model for estimating the interactome size, false discovery rate and

proportion of true interactions that have been found. This model showed that the physical

and genetic interactomes are of substantially different sizes. The current knowledge

of genetic interactions is more limited than that found for physical interactions. The

false-discovery rate found in HTP and SSE are closer than previously thought, although

HTP data are more noisy. However, replicated sampling can be used to eliminate errors

and a doubling of the currently available reported interactions would greatly increase the

amount of true interactions that can be found using repeated information.

163


6.2. CONCLUSIONS Conclusions

6.2 Conclusions

Graph structure plays an important role in the analysis of whether biological covariates

are significantly correlated with PINs. Underlying assumptions in studies regarding the

interactome’s structure, i.e. an ER or scale-free random graph, have the power to change

the conclusions about the significance of a variety of biological traits. The choice of null

model, without further knowledge about the true interactome, has to be clearly defined

and understood when used to assess the significance of traits found in the empirical data.

Some graph ensembles have been shown to generate random graphs that show very similar

biological properties to empirical data (Thorne and Stumpf, 2007). In these cases, if

the generation method is linked to the biology or possibly to explain the evolution of the

empirical data, there is a need to support the assertion more fully showing that the graph

ensemble does produce more similar traits than would be expected by chance. Although

this is hard to define, depending on the ensemble method used, the graph distances between

the empirical and random graphs can be used to help with this assessment. A

graph distance measure, or comparison of the ensemble with those graphs that are a similar

distance away, may give more confidence in the biological relevance of the proposed

ensemble, or lead to the conclusion that the ensemble is just over-fitted to the empirical

data.

Reported evidence for co-evolution between PPIs (Goh et al., 2000) is not backed up by

an analysis of S. cerevisiae protein phylogenetic topologies. The network structure of the

reported data, and how it is used to define a null graph set, is key to the resulting level

of similarity found between the phylogenetic topologies of PPIs. Whilst, on average,

the similarity is higher for an observed PPI than for a random protein-pair, the similarity

is not significantly higher for the PIN when compared to networks with the same degree

distribution. There is a need to clearly differentiate between analyses of particular families

of PPIs, all PPIs, or features of the PIN in future work. Globally, the properties of protein

pairs may be too diverse and insufficiently specific or informative to reliably predict PPIs

when taken in isolation.

The available literature curated data, although clearly more reliable than large scale studies,

appears to have non-negligible error rates. HTP data, which have always been assumed

to be error prone (Schwartz et al., 2009), do however have coverage properties

which are better understood. Ignoring the tested set of protein pairs used in these studies

has led to an exaggeration of noise found in HTP studies (Gentleman and Huber,

164


6.3. FURTHER WORK Conclusions

2007), backed up by the analysis presented here. The release of raw results from these

HTP experiments will undoubtedly aid further research and help to improve the coupon

models presented here, as well as enabling better measurement of the false-positive and

false-negative rates.

The observable S. cerevisiae PPI network, under the assumption of an FDR lower than 0.5,

has less than 40,000 interactions. Of these interactions, around half have been reported in

BioGRID. PPI data have a significant level of false-positive data. The discovery of some

novel interactions perhaps requires new experimental methods or further replicated HTP

studies, although the level of error found in HTP is not too high to exclude the possibility

that the observable set of interactions can be found solely from these studies.

6.3 Further work

Graph ensemble work presented in this thesis has shown how differences in the assumed

graph structure can affect biological traits. This exploratory work should be extended

to include additional rewiring algorithms and biological characteristics in order to understand

better correlations between PPIs, PINs, and covariates. Distances between empirical

data and the random graphs enables an assessment regarding how similar the trait values

would be if any constraints were removed. This distance should be further used to assess

the effect of published ensembles upon the stability of traits found in the empirical data.

It also will enable a better guide as to how different the random graphs are from empirical

data. If some graph distance between empirical data and the constructed random graphs

is not used, there is a risk that reported results, and the credence given to particular PIN

generation methods, may be over interpreted.

The diversity of experimental techniques, and differences in how PPIs are reported, make

comparison of SSE and HTP data difficult and potentially misleading. There is a need for

a repository of interaction data that quantifies all experimental results (where possible)

and lists both the set of reported positive interactions, and false interactions, when quantitative

information is unavailable. SSE and HTP data should be separated in this database

to enable easier analyses and also to contribute alternative training sets for future algorithms.

Phylogenetic topology results reiterate previous findings that have suggested a lack of

co-evolutionary signal found from mirrortree of protein interactions in the S. cerevisiae

165


6.3. FURTHER WORK Conclusions

PIN (Hakes et al., 2007a). These analyses should also be completed on PIN data from

other organisms. Together this will lead to a better understanding of how interactions are

retained when independent observations, in different organisms such as C. glabrata, of

the homologous proteins have been shown to either interact, or not interact. Analysis of

interaction data for closely related species will also provide further means of assessing

possible co-evolutionary effects between interacting proteins.

Finally, the coupon model provides a means of using all the reported interaction data to

assess the FDR and interactome size. The model can further be used to compare error rates

of different experimental techniques as well as data found from other model organisms.

It also can serve as a means of estimating the number of replicated HTP experiments

required to isolate the true interactions and false interactions that can be observed for each

set of conditions. The coupon model can also be further used to compare the reliability

of each biological technique for a given interactome, in order to help assess the value of

interaction data from any methodology.

166


Appendix A

Mathematical techniques

This appendix provides further information on the random graph techniques introduced

in Chapter 1.

A.1 Likelihood analysis of degree distributions

Suppose a probability model defines the observed degree distribution, P (d (v i ) = k; θ),

where θ are the model’s parameters. Maximum likelihood estimation can be used to

estimate the parameters which best reproduce the observed degree data,

D = {d (v 1 ) , . . . , d (v n )}. The log-likelihood is defined as:

log (L (θ)) =

n∑

log (P (d (v i ) = k; θ)) .

i=1

(A.1)

In order to compare the degree distribution models, likelihoods are measured for each

model. Model selection is performed by comparing the likelihoods found for each distribution

after taking account of the different numbers of parameters in each model. As the

models are non-nested, an Akaike-information criterion (AIC) is used to choose between

the models (Burnham and Anderson, 1998). AIC, the measure used to choose the best

model for P (d (v i ) = k; θ), is:

( ( ) )

AIC = 2 − log L(ˆθ) + d , (A.2)

167


A.2. SCALING DEGREE RANDOM GRAPHS

Techniques

where ˆθ is the maximum likelihood estimate of θ and d is the number of parameters found

in the model.

A.2 Scaling degree random graphs

Graph ensemble models for generating graphs with scale-free properties have frequently

been proposed in the literature, taking a variety of guises (Aiello et al., 2000; Gkantsidis

et al., 2003; Li et al., 2005; Stumpf et al., 2007). Some of these methods have aimed to reproduce

the observed biological systems using pseudo-evolutionary schemas. Biological

concepts such as gene duplication events, the evolutionary divergence of similar genes,

or functional importance of particular genes have been used to justify concepts such as

preferential attachment and duplication-divergence. These methods start with a small network

(e.g. two nodes with a single edge) and define an iterative scheme for generating

network edges as more nodes are added.

These generative models use two primary features to produce graphs with scale-free degree

sequences. The different features used generate graphs with various levels of clustering,

dependent on the parameter values (the probability of an edge or duplication of a

node) (Chung et al., 2003).

(a) Preferential attachment The first generative model is preferential attachment (PA)

where added nodes are more likely to connect to highly connected nodes (Dorogovtsev

et al., 2000). This forms a methodology of generating graphs that exhibit power-law

degree distributions. However, the aim of PA may not be to replicate the actual mechanism

that drove evolution of empirical data.

(b) Duplication-divergence A second technique, duplication-divergence (DD), takes

some inspiration from actual duplication events found in biological systems (Chung et al.,

2003). The new node is assumed to be a duplicate of a node in the network, thus preserving

all its edges. The divergence stage then randomly mutates the edges that are preserved

by this duplication event, mirroring the role of duplication and specialisation of proteins

that has been observed in real biological systems.

168


A.3. EXPONENTIAL RANDOM GRAPHS

Techniques

(c) Duplication-attachment Finally, duplication attachment (DA) methods combine the

properties of PA and DD to generate random graphs. Nodes are duplicated, although there

is no divergence, and the inheritance of edges is random. Preferential attachment events

occur as new nodes are added to the graph.

The degree distributions typical of these growth models exhibit similar scaling properties

to observed PPIs. The models generate graphs which have a small collection of

hubs connected to the majority of the nodes in the graph that have a small number of

neighbours. Duplication-attachment produces a degree distribution most similar to the S.

cerevisiae protein interaction network, primarily as the highest degrees are lower than in

the preferential-attachment simulation.

There are a number of other methods, aside from these ‘evolution’ based schemes, that

create graphs with scale-free degree sequences, including: generalised random graphs;

power-law random graphs (Aiello et al., 2000); and random degree-preserving rewiring

(Gkantsidis et al., 2003). However, these are not discussed any further as all of these

methods have been found to be asymptotically equivalent (Li et al., 2005).

A.3 Exponential random graphs

Exponential random graph models (ERGMs) have been used to create graphs with the

same properties as empirical social graphs (Pattison and Wasserman, 1999; Robins et al.,

2007). Saul and Filkov (2007) have recently applied this technique to biological graphs.

Typically, we have access to some measurements regarding the observed graphs, including

either network or biological traits, and would like to make random graphs that share these

properties. This technique is similar to the graph ensembles used to study traits and

stability in Chapter 3.

Assuming that ERGMs form a reliable model of the data, they can be used for a variety

of reasons. Their parameters can be interpreted as conferring information on the relative

importance of traits, perhaps also enabling feature selection over the available traits when

attempting to model the data effectively. The ERGMs are also easily extendable, as extra

traits can be added to the model, enabling a better fit to the data.

Let G be a random graph, on n nodes, with adjacency matrix A. Let a be an observed

adjacency matrix of a graph. Now consider a series of traits, {z 1 (a) , z 2 (a) , . . . , z m (a)}.

169


A.3. EXPONENTIAL RANDOM GRAPHS

Techniques

These can be node or edge specific, including structural traits such as motifs. We assume

a log-linear model,

log (P (A = a)) ∝ θ 1 z 1 (a) + . . . + θ m z m (a) ,

(A.3)

or, equivalently,

P (A = a) = 1

κ(θ) exp(θ 1z 1 (a) + . . . + θ m z m (a))

= 1

κ(θ) exp(θ⊤ z(a))

(A.4)

where θ = [θ 1 , . . . , θ m ] ⊤ , z(a) = [z 1 (a), . . . , z m (a)] ⊤ , and κ(θ) is a normalising constant

which ensures that P (A = a) is a true probability distribution.

The model can be related to a logistic regression model. Let A i,j be the element (i, j) of

the matrix, whilst A c i,j denote the remaining entries. Then:

P (A i,j = 1|A c i,j ) = P (A i,j = 1, A c i,j )

P (A c i,j )

P (A i,j = 1|A c i,j ) =

P (A i,j = 1, A c ij )

P (A i,j = 1, A c ij ) + P (A i,j = 0, A c ij ) .

=

P (A = a + )

P (A = a + ) + P (A = a − ) ,

(A.5)

where a + is the graph where a i,j = 1, and a − the graph where a i,j = 0.

Then, from Equation A.4:

P (A i,j = 1|A ij c ) =

exp(θ ⊤ z(a + ))

exp(θ ⊤ z(a + )) + exp(θ ⊤ z(a − )) .

(A.6)

Using the similar expression for P (A i,j = 0|A i,j c ), we can write,

log

( P (Ai,j = 1|A c )

i,j )

P (A i,j = 0|A c i,j )

= θ ⊤ ( z ( a +) − z ( a −)) (A.7)

where δ = z (a + ) − z (a − ) is a vector known as the change statistic. Note that Equa-

170


A.4. FURTHER BIOLOGICAL RANDOM GRAPHS

Techniques

tion A.7 has a similar form to a logistic regression model.

A vector of change statistics is used to find the parameters. Using a logistic regressor for

the parameters assumes that the training data are independent, with no interdependence

amongst the nodes. Markov Chain Maximum Likelihood Estimation (MCMCMLE) (Snijders,

2002) incorporates dependencies into the estimation to eliminate this problem. An

ERGM has been used for PINs showing a good fit to the observed graph (Saul and Filkov,

2007).

A.4 Further biological random graphs

Daudin et al. (2007) proposed an adaptation to the standard ER graph model to incorporate

the observation of heterogenity among the nodes found in biological graphs which,

perhaps confusingly, they denoted ERMGs.

The ERMG model assumes that each node, v, of the graph, G, is in one of Q clusters

with prior probabilities of {α 1 , . . . , α Q }. A variable incorporates the probability that

nodes from different clusters have connections. Let π i,j be the probability that a node

from cluster i has an edge with a node from cluster j. This additional variable allows

control over the connectivity of the graph, allowing the model to generate highly clustered

subgraphs, whilst still having a sparse set of edges.

To fit data the Bayesian Information Criterion (BIC) is used with an adapted Expectation-

Maximisation (EM) algorithm to determine the optimal number of clusters. Then, the

degree distributions are found (within each group) to be better modelled as Poisson degree

distributions, as opposed to the scale-free distributions observed over the complete graph.

This also raises an additional point as subgraphs of a scale-free graph are not necessarily

scale-free (Stumpf et al., 2005b), perhaps limiting the conclusions when attempting to

model these systems using incomplete data.

Another random graph model assigns a distance function between different nodes and

the probability of edges according to the distance between nodes. Higham et al. (2008)

developed a geometric random graph model that embeds PPI data into Euclidean space,

testing whether the edges occur to some distance function. This algorithm suggested that

two-dimensional Euclidean space was as effective as higher dimensional space to explain

the connectivity found in empirical PINs.

171


Appendix B

Data tables for biological traits

This appendix includes further data and analysis to complement Chapter 2.

B.1 Experimental interaction techniques

This section presents some further information on the experiment techniques used to find

and infer protein interactions. These are the different types of computational interaction

prediction method; a full list of the BioGRID experimental techniques; and the number

of proteins with each annotation for each GO ontology.

Method Interaction Association

Bayesian networks Domain Physical

Classification Domain Physical

Domain association Domain Physical

Domain pair exclusion Domain Physical

Gene co-expression Protein Functional

Gene neighbour Protein Functional

Phylogenetic profile Protein Functional

Rosetta stone Protein Functional

Sequence co-evolution Domain Functional

Synthetic lethality Protein Functional

Table B.1: Interaction prediction methodologies. This shows tools that can be used to predict

interactions or association between proteins or domains using a combination of in vivo and in silico

methods.

172


B.1. EXPERIMENTAL INTERACTION TECHNIQUES

Data tables

(a) Molecular function

Annotation Proteins

Structural molecule activity 204

Protein kinase activity 107

Transporter activity 248

Hydrolase activity 267

Transcription regulator activity 232

Oxidoreductase activity 155

Transferase activity 288

Isomerase activity 34

RNA binding 185

Phosphoprotein phosphatase activity 42

Peptidase activity 93

DNA binding 154

Translation regulator activity 39

Protein binding 300

Nucleotidyltransferase activity 62

Lyase activity 62

Ligase activity 73

Motor activity 17

Helicase activity 52

Enzyme regulator activity 116

Signal transducer activity 54

(b) Cellular component

Annotation Proteins

Ribosome 133

Nucleus 1289

Nucleolus 182

Plasma membrane 157

Mitochondrion 563

Vacuole 88

Peroxisome 43

Cytoplasm 1192

Cell wall 47

Membrane fraction 44

Endoplasmic reticulum 209

Mitochondrial membrane 109

Chromosome 100

Cytoskeleton 61

Microtubule organizing center 44

Membrane 89

Bud 81

Cell cortex 49

Endomembrane system 68

Golgi apparatus 87

Cytoplasmic membrane-bound vesicle 52

Site of polarized growth 70

Extracellular region 10

(c) Biological process

Annotation Proteins

Protein biosynthesis 219

Morphogenesis 14

Transcription 260

Transport 362

Organelle organization and biogenesis 121

Lipid metabolism 85

Meiosis 98

Electron transport 5

DNA metabolism 264

Amino acid and derivative metabolism 98

RNA metabolism 291

Ribosome biogenesis and assembly 140

Cell wall organization and biogenesis 92

Protein modification 271

Carbohydrate metabolism 72

Pseudohyphal growth 39

Cellular respiration 58

Cell budding 24

Vitamin metabolism 34

Protein catabolism 64

Cytoskeleton organization and biogenesis 99

Generation of precursor metabolites and energy 57

Nuclear organization and biogenesis 47

Vesicle-mediated transport 188

Cell cycle 111

Response to stress 147

Signal transduction 71

Sporulation 43

Cell homeostasis 41

Conjugation 44

Cytokinesis 51

Membrane organization and biogenesis 19

Table B.2: GO slim annotation classes. These tables detail the annotation classes for the three different Gene Ontology categories. For each class, the

number of S. cerevisiae proteins within that class is also given.

173


B.2. FURTHER GENE ONTOLOGY ANNOTATION ANALYSIS

Data tables

(a)

Physical method

Affinity Capture-MS

Affinity Capture-Western

Two-hybrid

Co-localization

FRET

Affinity Capture-RNA

Reconstituted Complex

Protein-peptide

Co-purification

Co-fractionation

Biochemical Activity

Co-crystal Structure

Far Western

Protein-RNA

(b)

Genetic method

Synthetic Lethality

Synthetic Growth Defect

Synthetic Rescue

Dosage Lethality

Dosage Growth Defect

Dosage Rescue

Phenotypic Enhancement

Phenotypic Suppression

Table B.3: BioGRID experimental methods. These tables show the experimental techniques that

produce either genetic or physical protein interactions.

B.2 Further Gene Ontology annotation analysis

Figure B.1 shows the proportion of matching GO slim annotations for interactions reported

between 1990 and 2007. The contribution of HTP data are evident after 2000.

This is associated with a lower level of similarity in annotations between the more recently

reported interactions.

As HTP techniques started to produce interaction data there has been a larger proportion

of interactions between proteins of dissimilar annotations. This trend can either be due

to: (a) a higher false-positive rate; (b) a bias in the discovery of interactions in earlier

literature; or (c) a need to assess the reliability of the GO annotations.

The choice between these explanations would be straightforward if matching annotations

were found to be necessary for certain protein-protein interactions. However, evidence

for this cannot come directly from prior assumed knowledge. Indeed, overconfidence in

prior knowledge would contribute towards ignoring the possiblities that (b) or (c) may

be resulting in the perceived errors, rather than a higher FDR implied by (a). Evidence

of higher levels of matching annotations between interacting proteins should be shown

through uniform random testing of protein pairs, or alternatively guided by validations of

174


B.2. FURTHER GENE ONTOLOGY ANNOTATION ANALYSIS

Data tables

Matching proportions

0.0 0.2 0.4 0.6 0.8 1.0

52

79

77

Component

Process

Function

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

139

99

179

161

289

231

428

262

542

361

831

419

1358

520

1640

527

1643

586

3310

597

3470

598

9949

571

3714

586

7777

553

15434

519

19223

488

26045

Figure B.1: GO slim matching annotations through time. This shows the proportions of

matching GO slim annotations for reported interactions between 1990 and 2007. The numbers in (green,

red) are the (interactions, studies) for each year.

studies that probe the same (or complete) interaction space.

Ptacek et al. (2005) published a HTP study, contributing 4,182 protein-protein interactions,

that focused on protein phosphorylation. This is a regulatory mechanism for basic

processes that is thought to affect up to 30% of proteins at any given time. This single

study showed below average annotation similarity, with function and process annotations

matching for interactors in only 6% of those with known information, whilst the component

annotations matched in 37% of cases. This biochemical activity study explains

the low level of similarity for interactions reported in 2005. Biochemical analysis studies

show low levels of annotation similarity for function and process across all (240) available

studies (shown in Figures 2.6-2.8). The additional 239 biochemical activity studies

contribute a further 1,010 (19%) novel interactions, with the Ptacek et al. (2005) data

contributing the majority of the data for this technique.

175


B.2. FURTHER GENE ONTOLOGY ANNOTATION ANALYSIS

Data tables

Physical interactions Genetic interactions

Physical interactions Genetic interactions

Physical interactions Genetic interactions

Known function proportion

0.0 0.2 0.4 0.6 0.8 1.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Known process proportion

0.0 0.2 0.4 0.6 0.8 1.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Known component proportion

0.0 0.2 0.4 0.6 0.8 1.0

24068

4399

1657

7457

5169

108

52

64

80

301

56

33

1174

548

9636

3053

5809

1929

433

15948

4392

63

Affinity Capture−MS

Affinity Capture−Western

Reconstituted Complex

Two−hybrid

Biochemical Activity

Co−crystal Structure

Far Western

FRET

Protein−peptide

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

Synthetic Growth Defect

Synthetic Rescue

Dosage Lethality

Phenotypic Enhancement

Phenotypic Suppression

Dosage Growth Defect

Affinity Capture−MS

Affinity Capture−Western

Reconstituted Complex

Two−hybrid

Biochemical Activity

Co−crystal Structure

Far Western

FRET

Protein−peptide

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

Synthetic Growth Defect

Synthetic Rescue

Dosage Lethality

Phenotypic Enhancement

Phenotypic Suppression

Dosage Growth Defect

Affinity Capture−MS

Affinity Capture−Western

Reconstituted Complex

Two−hybrid

Biochemical Activity

Co−crystal Structure

Far Western

FRET

Protein−peptide

Co−localization

Affinity Capture−RNA

Protein−RNA

Co−purification

Co−fractionation

Synthetic Lethality

Dosage Rescue

Synthetic Growth Defect

Synthetic Rescue

Dosage Lethality

Phenotypic Enhancement

Phenotypic Suppression

Dosage Growth Defect

(a) Molecular function

(b) Biological process

(c) Cellular component

Figure B.2: Known GO annotations for PPIs by method. This shows the proportion of known GO annotations found for PPIs reported by each

experimental technique. Dashed lines show average proportion across complete genetic or physical interaction set. Bar density shows p-value of binomial

proportion test, assessing similarity, between technique and genetic or physical dataset. FRET exhibits a far higher proportion of known GO annotations that

other experimental techniques.

176


Appendix C

Graph ensemble output

This appendix details results for CORE and DIP graphs to complement to LC results

presented in the main text of Chapter 3. These show the same trait information presented

for the LC graph data: GO annotation traits; complex annotation matching trait;

co-expression levels and the clustering coefficient found for each of the ensembles methods.

177


Graph ensembles

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Average co−expression

(a) Coexpression

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching complex annotations

(b) Complex

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching function annotations

(c) Function

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching process annotations

(d) Process

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching component annotations

(e) Component

0.0 0.2 0.4 0.6 0.8 1.0

Average clustering coefficient

(f) Clustering

Figure C.1: Graph ensemble traits for DIP data. Trait statistic values for the graph ensembles

shown as a proportion of trait found in DIP. The effects are similar to those seen for LC in Chapter 3. Node

shuffle ensembles exhibit higher variability and similar characteristics to random graph whilst network

shuffle ensembles replicate the trait values seen in the empirical data more closely.

178


Graph ensembles

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Average co−expression

(a) Coexpression

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching complex annotations

(b) Complex

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching function annotations

(c) Function

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching process annotations

(d) Process

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.0 0.2 0.4 0.6 0.8 1.0

Proportion matching component annotations

(e) Component

0.0 0.2 0.4 0.6 0.8 1.0

Average clustering coefficient

(f) Clustering

Figure C.2: Graph ensemble traits for CORE data. Trait statistic values for the graph ensembles

shown as a proportion of trait found in CORE. The effects are similar to those seen for LC in

Chapter 3. Node shuffle ensembles exhibit higher variability and similar characteristics to random graph

whilst network shuffle ensembles replicate the trait values seen in the empirical data more closely.

179


Appendix D

Phylogenetic topology

Further results found using the methods presented in Chapter 4 are detailed in this appendix.

These include analysis of the E. coli PPIs based on the protein phylogenetic tree

topologies as well as how to find the number of possible bifurcating and multifurcating

trees for any number of lineages.

D.1 Phylogenetic topologies

The possible distinct phylogenetic topologies is related to the number of species, or sequences,

from which the phylogenetic tree is generated (Felsenstein, 2003). For rooted

bifurcating trees the total number of possible topologies for n species, T n is:

T n =

(2n − 3)!

2 n−1 (n − 1)!

(D.1)

Multifurcating rooted trees can have any degree at each internal node of the tree, so the

number of different topologies on n species is greater than found for bifurcating trees.

The total number of different multifurcating trees is the sum over the number of internal

nodes, m, of:

T n,m = (n + m − 2) T n−1,m−1 + T n−1,m ,

(D.2)

for m ∈ [1, n − 1], T n,1 = 1 and T n,m = 0 ∀ m ≥ n .

The maximum edit distance between two phylogenetic trees on n species is defined by

180


D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS

Phylogeny

the recursion: M n+1 = M n + (n − 2), M 3 = 2. The possible topologies, and associated

maximum scores, M n , between distinct trees on n species are shown in Table D.1.

Species Bifurcating trees Multifurcating trees Maximum score

1 1 1 −

2 1 1 −

3 3 4 2

4 15 26 4

5 105 236 7

6 945 2, 752 11

7 10, 395 39, 208 16

8 135, 135 660, 032 22

9 2, 027, 025 12, 818, 912 29

10 34, 459, 425 282, 137, 824 37

Table D.1: Number of topologies. The number of rooted labelled trees for n species.

D.2 Supplementary phylogenetic results

This section details the phylogenetic tree results for the three different empirical graphs

and tree construction methods. These show the same trait information presented for the

LC graph data, and PROML trees, in Chapter 4.

181


D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS

Phylogeny

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.36 0.38 0.40 0.42 0.44 0.46 0.48

Average topology matches

(a) Topology matches

1.5 1.6 1.7 1.8 1.9 2.0 2.1

Average score

(b) Topology score

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.82 0.83 0.84 0.85 0.86

Average similarity

(c) Topology similarity

4.9 5.0 5.1 5.2 5.3 5.4 5.5 5.6

Average orthologous species

(d) Lineage comparisons

Figure D.1: Phylogeny results for DIP (PROML trees). Boxplots show the distribution of

a trait statistic for graph ensembles: matching topologies; topology score; topology similarity; average

lineages for topology comparisons.

182


D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS

Phylogeny

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.38 0.40 0.42 0.44 0.46 0.48 0.50 0.52

Average topology matches

(a) Topology matches

1.4 1.5 1.6 1.7 1.8 1.9 2.0

Average score

(b) Topology score

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.82 0.83 0.84 0.85 0.86 0.87

Average similarity

(c) Topology similarity

4.8 4.9 5.0 5.1 5.2 5.3 5.4

Average orthologous species

(d) Lineage comparisons

Figure D.2: Phylogeny results for CORE (PROML trees). Boxplots show the distribution

of a trait statistic for graph ensembles: matching topologies; topology score; topology similarity; average

lineages for topology comparisons.

183


D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS

Phylogeny

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.12 0.14 0.16 0.18 0.20

Average topology matches

(a) Topology matches

2.0 2.2 2.4 2.6 2.8

Average score

(b) Topology score

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.72 0.73 0.74 0.75 0.76 0.77

Average similarity

(c) Topology similarity

4.6 4.8 5.0 5.2

Average orthologous species

(d) Lineage comparisons

Figure D.3: Phylogeny results for LC (PAML trees). Boxplots show the distribution of a trait

statistic for graph ensembles: matching topologies; topology score; topology similarity; average lineages

for topology comparisons.

184


D.2. SUPPLEMENTARY PHYLOGENETIC RESULTS

Phylogeny

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.48 0.50 0.52 0.54 0.56 0.58 0.60

Average topology matches

(a) Topology matches

1.1 1.2 1.3 1.4 1.5 1.6 1.7

Average score

(b) Topology score

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

Random graph

Node shuffle

Node shuffle [process]

Node shuffle [component]

Node shuffle [function]

Node shuffle [complex]

Network shuffle

Network shuffle [process]

Network shuffle [component]

Network shuffle [function]

Network shuffle [complex]

0.86 0.87 0.88 0.89 0.90 0.91

Average similarity

(c) Topology similarity

4.6 4.7 4.8 4.9 5.0 5.1 5.2

Average orthologous species

(d) Lineage comparisons

Figure D.4: Phylogeny results for LC (PARS trees). Boxplots show the distribution of a trait

statistic for graph ensembles: matching topologies; topology score; topology similarity; average lineages

for topology comparisons.

185


D.3. ESCHERICHIA COLI PHYLOGENETIC TREES

Phylogeny

D.3 Escherichia coli phylogenetic trees

E. coli data (Pazos et al., 2005) produces similar results to those found for S. cerevisiae in

Chapter 4. Using phylogenetic trees from 47 bacterial species, the fraction of matching

topologies in the DIP network for E. coli and the graph ensemble simulations broadly

confirming results obtained for S. cerevisiae. The phylogenetic profiles are more similar

among pairs of interacting proteins than among protein pairs found in graphs drawn from

the network shuffle and node shuffle graph ensembles.

There is little evidence that the topology of the protein phylogeny provides any indication

of an interaction in general. The protein phylogenetic topologies show more matches in

the random graphs than in the empirical data, as shown in Figure D.5. Node shuffle results

show most matching topologies, as a result of a large proportion of interacting proteins

sharing homologues in few species. This results in more matches as there are fewer possible

topologies. However, the results are similar to those found for yeast across the data.

A similar proportion of protein pairs match in empirical graphs and when two proteins’

trees are randomly compared on the same number of shared homologous sequences (so

the complexity of the trees are identical). The node shuffle graph ensemble tends to generate

graphs with a lower propensity for phylogenetic topologies of proteins to match,

although the graphs drawn from network shuffle ensemble tend to produce more matches.

This suggests that the trees which match are generally found from edges leading involving

more highly connected nodes.

186


















D.3. ESCHERICHIA COLI PHYLOGENETIC TREES

Phylogeny

Proportion of Matches

0.0 0.1 0.2 0.3 0.4 0.5

Net Shuffle

Tree Shuffle

3 4 5 6 7 8 9 10

Shared Homologues

Figure D.5: Phylogenetic topology matches for E. coli data. This shows the output for E.

coli interaction data, showing a lack of difference between the two types of graph ensemble method and

the empirical graph over comparisons made on the same number of shared homologues. Network shuffle

results are shown in yellow (Net Shuffle) and node shuffle results in green (Tree Shuffle).

187


Appendix E

Sampling schemes

The sampling scheme used in Chapter 5 assumes that each interaction is reported uniformly

across all true interactions found in the complete set of protein pairs, E Ω . This

edge sampling approach means that the reported interactions can be modeled as samples

from an urn of interactions. However, in order to estimate the interactome size, an assessment

of the proportion of E Ω contained in either urn, the coverage of the experimental

techniques, is required.

As the reported non-interactions are not known, sampling schemes are used to model

the coverage of experimental techniques based on reported interactions. These sampling

schemes can then be used to find estimates of the error rates described in Table 1.2. Three

different schemes are introduced that may describe how experimental procedures have

tested possible protein pairs: node sampling - selection of proteins selected at random,

then pairs tested that are combinations of this set; edge sampling - reported interactions

are picked at random from the true network, G, and according to some rate, f, also from

G ′ ; edge discovery - preferential testing of either fixed proteins or particular protein pairs.

Node sampling

Each experiment, P , proposes a set of proteins, V P , to probe for possible

interactions. The sampled region is the complete network on these proteins, made

up of: G P ∼ (V P , E P ) and the reported interactions E P ⊆ {(v i , v j ) : v i , v j ∈ V P }. The

number of individual interactions tested is ( |V P

)

|

2 and the non-interactions are assumed to

be: G ′ P ∼ (V P , K VP \ E P ) (K VP is the complete graph on set of nodes V P ).

To estimate this from a reported experiment, P (if no further evidence exists) the sample

188


Interactome

(a)

(b)

(c)

(d)

(e)

(f)

Figure 1: Node Sampling: (a) Complete space of possible proteins; (b) True

underlying

Figure E.1:

interaction

Node sampling.

network This(144 showsedges interactions 200 nodes); are discovered (c) Proteins using nodesampled

sampling. The

by experiments (coloured by experiment); (d) Interactions tested in experiment;

(e) figures True show: interactions (a) complete sampled; set of proteins (f)(200 Edges nodes); not(b) observed. true underlying interaction network (144 edges

on 200 nodes); (c) proteins sampled for each study (coloured by study); (d) interactions tested in a study

(forming the coverage); (e) true interactions tested in studies; (f) true interactions not tested.

set is approximated by using the set of proteins that have an interaction to generate V P .

This will possibly be a subset of those proteins actually sampled in the study, understating

the overall set of protein pairs tested.

Figure E.1 shows how a set of studies, covering a large subset of the proteome, may only

test a small subset of the true interactions. In this case, the proteins tested have been given,

rather than assumed from the positive interactions found from positive reported sets.

189


Interactome

Edge sampling Unlike node sampling, this scheme supposes that interactions are sampled

directly, rather than nodes being chosen as testing candidates. This could be hypothesis

driven, according to biological knowledge or other relationships. Each interaction is

assumed to be independent and the discovery of any interaction is viewed as a random

process that is not influenced by the set of proteins being assessed. Figure E.2 shows an

example of how this would find true interactions if we assume that the FDR is zero.

(a)

(b)

(c)

(d)

Figure 1: Edge Sampling: (a) Complete space of possible proteins; (b) True

underlying Figure E.2: Edge interaction sampling. network This shows (144how edges interactions 200are nodes); discovered (c) using Edges edgediscovered

sampling, assuming

zero FDR. The figures show: (a) complete space of possible proteins; (b) true underlying interaction

through randomly sampling edges; (d) Remaining edges not sampled.

network (144 edges on 200 nodes); (c) edges discovered through randomly sampling interactions (coloured

by study); (d) remaining edges not sampled.

Edge sampling is the easiest sampling scheme to use when interested solely in interaction

validations and their ability to classify putative interactions, as used in Chapter 5. However,

it requires extra information to determine the amount of non-interaction testing that

has been completed. The set of tested edges cannot be reconstructed solely from the experiment

data, P , and its associated interaction network: G P ∼ (V P , E P ), as assumptions

are not laid on the unseen protein pairs that have been tested.

One means of assessing the testing completed on non-interactions is to assume a non-zero

FDR, as is apparent in the true data. Now the reported interaction graph is a subsample

190


Interactome

of true interactions, G, and non-interactions, G ′ . Given a reference set of reported interactions

and non-interactions (which could be obtained using results from Chapter 5),

an assessment of the coverage can be obtained. The non-interactions should be sampled

uniformly across the proteome, so an assessment of the testing of each individual protein

can be reconstructed from this non-interaction reference set. Bias in the testing of noninteractions,

found through those that have been reported, can be used to estimate the set

of tested proteins, and thus the coverage of the testing across the whole proteome.

Edge discovery Figure E.3 shows a sampling scheme where each interaction reported

is local to others reported. For reported interactions the proteins involved are more likely

to be studied further or individual fixed proteins are tested against a large set of other

proteins. This biological bias would result in increased sampling of specific proteins and

perhaps their neighbours.

(a)

(b)

(c)

(d)

Figure 1: Edge Discovery: (a) Complete space of possible proteins; (b) True

underlying interaction network (144 edges on 200 nodes); (c) Edges discovered

Figure E.3: Edge discovery. This shows how interactions are discovered using edge discovery,

through edge discovery; (d) Edges not sampled.

assuming zero FDR. This discovers the collection of interactions by moving from known interactions to local

protein pairs. The figures show: (a) complete space of possible proteins; (b) true underlying interaction

network (144 edges on 200 nodes); (c) edges discovered through edge discovery; (d) edges not sampled.

The reported interactions, from edge discovery, can easily generate hub proteins. This

may explain some perceived false-positive proteins such as YBR111W-A (a protein in-

191


Interactome

volved in mRNA export couple transcription activation) that appears in only 4 studies

but has over 100 unique reported interactions. Millson et al. (2005) also searched the S.

cerevisiae proteome for interactors of HSP90, finding 125 interactions. The tested set of

protein pairs for these studies is the same as the interactions inferred from the protein

complex spoke model. This is in contrast to the matrix interpretation presented by the

node sampling scheme.

192


References

Aebersold, R and Mann, M, 2003. Mass spectrometry-based proteomics. Nature 422:198–207. (page 68)

Agrafioti, I, Swire, J, Abbott, J, Huntley, D, Butcher, S, and Stumpf, MPH, 2005. Comparative analysis

of the Saccharomyces cerevisiae and Caenorhabditis elegans protein interaction networks. BMC Evolutionary

Biology 5:23. (pages 33, 40, 87, 119, and 137)

Aiello, W, Chung, F, and Lu, L, 2000. A random graph model for massive graphs. Proceedings of the ACM

Symposium on Theory of Computing 171–180. (pages 51, 168, and 169)

Albert, I and Albert, R, 2004. Conserved network motifs allow protein-protein interaction prediction.

Bioinformatics 20:3346–3352. (page 40)

Albert, R and Barabasi, AL, 2000. Topology of evolving networks: local events and universality. Physical

Review Letters 85:5234–5237. (page 50)

Allen, SCH, Byron, A, Lord, JM, Davey, J, Roberts, LM, and Ladds, G, 2005. Utilisation of the budding

yeast Saccharomyces cerevisiae for the generation and isolation of non-lethal ricin A chain variants.

Yeast 22:1287–1297. (page 26)

Alm, E and Arkin, A, 2003. Biological networks. Current Opinion in Structural Biology 13:193–202.

(page 21)

Almaas, E, 2007. Biological impacts and context of network theory. Journal of Experimental Biology

210:1548–1558. (page 50)

Aloy, P and Russell, RB, 2002. Potential artefacts in protein-interaction networks. FEBS Letters 530:253–

254. (page 52)

Aloy, P and Russell, RB, 2006. Structural systems biology: modelling protein interactions. Nature Reviews

Molecular Cell Biology 7:188–197. (page 52)

Altschul, S, Gish, W, Miller, W, Myers, E, and Lipman, D, 1990. Basic Local Alignment Search Tool.

Journal of Molecular Biology 215:403–410. (page 27)

Andrews, D and Demidov, A, 1999. Resonance Energy Transfer. Wiley. (page 35)

Ashburner, M, Ball, C, Blake, J, and Botstein, D, 2000. Gene ontology: tool for the unification of biology.

The Gene Ontology Consortium. Nature Genetics 25:25–29. (pages 47 and 89)

Bader, GD, Donaldson, I, Wolting, C, and Ouellette, B, 2001. BIND—The Biomolecular Interaction Network

Database. Nucleic Acids Research 29:242–245. (page 64)

Bader, GD and Hogue, CWV, 2002. Analyzing yeast protein-protein interaction data obtained from different

sources. Nature Biotechnology 20:991–997. (pages 38 and 61)

193


Bader, JS, Chaudhuri, A, Rothberg, J, and Chant, J, 2004. Gaining confidence in high-throughput protein

interaction networks. Nature Biotechnology 22:78–85. (pages 40, 46, and 68)

Bader, S, Kühner, S, and Gavin, AC, 2008. Interaction networks for systems biology. FEBS Letters

582:1220–1224. (page 34)

Bangert, RK, Turek, RJ, Rehill, B, Wimp, GM, Schweitzer, JA, et al., 2006. A genetic similarity rule

determines arthropod community structure. Molecular Ecology 15:1379–1391. (page 26)

Barabasi, AL and Albert, R, 1999. Emergence of scaling in random networks. Science 286:509–512.

(pages 43, 50, and 51)

Barabasi, AL and Oltvai, Z, 2004. Network biology: understanding the cell’s functional organization.

Nature Reviews Genetics 5:101–113. (pages 21, 45, 50, and 52)

Batada, NN, Hurst, LD, and Tyers, M, 2006. Evolutionary and physiological importance of hub proteins.

PLoS Computational Biology 2:e88. (page 50)

Batada, NN, Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, et al., 2006. Stratus not altocumulus: a

new view of the yeast protein interaction network. PLoS Biology 4:e317. (page 50)

Ben-Hur, A and Noble, WS, 2005. Kernel methods for predicting protein-protein interactions. Bioinformatics

21:i38–i46. (pages 40, 46, and 52)

Bender, E and Canfield, ER, 1978. The asymptotic number of labeled graphs with given degree sequences.

Journal of Combinatorial Theory, Series A 24:296–307. (pages 89, 90, and 93)

Berg, J and Lassig, M, 2004. Local graph alignment and motif search in biological networks. Proceedings

of the National Academy of Sciences 101:14689–14694. (page 87)

Bhardwaj, N and Lu, H, 2005. Correlation between gene expression profiles and protein-protein interactions

within and across genomes. Bioinformatics 21:2730–2738. (pages 37, 46, and 89)

Boone, C, Bussey, H, and Andrews, BJ, 2007. Exploring genetic interactions and networks with yeast.

Nature Reviews Genetics 8:437–449. (page 24)

Bork, P, Jensen, LJ, von Mering, C, Ramani, AK, Lee, I, and Marcotte, EM, 2004. Protein interaction

networks from yeast to human. Current Opinion in Structural Biology 14:292–299. (page 40)

Breitkreutz, BJ, Stark, C, Reguly, T, Boucher, L, Breitkreutz, A, et al., 2008. The BioGRID Interaction

Database: 2008 update. Nucleic Acids Research 36:D637–D640. (pages 34 and 65)

Brizzio, V, Khalfan, W, Huddler, D, Beh, CT, Andersen, SS, et al., 1999. Genetic interactions between

KAR7/SEC71, KAR8/JEM1, KAR5, and KAR2 during nuclear fusion in Saccharomyces cerevisiae.

Molecular Biology of the Cell 10:609–626. (page 69)

Bruggeman, FJ and Westerhoff, HV, 2007. The nature of systems biology. Trends in Microbiology 15:45–

50. (page 21)

Bunge, J and Fitzpatrick, M, 1993. Estimating the Number of Species: A Review. Journal of the American

Statistical Association 88:364–373. (page 142)

Burnham, K and Anderson, DR, 1998. Model Selection and Inference: A Practical Information-Theoretic

Approach. Springer. (pages 82 and 167)

Camon, EB, Barrell, DG, Lee, V, Dimmer, E, and Apweiler, R, 2004. The Gene Ontology Annotation

(GOA) Database–an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico

Biology 4:5–6. (pages 34 and 47)

194


Carlson, J and Doyle, J, 1999. Highly optimized tolerance: A mechanism for power laws in designed

systems. Physical Review E 60:1412–1427. (pages 43 and 50)

Chao, A, 2001. An overview of closed capture-recapture models. Journal of Agricultural, Biological, and

Environmental Statistics 6:158–175. (page 142)

Chen, PY, Deane, CM, and Reinert, G, 2007. A statistical approach using network structure in the prediction

of protein characteristics. Bioinformatics 23:2314–2321. (page 87)

Chiang, T, Scholtens, D, Sarkar, D, and Gentleman, R, 2007. Coverage and error models of protein-protein

interaction data by directed graph analysis. Genome Biology R186. (pages 52, 59, and 139)

Chinnasamy, A, Mittal, A, and Sung, WK, 2006. Probabilistic prediction of protein-protein interactions

from the protein sequences. Computational Biological Medicine 36:1143–1154. (page 70)

Cho, R, Campbell, M, Winzeler, E, and Steinmetz, L, 1998. A Genome-Wide Transcriptional Analysis of

the Mitotic Cell Cycle. Molecular Cell 2:65–73. (page 89)

Chung, F, Lu, L, Dewey, T, and Galas, D, 2003. Duplication models for biological networks. Journal of

Computational Biology 10:677–687. (page 168)

Collins, SR, Kemmeren, P, Zhao, XC, Greenblatt, JF, Spencer, F, et al., 2007. Toward a comprehensive atlas

of the physical interactome of Saccharomyces cerevisiae. Molecular & Cellular Proteomics 6:439–450.

(pages 23, 34, and 66)

Collins, SR, Miller, KM, Maas, NL, Roguev, A, Fillingham, J, et al., 2007. Functional dissection of protein

complexes involved in yeast chromosome biology using a genetic interaction map. Nature 446:806–810.

(page 66)

Copley, RR, 2008. The animal in the genome: comparative genomics and evolution. Philosophical Transactions

of the Royal Society B 363:1453–1461. (page 139)

Cusick, ME, Yu, H, Smolyar, A, Venkatesan, K, Carvunis, AR, et al., 2009. Literature-curated protein

interaction datasets. Nature Methods 6:39–46. (pages 34 and 36)

Daudin, JJ, Picard, F, and Robin, S, 2007. A mixture model for random graphs. Statistics for Systems

Biology Group 5840. (pages 51 and 171)

de Silva, E, Thorne, T, Ingram, PJ, Agrafioti, I, Swire, J, et al., 2006. The effects of incomplete protein

interaction data on structural and evolutionary inferences. BMC Biology 4:39. (pages 95 and 137)

Deane, CM, Salwinski, L, Xenarios, I, and Eisenberg, D, 2002. Protein interactions: two methods for

assessment of the reliability of high throughput observations. Molecular & Cellular Proteomics 1:349–

356. (pages 40, 47, 52, 65, and 79)

Deng, M, Sun, F, and Chen, T, 2003. Assessment of the reliability of protein-protein interactions and protein

function prediction. Pacific Symposium on Biocomputing 140–151. (page 57)

D’haeseleer, P and Church, G, 2004. Estimating and improving protein interaction error rates. Proceedings

of the IEEE Computational Systems Bioinformatics Conference. (pages 53, 56, 57, 59, 61, 74, and 139)

Dorogovtsev, SN and Mendes, JFF, 2001. Evolution of Networks. arXiv 0106144. (pages 45 and 49)

Dorogovtsev, SN, Mendes, JFF, and Samukhin, AN, 2000. Structure of growing networks with preferential

linking. Physical Review Letters 85:4633–4636. (page 168)

195


Doyle, J, Alderson, D, Li, L, Low, S, Roughan, M, et al., 2005. The “robust yet fragile” nature of the

Internet. Proceedings of the National Academy of Sciences 102:14497–14502. (pages 50 and 85)

Drummond, DA, Raval, A, and Wilke, CO, 2006. A single determinant dominates the rate of yeast protein

evolution. Molecular Biology and Evolution 23:327–337. (pages 33 and 137)

Eisen, MB, Spellman, PT, Brown, PO, and Botstein, D, 1998. Cluster analysis and display of genome-wide

expression patterns. Proceedings of the National Academy of Sciences 95:14863–14868. (page 47)

Erdös, P and Rényi, A, 1959. On random graphs. Publicationes Mathematicae Debrecen 6:290–297.

(pages 48 and 49)

Felsenstein, J, 1984. Distance Methods for Inferring Phylogenies: A Justification. Evolution 38:16–24.

(page 27)

Felsenstein, J, 1995. PHYLIP (Phylogeny Inference Package), version 3.57c. University of Washington.

(pages 30 and 120)

Felsenstein, J, 2003. Inferring Phylogenies. Sinauer Associates. (pages 122 and 180)

Fitzpatrick, D, Logue, M, Stajich, J, and Butler, G, 2006. A fungal phylogeny based on 42 complete

genomes derived from supertree and combined gene analysis. BMC Evolutionary Biology 6:99.

(page 120)

Fraser, HB, Hirsh, A, Steinmetz, L, Scharfe, C, and Feldman, M, 2002. Evolutionary rate in the protein

interaction network. Science 296:750–752. (pages 87 and 118)

Freifelder, D, 1982. Physical Biochemistry: Applications to Biochemistry and Molecular Biology. W.H.

Freeman. (page 35)

Gaczynska, M, Osmulski, PA, Jiang, Y, Lee, JK, Bermudez, V, and Hurwitz, J, 2004. Atomic force microscopic

analysis of the binding of the Schizosaccharomyces pombe origin recognition complex and the

spOrc4 protein with origin DNA. Proceedings of the National Academy of Sciences 101:17952–17957.

(page 35)

Gavin, AC, Aloy, P, Grandi, P, Krause, R, Boesche, M, et al., 2006. Proteome survey reveals modularity of

the yeast cell machinery. Nature 440:631–636. (pages 21, 38, 66, 83, and 89)

Gavin, AC, Bosche, M, Krause, R, Grandi, P, Marzioch, M, et al., 2002. Functional organization of the

yeast proteome by systematic analysis of protein complexes. Nature 415:141–147. (pages 34, 36, 57,

61, 66, and 80)

Gentleman, R and Huber, W, 2007. Making the most of high-throughput protein-interaction data. Genome

Biology 8:112. (pages 59 and 164)

Gertz, J, Elfond, G, Shustrova, A, Weisinger, M, Pellegrini, M, et al., 2003. Inferring protein interactions

from phylogenetic distance matrices. Bioinformatics 19:2039–2045. (pages 30, 31, 40, and 118)

Gilchrist, MA, Salter, LA, and Wagner, A, 2004. A statistical framework for combining and interpreting

proteomic datasets. Bioinformatics 20:689–700. (page 21)

Gilles, C, Rousseau, P, Rouge, P, and Payan, F, 1996. Crystallization and preliminary x-ray analysis of

pig porcine pancreatic alpha-amylase in complex with a bean lectin-like inhibitor. Acta Crystallography

581–582. (page 24)

196


Gkantsidis, C, Mihail, M, and Zegura, E, 2003. The Markov Chain Simulation Method for Generating

Connected Power Law Random Graphs. Proceedings of the SIAM Alenex 16–25. (pages 51, 168,

and 169)

Goh, CS, Bogan, A, Joachimiak, M, Walther, D, and Cohen, F, 2000. Co-evolution of Proteins with their

Interaction Partners. Journal of Molecular Biology 299:283–293. (pages 31, 118, 137, and 164)

Goh, CS and Cohen, F, 2002. Co-evolutionary Analysis Reveals Insights into Protein-Protein Interactions.

Journal of Molecular Biology 324:177–192. (pages 30, 118, and 137)

Grigoriev, A, 2003. On the number of proteinprotein interactions in the yeast proteome. Nucleic Acids

Research 31:4157–4161. (pages 57, 58, 61, 139, and 140)

Güldener, U, Münsterkötter, M, Oesterheld, M, Pagel, P, Ruepp, A, et al., 2006. MPact: the MIPS protein

interaction resource on yeast. Nucleic Acids Research 34:D436–D441. (pages 58 and 64)

Gurunathan, S, David, D, and Gerst, JE, 2002. Dynamin and clathrin are required for the biogenesis of a

distinct class of secretory vesicles in yeast. The EMBO Journal 21:602–614. (page 35)

Hakes, L, Lovell, SC, Oliver, SG, and Robertson, DL, 2007. Specificity in protein interactions and its

relationship with sequence diversity and coevolution. Proceedings of the National Academy of Sciences

104:7999–8004. (pages 32, 33, 121, 137, and 166)

Hakes, L, Pinney, JW, Lovell, SC, Oliver, SG, and Robertson, DL, 2007. All duplicates are not equal: the

difference between small-scale and genome duplication. Genome Biology 8:R209. (page 28)

Hakes, L, Robertson, DL, Oliver, SG, and Lovell, SC, 2007. Protein interactions from complexes: a structural

perspective. Comparative and Functional Genomics 2007:5. (page 38)

Hamming, R, 1950. Error detecting and error correcting codes. Bell System Technical Journal 29:2.

(page 96)

Harkness, TAA, Davies, GF, Ramaswamy, V, and Arnason, TG, 2002. The ubiquitin-dependent targeting

pathway in Saccharomyces cerevisiae plays a critical role in multiple chromatin assembly regulatory

steps. Genetics 162:615–632. (page 69)

Hart, GT, Ramani, AK, and Marcotte, EM, 2006. How complete are current yeast and human proteininteraction

networks? Genome Biology 7:120. (pages 21, 59, and 61)

Harvey, P, Colwell, R, Silvertown, J, and May, R, 1983. Null Models in Ecology. Annual Reviews Ecology

and Systematics 14:189–211. (page 87)

He, X and Zhang, J, 2006. Why do hubs tend to be essential in protein networks? PLoS Genetics 2:e88.

(pages 43 and 50)

Hermjakob, H, Montecchi-Palazzi, L, and Lewington, C, 2004. IntAct: an open source molecular interaction

database. Nucleic Acids Research 32:D452–D455. (page 34)

Higham, D, Rasajski, M, and Przulj, N, 2008. Fitting a geometric graph to a protein-protein interaction

network. Bioinformatics 24:1093–1099. (pages 51, 80, and 171)

Hintze, A and Adami, C, 2008. Evolution of complex modular biological networks. PLoS Computational

Biology 4:e23. (page 18)

Hirschman, JE, Balakrishnan, R, Christie, KR, Costanzo, MC, Dwight, SS, et al., 2006. Genome Snapshot:

a new resource at the Saccharomyces Genome Database (SGD) presenting an overview of the

Saccharomyces cerevisiae genome. Nucleic Acids Research 34:D442–D445. (pages 21 and 142)

197


Ho, Y, Gruhler, A, Heilbut, A, Bader, GD, Moore, L, et al., 2002. Systematic identification of protein

complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415:180–183. (pages 36, 57, 61,

66, and 80)

Hong, EL, Balakrishnan, R, Dong, Q, Christie, KR, Park, J, et al., 2008. Gene Ontology annotations at

SGD: new data sources and annotation methods. Nucleic Acids Research 36:D577–D581. (pages 21

and 34)

Huang, H, Jedynak, BM, and Bader, JS, 2007. Where have all the interactions gone? Estimating the

coverage of two-hybrid protein interaction maps. PLoS Computational Biology 3:e214. (pages 59 and 61)

Ito, T, Chiba, T, Ozawa, R, Yoshida, M, Hattori, M, and Sakaki, Y, 2001. A comprehensive two-hybrid

analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences

98:4569–4574. (pages 34, 36, 37, 57, 58, 61, and 80)

Ito, T, Tashiro, K, Muta, S, Ozawa, R, Chiba, T, et al., 2000. Toward a protein-protein interaction map of the

budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations

between the yeast proteins. Proceedings of the National Academy of Sciences 97:1143–1147. (page 80)

Jansen, R and Gerstein, M, 2004. Analyzing protein function on a genomic scale: the importance of goldstandard

positives and negatives for network prediction. Current Opinion in Microbiology 7:535–545.

(pages 47 and 70)

Jansen, R, Yu, H, Greenbaum, D, Kluger, Y, Krogan, NJ, et al., 2003. A Bayesian Networks Approach for

Predicting Protein-Protein Interactions from Genomic Data. Science 302:449–453. (pages 47 and 70)

Jordan, I, Wolf, Y, and Koonin, EV, 2003. No simple dependence between protein evolution rate and the

number of protein-protein interactions: only the most prolific interactors tend to evolve slowly. BMC

Evolutionary Biology 3:1. (pages 33, 87, 118, and 137)

Jothi, R, Cherukuri, PF, Tasneem, A, and Przytycka, TM, 2006. Co-evolutionary analysis of domains in

interacting proteins reveals insights into domain-domain interactions mediating protein-protein interactions.

Journal of Molecular Biology 362:861–875. (page 33)

Jothi, R, Kann, MG, and Przytycka, TM, 2005. Predicting protein-protein interaction by searching evolutionary

tree automorphism space. Bioinformatics 21:i241–i250. (pages 33 and 118)

Juan, D, Pazos, F, and Valencia, A, 2008. Co-evolution and co-adaptation in protein networks. FEBS Letters

582:1225–1230. (pages 31 and 121)

Juan, D, Pazos, F, and Valencia, A, 2008. High-confidence prediction of global interactomes based on

genome-wide coevolutionary networks. Proceedings of the National Academy of Sciences 105:934–939.

(pages 31 and 118)

Kann, MG, Jothi, R, Cherukuri, PF, and Przytycka, TM, 2007. Predicting protein domain interactions from

coevolution of conserved regions. Proteins 67:811–820. (page 32)

Kann, MG, Shoemaker, BA, Panchenko, AR, and Przytycka, TM, 2008. Correlated Evolution of Interacting

Proteins: Looking Behind the Mirrortree. Journal of Molecular Biology 385:91–98. (page 32)

Kashtan, N, Itzkovitz, S, Milo, R, and Alon, U, 2004. Efficient sampling algorithm for estimating subgraph

concentrations and detecting network motifs. Bioinformatics 20:1746–1758. (page 45)

Kelly, WP and Stumpf, MPH, 2008. Protein-protein interactions: from global to local analyses. Current

Opinion in Biotechnology 19:396–403. (page 20)

198


Kemp, BE, Mitchelhill, KI, Stapleton, D, Michell, BJ, Chen, ZP, and Witters, LA, 1999. Dealing with

energy demand: the AMP-activated protein kinase. Trends in Biochemical Sciences 24:22–25. (page 79)

Kerrien, S, Alam-Faruque, Y, Aranda, B, and Bancarz, I, 2006. IntAct - open source resource for molecular

interaction data. Nucleic Acids Research 00:D1–D5. (page 64)

Kiel, C, Beltrao, P, and Serrano, L, 2008. Analyzing Protein Interaction Networks Using Structural Information.

Annual Review Biochemistry 77:1–27. (page 35)

Kim, PM, Sboner, A, Xia, Y, and Gerstein, M, 2008. The role of disorder in interaction networks: a

structural analysis. Molecular Systems Biology 4:179. (page 50)

Kim, WK and Marcotte, EM, 2008. Age-dependent evolution of the yeast protein interaction network

suggests a limited role of gene duplication and divergence. PLoS Computational Biology 4:e1000232.

(page 51)

Konagurthu, AS and Lesk, AM, 2008. On the origin of distribution patterns of motifs in biological networks.

BMC Systems Biology 2:73. (pages 26 and 45)

Krogan, NJ, Cagney, G, Yu, H, Zhong, G, Guo, X, et al., 2006. Global landscape of protein complexes in

the yeast Saccharomyces cerevisiae. Nature 440:637–643. (pages 38, 47, 66, and 83)

Kuchin, S, Treich, I, and Carlson, MW, 2000. A regulatory shortcut between the Snf1 protein kinase and

RNA polymerase II holoenzyme. Proceedings of the National Academy of Sciences 97:7916–7920.

(page 79)

LaCount, DJ, Vignali, M, Chettier, R, Phansalkar, A, Bell, R, et al., 2005. A protein interaction network of

the malaria parasite Plasmodium falciparum. Nature 438:103–107. (page 21)

Lappe, M and Holm, L, 2004. Unraveling protein interaction networks with near-optimal efficiency. Nature

Biotechnology 22:98–103. (page 34)

Lee, S, Kim, P, and Jeong, H, 2006. Statistical properties of sampled networks. Physical Review E

73:016102. (pages 54 and 95)

Lehner, B and Fraser, AG, 2004. A first-draft human protein-interaction map. Genome Biology 5:R63.

(page 27)

Lemos, B, Meiklejohn, C, and Hartl, D, 2004. Regulatory evolution across the protein interaction network.

Nature Genetics 36:1059–1060. (page 87)

Li, L, Anderson, D, Tanaka, R, Doyle, J, and Willinger, W, 2005. Towards a Theory of Scale-Free Graphs:

Definition, Properties, and Implications. Internet Mathematics 2:4. (pages 51, 168, and 169)

Li, S, Armstrong, C, Bertin, N, Ge, H, and Milstein, S, 2004. A Map of the Interactome Network of the

Metazoan C. elegans. Science 303:540–543. (page 40)

Li, X, Chen, H, Huang, Z, Su, H, and Martinez, JD, 2007. Global mapping of gene/protein interactions

in PubMed abstracts: a framework and an experiment with P53 interactions. Journal of Biomedical

Informatics 40:453–464. (page 50)

Lin, N, Wu, B, Jansen, R, Gerstein, M, and Zhao, H, 2004. Information assessment on predicting proteinprotein

interactions. BMC Bioinformatics 5:154. (page 47)

Lo, WS, Duggan, L, Emre, NC, Belotserkovskya, R, Lane, WS, et al., 2001. Snf1–a histone kinase that

works in concert with the histone acetyltransferase Gcn5 to regulate transcription. Science 293:1142–

1146. (page 79)

199


Loganantharaj, R and Atwi, M, 2007. Towards validating the hypothesis of phylogenetic profiling. BMC

Bioinformatics 8:s25. (page 31)

Lu, L, Xia, Y, Paccanaro, A, Yu, H, and Gerstein, M, 2005. Assessing the limits of genomic data integration

for predicting protein networks. Genome Research 15:945–953. (page 40)

Luscombe, NM, Greenbaum, D, and Gerstein, M, 2001. What is bioinformatics? A proposed definition and

overview of the field. Methods of Information in Medicine 40:346–358. (page 18)

MacDonald, N, 1979. Simple aspects of foodweb complexity. Journal of Theoretical Biology 80:577–588.

(page 26)

Marcotte, EM, Pellegrini, M, Thompson, M, and Yeates, TO, 1999. A combined algorithm for genome-wide

prediction of protein function. Nature 402:83–86. (page 47)

Maslov, S and Sneppen, K, 2002. Specificity and stability in topology of protein networks. Science

296:910–913. (page 45)

May, RM, 2001. Stability and Complexity in Model Ecosystems. Princeton University Press. (page 87)

Meinke, G, Ezeokonkwo, C, Balbo, P, Stafford, W, Moore, C, and Bohm, A, 2008. Structure of yeast

poly(A) polymerase in complex with a peptide from Fip1, an intrinsically disordered protein. Biochemistry

47:6859–6869. (page 35)

Mewes, HW, Frishman, D, Mayer, K, Munsterkotter, M, Noubibou, O, et al., 2006. MIPS: analysis and

annotation of proteins from whole genomes in 2005. Nucleic Acids Research 34:D169–D172. (pages 65

and 119)

Mika, S and Rost, B, 2006. Protein-protein interactions more conserved within species than across species.

PLoS Computational Biology 2:e79. (page 40)

Millson, SH, Truman, AW, King, V, Prodromou, C, Pearl, LH, and Piper, PW, 2005. A two-hybrid screen

of the yeast proteome for Hsp90 interactors uncovers a novel Hsp90 chaperone requirement in the activity

of a stress-activated mitogen-activated protein kinase, Slt2p (Mpk1p). Eukaryotic Cell 4:849–860.

(page 192)

Milo, R, Shen-Orr, S, Itzkovitz, S, Kashtan, N, Chklovskii, D, and Alon, U, 2002. Network motifs: simple

building blocks of complex networks. Science 298:824–827. (pages 45 and 87)

Mösch, HU and Fink, GR, 1997. Dissection of filamentous growth by transposon mutagenesis in Saccharomyces

cerevisiae. Genetics 145:671–684. (page 69)

Mount, D, 2004. Bioinformatics: Sequence and Genome Analysis. CSHL Press. (page 30)

Nariai, N, Tamada, Y, Imoto, S, and Miyano, S, 2005. Estimating gene regulatory networks and proteinprotein

interactions of Saccharomyces cerevisiae from multiple genome-wide data. Bioinformatics

21:i206–i212. (page 37)

Newman, MEJ, 2003. Mixing patterns in networks. Physical Review E 67:026126. (page 45)

Newman, MEJ, 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary Physics 46:323–351.

(page 49)

Newman, MEJ and Park, J, 2003. Why social networks are different from other types of networks. arXiv

0305612. (page 46)

200


Overbeek, R, Fonstein, M, D’Souza, M, Pusch, GD, and Maltsev, N, 1999. The use of gene clusters to infer

functional coupling. Proceedings of the National Academy of Sciences 96:2896–2901. (page 47)

Pamilo, P and Nei, M, 1988. Relationships between gene trees and species trees. Molecular Biology and

Evolution 5:568–583. (page 121)

Pan, X, Ye, P, Yuan, DS, Wang, X, Bader, JS, and Boeke, JD, 2006. A DNA integrity network in the yeast

Saccharomyces cerevisiae. Cell 124:1069–1081. (page 66)

Park, J and Newman, MEJ, 2004. The statistical mechanics of networks. arXiv 0405566. (page 51)

Pastor-Satorras, R, Smith, E, and Sole, RV, 2003. Evolving protein interaction networks through gene

duplication. Journal of Theoretical Biology 222:199–210. (page 28)

Pattison, P and Wasserman, S, 1999. Logit models and logistic regressions for social networks: II. Multivariate

relations. British Journal of Mathematical and Statistical Psychology 52:169–193. (pages 51,

80, and 169)

Pazos, F, Ranea, J, Juan, D, and Sternberg, M, 2005. Assessing Protein Co-evolution in the Context of the

Tree of Life Assists in the Prediction of the Interactome. Journal of Molecular Biology 352:1002–1015.

(pages 30, 32, 118, 134, 137, and 186)

Pazos, F and Valencia, A, 2001. Similarity of phylogenetic trees as indicator of protein-protein interaction.

Protein Engineering 14:609–614. (pages 31 and 118)

Pazos, F and Valencia, A, 2008. Protein co-evolution, co-adaptation and interactions. The EMBO Journal

27:2648–2655. (page 121)

Pellegrini, M, Marcotte, EM, Thompson, M, Eisenberg, D, and Yeates, TO, 1999. Assigning protein

functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National

Academy of Sciences 96:4285–4288. (pages 30, 31, and 118)

Phizicky, E, Bastiaens, PIH, Zhu, H, Snyder, M, and Fields, S, 2003. Protein analysis on a proteomic scale.

Nature 422:208–215. (page 68)

Picard, F, Daudin, JJ, Koskas, M, Schbath, S, and Robin, S, 2008. Assessing the exceptionality of network

motifs. Journal of Computational Biology 15:1–20. (page 45)

Pratt, RC, Morgan-Richards, M, and Trewick, SA, 2008. Diversification of New Zealand weta (Orthoptera:

Ensifera: Anostostomatidae) and their relationships in Australasia. Philosophical Transactions of the

Royal Society B 363:3427–3437. (page 26)

Przulj, N, 2007. Biological network comparison using graphlet degree distribution. Bioinformatics

23:e177–e183. (page 45)

Ptacek, J, Devgan, G, Michaud, G, Zhu, H, Zhu, X, et al., 2005. Global analysis of protein phosphorylation

in yeast. Nature 438:679–684. (pages 66, 79, and 175)

Ptak, RG, Fu, W, Sanders-Beer, BE, Dickerson, JE, Pinney, JW, et al., 2008. Cataloguing the HIV type 1

human protein interaction network. AIDS Research and Human Retroviruses 24:1497–1502. (page 25)

Ramani, AK, Li, Z, Hart, GT, Carlson, MW, Boutz, DR, and Marcotte, EM, 2008. A map of human protein

interactions derived from co-expression of human mRNAs and their orthologs. Molecular Systems

Biology 4:180. (pages 26, 46, 47, and 89)

Ramani, AK and Marcotte, EM, 2003. Exploiting the Co-evolution of Interacting Proteins to Discover

Interaction Specificity. Journal of Molecular Biology 327:273–284. (pages 118 and 137)

201


Raveh, A, Riven, I, and Reuveny, E, 2009. The Use of FRET Microscopy to Elucidate Steady State Channel

Conformational Rearrangements and G Protein Interaction with the GIRK Channels. Methods in

Molecular Biology 491:199–212. (page 35)

Reguly, T, Breitkreutz, A, Boucher, L, Breitkreutz, BJ, Hon, G, et al., 2006. Comprehensive curation and

analysis of global interaction networks in Saccharomyces cerevisiae. Journal of Biology 5:11. (pages 50,

65, 80, and 82)

Robins, G, Pattison, P, Kalish, Y, and Lusher, D, 2007. An introduction to exponential random graph (p*)

models for social networks. Social Networks 29:173–191. (pages 51 and 169)

Salwinski, L and Eisenberg, D, 2003. Computational methods of analysis of proteinprotein interactions.

Current Opinion in Structural Biology 13:377–382. (pages 46, 89, and 140)

Sanchez, C, Lachaize, C, Janody, F, Bellon, B, Röder, L, et al., 1999. Grasping at molecular interactions

and genetic networks in Drosophila melanogaster using FlyNets, an Internet database. Nucleic Acids

Research 27:89–94. (page 23)

Sato, T, Yamanishi, Y, Horimoto, K, Toh, H, and Kanehisa, M, 2003. Prediction of proteinprotein interactions

from phylogenetic trees using partial correlation coefficient. Genome Informatics 14:496–497.

(pages 28 and 118)

Saul, ZM and Filkov, V, 2007. Exploring biological network structure using exponential random graph

models. Bioinformatics 23:2604–2611. (pages 169 and 171)

Scholtens, D, Chiang, T, Huber, W, and Gentleman, R, 2008. Estimating node degree in bait-prey graphs.

Bioinformatics 24:218–224. (pages 36 and 139)

Schwartz, AS, Yu, J, Gardenour, KR, Finley, RL, and Ideker, T, 2009. Cost-effective strategies for completing

the interactome. Nature Methods 6:55–61. (page 164)

Schwikowski, B, Uetz, P, and Fields, S, 2000. A network of protein protein interactions in yeast. Nature

Biotechnology 18:1257–1261. (pages 18 and 52)

Sharp, P and Li, WH, 1987. The codon adaptation index - a measure of directional synonymous codon

usage bias, and its potential applications. Nucleic Acids Research 15:1281–1295. (page 137)

Sharrocks, K, 2007. Host cell factors facilitating HIV-1 Integration. PhD Thesis. (page 25)

Shen, J, Zhang, J, Luo, X, Zhu, W, Yu, K, et al., 2007. Predicting protein-protein interactions based only

on sequences information. Proceedings of the National Academy of Sciences 104:4337–4341. (pages 21

and 46)

Shen-Orr, S, Milo, R, Mangan, S, and Alon, U, 2002. Network motifs in the transcriptional regulation

network of Escherichia coli. Nature Genetics 31:64–69. (page 45)

Shoemaker, BA and Panchenko, AR, 2007. Deciphering protein-protein interactions. Part I. Experimental

techniques and databases. PLoS Computational Biology 3:e42. (page 36)

Shoemaker, BA and Panchenko, AR, 2007. Deciphering protein-protein interactions. Part II. Computational

methods to predict protein and domain interaction partners. PLoS Computational Biology 3:e43.

(page 37)

Shokouhi, M, Zobel, J, and Scholer, F, 2006. Capturing collection size for distributed non-cooperative

retrieval. SIGIR Proceedings 316–323. (pages 139 and 142)

202


Skrabanek, L, Saini, HK, Bader, GD, and Enright, AJ, 2008. Computational prediction of protein-protein

interactions. Molecular Biotechnology 38:1–17. (pages 46, 47, 70, and 89)

Small, M, Walker, DM, and Tse, CK, 2007. Scale-free distribution of avian influenza outbreaks. Physical

Review Letters 99:188702. (page 50)

Small, M, Xu, X, Zhou, J, Zhang, J, Sun, J, and Lu, JA, 2008. Scale-free networks which are highly

assortative but not small world. Physical Review E 77:066112. (page 50)

Smith, TF and Waterman, MS, 1981. Identification of common molecular subsequences. Journal of Molecular

Biology 147:195–197. (page 27)

Snijders, T, 2002. Markov chain Monte Carlo estimation of exponential random graph models. Journal of

Social Structure. (page 171)

Sprinzak, E, Sattath, S, and Margalit, H, 2003. How Reliable are Experimental Protein–Protein Interaction

Data? Journal of Molecular Biology 919–923. (page 61)

Srinivasan, BS, Shah, NH, Flannick, JA, Abeliuk, E, Novak, AF, and Batzoglou, S, 2007. Current progress

in network research: toward reference networks for key model organisms. Briefings in Bioinformatics

8:318–332. (page 46)

Stark, C, Breitkreutz, BJ, Reguly, T, Boucher, L, Breitkreutz, A, and Tyers, M, 2006. BioGRID: a general

repository for interaction datasets. Nucleic Acids Research 34:D535–D539. (page 64)

Strong, DR, Simberloff, D, Abele, LG, and Thistle, AB, 1984. Ecological communities: Conceptual issues

and the evidence. Princeton University Press. (page 87)

Stuart, JM, Segal, E, Koller, D, and Kim, SK, 2003. A gene-coexpression network for global discovery of

conserved genetic modules. Science 302:249–255. (page 47)

Stumpf, MPH, Ingram, PJ, Nouvel, I, and Wiuf, C, 2005. Statistical Model Selection Methods Applied to

Biological Networks. arXiv 0506013. (page 50)

Stumpf, MPH, Kelly, WP, Thorne, T, and Wiuf, C, 2007. Evolution at the system level: the natural history

of protein interaction networks. Trends in Ecology & Evolution 22:366–373. (pages 20, 43, 51, 87,

and 168)

Stumpf, MPH, Thorne, T, de Silva, E, Stewart, R, An, H, et al., 2008. Estimating the size of the human

interactome. Proceedings of the National Academy of Sciences 105:6959–6964. (pages 59, 61, 139, 141,

and 142)

Stumpf, MPH and Wiuf, C, 2005. Sampling properties of random graphs: The degree distribution. Physical

Review E 72:036118. (pages 50 and 95)

Stumpf, MPH, Wiuf, C, and May, RM, 2005. Subnets of scale-free networks are not scale-free: Sampling

properties of networks. Proceedings of the National Academy of Sciences 102:4221–4224. (pages 95

and 171)

Tajima, F, 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437–460.

(page 121)

Takemoto, K and Oosawa, C, 2005. Evolving networks by merging cliques. Physical Review E 72:046116.

(page 49)

Tanaka, R, Yi, TM, and Doyle, J, 2005. Some protein interaction data do not exhibit power law statistics.

FEBS Letters 579:5140–5144. (page 50)

203


Tarassov, K, Messier, V, Landry, CR, Radinovic, S, Molina, MMS, et al., 2008. An in vivo map of the yeast

protein interactome. Science 320:1465–1470. (page 23)

Thomas, P, 2008. Generalising multiple capture-recapture to non-uniform sample sizes. SIGIR Proceedings

839–840. (page 142)

Thompson, JD, Gibson, TJ, and Higgins, DG, 2002. Multiple sequence alignment using ClustalW and

ClustalX. Current Protocols in Bioinformatics Chapter 2:Unit 2.3. (page 120)

Thorne, T and Stumpf, MPH, 2007. Generating confidence intervals on biological networks. BMC Bioinformatics

8:467. (pages 40, 46, 87, 89, 93, and 164)

Tucker, CL, Gera, JF, and Uetz, P, 2001. Towards an understanding of complex protein networks. Trends

in Cell Biology 11:102–106. (page 61)

Uetz, P, Giot, L, Cagney, G, Mansfield, TA, Judson, RS, et al., 2000. A comprehensive analysis of proteinprotein

interactions in Saccharomyces cerevisiae. Nature 403:623–627. (pages 34, 36, 37, 57, 58, 61,

and 80)

Umemura, M, Fujita, M, Yoko-O, T, Fukamizu, A, and Jigami, Y, 2007. Saccharomyces cerevisiae CWH43

is involved in the remodeling of the lipid moiety of GPI anchors to ceramides. Molecular Biology of the

Cell 18:4304–16. (page 69)

Valencia, A and Pazos, F, 2002. Computational methods for the prediction of protein interactions. Current

Opinion in Structural Biology 12:368–373. (pages 46 and 89)

Vázquez, A, Pastor-Satorras, R, and Vespignani, A, 2002. Large-scale topological and dynamical properties

of the Internet. Physical Review E 65:066130. (page 45)

von Mering, C, Krause, R, Snel, B, Cornell, M, Oliver, SG, et al., 2002. Comparative assessment of largescale

data sets of protein-protein interactions. Nature 417:399–403. (pages 37, 56, 61, and 140)

Wagner, A, 2001. The yeast protein interaction network evolves rapidly and contains few redundant duplicate

genes. Molecular Biology and Evolution 18:1283–1292. (pages 87 and 120)

Wagner, A, 2005. Robustness and Evolvability in Living Systems. Princeton University Press. (page 50)

Watts, DJ, 2004. Small Worlds: The Dynamics of Networks Between Order and Randomness. Princeton

University Press. (pages 45 and 49)

Watts, DJ and Strogatz, S, 1998. Collective dynamics of ‘small-world’ networks. Nature 393:440–442.

(pages 48 and 50)

Wojcik, J, Boneca, IG, and Legrain, P, 2002. Prediction, assessment and validation of protein interaction

maps in bacteria. Journal of Molecular Biology 323:763–770. (page 40)

Wolfe, K, 2006. Comparative genomics and genome evolution in yeasts. Philosophical Transactions of the

Royal Society B 361:403–412. (pages 26, 29, and 120)

Xenarios, I, Salwinski, L, Duan, X, Higney, P, Kim, SM, and Eisenberg, D, 2002. DIP, the Database of

Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids

Research 30:303–305. (pages 32, 64, and 65)

Xu, J, Wu, S, and Li, X, 2007. Estimating collection size with logistic regression. SIGIR Proceedings

789–790. (page 142)

Yang, Z, 2004. PAML: Phylogenetic Analysis by Maximum Likelihood. (page 30)

204


Yang, Z, 2006. Computational Molecular Evolution. Oxford University Press. (page 26)

Yang, Z, 2007. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution

24:1586–1591. (page 120)

Yook, SH, Jeong, H, and Barabasi, AL, 2002. Modeling the Internet’s large-scale topology. Proceedings of

the National Academy of Sciences 99:13382–13386. (pages 43 and 50)

Yu, J and Fotouhi, F, 2006. Computational approaches for predicting protein-protein interactions: a survey.

Journal of Medical Systems 30:39–44. (pages 46 and 89)

Yuan, C, Yongkiettrakul, S, Byeon, IJ, Zhou, S, and Tsai, MD, 2001. Solution structures of two FHA1-

phosphothreonine peptide complexes provide insight into the structural basis of the ligand specificity of

FHA1 from yeast Rad53. Journal of Molecular Biology 314:563–575. (page 35)

Zhang, J, 2003. Evolution by gene duplication: an update. Trends in Ecology & Evolution 18:292–298.

(page 28)

205