“A structure based flexible search method for motifs in RNA” Isana ...

STRMS 

“A structure based flexible search 

method for motifs in RNA” 

Isana Veksler-Lublinsky 

Michal Ziv-Ukelson 

Danny Barash 

Klara Kedem

Outline 

� Background 

� Motivation 

� RNA’s structure representations 

� Trees comparison 

� Our Algorithm 

� Results

The Central Dogma of 

DNA 

transcription 

RNA 

translation 

Protein 

Molecular Biology 

Non Coding RNA 

- RNA molecule that is 

not translated into a 

protein 

- Have been found to 

have roles in a great 

variety of processes 

DNA RNA Protein

The Recent Example 

The Nobel Prize in Physiology or Medicine 

2006 

"for their discovery of RNA interference - gene 

silencing by double-stranded RNA" 

Andrew Z. Fire Craig C. Mello 

DNA RNA Protein

Non Coding RNA Families 

�They are not conserved in sequence, but they are 

conserved in structure. 

�Have a role in regulating gene expression. 

� tRNA, rRNA, snoRNA, microRNA, siRNA, 

Riboswitch

Motivation 

� The goal is to discover ncRNA motifs in a sequence 

database. 

� Most RNA motif search methods start from the 

primary sequence and only then take into account 

secondary structure considerations. 

� Since different motifs vary in structure rigidity and 

in local sequence constraints, there is a need for 

algorithms and tools that can be fine-tuned 

according to the searched RNA motif.

Our Goal 

Discover ncRNA motifs in a sequence database. 

Genome Sequence 

QUERY 

millions of nucleotides 

ACGCUGACGUAGUCAGUAGACGAC 

AGACAGAUACGUCACCGCAGAUAC 

GCAUAGUAGCAGUAGCAGAUGACG 

ACGCUGACGUAGUCAGUAGACGAC 

AGACAGAUACGUCACCGCAGAUAC 

GCAUAGUAGCAGUAGCAGAUGACG 

…………………………………………… 

…………………………………………… 

Are there any appearances of this 

structure in the genome?

The tool - STRMS 

(Structural RNA Motif Search): 

� Input: Secondary structure of the query, including local sequence 

and structure constraints, and a target sequence database. 

� Output: All occurrences of the query in the target, ranked by their 

similarity to the query [in html file]. 

� The tool is flexible and takes into account a large number of 

sequence options. 

� Our approach combines: 

� pre-folding with MFOLD (Zuker, 2003) 

� RNA pattern matching algorithm [O(mn)] based on subtree 

homeomorphism for ordered, rooted trees.

Our method consists of two phases:

RNA’s Secondary Structure 

Pseudoknot 

Single-Stranded 

Bulge Loop 

Stem 

Interior Loop 

Hairpin 

loop 

Junction (Multiloop) 

Image– Wuchty


(((((((..((((…….)))).(((((…….)))))…..(((((…….))))))))))))


Graph

Ordered rooted tree 

Shapiro, 1988: 

� The nodes correspond to elements of secondary 

structure (hairpin loop, bulge, internal loop or multiloop). 

� The edges correspond to base-paired (stem) regions. 

Zhang, 1998: 

� The nodes of the tree represent either unpaired bases 

(leaves) or paired bases (internal nodes). Each node is 

labeled with a base or a pair of bases, respectively. 

� Two kinds of edges, alternatively connecting either 

consecutive stem base-pairs or a leaf base with the 

last base-pair in the corresponding stem.

Our tree representation 

� Compressed as in [Shapiro, 1988] + a node for 

every single strand component in multiloops. 

� Includes additional information on nodes and 

on edges for the purpose of sequence analysis. 

� It is more informative than Shapiro’s tree 

representation and more compact then Zhang’s. 

� This leads to a precise screening of the target 

text by first selecting candidates whose 

structural tree representation is similar to that of 

the query, and then further filtering these 

candidates by applying sequence 

considerations.

Our tree representation 

origin of a single structure 

bulge loop 

stem -edge 

hairpin loop 

interior loop 

dangling ends 

single-strand components of the multiloop 

☺Single-strand components and stem-edges are annotated with length and 

sequence. 

☺A small circle node carries only topological information. 

☺Generating the tree structure from a ct-file (output from mfold). 

☺The tree construction is ordered by the 5’ to 3’ ordering of the molecule. 

☺Compressed structure which retains also the sequence information.

Our tree representation

Comparison of ordered rooted 

trees 

� Trees are among the most common and wellstudied 

combinatorial structures in computer 

science. In particular, the problem of comparing 

trees occurs in several diverse areas such as: 

� computational biology 

� structured text databases 

� image analysis 

� automatic theorem proving 

� compiler optimization.


trees 

� The following operations are defined on ordered 

trees: 

� relabel - Change the label of a node v in T. 

� delete - Delete a non-root node v in T with parent v′, 

making the children of v become the children of v′. The 

children are inserted in the place of v as a subsequence in 

the left-to-right order of the children of v′. 

� insert - The complement of delete. Insert a 

node v as a child of v′ in T making v the 

parent of a consecutive subsequence of the 

children of v′.

1. Edit distance 

� An edit script S between T1 and T2 is a sequence of 

edit operations turning T1 into T2. 

� The tree edit distance problem is to compute the edit 

distance and a corresponding edit script.

1. Edit distance

1. Edit distance

1. Edit distance

1. Edit distance

1. Edit distance

2. Tree Inclusion 

T1 is included in T2 if there is a sequence of delete 

operations performed on T2 which makes T2 

isomorphic to T1. The tree inclusion problem is to 

decide if T1 is included in T2.

2. Tree Inclusion 

T1 is included in T2 if there is a sequence of delete 

operations performed on T2 which makes T2 

isomorphic to T1. The tree inclusion problem is to 

decide if T1 is included in T2.

� Polynomial time algorithms exist for these 

problems. They are all based on the classical 

technique of dynamic programming and most 

of them are simple combinatorial algorithms.


trees 

� Ordered tree comparison is generally computed by tree edit 

distance, which allows various forms of deletions and insertions in 

both query and target. 

� The search for small non-coding RNAs naturally yields a more 

specific tree search formulation since we do not allow deletions in 

the query. 

� In our method we apply a weighted pattern matching algorithm for 

finding the best homeomorphic mapping between two rooted 

ordered trees. 

� Specific constraints on the searched motif can be defined in the 

input to the search: structural constraints (lengths), allowing or 

forbidding element deletion in the target, sequence constraints 

(existence of sibling pseudoknots, local conserved sequence 

segments).

The Algorithm 

� The subtree isomorphism problem [Matula, 1968,1978]: 

Given a pattern tree P and a text tree T, find a subtree of T which is 

isomorphic to P, i.e. find if some subtree of T that is identical in structure 

to P can be obtained by removing entire subtrees of T, or decide that 

there is no such tree. 

� The subtree homeomorphism problem [Chung, 1987, Reyner, 

1977, Pinter et al., 2004]: 

Is a variant of the former problem, where degree-2 nodes can be deleted 

from the text tree. 

Homeomorphism Example

The Algorithm - Motivation 

� Point-mutation events could easily result in an extra bulge in an RNA structure. 

� However, in some cases the functional homology to the original, non-mutated 

structure is still preserved. 

� The suggested alignment should be flexible enough to allow the deletion of degree- 

2 nodes from the target tree. 

bulge 

riboswitch and its functional homologue

The Algorithm - Motivation 

� In some cases subtrees may be deleted from the target tree 

but not from the query tree, as in tRNA case. 

Subtree homeomorphism on ordered rooted trees is more efficient (quadratic 

in input size) than tree edit distance (cubic in input size).

Subtree Homeomorphism Score 

� Let T 1 and T 2 be two ordered, rooted, homeomorphic trees. 

� A mapping µ : T 1 → T 2 is a one-to-one map from the nodes of T 1 to the nodes of T 2 

that preserves the ancestor relations of the nodes and their relative order. 

� The subtree homeomorphism score of the mapping, denoted S(µ), is 

a user defined nodeto-node 

similarity 

score function 

edge-to-edge similarity 

score function where e u�T1, 

e v�T2 are corresponding 

edges. 

The penalty of 

deleting a 

degree-2-node 

from T 2 

The penalty for 

deleting any 

other node.


� Given two rooted ordered trees, P and T, the 

weighted subtree homeomorphism problem is to find 

a homeomorphism-preserving mapping µ : P → t 

from P to some subtree t of T, such that S(µ) is 

maximal.


� The cost function varies from one application to another, 

depending upon the amount of information supplied with the 

query. 

� The simplest one just compares the topology of the 

structures. 

� More complex functions include length differences of the 

structural elements, sequence conservation and pseudoknot 

matching. 

� The node deletion score (i.e., gap penalty) reflects the tradeoff 

between a gap and a mismatch. As the gap penalty increases, 

the algorithm tends to match distant nodes to avoid gaps. As 

different values may suit different needs, our tool enables 

users to set this parameter for each run.

The Tree Alignment Algorithm 

� A bottom-up two level dynamic programming (DP): 

� computing optimal alignments between P and any similar 

subtree t of T which maximizes the similarity score 

between P and t (where P is the query tree and T is the text tree) 

� O(mn) algorithm, where m and n are the number of 

vertices in P and T respectively.


� We define score(u,v) to be: 

a subtree of P 

rooted in node 

u�P 

a subtree of T 

rooted in node 

v�T

The two-stage DP approach to the tree 

alignment The compared trees 

Large DP - m*n table 

= score(a,1) 

Activated during computation of each 

non-leaf entry (u,v) in the L DP in order 

to compute the optimal mapping between 

the children of u and the children of v. 

Small DP - comparing subtrees of f and 9 

( second-level dynamic programming )


The algorithm returns a vertex v*�T that 

maximizes the score S(µ:P→ t v*) (found in the 

last row of L DP ). 

V*

Dealing with Potential Pseudoknots 

Extension of the subtree homeomorphism algorithm to 

handle the pseudoknot considerations posed by the 

riboswitches in our study. 

Indeed, [Mandal et al., 2003] 

predicted a potential 

pseudoknot between the two 

arms of the purine riboswitch 

aptamer. 

In order to extend our model to take such key 

information into consideration we annotate the tree 

with this additional information by connecting node 

2 and node 4 with a “potential pseudoknot” edge. 

2 

“GGUAU” 

4 

“CCGUA”

Dealing with Potential Pseudoknots 

Observations: 

� These edges break down the tree-like 

representation of the RNA secondary structures. 

� The potential pseudoknot is confined to the 

subtree rooted in node 8, i.e., node 2 and node 4 

are sibling nodes sharing a common parent node. 

� For all riboswitch aptamer queries in this study, 

only one potential pseudoknot is predicted and it 

is always formed between two sibling leaf nodes 

sharing a common parent node. 

� The text subtrees could be annotated with any 

number of potential sibling pseudoknots*. 

* based on loop sequence complementarity analysis that is executed in the preprocessing stage. 

sibling pseudoknot edge

Updating the S DP 

X : pseudoknot in the query 

Y and Z : candidate pseudoknots in 

the text. 

If arc X is to be matched to arc Y: 

the optimal DP path must enter block G2 

through vertex (0, 2) and leave it through 

vertex (3, 6). 

In this case, the weight of the optimal path 

will be the sum of its three components: 

OptPath G1[(0,0),(2,2)] + OptPath G2[(0, 2),(3, 6)] + OptPath G3[(1, 8),(0, 6)] 

The optimal pseudoknot matching corresponds to the highest scoring path among all the 

optional paths. When the number of optional paths is constant, the pseudoknot matching 

increases the time complexity of the main stage by a constant factor only. This is, in 

practice, the observed case for the riboswitch searches applied in this study.

Taking into account sequence 

considerations 

Variety of sequence considerations: 

� Sequence alignment criterion on the 

single strand regions like bulges and 

loops (tRNA and riboswitches) 

� Sequence alignment scoring on the 

compared stems (miRNA) 

� Sequence comparisons are performed 

on the small number of filtered 

candidates � the effect of its runtime 

on the overall search is negligible. 

Target database 

Filtering by structure constraints 

Relatively small 

number of structures 

Applying sequence constraints 

Final set of candidates

Experimental Results 

� Riboswitches 

� Purine Riboswitch 

� tRNA

Purine Riboswitch 

� Riboswitches: 

� Part of an RNA molecule. 

� Directly bind a small target molecules with high affinity and as 

a consequence they respond with conformational switching that 

affects the gene’s activity. 

� Purine riboswitch - binds guanine/adenine to regulate 

purine metabolism and transport.

Purine Riboswitch 

The secondary structure: 

� A three-stem junction with a 

multiloop connecting two 

hairpins and the 5’-3’ end. 

� Significant sequence 

conservation occurs within P1 

and in the unpaired regions. 

� Some base-pairing potential 

exists between the two stemloop 

sequences, which might 

permit the formation of a 

pseudoknot.

Results – First dataset 

� FN=0 

� Sensetivity (TP/TP+FN )=1 

� PPV (TP/TP+FP )= 1 except for Clostridium perfringens

Results – Second dataset 

� The search was conducted in three stages: 

1. Based only on topological similarity, as computed via subtree 

homeomorphism (S1). 

2. Enhancing the structural comparison with edge and loop length criteria 

(S2). 

3. Combining the sequence considerations into the search (S3). This 

reduced the number of false positives to zero or one. 

� This shows the importance of additional constraints supported 

by our tool in false positives control.

Searching for Riboswitches in 

Newly Sequenced Data 

Lactobacillus acidophilus 

at c(237640..237705) 

Lactobacillus delbrueckii 

at c(251482..251547) 

Sequential conservation of 

nucleotides in the functionally 

critical positions. 

Lactobacillus family 

Lactobacillus salivarius at 

c(1357553..1357618) 

[Mandal et al., 2003]

Searching for riboswitches in newly 

sequenced data 

Structural functionality was further asserted by running 

RNAAlifold multiple structural alignment program 

with the three candidate sequences as input: 

consistent mutations 

high sequence 

conservation 

compensatory mutations 

Consistent mutations - 

mutations that conserve the stem 

structure. 

Compensatory mutations - 

joint events where a mutation in one 

nucleotide was compensated by a 

corresponding mutation in the paired 

nucleotide in order to conserve the 

stem structure.

Our Bioinformatics Group 

Nimrod 

Milo 

Isana 

Veksler 

Shay Zakov Sivan 

Yogev 

Dr. Michal Ziv-Ukelson 

Tamar 

Pinhas 

Erez 

Katzenelson

“A structure based flexible search method for motifs in RNA” Isana ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?