31.10.2012 Views

“A structure based flexible search method for motifs in RNA” Isana ...

“A structure based flexible search method for motifs in RNA” Isana ...

“A structure based flexible search method for motifs in RNA” Isana ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

STRMS<br />

<strong>“A</strong> <strong>structure</strong> <strong>based</strong> <strong>flexible</strong> <strong>search</strong><br />

<strong>method</strong> <strong>for</strong> <strong>motifs</strong> <strong>in</strong> <strong>RNA”</strong><br />

<strong>Isana</strong> Veksler-Lubl<strong>in</strong>sky<br />

Michal Ziv-Ukelson<br />

Danny Barash<br />

Klara Kedem


Outl<strong>in</strong>e<br />

� Background<br />

� Motivation<br />

� RNA’s <strong>structure</strong> representations<br />

� Trees comparison<br />

� Our Algorithm<br />

� Results


The Central Dogma of<br />

DNA<br />

transcription<br />

RNA<br />

translation<br />

Prote<strong>in</strong><br />

Molecular Biology<br />

Non Cod<strong>in</strong>g RNA<br />

- RNA molecule that is<br />

not translated <strong>in</strong>to a<br />

prote<strong>in</strong><br />

- Have been found to<br />

have roles <strong>in</strong> a great<br />

variety of processes<br />

DNA RNA Prote<strong>in</strong>


The Recent Example<br />

The Nobel Prize <strong>in</strong> Physiology or Medic<strong>in</strong>e<br />

2006<br />

"<strong>for</strong> their discovery of RNA <strong>in</strong>terference - gene<br />

silenc<strong>in</strong>g by double-stranded RNA"<br />

Andrew Z. Fire Craig C. Mello<br />

DNA RNA Prote<strong>in</strong>


Non Cod<strong>in</strong>g RNA Families<br />

�They are not conserved <strong>in</strong> sequence, but they are<br />

conserved <strong>in</strong> <strong>structure</strong>.<br />

�Have a role <strong>in</strong> regulat<strong>in</strong>g gene expression.<br />

� tRNA, rRNA, snoRNA, microRNA, siRNA,<br />

Riboswitch


Motivation<br />

� The goal is to discover ncRNA <strong>motifs</strong> <strong>in</strong> a sequence<br />

database.<br />

� Most RNA motif <strong>search</strong> <strong>method</strong>s start from the<br />

primary sequence and only then take <strong>in</strong>to account<br />

secondary <strong>structure</strong> considerations.<br />

� S<strong>in</strong>ce different <strong>motifs</strong> vary <strong>in</strong> <strong>structure</strong> rigidity and<br />

<strong>in</strong> local sequence constra<strong>in</strong>ts, there is a need <strong>for</strong><br />

algorithms and tools that can be f<strong>in</strong>e-tuned<br />

accord<strong>in</strong>g to the <strong>search</strong>ed RNA motif.


Our Goal<br />

Discover ncRNA <strong>motifs</strong> <strong>in</strong> a sequence database.<br />

Genome Sequence<br />

QUERY<br />

millions of nucleotides<br />

ACGCUGACGUAGUCAGUAGACGAC<br />

AGACAGAUACGUCACCGCAGAUAC<br />

GCAUAGUAGCAGUAGCAGAUGACG<br />

ACGCUGACGUAGUCAGUAGACGAC<br />

AGACAGAUACGUCACCGCAGAUAC<br />

GCAUAGUAGCAGUAGCAGAUGACG<br />

……………………………………………<br />

……………………………………………<br />

Are there any appearances of this<br />

<strong>structure</strong> <strong>in</strong> the genome?


The tool - STRMS<br />

(Structural RNA Motif Search):<br />

� Input: Secondary <strong>structure</strong> of the query, <strong>in</strong>clud<strong>in</strong>g local sequence<br />

and <strong>structure</strong> constra<strong>in</strong>ts, and a target sequence database.<br />

� Output: All occurrences of the query <strong>in</strong> the target, ranked by their<br />

similarity to the query [<strong>in</strong> html file].<br />

� The tool is <strong>flexible</strong> and takes <strong>in</strong>to account a large number of<br />

sequence options.<br />

� Our approach comb<strong>in</strong>es:<br />

� pre-fold<strong>in</strong>g with MFOLD (Zuker, 2003)<br />

� RNA pattern match<strong>in</strong>g algorithm [O(mn)] <strong>based</strong> on subtree<br />

homeomorphism <strong>for</strong> ordered, rooted trees.


Our <strong>method</strong> consists of two phases:


RNA’s Secondary Structure<br />

Pseudoknot<br />

S<strong>in</strong>gle-Stranded<br />

Bulge Loop<br />

Stem<br />

Interior Loop<br />

Hairp<strong>in</strong><br />

loop<br />

Junction (Multiloop)<br />

Image– Wuchty


RNA’s Secondary Structure<br />

(((((((..((((…….)))).(((((…….)))))…..(((((…….))))))))))))


RNA’s Secondary Structure<br />

Graph


Ordered rooted tree<br />

Shapiro, 1988:<br />

� The nodes correspond to elements of secondary<br />

<strong>structure</strong> (hairp<strong>in</strong> loop, bulge, <strong>in</strong>ternal loop or multiloop).<br />

� The edges correspond to base-paired (stem) regions.<br />

Zhang, 1998:<br />

� The nodes of the tree represent either unpaired bases<br />

(leaves) or paired bases (<strong>in</strong>ternal nodes). Each node is<br />

labeled with a base or a pair of bases, respectively.<br />

� Two k<strong>in</strong>ds of edges, alternatively connect<strong>in</strong>g either<br />

consecutive stem base-pairs or a leaf base with the<br />

last base-pair <strong>in</strong> the correspond<strong>in</strong>g stem.


Our tree representation<br />

� Compressed as <strong>in</strong> [Shapiro, 1988] + a node <strong>for</strong><br />

every s<strong>in</strong>gle strand component <strong>in</strong> multiloops.<br />

� Includes additional <strong>in</strong><strong>for</strong>mation on nodes and<br />

on edges <strong>for</strong> the purpose of sequence analysis.<br />

� It is more <strong>in</strong><strong>for</strong>mative than Shapiro’s tree<br />

representation and more compact then Zhang’s.<br />

� This leads to a precise screen<strong>in</strong>g of the target<br />

text by first select<strong>in</strong>g candidates whose<br />

structural tree representation is similar to that of<br />

the query, and then further filter<strong>in</strong>g these<br />

candidates by apply<strong>in</strong>g sequence<br />

considerations.


Our tree representation<br />

orig<strong>in</strong> of a s<strong>in</strong>gle <strong>structure</strong><br />

bulge loop<br />

stem -edge<br />

hairp<strong>in</strong> loop<br />

<strong>in</strong>terior loop<br />

dangl<strong>in</strong>g ends<br />

s<strong>in</strong>gle-strand components of the multiloop<br />

☺S<strong>in</strong>gle-strand components and stem-edges are annotated with length and<br />

sequence.<br />

☺A small circle node carries only topological <strong>in</strong><strong>for</strong>mation.<br />

☺Generat<strong>in</strong>g the tree <strong>structure</strong> from a ct-file (output from mfold).<br />

☺The tree construction is ordered by the 5’ to 3’ order<strong>in</strong>g of the molecule.<br />

☺Compressed <strong>structure</strong> which reta<strong>in</strong>s also the sequence <strong>in</strong><strong>for</strong>mation.


Our tree representation


Comparison of ordered rooted<br />

trees<br />

� Trees are among the most common and wellstudied<br />

comb<strong>in</strong>atorial <strong>structure</strong>s <strong>in</strong> computer<br />

science. In particular, the problem of compar<strong>in</strong>g<br />

trees occurs <strong>in</strong> several diverse areas such as:<br />

� computational biology<br />

� <strong>structure</strong>d text databases<br />

� image analysis<br />

� automatic theorem prov<strong>in</strong>g<br />

� compiler optimization.


Comparison of ordered rooted<br />

trees<br />

� The follow<strong>in</strong>g operations are def<strong>in</strong>ed on ordered<br />

trees:<br />

� relabel - Change the label of a node v <strong>in</strong> T.<br />

� delete - Delete a non-root node v <strong>in</strong> T with parent v′,<br />

mak<strong>in</strong>g the children of v become the children of v′. The<br />

children are <strong>in</strong>serted <strong>in</strong> the place of v as a subsequence <strong>in</strong><br />

the left-to-right order of the children of v′.<br />

� <strong>in</strong>sert - The complement of delete. Insert a<br />

node v as a child of v′ <strong>in</strong> T mak<strong>in</strong>g v the<br />

parent of a consecutive subsequence of the<br />

children of v′.


1. Edit distance<br />

� An edit script S between T1 and T2 is a sequence of<br />

edit operations turn<strong>in</strong>g T1 <strong>in</strong>to T2.<br />

� The tree edit distance problem is to compute the edit<br />

distance and a correspond<strong>in</strong>g edit script.


1. Edit distance


1. Edit distance


1. Edit distance


1. Edit distance


1. Edit distance


2. Tree Inclusion<br />

T1 is <strong>in</strong>cluded <strong>in</strong> T2 if there is a sequence of delete<br />

operations per<strong>for</strong>med on T2 which makes T2<br />

isomorphic to T1. The tree <strong>in</strong>clusion problem is to<br />

decide if T1 is <strong>in</strong>cluded <strong>in</strong> T2.


2. Tree Inclusion<br />

T1 is <strong>in</strong>cluded <strong>in</strong> T2 if there is a sequence of delete<br />

operations per<strong>for</strong>med on T2 which makes T2<br />

isomorphic to T1. The tree <strong>in</strong>clusion problem is to<br />

decide if T1 is <strong>in</strong>cluded <strong>in</strong> T2.


� Polynomial time algorithms exist <strong>for</strong> these<br />

problems. They are all <strong>based</strong> on the classical<br />

technique of dynamic programm<strong>in</strong>g and most<br />

of them are simple comb<strong>in</strong>atorial algorithms.


Comparison of ordered rooted<br />

trees<br />

� Ordered tree comparison is generally computed by tree edit<br />

distance, which allows various <strong>for</strong>ms of deletions and <strong>in</strong>sertions <strong>in</strong><br />

both query and target.<br />

� The <strong>search</strong> <strong>for</strong> small non-cod<strong>in</strong>g RNAs naturally yields a more<br />

specific tree <strong>search</strong> <strong>for</strong>mulation s<strong>in</strong>ce we do not allow deletions <strong>in</strong><br />

the query.<br />

� In our <strong>method</strong> we apply a weighted pattern match<strong>in</strong>g algorithm <strong>for</strong><br />

f<strong>in</strong>d<strong>in</strong>g the best homeomorphic mapp<strong>in</strong>g between two rooted<br />

ordered trees.<br />

� Specific constra<strong>in</strong>ts on the <strong>search</strong>ed motif can be def<strong>in</strong>ed <strong>in</strong> the<br />

<strong>in</strong>put to the <strong>search</strong>: structural constra<strong>in</strong>ts (lengths), allow<strong>in</strong>g or<br />

<strong>for</strong>bidd<strong>in</strong>g element deletion <strong>in</strong> the target, sequence constra<strong>in</strong>ts<br />

(existence of sibl<strong>in</strong>g pseudoknots, local conserved sequence<br />

segments).


The Algorithm<br />

� The subtree isomorphism problem [Matula, 1968,1978]:<br />

Given a pattern tree P and a text tree T, f<strong>in</strong>d a subtree of T which is<br />

isomorphic to P, i.e. f<strong>in</strong>d if some subtree of T that is identical <strong>in</strong> <strong>structure</strong><br />

to P can be obta<strong>in</strong>ed by remov<strong>in</strong>g entire subtrees of T, or decide that<br />

there is no such tree.<br />

� The subtree homeomorphism problem [Chung, 1987, Reyner,<br />

1977, P<strong>in</strong>ter et al., 2004]:<br />

Is a variant of the <strong>for</strong>mer problem, where degree-2 nodes can be deleted<br />

from the text tree.<br />

Homeomorphism Example


The Algorithm - Motivation<br />

� Po<strong>in</strong>t-mutation events could easily result <strong>in</strong> an extra bulge <strong>in</strong> an RNA <strong>structure</strong>.<br />

� However, <strong>in</strong> some cases the functional homology to the orig<strong>in</strong>al, non-mutated<br />

<strong>structure</strong> is still preserved.<br />

� The suggested alignment should be <strong>flexible</strong> enough to allow the deletion of degree-<br />

2 nodes from the target tree.<br />

bulge<br />

riboswitch and its functional homologue


The Algorithm - Motivation<br />

� In some cases subtrees may be deleted from the target tree<br />

but not from the query tree, as <strong>in</strong> tRNA case.<br />

Subtree homeomorphism on ordered rooted trees is more efficient (quadratic<br />

<strong>in</strong> <strong>in</strong>put size) than tree edit distance (cubic <strong>in</strong> <strong>in</strong>put size).


Subtree Homeomorphism Score<br />

� Let T 1 and T 2 be two ordered, rooted, homeomorphic trees.<br />

� A mapp<strong>in</strong>g µ : T 1 → T 2 is a one-to-one map from the nodes of T 1 to the nodes of T 2<br />

that preserves the ancestor relations of the nodes and their relative order.<br />

� The subtree homeomorphism score of the mapp<strong>in</strong>g, denoted S(µ), is<br />

a user def<strong>in</strong>ed nodeto-node<br />

similarity<br />

score function<br />

edge-to-edge similarity<br />

score function where e u�T1,<br />

e v�T2 are correspond<strong>in</strong>g<br />

edges.<br />

The penalty of<br />

delet<strong>in</strong>g a<br />

degree-2-node<br />

from T 2<br />

The penalty <strong>for</strong><br />

delet<strong>in</strong>g any<br />

other node.


Subtree Homeomorphism Score<br />

� Given two rooted ordered trees, P and T, the<br />

weighted subtree homeomorphism problem is to f<strong>in</strong>d<br />

a homeomorphism-preserv<strong>in</strong>g mapp<strong>in</strong>g µ : P → t<br />

from P to some subtree t of T, such that S(µ) is<br />

maximal.


Subtree Homeomorphism Score<br />

� The cost function varies from one application to another,<br />

depend<strong>in</strong>g upon the amount of <strong>in</strong><strong>for</strong>mation supplied with the<br />

query.<br />

� The simplest one just compares the topology of the<br />

<strong>structure</strong>s.<br />

� More complex functions <strong>in</strong>clude length differences of the<br />

structural elements, sequence conservation and pseudoknot<br />

match<strong>in</strong>g.<br />

� The node deletion score (i.e., gap penalty) reflects the tradeoff<br />

between a gap and a mismatch. As the gap penalty <strong>in</strong>creases,<br />

the algorithm tends to match distant nodes to avoid gaps. As<br />

different values may suit different needs, our tool enables<br />

users to set this parameter <strong>for</strong> each run.


The Tree Alignment Algorithm<br />

� A bottom-up two level dynamic programm<strong>in</strong>g (DP):<br />

� comput<strong>in</strong>g optimal alignments between P and any similar<br />

subtree t of T which maximizes the similarity score<br />

between P and t (where P is the query tree and T is the text tree)<br />

� O(mn) algorithm, where m and n are the number of<br />

vertices <strong>in</strong> P and T respectively.


The Tree Alignment Algorithm<br />

� We def<strong>in</strong>e score(u,v) to be:<br />

a subtree of P<br />

rooted <strong>in</strong> node<br />

u�P<br />

a subtree of T<br />

rooted <strong>in</strong> node<br />

v�T


The two-stage DP approach to the tree<br />

alignment The compared trees<br />

Large DP - m*n table<br />

= score(a,1)<br />

Activated dur<strong>in</strong>g computation of each<br />

non-leaf entry (u,v) <strong>in</strong> the L DP <strong>in</strong> order<br />

to compute the optimal mapp<strong>in</strong>g between<br />

the children of u and the children of v.<br />

Small DP - compar<strong>in</strong>g subtrees of f and 9<br />

( second-level dynamic programm<strong>in</strong>g )


The Tree Alignment Algorithm<br />

The algorithm returns a vertex v*�T that<br />

maximizes the score S(µ:P→ t v*) (found <strong>in</strong> the<br />

last row of L DP ).<br />

V*


Deal<strong>in</strong>g with Potential Pseudoknots<br />

Extension of the subtree homeomorphism algorithm to<br />

handle the pseudoknot considerations posed by the<br />

riboswitches <strong>in</strong> our study.<br />

Indeed, [Mandal et al., 2003]<br />

predicted a potential<br />

pseudoknot between the two<br />

arms of the pur<strong>in</strong>e riboswitch<br />

aptamer.<br />

In order to extend our model to take such key<br />

<strong>in</strong><strong>for</strong>mation <strong>in</strong>to consideration we annotate the tree<br />

with this additional <strong>in</strong><strong>for</strong>mation by connect<strong>in</strong>g node<br />

2 and node 4 with a “potential pseudoknot” edge.<br />

2<br />

“GGUAU”<br />

4<br />

“CCGUA”


Deal<strong>in</strong>g with Potential Pseudoknots<br />

Observations:<br />

� These edges break down the tree-like<br />

representation of the RNA secondary <strong>structure</strong>s.<br />

� The potential pseudoknot is conf<strong>in</strong>ed to the<br />

subtree rooted <strong>in</strong> node 8, i.e., node 2 and node 4<br />

are sibl<strong>in</strong>g nodes shar<strong>in</strong>g a common parent node.<br />

� For all riboswitch aptamer queries <strong>in</strong> this study,<br />

only one potential pseudoknot is predicted and it<br />

is always <strong>for</strong>med between two sibl<strong>in</strong>g leaf nodes<br />

shar<strong>in</strong>g a common parent node.<br />

� The text subtrees could be annotated with any<br />

number of potential sibl<strong>in</strong>g pseudoknots*.<br />

* <strong>based</strong> on loop sequence complementarity analysis that is executed <strong>in</strong> the preprocess<strong>in</strong>g stage.<br />

sibl<strong>in</strong>g pseudoknot edge


Updat<strong>in</strong>g the S DP<br />

X : pseudoknot <strong>in</strong> the query<br />

Y and Z : candidate pseudoknots <strong>in</strong><br />

the text.<br />

If arc X is to be matched to arc Y:<br />

the optimal DP path must enter block G2<br />

through vertex (0, 2) and leave it through<br />

vertex (3, 6).<br />

In this case, the weight of the optimal path<br />

will be the sum of its three components:<br />

OptPath G1[(0,0),(2,2)] + OptPath G2[(0, 2),(3, 6)] + OptPath G3[(1, 8),(0, 6)]<br />

The optimal pseudoknot match<strong>in</strong>g corresponds to the highest scor<strong>in</strong>g path among all the<br />

optional paths. When the number of optional paths is constant, the pseudoknot match<strong>in</strong>g<br />

<strong>in</strong>creases the time complexity of the ma<strong>in</strong> stage by a constant factor only. This is, <strong>in</strong><br />

practice, the observed case <strong>for</strong> the riboswitch <strong>search</strong>es applied <strong>in</strong> this study.


Tak<strong>in</strong>g <strong>in</strong>to account sequence<br />

considerations<br />

Variety of sequence considerations:<br />

� Sequence alignment criterion on the<br />

s<strong>in</strong>gle strand regions like bulges and<br />

loops (tRNA and riboswitches)<br />

� Sequence alignment scor<strong>in</strong>g on the<br />

compared stems (miRNA)<br />

� Sequence comparisons are per<strong>for</strong>med<br />

on the small number of filtered<br />

candidates � the effect of its runtime<br />

on the overall <strong>search</strong> is negligible.<br />

Target database<br />

Filter<strong>in</strong>g by <strong>structure</strong> constra<strong>in</strong>ts<br />

Relatively small<br />

number of <strong>structure</strong>s<br />

Apply<strong>in</strong>g sequence constra<strong>in</strong>ts<br />

F<strong>in</strong>al set of candidates


Experimental Results<br />

� Riboswitches<br />

� Pur<strong>in</strong>e Riboswitch<br />

� tRNA


Pur<strong>in</strong>e Riboswitch<br />

� Riboswitches:<br />

� Part of an RNA molecule.<br />

� Directly b<strong>in</strong>d a small target molecules with high aff<strong>in</strong>ity and as<br />

a consequence they respond with con<strong>for</strong>mational switch<strong>in</strong>g that<br />

affects the gene’s activity.<br />

� Pur<strong>in</strong>e riboswitch - b<strong>in</strong>ds guan<strong>in</strong>e/aden<strong>in</strong>e to regulate<br />

pur<strong>in</strong>e metabolism and transport.


Pur<strong>in</strong>e Riboswitch<br />

The secondary <strong>structure</strong>:<br />

� A three-stem junction with a<br />

multiloop connect<strong>in</strong>g two<br />

hairp<strong>in</strong>s and the 5’-3’ end.<br />

� Significant sequence<br />

conservation occurs with<strong>in</strong> P1<br />

and <strong>in</strong> the unpaired regions.<br />

� Some base-pair<strong>in</strong>g potential<br />

exists between the two stemloop<br />

sequences, which might<br />

permit the <strong>for</strong>mation of a<br />

pseudoknot.


Results – First dataset<br />

� FN=0<br />

� Sensetivity (TP/TP+FN )=1<br />

� PPV (TP/TP+FP )= 1 except <strong>for</strong> Clostridium perfr<strong>in</strong>gens


Results – Second dataset<br />

� The <strong>search</strong> was conducted <strong>in</strong> three stages:<br />

1. Based only on topological similarity, as computed via subtree<br />

homeomorphism (S1).<br />

2. Enhanc<strong>in</strong>g the structural comparison with edge and loop length criteria<br />

(S2).<br />

3. Comb<strong>in</strong><strong>in</strong>g the sequence considerations <strong>in</strong>to the <strong>search</strong> (S3). This<br />

reduced the number of false positives to zero or one.<br />

� This shows the importance of additional constra<strong>in</strong>ts supported<br />

by our tool <strong>in</strong> false positives control.


Search<strong>in</strong>g <strong>for</strong> Riboswitches <strong>in</strong><br />

Newly Sequenced Data<br />

Lactobacillus acidophilus<br />

at c(237640..237705)<br />

Lactobacillus delbrueckii<br />

at c(251482..251547)<br />

Sequential conservation of<br />

nucleotides <strong>in</strong> the functionally<br />

critical positions.<br />

Lactobacillus family<br />

Lactobacillus salivarius at<br />

c(1357553..1357618)<br />

[Mandal et al., 2003]


Search<strong>in</strong>g <strong>for</strong> riboswitches <strong>in</strong> newly<br />

sequenced data<br />

Structural functionality was further asserted by runn<strong>in</strong>g<br />

RNAAlifold multiple structural alignment program<br />

with the three candidate sequences as <strong>in</strong>put:<br />

consistent mutations<br />

high sequence<br />

conservation<br />

compensatory mutations<br />

Consistent mutations -<br />

mutations that conserve the stem<br />

<strong>structure</strong>.<br />

Compensatory mutations -<br />

jo<strong>in</strong>t events where a mutation <strong>in</strong> one<br />

nucleotide was compensated by a<br />

correspond<strong>in</strong>g mutation <strong>in</strong> the paired<br />

nucleotide <strong>in</strong> order to conserve the<br />

stem <strong>structure</strong>.


Our Bio<strong>in</strong><strong>for</strong>matics Group<br />

Nimrod<br />

Milo<br />

<strong>Isana</strong><br />

Veksler<br />

Shay Zakov Sivan<br />

Yogev<br />

Dr. Michal Ziv-Ukelson<br />

Tamar<br />

P<strong>in</strong>has<br />

Erez<br />

Katzenelson

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!