03.12.2015 Views

bbc 2015

BBC2015_booklet

BBC2015_booklet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />

Abstract ID: P<br />

Poster<br />

10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />

P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA<br />

DETECTION AND CLASSIFICATION IN PLANTS<br />

Lionel Morgado 1* & Frank Johannes 2,3 .<br />

Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and<br />

Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University<br />

Munich 3 . * lionelmorgado@gmail.com<br />

Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional<br />

silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a<br />

given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in<br />

combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are<br />

bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing<br />

data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our<br />

SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and<br />

classification in plants.<br />

INTRODUCTION<br />

Small RNA molecules are known to have an important<br />

role in gene expression control. It is therefore of extreme<br />

interest to be able to detect them and determine the<br />

regulatory pathways in which they are involved. With the<br />

current laboratorial methods it is unfeasible to test the high<br />

number of sRNA candidates, but there are computational<br />

methods that can greatly narrow down the list.<br />

Nevertheless, sRNA activity is still far from being fully<br />

understood and that is reflected in the very high false<br />

positive rate of the prediction tools currently available.<br />

High throughput sequencing in combination with<br />

immunoprecipitation (IP) techniques make nowadays<br />

possible to access sRNA sequences associated with<br />

specific AGO. AGO-sRNA binding is a fundamental step<br />

for the activation of specific silencing pathways. Here,<br />

AGO-sRNA data acquired from A. thaliana is explored<br />

with SVM-based algorithms to learn which sequence<br />

features drive different AGO-sRNA associations. Using<br />

this knowledge, a framework for in silico sRNA detection<br />

and classification in plants is presented.<br />

METHODS<br />

A system with 3 layers of classifiers (see figure 1) was<br />

designed to identify different kinds of sRNA: the 1 st layer<br />

includes a binary SVM model that filters out sequences<br />

that don’t bind to AGO and are therefore most probably<br />

inactive; 2 nd layer is composed by an ensemble of binary<br />

classifiers, each trained to explore the differences in sRNA<br />

bound to a specific AGO against all others; and finally, the<br />

3 rd layer comprises a multiclass linear model to assign the<br />

most akin AGO to a given sRNA, using scores produced<br />

in the previous layer.<br />

Diverse AGO-sRNA libraries from A. thaliana were<br />

explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.<br />

After the typical RNA-seq library preprocessing, quality<br />

check and genome mapping, several features were<br />

extracted from the remaining sequences, namely: position<br />

specific base composition, sequence length, k-mer<br />

composition and entropy scores. The different feature sets<br />

were explored separately and in different combinations.<br />

Initially, highly correlated features (pearson score>0.75)<br />

were removed, and the remaining ones were further<br />

subjected to selection using SVM-RFE (Guyon et al.,<br />

2002) with a linear kernel to handle the large data set size.<br />

A 10-fold cross-validation procedure was executed to<br />

modulate the variation in the data, being the best features<br />

of each round determined as the ones with the highest<br />

average weight across the models with the best ROC-AUC<br />

score in each cross-validation subset. Each round, 1/3 of<br />

the remaining features with the worst performance were<br />

eliminated, being the process repeated until no more<br />

features were available. The best features found were then<br />

used to train the final classifiers using RBF kernels with<br />

optimal parameters. This was repeated for all models in<br />

layers 1 and 2.<br />

AGO1<br />

vs<br />

otherAGO<br />

AGO vs noAGO<br />

AGO2<br />

vs<br />

otherAGO<br />

…<br />

Final AGO prediction<br />

FIGURE 1. Proposed architecture for the SVM-based framework.<br />

RESULTS & DISCUSSION<br />

AGO10<br />

vs<br />

otherAGO<br />

Layer 1<br />

Layer 2<br />

Layer 3<br />

Although the classifiers are still being optimized,<br />

preliminary results from the 2 nd layer of the framework<br />

(see figure 1) show that the top ranked features by SVM-<br />

RFE reflect indeed significant biological patterns for<br />

AGO-sRNA association. Among others, the relevance of<br />

the 5’ terminal nucleotide was observed, in agreement<br />

with findings from previous work (Mi et al., 2008).<br />

Additionally, the accuracy for the models trained span<br />

values that range from 71% to 86%, showing their<br />

capacity to recognize specific AGO-binding patterns.<br />

REFERENCES<br />

Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn<br />

46:389-422 (2002)<br />

Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the<br />

5’terminal nucleotide. Cell 133(1): 116-27 (2008).<br />

Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:<br />

413, 209-215 (2014).<br />

80

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!