bbc 2015
BBC2015_booklet
BBC2015_booklet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 <strong>2015</strong><br />
Abstract ID: P<br />
Poster<br />
10th Benelux Bioinformatics Conference <strong>bbc</strong> <strong>2015</strong><br />
P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA<br />
DETECTION AND CLASSIFICATION IN PLANTS<br />
Lionel Morgado 1* & Frank Johannes 2,3 .<br />
Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and<br />
Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University<br />
Munich 3 . * lionelmorgado@gmail.com<br />
Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional<br />
silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a<br />
given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in<br />
combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are<br />
bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing<br />
data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our<br />
SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and<br />
classification in plants.<br />
INTRODUCTION<br />
Small RNA molecules are known to have an important<br />
role in gene expression control. It is therefore of extreme<br />
interest to be able to detect them and determine the<br />
regulatory pathways in which they are involved. With the<br />
current laboratorial methods it is unfeasible to test the high<br />
number of sRNA candidates, but there are computational<br />
methods that can greatly narrow down the list.<br />
Nevertheless, sRNA activity is still far from being fully<br />
understood and that is reflected in the very high false<br />
positive rate of the prediction tools currently available.<br />
High throughput sequencing in combination with<br />
immunoprecipitation (IP) techniques make nowadays<br />
possible to access sRNA sequences associated with<br />
specific AGO. AGO-sRNA binding is a fundamental step<br />
for the activation of specific silencing pathways. Here,<br />
AGO-sRNA data acquired from A. thaliana is explored<br />
with SVM-based algorithms to learn which sequence<br />
features drive different AGO-sRNA associations. Using<br />
this knowledge, a framework for in silico sRNA detection<br />
and classification in plants is presented.<br />
METHODS<br />
A system with 3 layers of classifiers (see figure 1) was<br />
designed to identify different kinds of sRNA: the 1 st layer<br />
includes a binary SVM model that filters out sequences<br />
that don’t bind to AGO and are therefore most probably<br />
inactive; 2 nd layer is composed by an ensemble of binary<br />
classifiers, each trained to explore the differences in sRNA<br />
bound to a specific AGO against all others; and finally, the<br />
3 rd layer comprises a multiclass linear model to assign the<br />
most akin AGO to a given sRNA, using scores produced<br />
in the previous layer.<br />
Diverse AGO-sRNA libraries from A. thaliana were<br />
explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10.<br />
After the typical RNA-seq library preprocessing, quality<br />
check and genome mapping, several features were<br />
extracted from the remaining sequences, namely: position<br />
specific base composition, sequence length, k-mer<br />
composition and entropy scores. The different feature sets<br />
were explored separately and in different combinations.<br />
Initially, highly correlated features (pearson score>0.75)<br />
were removed, and the remaining ones were further<br />
subjected to selection using SVM-RFE (Guyon et al.,<br />
2002) with a linear kernel to handle the large data set size.<br />
A 10-fold cross-validation procedure was executed to<br />
modulate the variation in the data, being the best features<br />
of each round determined as the ones with the highest<br />
average weight across the models with the best ROC-AUC<br />
score in each cross-validation subset. Each round, 1/3 of<br />
the remaining features with the worst performance were<br />
eliminated, being the process repeated until no more<br />
features were available. The best features found were then<br />
used to train the final classifiers using RBF kernels with<br />
optimal parameters. This was repeated for all models in<br />
layers 1 and 2.<br />
AGO1<br />
vs<br />
otherAGO<br />
AGO vs noAGO<br />
AGO2<br />
vs<br />
otherAGO<br />
…<br />
Final AGO prediction<br />
FIGURE 1. Proposed architecture for the SVM-based framework.<br />
RESULTS & DISCUSSION<br />
AGO10<br />
vs<br />
otherAGO<br />
Layer 1<br />
Layer 2<br />
Layer 3<br />
Although the classifiers are still being optimized,<br />
preliminary results from the 2 nd layer of the framework<br />
(see figure 1) show that the top ranked features by SVM-<br />
RFE reflect indeed significant biological patterns for<br />
AGO-sRNA association. Among others, the relevance of<br />
the 5’ terminal nucleotide was observed, in agreement<br />
with findings from previous work (Mi et al., 2008).<br />
Additionally, the accuracy for the models trained span<br />
values that range from 71% to 86%, showing their<br />
capacity to recognize specific AGO-binding patterns.<br />
REFERENCES<br />
Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn<br />
46:389-422 (2002)<br />
Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the<br />
5’terminal nucleotide. Cell 133(1): 116-27 (2008).<br />
Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5:<br />
413, 209-215 (2014).<br />
80