bbc 2015

Recommendations

Info

BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P36. LEVERAGING AGO-SRNA AFFINITY TO IMPROVE IN SILICO SRNA DETECTION AND CLASSIFICATION IN PLANTS Lionel Morgado 1* & Frank Johannes 2,3 . Groningen Bioinformatics Centre (GBiC), University of Groningen 1 ; Department of Plant Sciences, Center of Life and Food Sciences Weihenstephan, Technical University Munich 2 ; Institute of Advanced Studies, Technical University Munich 3 . * lionelmorgado@gmail.com Small RNAs (sRNA) have an important role in the regulation of gene expression, either through post-transcriptional silencing or the recruitment of repressive epigenetic marks such as DNA methylation. In plants, the mode of action of a given sRNA is tightly related with the Argonaute protein (AGO) to which it binds. High throughput sequencing in combination with immunoprecipitation techniques have made it possible to determine the sequences of sRNA that are bound to different families of AGO. Here we apply Support Vector Machines (SVM) to recent AGO-sRNA sequencing data of A. thaliana to learn which sRNA sequence features govern their differential association with certain AGOs. Our SVM classifiers show good sensitivity and specificity and provide a framework for accurate in silico sRNA detection and classification in plants. INTRODUCTION Small RNA molecules are known to have an important role in gene expression control. It is therefore of extreme interest to be able to detect them and determine the regulatory pathways in which they are involved. With the current laboratorial methods it is unfeasible to test the high number of sRNA candidates, but there are computational methods that can greatly narrow down the list. Nevertheless, sRNA activity is still far from being fully understood and that is reflected in the very high false positive rate of the prediction tools currently available. High throughput sequencing in combination with immunoprecipitation (IP) techniques make nowadays possible to access sRNA sequences associated with specific AGO. AGO-sRNA binding is a fundamental step for the activation of specific silencing pathways. Here, AGO-sRNA data acquired from A. thaliana is explored with SVM-based algorithms to learn which sequence features drive different AGO-sRNA associations. Using this knowledge, a framework for in silico sRNA detection and classification in plants is presented. METHODS A system with 3 layers of classifiers (see figure 1) was designed to identify different kinds of sRNA: the 1 st layer includes a binary SVM model that filters out sequences that don’t bind to AGO and are therefore most probably inactive; 2 nd layer is composed by an ensemble of binary classifiers, each trained to explore the differences in sRNA bound to a specific AGO against all others; and finally, the 3 rd layer comprises a multiclass linear model to assign the most akin AGO to a given sRNA, using scores produced in the previous layer. Diverse AGO-sRNA libraries from A. thaliana were explored, namely from AGO: 1, 2, 4, 5, 6, 7, 9 and 10. After the typical RNA-seq library preprocessing, quality check and genome mapping, several features were extracted from the remaining sequences, namely: position specific base composition, sequence length, k-mer composition and entropy scores. The different feature sets were explored separately and in different combinations. Initially, highly correlated features (pearson score>0.75) were removed, and the remaining ones were further subjected to selection using SVM-RFE (Guyon et al., 2002) with a linear kernel to handle the large data set size. A 10-fold cross-validation procedure was executed to modulate the variation in the data, being the best features of each round determined as the ones with the highest average weight across the models with the best ROC-AUC score in each cross-validation subset. Each round, 1/3 of the remaining features with the worst performance were eliminated, being the process repeated until no more features were available. The best features found were then used to train the final classifiers using RBF kernels with optimal parameters. This was repeated for all models in layers 1 and 2. AGO1 vs otherAGO AGO vs noAGO AGO2 vs otherAGO … Final AGO prediction FIGURE 1. Proposed architecture for the SVM-based framework. RESULTS & DISCUSSION AGO10 vs otherAGO Layer 1 Layer 2 Layer 3 Although the classifiers are still being optimized, preliminary results from the 2 nd layer of the framework (see figure 1) show that the top ranked features by SVM- RFE reflect indeed significant biological patterns for AGO-sRNA association. Among others, the relevance of the 5’ terminal nucleotide was observed, in agreement with findings from previous work (Mi et al., 2008). Additionally, the accuracy for the models trained span values that range from 71% to 86%, showing their capacity to recognize specific AGO-binding patterns. REFERENCES Guyon I et al.Gene selection for cancer classification using support vector machines. Mach Learn 46:389-422 (2002) Mi S et al. Sorting of small RNAs into Arabidospis agonaute complexes is directed by the 5’terminal nucleotide. Cell 133(1): 116-27 (2008). Zhou A & Pawlowski WP. Regulation of meiotic gene expression in plants. Front Plant Sci 5: 413, 209-215 (2014). 80
BeNeLux Bioinformatics Conference – Antwerp, December 7-8 2015 Abstract ID: P Poster 10th Benelux Bioinformatics Conference bbc 2015 P37. ANALYSIS OF RELATIONSHIP PATTERNS IN UNASSIGNED MS/MS SPECTRA Aida Mrzic 1,2* , Wout Bittremieux 1,2 , Trung Nghia Vu 4 , Dirk Valkenborg 3,5,6 , Bart Goethals 1 & Kris Laukens 1,2 . Advanced Database Research and Modeling (ADReM), University of Antwerp 1 ; Biomedical informatics research center Antwerpen (biomina) 2 ; Flemish Institute for Technological Research (VITO), Mol 3 ; Karolinska Institutet, Stockholm 4 ; CFP, University of Antwerp 5 ; I-BioStat, Hasselt University 6 . * aida.mrzic@uantwerpen.be Tandem mass spectrometry (MS/MS) spectra generated in proteomics experiments often contain a large portion of unexplained peaks, despite continuous search engines improvements. Here we use pattern mining technique to determine the origin of these unassigned spectra. We discover patterns that indicate the presence of chimeric spectra and missed post-translational modifications (PTMs). INTRODUCTION Regardless of being a rich source of information, mass spectra acquired in mass spectrometry proteomics experiments often contain a significant number of unexplained peaks, or even remain completely unidentified. The unexplained fraction of mass spectra may come from low-quality or chimeric MS/MS spectra, or unexpected PTMs. To interpret the unexplained data, we propose a structured analysis of the peaks occurring in MS/MS spectra. We employ an unsupervised pattern mining technique (Naulaerts et al., 2013) to discover which peaks are associated with each other, and therefore are likely to have a common origin. METHODS Frequent itemset mining The technique we used to discover relationships between frequently co-occurring peaks in MS/MS data is frequent itemset mining, a class of data mining techniques that is specifically designed to discover co-occurring items in transactional datasets. The typical example of frequent itemset mining is the discovery of sets of products that are frequently bought together. Here, every set of products purchased together represents a single transaction, which results in a dataset consisting of a large number of supermarket basket transactions that can be mined for frequent patterns (Figure 1). In our approach a transaction consists of the mass differences between relevant peaks in the MS/MS spectrum. FIGURE 1. Frequent itemset mining principle. Mass differences associations In order to detect relationships between different types of mass spectrometry peaks, a distinction is made between peaks that were relevant for spectrum identification (assigned peaks) and peaks that were not used for the identification (unassigned peaks) (Vu et al., 2013). The mass differences between peaks (either assigned, unassigned, or both) are then calculated so that for each MS/MS spectrum in the dataset there is a single transaction consisting of all its mass differences. After obtaining these transactions for all MS/MS spectra in the dataset, frequent itemset mining can be employed to detect relationship patterns (Figure 2). These patterns can indicate previously unknown characteristics of the spectra, or even detect novel PTMs. FIGURE 2. Outline of the approach. RESULTS & DISCUSSION In order to evaluate our approach, we used MS/MS datasets from the PRoteomics IDEntifications (PRIDE) database (Vizcaino et al., 2013). This database contains a large number of publicly available datasets from massspectrometry-based proteomics experiments. However, the quality of the submitted datasets can be subject to a large variability, which makes it a proper candidate for our pattern mining approach. Preliminary results show that the detected patterns are able to capture valid information in a spectrum. The obtained patterns indicate peaks originating from the same peptide in case of chimeric spectra and mass differences originating from common PTMs. REFERENCES Naulaerts et al. Brief Bioinform, 16(2): 216–231 (2015). Vizcaino et al. Nucleic Acids Res, 41(D1):D1063-9 (2013). Vu et al. Proteome Science, 12:54 (2014). 81
Page 1 and 2:
10 th Benelux Bioinformatics Confer
Page 3 and 4:
10th Benelux Bioinformatics Confere
Page 5 and 6:
Page 7 and 8:
Page 9 and 10:
Page 11 and 12:
Page 13 and 14:
Page 15 and 16:
Page 17 and 18:
Page 19 and 20:
BeNeLux Bioinformatics Conference -
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
Page 27 and 28:
Page 29 and 30: BeNeLux Bioinformatics Conference -
Page 79: BeNeLux Bioinformatics Conference -
Page 115: 10th Benelux Bioinformatics Confere
show all

bbc 2015

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?