Algorithmica and molecular biologyThe Pisan experienceFabrizio LuccioGlimpses into the world born from the encounterbetween the machines for sequencing DNAfragments, and computers that assembly thosefragments.
The Department groupMaria Federico (firstname.lastname@example.org)Claudio Felicioli (email@example.com) *Paolo Ferragina * (firstname.lastname@example.org)Roberto Grossi (email@example.com)Fabrizio Luccio * (firstname.lastname@example.org)Roberto Marangoni * (email@example.com)Nadia Pisanti * (firstname.lastname@example.org)* The boss (at least, the who knows everything)* Reference person (she made most of the work)* Trying to escape* gmail: why? (probably paid by Google)* ME! (parasite, but early group initiator)
Glimpses into the world etc ….Algorithms are the winning tool.Sorry…. good algorithms are the winning tool,especially when dealing with very largedata.
Inefficient algorithms….…. have the unpleasant property of resisting tohardware improvement:A polynomial-time algorithm solves a problem on n data in time t 1= c n sAn exponential-time algorithm solves a problem on n data in time t 2= c s nwith c, s constantsWith a computer k times faster, and same running time, we process N > ndata, according to the laws:t 1= c n s , k t 1= c N s N = k 1/s nt 2= c s n , k t 2= c s N k s n = s NN = n + log sk
Publications on sequence algorithmsMercatanti A., Rainaldi G., Mariani L., Marangoni R., Citti L. A method forprediction of accessible sites on an mRNA sequence for target selection ofhammehead ribozymes. J. Computational Biology, 4(9) 641-653, 2002Menconi G., Marangoni R. A compression-based approach for codingsequences identification in prokaryotic genomes, J. Computational Biology (toappear)Corsi C., Ferragina P., Marangoni R. The bioPrompt-box: an ontology-basedclustering tool for searching in biological databases. . BMC bioinformatics (toappear)Cozza A., Morandin F., Galfrè S.G., Mariotti V., Marangoni R., Pellegrini S.TAMGeS: a Three-Array Method for Genotyping of SNPs by a dual-colorapproach. BMC genomics (to appear)Felicioli C., Marangoni R. BpMatch: an efficient algorithm for segmentingsequences, calculating genomic distance and counting repeats, , (submitted)Ferragina P. String search in external memory: algorithms and datastructures. Handbook of Computational Molecular Biology, CRC Press, 2005
Publications on motifsN. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. A Comparative Study ofBases for Motif Inference. NATO Series on String Algorithmics, 2004.N. Pisanti, M. Crochemore, R. Grossi, M.-F. Sagot. Bases of Motifs forGenerating Repeated Patterns with Wild Cards. IEEE/ACM Transactionson Computational Biology and Bioinformatics 2(1) 40-50, 2005.C.S.Iliopoulos, J.McHugh, P.Peterlongo, N.Pisanti, W.Rytter, M.Sagot. A firstapproach to finding common motifs with gaps, International Journal ofFoundations of Computer Science 16(6) 1145--1155, 2005.N. Pisanti, H. Soldano, M. Carpentier, J. Pothier. Implicit and ExplicitRepresentation of Approximated Motifs. In: Algorithms for Bioinformatics,C. Iliopoulos et al, editors, King's College London Press, 2006.P.Peterlongo, N.Pisanti, F.Boyer, A.Pereira do Lago, M.-F.Sagot. Losslessfilter for multiple repetitions with Hamming Distance. Journal of DiscreteAlgorithms 2007 (to appear).
Major collaborations on motifsLyon (group of Marie-France Sagot)Grenoble (group of Alan Vieri)Paris (group of Henri Soldano)Marne-la-Valle (group of Maxime Crochemore)
Paralogy tree construction……. via transformation distancePisanti N., Marangoni R., Ferragina P., Frangioni A., Savona A.,Pisanelli C., Luccio F. PaTre: A Method for Paralogy TreesConstruction. J. Computational Biology, 5 (10) 791-802, 2003How does the genomic information increase?external imports- Transfections- Horizontal transferEndogenous mechanisms(genic or genomic) duplications:Large scaleTandemDispersedSingle gene
The fate of the copyNon-functional: pseudogeneFunctional: paraloggenome as a set of families of paralogsPARALOGY TREE How does the genome choose the paralog to duplicatewithin a family? Is the duplication rate constant among the variousfamilies? Are sparse duplications correlated to sparse deletions?
Couple-comparison methodTransformation Distance (TD)Often newest genes are the shortest onesTo insert sequences imply paying metabolic costs. To deletesequences has no metabolic costWe need an asymmetric distance:TD(S,T) = the cost of the minimum-length script able totransform S into TElementary operations : Insertion, Copy, Inverted copy
TD: an examplef g hS=ATCGATCAGCTGCCCAATGAATCAGATAAAGTTTC1ÉÉÉÉÉ.ÉÉ11ÉÉ.....16ÉÉÉÉÉÉ..25ÉÉÉÉÉÉÉ35f g hT=ATCGATCAGCTTTCACTACGAATGAATCAGATTGGTAGCTTTGAAATAG1ÉÉÉÉÉÉÉ..11ÉÉÉÉÉÉ...21ÉÉÉÉÉÉÉÉÉÉÉ.ÉÉÉ38ÉÉÉÉÉÉÉ48Script transforming S into TDescription1) copy f copy (1, 1, 11)2) insertion of TTCACTACG insert (TTCACTACG)3) copy g copy (16,21,12)4) insertion of TGGTAGC insert (TGGTAGC)5) inverted copy of h copy (25,38,11,1)
PaTreInput: TD values for each possible couple made by thegenes of the familyBuilding-up of the directed graph of distancesEdmonds’ algorithm: extraction of the LSA (LightestSpanning Arborescence) optimal paralogy treeGeneration of optimal and sub-optimals (space ofquasi-optimal solutions)
PaTre has been tested by simulation…because there are no experimental dataon the history of families of genes
str11MFINFRPstr05str06Cost: 7840 - Distance: 0%MFINFRP104411871035str01 str02 str03757955str04str05str08str07str02505str07394str08704str06526str09428str11str10305str10str09str03str01str04The paralogy tree reconstructedby ClustalW for the Ribosomalproteins genic family ofM. pneumoniae
After having tested PaTre on many examples,we could conclude that PaTre is able tocorrectly reconstruct the simulated history ofgenetic families, , while ClustalW and othersimilarity based methods fail.