Rapide bilan 2012-2013 - LIFL
Rapide bilan 2012-2013 - LIFL
Rapide bilan 2012-2013 - LIFL
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Rapide</strong> <strong>bilan</strong> <strong>2012</strong>-<strong>2013</strong><br />
Laurent<br />
<strong>LIFL</strong>, Université Lille 1 - INRIA<br />
Journées au vert<br />
11 et 12 juin <strong>2013</strong><br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
Mais avant ...<br />
mais avant ...<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
<strong>Rapide</strong> <strong>bilan</strong> 2011-<strong>2012</strong><br />
Laurent<br />
<strong>LIFL</strong>, Université Lille 1 - INRIA<br />
Journées au vert<br />
13 et 14 juin <strong>2012</strong><br />
Laurent Année 2011-<strong>2012</strong>
J’avais fini l’année dernière par :<br />
Merci pour les nombreux PJIs encadrées cette année !!<br />
n’hesitez pas à en proposer encore plus l’année prochaine :-)<br />
Laurent Année 2011-<strong>2012</strong>
J’avais fini l’année dernière par :<br />
Merci pour les nombreux PJIs encadrées cette année !!<br />
n’hesitez pas à en proposer encore plus l’année prochaine :-)<br />
Donc je peux recommencer, en y rajoutant désormais :<br />
Merci aux présidents permanents et ponctuels cette année !!<br />
n’hesitez pas à en prendre encore plus (module) l’année prochaine :-)<br />
Laurent Année 2011-<strong>2012</strong>
Vendredi 1 er juin<br />
Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2012</strong>-06-01 08h00 M5-A7 124<br />
<strong>2012</strong>-06-01 08h30 M5-A7 71<br />
<strong>2012</strong>-06-01 09h00 M5-A7 96 Intégration d'un mécanisme de récompenses pour des Nicolas Haderer Romain Rouvoy Nacim Hamdad Benjamin Bertein<br />
<strong>2012</strong>-06-01 09h30 M5-A7 47 Conception d'une interface Web pour la visualisation Adel Boris Couturier<br />
Noureddine Romain Rouvoy Jonathan Decrocq<br />
<strong>2012</strong>-06-01 10h45 M5-A7 93 [AGIL-IT] Site de suivi des demandes de l'accueil RH Lionel Drain<br />
<strong>2012</strong>-06-01 11h15 M5-A7 92 Régis Servant<br />
[AGIL-IT] [http://www.mobitic.fr/] Développement et Patricia Plénacoste Loïc Daara Sébastien Poulmane<br />
<strong>2012</strong>-06-01 11h45 M5-A7 50 Vers un campus ubiquitaire et social Yvan Peter Yvan Peter<br />
<strong>2012</strong>-06-01 08h00 M5-A8 33 Classification des Historiques de Ventes en grande<br />
<strong>2012</strong>-06-01 08h30 M5-A8 45 Application web de consultations de données Jean-Christophe Routier Jean-Christophe Routier<br />
<strong>2012</strong>-06-01 09h00 M5-A8 128 Interface de saisie et de restitution de données Jean-Christophe Routier Jean-Christophe Routier Julien Milan<br />
<strong>2012</strong>-06-01 09h30 M5-A8 88 Application web de gestion du flux des achats Bruno Bogaert Bruno Bogaert Benjamin Jacquet<br />
<strong>2012</strong>-06-01 10h45 M5-A8 63 Souris 3D<br />
<strong>2012</strong>-06-01 11h15 M5-A8 1 Suivi multi-flux d'objets mobiles Chabane Djeraba Chabane Djeraba Alexandre Mandy<br />
<strong>2012</strong>-06-01 11h45 M5-A8 109 Comparaisons de séquences musicales symboliques Mathieu Giraud Mathieu Giraud Corentin Bertiaux Anthony Lerouge<br />
<strong>2012</strong>-06-01 14h00 M5-A7 86 Annotation de génomes Sylvain Denis<br />
Hélène Touzet Mikael Salson Pauline Wauquier<br />
<strong>2012</strong>-06-01 14h30 M5-A7 112 Mise en place d'une base de données des élus et mMikaël Salson Mikaël Salson Goulven Rozec Patience Ngami-Nana<br />
<strong>2012</strong>-06-01 15h00 M5-A7 66 distrSégolène Caboche Eric Piette<br />
Développement d'un outil de visualisation de la Mikael Salson Luigi Palmiero<br />
<strong>2012</strong>-06-01 15h30 M5-A7 52<br />
<strong>2012</strong>-06-01 16h45 M5-A7 11 Recherche dans des millions de courtes séquencesMikaël Salson<br />
Mikaël Salson<br />
Florian Recourt<br />
<strong>2012</strong>-06-01 17h15 M5-A7 94 [AGIL-IT] Création d’un site internet de promotion de Julien Bliart<br />
Laurent Noé<br />
<strong>2012</strong>-06-01 17h45 M5-A7 39<br />
<strong>2012</strong>-06-01 14h00 M5-A9 104<br />
MINY - Multimodality Is Nice for You! Xavier Le Pallec Xavier Le Pallec Alain Laraki<br />
<strong>2012</strong>-06-01 14h30 M5-A9 102 Clement Dufour<br />
<strong>2012</strong>-06-01 15h00 M5-A9 60 Jean Martinet<br />
Kinect et ZCam, le face à face Amel Aissaoui Ramy Arbid Antoni Pauchet<br />
<strong>2012</strong>-06-01 15h30 M5-A9 41 Extraction d'information de Twitter Ali Abbas<br />
Luigi Lancieri Eric Lepretre David Deroo<br />
desSamuel Blanquart Samuel Blanquart Benoît-Charles Detuncq<br />
<strong>2012</strong>-06-01 16h45 M5-A9 35 Développement d'une interface de visualisation Yannick Leroy<br />
<strong>2012</strong>-06-01 17h15 M5-A9 37 Développement de sites déployables pour la gestion Samuel Blanquart Samuel Blanquart Ismael Souissi Samuel Queniart<br />
<strong>2012</strong>-06-01 17h45 M5-A9 98<br />
Implémentation d'un jeu de tir à l'arc avec Kinect Thomas Pietrzak Thomas Pietrzak Guillaume Devos Joffrey Hochart<br />
Visage en relief avec Zcam : application à la détectiAfifa Dahmane Afifa Dahmane Bahare Shirazi<br />
Patricia Plénacoste<br />
Emmanuel Ardiot<br />
Simon Debaecke<br />
Christophe Leemans<br />
Sylvain Mongy Jean-Stéphane Varré Benjamin Fisset Taha Touati<br />
Kouami-Aderibgbe Adekambi<br />
Marc Duez<br />
Matthieu Calmels<br />
Cédric Montay<br />
Géry Casiez Géry Casiez Kevin Pollaert Axel Delahaye<br />
Alecsia : Aide à L'Evaluation et à la Correction SemMikaël Salson Mikaël Salson Ludovic Loridan<br />
Tony Proum<br />
Développement d’une base de données de glycosylAnne Harduin-Lepers Olgo Plechakova & Maria-CeciliaAnthony Tonglet Antoine Baluzolanga-Kiatoko<br />
Paint collaboratif par Smartphones Xavier Le Pallec Xavier Le Pallec Maxime Raverdy Damien Level<br />
Développement avec un framework PHP pour aiderVincent Vatelot Gilles Vanwormhoudt Adel Ben-Elkrizi Mamadou Cellou Dara Diallo
Lundi 4 juin<br />
Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2012</strong>-06-04 08h00 M5-A7 90 [ATOS-IT] MOW (groupe 1 : étudiants 1 et 2)<br />
Lionel Seinturier Lionel Seinturier Gaetan Mallants<br />
<strong>2012</strong>-06-04 08h30 M5-A7 91 [ATOS-IT] MOW (groupe 1 : étudiants 3 et 4) Manuel Servais<br />
<strong>2012</strong>-06-04 09h00 M5-A7 121 Client WindowsPhone pour plate-forme d'échange de Pierre Kopaczewski Lionel Seinturier Matthias Mellouli Louis Dekeister<br />
<strong>2012</strong>-06-04 09h30 M5-A7 80 Cloud Security - Simulations d'attaques distribuées Damien Riquet<br />
Gilles Grimaud Ludovic Moreau<br />
<strong>2012</strong>-06-04 10h45 M5-A7 2 Recherche de sous-graphes dans un graphe, appliquée Laurent Noé<br />
Maude Pupin<br />
<strong>2012</strong>-06-04 11h15 M5-A7 22 Développement d'un outil de dessin facilitant la créat Valérie Leclère<br />
Maude Pupin<br />
<strong>2012</strong>-06-04 11h45 M5-A7 34 Implémentation d’une solution BI et analyse technolSylvain Maude Pupin<br />
Mongy Adil Ayar Morgan Auchede<br />
<strong>2012</strong>-06-04 08h00 M5-A8 75<br />
<strong>2012</strong>-06-04 08h30 M5-A8 72 bas Maria Cecilia Arias<br />
Développement d’interfaces graphiques pour une Olga Plechakova Gorgui-Djire Ndong Naby Gueye<br />
<strong>2012</strong>-06-04 09h00 M5-A8 18 Base de données Intranet du matériel biologique d'une Christophe Remi Duriez<br />
D'Hulst Olga Plechakova Benjamin Bellangeon<br />
<strong>2012</strong>-06-04 09h30 M5-A8 126 intégrat Jean-Frédéric Berthelot Jean-Frédéric Berthelot<br />
Développement d’extensions pour CMS pour Mickael Lemaitre Jerome Deboffles<br />
<strong>2012</strong>-06-04 10h45 M5-A8 127 permettant Jean-Frédéric Berthelot Jean-Frédéric Berthelot<br />
Développement d’un plugin pour digiKam Iliya Ivanov Nathan Damie<br />
<strong>2012</strong>-06-04 11h15 M5-A8 81<br />
<strong>2012</strong>-06-04 11h45 M5-A8 82<br />
<strong>2012</strong>-06-04 14h00 M5-A7 21 Algorithmes de recherche locale pour l’optimisation Arnaud Liefooghe Arnaud Liefooghe Yoann Dufresne<br />
<strong>2012</strong>-06-04 14h30 M5-A7 119<br />
<strong>2012</strong>-06-04 15h00 M5-A7 7 Site Web 2.0 communautaire d'appariement<br />
<strong>2012</strong>-06-04 16h15 M5-A7 62<br />
<strong>2012</strong>-06-04 16h45 M5-A7 55 Module de construction moléculaire 3D Fabrice Aubert<br />
Sébastien Canneaux Nadir Cherifi Mohamed El-Amrani<br />
<strong>2012</strong>-06-04 17h15 M5-A7 69 Fabrice Aubert Fabrice Aubert<br />
Lecteur web de vidéos 360 et WebSocket Nathanaël Deboeuf Pierre Denquin<br />
<strong>2012</strong>-06-04 17h45 M5-A7 95 Pierre-Hubert Olivier<br />
[SCOTLER] Scotler C&C WA Céline Kuttler Jamal-Dine Youlhajen<br />
<strong>2012</strong>-06-04 14h00 M5-A9 144 [ALTERNANT] étudiant:Kévin Labat & entreprise:Audax Vincent Cordonnier Maude Pupin<br />
<strong>2012</strong>-06-04 14h30 M5-A9 145 [ALTERNANT] étudiant:Valentin Lecerf & entreprise:Laurent Vansuypeene Pierre Boulet<br />
<strong>2012</strong>-06-04 15h00 M5-A9 136 [ALTERNANT] étudiant:Nicolas Cousin & entreprise: Céline Bilasco<br />
Céline Kuttler<br />
Nicolas Cousin<br />
<strong>2012</strong>-06-04 16h15 M5-A9 146 [ALTERNANT] étudiant:Florian Ledoux & entreprise: Nicolas Ruff<br />
Sophie Tison Florian Ledoux<br />
<strong>2012</strong>-06-04 16h45 M5-A9 147 [ALTERNANT] étudiant:Benoit Petit & entreprise:Quadr Sébastien Lucas Laetitia Jourdan Benoit Petit<br />
<strong>2012</strong>-06-04 17h15 M5-A9 140 [ALTERNANT] étudiant:Guillaume Gallant & entreprFrançois Pasquereau Laetitia Jourdan<br />
Lionel Seinturier Lionel Seinturier Nicolas Crappe Alexandre Dubus<br />
Evelyne Ferot<br />
Remi Degruson<br />
Clément Pasek<br />
Développement d'un outil interactif de contrôle de la Valérie Leclère Olga Plechakova Chaste Isabane Anais Ngo-Xuan-Coi<br />
Implementation of a web server for protein function Marc Lensink Guillaume Brysbaert Qiang Liu Roshanak Gharagozlou<br />
Development of an XML language for protein function Marc Lensink Guillaume Brysbaert Doga Ozturk<br />
Optimisation de ressources dans les "clouds" GooglFrançois Clautiaux Arnaud Liefooghe Charle-Edmond Bihr<br />
Maxime Morge Maxime Morge Valentine Maillart Tristan Bourgois<br />
Middleware DDS en environnement métro Christophe Gransart Christophe Gransart Seilendria Hadiwardoyo<br />
Kévin Labat<br />
Valentin Lecerf<br />
Guillaume Gallant
Mardi 5 juin<br />
Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2012</strong>-06-05 08h00 M5-A9 122 Rubik Francesco Eric Wegrzynowski Oscar Gest<br />
Lego (A) De Comite<br />
Geoffrey Verhille<br />
<strong>2012</strong>-06-05 08h30 M5-A9 125 Rubik Lego (B) Francesco De Comite Eric Wegrzynowski<br />
Guillaume Macke <strong>2012</strong>-06-05 09h00 M5-A9 24 Kinect Jean-Claude Tarby Jean-Claude Tarby Cyprien Cuvillier<br />
Applications Jean-Claude Tarby Jean-Claude Tarby Claude Saint-Georges<br />
<strong>2012</strong>-06-05 09h30 M5-A9 58 pilotées par le cerveau<br />
<strong>2012</strong>-06-05 10h45 M5-A9 77 Migration d’une base de données relationnelles en bas Céline Anas El-Achiqi<br />
Bilasco Marius Bilasco<br />
Marius Bilasco Marius Bilasco<br />
Warren Moreau<br />
Web of Metadata <strong>2012</strong>-06-05 11h15 M5-A9 123<br />
<strong>2012</strong>-06-05 11h45 M5-A9 130 Reprise d'une application pour la génération du WHO Marius Frederic Bellano Bilasco Marius Bilasco Larbi Noufli<br />
[ALTERNANT] étudiant:Kévin Defives & entreprise:ORomain Lahoche<br />
Sophie Tison<br />
Kévin Defives<br />
<strong>2012</strong>-06-05 09h00 M5-A8 139<br />
entreprise: Jean-Jacques Decrucq Mikaël Salson<br />
<strong>2012</strong>-06-05 09h30 M5-A8 142 [ALTERNANT] étudiant:Geoffrey Hecht & Geoffrey Hecht<br />
<strong>2012</strong>-06-05 10h15 M5-A8 134 [ALTERNANT] étudiant:Christopher Coat & entreprisJonathan Christopher Coat<br />
Carpentier Patricia Plenacoste<br />
[ALTERNANT] étudiant:Guillaume Dauster & entrepr Jonathan Alexandre Sedoglavic Guillaume Dauster<br />
<strong>2012</strong>-06-05 11h15 M5-A8 137 Carpentier<br />
[ALTERNANT] étudiant:Vincent Herbulot & entreprisDesrumaux Romain Rouvoy Vincent Herbulot<br />
<strong>2012</strong>-06-05 11h45 M5-A8 143<br />
Sabine<br />
[ALTERNANT] étudiant:Henri Roussez & entreprise:Bertrand Hudzia Philippe Marquet Henri Roussez<br />
<strong>2012</strong>-06-05 14h00 M5-A8 148<br />
[ALTERNANT] étudiant:Amaury David & entreprise:OLaurent Decool<br />
Philippe Marquet<br />
Amaury David<br />
<strong>2012</strong>-06-05 14h30 M5-A8 138<br />
<strong>2012</strong>-06-05 15h30 M5-A8 135<br />
Maxime Colmant<br />
[ALTERNANT] étudiant:Maxime Colmant & entreprisRalida Azzi Xavier Le Pallec<br />
[ALTERNANT] étudiant:Kevin Guilbert & entreprise:Yann Marzack Xavier Le Pallec Kevin Guilbert<br />
<strong>2012</strong>-06-05 16h15 M5-A8 141<br />
Président : Xavier le Pallec / Laetitia Jourdan<br />
Président : Fabrice Aubert<br />
Président : Laurent Noé<br />
Président : Mikael Salson
Quelques points sur “l’administratif”<br />
1<br />
Licence Cybersécurité : <strong>2012</strong>-<strong>2013</strong> pour<br />
<strong>2013</strong>-2014 ...<br />
2<br />
UPMC : McF<br />
3<br />
PJI mes amis :-)<br />
4<br />
Reviews (que des graines cette année, ou<br />
presque ...)<br />
Laurent Année 2011-<strong>2012</strong>
Quelques points sur la recherche<br />
1<br />
Mappi<br />
(slides de mi-parcours + cf Exposé Jenya)<br />
2<br />
Graines et Produit<br />
(draft)<br />
3<br />
Peptide Matching<br />
→ stage de Yoann Dufresne<br />
Laurent Année 2011-<strong>2012</strong>
Quelques points sur la recherche<br />
1<br />
Mappi<br />
(slides de mi-parcours + cf Exposé Jenya)<br />
2<br />
Graines et Produit<br />
(draft)<br />
3<br />
Peptide Matching<br />
→ stage de Yoann Dufresne<br />
Laurent Année 2011-<strong>2012</strong>
MAPPI<br />
5tâches,<br />
I Tâche 1 : Nouvelles structures d’index pour la recherche de<br />
motifs approchés<br />
I Tâche 2 : Mapping pour la métagenomique et la<br />
métatranscriptomique<br />
I Tâche 3 : Outils d’assemblage pour les NGS<br />
I Tâche 4 : Assemblage guidé de données de<br />
métatranscriptomique<br />
I Tâche 5 : Pipeline bioinformatique
MAPPI<br />
5tâches,celles que je vais décrire dans le contexte Lillois<br />
I Tâche 1 : Nouvelles structures d’index pour la recherche de<br />
motifs approchés<br />
I Tâche 2 : Mapping pour la métagenomique et la<br />
métatranscriptomique<br />
I Tâche 3 : Outils d’assemblage pour les NGS<br />
I Tâche 4 : Assemblage guidé de données de<br />
métatranscriptomique<br />
I Tâche 5 : Pipeline bioinformatique
Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />
Contexte : Read Mapping
Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />
Contexte : Read Mapping
Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />
Contexte : Read Mapping<br />
Réalisé :<br />
1. Portage de l’algorithme de Wu-Mamber sur GPU<br />
[Bit-Parallel Multiple Pattern Matching. T. T. Tran, M. Giraud, J.-S. Varré PPAM /<br />
PBC 2011.]<br />
2. Indexation des voisinages des k-mers<br />
But : profiter de l’efficacité du cache GPU/Processeur<br />
Deux méthodes d’indexation envisagées :<br />
I indexation directe (tri des mots → recherche dichotomique)<br />
I hachage parfait<br />
+ non encore publié mais des résultats :<br />
I mise en oeuvre en OpenCL (fonctionnelle sur CPU et GPU)<br />
I gain en performance entre x10 et x60<br />
I prototype de readmapper en cours<br />
[<strong>LIFL</strong>] Tuan Tu Tran, Mathieu Giraud, Jean-Stéphane Varré<br />
[LIAFA] Djamal Belazzougui, Mathieu Raffinot
900<br />
700<br />
500<br />
400<br />
300<br />
800<br />
200<br />
100<br />
1000<br />
600<br />
5’<br />
1100<br />
1200<br />
3’<br />
1800<br />
1700<br />
1300<br />
1400<br />
1500<br />
LEGEND<br />
count<br />
-------------------------------------- -----<br />
100% gaps 0<br />
information content (bits):<br />
[0.000-0.400) 172<br />
[0.400-0.800) 205<br />
[0.800-1.200) 238<br />
[1.200-1.600) 259<br />
[1.600-1.990) 677<br />
[1.990-2.000] 330<br />
1600<br />
Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />
Contexte : identification d’ARN ribosomiques (16S/18S,23S/28S...)<br />
Buts :<br />
I élimination<br />
I classification<br />
: Problème nouveau sur données de métatranscriptomique<br />
created by the SSU-ALIGN package (http://eddylab.org/software.html)<br />
structure diagram derived from CRW database (http://www.rna.ccbb.utexas.edu/)
Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />
Contexte : identification d’ARN ribosomiques<br />
Réalisé :<br />
I conception d’un filtre efficace pour la sélection des familles<br />
d’ARNr (SortMeRNA)<br />
I travail basé sur le Burst Trie et l’automate de Levenstein<br />
A C G U<br />
−1 −1 1 1<br />
NULL NULL<br />
010(x)<br />
{I 7 }<br />
x10(x)<br />
x1x(x)<br />
NULL NULL<br />
011x<br />
x11x<br />
{I 6 }<br />
x01x<br />
001x<br />
{I 5 }<br />
xx1x<br />
A C G U<br />
1 0 1 1<br />
NULL<br />
NULL NULL NULL NULL<br />
A C G U<br />
1 1 0 0<br />
NULL NULL<br />
NULL NULL NULL NULL<br />
[8] GGCUU [3] GGUAU<br />
111x<br />
{I 2 }<br />
101(x)<br />
101<br />
001<br />
{I 3 } 1x1x {M 12 }<br />
11x(x)<br />
111<br />
{I 1 }<br />
10(x)(x)<br />
[2] CAGC<br />
[4] AUCU<br />
[9] AGGC<br />
[7] UUU<br />
[6] CACG<br />
[1] UGAG<br />
[5] GUUU<br />
x01x<br />
x00(x)<br />
1(x)(x)(x)<br />
{M 8 }<br />
{I 4 } {M 13 }<br />
1<br />
x01<br />
{M 11 }<br />
x1xx<br />
{I 0 x1x<br />
} {M 9 }<br />
x<br />
x0<br />
{M 10 }<br />
x1
Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />
Contexte : identification d’ARN ribosomiques<br />
En cours :<br />
I communication aux London Stringology Days<br />
I publication en cours de soumission<br />
I séjour prévue au Génoscope pour la transition<br />
fin Tâche 4 / début Tâche 2, 10-13 avril<br />
[<strong>LIFL</strong>] Evguenia Kopylova (ANR), Laurent Noé, Hélène Touzet<br />
[GENOSCOPE] Olivier Jaillon
Spaced seed design for precise read-mapping on HMM profiles for<br />
NGS read-mapping<br />
efficient sliding window product on the matrix semi-group<br />
Laurent Noé<br />
May 10, <strong>2012</strong><br />
Abstract<br />
We propose a new method and an associated algorithm to efficiently compute seed sensitivity when<br />
considering that HTS reads are mapped along sub-parts of a known HMM alignment profile. This<br />
computation makes particularly sense with positioned spaced seeds. It relies both on automata theory<br />
(previous work [KNR06]) combined with a matrix product problem.<br />
Interestingly, it brings into light an “interval product problem” considered more than twenty years<br />
ago in [AS87], but in a “sliding window” form. We propose here an efficient algorithm to compute this<br />
sliding windows product using a linear number of products on the (associative, but non commutative<br />
and non invertible) matrix semi-group.<br />
This computational scheme is implemented in the ongoing 1.06 version of Iedera http://bioinfo.<br />
lifl.fr/yass/iedera.php.<br />
1 Introduction<br />
Spaced seed design remains an important, but complex and challenging problem. Many papers have been<br />
devoted to this subject (mainly this last decade), from the mere (but at first unintuitive) idea that such seeds<br />
were performing better [CR93, Buh02] and could be optimized [MTL02, BK01], to spaced seed sensitivity<br />
definition and computation [KLMT04], extended models of seeds and their computation [BBV05, Bro05,<br />
MGB06, CM07, YZ08, II09, KWS + 11], and given bounds and complexity problems investigated [FCLCST05,<br />
NR08, MY09, EM11]. Several software are now publicly available to design spaced seeds [SB05, NGK10,<br />
IIMB11] 1 .<br />
High throughput sequencing technologies have thrown a new light on the seed design process, mainly<br />
because reads obtained are of relative short length and quality labelled. Some of the most sensitive algorithms<br />
to map such reads onto related genomes use spaced seeds (SHRiMP [RLD + 09], ZOOM [LZZ + 08],<br />
BFAST [HMN09], PerM [CSC09], LAST [KWS + 11], SToRM [NGK10], ...),<br />
But most of the regular seeds designed within these tools are based on the assumption that the mapped<br />
alignment profile remains “unknown”, thus prefering a i.i.d “randomly” generated profile. There are several<br />
(if not many) cases where this assumption can be removed due to a known profile of what is searched /<br />
filtered out (prior knowledge on the sequences being searched).<br />
We propose in the main part of this paper an extended method to efficiently compute seed sensitivity or<br />
lossless property when considering that short reads are mapped on substrings of a known HMM alignment<br />
profile. This computation is especially usefull when designing positioned spaced seeds, it relies mainly on<br />
dynamic programming on automata, that can be computed by a set of matrices product along overlapped<br />
intervals.<br />
DRAFT<br />
1 Currenlty, more than one hundred references have been directly related to the spaced seeds problem, see for example<br />
http://www.lifl.fr/~noe/spaced_seeds.html<br />
This “interval product problem” has been considered in [AS87] and the authors provide an efficient solution<br />
in term of preprocessing, in order to be able to answer any query product with a given constant number<br />
of products bound k. We propose here to consider this “interval product problem” with an incremental<br />
aspect, using a form of “sliding window”, and propose an efficient algorithm to compute it using a linear<br />
number of product on the (associative, but non commutative and non invertible) matrix semi-group.<br />
In part 2, we give a brief recall of the seed design principle focussing on the seed sensitivity computation.<br />
We than propose the (matrix) product problem in part 3, and show how it can be solved. Finally, in part<br />
4, we give some measurments on a practical implementation included in the ongoing 1.06 version of Iedera<br />
http://bioinfo.lifl.fr/yass/iedera.php.<br />
2 Seed design process<br />
Spaced seeds are a now frequently used hashing technique for biological sequence analysis. Their implementation<br />
(as direct hashing) is straitforward and brings high sensitivity for the same theoretical selectivity.<br />
Interestingly, in practice, a lightly reduced computational cost can also be observed when using spaced seeds<br />
compared to contiguous seeds of the same weight.<br />
Spaced seeds have been generalized by several extended seed models (Vector seeds [BBV05], Indel<br />
seeds [MGB06], Subset seeds [KNR06, ZF07, YZ08], Neighbor seeds [CM07]). To increase the overall sensitivity,<br />
they can usually be designed jointly as multiple seeds [YWC + 04, SB05], and (on quality labelled<br />
sequences) as positioned seeds [LZZ + 08, NGK10].<br />
In addition to the seed model, one need a selection criterion for good seed shapes : this criterion is<br />
(almost always) established on a model of alignment being searched (usualy a word on a match/mismatch<br />
binary alphabet), itself “weigthed” by a probabilistic model. Here again the initially proposed i.i.d. Bernoulli<br />
model [KLMT04] has been extended into Markov models [BKS05] and HMM [BBV04] models, with several<br />
extensions [MB07, CP10].<br />
In practice the considered criterion to select good spaced seed shapes is “the probability to hit at least<br />
once” (sensivity), or the guaranty to hit always once (lossless property)<br />
Such criterion can be measured by a dynamic programming algorithm on automata, with a probabilistic<br />
model (a probabilistic automaton, eg HMM (vinar) ) - represented by regular expressions - computation<br />
involved -<br />
3 Matrices product<br />
Finite Automata are frequently represented by Matrices (obviously sparse matrices when DFA are used).<br />
Matrices are in practice multiplied or powered, in such a way that properties of the initial languages of<br />
the Finite Automata are computed on “semi-rings” : for example, probabilities are computed on a classical<br />
semi-ring (E = R0applerapple1,⊕ =+,⊙ =0,1⊙ = 1), whereas costs are computed on a tropical semiring<br />
[Sim88, Pin98, MS09] (E = N,⊕ = min,⊙ = 1,1⊙ = 0). Sometime (but not always),<br />
=+,0⊕,✏⊙<br />
=.,0⊕,✏⊙<br />
on tropical semi-rings, such costs are log probability ratios; in that case, the underlying problem one has to<br />
solve is to find the best path (if any) in term of expected value.<br />
More generally, on both classical and tropical semi-rings, the same algorithm can be applied to compute<br />
seed sensitivity [KNR06] for (what is commonly named) lossy (classical semi-ring) or lossless (tropical<br />
semi-ring) seed design framework.<br />
On the classical semi-ring, HMM models (HMM alignment models) are frequently used in language<br />
recognition and seed sensitivity computation [BBV04, KNR06, HR08] : they give a set of probabilities<br />
(emissionprobabilitiesforeachstate,togetherwithtransitionprobabilitiesbetweenstates)thatarecomputed<br />
out of a “profile” alignment. But when such HMM models have to be used with NGS reads to design seeds,<br />
one has to face a new problem : taking into account the fact that the read can be any sub-string generated<br />
by the HMM alignment model, and thus that the computation may start at any “position” on the alignment<br />
HMM : in some way a more challenging problem.<br />
DRAFT<br />
1<br />
2
3.1 Sliding windows product<br />
Such computation, translated into matrix form, implies to compute, for an ordered set of (non-invertible)<br />
matrices M1,M2,...,Mn, a set of products in the two following forms :<br />
either :<br />
Problem.<br />
where w is the length of the read,<br />
Problem.<br />
i+w Y<br />
compute<br />
compute<br />
u=i<br />
Mu 8i 2 [1..n−w] (1)<br />
or more generally :<br />
j(t) Y<br />
u=i(t)<br />
such that i(t) and j(t) are two monotonically ( +0<br />
+1 )-increasing functions.<br />
Mu 8t with i(t) apple j(t) (2)<br />
The definition (2) suits particularly well when the length of the read is not fixed : for example with 454<br />
sequencing process where homo-polymers are read in a single step, and thus give variable read lengths. In<br />
other words, the definition (1) is just a special case of (2), where after increasing the j up to w,astepwise<br />
increment of both i and j is applied. We will thus consider the second definition (2) in the next parts.<br />
3.2 Previous work for the Online query product after preprocessing<br />
Alon and Schieber [AS87] have proposed an Online optimal way to answer any (non-commutative) product<br />
Q j<br />
t=iMt for any i and j in a constant k number of products, after a preprocessing in Θ(n.λ(k,n)) where<br />
λ(k,n) is defined as the inverse of a certain function at the b k 2c-th level of the primitive recursive hierarchy.<br />
For example λ(0,n)=d n 2 e λ(1,n)=dp ne λ(2,n) = log(n) λ(3,n) = loglog(n) λ(4,n) = log ⇤ (n).<br />
This fit perfectly when the length of the windows and its position are randomly drawn. But when there<br />
are dependencies on the positions of the windows, a sliding windows product may be more appropriate.<br />
3.3 Algorithm proposed to compute the Sliding windows product<br />
In our case Online query is not required so we can avoid doing both preprocessing and processing by using<br />
an “Online sliding window product” that moves separately or conjonclty the two ends of the windows : it<br />
costs an amortized constant number of products on problem 1 and problem 2.<br />
This process does not depends in the size of the sliding window in the second problem (which can be<br />
asymptotically improved otherwise, using similar approach of [AS87]). We are here able to move both left<br />
and right ends i and j of the window step-wise, keeping a set of matrices, and computing the product for<br />
any of the windows obtained.<br />
U(k) definition and “pre”-processing : the main additional data used is to preprocess and keep<br />
a set of block products U(k) (for k 2 [i..j]) as shown on Figure 1. U(k) is defined as the product of<br />
a given contiguous block of matrices of size u(k,j) starting from k (to k + u(k,j) − 1). More precisely<br />
U(k)= Q k+u(k,j)−1<br />
t=k Mt. u(k,j) is defined as the largest possible value 2 p such that k + u(k,j) − 1 apple j<br />
and that u(k,j)dividesk.SuchU(k) blocks are thus of size u(k,j)=2 p and this size, once fixed can only<br />
increase (by doubling) depending on j value, before disapering (when i>k).<br />
Maintaining such matrices U(k) for k 2 [i..j] does only cost at most (in amortized analysis) one product<br />
per increase of j (see Appendix 6.2). Note that increasing i simply deletes the last U(i) and thus does not<br />
DRAFT<br />
U[0]<br />
Figure 1: U(k) matrices: example when i = 0 and j = 24<br />
17<br />
16<br />
15<br />
14<br />
13<br />
12<br />
11<br />
10<br />
09<br />
08<br />
07<br />
06<br />
05<br />
04<br />
03<br />
02<br />
01<br />
00<br />
25<br />
29<br />
28<br />
27<br />
26<br />
25<br />
24<br />
23<br />
22<br />
21<br />
20<br />
19<br />
i=0 j=24<br />
U[1]<br />
U[2]<br />
U[3]<br />
U[4]<br />
U[5]<br />
U[6]<br />
U[7]<br />
U[8] ...<br />
U[16]<br />
18<br />
U[24]<br />
any additional product on the U(k)’s. A pseudo-code of the add right process (increment of j)isprovided<br />
in Algorithm 1.<br />
Algorithm 1: add right : increments the right border j by one, and updates the set Ui..j using<br />
the matrix Mj<br />
Data: the set of matrices M1,M2,...,Mn, the original set Ui..j<br />
Result: the updated set Ui..j<br />
/* a) only before the first increment */<br />
if j =0then<br />
U0 M0;<br />
/* b) increment j */<br />
inc(j);<br />
/* c) and process the set of Uj−t matrices that have to be updated */<br />
Uj Mj;<br />
u j +1;told 0;t 1;<br />
while u is even and j −t ≥ i do<br />
Uj−t ; Uj−t.Uj−told<br />
told t ; t 2.t+1;u u/2;<br />
Without considering any previous computation kept, it is directly possible to compute the product<br />
Mi.Mi+1···Mj for any i,j (j>i)inO(log(j − i)) products using the updated U(k) set of matrices for<br />
k 2 [i..j] (see Appendix 6.1).<br />
But when the product is computed when i and j follow the “increasing step”-functions as defined before,<br />
the number of products can be reduced to constants for each i and j step-move (or for both moves when the<br />
distance w separating i and j is fixed) :<br />
DRAFT<br />
middle definition : we need to define here the middle m of i and j as the beginning position of the<br />
maximal (in size) U-block included in the interval i..j. In other words, m corresponds to the value between<br />
i and j that can be the “most factorized by 2”. If two maximal blocks are between i and j, we choose the<br />
beginning of the second block (see Figure 3.3) (as it always corresponds to the value m that can be the<br />
“most factorized by 2”). This middle border enable to split the computation in two parts when needed, that<br />
we will call left (colored in green on Figure 3.3) and right (red on Figure 3.3). Note that m< 1 3 i + 2 3 j.<br />
Note also that when there is only one maximal sized block, that m< 1 2 i+ 1 2j, and when there are two<br />
maximal sized blocks, that m> 2 3 i+ 1 3 j.<br />
In the next part, we will compute in two separate parts Mi..m−1 and Mm..j, considering the case when<br />
m is fixed first, and then two cases when m is increased.<br />
3<br />
4
04<br />
03<br />
02<br />
01<br />
00<br />
i=1<br />
U[1]<br />
U[2]<br />
U[3]<br />
Figure 2: U(k) matrices: example when i = 1 and j = 24<br />
05<br />
U[4]<br />
U[5]<br />
13<br />
12<br />
11<br />
10<br />
09<br />
08<br />
07<br />
06<br />
U[6]<br />
U[7]<br />
U[8]...<br />
24<br />
14<br />
m=16<br />
29<br />
28<br />
27<br />
26<br />
25<br />
24<br />
23<br />
22<br />
21<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
U[16]<br />
U[24]<br />
middle unchanged : if we suppose that the middle m does not change during a computational step,<br />
it can be observed that :<br />
j=24<br />
• when j is increased (so that j = jold +1), updating the product Mm..j can be done with one product,<br />
considering that we keep the previous computation . Thus considering that we also update the<br />
Mm..jold<br />
U(k)’s values at the same time, an amortized single product must be added (Amortization on j :see<br />
Appendix 6.2).<br />
Joining Mi..m−1 with Mm..j then costs one extra product, giving a total number of products of three.<br />
• when i is increased (i = iold + 1), previous computation Miold..m−1 does not help and can be erased<br />
here. However, if we suppose that we keep all the previous computed products Mk..m−1 in a stack for<br />
all the blocks Uk visited before, reusing and updating this part can be done with one single amortized<br />
product (Amortization on m : see Appendix 6.3).<br />
Joining Mi..m−1 and Mm..j then costs one extra product, giving a total number of products of two.<br />
At first sight, a {cost(i) apple i+m; cost(j) apple 3j} cost is applied (when m does not change). However,<br />
this computation has to be updated when m changes; this will be considered in the next part :<br />
middle changed : if we suppose that the middle m does change, previous computation cut in two<br />
parts Mi..m−1 and Mm..j is somehow “compromised”; Let now see when m change, and moreover why :<br />
• when m changes due to a j increase, as m follows the beginning of the largest right-most Uk block,<br />
j can increase the maximal block size by two, either without changing m (case handled before),<br />
or jumping to the next power of two block thus from mold = odd⇥2 p to m =(odd+1)⇥2 p =<br />
odd+1<br />
2 ⇥2 p+1 : The last case has no consequence on the product Mm..j that is immediately computed<br />
by the update of the U(k)’s values as Mm..j corresponds to a single maximal block in U(k), thus in<br />
one single product here (and not two).<br />
DRAFT<br />
However, moving m will obviously compromise the left stack of Mk..mold−1 previous computations that<br />
will now not help the computation of the next Mi..mold−1 on the next increase of i,sincemold is now<br />
pushed to the next power of two m, and can be erased.<br />
This extra cost can however be amortized by a 9 8 of ∆m where ∆m =representsthem increase<br />
(Amortization on m, see Appendix 6.4). Joining Mi..m−1 with Mm..j then costs one extra product.<br />
Finally, when m changes due to a j increase, a {cost(j) apple 2j + 9 8m} cost is applied.<br />
• when i is increased so that i>m(thus i = m+1), m can only “jump” to a next block of smaller<br />
size : the cost on the left stack [i..m−1] is already payed as it corresponds to a “legal” move of i that<br />
is amortized by one product as seen previously(Amortization on m : see Appendix 6.3).<br />
However, moving m will obviously compromise the right computation of Mmold..j since mold is now<br />
pushed to the next (smaller) block, and can be erased and recomputed.<br />
This cost can however be amortized by a 9 8<br />
Appendix 6.5).<br />
Joining Mi..m−1 and Mm..j then costs one extra product.<br />
of ∆m where ∆m = m increase (Amortization on m,see<br />
Finally, when m changes due to i increase, a {cost(i) apple i+ 9 8m} cost is applied.<br />
To conclude, a {cost(i) apple i+ 9 8 m; cost(j) apple 3j + 1 8m} cost is applied.<br />
4 Practical implementation<br />
First, it is very likely that these bounds can be improved by a more precise analysis; However going under a<br />
bound of 3.0 per move is unlikely, at least without any initial amortized costs, since we have found at least<br />
one example such that the number of product is 3.00325 per move 2 .<br />
Moreover when j is increased by “runs” while i is fixed, the proposed algorithm can be enhanced with a<br />
gready computation of the Mi..j product (that can be done quickly provided that i is fixed for a while). In<br />
practice, this implementation gives always less products than the proposed one, but has not been carefully<br />
analysed by now.<br />
On the other hand, some more pratical considerations show also that, when applied on sparse matrices,<br />
such product cannot be considered as a “constant” operation, but more likely as a “function of the sparcity”.<br />
Such implementation needs however to know this “sparcity cost” for all the posible products, which, in<br />
practice on unknown automata, is similar to simulating the product, and thus costs as much as the product<br />
itself...<br />
5 Experiments<br />
The previous algorithm has been implemented and tested in iedera where<br />
Speedup in practice ... over the naive range product.<br />
On a typical example, for a windows of length 108 (that corresponds to Illumina read length here) and a<br />
profile of size 1605 (16S rRNA), number of products for the full computation of the 1605−108+1 = 1498<br />
windows is 5933 (note that each window need a displacement both on i and j).<br />
References<br />
[AS87]<br />
[BBV04]<br />
[BBV05]<br />
[BK01]<br />
DRAFT<br />
NogaAlonandBaruchSchieber. Optimalpreprocessingforansweringon-lineproductqueries. Technical<br />
Report TR 71/87, Inst. of Comp. Science, Tel-Aviv Univ., 1987.<br />
Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Optimal spaced seeds for homologous coding<br />
regions. Journal of Bioinformatics and Computational Biology, 1(4):595–610, Jan 2004. (earlier version<br />
in CPM 2003).<br />
Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Vector seeds: An extension to spaced seeds.<br />
Journal of Computer and System Sciences, 70(3):364–380, 2005. (earlier version in WABI 2003).<br />
StefanBurkhardtandJuhaKärkkäinen. Betterfilteringwithgappedq-grams. InProceedings of the 12th<br />
Symposium on Combinatorial Pattern Matching (CPM),volume2089ofLecture Notes in Computer<br />
Science, pages 73–85. Springer, July 2001.<br />
2 1,2,−1,3..24,−2,−3,25..51,−4,52..72,−5,73..392,−6,393..441,−7,−8,442..577,−9,578..3071 where i-moves are given<br />
with a minus notation<br />
5<br />
6
[BKS05] Jeremy Buhler, Uri Keich, and Yanni Sun. Designing seeds for similarity search in genomic DNA.<br />
Journal of Computer and System Sciences, 70(3):342–363, 2005. (earlier version in RECOMB 2003).<br />
[Bro05] Daniel G. Brown. Optimizing multiple seeds for protein homology search. IEEE/ACM Transactions<br />
on Computational Biology and Bioinformatics (TCBB), 2(1):29–38, january 2005. (earlier version in<br />
WABI 2004).<br />
[Buh02]<br />
[CM07]<br />
[CP10]<br />
[CR93]<br />
[CSC09]<br />
[EM11]<br />
Jeremy Buhler. Provably sensitive indexing strategies for biosequence similarity search. In RECOMB,<br />
Washington, DC (USA), pages 90–99. ACM Press, April 2002.<br />
Miklós Csűrös and Bin Ma. Rapid homology search with neighbor seeds. Algorithmica,48(2):187–202,<br />
Jun. 2007. (earlier version in COCOON 2005).<br />
Won-Hyoung Chung and Seong-Bae Park. Hit integration for identifying optimal spaced seeds. BMC<br />
Bioinformatics - Selected articles from the 8th Asia-Pacific Bioinformatics Conference (APBC), 18-21<br />
january, Bangalore, India,11(Suppl1):S37,2010.<br />
A. Califano and I. Rigoutsos. Flash: A fast look-up algorithm for string homology. In Proceedings of<br />
the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB),pages56–64,<br />
July 1993.<br />
Yangho Chen, Tate Souaiaia, and Ting Chen. PerM: efficient mapping of short sequencing reads with<br />
periodic full sensitive spaced seeds. Bioinformatics,25(19):2514–2521,2009.<br />
Lavinia Egidi and Giovanni Manzini. Spaced seeds design using perfect rulers. In Proceedings of the<br />
18th International Symposium on String Processing and Information Retrieval (SPIRE), Pisa (Italy),<br />
volume 7024 of Lecture Notes in Computer Science, pages 32–43. Springer, 2011.<br />
[FCLCST05] Martin Farach-Colton, Gad M. Landau, Süleyman Cenk Sahinalp, and Dekel Tsur. Optimal spaced<br />
seeds for faster approximate string matching. In Proceedings of the 32nd International Colloquium on<br />
Automata, Languages and Programming (ICALP’05), Lisboa (Portugal),volume3580ofLecture Notes<br />
in Computer Science, pages 1251–1262. Springer, 2005.<br />
[HMN09]<br />
[HR08]<br />
[II09]<br />
[IIMB11]<br />
[KLMT04]<br />
[KNR06]<br />
[KWS + 11]<br />
[LZZ + 08]<br />
[MB07]<br />
[MGB06]<br />
Nils Homer, Barry Merriman, and Stanley F. Nelson. BFAST: An alignment tool for large scale genome<br />
resequencing. PLoS One,4(11):e7767,2009.<br />
Inke Herms and Sven Rahmann. Computing alignment seed sensitivity with probabilistic arithmetic<br />
automata. In Proceedings of the 8th International Workshop on Algorithms in Bioinformatics (WABI),<br />
Karlsruhe (Germany),volume5251ofLecture Notes in Bioinformatics, pages 318–329. Springer, Sept.<br />
2008.<br />
Lucian Ilie and Silvana Ilie. Fast computation of neighbor seeds. Bioinformatics,25(6):822–823,2009.<br />
Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. SpEED: fast computation of sensitive spaced<br />
seeds. Bioinformatics,2011.<br />
Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied<br />
Mathematics, 138(3):253–263, 2004. (preliminary version in 2002).<br />
Gregory Kucherov, Laurent Noé, and Mikhail A. Roytberg. A unifying framework for seed sensitivity<br />
and its application to subset seeds. Journal of Bioinformatics and Computational Biology,4(2):553–569,<br />
November 2006.<br />
Szymon M. Kie lbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. Adaptive seeds<br />
tame genomic sequence comparison. Genome Research,21(3):487–493,2011.<br />
Hao Lin, Zefeng Zhang, Michael Q. Zhang, Bin Ma, and Ming Li. ZOOM! Zillions Of Oligos Mapped.<br />
Bioinformatics,24(21):2431–2437,2008.<br />
Denise Y.F. Mak and Gary Benson. All hits all the time: parameter free calculation of seed sensitivity.<br />
In D. Sanko↵, L. Wang, and F. Chin, editors, Proceedings of the 5th Asia Pacific Bioinformatics<br />
Conference (APBC),volume5ofAdvances in Bioinformatics and Computational Biology,pages327–<br />
340. Imperial College Press, 2007.<br />
DeniseY.F.Mak, YevgeniyGelfand, andGaryBenson. Indelseedsforhomologysearch. Bioinformatics,<br />
22(14):e341–e349, 2006.<br />
DRAFT<br />
[MS09] Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry. (draft book-in-progress), 2009.<br />
[MTL02] Bin Ma, John Tromp, and Ming Li. PatternHunter: Faster and more sensitive homology search.<br />
Bioinformatics,18(3):440–445,2002.<br />
[MY09] Bin Ma and Hongyi Yao. Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler<br />
design. Information Processing Letters,109(19):1120–1124,2009.<br />
[NGK10]<br />
[NR08]<br />
[Pin98]<br />
[RLD + 09]<br />
[SB05]<br />
[Sim88]<br />
Laurent Noé, Marta Gîrdea, and Gregory Kucherov. Designing efficient spaced seeds for SOLiD read<br />
mapping. Advances in Bioinformatics, 2010:ID 708501, July 2010.<br />
François Nicolas and Éric Rivals. Hardness of optimal spaced seed design. Journal of Computer and<br />
System Sciences, 74(5):831–849, Aug. 2008. (earlier version in CPM 2005).<br />
Jean-Éric Pin. Tropical semirings. In J. Gunawardena, editor, Idempotency,volume11ofPubl. Newton<br />
Inst., pages 50–69, Bristol, 1998. Cambridge Univ. Press.<br />
Stephen M. Rumble, Phil Lacroute, Adrian V. Dalca, Marc Fiume, Arend Sidow, and Michael Brudno.<br />
SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol,5(5):e1000386,052009.<br />
Yanni Sun and Jeremy Buhler. Designing multiple simultaneous seeds for DNA similarity search.<br />
Journal of Computational Biology, 12(6):847–861, 2005. (earlier version in RECOMB 2004).<br />
Imre Simon. Recognizable sets with multiplicities in the tropical semiring. In Mathematical foundations<br />
of computer science, 1988 (Carlsbad, 1988),volume324ofLecture Notes in Comput. Sci.,pages107–<br />
120. Springer, Berlin, 1988.<br />
[YWC + 04] I-Hsuan Yang, Sheng-Ho Wang, Yang-Ho Chen, Pao-Hsian Huang, Liang Ye, Xiaoqiu Huang, and Kun-<br />
Mao Chao. Efficient methods for generating optimal single and multiple spaced seeds. In Proceedings<br />
of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE), Taichung (Taiwan),pages<br />
411–416. IEEE Computer Society Press, 2004.<br />
[YZ08]<br />
[ZF07]<br />
Jialiang Yang and Louxin Zhang. Run probabilities of seed-like patterns and identifying good transition<br />
seeds. Journal of Computational Biology, 15(10):1295–1313, Dec. 2008. (earlier version in APBC 2008).<br />
Leming Zhou and Liliana Florea. Designing sensitive and specific spaced seeds for cross-species mRNAto-genome<br />
alignment. Journal of Computational Biology, 14(2):113–130, Mar. 2007.<br />
DRAFT<br />
7<br />
8
x−1<br />
6 Appendix<br />
6.1 Worst case number of products from i to j<br />
We denote by n the number of single matrices : n = j − i +1(n is thus the length of the block being<br />
computed with help of the Uk matrices already given). We illustrate below how to obtain the smaller size n<br />
according to the number of product x.<br />
• if xhis odd, i the worst case is produced by a concatenation of blocks of size 2 i on both ends, for<br />
i 2 0.. x−1<br />
2 (see Figure 3 for x = 5):<br />
00<br />
01<br />
02<br />
Figure 3: U(k) matrices and product: example when i = 9 and j = 23<br />
03<br />
04<br />
05<br />
06<br />
n = 2<br />
07<br />
08<br />
i=9<br />
2X<br />
i=0<br />
09<br />
10<br />
x<br />
11<br />
12<br />
x<br />
13<br />
14<br />
15<br />
16<br />
17<br />
n=14<br />
x<br />
18<br />
19<br />
DRAFT<br />
20<br />
x<br />
21<br />
22<br />
x<br />
23<br />
24<br />
25<br />
j=23<br />
2 i = 2 p ⇣<br />
2⇥2 x n+2<br />
⌘<br />
2 −2 x = 2log 2<br />
• If x is even, the worst h case i is produced by a concatenation of blocks of size 2 i on both ends of a block<br />
of size 2 x 2, for i 2 0.. x−2<br />
2 (see Figure 4 for x = 4):<br />
00<br />
01<br />
02<br />
26<br />
2 p 2<br />
Figure 4: U(k) matrices and product: example when i = 9 and j = 19<br />
03<br />
04<br />
05<br />
06<br />
n = 2<br />
07<br />
08<br />
i=9<br />
x−2<br />
2X<br />
i=0<br />
09<br />
10<br />
x<br />
11<br />
12<br />
x<br />
13<br />
14<br />
n=10<br />
15<br />
16<br />
x<br />
17<br />
18<br />
x<br />
19<br />
20<br />
21<br />
j=19<br />
⇣<br />
2 i +2 x 2 = 3⇥2 x n+2<br />
⌘<br />
2 −2 x = 2log 2<br />
3<br />
22<br />
23<br />
24<br />
25<br />
26<br />
27<br />
27<br />
28<br />
28<br />
29<br />
29<br />
Combining those two cases, it can be shown that when the number of product is set to x =1,2 or 3,<br />
then the minimal size is exactly 2⇥x, and also that when x>3 (or x = 0) that this minimal bound is never<br />
reached again.<br />
Figure 5: minimal n (for x even and odd) functions compared to 2⇥x<br />
In other words, the number of products x is always apple n 2 .<br />
6.2 Amortized analysis of Uk blocks when i =0and j ≥ 0<br />
Summing the number of products needed when computing Uk should be 2 on average, and not 1 : a quick<br />
analysis shows that, indeed, if one product is done half of the time, two are done each 1/4, three done each<br />
1/8, and so on ... then the P 1<br />
u=1 u<br />
2 =2 u<br />
However here, we will show that amortized number of product when considering j is only 1. We use an<br />
amortized analysis by giving one coin each time j is increased (i is supposed to stay at 0 but this assumtion<br />
can be leaved since it can be seen as a worst case when updating Uk) to show than any sublock Uk will<br />
generate one extra coin, and thus grouped with its neigboor block in size (itself generating on extra coin),<br />
the cost of the father block processed with those two is also generating (1+1)−1 = one extra coin.<br />
DRAFT<br />
• this is true for blocks of size 2 since they are build of blocks of size 1 that do not generate any product<br />
: the cost for such block of size 2 is thus 1, and 1 extra coin remains.<br />
• this can be easily verified for blocks of size 2 p (p>1), since by induction hypothesis the two sub-blocks<br />
of size 2 p−1 give each one extra coin : the cost associated when joining the two sub-blocks then removes<br />
one coin, and one extra coin remain again.<br />
Note that this analysis can be set for any i ≥ 0 and any j>iprovided that at first an extra number of<br />
j −i coins is provided.<br />
6.3 Amortized analysis of the left Mi..m−1 blocks when m fixed and i increased<br />
Summing the number of products needed to when computing Mi..m−1 for any i from 0 to m is 1 on average<br />
: a quick analysis shows indeed that if zero product is done half of the time (when i is even), one product is<br />
done each 1/4, two done each 1/8, and so on ... then P 1 u<br />
u=0 2u+1 = 1.<br />
9<br />
10
But this does not guaranty that the total number of product payed when increasing i from any value (for<br />
example 0) to m is always less than m. Here we will show that the number of product (once m is fixed) for<br />
computing Mi..m−1 for any i from 0 to a given m =2 p is apple 2 p −p−1.<br />
m =2! 0<br />
m =4! 1<br />
m =8! 4<br />
m = 16 ! 11<br />
A similar method to section 6.2 can be applied.<br />
First we consider the case when i = 0 and m has been increased to reach a given (and fixed) value 2 p .<br />
• this is true when p =1(thuswhenm = 2) since, using Uk blocks, it needs no product to compute<br />
M0..1 and M1..1.<br />
• this can be verified for blocks of size 2 p (p>1), since we can then use the two sub-blocks of size 2 p−1 :<br />
when i is within the first sub-block, as the product is done from m to i and stacked in such way that any<br />
suffix Mk..m in kept, it costs the product produced by this sub-block (2 (p−1) −(p−1)−1) added to the<br />
log2( m 2 )=p−1 extra products to cover the second sub-block of size 2p−1 ;wheni is within the second<br />
sub-block, exactly the number of products produced by this sub-block ⇣ (2 (p−1) −(p−1)−1). ⌘ Thus when<br />
summing these two quantities, the number of product is apple 2⇥ 2 (p−1) −(p−1)−1 +(p−1) = 2 p −p−1<br />
Thus, increasing i from any value ≥ 0tom and computing all the possible products (with the help of<br />
the Blocks Uk)isapple m−log2(m)−1, and thus costs less than m.<br />
Note that this analysis can be set for any i ≥ 0 and any m (not necesseraly represented as a strict power<br />
of 2 , but as m = a⇥2 p such that 2 p is the maximal block size of Uk for k 2 [i..j]).<br />
6.4 Amortized analysis of the left Mi..m−1 blocks when m is increased (due to a<br />
j increase) and i is fixed<br />
When j increases while i is fixed, m may change to a new (and of course increased) value pointing to an<br />
equal (or twice larger block) : this appends when m goes from mold =2 pold ⇥aold (with aold odd), to its new<br />
value m = mold+2 pold =(aold+1)⇥2 pold = a⇥2 p (with a = aold<br />
2 and p = pold+1), as illustrated on Figure<br />
6.<br />
Figure 6: U(k) matrices and Mi..m−1 product : example when i = 33 and j goes from 47 to 48<br />
32<br />
33<br />
34<br />
DRAFT<br />
i=33<br />
x<br />
35<br />
36<br />
x<br />
37<br />
m_old=36<br />
38<br />
39<br />
40<br />
24<br />
m=40<br />
41<br />
42<br />
43<br />
44<br />
45<br />
46<br />
47<br />
48<br />
49<br />
50<br />
j=48<br />
j_old=47<br />
We are here interested in the computation of Mi..m−1 due to this ∆m = m−mold =2 pold increase. In<br />
practice, since m has changed, the full set of left stack matrices Mk..m−1 has to be recomputed for some<br />
k 2 [i..m], and some products already done Mk..mold−1 have to be redone unfortunately twice here.<br />
51<br />
This twice-cost is at most log2( mold−i+1<br />
2 ) apple log2( m−mold<br />
2 )=log2( ∆m 2<br />
)(mold −imwhile j is fixed, m must change to a new (and of course increased) value<br />
pointing to a smaller block as illustrated on Figure 7. This appends when m goes from mold =2 pold ⇥aold<br />
to its new value m = mold +2 pold =(aold +1)⇥2 pold , as illustrated on Figure 7.<br />
Compared to the case previously seen on Appendix 6.4, there is no twice cost on the left-stack, so this<br />
part is still be amortized within the Section 6.3. However the right part has now to be recomputed.<br />
Figure 7: U(k) matrices and Mm..j product : example when j = 47 and i goes from 32 to 33<br />
i_old=32<br />
32<br />
33<br />
i=33<br />
m_old=32<br />
34<br />
35<br />
36<br />
37<br />
38<br />
39<br />
40<br />
41<br />
42<br />
m=40<br />
We are here interested in the computation of Mm..j due to this m change (red mark on Figure 7)<br />
This cost is at most log2( j−m+1<br />
2 ) apple log2( m−mold<br />
2 )=log2( ∆m 2 )(j −m
Using Appendix 6.2 and Appendix 6.5 (in a similar way to 6.4), moving i implies thus<br />
• a1(peri increase) +1 (per ”m amortized increase”, see section 6.3) cost when m is fixed,<br />
• a1(peri increase) +log2( ∆m 2 ) cost when m is increased by ∆m =2p .<br />
A rapid analysis of the two cases combined shows that this cost can be bounded by 1⇥i+ 9 8 ⇥m (this worst<br />
case can be produced when ∆m = 8 or ∆m = 16). This cost is thus apple 11<br />
8 ⇥i+ 3 2⇥j+i<br />
4 ⇥j (since m apple 3 ) and<br />
can be roughly bound by 2 1 8 per j increase (since i apple j).<br />
DRAFT<br />
13
Laurent Année 2011-<strong>2012</strong>
Laurent Année 2011-<strong>2012</strong>
Laurent Année 2011-<strong>2012</strong>
Laurent Année 2011-<strong>2012</strong>
Laurent Année 2011-<strong>2012</strong>
Peptides à matcher
...
...
ce coup-ci ... pas trouvé d’excuse de dernière minute<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
ce coup-ci ... pas trouvé d’excuse de dernière minute<br />
merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
ce coup-ci ... pas trouvé d’excuse de dernière minute<br />
merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />
1 on recherche un nouveau président de PJI pour l’année prochaine<br />
(départ Gery)<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
ce coup-ci ... pas trouvé d’excuse de dernière minute<br />
merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />
1 on recherche un nouveau président de PJI pour l’année prochaine<br />
(départ Gery)<br />
2 on recherche un nouveau repreneur du module<br />
(qui est dans cette salle)<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
Mardi 28 mai<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-05-28 16h00 M5-A8 127 [ALTERNANT] étudiant:Stephane Drubay & entreprise:OdysysPierre-Eric Marez Philippe Marquet Stephane Drubay<br />
<strong>2013</strong>-05-28 16h30 M5-A8 132 [ALTERNANT] étudiant:Laura Leclercq & entreprise:Odysys Pierre-Eric Marez Philippe Marquet Laura Leclercq<br />
<strong>2013</strong>-05-28 17h00 M5-A8 137 [ALTERNANT] étudiant:Kévin Moulart & entreprise:Proges PluPhilippe Viot<br />
Maude Pupin Kévin Moulart<br />
<strong>2013</strong>-06-03 17h30 M5-A8 134 [ALTERNANT] étudiant:Laurent Leleux & entreprise:Proges PlMaryvonne Viot Maude Pupin Laurent Leleux<br />
Mercredi 29 mai<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-05-29 09h00 M5-A9 22 Évolution de l'application de gestion du personnel de l'IEEA Jean-Christophe RoutierJean-Christophe RoutierMélissa Blain Jérôme Wyckaert<br />
<strong>2013</strong>-05-29 09h30 M5-A9 89 Annuaire de l'association des anciens de la MIAGE de Lille Anne-Cécile Caron Anne-Cécile Caron Nicolas VandemeulebrouFlorian Bruffaert<br />
<strong>2013</strong>-05-29 10h30 M5-A9 111 Outils de communication pour l'association AVERS Anne-Cécile Caron Anne-Cécile Caron Mamadou Bachir Bah<br />
<strong>2013</strong>-05-29 11h00 M5-A9 90 Génération de documents pédagogiques Anne-Cécile Caron Anne-Cécile Caron Maxime Boucher Latifou Sano<br />
<strong>2013</strong>-05-29 14h00 M5-A9 86 Amélioration d'un logiciel de visualisation d'orbite Florent Deleflie Francesco De Comité Romain Frangi Dimitri Descamps<br />
<strong>2013</strong>-05-29 14h30 M5-A9 112 Concours Infotel Anne-Cécile Caron Anne-Cécile Caron Christopher Laethem Zakariae Azaroual<br />
<strong>2013</strong>-05-29 15h00 M5-A9 113 Concours Infotel (suite) Anne-Cécile Caron Anne-Cécile Caron Nassim Hassaine Zouhair Makhout<br />
<strong>2013</strong>-05-29 15h30 M5-A9 45 Analyse automatique de l'historique Git des logiciels Martin Monperrus Martin Monperrus Sylvain Magnier Maxence Montauzan<br />
<strong>2013</strong>-05-29 16h00 M5-A9 95 Export de code source Python en XML Martin Monperrus Martin Monperrus Pierre Frayer<br />
Antoine Goubel<br />
Jeudi 30 mai<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-05-30 09h30 M5-A9 93 Optimisation de flots<br />
François Clautiaux Marie-Emile Voge Irina Bakardzhieva Ophélie Debiève<br />
<strong>2013</strong>-05-30 10h30 M5-A9 77 Suivi d'accueil des enfants dans un centre périscolaire - facturPeriscope Marius Bilasco Rémi Kaczmarek Maxime Vanpeene<br />
<strong>2013</strong>-05-30 11h00 M5-A9 115 Reprise application de gestion de listes de présences alternanMarius Bilasco Marius Bilasco Alexis Boutrouille Pierre Bailleul<br />
<strong>2013</strong>-05-30 14h00 M5-A9 116 Application web de gestion de suivis de recherche de stage Patricia Plénacoste Maude Pupin Soufiane Agadr Thomas Aubry<br />
<strong>2013</strong>-05-30 14h30 M5-A9 92 Création d’une base de données sur la glycosylation du poissoYann Guerardel Olga Plechakova Karl Deleforterie Franck David<br />
<strong>2013</strong>-05-30 15h30 M5-A9 23 Base de données et données géographiques Francis Bossut Francis Bossut Pierrick Lesage Alexandre Bienvenu<br />
<strong>2013</strong>-05-30 16h00 M5-A9 99 Site web de la MDE Eric Bros<br />
Raphaël Marvie Djamel Amara<br />
Vendredi 31 mai<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-05-31 08h30 M5-A7 2 Reconstituer le puzzle : depuis des fragments jusqu'à l'ARN Mikaël Salson Mikaël Salson Charles Husquin<br />
<strong>2013</strong>-05-31 09h00 M5-A7 4 Alecsia apprend à lire les ODT et PDF Mikaël Salson Mikaël Salson Anthony Tonglet<br />
<strong>2013</strong>-05-31 10h15 M5-A7 78 Evolution de l'application de suivi d'alternants et stages Marius Bilasco Marius Bilasco Ayoub Nejmeddine Sara El-Arbaoui<br />
<strong>2013</strong>-05-31 10h45 M5-A7 81 Take a photo for me Marius Bilasco Marius Bilasco Jérémie Samson Victor Paumier<br />
<strong>2013</strong>-05-31 11h15 M5-A7 82 Interagir avec votre ordinateur de la tête<br />
Marius Bilasco Marius Bilasco Mamadou Diop<br />
<strong>2013</strong>-05-31 11h45 M5-A7 84 Analyse contextuelle de collections de photos privées Marius Bilasco Marius Bilasco Benjamin Allaert Benjamin Flahauw<br />
<strong>2013</strong>-05-31 14h30 M5-A7 6 Frameworks PHP et back-offices pour applications mobiles Jean-Claude Tarby Jean-Claude Tarby Omar Chahbouni Abderrahime El Idrissi<br />
<strong>2013</strong>-05-31 15h00 M5-A7 8 Intégration des ondes cérébrales dans la vie courante Jean-Claude Tarby Jean-Claude Tarby Mickaël Duruisseau Nicolas Coyard<br />
<strong>2013</strong>-05-31 16h15 M5-A7 68 Conception d'un Raspberry pi dédié aux présentations Bruno Bogaert Bruno Bogaert Louis Billiet Sylvain Goulliart<br />
<strong>2013</strong>-05-31 16h45 M5-A7 69 Écosystème pour gestion d'emploi du temps hebdomadaire Bruno Bogaert Bruno Bogaert Dhia Elhak Lakhal Sylvain Malfait<br />
<strong>2013</strong>-05-31 10h15 M5-A8 46 Intégration de Drone à une plateforme logicielle<br />
Gwenael Cattez Gwenael Cattez Ali Hedjaz Tony Tran<br />
<strong>2013</strong>-05-31 10h45 M5-A8 65 Moteur de scripts sous iOS Nicolas Haderer, RomainRomain Rouvoy Benjamin Digeon Florent David<br />
<strong>2013</strong>-05-31 11h15 M5-A8 66 Utiliser les téléphones mobiles pour l’estimation de la densité dNicolas Haderer Romain Rouvoy Julien Duribreux Justin Dufour<br />
<strong>2013</strong>-05-31 14h00 M5-A8 34 Interface de visualisation de molécules Maude Pupin Laurent Noé<br />
Antonia Ludunge<br />
<strong>2013</strong>-05-31 14h30 M5-A8 91 Pipeline d'analyse de régions de cassures<br />
Jean-Stéphane Varré Jean-Stéphane Varré Gauvain Marquet<br />
<strong>2013</strong>-05-31 15h00 M5-A8 30 Robot lego solveur de Sudoku Francesco De Comité Leopold Weinberg Oulamine Youssef El Achiqi Anas<br />
<strong>2013</strong>-05-31 16h15 M5-A8 37 Traitement semi-automatique des feuilles de présence Géry Casiez Géry Casiez Alexis Linke<br />
Maxence Gaudry<br />
<strong>2013</strong>-05-31 16h45 M5-A8 3 Conception d'un reseau social orienté vidéo Antoine Thomas Antoine Thomas Emmanuel Pede Thomas Besset<br />
<strong>2013</strong>-05-31 08h30 M5-A9 105 Suivi d'un capteur en 3D a l'aide d'une webcam<br />
Jean Rioult Sébastien Ambellouis Matthieu Fesselier Guillaume Huylebroeck<br />
<strong>2013</strong>-05-31 09h00 M5-A9 110 Algorithmes de placement en deux dimensions<br />
François Clautiaux François Clautiaux Romain Windels<br />
<strong>2013</strong>-05-31 10h15 M5-A9 70 Home Cloud Server Cedric Dumoulin Cedric Dumoulin Lison Gallos Arnaud Caulier<br />
<strong>2013</strong>-05-31 10h45 M5-A9 27 Framework de modélisation dans les Tablettes Android Amine El Kouhen Cédric Dumoulin Malika Rakhaoui Fatou-Laye Mbaye<br />
<strong>2013</strong>-05-31 11h15 M5-A9 71 Etude de la spécification des représentations arborescentes Cedric Dumoulin Cedric Dumoulin Adrien Burillon Thomas Camberlin<br />
<strong>2013</strong>-05-31 11h45 M5-A9 72 Generateur de GUI Android Cedric Dumoulin Cedric Dumoulin Gerard Paligot<br />
<strong>2013</strong>-05-31 14h00 M5-A9 33 Intégration du support multitouch dans Pharo Stéphane Ducasse Stéphane Ducasse Francois Lepan Benjamin V. Ryseghem<br />
<strong>2013</strong>-05-31 14h30 M5-A9 50 Interaction Kinect pour une application ludique Samuel Degrande Patricia Plenacoste Thomas Crepel Rémi Boens<br />
<strong>2013</strong>-05-31 15h00 M5-A9 108 Développement d'un plugin Eclipse de transformation et d'anaMartin Monperrus Benoit Cornu<br />
Amina El-Mekky Ouardia Ma-Z<br />
Lundi 3 juin<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-06-03 09h00 M5-A8 138 [ALTERNANT] étudiant:Augustin Petre & entreprise:DecathlonJulien Mouchon Jean-Claude Tarby Augustin Petre<br />
<strong>2013</strong>-06-03 09h30 M5-A8 124 [ALTERNANT] étudiant:Olivier Debreu & entreprise:Noolitic Sylvain Deceuninck Gilles Grimaud Olivier Debreu<br />
<strong>2013</strong>-06-03 10h15 M5-A8 135 [ALTERNANT] étudiant:Alexandre Loywick & entreprise:GenesGaël Even Mikaël Salson Alexandre Loywick<br />
<strong>2013</strong>-06-03 10h45 M5-A8 123 [ALTERNANT] étudiant:Tristan Cavelier & entreprise:Nexedi Jean-Paul Smets Mikaël Salson Tristan Cavelier<br />
<strong>2013</strong>-06-03 11h45 M5-A8 141 [ALTERNANT] étudiant:Dominique Testelin & entreprise:Idees3Guillaume Palamin Fabrice Aubert Dominique Testelin<br />
<strong>2013</strong>-06-03 14h00 M5-A8 136 [ALTERNANT] étudiant:Nathanael Martin & entreprise:Unis Michaël Macquart Yves Roos<br />
Nathanael Martin<br />
<strong>2013</strong>-06-03 15h00 M5-A8 143 [ALTERNANT] étudiant:Donovan Watteau & entreprise:Cerise Gauthier M Dequidt Arnaud Liefooghe Donovan Watteau<br />
<strong>2013</strong>-06-03 14h00 M5-A7 52 Recherche de candidats/jobs sans contact Nabil Djarallah, Nicolas HNabil Djarallah Gens Maxime Camille Riquier<br />
<strong>2013</strong>-06-03 14h30 M5-A7 53 API de contrôle de drones volants Nabil Djarallah, Nicolas PNicolas Petitprez Mohamed Ouannane Jeremy Diaz<br />
<strong>2013</strong>-06-03 15h00 M5-A7 54 Petites annonces en réalité augmentée Nabil Djarallah, Nicolas PNicolas Petitprez Alexandre Raulin Yann Duval<br />
<strong>2013</strong>-06-03 16h15 M5-A7 118 Intégration du uPnP dans le serveur embarqué SMEWS Gilles Grimaud Gilles Grimaud Edouard Berton Nicolas Ryckembusch<br />
<strong>2013</strong>-06-03 16h45 M5-A7 119 Interface graphique en python pour la commande de compilatiGilles Grimaud Gilles Grimaud Rabab Bouziane Narjes Jomaa<br />
<strong>2013</strong>-06-03 14h00 M5-A9 20 Capture de mouvement 3D avec une caméra Microsoft KinectHazem Wannous Hazem Wannous Derek Hendrickx Benjamin Makusa<br />
<strong>2013</strong>-06-03 14h30 M5-A9 107 Essayage 3D des lunettes virtuelles avec une caméra MicrosoHazem Wannous Hazem Wannous Pierre Villoutreix Maxime Chaste<br />
<strong>2013</strong>-06-03 15h00 M5-A9 31 Robot lego machine de Turing Francesco De Comité Eric Wegrzynowski Matthieu Poudroux Ronan Dhellemmes<br />
<strong>2013</strong>-06-03 16h15 M5-A9 55 Extraction d'information textuelles multilingue à partir de flux sLuigi Lancieri Luigi Lancieri Shichen Zhao Amira Kamli<br />
<strong>2013</strong>-06-03 16h45 M5-A9 56 Analyse du buzz sur twitter Luigi Lancieri Luigi Lancieri Florian Michiel Alessio Trunfio<br />
Mardi 4 juin<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-06-04 13h30 M5-A7 85 Plugin de visualisation 3D pour la consommation énergétique Romain Rouvoy Romain Rouvoy Aurore Allart Benjamin Ruytoor<br />
<strong>2013</strong>-06-04 14h00 M5-A7 109 Réseau de neurones artificiels pour reconnaissance d'émotionPierre Boulet Pierre Boulet<br />
Sanaa Mouatassim<br />
<strong>2013</strong>-06-04 14h30 M5-A7 10 IHM HTML5 pour un simulateur de marchés financiers Yann Secq Philippe Mathieu Thomas Buisine Romain Belmonte<br />
<strong>2013</strong>-06-04 15h00 M5-A7 106 Mise en place d'une application vidéo sur la carte xilinx ZynbqJean-Luc Dekeyser Jean-Luc Dekeyser Quang-Tung Nguyen Antoine B. Kiatoko<br />
<strong>2013</strong>-06-04 15h30 M5-A7 117 Experimentation d'un codeur jpeg sur Homade : une approcheRabie Ben Atitallah Jean-Luc Dekeyser Aurelien Bertiaux<br />
<strong>2013</strong>-06-04 16h15 M5-A7 102 Ecosystèmes virtuels et programmation 3D : spécification et dSamuel Blanquart Samuel Blanquart Lois Arens<br />
Yoann Bouquet<br />
<strong>2013</strong>-06-04 09h00 M5-A8 131 [ALTERNANT] étudiant:Jules Ivanic & entreprise:Gfi Thomas Ribeaucoup Jean-Christophe RoutierJules Ivanic<br />
<strong>2013</strong>-06-04 09h30 M5-A8 133 [ALTERNANT] étudiant:Sebastien Leclercq & entreprise:LifedoHerve Fourmeaux Jean-Christophe RoutierSebastien Leclercq<br />
<strong>2013</strong>-06-04 10h15 M5-A8 142 [ALTERNANT] étudiant:Valois Vander-Cruyssen & entreprise:MAnthony Dhondt Jean-Luc Levaire Valois Vander-Cruyssen<br />
<strong>2013</strong>-06-04 10h45 M5-A8 121 [ALTERNANT] étudiant:Loic Allart & entreprise:Vekia Vincent Wauters Laetitia Jourdan Loic Allart<br />
<strong>2013</strong>-06-04 11h15 M5-A8 126 [ALTERNANT] étudiant:Stefan Dochez & entreprise:AlternativeGuillaume Pellien Lionel Seinturier Stefan Dochez<br />
<strong>2013</strong>-06-04 11h45 M5-A8 130 [ALTERNANT] étudiant:Etienne Helluy-Lafont & entreprise:AdvJeremie Jourdin Pierre Boulet<br />
Etienne Helluy-Lafont<br />
<strong>2013</strong>-06-04 12h15 M5-A8 128 [ALTERNANT] étudiant:Thibaut Frain & entreprise:Valipost Thierry Thibaut Philippe Marquet Thibaut Frain<br />
<strong>2013</strong>-06-04 14h00 M5-A8 129 [ALTERNANT] étudiant:Rémi Gosselin & entreprise:J2S Jean-Yves Jourdain Samuel Hym<br />
Rémi Gosselin<br />
<strong>2013</strong>-06-04 14h30 M5-A8 139 [ALTERNANT] étudiant:Fabien Piette & entreprise:Recisio Jean-Baptiste Defossez Samuel Hym<br />
Fabien Piette<br />
<strong>2013</strong>-06-04 15h00 M5-A8 125 [ALTERNANT] étudiant:Jérôme Desjardins & entreprise:StadlinPascal Farange Marius Bilasco Jérôme Desjardins<br />
<strong>2013</strong>-06-04 15h30 M5-A8 140 [ALTERNANT] étudiant:Cesar Splete & entreprise:Audaxis Vincent Hosatte Marius Bilasco Cesar Splete<br />
<strong>2013</strong>-06-04 16h15 M5-A8 120 [ALTERNANT] étudiant:Romuald Alapide & entreprise:Cap GeJean-Yves Byhet Alexandre Sedoglavic Romuald Alapide<br />
Présidents de sessions<br />
Laetitia Jourdan<br />
Anne-Cécile Caron<br />
Fabrice Aubert<br />
Gery Casiez<br />
Laurent Noé<br />
Mikael Salson
Vendredi 31 mai<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-05-31 08h30 M5-A7 2 Reconstituer le puzzle : depuis des fragments juMikaël Salson Mikaël Salson Charles Husquin<br />
<strong>2013</strong>-05-31 09h00 M5-A7 4 Alecsia apprend à lire les ODT et PDF Mikaël Salson Mikaël Salson Anthony Tonglet<br />
<strong>2013</strong>-05-31 10h15 M5-A7 78 Evolution de l'application de suivi d'alternants et Marius Bilasco Marius Bilasco Ayoub Nejmeddine Sara El-Arbaoui<br />
<strong>2013</strong>-05-31 10h45 M5-A7 81 Take a photo for me Marius Bilasco Marius Bilasco Jérémie Samson Victor Paumier<br />
<strong>2013</strong>-05-31 11h15 M5-A7 82 Interagir avec votre ordinateur de la tête Marius Bilasco Marius Bilasco Mamadou Diop<br />
<strong>2013</strong>-05-31 11h45 M5-A7 84 Analyse contextuelle de collections de photos prMarius Bilasco Marius Bilasco Benjamin Allaert Benjamin Flahauw<br />
<strong>2013</strong>-05-31 14h30 M5-A7 6 Frameworks PHP et back-offices pour applicationJean-Claude Tarby Jean-Claude Tarby Omar Chahbouni Abderrahime El Idrissi<br />
<strong>2013</strong>-05-31 15h00 M5-A7 8 Intégration des ondes cérébrales dans la vie couJean-Claude Tarby Jean-Claude Tarby Mickaël Duruisseau Nicolas Coyard<br />
<strong>2013</strong>-05-31 16h15 M5-A7 68 Conception d'un Raspberry pi dédié aux présentBruno Bogaert Bruno Bogaert Louis Billiet Sylvain Goulliart<br />
<strong>2013</strong>-05-31 16h45 M5-A7 69 Écosystème pour gestion d'emploi du temps hebBruno Bogaert Bruno Bogaert Dhia Elhak Lakhal Sylvain Malfait<br />
<strong>2013</strong>-05-31 10h15 M5-A8 46 Intégration de Drone à une plateforme logicielle Gwenael Cattez Gwenael Cattez Ali Hedjaz Tony Tran<br />
<strong>2013</strong>-05-31 10h45 M5-A8 65 Moteur de scripts sous iOS Nicolas Haderer, RomainRomain Rouvoy Benjamin Digeon Florent David<br />
<strong>2013</strong>-05-31 11h15 M5-A8 66 Utiliser les téléphones mobiles pour l’estimation Nicolas Haderer Romain Rouvoy Julien Duribreux Justin Dufour<br />
<strong>2013</strong>-05-31 14h00 M5-A8 34 Interface de visualisation de molécules Maude Pupin Laurent Noé<br />
Antonia Ludunge<br />
<strong>2013</strong>-05-31 14h30 M5-A8 91 Pipeline d'analyse de régions de cassures Jean-Stéphane Varré Jean-Stéphane Varré Gauvain Marquet<br />
<strong>2013</strong>-05-31 15h00 M5-A8 30 Robot lego solveur de Sudoku Francesco De Comité Leopold Weinberg Oulamine Youssef El Achiqi Anas<br />
<strong>2013</strong>-05-31 16h15 M5-A8 37 Traitement semi-automatique des feuilles de préGéry Casiez Géry Casiez Alexis Linke<br />
Maxence Gaudry<br />
<strong>2013</strong>-05-31 16h45 M5-A8 3 Conception d'un reseau social orienté vidéo Antoine Thomas Antoine Thomas Emmanuel Pede Thomas Besset<br />
<strong>2013</strong>-05-31 08h30 M5-A9 105 Suivi d'un capteur en 3D a l'aide d'une webcam Jean Rioult Sébastien Ambellouis Matthieu Fesselier Guillaume Huylebroeck<br />
<strong>2013</strong>-05-31 09h00 M5-A9 110 Algorithmes de placement en deux dimensions François Clautiaux François Clautiaux Romain Windels<br />
<strong>2013</strong>-05-31 10h15 M5-A9 70 Home Cloud Server Cedric Dumoulin Cedric Dumoulin Lison Gallos Arnaud Caulier<br />
<strong>2013</strong>-05-31 10h45 M5-A9 27 Framework de modélisation dans les Tablettes AAmine El Kouhen Cédric Dumoulin Malika Rakhaoui Fatou-Laye Mbaye<br />
<strong>2013</strong>-05-31 11h15 M5-A9 71 Etude de la spécification des représentations arbCedric Dumoulin Cedric Dumoulin Adrien Burillon Thomas Camberlin<br />
<strong>2013</strong>-05-31 11h45 M5-A9 72 Generateur de GUI Android Cedric Dumoulin Cedric Dumoulin Gerard Paligot<br />
<strong>2013</strong>-05-31 14h00 M5-A9 33 Intégration du support multitouch dans Pharo Stéphane Ducasse Stéphane Ducasse Francois Lepan Benjamin V. Ryseghem<br />
<strong>2013</strong>-05-31 14h30 M5-A9 50 Interaction Kinect pour une application ludique Samuel Degrande Patricia Plenacoste Thomas Crepel Rémi Boens<br />
<strong>2013</strong>-05-31 15h00 M5-A9 108 Développement d'un plugin Eclipse de transformMartin Monperrus Benoit Cornu Amina El-Mekky Ouardia Maiz<br />
Lundi 3 juin<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-06-03 14h00 M5-A7 52 Recherche de candidats/jobs sans contact Nabil Djarallah, Nicolas HNabil Djarallah Gens Maxime Camille Riquier<br />
<strong>2013</strong>-06-03 14h30 M5-A7 53 API de contrôle de drones volants Nabil Djarallah, Nicolas PNicolas Petitprez Mohamed Ouannane Jeremy Diaz<br />
<strong>2013</strong>-06-03 15h00 M5-A7 54 Petites annonces en réalité augmentée Nabil Djarallah, Nicolas PNicolas Petitprez Alexandre Raulin Yann Duval<br />
<strong>2013</strong>-06-03 16h15 M5-A7 118 Intégration du uPnP dans le serveur embarqué SGilles Grimaud Gilles Grimaud Edouard Berton Nicolas Ryckembusch<br />
<strong>2013</strong>-06-03 16h45 M5-A7 119 Interface graphique en python pour la commandGilles Grimaud Gilles Grimaud Rabab Bouziane Narjes Jomaa<br />
<strong>2013</strong>-06-03 14h00 M5-A9 20 Capture de mouvement 3D avec une caméra Micr Hazem Wannous Hazem Wannous Derek Hendrickx Benjamin Makusa<br />
<strong>2013</strong>-06-03 14h30 M5-A9 107 Essayage 3D des lunettes virtuelles avec une caHazem Wannous Hazem Wannous Pierre Villoutreix Maxime Chaste<br />
<strong>2013</strong>-06-03 15h00 M5-A9 31 Robot lego machine de Turing Francesco De Comité Eric Wegrzynowski Matthieu Poudroux Ronan Dhellemmes<br />
<strong>2013</strong>-06-03 16h15 M5-A9 55 Extraction d'information textuelles multilingue à pLuigi Lancieri Luigi Lancieri Shichen Zhao Amira Kamli<br />
<strong>2013</strong>-06-03 16h45 M5-A9 56 Analyse du buzz sur twitter Luigi Lancieri Luigi Lancieri Florian Michiel Alessio Trunfio<br />
Mardi 4 juin<br />
Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />
<strong>2013</strong>-06-04 13h30 M5-A7 85 Plugin de visualisation 3D pour la consommationRomain Rouvoy Romain Rouvoy Aurore Allart Benjamin Ruytoor<br />
<strong>2013</strong>-06-04 14h00 M5-A7 109 Réseau de neurones artificiels pour reconnaissaPierre Boulet Pierre Boulet Sanaa Mouatassim<br />
<strong>2013</strong>-06-04 14h30 M5-A7 10 IHM HTML5 pour un simulateur de marchés finaYann Secq Philippe Mathieu Thomas Buisine Romain Belmonte<br />
<strong>2013</strong>-06-04 15h00 M5-A7 106 Mise en place d'une application vidéo sur la carteJean-Luc Dekeyser Jean-Luc Dekeyser Quang-Tung Nguyen<br />
<strong>2013</strong>-06-04 15h30 M5-A7 117 Experimentation d'un codeur jpeg sur Homade : Rabie Ben Atitallah Jean-Luc Dekeyser Aurelien Bertiaux<br />
<strong>2013</strong>-06-04 16h15 M5-A7 102 Ecosystèmes virtuels et programmation 3D : spéSamuel Blanquart Samuel Blanquart Lois Arens<br />
Yoann Bouquet<br />
Présidents de sessions<br />
Fabrice Aubert<br />
Gery Casiez<br />
Laurent Noé<br />
Mikael Salson
Quelques points<br />
Enseignement :<br />
1<br />
Bioinfo, Algo [1er semestre]<br />
2<br />
PDS, Réseaux, Suivis de Stages (3), PJI mes amis :-)<br />
[2eme semestre]<br />
3<br />
Nouvelle maquette<br />
Recherche :<br />
1<br />
PEPS Sand (accepté), ANR BnB (heu ... réponse le 17),<br />
Stage, Recrutements, Code (bugs), Evaluations en tout<br />
genre ...<br />
2<br />
Reviews (encore des graines, lossless ce coup ci ...)<br />
3<br />
→<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
Quelques points sur la recherche<br />
1<br />
Mappi<br />
cf Exposé de Jenya<br />
2<br />
Peptide Matching<br />
cf Exposé de Yoann<br />
3<br />
Graines et Produit (serpent de mer)<br />
(draft)<br />
Laurent Année <strong>2012</strong>-<strong>2013</strong>
Spaced seed design on profile HMMs for precise HTS read-mapping<br />
efficient sliding window product on the matrix semi-group<br />
Laurent Noé<br />
May 28, <strong>2013</strong><br />
Abstract<br />
We propose a new method and its associated algorithm to efficiently compute seed sensitivity when<br />
considering that High Throughput Sequencing reads are mapped along sub-parts of a known HMM alignment<br />
profile. This computation particularly makes sense with positioned spaced seeds. It relies on both<br />
automata theory (previous work [KNR06]) combined with a matrix product problem.<br />
Interestingly, it brings into light an interval product problem considered more than twenty years ago<br />
in [AS87], but here with a sliding window aspect : we propose an efficient algorithm to compute this<br />
sliding window set of products using a linear number of unit products on the (associative, but non<br />
commutative and non invertible) matrix semi-group.<br />
This computational scheme is implemented in the ongoing 1.06 version of Iedera which is available at<br />
http://bioinfo.lifl.fr/yass/iedera.php<br />
1 Introduction<br />
Spaced seed design remains an important, but a complex and challenging problem. Many papers have been<br />
devoted to this subject (mainly this last decade), from the (at first counter-intuitive) idea that such seeds<br />
were performing better [CR93, Buh02] and could be optimized [MTL02, BK01], to spaced seed sensitivity<br />
definition and computation [KLMT04], extended models of seeds and their computation [BBV05, Bro05,<br />
MGB06, CM07, YZ08, II09, KWS + 11], and given bounds and complexity problems investigated [FCLCST05,<br />
NR08, MY09, EM11]. Several software are now publicly available to design spaced seeds [SB05, NGK10,<br />
IIMB11, DDDD + 12, Nue11, MHKR12] 1 .<br />
High Throughput Sequencing (HTS) technologies have thrown a new light on the seed design process,<br />
because obtained HTS reads are of relative short length and quality labelled. Some of the most sensitive<br />
algorithms to map such reads onto related genomes use spaced seeds (SHRiMP [RLD + 09, DDL + 11],<br />
ZOOM [LZZ + 08], BFAST [HMN09], PerM [CSC09], LAST [KWS + 11], SToRM [NGK10], ...).<br />
But most of the regular seeds designed within these tools are based on the assumption that the mapped<br />
alignment profile remains “unknown”, thus preferring a i.i.d “randomly” generated profile. There are several<br />
(if not many) cases where this assumption can be removed due to a known profile of what is searched [SB09]<br />
/ filtered out (prior knowledge on the sequences being searched). However, an additional constraint comes<br />
from the fact that HTS reads are (most of the time) relatively short compared to the known profile and are<br />
thus aligned against any sub-profile extracted from the original profile.<br />
We thus propose in the main part of this paper an extended method to efficiently compute seed sensitivity<br />
or lossless property when considering that HTS reads are mapped on sub-profiles (overlapping windows) of<br />
a known HMM alignment profile, which is especially useful when designing positioned spaced seeds. This<br />
computation is first known to rely on a dynamic programming algorithm applied on the automaton that<br />
recognizes the language matched by the seed combined with the HMM model [KNR06]. This computation<br />
DRAFT<br />
1 Currently, more than one hundred references have been directly related to the spaced seeds problem, see for example<br />
http://www.lifl.fr/~noe/spaced_seeds.html<br />
also depends, due to the sub-profile constraint, on a set of matrix products done along overlapped intervals,<br />
which is an idea explored in this paper.<br />
The interval product problem has been considered in [AS87] and the authors provide an efficient solution<br />
in term of preprocessing, in order to answer any query product with a given constant number of products.<br />
We consider this interval product problem with an incremental aspect, using a sliding window, and propose<br />
an efficient algorithm to compute it without preprocessing using an amortized linear number of products<br />
on associative, but non commutative and non invertible, matrix semi-group that stores the property being<br />
computed (probability, cost, score, ...), itself represented by a semi-ring.<br />
In part 2, we give a brief recall of the seed design principle focusing on the seed sensitivity computation.<br />
We then propose the (matrix) product problem in part 3, and propose a method to solve it. Finally, in part<br />
4, we give some measurements on a practical implementation included in the ongoing 1.06 version of Iedera<br />
http://bioinfo.lifl.fr/yass/iedera.php, before concluding remarks in part 5.<br />
2 Seed design process<br />
Spaced seeds are now a frequently used hashing technique for biological sequence analysis. Their implementation<br />
(as a direct hashing method) is straightforward and brings high sensitivity for the same theoretical<br />
selectivity compared to contiguous seeds of an equivalent weight. Interestingly, in practice, a lightly reduced<br />
computational cost can even be observed when using spaced seeds compared with contiguous seeds of the<br />
same weight.<br />
Spaced seeds have been generalized by several extended seed models (Vector seeds [BBV05], Indel<br />
seeds [MGB06], Subset seeds [KNR06, ZF07, YZ08], Neighbor seeds [CM07]). To increase the overall sensitivity,<br />
they can usually be designed jointly as multiple seeds [YWC + 04, SB05], and (for example on quality<br />
labelled sequences) as positioned seeds [LZZ + 08, NGK10].<br />
In addition to the seed model, one needs a selection criterion for good seed shapes : this criterion is<br />
(almost always) established on a model of the alignments being matched (usually represented as words on<br />
a binary match/mismatch alphabet), itself weighted by a probabilistic/cost/score/...(possibly any combination<br />
of such “semi-groups”) model. Here again, the initially proposed i.i.d. Bernoulli model [KLMT04]<br />
has been extended into Markov model [BKS05] and HMM [BBV04], with several extensions set on its<br />
parametrization [MB07, CP10].<br />
In practice the considered criterion to select good spaced seed shapes is “the probability to hit at least<br />
once”(sensitivity), or “the guaranty to hit always at least once”(lossless property). Such criteria can<br />
then be measured by a dynamic programming algorithm based on the decomposition of alignment word<br />
suffixes detected by the seed [KLMT04, BK03], or more directly on the regular language recognized by the<br />
seed, itself compiled into a deterministic finite automaton [BKS05, KNR06, HR08].<br />
3 Matrices product<br />
Given an automaton for the language recognized by the seed, and given a model (probabilistic/cost/score<br />
model) provided by a transducer, it is possible to compute properties (probabilities, costs, scores ...) of<br />
the initial language (see the illustrative example provided in Figure 1 for probabilities). In practice, the<br />
resulting matrices obtained from the model and the seed language are multiplied and/or powered; the<br />
computation “within matrices” is performed on “semi-rings” representing the properties : For example,<br />
language probabilities are computed on a classical semi-ring (E = R0≤r≤1,⊕ =+,⊙ =.,0⊕,ɛ⊙ =0,1⊙ =<br />
1), whereas language costs (respectively scores) are computed on a tropical semi-ring [Sim88, Pin98, MS09,<br />
Moh09](E = R,⊕ = min,⊙ =+,0⊕,ɛ⊙ = ∞,1⊙ = 0) (respectively (E = R,⊕ = max,⊙ =+,0⊕,ɛ⊙ =<br />
−∞,1⊙ = 0) for scores).<br />
In practice, for a set of seeds (and in general for any regular expression), the same algorithm [KNR06,<br />
MHKR12] can be applied on both classical and tropical semi-rings : it computes for example, either the<br />
seed sensitivity on the classical semi-ring for what is commonly named lossy seed design framework,<br />
DRAFT<br />
1<br />
2
q1<br />
q2<br />
q3<br />
q4<br />
q5<br />
p1<br />
11)<br />
p2<br />
11)<br />
Figure 1: Product of the seed 1*1 automaton with an ad hoc probabilistic model<br />
start<br />
0<br />
1<br />
0<br />
1<br />
0<br />
0<br />
1<br />
1<br />
0,1 0 ( 17),1 (27) 1 ( 47) 0 ( 1<br />
×<br />
=<br />
DRAFT<br />
start<br />
0 ( 11),1 3 ( 7<br />
(q1×p1) (q1×p2) (q2×p1) (q2×p2) (q3×p1) (q3×p2) (q4×p1) (q4×p2) (q5×p1) (q5×p2)<br />
(q1×p1) ( 1 7 ) (2 7 ) (4 7 )<br />
(q1×p2) ( 3<br />
11 ) ( 1<br />
11 ) ( 7<br />
11 )<br />
(q2×p1) ( 2 7 ) (4 7 ) (1 7 )<br />
(q2×p2) ( 7<br />
11 ) ( 3<br />
11 ) ( 1<br />
11 )<br />
(q3×p1) ( 2 7 ) (1 7 ) (4 7 )<br />
(q3×p2) ( 3<br />
11 ) ( 11 1 ) ( 7 11 )<br />
(2 7 ) (q4×p1) ( 1 7 ) (4 7 )<br />
(q4×p2) ( 3<br />
11 ) ( 11 1 ) ( 7 11 )<br />
1<br />
7 +2 7 (q5×p1) ( ) (4 7 )<br />
(q5×p2) ( 3 11 + 7<br />
11 ) ( 1<br />
11 )<br />
otherwise the minimal cost and thus the lossless property on the tropical semi-ring for the lossless seed<br />
design framework [NGK10]. Note also that it can be adapted to a score framework, if providing a clearly<br />
defined problem (e.g. [KNP04]).<br />
In the lossy framework, HMMs are frequently used in biological sequence and alignment representation<br />
(for example as profile HMMs [Edd98]) 2 . They thus can be easily applied to seed sensitivity computation<br />
[BBV04, KNR06, HR08] : they give a set of probabilities (emission probabilities for each state, together<br />
with transition probabilities between states) that are computed out of a profile alignment. Butwhensuch<br />
HMMs have to be used with HTS reads to design seeds, one must face a new problem : taking into account<br />
the fact that the read can be any sub-part of the HMM (HMM local alignment), and thus that the computation<br />
may start at any “position” on the alignment HMM : in some way a more challenging problem to design<br />
seeds when one needs to know precisely the hit probability of a set of (positioned) seeds for each window<br />
along the HMM.<br />
3.1 Sliding window product<br />
Such computation, translated into matrix form, implies to compute, for a list of (non-invertible) matrices<br />
M0,M1,M2,...,Mn−1, a set of products as one of the two following forms :<br />
2 Notice also that Position Weight Matrices (PWM) with indels, as the one used for example in Prosite, can be seen as a<br />
rough equivalent of the profile HMM in the tropical semi-ring...<br />
Problem.<br />
where w is the length of the read,<br />
Problem.<br />
compute<br />
compute<br />
j(t) ∏<br />
u=i(t)<br />
i+w ∏<br />
u=i<br />
Mu ∀i ∈ [0..n−w−1] (1)<br />
or more generally :<br />
Mu ∀t with 0≤i(t)≤j(t)k).<br />
Maintaining such matrices Uk for k ∈ [i..j] costs at most (in amortized analysis) one product per j-<br />
increase (see Appendix 7.2). Note that increasing i simply deletes the last Ui and thus does not cost any<br />
3<br />
4
i=0<br />
00<br />
U[0]<br />
Figure 2: Uk matrices: example when i = 0 and j = 24<br />
27<br />
26<br />
25<br />
24<br />
23<br />
22<br />
21<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
14<br />
13<br />
12<br />
11<br />
10<br />
09<br />
08<br />
07<br />
06<br />
05<br />
04<br />
03<br />
02<br />
01<br />
U[1]<br />
U[2]<br />
U[3]<br />
U[4]<br />
U[5]<br />
U[6]<br />
U[7]<br />
U[8]...<br />
25<br />
additional product on the Uk’s. A pseudo-code of the add right process (increment of j)isprovidedin<br />
Algorithm 1.<br />
Without considering that any previous computation is kept, it is directly possible to compute the Mi..j<br />
product, as Mi × Mi+1···Mj for any i,j (j>i)inO(log(j − i)) products using the updated Uk set of<br />
matrices for k ∈ [i..j] (see Appendix 7.1).<br />
But if the product is computed when i and j follow two monotonically ( +0<br />
+1 )-increasing functions, the<br />
number of products can be reduced to (amortized) constants for each i and j step-move (or for both moves).<br />
3.3.2 Middle m definition and Mi..j product update<br />
00<br />
i=1<br />
01<br />
U[1]<br />
02<br />
U[2]<br />
03<br />
U[3]<br />
04<br />
U[16]<br />
Figure 3: Uk matrices: example when i = 1 and j = 24<br />
U[4]<br />
05<br />
U[5]<br />
06<br />
U[6]<br />
07<br />
U[7]<br />
08<br />
09<br />
10<br />
U[8]...<br />
11<br />
12<br />
24<br />
13<br />
14<br />
15<br />
m=16<br />
16<br />
17<br />
U[16]<br />
18<br />
19<br />
U[24]<br />
j=24<br />
DRAFT<br />
20<br />
21<br />
22<br />
23<br />
24<br />
25<br />
U[24]<br />
To split the computation when only i or j is moved, we need to define here the middle m of i and j.<br />
It is defined as the beginning position of the maximal (in size) U-block included in the interval i..j. Iftwo<br />
equal-size maximal blocks are between i and j, we choose m as the one that is the most factorized by two,<br />
which corresponds 3 to the beginning of the right maximal block (see Figure 3). This middle border enables<br />
to split the computation in two parts when needed, which we will call left (colored in green in Figure 3) and<br />
right (red in Figure 3). Note that m< 1 3 i+ 2 3j. Note also that when there is only one maximal sized<br />
block,thenm< 1 2 i+ 1 2 j, and when there are two maximal sized blocks,thenm>2 3 i+ 1 3 j.<br />
3 proof : the other choice would implies that the two maximal left and right blocks would be merged, which contradicts<br />
“maximality” of the left block; thus only the right block can be increased in size; to conclude : for two contiguous blocks of<br />
equal size, the right block is at least one more power of two factorizable than the left block<br />
26<br />
27<br />
j=24<br />
28<br />
28<br />
29<br />
29<br />
Algorithm 1: add right : increments the right border j, and updates the set Ui..j using Mj<br />
Input:<br />
• M0,M1,M2,...,Mn−1 : original matrices.<br />
Global:<br />
• i,j : integers,<br />
• Ui,...,Uj : original and updated set of matrices.<br />
Local:<br />
• u,t,told : integers.<br />
/* a) only before the first increment */<br />
if j =0then<br />
U0 ← M0;<br />
/* b) increment j */<br />
inc(j);<br />
/* c) and process the subset of Uj−t matrices that have to be updated */<br />
Uj ← Mj;<br />
u ← j +1;told ← 0;t ← 1;<br />
while u is even and j −t ≥ i do<br />
Uj−t ← Uj−t ; ×Uj−told<br />
told ← t ; t ← 2.t+1;u ← u/2;<br />
In the next part, we will compute in two separate parts Mi..m−1 and Mm..j, considering the case when<br />
m is fixed first, and then two cases when m is increased.<br />
middle unchanged : if we suppose that the middle m does not change during a computational step,<br />
the following can be observed :<br />
• when j is increased (so that j = jold +1), updating the product Mm..j can be done with one product,<br />
considering that we keep the previous computation . Thus, considering that we also update<br />
Mm..jold<br />
the Uk’s values at the same time, an amortized single product must be added (Amortization on j :see<br />
Appendix 7.2). Joining Mi..m−1 with Mm..j will then cost one extra product, giving a total number of<br />
products of three.<br />
• when i is increased (i = iold + 1), previous computation Miold..m−1 does not help and can be erased<br />
here. However, if we suppose that we keep all the previous computed products Mk..m−1 in a stack for<br />
all the blocks Uk visited before, reusing and updating this part can be done with one single amortized<br />
product (Amortization on m : see Appendix 7.3). Joining Mi..m−1 and Mm..j will then cost one extra<br />
product, giving a total number of products of two.<br />
DRAFT<br />
At first glance, a {cost(i) ≤ i+m; cost(j) ≤ 3j} cost is applied when m does not change. Otherwise<br />
this computation has to be updated and this will be considered in the next part :<br />
middle changed : if we suppose that the middle m does change, then the previous computation cut<br />
in two parts Mi..m−1 and Mm..j is somehow “compromised”; Let’s now see when m changes, and moreover,<br />
why :<br />
5<br />
6
• whenmchangesduetoa j-increase, asmfollowsthebeginningofthelargestright-mostUk block, j can<br />
increase the maximal block size by two without changing m (case handled before, corresponding<br />
to one single maximal block), or j can make m jump to the next power of two “potential” block,<br />
thus from mold = odd×2 p to m =(odd+1)×2 p = odd+1<br />
2 ×2 p+1 (case not handled, that corresponds<br />
to two maximal blocks of equal size, the right-most being now the “m one”): This last case has no<br />
consequence on the product Mm..j that is immediately computed by the update of the Uk’s values as<br />
Mm..j corresponds to the right-most maximal block in Uk,thusinone single product here (and<br />
not two as shown before).<br />
However, moving m will obviously compromise the left stack of Mk..mold−1 previous computations that<br />
will now not help the computation of the next Mi..mold−1 on the next i-increase, since mold is now<br />
pushed to the next power of two m, and can be erased. This cost can however be bound by a log2( ∆m 2 )<br />
where ∆m represents the m increase (see Appendix 7.4).<br />
At the end, joining Mi..m−1 with Mm..j will cost one extra product.<br />
Using an amortization on m and j and combining the two j-increase cases (when m does change, or<br />
not) gives a cost(j) ≤ 3j + 1 8m (see Appendix 7.4)<br />
• when i is increased so that i>m(thus i = m+1), m (that correponds to the largest right-most block)<br />
can only “jump” to a next block of smaller size : the cost on the left stack [i..m − 1] is already<br />
paid as it corresponds to a “legal” move of i that is amortized by one product as seen previously<br />
(Amortization on m, see Appendix 7.3).<br />
However, moving m will obviously compromise the right computation of Mmold..j since mold is now<br />
pushed to the next (smaller) block, and can be erased and recomputed. This cost can however be<br />
bound by a log2( ∆m 2 )where∆m where ∆m =representsthem increase (see Appendix 7.5).<br />
At the end, joining Mi..m−1 and Mm..j will cost one extra product.<br />
Using an amortization on m and i and combining the two i-increase cases (when m does change, or<br />
not) gives a cost(i) ≤ i+ 9 8m (see Appendix 7.5)<br />
To conclude, a {cost(i) ≤ i+ 9 8 m; cost(j) ≤ 3j + 1 8m} cost is applied.<br />
3.3.3 Pseudocode<br />
To illustrate the previously described computation, an associated pseudocode is given in Algorithm 2.The<br />
proposed algorithm returns the Mi..j product (still defined as Mi × Mi+1 × ···× Mj). It can only be<br />
applied once the Ui..j matrices have been updated by Algorithm 1. The main global data structure used in<br />
Algorithm 2 is a stack of matrices left products to m stack that keeps a set of products Mk..m−1 (where k<br />
is i ≤ k pairs,<br />
• right product from m : < matrix,int > pair.<br />
Local:<br />
• Pleft,Pright : matrices,<br />
• kleft,kright : integers.<br />
Result:<br />
• the product Mi..j<br />
/* a) update m (update algorithm not described here) */<br />
mold ← m ; m ← update(m,i,j);<br />
/* a.1) reset all global variables when m change */<br />
if mold ≠ m then<br />
left products to m stack ←∅;<br />
right products from m ← ;<br />
/* b) update the left stack products*/<br />
/* b.1) remove stacked product that are not usefull */<br />
while left products to m stack ≠ ∅ and top(left products to m stack).int < i do<br />
pop(left products to m stack);<br />
if left products to m stack ≠ ∅ then<br />
← top(left products to m stack);<br />
else<br />
← ;<br />
/* b.2) and compute / stack left products from i to m */<br />
while kleft >ido<br />
kleft ← kleft −size of block before(kleft,i);<br />
Pleft ← Ukleft ×Pleft;<br />
push(left products to m stack,< Pleft,kleft >);<br />
DRAFT<br />
/* c) compute the right product from m to j */<br />
← right product from m;<br />
while kright
1<br />
4 Experiments on seed design<br />
The previous algorithm has been implemented and tested in Iedera where it can now be activated with the<br />
-ll option (see http://bioinfo.lifl.fr/yass/iedera.php).<br />
We designed spaced seeds on reads using an alignment model obtained from a profile HMM : on a<br />
typical example, for a read/windows length of 100 (respectively of 200) that corresponds to an observed<br />
current Illumina single read length (respectively two merged reads), and a simplified 4 profile HMM alignment<br />
of size 1605 (from a 16S rRNA database), the number of products required for the full computation of the<br />
1605−100+1 = 1506windowsoflength100was5931 5 (respectively5720productsforthe1605−200+1 = 1406<br />
windows of length 200) that must be compared to the number of products for the naive range algorithm of<br />
≈ 150000 (respectively ≈ 300000).<br />
However, it must be noticed that the products required for the naive algorithm are less time consuming<br />
(as matrix × vector products) compared to our case (matrix × matrix products). We thus compared the<br />
execution time of both approaches under the conditions proposed above for both 100 and 200 windows<br />
length. We conducted two experiments, one on spaced seeds, and the other on positioned spaced seeds. For<br />
the first experiment, the seeds were set at every position along the HMM, and the sensitivity was computed<br />
on all the windows along the HMM. For the second experiment, we additionally set a fixed number of<br />
positions along the HMM (10,20,40,80,160,320,640,1280) where seeds were set : seed positions were drawn,<br />
and the sensitivity was again computed on all the windows along the HMM.<br />
For both spaced seeds and positioned spaced seeds, we have chosen seeds of weight w ranging from 8 to<br />
12 (Figure 4 and 5: x-axis bottom label), span s ranging from w to 2×w (Figure 4 and 5: x-axis top label),<br />
and, for each pair (w,s), we have computed the sensitivity on 100 seeds and measured the time elapsed<br />
(Figure 4 and 5: y-axis label). Note that the set of seeds (respectively the set of positions for each seed)<br />
was identical on both methods being evaluated. The computation was carried out exclusively on a HP Z800<br />
Computer (Intel(R) Xeon(R) CPU E5620 @ 2.40GHz) with 20Gb of RAM (in practice, not more than 20% of<br />
the RAM was used), using a single thread.<br />
The obtained results are illustrated on Figure 4 for one single seed, and also on Figure 5 for a set of two<br />
seeds : they show a substantial improvement in almost all cases considered in the experiments.<br />
There is a double speedup observed in the most time consuming problems : this appends for seeds of the<br />
largest span in the set. This is the worst case, in the sense that large and dense matrices are produced. In<br />
practice, the practical speedup for seeds of reasonable span<br />
weight ratio (e.g. ≤ 1.8) is at least four times the one<br />
of the naive algorithm on non positioned seeds. The practical speedup for positioned seeds is less obvious on<br />
middle span seeds, but appears to increase if the seeds are of small or very large span, and when the set of<br />
positions increase. Finally, it must be noticed that on non positioned seeds, increasing the window length<br />
from 100 to 200 has a strong impact on the overall performances.<br />
5 Concluding remarks<br />
DRAFT<br />
First, it is very likely that the bounds proposed at the end of section 3 could be improved by a more precise<br />
analysis; However going under a bound of 3.0 per move while computing, for each move, the window product,<br />
is unlikely (at least without any initial amortized cost), since we have found at least one example such that<br />
the amortized number of products is 9253<br />
3081 ≈ 3.0032457 per move6 .<br />
Moreover when j is increased by “runs” while i is fixed, the proposed algorithm can be enhanced with a<br />
greedy computation of the Mi..j product (that can be done quickly provided that i is fixed for a while). In<br />
practice, this implementation always gives less or the same number of products than the proposed one, but<br />
has to further be carefully analyzed.<br />
4 only matching states are kept : insertion and deletion states are removed, but we keep track of transitions between matching<br />
Figure 4: Iedera speed improvement for one seed<br />
positioned seeds (window length 100)<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />
non positioned seeds (window length 100)<br />
span<br />
span<br />
1000000<br />
naive range algorithm<br />
proposed algorithm<br />
proposed algorithm slower<br />
naive range algorithm faster<br />
naive range algorithm<br />
proposed algorithm<br />
100000<br />
× 2<br />
× 2<br />
10000<br />
1000<br />
100<br />
10<br />
8 9 10 11 12<br />
weight<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />
8 9 10 11 12<br />
positioned seeds (window length 200)<br />
span<br />
1000000<br />
DRAFT<br />
weight<br />
non positioned seeds (window length 200)<br />
span<br />
naive range algorithm<br />
proposed algorithm<br />
proposed algorithm slower<br />
naive range algorithm faster<br />
naive range algorithm<br />
proposed algorithm<br />
100000<br />
× 2<br />
10000<br />
× 2<br />
1000<br />
100<br />
10<br />
8 9 10 11 12<br />
8 9 10 11 12<br />
weight<br />
weight<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />
↔ indel states to generate some break between contiguous blocks of matches : we thus keep the indel even without its length,<br />
and without any letter it may add here, since they are not supposed match any seed<br />
5 it must be noticed that each window needs a displacement both on i and j<br />
6 1,2,−1,3..24,−2,−3,25..51,−4,52..72,−5,73..392,−6,393..441,−7,−8,442..577,−9,578..3071 where i-moves are given<br />
with a minus notation<br />
1000000<br />
100000<br />
10000<br />
1000<br />
100<br />
10<br />
10000000<br />
1000000<br />
100000<br />
10000<br />
1000<br />
100<br />
10<br />
9<br />
time (seconds)<br />
10<br />
time (seconds)
Figure 5: Iedera speed improvement for two seeds<br />
positioned seeds (two seeds, window length 100)<br />
span<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />
non positioned seeds (two seeds, window length 100)<br />
span<br />
100000<br />
naive range algorithm<br />
proposed algorithm<br />
proposed algorithm slower<br />
naive × 2 range algorithm faster<br />
naive range algorithm<br />
proposed algorithm<br />
10000<br />
1000<br />
100<br />
10<br />
8 9 10 11 12<br />
weight<br />
positioned seeds (two seeds, window length 200)<br />
span<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />
8 9 10 11 12<br />
10000000<br />
naive range algorithm<br />
proposed algorithm<br />
proposed algorithm slower<br />
naive range algorithm faster<br />
1000000<br />
DRAFT<br />
weight<br />
100000<br />
× 2<br />
10000<br />
1000<br />
100<br />
8 9 10 11 12<br />
weight<br />
10<br />
8 9 10 11 12<br />
weight<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />
10000000<br />
1000000<br />
100000<br />
× 2<br />
Finally, it is also very likely that the algorithm “binary block division” may be modified and analyzed to<br />
get an even better bound. For example, some sorting algorithm as the Smooth sort [Dij82] uses Fibonacci<br />
numbers to partition data, rather than following the “binary tree” of the classical Heap Sort. Another<br />
interestingpointofviewistoconsiderthesizeofeachmatrix(hereunknownmostofthetime, andunfortately<br />
square on trivial cases), in order to combine the sliding window problem with the classical chain matrix<br />
non positioned seeds (two seeds, window length 200)<br />
span<br />
naive range algorithm<br />
proposed algorithm<br />
× 2<br />
DRAFT<br />
8<br />
9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />
10000<br />
1000<br />
100<br />
10<br />
1000000<br />
100000<br />
10000<br />
1000<br />
100<br />
10<br />
multiplication problem for the overall computation.<br />
From a more practical point of view, matrix products used within the algorithm, when applied on<br />
sparse/non sparse matrices, cannot be considered as a “constant” operation, but more likely as a “function<br />
of the sparsity”. However, such implementation needs to know this “sparsity cost” for all the possible<br />
products, which, in practice on unknown automata, is not predictable, but is similar to simulating the<br />
product, thus costs as much as the product itself; We have adopted in Iedera the choice of representing<br />
matrices (and the associated product) with the two possibilities : the algorithm chooses, for each matrix<br />
row, a sparse implementation if less than 20% of the cells are present, or a dense implementation otherwise.<br />
However, it is still possible here to get very high costs with the full matrix product : an alternative solution<br />
would be to combine both elements of the naive range product with sub-computations from the proposed<br />
algorithm if such cases would appear.<br />
Finally, a last aspect that can be taken into account is to parallelize the block product carefully since this<br />
one heavily depends on separate calculations for the same window being considered : the naive algorithm<br />
is in fact more difficult to parallelize efficiently withing each window for at least two reasons : first, there<br />
is a flow dependency between the set of products; worse, within each product, synchronization is needed<br />
when accessing the post computation vector, unless one has to reverse the computation by considering it<br />
first, which implies to reverse the matrices cells access from row-first to column-first.<br />
6 Acknowledgments<br />
This research was supported by the ANR project MAPPI (ANR-2010-COSI-004-02), <strong>LIFL</strong> (UMR CNRS<br />
8022 Université de Lille 1) and Inria Lille Nord-Europe. Project MAPPI is associated with the Tara Oceans<br />
expedition where the principal tasks involve the development of new software for mapping and assembling<br />
metagenomic and metatranscriptomic data.<br />
References<br />
[AS87]<br />
[BBV04]<br />
[BBV05]<br />
[BK01]<br />
[BK03]<br />
NogaAlonandBaruchSchieber. Optimalpreprocessingforansweringon-lineproductqueries. Technical<br />
Report TR 71/87, Inst. of Comp. Science, Tel-Aviv Univ., 1987.<br />
Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Optimal spaced seeds for homologous coding<br />
regions. Journal of Bioinformatics and Computational Biology, 1(4):595–610, Jan 2004. (earlier version<br />
in CPM 2003). URL: http://www.worldscinet.com/jbcb/01/0104/S0219720004000326.html, doi:<br />
10.1142/S0219720004000326.<br />
BroňaBrejová, DanielG.Brown, andTomášVinař. Vectorseeds: Anextensiontospacedseeds. Journal<br />
of Computer and System Sciences, 70(3):364–380, 2005. (earlier version in WABI 2003). URL: http://<br />
linkinghub.elsevier.com/retrieve/pii/S0022000004001527, doi:10.1016/j.jcss.2004.12.008.<br />
Stefan Burkhardt and Juha Kärkkäinen. Better filtering with gapped q-grams. In Proceedings of<br />
the 12th Symposium on Combinatorial Pattern Matching (CPM),volume2089ofLecture Notes in<br />
Computer Science, pages 73–85. Springer, July 2001. URL: http://www.springerlink.com/content/<br />
gykw51mpjqnwrmqx, doi:10.1007/3-540-48194-X_6.<br />
Stefan Burkhardt and Juha Kärkkäinen. Better filtering with gapped q-grams. Fundamenta Informaticae,<br />
56(1-2):51–70, 2003. Preliminary version in Combinatorial Pattern Matching 2001. URL:<br />
http://iospress.metapress.com/content/8ad9p3mqeday8vt5.<br />
[BKS05]<br />
Jeremy Buhler, Uri Keich, and Yanni Sun. Designing seeds for similarity search in genomic DNA.<br />
Journal of Computer and System Sciences, 70(3):342–363, 2005. (earlier version in RECOMB 2003).<br />
time (seconds)<br />
11<br />
time (seconds)<br />
12
URL: http://linkinghub.elsevier.com/retrieve/pii/S0022000004001515, doi:10.1016/j.jcss.<br />
2004.12.003.<br />
[Bro05] Daniel G. Brown. Optimizing multiple seeds for protein homology search. IEEE/ACM Transactions<br />
on Computational Biology and Bioinformatics (TCBB), 2(1):29–38, january 2005. (earlier version in<br />
WABI 2004). URL: http://ieeexplore.ieee.org/xpl/freeabs_all.jsparnumber=1416848, doi:<br />
10.1109/tcbb.2005.13.<br />
[Buh02]<br />
[CM07]<br />
[CP10]<br />
[CR93]<br />
[CSC09]<br />
Jeremy Buhler. Provably sensitive indexing strategies for biosequence similarity search. In RECOMB,<br />
Washington, DC (USA), pages 90–99. ACM Press, April 2002. URL: http://doi.acm.org/10.1145/<br />
565196.565208, doi:10.1145/565196.565208.<br />
Miklós Csűrös and Bin Ma. Rapid homology search with neighbor seeds. Algorithmica,48(2):187–<br />
202, Jun. 2007. (earlier version in COCOON 2005). URL: http://www.springerlink.com/content/<br />
45446712u14n0416, doi:10.1007/s00453-007-0062-y.<br />
Won-Hyoung Chung and Seong-Bae Park. Hit integration for identifying optimal spaced seeds. BMC<br />
Bioinformatics - Selected articles from the 8th Asia-Pacific Bioinformatics Conference (APBC), 18-21<br />
january, Bangalore, India,11(Suppl1):S37,2010.URL:http://www.biomedcentral.com/1471-2105/<br />
11/S1/S37, doi:10.1186/1471-2105-11-S1-S37.<br />
Andrea Califano and Isidore Rigoutsos. Flash: A fast look-up algorithm for string homology. In<br />
Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB),<br />
pages 56–64, July 1993.<br />
Yangho Chen, Tate Souaiaia, and Ting Chen. PerM: efficient mapping of short sequencing reads<br />
with periodic full sensitive spaced seeds. Bioinformatics, 25(19):2514–2521,2009. URL:http://<br />
bioinformatics.oxfordjournals.org/content/25/19/2514, doi:10.1093/bioinformatics/btp486.<br />
[DDDD + 12] Dong Do Duc, Huy Q. Dinh, Thanh Hai Dang, Kris Laukens, and Xuan Huan Hoang. AcoSeeD: An<br />
ant colony optimization for finding optimal spaced seeds in biological sequence search. In Proceedings of<br />
the 8th International Conference on Swarm Intelligence (ANTS), Brussels (Belgium),volume7461of<br />
Lecture Notes in Computer Science, pages 204–211. Springer, <strong>2012</strong>. URL: http://www.springerlink.<br />
com/content/n1476j612302410k/, doi:10.1007/978-3-642-32650-9_19.<br />
[DDL + 11]<br />
[Dij82]<br />
[Edd98]<br />
[EM11]<br />
Matei David, Misko Dzamba, Dan Lister, Lucian Ilie, and Michael Brudno. SHRiMP2: Sensitive yet<br />
practical short read mapping. Bioinformatics,2011.doi:10.1093/bioinformatics/btr046.<br />
Edsger W. Dijkstra. Smoothsort, an alternative to sorting in situ. Sci. Comp. Progr.,1:223–233,1982.<br />
Sean R. Eddy. Profile hidden Markov models. Bioinformatics,14(9):755–763,1998. doi:10.1093/<br />
bioinformatics/14.9.755.<br />
Lavinia Egidi and Giovanni Manzini. Spaced seeds design using perfect rulers. In Proceedings of the<br />
18th International Symposium on String Processing and Information Retrieval (SPIRE), Pisa (Italy),<br />
volume 7024 of Lecture Notes in Computer Science, pages 32–43. Springer, 2011. URL: http://www.<br />
springerlink.com/content/c18m78j1214h7k21/, doi:10.1007/978-3-642-24583-1_5.<br />
[FCLCST05] Martin Farach-Colton, Gad M. Landau, Süleyman Cenk Sahinalp, and Dekel Tsur. Optimal spaced<br />
seeds for faster approximate string matching. In Proceedings of the 32nd International Colloquium<br />
on Automata, Languages and Programming (ICALP’05), Lisboa (Portugal),volume3580ofLecture<br />
Notes in Computer Science, pages 1251–1262. Springer, 2005. URL: http://www.springerlink.com/<br />
content/815pej6c1kc09upj, doi:10.1007/11523468_101.<br />
[HMN09]<br />
[HR08]<br />
DRAFT<br />
Nils Homer, Barry Merriman, and Stanley F. Nelson. BFAST: An alignment tool for large scale genome<br />
resequencing. PLoS One,4(11):e7767,2009.doi:10.1371/journal.pone.0007767.<br />
Inke Herms and Sven Rahmann. Computing alignment seed sensitivity with probabilistic arithmetic<br />
automata. In Proceedings of the 8th International Workshop on Algorithms in Bioinformatics<br />
(WABI), Karlsruhe (Germany), volume5251ofLecture Notes in Bioinformatics, pages318–<br />
329. Springer, Sept. 2008. URL: http://www.springerlink.com/content/e8w1g39288144l56, doi:<br />
10.1007/978-3-540-87361-7_27.<br />
[II09] Lucian Ilie and Silvana Ilie. Fast computation of neighbor seeds. Bioinformatics, 25(6):822–<br />
823, 2009. URL: http://bioinformatics.oxfordjournals.org/content/25/6/822, doi:10.1093/<br />
bioinformatics/btp054.<br />
[IIMB11]<br />
[KLMT04]<br />
[KNP04]<br />
[KNR06]<br />
[KWS + 11]<br />
[LZZ + 08]<br />
[MB07]<br />
[MGB06]<br />
Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. SpEED: fast computation of sensitive spaced<br />
seeds. Bioinformatics,2011.doi:10.1093/bioinformatics/btr368.<br />
Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied<br />
Mathematics, 138(3):253–263, 2004. (preliminary version in 2002). doi:10.1016/S0166-218X(03)<br />
00382-2.<br />
Gregory Kucherov, Laurent Noé, and Yann Ponty. Estimating seed sensitivity on homogeneous<br />
alignments. In Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering<br />
(BIBE), May 19-21, 2004, Taichung (Taiwan), pages 387–394. IEEE Computer Society Press, April<br />
2004. URL: http://ieeexplore.ieee.org/xpl/freeabs_all.jsparnumber=1317369, arXiv:cs.OH/<br />
0603106, doi:10.1109/BIBE.2004.1317369.<br />
Gregory Kucherov, Laurent Noé, and Mikhail A. Roytberg. A unifying framework for seed sensitivity<br />
and its application to subset seeds. Journal of Bioinformatics and Computational Biology,4(2):553–<br />
569, November 2006. URL: http://www.worldscinet.com/jbcb/04/0402/S0219720006001977.html,<br />
arXiv:cs.DS/0601116, doi:10.1142/S0219720006001977.<br />
Szymon M. Kie̷lbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. Adaptive seeds<br />
tame genomic sequence comparison. Genome Research,21(3):487–493,2011.URL:http://genome.<br />
cshlp.org/content/21/3/487, doi:10.1101/gr.113985.110.<br />
Hao Lin, Zefeng Zhang, Michael Q. Zhang, Bin Ma, and Ming Li. ZOOM! Zillions Of Oligos<br />
Mapped. Bioinformatics,24(21):2431–2437,2008. URL:http://bioinformatics.oxfordjournals.<br />
org/content/24/21/2431, doi:10.1093/bioinformatics/btn416.<br />
Denise Y.F. Mak and Gary Benson. All hits all the time: parameter free calculation of seed sensitivity.<br />
In D. Sankoff, L. Wang, and F. Chin, editors, Proceedings of the 5th Asia Pacific Bioinformatics<br />
Conference (APBC),volume5ofAdvances in Bioinformatics and Computational Biology,pages327–<br />
340. Imperial College Press, 2007. URL: http://eproceedings.worldscinet.com/9781860947995/<br />
9781860947995_0035.html, doi:10.1142/9781860947995_0035.<br />
DeniseY.F.Mak, YevgeniyGelfand, andGaryBenson. Indelseedsforhomologysearch. Bioinformatics,<br />
22(14):e341–e349, 2006. URL: http://bioinformatics.oxfordjournals.org/content/22/14/e341,<br />
doi:10.1093/bioinformatics/btl263.<br />
[MHKR12] Tobias Marschall, Inke Herms, Hans-Michael Kaltenbach, and Sven Rahmann. Probabilistic arithmetic<br />
automata and their applications. IEEE/ACM Transactions on Computational Biology and Bioinformatics<br />
(TCBB),9(6):1737–1750,<strong>2012</strong>.URL:http://doi.ieeecomputersociety.org/10.1109/tcbb.<br />
<strong>2012</strong>.109, doi:10.1109/TCBB.<strong>2012</strong>.109.<br />
[Moh09] Mehryar Mohri. Handbook of Weighted Automata, chapter Weighted automata algorithms, pages 213–<br />
254. Springer, 2009. doi:10.1007/978-3-642-01492-5_6.<br />
[MS09] Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry. (draft book-in-progress), 2009.<br />
[MTL02]<br />
[MY09]<br />
[NGK10]<br />
[NR08]<br />
[Nue11]<br />
[Pin98]<br />
Bin Ma, John Tromp, and Ming Li. PatternHunter: Faster and more sensitive homology search.<br />
Bioinformatics,18(3):440–445,2002. URL:http://bioinformatics.oxfordjournals.org/content/<br />
18/3/440, doi:10.1093/bioinformatics/18.3.440.<br />
DRAFT<br />
Bin Ma and Hongyi Yao. Seed optimization for i.i.d. similarities is no easier than optimal Golomb<br />
ruler design. Information Processing Letters, 109(19):1120–1124,2009. URL:http://linkinghub.<br />
elsevier.com/retrieve/pii/S0020019009002270, doi:10.1016/j.ipl.2009.07.008.<br />
Laurent Noé, Marta Gîrdea, and Gregory Kucherov. Designing efficient spaced seeds for SOLiD read<br />
mapping. Advances in Bioinformatics,2010:ID708501,July2010.URL:http://www.hindawi.com/<br />
journals/abi/2010/708501/, doi:10.1155/2010/708501.<br />
François Nicolas and Éric Rivals. Hardness of optimal spaced seed design. Journal of Computer and<br />
System Sciences, 74(5):831–849, Aug. 2008. (earlier version in CPM 2005). URL: http://linkinghub.<br />
elsevier.com/retrieve/pii/S0022000007001444, doi:10.1016/j.jcss.2007.10.001.<br />
Gregory Nuel. Bioinformatics - Trends and Methodologies, chapter Significance Score of Motifs in<br />
Biological Sequences. InTech, 2011. doi:10.5772/18448.<br />
Jean-Éric Pin. Tropical semirings. In J. Gunawardena, editor, Idempotency,volume11ofPubl. Newton<br />
Inst., pages 50–69, Bristol, 1998. Cambridge Univ. Press.<br />
13<br />
14
x−1<br />
x−2<br />
[RLD + 09] Stephen M. Rumble, Phil Lacroute, Adrian V. Dalca, Marc Fiume, Arend Sidow, and Michael<br />
Brudno. SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol,5(5):e1000386,<br />
05 2009. URL: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000386,<br />
doi:10.1371/journal.pcbi.1000386.<br />
[SB05] Yanni Sun and Jeremy Buhler. Designing multiple simultaneous seeds for DNA similarity search.<br />
Journal of Computational Biology, 12(6):847–861, 2005. (earlierversioninRECOMB2004). URL:http:<br />
//www.liebertonline.com/doi/abs/10.1089/cmb.2005.12.847, doi:10.1089/cmb.2005.12.847.<br />
[SB09]<br />
[Sim88]<br />
Yanni Sun and Jeremy Buhler. Designing patterns and profiles for faster HMM search. IEEE/ACM<br />
Transactions on Computational Biology and Bioinformatics (TCBB),6(2):232–243,2009. doi:10.<br />
1109/tcbb.2008.14.<br />
Imre Simon. Recognizable sets with multiplicities in the tropical semiring. In Mathematical foundations<br />
of computer science, 1988 (Carlsbad, 1988),volume324ofLecture Notes in Comput. Sci.,pages107–<br />
120. Springer, Berlin, 1988. doi:10.1007/BFb0017135.<br />
[YWC + 04] I-Hsuan Yang, Sheng-Ho Wang, Yang-Ho Chen, Pao-Hsian Huang, Liang Ye, Xiaoqiu Huang, and Kun-<br />
Mao Chao. Efficient methods for generating optimal single and multiple spaced seeds. In Proceedings<br />
of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE), Taichung (Taiwan),pages<br />
411–416. IEEE Computer Society Press, 2004. URL:http://ieeexplore.ieee.org/xpl/freeabs_all.<br />
jsparnumber=1317372, doi:10.1109/BIBE.2004.1317372.<br />
[YZ08]<br />
[ZF07]<br />
Jialiang Yang and Louxin Zhang. Run probabilities of seed-like patterns and identifying good transition<br />
seeds. Journal of Computational Biology, 15(10):1295–1313, Dec. 2008. (earlier version in APBC<br />
2008). URL: http://www.liebertonline.com/doi/abs/10.1089/cmb.2007.0209, doi:10.1089/cmb.<br />
2007.0209.<br />
Leming Zhou and Liliana Florea. Designing sensitive and specific spaced seeds for cross-species mRNAto-genome<br />
alignment. Journal of Computational Biology, 14(2):113–130, Mar. 2007. URL: http:<br />
//www.liebertonline.com/doi/abs/10.1089/cmb.2006.0130, doi:10.1089/cmb.2006.0130.<br />
DRAFT<br />
00<br />
00<br />
01<br />
01<br />
02<br />
02<br />
7 Appendix<br />
Figure 6: Uk matrices and product: example when i = 9 and j = 23<br />
24<br />
23<br />
22<br />
21<br />
20<br />
19<br />
18<br />
17<br />
16<br />
15<br />
14<br />
13<br />
12<br />
11<br />
10<br />
09<br />
08<br />
07<br />
06<br />
05<br />
04<br />
03<br />
i=9<br />
x<br />
x<br />
n=14<br />
x<br />
29<br />
28<br />
27<br />
26<br />
DRAFT<br />
x<br />
x<br />
j=23<br />
Figure 7: Uk matrices and product: example when i = 9 and j = 19<br />
03<br />
04<br />
05<br />
06<br />
07<br />
08<br />
i=9<br />
09<br />
x<br />
10<br />
11<br />
x<br />
12<br />
13<br />
14<br />
n=10<br />
7.1 Worst case number of products from i to j<br />
We denote by n the number of single matrices : n = j − i +1(n is thus the length of the block being<br />
computed with help of the Uk matrices already given). We illustrate below how to obtain the smaller size n<br />
according to the number of products x.<br />
• if x[<br />
is odd, ] the worst case is produced by a concatenation of blocks of size 2 i on both ends, for<br />
i ∈ 0.. x−1<br />
2 (see Figure 6 for x = 5):<br />
n = 2<br />
2∑<br />
i=0<br />
15<br />
x<br />
16<br />
17<br />
x<br />
18<br />
19<br />
20<br />
j=19<br />
2 i = 2 √ (<br />
2×2 x n+2<br />
)<br />
2 −2 x = 2log 2<br />
• If x is even, the worst case]<br />
is produced by a concatenation of blocks of size 2 i on both ends of a block<br />
of size 2 x 2, for i ∈<br />
[0.. x 2 −1 (see Figure 7 for x = 4):<br />
n = 2<br />
2∑<br />
i=0<br />
21<br />
22<br />
23<br />
24<br />
25<br />
25<br />
26<br />
2 √ 2<br />
(<br />
2 i +2 x 2 = 3×2 x n+2<br />
)<br />
2 −2 x = 2log 2<br />
3<br />
27<br />
28<br />
29<br />
15<br />
16
Figure 8: minimal n (for x even and odd) functions compared to 2×x<br />
Figure 9: Uk matrices and Mi..m−1 product : example when i = 33 and j goes from 47 to 48<br />
16<br />
22.2 x/2 - 2<br />
3.2 x/2 - 2<br />
14<br />
2.x 14<br />
12<br />
10<br />
10<br />
8<br />
6<br />
6<br />
4<br />
4<br />
2<br />
2<br />
1<br />
0<br />
0 1 2 3 4 5<br />
Note that this integer sequence has its own OEIS sequence at http://oeis.org/A027383,definedhere<br />
as a partial sum of http://oeis.org/A016116.<br />
Combining those two cases, it can be shown that when the number of products is set to x =1,2 or 3,<br />
then the minimal size is exactly 2×x (Illustration on Figure 8), and also when x>3 (or x = 0) that this<br />
minimal bound is never reached again.<br />
In other words, the number of products x is always ≤ n 2 .<br />
7.2 Amortized analysis of Uk blocks when i =0and j ≥ 0<br />
Summing the number of products needed when computing Uk should be 2 on average, and not 1 : a quick<br />
analysis shows that, indeed, if one product is done half of the time, two are done each 1/4, three done each<br />
1/8, and so on ... then the ∑ ∞<br />
u=1 u<br />
2 = 2. u<br />
However here, we will show that amortized number of product when considering j is only 1. We use an<br />
amortized analysis by giving one coin each time j is increased (i is supposed to stay at 0 but this assumption<br />
can be lifted since it can be seen as a worst case when updating Uk) to show than any sub-block Uk will<br />
generate one extra coin, and thus grouped with its neighbour block in size (itself generating one extra coin),<br />
the cost of the father block processed with those two is also generating (1+1)−1 = one extra coin.<br />
• this is true for blocks of size 2 since they are build of blocks of size 1 that do not generate any product<br />
: the cost for such block of size 2 is thus 1, and 1 extra coin remains.<br />
• this can be easily verified for blocks of size 2 p (p>1), since by induction hypothesis the two sub-blocks<br />
of size 2 p−1 give each one extra coin : the cost associated when joining the two sub-blocks then removes<br />
one coin, and one extra coin remains again.<br />
Note that this analysis can be set for any i ≥ 0 and any j>iprovided that at first an extra number of<br />
j −i coins is provided.<br />
x<br />
DRAFT<br />
i=33<br />
51<br />
50<br />
49<br />
48<br />
47<br />
46<br />
45<br />
44<br />
43<br />
42<br />
41<br />
40<br />
39<br />
38<br />
37<br />
36<br />
35<br />
34<br />
33<br />
32<br />
m_old=36<br />
x<br />
x<br />
24<br />
m=40<br />
j_old=47<br />
j=48<br />
7.3 Amortized analysis of the left Mi..m−1 blocks when m fixed and i increased<br />
Summing the number of products needed when computing Mi..m−1 for any i from 0 to m is on average 1 : a<br />
quick analysis shows indeed that if no product is done half of the time (when i is even), one product is done<br />
each 1/4, two done each 1/8, and so on ... then ∑ ∞ u<br />
u=0 2u+1 = 1.<br />
But this does not guaranty that the total number of products paid when increasing i from any value (for<br />
example 0) to m is always less than m. Here we will show that the number of products (once m is fixed) for<br />
computing Mi..m−1 for any i from 0 to a given m =2 p is ≤ 2 p −p−1.<br />
m =2! 0<br />
m =4! 1<br />
m =8! 4<br />
m = 16 ! 11<br />
A similar method to section 7.2 can be applied.<br />
First we consider the case when i = 0 and m has been increased to reach a given (and fixed) value 2 p .<br />
• this is true when p =1(thuswhenm = 2) since, using Uk blocks, it needs no product to compute<br />
M0..1 and M1..1.<br />
• this can be verified for blocks of size 2 p (p>1), since we can then use the two sub-blocks of size 2 p−1 :<br />
when i is within the first sub-block, as the product is done from m to i and stacked in such way that any<br />
suffix Mk..m in kept, it costs the product produced by this sub-block (2 (p−1) −(p−1)−1) added to the<br />
log2( m 2 )=p−1 extra products to cover the second sub-block of size 2p−1 ;wheni is within the second<br />
sub-block, exactly the number of products produced by this sub-block ( (2 (p−1) −(p−1)−1). ) Thus when<br />
summing these two quantities, the number of products is ≤ 2× 2 (p−1) −(p−1)−1 +(p−1) = 2 p −p−1<br />
DRAFT<br />
Thus, increasing i from any value ≥ 0tom and computing all the possible products (with the help of<br />
the Blocks Uk)is≤ m−log2(m)−1, thus costs less than m.<br />
Note that this analysis can be set for any i ≥ 0 and any m (not necessary represented as a strict power<br />
of 2 , but as m = a×2 p such that 2 p is the maximal block size of Uk for k ∈ [i..j]).<br />
7.4 Amortized analysis of the left Mi..m−1 blocks when m is increased (due to a<br />
j increase) and i is fixed<br />
When j increases while i is fixed, m may change to a new (and of course increased) value pointing to an<br />
equal (or twice larger block) : this appends when m goes from mold =2 pold ×aold (with aold odd), to its new<br />
value m = mold+2 pold =(aold+1)×2 pold = a×2 p (with a = aold<br />
2 and p = pold+1), as illustrated on Figure<br />
9.<br />
17<br />
18
Here we are interested in the computation of Mi..m−1 due to this ∆m = m − mold =2 pold increase.<br />
In practice, since m has changed, the full set of left stack matrices Mk..m−1 has to be recomputed for<br />
some k ∈ [i..m], and some products already done Mk..mold−1 (amortized in Section 7.3)havetoberedone<br />
unfortunately twice here.<br />
This twice-cost is at most log2( mold−i+1<br />
2 ) ≤ log2( m−mold<br />
2 )=log2( ∆m 2<br />
)(mold −i
7.6 Analysis 1<br />
If the size of the ∆m block increase is given by 2 u , the function f(u) that represents the amortized increase<br />
per j move (Appendix 7.4)is:<br />
f(u)=<br />
(<br />
3− 1 )<br />
2 u [j]+ u−1<br />
(<br />
2 u [m] ≤ g(u)= 3− 1 )<br />
2 u [j]+ u−1 [ i+2×j<br />
2 u 3<br />
DRAFT<br />
]<br />
since m< i+2×j<br />
3<br />
[ i+2×j<br />
f ′ (u)= ln(2)<br />
2 u [j]+1−ln(2)−uln(2) 2 u [m] g ′ (u)= ln(2)<br />
2 u [j]+1−ln(2)−uln(2) 2 u 3<br />
note that<br />
g ′ (x) ≥ 0ifx ≤ 1+ 1<br />
ln(2) ≈ 2.44<br />
g ′ (x) ≤ 0ifx ≥ 5 2 + 1<br />
ln(2) ≈ 3.94<br />
so the maximal g(int) candidate is one of g(2),g(3) or g(4). Since,<br />
g(3)−g(2) = 1 8 [j] ≥ 0<br />
g(4)−g(3) = 1<br />
48 ([j]−[i]) ≥ 0<br />
then g(4) = 3[j]+ [i]+[j]<br />
16 is the maximal value. Thus, f(u) ≤ 3[j]+ [i]+[j]<br />
16 .<br />
Note also that<br />
f(u)=<br />
(<br />
3− 1 )<br />
2 u [j]+ u−1 [m]<br />
2u ≤ 3[j]+u−2[m] 2u sincem