26.12.2014 Views

Rapide bilan 2012-2013 - LIFL

Rapide bilan 2012-2013 - LIFL

Rapide bilan 2012-2013 - LIFL

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Rapide</strong> <strong>bilan</strong> <strong>2012</strong>-<strong>2013</strong><br />

Laurent<br />

<strong>LIFL</strong>, Université Lille 1 - INRIA<br />

Journées au vert<br />

11 et 12 juin <strong>2013</strong><br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


Mais avant ...<br />

mais avant ...<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


<strong>Rapide</strong> <strong>bilan</strong> 2011-<strong>2012</strong><br />

Laurent<br />

<strong>LIFL</strong>, Université Lille 1 - INRIA<br />

Journées au vert<br />

13 et 14 juin <strong>2012</strong><br />

Laurent Année 2011-<strong>2012</strong>


J’avais fini l’année dernière par :<br />

Merci pour les nombreux PJIs encadrées cette année !!<br />

n’hesitez pas à en proposer encore plus l’année prochaine :-)<br />

Laurent Année 2011-<strong>2012</strong>


J’avais fini l’année dernière par :<br />

Merci pour les nombreux PJIs encadrées cette année !!<br />

n’hesitez pas à en proposer encore plus l’année prochaine :-)<br />

Donc je peux recommencer, en y rajoutant désormais :<br />

Merci aux présidents permanents et ponctuels cette année !!<br />

n’hesitez pas à en prendre encore plus (module) l’année prochaine :-)<br />

Laurent Année 2011-<strong>2012</strong>


Vendredi 1 er juin<br />

Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2012</strong>-06-01 08h00 M5-A7 124<br />

<strong>2012</strong>-06-01 08h30 M5-A7 71<br />

<strong>2012</strong>-06-01 09h00 M5-A7 96 Intégration d'un mécanisme de récompenses pour des Nicolas Haderer Romain Rouvoy Nacim Hamdad Benjamin Bertein<br />

<strong>2012</strong>-06-01 09h30 M5-A7 47 Conception d'une interface Web pour la visualisation Adel Boris Couturier<br />

Noureddine Romain Rouvoy Jonathan Decrocq<br />

<strong>2012</strong>-06-01 10h45 M5-A7 93 [AGIL-IT] Site de suivi des demandes de l'accueil RH Lionel Drain<br />

<strong>2012</strong>-06-01 11h15 M5-A7 92 Régis Servant<br />

[AGIL-IT] [http://www.mobitic.fr/] Développement et Patricia Plénacoste Loïc Daara Sébastien Poulmane<br />

<strong>2012</strong>-06-01 11h45 M5-A7 50 Vers un campus ubiquitaire et social Yvan Peter Yvan Peter<br />

<strong>2012</strong>-06-01 08h00 M5-A8 33 Classification des Historiques de Ventes en grande<br />

<strong>2012</strong>-06-01 08h30 M5-A8 45 Application web de consultations de données Jean-Christophe Routier Jean-Christophe Routier<br />

<strong>2012</strong>-06-01 09h00 M5-A8 128 Interface de saisie et de restitution de données Jean-Christophe Routier Jean-Christophe Routier Julien Milan<br />

<strong>2012</strong>-06-01 09h30 M5-A8 88 Application web de gestion du flux des achats Bruno Bogaert Bruno Bogaert Benjamin Jacquet<br />

<strong>2012</strong>-06-01 10h45 M5-A8 63 Souris 3D<br />

<strong>2012</strong>-06-01 11h15 M5-A8 1 Suivi multi-flux d'objets mobiles Chabane Djeraba Chabane Djeraba Alexandre Mandy<br />

<strong>2012</strong>-06-01 11h45 M5-A8 109 Comparaisons de séquences musicales symboliques Mathieu Giraud Mathieu Giraud Corentin Bertiaux Anthony Lerouge<br />

<strong>2012</strong>-06-01 14h00 M5-A7 86 Annotation de génomes Sylvain Denis<br />

Hélène Touzet Mikael Salson Pauline Wauquier<br />

<strong>2012</strong>-06-01 14h30 M5-A7 112 Mise en place d'une base de données des élus et mMikaël Salson Mikaël Salson Goulven Rozec Patience Ngami-Nana<br />

<strong>2012</strong>-06-01 15h00 M5-A7 66 distrSégolène Caboche Eric Piette<br />

Développement d'un outil de visualisation de la Mikael Salson Luigi Palmiero<br />

<strong>2012</strong>-06-01 15h30 M5-A7 52<br />

<strong>2012</strong>-06-01 16h45 M5-A7 11 Recherche dans des millions de courtes séquencesMikaël Salson<br />

Mikaël Salson<br />

Florian Recourt<br />

<strong>2012</strong>-06-01 17h15 M5-A7 94 [AGIL-IT] Création d’un site internet de promotion de Julien Bliart<br />

Laurent Noé<br />

<strong>2012</strong>-06-01 17h45 M5-A7 39<br />

<strong>2012</strong>-06-01 14h00 M5-A9 104<br />

MINY - Multimodality Is Nice for You! Xavier Le Pallec Xavier Le Pallec Alain Laraki<br />

<strong>2012</strong>-06-01 14h30 M5-A9 102 Clement Dufour<br />

<strong>2012</strong>-06-01 15h00 M5-A9 60 Jean Martinet<br />

Kinect et ZCam, le face à face Amel Aissaoui Ramy Arbid Antoni Pauchet<br />

<strong>2012</strong>-06-01 15h30 M5-A9 41 Extraction d'information de Twitter Ali Abbas<br />

Luigi Lancieri Eric Lepretre David Deroo<br />

desSamuel Blanquart Samuel Blanquart Benoît-Charles Detuncq<br />

<strong>2012</strong>-06-01 16h45 M5-A9 35 Développement d'une interface de visualisation Yannick Leroy<br />

<strong>2012</strong>-06-01 17h15 M5-A9 37 Développement de sites déployables pour la gestion Samuel Blanquart Samuel Blanquart Ismael Souissi Samuel Queniart<br />

<strong>2012</strong>-06-01 17h45 M5-A9 98<br />

Implémentation d'un jeu de tir à l'arc avec Kinect Thomas Pietrzak Thomas Pietrzak Guillaume Devos Joffrey Hochart<br />

Visage en relief avec Zcam : application à la détectiAfifa Dahmane Afifa Dahmane Bahare Shirazi<br />

Patricia Plénacoste<br />

Emmanuel Ardiot<br />

Simon Debaecke<br />

Christophe Leemans<br />

Sylvain Mongy Jean-Stéphane Varré Benjamin Fisset Taha Touati<br />

Kouami-Aderibgbe Adekambi<br />

Marc Duez<br />

Matthieu Calmels<br />

Cédric Montay<br />

Géry Casiez Géry Casiez Kevin Pollaert Axel Delahaye<br />

Alecsia : Aide à L'Evaluation et à la Correction SemMikaël Salson Mikaël Salson Ludovic Loridan<br />

Tony Proum<br />

Développement d’une base de données de glycosylAnne Harduin-Lepers Olgo Plechakova & Maria-CeciliaAnthony Tonglet Antoine Baluzolanga-Kiatoko<br />

Paint collaboratif par Smartphones Xavier Le Pallec Xavier Le Pallec Maxime Raverdy Damien Level<br />

Développement avec un framework PHP pour aiderVincent Vatelot Gilles Vanwormhoudt Adel Ben-Elkrizi Mamadou Cellou Dara Diallo


Lundi 4 juin<br />

Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2012</strong>-06-04 08h00 M5-A7 90 [ATOS-IT] MOW (groupe 1 : étudiants 1 et 2)<br />

Lionel Seinturier Lionel Seinturier Gaetan Mallants<br />

<strong>2012</strong>-06-04 08h30 M5-A7 91 [ATOS-IT] MOW (groupe 1 : étudiants 3 et 4) Manuel Servais<br />

<strong>2012</strong>-06-04 09h00 M5-A7 121 Client WindowsPhone pour plate-forme d'échange de Pierre Kopaczewski Lionel Seinturier Matthias Mellouli Louis Dekeister<br />

<strong>2012</strong>-06-04 09h30 M5-A7 80 Cloud Security - Simulations d'attaques distribuées Damien Riquet<br />

Gilles Grimaud Ludovic Moreau<br />

<strong>2012</strong>-06-04 10h45 M5-A7 2 Recherche de sous-graphes dans un graphe, appliquée Laurent Noé<br />

Maude Pupin<br />

<strong>2012</strong>-06-04 11h15 M5-A7 22 Développement d'un outil de dessin facilitant la créat Valérie Leclère<br />

Maude Pupin<br />

<strong>2012</strong>-06-04 11h45 M5-A7 34 Implémentation d’une solution BI et analyse technolSylvain Maude Pupin<br />

Mongy Adil Ayar Morgan Auchede<br />

<strong>2012</strong>-06-04 08h00 M5-A8 75<br />

<strong>2012</strong>-06-04 08h30 M5-A8 72 bas Maria Cecilia Arias<br />

Développement d’interfaces graphiques pour une Olga Plechakova Gorgui-Djire Ndong Naby Gueye<br />

<strong>2012</strong>-06-04 09h00 M5-A8 18 Base de données Intranet du matériel biologique d'une Christophe Remi Duriez<br />

D'Hulst Olga Plechakova Benjamin Bellangeon<br />

<strong>2012</strong>-06-04 09h30 M5-A8 126 intégrat Jean-Frédéric Berthelot Jean-Frédéric Berthelot<br />

Développement d’extensions pour CMS pour Mickael Lemaitre Jerome Deboffles<br />

<strong>2012</strong>-06-04 10h45 M5-A8 127 permettant Jean-Frédéric Berthelot Jean-Frédéric Berthelot<br />

Développement d’un plugin pour digiKam Iliya Ivanov Nathan Damie<br />

<strong>2012</strong>-06-04 11h15 M5-A8 81<br />

<strong>2012</strong>-06-04 11h45 M5-A8 82<br />

<strong>2012</strong>-06-04 14h00 M5-A7 21 Algorithmes de recherche locale pour l’optimisation Arnaud Liefooghe Arnaud Liefooghe Yoann Dufresne<br />

<strong>2012</strong>-06-04 14h30 M5-A7 119<br />

<strong>2012</strong>-06-04 15h00 M5-A7 7 Site Web 2.0 communautaire d'appariement<br />

<strong>2012</strong>-06-04 16h15 M5-A7 62<br />

<strong>2012</strong>-06-04 16h45 M5-A7 55 Module de construction moléculaire 3D Fabrice Aubert<br />

Sébastien Canneaux Nadir Cherifi Mohamed El-Amrani<br />

<strong>2012</strong>-06-04 17h15 M5-A7 69 Fabrice Aubert Fabrice Aubert<br />

Lecteur web de vidéos 360 et WebSocket Nathanaël Deboeuf Pierre Denquin<br />

<strong>2012</strong>-06-04 17h45 M5-A7 95 Pierre-Hubert Olivier<br />

[SCOTLER] Scotler C&C WA Céline Kuttler Jamal-Dine Youlhajen<br />

<strong>2012</strong>-06-04 14h00 M5-A9 144 [ALTERNANT] étudiant:Kévin Labat & entreprise:Audax Vincent Cordonnier Maude Pupin<br />

<strong>2012</strong>-06-04 14h30 M5-A9 145 [ALTERNANT] étudiant:Valentin Lecerf & entreprise:Laurent Vansuypeene Pierre Boulet<br />

<strong>2012</strong>-06-04 15h00 M5-A9 136 [ALTERNANT] étudiant:Nicolas Cousin & entreprise: Céline Bilasco<br />

Céline Kuttler<br />

Nicolas Cousin<br />

<strong>2012</strong>-06-04 16h15 M5-A9 146 [ALTERNANT] étudiant:Florian Ledoux & entreprise: Nicolas Ruff<br />

Sophie Tison Florian Ledoux<br />

<strong>2012</strong>-06-04 16h45 M5-A9 147 [ALTERNANT] étudiant:Benoit Petit & entreprise:Quadr Sébastien Lucas Laetitia Jourdan Benoit Petit<br />

<strong>2012</strong>-06-04 17h15 M5-A9 140 [ALTERNANT] étudiant:Guillaume Gallant & entreprFrançois Pasquereau Laetitia Jourdan<br />

Lionel Seinturier Lionel Seinturier Nicolas Crappe Alexandre Dubus<br />

Evelyne Ferot<br />

Remi Degruson<br />

Clément Pasek<br />

Développement d'un outil interactif de contrôle de la Valérie Leclère Olga Plechakova Chaste Isabane Anais Ngo-Xuan-Coi<br />

Implementation of a web server for protein function Marc Lensink Guillaume Brysbaert Qiang Liu Roshanak Gharagozlou<br />

Development of an XML language for protein function Marc Lensink Guillaume Brysbaert Doga Ozturk<br />

Optimisation de ressources dans les "clouds" GooglFrançois Clautiaux Arnaud Liefooghe Charle-Edmond Bihr<br />

Maxime Morge Maxime Morge Valentine Maillart Tristan Bourgois<br />

Middleware DDS en environnement métro Christophe Gransart Christophe Gransart Seilendria Hadiwardoyo<br />

Kévin Labat<br />

Valentin Lecerf<br />

Guillaume Gallant


Mardi 5 juin<br />

Date Heure Salle # Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2012</strong>-06-05 08h00 M5-A9 122 Rubik Francesco Eric Wegrzynowski Oscar Gest<br />

Lego (A) De Comite<br />

Geoffrey Verhille<br />

<strong>2012</strong>-06-05 08h30 M5-A9 125 Rubik Lego (B) Francesco De Comite Eric Wegrzynowski<br />

Guillaume Macke <strong>2012</strong>-06-05 09h00 M5-A9 24 Kinect Jean-Claude Tarby Jean-Claude Tarby Cyprien Cuvillier<br />

Applications Jean-Claude Tarby Jean-Claude Tarby Claude Saint-Georges<br />

<strong>2012</strong>-06-05 09h30 M5-A9 58 pilotées par le cerveau<br />

<strong>2012</strong>-06-05 10h45 M5-A9 77 Migration d’une base de données relationnelles en bas Céline Anas El-Achiqi<br />

Bilasco Marius Bilasco<br />

Marius Bilasco Marius Bilasco<br />

Warren Moreau<br />

Web of Metadata <strong>2012</strong>-06-05 11h15 M5-A9 123<br />

<strong>2012</strong>-06-05 11h45 M5-A9 130 Reprise d'une application pour la génération du WHO Marius Frederic Bellano Bilasco Marius Bilasco Larbi Noufli<br />

[ALTERNANT] étudiant:Kévin Defives & entreprise:ORomain Lahoche<br />

Sophie Tison<br />

Kévin Defives<br />

<strong>2012</strong>-06-05 09h00 M5-A8 139<br />

entreprise: Jean-Jacques Decrucq Mikaël Salson<br />

<strong>2012</strong>-06-05 09h30 M5-A8 142 [ALTERNANT] étudiant:Geoffrey Hecht & Geoffrey Hecht<br />

<strong>2012</strong>-06-05 10h15 M5-A8 134 [ALTERNANT] étudiant:Christopher Coat & entreprisJonathan Christopher Coat<br />

Carpentier Patricia Plenacoste<br />

[ALTERNANT] étudiant:Guillaume Dauster & entrepr Jonathan Alexandre Sedoglavic Guillaume Dauster<br />

<strong>2012</strong>-06-05 11h15 M5-A8 137 Carpentier<br />

[ALTERNANT] étudiant:Vincent Herbulot & entreprisDesrumaux Romain Rouvoy Vincent Herbulot<br />

<strong>2012</strong>-06-05 11h45 M5-A8 143<br />

Sabine<br />

[ALTERNANT] étudiant:Henri Roussez & entreprise:Bertrand Hudzia Philippe Marquet Henri Roussez<br />

<strong>2012</strong>-06-05 14h00 M5-A8 148<br />

[ALTERNANT] étudiant:Amaury David & entreprise:OLaurent Decool<br />

Philippe Marquet<br />

Amaury David<br />

<strong>2012</strong>-06-05 14h30 M5-A8 138<br />

<strong>2012</strong>-06-05 15h30 M5-A8 135<br />

Maxime Colmant<br />

[ALTERNANT] étudiant:Maxime Colmant & entreprisRalida Azzi Xavier Le Pallec<br />

[ALTERNANT] étudiant:Kevin Guilbert & entreprise:Yann Marzack Xavier Le Pallec Kevin Guilbert<br />

<strong>2012</strong>-06-05 16h15 M5-A8 141<br />

Président : Xavier le Pallec / Laetitia Jourdan<br />

Président : Fabrice Aubert<br />

Président : Laurent Noé<br />

Président : Mikael Salson


Quelques points sur “l’administratif”<br />

1<br />

Licence Cybersécurité : <strong>2012</strong>-<strong>2013</strong> pour<br />

<strong>2013</strong>-2014 ...<br />

2<br />

UPMC : McF<br />

3<br />

PJI mes amis :-)<br />

4<br />

Reviews (que des graines cette année, ou<br />

presque ...)<br />

Laurent Année 2011-<strong>2012</strong>


Quelques points sur la recherche<br />

1<br />

Mappi<br />

(slides de mi-parcours + cf Exposé Jenya)<br />

2<br />

Graines et Produit<br />

(draft)<br />

3<br />

Peptide Matching<br />

→ stage de Yoann Dufresne<br />

Laurent Année 2011-<strong>2012</strong>


Quelques points sur la recherche<br />

1<br />

Mappi<br />

(slides de mi-parcours + cf Exposé Jenya)<br />

2<br />

Graines et Produit<br />

(draft)<br />

3<br />

Peptide Matching<br />

→ stage de Yoann Dufresne<br />

Laurent Année 2011-<strong>2012</strong>


MAPPI<br />

5tâches,<br />

I Tâche 1 : Nouvelles structures d’index pour la recherche de<br />

motifs approchés<br />

I Tâche 2 : Mapping pour la métagenomique et la<br />

métatranscriptomique<br />

I Tâche 3 : Outils d’assemblage pour les NGS<br />

I Tâche 4 : Assemblage guidé de données de<br />

métatranscriptomique<br />

I Tâche 5 : Pipeline bioinformatique


MAPPI<br />

5tâches,celles que je vais décrire dans le contexte Lillois<br />

I Tâche 1 : Nouvelles structures d’index pour la recherche de<br />

motifs approchés<br />

I Tâche 2 : Mapping pour la métagenomique et la<br />

métatranscriptomique<br />

I Tâche 3 : Outils d’assemblage pour les NGS<br />

I Tâche 4 : Assemblage guidé de données de<br />

métatranscriptomique<br />

I Tâche 5 : Pipeline bioinformatique


Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />

Contexte : Read Mapping


Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />

Contexte : Read Mapping


Tâche 1 : Nouvelles structures d’index pour la recherche de motifs approchés<br />

Contexte : Read Mapping<br />

Réalisé :<br />

1. Portage de l’algorithme de Wu-Mamber sur GPU<br />

[Bit-Parallel Multiple Pattern Matching. T. T. Tran, M. Giraud, J.-S. Varré PPAM /<br />

PBC 2011.]<br />

2. Indexation des voisinages des k-mers<br />

But : profiter de l’efficacité du cache GPU/Processeur<br />

Deux méthodes d’indexation envisagées :<br />

I indexation directe (tri des mots → recherche dichotomique)<br />

I hachage parfait<br />

+ non encore publié mais des résultats :<br />

I mise en oeuvre en OpenCL (fonctionnelle sur CPU et GPU)<br />

I gain en performance entre x10 et x60<br />

I prototype de readmapper en cours<br />

[<strong>LIFL</strong>] Tuan Tu Tran, Mathieu Giraud, Jean-Stéphane Varré<br />

[LIAFA] Djamal Belazzougui, Mathieu Raffinot


900<br />

700<br />

500<br />

400<br />

300<br />

800<br />

200<br />

100<br />

1000<br />

600<br />

5’<br />

1100<br />

1200<br />

3’<br />

1800<br />

1700<br />

1300<br />

1400<br />

1500<br />

LEGEND<br />

count<br />

-------------------------------------- -----<br />

100% gaps 0<br />

information content (bits):<br />

[0.000-0.400) 172<br />

[0.400-0.800) 205<br />

[0.800-1.200) 238<br />

[1.200-1.600) 259<br />

[1.600-1.990) 677<br />

[1.990-2.000] 330<br />

1600<br />

Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />

Contexte : identification d’ARN ribosomiques (16S/18S,23S/28S...)<br />

Buts :<br />

I élimination<br />

I classification<br />

: Problème nouveau sur données de métatranscriptomique<br />

created by the SSU-ALIGN package (http://eddylab.org/software.html)<br />

structure diagram derived from CRW database (http://www.rna.ccbb.utexas.edu/)


Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />

Contexte : identification d’ARN ribosomiques<br />

Réalisé :<br />

I conception d’un filtre efficace pour la sélection des familles<br />

d’ARNr (SortMeRNA)<br />

I travail basé sur le Burst Trie et l’automate de Levenstein<br />

A C G U<br />

−1 −1 1 1<br />

NULL NULL<br />

010(x)<br />

{I 7 }<br />

x10(x)<br />

x1x(x)<br />

NULL NULL<br />

011x<br />

x11x<br />

{I 6 }<br />

x01x<br />

001x<br />

{I 5 }<br />

xx1x<br />

A C G U<br />

1 0 1 1<br />

NULL<br />

NULL NULL NULL NULL<br />

A C G U<br />

1 1 0 0<br />

NULL NULL<br />

NULL NULL NULL NULL<br />

[8] GGCUU [3] GGUAU<br />

111x<br />

{I 2 }<br />

101(x)<br />

101<br />

001<br />

{I 3 } 1x1x {M 12 }<br />

11x(x)<br />

111<br />

{I 1 }<br />

10(x)(x)<br />

[2] CAGC<br />

[4] AUCU<br />

[9] AGGC<br />

[7] UUU<br />

[6] CACG<br />

[1] UGAG<br />

[5] GUUU<br />

x01x<br />

x00(x)<br />

1(x)(x)(x)<br />

{M 8 }<br />

{I 4 } {M 13 }<br />

1<br />

x01<br />

{M 11 }<br />

x1xx<br />

{I 0 x1x<br />

} {M 9 }<br />

x<br />

x0<br />

{M 10 }<br />

x1


Tâche 4 : Assemblage guidé de données de métatranscriptomique<br />

Contexte : identification d’ARN ribosomiques<br />

En cours :<br />

I communication aux London Stringology Days<br />

I publication en cours de soumission<br />

I séjour prévue au Génoscope pour la transition<br />

fin Tâche 4 / début Tâche 2, 10-13 avril<br />

[<strong>LIFL</strong>] Evguenia Kopylova (ANR), Laurent Noé, Hélène Touzet<br />

[GENOSCOPE] Olivier Jaillon


Spaced seed design for precise read-mapping on HMM profiles for<br />

NGS read-mapping<br />

efficient sliding window product on the matrix semi-group<br />

Laurent Noé<br />

May 10, <strong>2012</strong><br />

Abstract<br />

We propose a new method and an associated algorithm to efficiently compute seed sensitivity when<br />

considering that HTS reads are mapped along sub-parts of a known HMM alignment profile. This<br />

computation makes particularly sense with positioned spaced seeds. It relies both on automata theory<br />

(previous work [KNR06]) combined with a matrix product problem.<br />

Interestingly, it brings into light an “interval product problem” considered more than twenty years<br />

ago in [AS87], but in a “sliding window” form. We propose here an efficient algorithm to compute this<br />

sliding windows product using a linear number of products on the (associative, but non commutative<br />

and non invertible) matrix semi-group.<br />

This computational scheme is implemented in the ongoing 1.06 version of Iedera http://bioinfo.<br />

lifl.fr/yass/iedera.php.<br />

1 Introduction<br />

Spaced seed design remains an important, but complex and challenging problem. Many papers have been<br />

devoted to this subject (mainly this last decade), from the mere (but at first unintuitive) idea that such seeds<br />

were performing better [CR93, Buh02] and could be optimized [MTL02, BK01], to spaced seed sensitivity<br />

definition and computation [KLMT04], extended models of seeds and their computation [BBV05, Bro05,<br />

MGB06, CM07, YZ08, II09, KWS + 11], and given bounds and complexity problems investigated [FCLCST05,<br />

NR08, MY09, EM11]. Several software are now publicly available to design spaced seeds [SB05, NGK10,<br />

IIMB11] 1 .<br />

High throughput sequencing technologies have thrown a new light on the seed design process, mainly<br />

because reads obtained are of relative short length and quality labelled. Some of the most sensitive algorithms<br />

to map such reads onto related genomes use spaced seeds (SHRiMP [RLD + 09], ZOOM [LZZ + 08],<br />

BFAST [HMN09], PerM [CSC09], LAST [KWS + 11], SToRM [NGK10], ...),<br />

But most of the regular seeds designed within these tools are based on the assumption that the mapped<br />

alignment profile remains “unknown”, thus prefering a i.i.d “randomly” generated profile. There are several<br />

(if not many) cases where this assumption can be removed due to a known profile of what is searched /<br />

filtered out (prior knowledge on the sequences being searched).<br />

We propose in the main part of this paper an extended method to efficiently compute seed sensitivity or<br />

lossless property when considering that short reads are mapped on substrings of a known HMM alignment<br />

profile. This computation is especially usefull when designing positioned spaced seeds, it relies mainly on<br />

dynamic programming on automata, that can be computed by a set of matrices product along overlapped<br />

intervals.<br />

DRAFT<br />

1 Currenlty, more than one hundred references have been directly related to the spaced seeds problem, see for example<br />

http://www.lifl.fr/~noe/spaced_seeds.html<br />

This “interval product problem” has been considered in [AS87] and the authors provide an efficient solution<br />

in term of preprocessing, in order to be able to answer any query product with a given constant number<br />

of products bound k. We propose here to consider this “interval product problem” with an incremental<br />

aspect, using a form of “sliding window”, and propose an efficient algorithm to compute it using a linear<br />

number of product on the (associative, but non commutative and non invertible) matrix semi-group.<br />

In part 2, we give a brief recall of the seed design principle focussing on the seed sensitivity computation.<br />

We than propose the (matrix) product problem in part 3, and show how it can be solved. Finally, in part<br />

4, we give some measurments on a practical implementation included in the ongoing 1.06 version of Iedera<br />

http://bioinfo.lifl.fr/yass/iedera.php.<br />

2 Seed design process<br />

Spaced seeds are a now frequently used hashing technique for biological sequence analysis. Their implementation<br />

(as direct hashing) is straitforward and brings high sensitivity for the same theoretical selectivity.<br />

Interestingly, in practice, a lightly reduced computational cost can also be observed when using spaced seeds<br />

compared to contiguous seeds of the same weight.<br />

Spaced seeds have been generalized by several extended seed models (Vector seeds [BBV05], Indel<br />

seeds [MGB06], Subset seeds [KNR06, ZF07, YZ08], Neighbor seeds [CM07]). To increase the overall sensitivity,<br />

they can usually be designed jointly as multiple seeds [YWC + 04, SB05], and (on quality labelled<br />

sequences) as positioned seeds [LZZ + 08, NGK10].<br />

In addition to the seed model, one need a selection criterion for good seed shapes : this criterion is<br />

(almost always) established on a model of alignment being searched (usualy a word on a match/mismatch<br />

binary alphabet), itself “weigthed” by a probabilistic model. Here again the initially proposed i.i.d. Bernoulli<br />

model [KLMT04] has been extended into Markov models [BKS05] and HMM [BBV04] models, with several<br />

extensions [MB07, CP10].<br />

In practice the considered criterion to select good spaced seed shapes is “the probability to hit at least<br />

once” (sensivity), or the guaranty to hit always once (lossless property)<br />

Such criterion can be measured by a dynamic programming algorithm on automata, with a probabilistic<br />

model (a probabilistic automaton, eg HMM (vinar) ) - represented by regular expressions - computation<br />

involved -<br />

3 Matrices product<br />

Finite Automata are frequently represented by Matrices (obviously sparse matrices when DFA are used).<br />

Matrices are in practice multiplied or powered, in such a way that properties of the initial languages of<br />

the Finite Automata are computed on “semi-rings” : for example, probabilities are computed on a classical<br />

semi-ring (E = R0applerapple1,⊕ =+,⊙ =0,1⊙ = 1), whereas costs are computed on a tropical semiring<br />

[Sim88, Pin98, MS09] (E = N,⊕ = min,⊙ = 1,1⊙ = 0). Sometime (but not always),<br />

=+,0⊕,✏⊙<br />

=.,0⊕,✏⊙<br />

on tropical semi-rings, such costs are log probability ratios; in that case, the underlying problem one has to<br />

solve is to find the best path (if any) in term of expected value.<br />

More generally, on both classical and tropical semi-rings, the same algorithm can be applied to compute<br />

seed sensitivity [KNR06] for (what is commonly named) lossy (classical semi-ring) or lossless (tropical<br />

semi-ring) seed design framework.<br />

On the classical semi-ring, HMM models (HMM alignment models) are frequently used in language<br />

recognition and seed sensitivity computation [BBV04, KNR06, HR08] : they give a set of probabilities<br />

(emissionprobabilitiesforeachstate,togetherwithtransitionprobabilitiesbetweenstates)thatarecomputed<br />

out of a “profile” alignment. But when such HMM models have to be used with NGS reads to design seeds,<br />

one has to face a new problem : taking into account the fact that the read can be any sub-string generated<br />

by the HMM alignment model, and thus that the computation may start at any “position” on the alignment<br />

HMM : in some way a more challenging problem.<br />

DRAFT<br />

1<br />

2


3.1 Sliding windows product<br />

Such computation, translated into matrix form, implies to compute, for an ordered set of (non-invertible)<br />

matrices M1,M2,...,Mn, a set of products in the two following forms :<br />

either :<br />

Problem.<br />

where w is the length of the read,<br />

Problem.<br />

i+w Y<br />

compute<br />

compute<br />

u=i<br />

Mu 8i 2 [1..n−w] (1)<br />

or more generally :<br />

j(t) Y<br />

u=i(t)<br />

such that i(t) and j(t) are two monotonically ( +0<br />

+1 )-increasing functions.<br />

Mu 8t with i(t) apple j(t) (2)<br />

The definition (2) suits particularly well when the length of the read is not fixed : for example with 454<br />

sequencing process where homo-polymers are read in a single step, and thus give variable read lengths. In<br />

other words, the definition (1) is just a special case of (2), where after increasing the j up to w,astepwise<br />

increment of both i and j is applied. We will thus consider the second definition (2) in the next parts.<br />

3.2 Previous work for the Online query product after preprocessing<br />

Alon and Schieber [AS87] have proposed an Online optimal way to answer any (non-commutative) product<br />

Q j<br />

t=iMt for any i and j in a constant k number of products, after a preprocessing in Θ(n.λ(k,n)) where<br />

λ(k,n) is defined as the inverse of a certain function at the b k 2c-th level of the primitive recursive hierarchy.<br />

For example λ(0,n)=d n 2 e λ(1,n)=dp ne λ(2,n) = log(n) λ(3,n) = loglog(n) λ(4,n) = log ⇤ (n).<br />

This fit perfectly when the length of the windows and its position are randomly drawn. But when there<br />

are dependencies on the positions of the windows, a sliding windows product may be more appropriate.<br />

3.3 Algorithm proposed to compute the Sliding windows product<br />

In our case Online query is not required so we can avoid doing both preprocessing and processing by using<br />

an “Online sliding window product” that moves separately or conjonclty the two ends of the windows : it<br />

costs an amortized constant number of products on problem 1 and problem 2.<br />

This process does not depends in the size of the sliding window in the second problem (which can be<br />

asymptotically improved otherwise, using similar approach of [AS87]). We are here able to move both left<br />

and right ends i and j of the window step-wise, keeping a set of matrices, and computing the product for<br />

any of the windows obtained.<br />

U(k) definition and “pre”-processing : the main additional data used is to preprocess and keep<br />

a set of block products U(k) (for k 2 [i..j]) as shown on Figure 1. U(k) is defined as the product of<br />

a given contiguous block of matrices of size u(k,j) starting from k (to k + u(k,j) − 1). More precisely<br />

U(k)= Q k+u(k,j)−1<br />

t=k Mt. u(k,j) is defined as the largest possible value 2 p such that k + u(k,j) − 1 apple j<br />

and that u(k,j)dividesk.SuchU(k) blocks are thus of size u(k,j)=2 p and this size, once fixed can only<br />

increase (by doubling) depending on j value, before disapering (when i>k).<br />

Maintaining such matrices U(k) for k 2 [i..j] does only cost at most (in amortized analysis) one product<br />

per increase of j (see Appendix 6.2). Note that increasing i simply deletes the last U(i) and thus does not<br />

DRAFT<br />

U[0]<br />

Figure 1: U(k) matrices: example when i = 0 and j = 24<br />

17<br />

16<br />

15<br />

14<br />

13<br />

12<br />

11<br />

10<br />

09<br />

08<br />

07<br />

06<br />

05<br />

04<br />

03<br />

02<br />

01<br />

00<br />

25<br />

29<br />

28<br />

27<br />

26<br />

25<br />

24<br />

23<br />

22<br />

21<br />

20<br />

19<br />

i=0 j=24<br />

U[1]<br />

U[2]<br />

U[3]<br />

U[4]<br />

U[5]<br />

U[6]<br />

U[7]<br />

U[8] ...<br />

U[16]<br />

18<br />

U[24]<br />

any additional product on the U(k)’s. A pseudo-code of the add right process (increment of j)isprovided<br />

in Algorithm 1.<br />

Algorithm 1: add right : increments the right border j by one, and updates the set Ui..j using<br />

the matrix Mj<br />

Data: the set of matrices M1,M2,...,Mn, the original set Ui..j<br />

Result: the updated set Ui..j<br />

/* a) only before the first increment */<br />

if j =0then<br />

U0 M0;<br />

/* b) increment j */<br />

inc(j);<br />

/* c) and process the set of Uj−t matrices that have to be updated */<br />

Uj Mj;<br />

u j +1;told 0;t 1;<br />

while u is even and j −t ≥ i do<br />

Uj−t ; Uj−t.Uj−told<br />

told t ; t 2.t+1;u u/2;<br />

Without considering any previous computation kept, it is directly possible to compute the product<br />

Mi.Mi+1···Mj for any i,j (j>i)inO(log(j − i)) products using the updated U(k) set of matrices for<br />

k 2 [i..j] (see Appendix 6.1).<br />

But when the product is computed when i and j follow the “increasing step”-functions as defined before,<br />

the number of products can be reduced to constants for each i and j step-move (or for both moves when the<br />

distance w separating i and j is fixed) :<br />

DRAFT<br />

middle definition : we need to define here the middle m of i and j as the beginning position of the<br />

maximal (in size) U-block included in the interval i..j. In other words, m corresponds to the value between<br />

i and j that can be the “most factorized by 2”. If two maximal blocks are between i and j, we choose the<br />

beginning of the second block (see Figure 3.3) (as it always corresponds to the value m that can be the<br />

“most factorized by 2”). This middle border enable to split the computation in two parts when needed, that<br />

we will call left (colored in green on Figure 3.3) and right (red on Figure 3.3). Note that m< 1 3 i + 2 3 j.<br />

Note also that when there is only one maximal sized block, that m< 1 2 i+ 1 2j, and when there are two<br />

maximal sized blocks, that m> 2 3 i+ 1 3 j.<br />

In the next part, we will compute in two separate parts Mi..m−1 and Mm..j, considering the case when<br />

m is fixed first, and then two cases when m is increased.<br />

3<br />

4


04<br />

03<br />

02<br />

01<br />

00<br />

i=1<br />

U[1]<br />

U[2]<br />

U[3]<br />

Figure 2: U(k) matrices: example when i = 1 and j = 24<br />

05<br />

U[4]<br />

U[5]<br />

13<br />

12<br />

11<br />

10<br />

09<br />

08<br />

07<br />

06<br />

U[6]<br />

U[7]<br />

U[8]...<br />

24<br />

14<br />

m=16<br />

29<br />

28<br />

27<br />

26<br />

25<br />

24<br />

23<br />

22<br />

21<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

U[16]<br />

U[24]<br />

middle unchanged : if we suppose that the middle m does not change during a computational step,<br />

it can be observed that :<br />

j=24<br />

• when j is increased (so that j = jold +1), updating the product Mm..j can be done with one product,<br />

considering that we keep the previous computation . Thus considering that we also update the<br />

Mm..jold<br />

U(k)’s values at the same time, an amortized single product must be added (Amortization on j :see<br />

Appendix 6.2).<br />

Joining Mi..m−1 with Mm..j then costs one extra product, giving a total number of products of three.<br />

• when i is increased (i = iold + 1), previous computation Miold..m−1 does not help and can be erased<br />

here. However, if we suppose that we keep all the previous computed products Mk..m−1 in a stack for<br />

all the blocks Uk visited before, reusing and updating this part can be done with one single amortized<br />

product (Amortization on m : see Appendix 6.3).<br />

Joining Mi..m−1 and Mm..j then costs one extra product, giving a total number of products of two.<br />

At first sight, a {cost(i) apple i+m; cost(j) apple 3j} cost is applied (when m does not change). However,<br />

this computation has to be updated when m changes; this will be considered in the next part :<br />

middle changed : if we suppose that the middle m does change, previous computation cut in two<br />

parts Mi..m−1 and Mm..j is somehow “compromised”; Let now see when m change, and moreover why :<br />

• when m changes due to a j increase, as m follows the beginning of the largest right-most Uk block,<br />

j can increase the maximal block size by two, either without changing m (case handled before),<br />

or jumping to the next power of two block thus from mold = odd⇥2 p to m =(odd+1)⇥2 p =<br />

odd+1<br />

2 ⇥2 p+1 : The last case has no consequence on the product Mm..j that is immediately computed<br />

by the update of the U(k)’s values as Mm..j corresponds to a single maximal block in U(k), thus in<br />

one single product here (and not two).<br />

DRAFT<br />

However, moving m will obviously compromise the left stack of Mk..mold−1 previous computations that<br />

will now not help the computation of the next Mi..mold−1 on the next increase of i,sincemold is now<br />

pushed to the next power of two m, and can be erased.<br />

This extra cost can however be amortized by a 9 8 of ∆m where ∆m =representsthem increase<br />

(Amortization on m, see Appendix 6.4). Joining Mi..m−1 with Mm..j then costs one extra product.<br />

Finally, when m changes due to a j increase, a {cost(j) apple 2j + 9 8m} cost is applied.<br />

• when i is increased so that i>m(thus i = m+1), m can only “jump” to a next block of smaller<br />

size : the cost on the left stack [i..m−1] is already payed as it corresponds to a “legal” move of i that<br />

is amortized by one product as seen previously(Amortization on m : see Appendix 6.3).<br />

However, moving m will obviously compromise the right computation of Mmold..j since mold is now<br />

pushed to the next (smaller) block, and can be erased and recomputed.<br />

This cost can however be amortized by a 9 8<br />

Appendix 6.5).<br />

Joining Mi..m−1 and Mm..j then costs one extra product.<br />

of ∆m where ∆m = m increase (Amortization on m,see<br />

Finally, when m changes due to i increase, a {cost(i) apple i+ 9 8m} cost is applied.<br />

To conclude, a {cost(i) apple i+ 9 8 m; cost(j) apple 3j + 1 8m} cost is applied.<br />

4 Practical implementation<br />

First, it is very likely that these bounds can be improved by a more precise analysis; However going under a<br />

bound of 3.0 per move is unlikely, at least without any initial amortized costs, since we have found at least<br />

one example such that the number of product is 3.00325 per move 2 .<br />

Moreover when j is increased by “runs” while i is fixed, the proposed algorithm can be enhanced with a<br />

gready computation of the Mi..j product (that can be done quickly provided that i is fixed for a while). In<br />

practice, this implementation gives always less products than the proposed one, but has not been carefully<br />

analysed by now.<br />

On the other hand, some more pratical considerations show also that, when applied on sparse matrices,<br />

such product cannot be considered as a “constant” operation, but more likely as a “function of the sparcity”.<br />

Such implementation needs however to know this “sparcity cost” for all the posible products, which, in<br />

practice on unknown automata, is similar to simulating the product, and thus costs as much as the product<br />

itself...<br />

5 Experiments<br />

The previous algorithm has been implemented and tested in iedera where<br />

Speedup in practice ... over the naive range product.<br />

On a typical example, for a windows of length 108 (that corresponds to Illumina read length here) and a<br />

profile of size 1605 (16S rRNA), number of products for the full computation of the 1605−108+1 = 1498<br />

windows is 5933 (note that each window need a displacement both on i and j).<br />

References<br />

[AS87]<br />

[BBV04]<br />

[BBV05]<br />

[BK01]<br />

DRAFT<br />

NogaAlonandBaruchSchieber. Optimalpreprocessingforansweringon-lineproductqueries. Technical<br />

Report TR 71/87, Inst. of Comp. Science, Tel-Aviv Univ., 1987.<br />

Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Optimal spaced seeds for homologous coding<br />

regions. Journal of Bioinformatics and Computational Biology, 1(4):595–610, Jan 2004. (earlier version<br />

in CPM 2003).<br />

Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Vector seeds: An extension to spaced seeds.<br />

Journal of Computer and System Sciences, 70(3):364–380, 2005. (earlier version in WABI 2003).<br />

StefanBurkhardtandJuhaKärkkäinen. Betterfilteringwithgappedq-grams. InProceedings of the 12th<br />

Symposium on Combinatorial Pattern Matching (CPM),volume2089ofLecture Notes in Computer<br />

Science, pages 73–85. Springer, July 2001.<br />

2 1,2,−1,3..24,−2,−3,25..51,−4,52..72,−5,73..392,−6,393..441,−7,−8,442..577,−9,578..3071 where i-moves are given<br />

with a minus notation<br />

5<br />

6


[BKS05] Jeremy Buhler, Uri Keich, and Yanni Sun. Designing seeds for similarity search in genomic DNA.<br />

Journal of Computer and System Sciences, 70(3):342–363, 2005. (earlier version in RECOMB 2003).<br />

[Bro05] Daniel G. Brown. Optimizing multiple seeds for protein homology search. IEEE/ACM Transactions<br />

on Computational Biology and Bioinformatics (TCBB), 2(1):29–38, january 2005. (earlier version in<br />

WABI 2004).<br />

[Buh02]<br />

[CM07]<br />

[CP10]<br />

[CR93]<br />

[CSC09]<br />

[EM11]<br />

Jeremy Buhler. Provably sensitive indexing strategies for biosequence similarity search. In RECOMB,<br />

Washington, DC (USA), pages 90–99. ACM Press, April 2002.<br />

Miklós Csűrös and Bin Ma. Rapid homology search with neighbor seeds. Algorithmica,48(2):187–202,<br />

Jun. 2007. (earlier version in COCOON 2005).<br />

Won-Hyoung Chung and Seong-Bae Park. Hit integration for identifying optimal spaced seeds. BMC<br />

Bioinformatics - Selected articles from the 8th Asia-Pacific Bioinformatics Conference (APBC), 18-21<br />

january, Bangalore, India,11(Suppl1):S37,2010.<br />

A. Califano and I. Rigoutsos. Flash: A fast look-up algorithm for string homology. In Proceedings of<br />

the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB),pages56–64,<br />

July 1993.<br />

Yangho Chen, Tate Souaiaia, and Ting Chen. PerM: efficient mapping of short sequencing reads with<br />

periodic full sensitive spaced seeds. Bioinformatics,25(19):2514–2521,2009.<br />

Lavinia Egidi and Giovanni Manzini. Spaced seeds design using perfect rulers. In Proceedings of the<br />

18th International Symposium on String Processing and Information Retrieval (SPIRE), Pisa (Italy),<br />

volume 7024 of Lecture Notes in Computer Science, pages 32–43. Springer, 2011.<br />

[FCLCST05] Martin Farach-Colton, Gad M. Landau, Süleyman Cenk Sahinalp, and Dekel Tsur. Optimal spaced<br />

seeds for faster approximate string matching. In Proceedings of the 32nd International Colloquium on<br />

Automata, Languages and Programming (ICALP’05), Lisboa (Portugal),volume3580ofLecture Notes<br />

in Computer Science, pages 1251–1262. Springer, 2005.<br />

[HMN09]<br />

[HR08]<br />

[II09]<br />

[IIMB11]<br />

[KLMT04]<br />

[KNR06]<br />

[KWS + 11]<br />

[LZZ + 08]<br />

[MB07]<br />

[MGB06]<br />

Nils Homer, Barry Merriman, and Stanley F. Nelson. BFAST: An alignment tool for large scale genome<br />

resequencing. PLoS One,4(11):e7767,2009.<br />

Inke Herms and Sven Rahmann. Computing alignment seed sensitivity with probabilistic arithmetic<br />

automata. In Proceedings of the 8th International Workshop on Algorithms in Bioinformatics (WABI),<br />

Karlsruhe (Germany),volume5251ofLecture Notes in Bioinformatics, pages 318–329. Springer, Sept.<br />

2008.<br />

Lucian Ilie and Silvana Ilie. Fast computation of neighbor seeds. Bioinformatics,25(6):822–823,2009.<br />

Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. SpEED: fast computation of sensitive spaced<br />

seeds. Bioinformatics,2011.<br />

Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied<br />

Mathematics, 138(3):253–263, 2004. (preliminary version in 2002).<br />

Gregory Kucherov, Laurent Noé, and Mikhail A. Roytberg. A unifying framework for seed sensitivity<br />

and its application to subset seeds. Journal of Bioinformatics and Computational Biology,4(2):553–569,<br />

November 2006.<br />

Szymon M. Kie lbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. Adaptive seeds<br />

tame genomic sequence comparison. Genome Research,21(3):487–493,2011.<br />

Hao Lin, Zefeng Zhang, Michael Q. Zhang, Bin Ma, and Ming Li. ZOOM! Zillions Of Oligos Mapped.<br />

Bioinformatics,24(21):2431–2437,2008.<br />

Denise Y.F. Mak and Gary Benson. All hits all the time: parameter free calculation of seed sensitivity.<br />

In D. Sanko↵, L. Wang, and F. Chin, editors, Proceedings of the 5th Asia Pacific Bioinformatics<br />

Conference (APBC),volume5ofAdvances in Bioinformatics and Computational Biology,pages327–<br />

340. Imperial College Press, 2007.<br />

DeniseY.F.Mak, YevgeniyGelfand, andGaryBenson. Indelseedsforhomologysearch. Bioinformatics,<br />

22(14):e341–e349, 2006.<br />

DRAFT<br />

[MS09] Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry. (draft book-in-progress), 2009.<br />

[MTL02] Bin Ma, John Tromp, and Ming Li. PatternHunter: Faster and more sensitive homology search.<br />

Bioinformatics,18(3):440–445,2002.<br />

[MY09] Bin Ma and Hongyi Yao. Seed optimization for i.i.d. similarities is no easier than optimal Golomb ruler<br />

design. Information Processing Letters,109(19):1120–1124,2009.<br />

[NGK10]<br />

[NR08]<br />

[Pin98]<br />

[RLD + 09]<br />

[SB05]<br />

[Sim88]<br />

Laurent Noé, Marta Gîrdea, and Gregory Kucherov. Designing efficient spaced seeds for SOLiD read<br />

mapping. Advances in Bioinformatics, 2010:ID 708501, July 2010.<br />

François Nicolas and Éric Rivals. Hardness of optimal spaced seed design. Journal of Computer and<br />

System Sciences, 74(5):831–849, Aug. 2008. (earlier version in CPM 2005).<br />

Jean-Éric Pin. Tropical semirings. In J. Gunawardena, editor, Idempotency,volume11ofPubl. Newton<br />

Inst., pages 50–69, Bristol, 1998. Cambridge Univ. Press.<br />

Stephen M. Rumble, Phil Lacroute, Adrian V. Dalca, Marc Fiume, Arend Sidow, and Michael Brudno.<br />

SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol,5(5):e1000386,052009.<br />

Yanni Sun and Jeremy Buhler. Designing multiple simultaneous seeds for DNA similarity search.<br />

Journal of Computational Biology, 12(6):847–861, 2005. (earlier version in RECOMB 2004).<br />

Imre Simon. Recognizable sets with multiplicities in the tropical semiring. In Mathematical foundations<br />

of computer science, 1988 (Carlsbad, 1988),volume324ofLecture Notes in Comput. Sci.,pages107–<br />

120. Springer, Berlin, 1988.<br />

[YWC + 04] I-Hsuan Yang, Sheng-Ho Wang, Yang-Ho Chen, Pao-Hsian Huang, Liang Ye, Xiaoqiu Huang, and Kun-<br />

Mao Chao. Efficient methods for generating optimal single and multiple spaced seeds. In Proceedings<br />

of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE), Taichung (Taiwan),pages<br />

411–416. IEEE Computer Society Press, 2004.<br />

[YZ08]<br />

[ZF07]<br />

Jialiang Yang and Louxin Zhang. Run probabilities of seed-like patterns and identifying good transition<br />

seeds. Journal of Computational Biology, 15(10):1295–1313, Dec. 2008. (earlier version in APBC 2008).<br />

Leming Zhou and Liliana Florea. Designing sensitive and specific spaced seeds for cross-species mRNAto-genome<br />

alignment. Journal of Computational Biology, 14(2):113–130, Mar. 2007.<br />

DRAFT<br />

7<br />

8


x−1<br />

6 Appendix<br />

6.1 Worst case number of products from i to j<br />

We denote by n the number of single matrices : n = j − i +1(n is thus the length of the block being<br />

computed with help of the Uk matrices already given). We illustrate below how to obtain the smaller size n<br />

according to the number of product x.<br />

• if xhis odd, i the worst case is produced by a concatenation of blocks of size 2 i on both ends, for<br />

i 2 0.. x−1<br />

2 (see Figure 3 for x = 5):<br />

00<br />

01<br />

02<br />

Figure 3: U(k) matrices and product: example when i = 9 and j = 23<br />

03<br />

04<br />

05<br />

06<br />

n = 2<br />

07<br />

08<br />

i=9<br />

2X<br />

i=0<br />

09<br />

10<br />

x<br />

11<br />

12<br />

x<br />

13<br />

14<br />

15<br />

16<br />

17<br />

n=14<br />

x<br />

18<br />

19<br />

DRAFT<br />

20<br />

x<br />

21<br />

22<br />

x<br />

23<br />

24<br />

25<br />

j=23<br />

2 i = 2 p ⇣<br />

2⇥2 x n+2<br />

⌘<br />

2 −2 x = 2log 2<br />

• If x is even, the worst h case i is produced by a concatenation of blocks of size 2 i on both ends of a block<br />

of size 2 x 2, for i 2 0.. x−2<br />

2 (see Figure 4 for x = 4):<br />

00<br />

01<br />

02<br />

26<br />

2 p 2<br />

Figure 4: U(k) matrices and product: example when i = 9 and j = 19<br />

03<br />

04<br />

05<br />

06<br />

n = 2<br />

07<br />

08<br />

i=9<br />

x−2<br />

2X<br />

i=0<br />

09<br />

10<br />

x<br />

11<br />

12<br />

x<br />

13<br />

14<br />

n=10<br />

15<br />

16<br />

x<br />

17<br />

18<br />

x<br />

19<br />

20<br />

21<br />

j=19<br />

⇣<br />

2 i +2 x 2 = 3⇥2 x n+2<br />

⌘<br />

2 −2 x = 2log 2<br />

3<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

27<br />

28<br />

28<br />

29<br />

29<br />

Combining those two cases, it can be shown that when the number of product is set to x =1,2 or 3,<br />

then the minimal size is exactly 2⇥x, and also that when x>3 (or x = 0) that this minimal bound is never<br />

reached again.<br />

Figure 5: minimal n (for x even and odd) functions compared to 2⇥x<br />

In other words, the number of products x is always apple n 2 .<br />

6.2 Amortized analysis of Uk blocks when i =0and j ≥ 0<br />

Summing the number of products needed when computing Uk should be 2 on average, and not 1 : a quick<br />

analysis shows that, indeed, if one product is done half of the time, two are done each 1/4, three done each<br />

1/8, and so on ... then the P 1<br />

u=1 u<br />

2 =2 u<br />

However here, we will show that amortized number of product when considering j is only 1. We use an<br />

amortized analysis by giving one coin each time j is increased (i is supposed to stay at 0 but this assumtion<br />

can be leaved since it can be seen as a worst case when updating Uk) to show than any sublock Uk will<br />

generate one extra coin, and thus grouped with its neigboor block in size (itself generating on extra coin),<br />

the cost of the father block processed with those two is also generating (1+1)−1 = one extra coin.<br />

DRAFT<br />

• this is true for blocks of size 2 since they are build of blocks of size 1 that do not generate any product<br />

: the cost for such block of size 2 is thus 1, and 1 extra coin remains.<br />

• this can be easily verified for blocks of size 2 p (p>1), since by induction hypothesis the two sub-blocks<br />

of size 2 p−1 give each one extra coin : the cost associated when joining the two sub-blocks then removes<br />

one coin, and one extra coin remain again.<br />

Note that this analysis can be set for any i ≥ 0 and any j>iprovided that at first an extra number of<br />

j −i coins is provided.<br />

6.3 Amortized analysis of the left Mi..m−1 blocks when m fixed and i increased<br />

Summing the number of products needed to when computing Mi..m−1 for any i from 0 to m is 1 on average<br />

: a quick analysis shows indeed that if zero product is done half of the time (when i is even), one product is<br />

done each 1/4, two done each 1/8, and so on ... then P 1 u<br />

u=0 2u+1 = 1.<br />

9<br />

10


But this does not guaranty that the total number of product payed when increasing i from any value (for<br />

example 0) to m is always less than m. Here we will show that the number of product (once m is fixed) for<br />

computing Mi..m−1 for any i from 0 to a given m =2 p is apple 2 p −p−1.<br />

m =2! 0<br />

m =4! 1<br />

m =8! 4<br />

m = 16 ! 11<br />

A similar method to section 6.2 can be applied.<br />

First we consider the case when i = 0 and m has been increased to reach a given (and fixed) value 2 p .<br />

• this is true when p =1(thuswhenm = 2) since, using Uk blocks, it needs no product to compute<br />

M0..1 and M1..1.<br />

• this can be verified for blocks of size 2 p (p>1), since we can then use the two sub-blocks of size 2 p−1 :<br />

when i is within the first sub-block, as the product is done from m to i and stacked in such way that any<br />

suffix Mk..m in kept, it costs the product produced by this sub-block (2 (p−1) −(p−1)−1) added to the<br />

log2( m 2 )=p−1 extra products to cover the second sub-block of size 2p−1 ;wheni is within the second<br />

sub-block, exactly the number of products produced by this sub-block ⇣ (2 (p−1) −(p−1)−1). ⌘ Thus when<br />

summing these two quantities, the number of product is apple 2⇥ 2 (p−1) −(p−1)−1 +(p−1) = 2 p −p−1<br />

Thus, increasing i from any value ≥ 0tom and computing all the possible products (with the help of<br />

the Blocks Uk)isapple m−log2(m)−1, and thus costs less than m.<br />

Note that this analysis can be set for any i ≥ 0 and any m (not necesseraly represented as a strict power<br />

of 2 , but as m = a⇥2 p such that 2 p is the maximal block size of Uk for k 2 [i..j]).<br />

6.4 Amortized analysis of the left Mi..m−1 blocks when m is increased (due to a<br />

j increase) and i is fixed<br />

When j increases while i is fixed, m may change to a new (and of course increased) value pointing to an<br />

equal (or twice larger block) : this appends when m goes from mold =2 pold ⇥aold (with aold odd), to its new<br />

value m = mold+2 pold =(aold+1)⇥2 pold = a⇥2 p (with a = aold<br />

2 and p = pold+1), as illustrated on Figure<br />

6.<br />

Figure 6: U(k) matrices and Mi..m−1 product : example when i = 33 and j goes from 47 to 48<br />

32<br />

33<br />

34<br />

DRAFT<br />

i=33<br />

x<br />

35<br />

36<br />

x<br />

37<br />

m_old=36<br />

38<br />

39<br />

40<br />

24<br />

m=40<br />

41<br />

42<br />

43<br />

44<br />

45<br />

46<br />

47<br />

48<br />

49<br />

50<br />

j=48<br />

j_old=47<br />

We are here interested in the computation of Mi..m−1 due to this ∆m = m−mold =2 pold increase. In<br />

practice, since m has changed, the full set of left stack matrices Mk..m−1 has to be recomputed for some<br />

k 2 [i..m], and some products already done Mk..mold−1 have to be redone unfortunately twice here.<br />

51<br />

This twice-cost is at most log2( mold−i+1<br />

2 ) apple log2( m−mold<br />

2 )=log2( ∆m 2<br />

)(mold −imwhile j is fixed, m must change to a new (and of course increased) value<br />

pointing to a smaller block as illustrated on Figure 7. This appends when m goes from mold =2 pold ⇥aold<br />

to its new value m = mold +2 pold =(aold +1)⇥2 pold , as illustrated on Figure 7.<br />

Compared to the case previously seen on Appendix 6.4, there is no twice cost on the left-stack, so this<br />

part is still be amortized within the Section 6.3. However the right part has now to be recomputed.<br />

Figure 7: U(k) matrices and Mm..j product : example when j = 47 and i goes from 32 to 33<br />

i_old=32<br />

32<br />

33<br />

i=33<br />

m_old=32<br />

34<br />

35<br />

36<br />

37<br />

38<br />

39<br />

40<br />

41<br />

42<br />

m=40<br />

We are here interested in the computation of Mm..j due to this m change (red mark on Figure 7)<br />

This cost is at most log2( j−m+1<br />

2 ) apple log2( m−mold<br />

2 )=log2( ∆m 2 )(j −m


Using Appendix 6.2 and Appendix 6.5 (in a similar way to 6.4), moving i implies thus<br />

• a1(peri increase) +1 (per ”m amortized increase”, see section 6.3) cost when m is fixed,<br />

• a1(peri increase) +log2( ∆m 2 ) cost when m is increased by ∆m =2p .<br />

A rapid analysis of the two cases combined shows that this cost can be bounded by 1⇥i+ 9 8 ⇥m (this worst<br />

case can be produced when ∆m = 8 or ∆m = 16). This cost is thus apple 11<br />

8 ⇥i+ 3 2⇥j+i<br />

4 ⇥j (since m apple 3 ) and<br />

can be roughly bound by 2 1 8 per j increase (since i apple j).<br />

DRAFT<br />

13


Laurent Année 2011-<strong>2012</strong>


Laurent Année 2011-<strong>2012</strong>


Laurent Année 2011-<strong>2012</strong>


Laurent Année 2011-<strong>2012</strong>


Laurent Année 2011-<strong>2012</strong>


Peptides à matcher


...


...


ce coup-ci ... pas trouvé d’excuse de dernière minute<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


ce coup-ci ... pas trouvé d’excuse de dernière minute<br />

merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


ce coup-ci ... pas trouvé d’excuse de dernière minute<br />

merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />

1 on recherche un nouveau président de PJI pour l’année prochaine<br />

(départ Gery)<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


ce coup-ci ... pas trouvé d’excuse de dernière minute<br />

merci pour les nombreux PJIs encadrées ou présidés cette année !!<br />

1 on recherche un nouveau président de PJI pour l’année prochaine<br />

(départ Gery)<br />

2 on recherche un nouveau repreneur du module<br />

(qui est dans cette salle)<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


Mardi 28 mai<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-05-28 16h00 M5-A8 127 [ALTERNANT] étudiant:Stephane Drubay & entreprise:OdysysPierre-Eric Marez Philippe Marquet Stephane Drubay<br />

<strong>2013</strong>-05-28 16h30 M5-A8 132 [ALTERNANT] étudiant:Laura Leclercq & entreprise:Odysys Pierre-Eric Marez Philippe Marquet Laura Leclercq<br />

<strong>2013</strong>-05-28 17h00 M5-A8 137 [ALTERNANT] étudiant:Kévin Moulart & entreprise:Proges PluPhilippe Viot<br />

Maude Pupin Kévin Moulart<br />

<strong>2013</strong>-06-03 17h30 M5-A8 134 [ALTERNANT] étudiant:Laurent Leleux & entreprise:Proges PlMaryvonne Viot Maude Pupin Laurent Leleux<br />

Mercredi 29 mai<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-05-29 09h00 M5-A9 22 Évolution de l'application de gestion du personnel de l'IEEA Jean-Christophe RoutierJean-Christophe RoutierMélissa Blain Jérôme Wyckaert<br />

<strong>2013</strong>-05-29 09h30 M5-A9 89 Annuaire de l'association des anciens de la MIAGE de Lille Anne-Cécile Caron Anne-Cécile Caron Nicolas VandemeulebrouFlorian Bruffaert<br />

<strong>2013</strong>-05-29 10h30 M5-A9 111 Outils de communication pour l'association AVERS Anne-Cécile Caron Anne-Cécile Caron Mamadou Bachir Bah<br />

<strong>2013</strong>-05-29 11h00 M5-A9 90 Génération de documents pédagogiques Anne-Cécile Caron Anne-Cécile Caron Maxime Boucher Latifou Sano<br />

<strong>2013</strong>-05-29 14h00 M5-A9 86 Amélioration d'un logiciel de visualisation d'orbite Florent Deleflie Francesco De Comité Romain Frangi Dimitri Descamps<br />

<strong>2013</strong>-05-29 14h30 M5-A9 112 Concours Infotel Anne-Cécile Caron Anne-Cécile Caron Christopher Laethem Zakariae Azaroual<br />

<strong>2013</strong>-05-29 15h00 M5-A9 113 Concours Infotel (suite) Anne-Cécile Caron Anne-Cécile Caron Nassim Hassaine Zouhair Makhout<br />

<strong>2013</strong>-05-29 15h30 M5-A9 45 Analyse automatique de l'historique Git des logiciels Martin Monperrus Martin Monperrus Sylvain Magnier Maxence Montauzan<br />

<strong>2013</strong>-05-29 16h00 M5-A9 95 Export de code source Python en XML Martin Monperrus Martin Monperrus Pierre Frayer<br />

Antoine Goubel<br />

Jeudi 30 mai<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-05-30 09h30 M5-A9 93 Optimisation de flots<br />

François Clautiaux Marie-Emile Voge Irina Bakardzhieva Ophélie Debiève<br />

<strong>2013</strong>-05-30 10h30 M5-A9 77 Suivi d'accueil des enfants dans un centre périscolaire - facturPeriscope Marius Bilasco Rémi Kaczmarek Maxime Vanpeene<br />

<strong>2013</strong>-05-30 11h00 M5-A9 115 Reprise application de gestion de listes de présences alternanMarius Bilasco Marius Bilasco Alexis Boutrouille Pierre Bailleul<br />

<strong>2013</strong>-05-30 14h00 M5-A9 116 Application web de gestion de suivis de recherche de stage Patricia Plénacoste Maude Pupin Soufiane Agadr Thomas Aubry<br />

<strong>2013</strong>-05-30 14h30 M5-A9 92 Création d’une base de données sur la glycosylation du poissoYann Guerardel Olga Plechakova Karl Deleforterie Franck David<br />

<strong>2013</strong>-05-30 15h30 M5-A9 23 Base de données et données géographiques Francis Bossut Francis Bossut Pierrick Lesage Alexandre Bienvenu<br />

<strong>2013</strong>-05-30 16h00 M5-A9 99 Site web de la MDE Eric Bros<br />

Raphaël Marvie Djamel Amara<br />

Vendredi 31 mai<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-05-31 08h30 M5-A7 2 Reconstituer le puzzle : depuis des fragments jusqu'à l'ARN Mikaël Salson Mikaël Salson Charles Husquin<br />

<strong>2013</strong>-05-31 09h00 M5-A7 4 Alecsia apprend à lire les ODT et PDF Mikaël Salson Mikaël Salson Anthony Tonglet<br />

<strong>2013</strong>-05-31 10h15 M5-A7 78 Evolution de l'application de suivi d'alternants et stages Marius Bilasco Marius Bilasco Ayoub Nejmeddine Sara El-Arbaoui<br />

<strong>2013</strong>-05-31 10h45 M5-A7 81 Take a photo for me Marius Bilasco Marius Bilasco Jérémie Samson Victor Paumier<br />

<strong>2013</strong>-05-31 11h15 M5-A7 82 Interagir avec votre ordinateur de la tête<br />

Marius Bilasco Marius Bilasco Mamadou Diop<br />

<strong>2013</strong>-05-31 11h45 M5-A7 84 Analyse contextuelle de collections de photos privées Marius Bilasco Marius Bilasco Benjamin Allaert Benjamin Flahauw<br />

<strong>2013</strong>-05-31 14h30 M5-A7 6 Frameworks PHP et back-offices pour applications mobiles Jean-Claude Tarby Jean-Claude Tarby Omar Chahbouni Abderrahime El Idrissi<br />

<strong>2013</strong>-05-31 15h00 M5-A7 8 Intégration des ondes cérébrales dans la vie courante Jean-Claude Tarby Jean-Claude Tarby Mickaël Duruisseau Nicolas Coyard<br />

<strong>2013</strong>-05-31 16h15 M5-A7 68 Conception d'un Raspberry pi dédié aux présentations Bruno Bogaert Bruno Bogaert Louis Billiet Sylvain Goulliart<br />

<strong>2013</strong>-05-31 16h45 M5-A7 69 Écosystème pour gestion d'emploi du temps hebdomadaire Bruno Bogaert Bruno Bogaert Dhia Elhak Lakhal Sylvain Malfait<br />

<strong>2013</strong>-05-31 10h15 M5-A8 46 Intégration de Drone à une plateforme logicielle<br />

Gwenael Cattez Gwenael Cattez Ali Hedjaz Tony Tran<br />

<strong>2013</strong>-05-31 10h45 M5-A8 65 Moteur de scripts sous iOS Nicolas Haderer, RomainRomain Rouvoy Benjamin Digeon Florent David<br />

<strong>2013</strong>-05-31 11h15 M5-A8 66 Utiliser les téléphones mobiles pour l’estimation de la densité dNicolas Haderer Romain Rouvoy Julien Duribreux Justin Dufour<br />

<strong>2013</strong>-05-31 14h00 M5-A8 34 Interface de visualisation de molécules Maude Pupin Laurent Noé<br />

Antonia Ludunge<br />

<strong>2013</strong>-05-31 14h30 M5-A8 91 Pipeline d'analyse de régions de cassures<br />

Jean-Stéphane Varré Jean-Stéphane Varré Gauvain Marquet<br />

<strong>2013</strong>-05-31 15h00 M5-A8 30 Robot lego solveur de Sudoku Francesco De Comité Leopold Weinberg Oulamine Youssef El Achiqi Anas<br />

<strong>2013</strong>-05-31 16h15 M5-A8 37 Traitement semi-automatique des feuilles de présence Géry Casiez Géry Casiez Alexis Linke<br />

Maxence Gaudry<br />

<strong>2013</strong>-05-31 16h45 M5-A8 3 Conception d'un reseau social orienté vidéo Antoine Thomas Antoine Thomas Emmanuel Pede Thomas Besset<br />

<strong>2013</strong>-05-31 08h30 M5-A9 105 Suivi d'un capteur en 3D a l'aide d'une webcam<br />

Jean Rioult Sébastien Ambellouis Matthieu Fesselier Guillaume Huylebroeck<br />

<strong>2013</strong>-05-31 09h00 M5-A9 110 Algorithmes de placement en deux dimensions<br />

François Clautiaux François Clautiaux Romain Windels<br />

<strong>2013</strong>-05-31 10h15 M5-A9 70 Home Cloud Server Cedric Dumoulin Cedric Dumoulin Lison Gallos Arnaud Caulier<br />

<strong>2013</strong>-05-31 10h45 M5-A9 27 Framework de modélisation dans les Tablettes Android Amine El Kouhen Cédric Dumoulin Malika Rakhaoui ‎ Fatou-Laye Mbaye<br />

<strong>2013</strong>-05-31 11h15 M5-A9 71 Etude de la spécification des représentations arborescentes Cedric Dumoulin Cedric Dumoulin Adrien Burillon Thomas Camberlin<br />

<strong>2013</strong>-05-31 11h45 M5-A9 72 Generateur de GUI Android Cedric Dumoulin Cedric Dumoulin Gerard Paligot<br />

<strong>2013</strong>-05-31 14h00 M5-A9 33 Intégration du support multitouch dans Pharo Stéphane Ducasse Stéphane Ducasse Francois Lepan Benjamin V. Ryseghem<br />

<strong>2013</strong>-05-31 14h30 M5-A9 50 Interaction Kinect pour une application ludique Samuel Degrande Patricia Plenacoste Thomas Crepel Rémi Boens<br />

<strong>2013</strong>-05-31 15h00 M5-A9 108 Développement d'un plugin Eclipse de transformation et d'anaMartin Monperrus Benoit Cornu<br />

Amina El-Mekky Ouardia Ma-Z<br />

Lundi 3 juin<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-06-03 09h00 M5-A8 138 [ALTERNANT] étudiant:Augustin Petre & entreprise:DecathlonJulien Mouchon Jean-Claude Tarby Augustin Petre<br />

<strong>2013</strong>-06-03 09h30 M5-A8 124 [ALTERNANT] étudiant:Olivier Debreu & entreprise:Noolitic Sylvain Deceuninck Gilles Grimaud Olivier Debreu<br />

<strong>2013</strong>-06-03 10h15 M5-A8 135 [ALTERNANT] étudiant:Alexandre Loywick & entreprise:GenesGaël Even Mikaël Salson Alexandre Loywick<br />

<strong>2013</strong>-06-03 10h45 M5-A8 123 [ALTERNANT] étudiant:Tristan Cavelier & entreprise:Nexedi Jean-Paul Smets Mikaël Salson Tristan Cavelier<br />

<strong>2013</strong>-06-03 11h45 M5-A8 141 [ALTERNANT] étudiant:Dominique Testelin & entreprise:Idees3Guillaume Palamin Fabrice Aubert Dominique Testelin<br />

<strong>2013</strong>-06-03 14h00 M5-A8 136 [ALTERNANT] étudiant:Nathanael Martin & entreprise:Unis Michaël Macquart Yves Roos<br />

Nathanael Martin<br />

<strong>2013</strong>-06-03 15h00 M5-A8 143 [ALTERNANT] étudiant:Donovan Watteau & entreprise:Cerise Gauthier M Dequidt Arnaud Liefooghe Donovan Watteau<br />

<strong>2013</strong>-06-03 14h00 M5-A7 52 Recherche de candidats/jobs sans contact Nabil Djarallah, Nicolas HNabil Djarallah Gens Maxime Camille Riquier<br />

<strong>2013</strong>-06-03 14h30 M5-A7 53 API de contrôle de drones volants Nabil Djarallah, Nicolas PNicolas Petitprez Mohamed Ouannane Jeremy Diaz<br />

<strong>2013</strong>-06-03 15h00 M5-A7 54 Petites annonces en réalité augmentée Nabil Djarallah, Nicolas PNicolas Petitprez Alexandre Raulin Yann Duval<br />

<strong>2013</strong>-06-03 16h15 M5-A7 118 Intégration du uPnP dans le serveur embarqué SMEWS Gilles Grimaud Gilles Grimaud Edouard Berton Nicolas Ryckembusch<br />

<strong>2013</strong>-06-03 16h45 M5-A7 119 Interface graphique en python pour la commande de compilatiGilles Grimaud Gilles Grimaud Rabab Bouziane Narjes Jomaa<br />

<strong>2013</strong>-06-03 14h00 M5-A9 20 Capture de mouvement 3D avec une caméra Microsoft KinectHazem Wannous Hazem Wannous Derek Hendrickx Benjamin Makusa<br />

<strong>2013</strong>-06-03 14h30 M5-A9 107 Essayage 3D des lunettes virtuelles avec une caméra MicrosoHazem Wannous Hazem Wannous Pierre Villoutreix Maxime Chaste<br />

<strong>2013</strong>-06-03 15h00 M5-A9 31 Robot lego machine de Turing Francesco De Comité Eric Wegrzynowski Matthieu Poudroux Ronan Dhellemmes<br />

<strong>2013</strong>-06-03 16h15 M5-A9 55 Extraction d'information textuelles multilingue à partir de flux sLuigi Lancieri Luigi Lancieri Shichen Zhao Amira Kamli<br />

<strong>2013</strong>-06-03 16h45 M5-A9 56 Analyse du buzz sur twitter Luigi Lancieri Luigi Lancieri Florian Michiel Alessio Trunfio<br />

Mardi 4 juin<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-06-04 13h30 M5-A7 85 Plugin de visualisation 3D pour la consommation énergétique Romain Rouvoy Romain Rouvoy Aurore Allart Benjamin Ruytoor<br />

<strong>2013</strong>-06-04 14h00 M5-A7 109 Réseau de neurones artificiels pour reconnaissance d'émotionPierre Boulet Pierre Boulet<br />

Sanaa Mouatassim<br />

<strong>2013</strong>-06-04 14h30 M5-A7 10 IHM HTML5 pour un simulateur de marchés financiers Yann Secq Philippe Mathieu Thomas Buisine Romain Belmonte<br />

<strong>2013</strong>-06-04 15h00 M5-A7 106 Mise en place d'une application vidéo sur la carte xilinx ZynbqJean-Luc Dekeyser Jean-Luc Dekeyser Quang-Tung Nguyen Antoine B. Kiatoko<br />

<strong>2013</strong>-06-04 15h30 M5-A7 117 Experimentation d'un codeur jpeg sur Homade : une approcheRabie Ben Atitallah Jean-Luc Dekeyser Aurelien Bertiaux<br />

<strong>2013</strong>-06-04 16h15 M5-A7 102 Ecosystèmes virtuels et programmation 3D : spécification et dSamuel Blanquart Samuel Blanquart Lois Arens<br />

Yoann Bouquet<br />

<strong>2013</strong>-06-04 09h00 M5-A8 131 [ALTERNANT] étudiant:Jules Ivanic & entreprise:Gfi Thomas Ribeaucoup Jean-Christophe RoutierJules Ivanic<br />

<strong>2013</strong>-06-04 09h30 M5-A8 133 [ALTERNANT] étudiant:Sebastien Leclercq & entreprise:LifedoHerve Fourmeaux Jean-Christophe RoutierSebastien Leclercq<br />

<strong>2013</strong>-06-04 10h15 M5-A8 142 [ALTERNANT] étudiant:Valois Vander-Cruyssen & entreprise:MAnthony Dhondt Jean-Luc Levaire Valois Vander-Cruyssen<br />

<strong>2013</strong>-06-04 10h45 M5-A8 121 [ALTERNANT] étudiant:Loic Allart & entreprise:Vekia Vincent Wauters Laetitia Jourdan Loic Allart<br />

<strong>2013</strong>-06-04 11h15 M5-A8 126 [ALTERNANT] étudiant:Stefan Dochez & entreprise:AlternativeGuillaume Pellien Lionel Seinturier Stefan Dochez<br />

<strong>2013</strong>-06-04 11h45 M5-A8 130 [ALTERNANT] étudiant:Etienne Helluy-Lafont & entreprise:AdvJeremie Jourdin Pierre Boulet<br />

Etienne Helluy-Lafont<br />

<strong>2013</strong>-06-04 12h15 M5-A8 128 [ALTERNANT] étudiant:Thibaut Frain & entreprise:Valipost Thierry Thibaut Philippe Marquet Thibaut Frain<br />

<strong>2013</strong>-06-04 14h00 M5-A8 129 [ALTERNANT] étudiant:Rémi Gosselin & entreprise:J2S Jean-Yves Jourdain Samuel Hym<br />

Rémi Gosselin<br />

<strong>2013</strong>-06-04 14h30 M5-A8 139 [ALTERNANT] étudiant:Fabien Piette & entreprise:Recisio Jean-Baptiste Defossez Samuel Hym<br />

Fabien Piette<br />

<strong>2013</strong>-06-04 15h00 M5-A8 125 [ALTERNANT] étudiant:Jérôme Desjardins & entreprise:StadlinPascal Farange Marius Bilasco Jérôme Desjardins<br />

<strong>2013</strong>-06-04 15h30 M5-A8 140 [ALTERNANT] étudiant:Cesar Splete & entreprise:Audaxis Vincent Hosatte Marius Bilasco Cesar Splete<br />

<strong>2013</strong>-06-04 16h15 M5-A8 120 [ALTERNANT] étudiant:Romuald Alapide & entreprise:Cap GeJean-Yves Byhet Alexandre Sedoglavic Romuald Alapide<br />

Présidents de sessions<br />

Laetitia Jourdan<br />

Anne-Cécile Caron<br />

Fabrice Aubert<br />

Gery Casiez<br />

Laurent Noé<br />

Mikael Salson


Vendredi 31 mai<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-05-31 08h30 M5-A7 2 Reconstituer le puzzle : depuis des fragments juMikaël Salson Mikaël Salson Charles Husquin<br />

<strong>2013</strong>-05-31 09h00 M5-A7 4 Alecsia apprend à lire les ODT et PDF Mikaël Salson Mikaël Salson Anthony Tonglet<br />

<strong>2013</strong>-05-31 10h15 M5-A7 78 Evolution de l'application de suivi d'alternants et Marius Bilasco Marius Bilasco Ayoub Nejmeddine Sara El-Arbaoui<br />

<strong>2013</strong>-05-31 10h45 M5-A7 81 Take a photo for me Marius Bilasco Marius Bilasco Jérémie Samson Victor Paumier<br />

<strong>2013</strong>-05-31 11h15 M5-A7 82 Interagir avec votre ordinateur de la tête Marius Bilasco Marius Bilasco Mamadou Diop<br />

<strong>2013</strong>-05-31 11h45 M5-A7 84 Analyse contextuelle de collections de photos prMarius Bilasco Marius Bilasco Benjamin Allaert Benjamin Flahauw<br />

<strong>2013</strong>-05-31 14h30 M5-A7 6 Frameworks PHP et back-offices pour applicationJean-Claude Tarby Jean-Claude Tarby Omar Chahbouni Abderrahime El Idrissi<br />

<strong>2013</strong>-05-31 15h00 M5-A7 8 Intégration des ondes cérébrales dans la vie couJean-Claude Tarby Jean-Claude Tarby Mickaël Duruisseau Nicolas Coyard<br />

<strong>2013</strong>-05-31 16h15 M5-A7 68 Conception d'un Raspberry pi dédié aux présentBruno Bogaert Bruno Bogaert Louis Billiet Sylvain Goulliart<br />

<strong>2013</strong>-05-31 16h45 M5-A7 69 Écosystème pour gestion d'emploi du temps hebBruno Bogaert Bruno Bogaert Dhia Elhak Lakhal Sylvain Malfait<br />

<strong>2013</strong>-05-31 10h15 M5-A8 46 Intégration de Drone à une plateforme logicielle Gwenael Cattez Gwenael Cattez Ali Hedjaz Tony Tran<br />

<strong>2013</strong>-05-31 10h45 M5-A8 65 Moteur de scripts sous iOS Nicolas Haderer, RomainRomain Rouvoy Benjamin Digeon Florent David<br />

<strong>2013</strong>-05-31 11h15 M5-A8 66 Utiliser les téléphones mobiles pour l’estimation Nicolas Haderer Romain Rouvoy Julien Duribreux Justin Dufour<br />

<strong>2013</strong>-05-31 14h00 M5-A8 34 Interface de visualisation de molécules Maude Pupin Laurent Noé<br />

Antonia Ludunge<br />

<strong>2013</strong>-05-31 14h30 M5-A8 91 Pipeline d'analyse de régions de cassures Jean-Stéphane Varré Jean-Stéphane Varré Gauvain Marquet<br />

<strong>2013</strong>-05-31 15h00 M5-A8 30 Robot lego solveur de Sudoku Francesco De Comité Leopold Weinberg Oulamine Youssef El Achiqi Anas<br />

<strong>2013</strong>-05-31 16h15 M5-A8 37 Traitement semi-automatique des feuilles de préGéry Casiez Géry Casiez Alexis Linke<br />

Maxence Gaudry<br />

<strong>2013</strong>-05-31 16h45 M5-A8 3 Conception d'un reseau social orienté vidéo Antoine Thomas Antoine Thomas Emmanuel Pede Thomas Besset<br />

<strong>2013</strong>-05-31 08h30 M5-A9 105 Suivi d'un capteur en 3D a l'aide d'une webcam Jean Rioult Sébastien Ambellouis Matthieu Fesselier Guillaume Huylebroeck<br />

<strong>2013</strong>-05-31 09h00 M5-A9 110 Algorithmes de placement en deux dimensions François Clautiaux François Clautiaux Romain Windels<br />

<strong>2013</strong>-05-31 10h15 M5-A9 70 Home Cloud Server Cedric Dumoulin Cedric Dumoulin Lison Gallos Arnaud Caulier<br />

<strong>2013</strong>-05-31 10h45 M5-A9 27 Framework de modélisation dans les Tablettes AAmine El Kouhen Cédric Dumoulin Malika Rakhaoui ‎ Fatou-Laye Mbaye<br />

<strong>2013</strong>-05-31 11h15 M5-A9 71 Etude de la spécification des représentations arbCedric Dumoulin Cedric Dumoulin Adrien Burillon Thomas Camberlin<br />

<strong>2013</strong>-05-31 11h45 M5-A9 72 Generateur de GUI Android Cedric Dumoulin Cedric Dumoulin Gerard Paligot<br />

<strong>2013</strong>-05-31 14h00 M5-A9 33 Intégration du support multitouch dans Pharo Stéphane Ducasse Stéphane Ducasse Francois Lepan Benjamin V. Ryseghem<br />

<strong>2013</strong>-05-31 14h30 M5-A9 50 Interaction Kinect pour une application ludique Samuel Degrande Patricia Plenacoste Thomas Crepel Rémi Boens<br />

<strong>2013</strong>-05-31 15h00 M5-A9 108 Développement d'un plugin Eclipse de transformMartin Monperrus Benoit Cornu Amina El-Mekky Ouardia Maiz<br />

Lundi 3 juin<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-06-03 14h00 M5-A7 52 Recherche de candidats/jobs sans contact Nabil Djarallah, Nicolas HNabil Djarallah Gens Maxime Camille Riquier<br />

<strong>2013</strong>-06-03 14h30 M5-A7 53 API de contrôle de drones volants Nabil Djarallah, Nicolas PNicolas Petitprez Mohamed Ouannane Jeremy Diaz<br />

<strong>2013</strong>-06-03 15h00 M5-A7 54 Petites annonces en réalité augmentée Nabil Djarallah, Nicolas PNicolas Petitprez Alexandre Raulin Yann Duval<br />

<strong>2013</strong>-06-03 16h15 M5-A7 118 Intégration du uPnP dans le serveur embarqué SGilles Grimaud Gilles Grimaud Edouard Berton Nicolas Ryckembusch<br />

<strong>2013</strong>-06-03 16h45 M5-A7 119 Interface graphique en python pour la commandGilles Grimaud Gilles Grimaud Rabab Bouziane Narjes Jomaa<br />

<strong>2013</strong>-06-03 14h00 M5-A9 20 Capture de mouvement 3D avec une caméra Micr Hazem Wannous Hazem Wannous Derek Hendrickx Benjamin Makusa<br />

<strong>2013</strong>-06-03 14h30 M5-A9 107 Essayage 3D des lunettes virtuelles avec une caHazem Wannous Hazem Wannous Pierre Villoutreix Maxime Chaste<br />

<strong>2013</strong>-06-03 15h00 M5-A9 31 Robot lego machine de Turing Francesco De Comité Eric Wegrzynowski Matthieu Poudroux Ronan Dhellemmes<br />

<strong>2013</strong>-06-03 16h15 M5-A9 55 Extraction d'information textuelles multilingue à pLuigi Lancieri Luigi Lancieri Shichen Zhao Amira Kamli<br />

<strong>2013</strong>-06-03 16h45 M5-A9 56 Analyse du buzz sur twitter Luigi Lancieri Luigi Lancieri Florian Michiel Alessio Trunfio<br />

Mardi 4 juin<br />

Date Heure Salle Projet Titre Auteur Responsable Etudiant1 Etudiant2<br />

<strong>2013</strong>-06-04 13h30 M5-A7 85 Plugin de visualisation 3D pour la consommationRomain Rouvoy Romain Rouvoy Aurore Allart Benjamin Ruytoor<br />

<strong>2013</strong>-06-04 14h00 M5-A7 109 Réseau de neurones artificiels pour reconnaissaPierre Boulet Pierre Boulet Sanaa Mouatassim<br />

<strong>2013</strong>-06-04 14h30 M5-A7 10 IHM HTML5 pour un simulateur de marchés finaYann Secq Philippe Mathieu Thomas Buisine Romain Belmonte<br />

<strong>2013</strong>-06-04 15h00 M5-A7 106 Mise en place d'une application vidéo sur la carteJean-Luc Dekeyser Jean-Luc Dekeyser Quang-Tung Nguyen<br />

<strong>2013</strong>-06-04 15h30 M5-A7 117 Experimentation d'un codeur jpeg sur Homade : Rabie Ben Atitallah Jean-Luc Dekeyser Aurelien Bertiaux<br />

<strong>2013</strong>-06-04 16h15 M5-A7 102 Ecosystèmes virtuels et programmation 3D : spéSamuel Blanquart Samuel Blanquart Lois Arens<br />

Yoann Bouquet<br />

Présidents de sessions<br />

Fabrice Aubert<br />

Gery Casiez<br />

Laurent Noé<br />

Mikael Salson


Quelques points<br />

Enseignement :<br />

1<br />

Bioinfo, Algo [1er semestre]<br />

2<br />

PDS, Réseaux, Suivis de Stages (3), PJI mes amis :-)<br />

[2eme semestre]<br />

3<br />

Nouvelle maquette<br />

Recherche :<br />

1<br />

PEPS Sand (accepté), ANR BnB (heu ... réponse le 17),<br />

Stage, Recrutements, Code (bugs), Evaluations en tout<br />

genre ...<br />

2<br />

Reviews (encore des graines, lossless ce coup ci ...)<br />

3<br />

→<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


Quelques points sur la recherche<br />

1<br />

Mappi<br />

cf Exposé de Jenya<br />

2<br />

Peptide Matching<br />

cf Exposé de Yoann<br />

3<br />

Graines et Produit (serpent de mer)<br />

(draft)<br />

Laurent Année <strong>2012</strong>-<strong>2013</strong>


Spaced seed design on profile HMMs for precise HTS read-mapping<br />

efficient sliding window product on the matrix semi-group<br />

Laurent Noé<br />

May 28, <strong>2013</strong><br />

Abstract<br />

We propose a new method and its associated algorithm to efficiently compute seed sensitivity when<br />

considering that High Throughput Sequencing reads are mapped along sub-parts of a known HMM alignment<br />

profile. This computation particularly makes sense with positioned spaced seeds. It relies on both<br />

automata theory (previous work [KNR06]) combined with a matrix product problem.<br />

Interestingly, it brings into light an interval product problem considered more than twenty years ago<br />

in [AS87], but here with a sliding window aspect : we propose an efficient algorithm to compute this<br />

sliding window set of products using a linear number of unit products on the (associative, but non<br />

commutative and non invertible) matrix semi-group.<br />

This computational scheme is implemented in the ongoing 1.06 version of Iedera which is available at<br />

http://bioinfo.lifl.fr/yass/iedera.php<br />

1 Introduction<br />

Spaced seed design remains an important, but a complex and challenging problem. Many papers have been<br />

devoted to this subject (mainly this last decade), from the (at first counter-intuitive) idea that such seeds<br />

were performing better [CR93, Buh02] and could be optimized [MTL02, BK01], to spaced seed sensitivity<br />

definition and computation [KLMT04], extended models of seeds and their computation [BBV05, Bro05,<br />

MGB06, CM07, YZ08, II09, KWS + 11], and given bounds and complexity problems investigated [FCLCST05,<br />

NR08, MY09, EM11]. Several software are now publicly available to design spaced seeds [SB05, NGK10,<br />

IIMB11, DDDD + 12, Nue11, MHKR12] 1 .<br />

High Throughput Sequencing (HTS) technologies have thrown a new light on the seed design process,<br />

because obtained HTS reads are of relative short length and quality labelled. Some of the most sensitive<br />

algorithms to map such reads onto related genomes use spaced seeds (SHRiMP [RLD + 09, DDL + 11],<br />

ZOOM [LZZ + 08], BFAST [HMN09], PerM [CSC09], LAST [KWS + 11], SToRM [NGK10], ...).<br />

But most of the regular seeds designed within these tools are based on the assumption that the mapped<br />

alignment profile remains “unknown”, thus preferring a i.i.d “randomly” generated profile. There are several<br />

(if not many) cases where this assumption can be removed due to a known profile of what is searched [SB09]<br />

/ filtered out (prior knowledge on the sequences being searched). However, an additional constraint comes<br />

from the fact that HTS reads are (most of the time) relatively short compared to the known profile and are<br />

thus aligned against any sub-profile extracted from the original profile.<br />

We thus propose in the main part of this paper an extended method to efficiently compute seed sensitivity<br />

or lossless property when considering that HTS reads are mapped on sub-profiles (overlapping windows) of<br />

a known HMM alignment profile, which is especially useful when designing positioned spaced seeds. This<br />

computation is first known to rely on a dynamic programming algorithm applied on the automaton that<br />

recognizes the language matched by the seed combined with the HMM model [KNR06]. This computation<br />

DRAFT<br />

1 Currently, more than one hundred references have been directly related to the spaced seeds problem, see for example<br />

http://www.lifl.fr/~noe/spaced_seeds.html<br />

also depends, due to the sub-profile constraint, on a set of matrix products done along overlapped intervals,<br />

which is an idea explored in this paper.<br />

The interval product problem has been considered in [AS87] and the authors provide an efficient solution<br />

in term of preprocessing, in order to answer any query product with a given constant number of products.<br />

We consider this interval product problem with an incremental aspect, using a sliding window, and propose<br />

an efficient algorithm to compute it without preprocessing using an amortized linear number of products<br />

on associative, but non commutative and non invertible, matrix semi-group that stores the property being<br />

computed (probability, cost, score, ...), itself represented by a semi-ring.<br />

In part 2, we give a brief recall of the seed design principle focusing on the seed sensitivity computation.<br />

We then propose the (matrix) product problem in part 3, and propose a method to solve it. Finally, in part<br />

4, we give some measurements on a practical implementation included in the ongoing 1.06 version of Iedera<br />

http://bioinfo.lifl.fr/yass/iedera.php, before concluding remarks in part 5.<br />

2 Seed design process<br />

Spaced seeds are now a frequently used hashing technique for biological sequence analysis. Their implementation<br />

(as a direct hashing method) is straightforward and brings high sensitivity for the same theoretical<br />

selectivity compared to contiguous seeds of an equivalent weight. Interestingly, in practice, a lightly reduced<br />

computational cost can even be observed when using spaced seeds compared with contiguous seeds of the<br />

same weight.<br />

Spaced seeds have been generalized by several extended seed models (Vector seeds [BBV05], Indel<br />

seeds [MGB06], Subset seeds [KNR06, ZF07, YZ08], Neighbor seeds [CM07]). To increase the overall sensitivity,<br />

they can usually be designed jointly as multiple seeds [YWC + 04, SB05], and (for example on quality<br />

labelled sequences) as positioned seeds [LZZ + 08, NGK10].<br />

In addition to the seed model, one needs a selection criterion for good seed shapes : this criterion is<br />

(almost always) established on a model of the alignments being matched (usually represented as words on<br />

a binary match/mismatch alphabet), itself weighted by a probabilistic/cost/score/...(possibly any combination<br />

of such “semi-groups”) model. Here again, the initially proposed i.i.d. Bernoulli model [KLMT04]<br />

has been extended into Markov model [BKS05] and HMM [BBV04], with several extensions set on its<br />

parametrization [MB07, CP10].<br />

In practice the considered criterion to select good spaced seed shapes is “the probability to hit at least<br />

once”(sensitivity), or “the guaranty to hit always at least once”(lossless property). Such criteria can<br />

then be measured by a dynamic programming algorithm based on the decomposition of alignment word<br />

suffixes detected by the seed [KLMT04, BK03], or more directly on the regular language recognized by the<br />

seed, itself compiled into a deterministic finite automaton [BKS05, KNR06, HR08].<br />

3 Matrices product<br />

Given an automaton for the language recognized by the seed, and given a model (probabilistic/cost/score<br />

model) provided by a transducer, it is possible to compute properties (probabilities, costs, scores ...) of<br />

the initial language (see the illustrative example provided in Figure 1 for probabilities). In practice, the<br />

resulting matrices obtained from the model and the seed language are multiplied and/or powered; the<br />

computation “within matrices” is performed on “semi-rings” representing the properties : For example,<br />

language probabilities are computed on a classical semi-ring (E = R0≤r≤1,⊕ =+,⊙ =.,0⊕,ɛ⊙ =0,1⊙ =<br />

1), whereas language costs (respectively scores) are computed on a tropical semi-ring [Sim88, Pin98, MS09,<br />

Moh09](E = R,⊕ = min,⊙ =+,0⊕,ɛ⊙ = ∞,1⊙ = 0) (respectively (E = R,⊕ = max,⊙ =+,0⊕,ɛ⊙ =<br />

−∞,1⊙ = 0) for scores).<br />

In practice, for a set of seeds (and in general for any regular expression), the same algorithm [KNR06,<br />

MHKR12] can be applied on both classical and tropical semi-rings : it computes for example, either the<br />

seed sensitivity on the classical semi-ring for what is commonly named lossy seed design framework,<br />

DRAFT<br />

1<br />

2


q1<br />

q2<br />

q3<br />

q4<br />

q5<br />

p1<br />

11)<br />

p2<br />

11)<br />

Figure 1: Product of the seed 1*1 automaton with an ad hoc probabilistic model<br />

start<br />

0<br />

1<br />

0<br />

1<br />

0<br />

0<br />

1<br />

1<br />

0,1 0 ( 17),1 (27) 1 ( 47) 0 ( 1<br />

×<br />

=<br />

DRAFT<br />

start<br />

0 ( 11),1 3 ( 7<br />

(q1×p1) (q1×p2) (q2×p1) (q2×p2) (q3×p1) (q3×p2) (q4×p1) (q4×p2) (q5×p1) (q5×p2)<br />

(q1×p1) ( 1 7 ) (2 7 ) (4 7 )<br />

(q1×p2) ( 3<br />

11 ) ( 1<br />

11 ) ( 7<br />

11 )<br />

(q2×p1) ( 2 7 ) (4 7 ) (1 7 )<br />

(q2×p2) ( 7<br />

11 ) ( 3<br />

11 ) ( 1<br />

11 )<br />

(q3×p1) ( 2 7 ) (1 7 ) (4 7 )<br />

(q3×p2) ( 3<br />

11 ) ( 11 1 ) ( 7 11 )<br />

(2 7 ) (q4×p1) ( 1 7 ) (4 7 )<br />

(q4×p2) ( 3<br />

11 ) ( 11 1 ) ( 7 11 )<br />

1<br />

7 +2 7 (q5×p1) ( ) (4 7 )<br />

(q5×p2) ( 3 11 + 7<br />

11 ) ( 1<br />

11 )<br />

otherwise the minimal cost and thus the lossless property on the tropical semi-ring for the lossless seed<br />

design framework [NGK10]. Note also that it can be adapted to a score framework, if providing a clearly<br />

defined problem (e.g. [KNP04]).<br />

In the lossy framework, HMMs are frequently used in biological sequence and alignment representation<br />

(for example as profile HMMs [Edd98]) 2 . They thus can be easily applied to seed sensitivity computation<br />

[BBV04, KNR06, HR08] : they give a set of probabilities (emission probabilities for each state, together<br />

with transition probabilities between states) that are computed out of a profile alignment. Butwhensuch<br />

HMMs have to be used with HTS reads to design seeds, one must face a new problem : taking into account<br />

the fact that the read can be any sub-part of the HMM (HMM local alignment), and thus that the computation<br />

may start at any “position” on the alignment HMM : in some way a more challenging problem to design<br />

seeds when one needs to know precisely the hit probability of a set of (positioned) seeds for each window<br />

along the HMM.<br />

3.1 Sliding window product<br />

Such computation, translated into matrix form, implies to compute, for a list of (non-invertible) matrices<br />

M0,M1,M2,...,Mn−1, a set of products as one of the two following forms :<br />

2 Notice also that Position Weight Matrices (PWM) with indels, as the one used for example in Prosite, can be seen as a<br />

rough equivalent of the profile HMM in the tropical semi-ring...<br />

Problem.<br />

where w is the length of the read,<br />

Problem.<br />

compute<br />

compute<br />

j(t) ∏<br />

u=i(t)<br />

i+w ∏<br />

u=i<br />

Mu ∀i ∈ [0..n−w−1] (1)<br />

or more generally :<br />

Mu ∀t with 0≤i(t)≤j(t)k).<br />

Maintaining such matrices Uk for k ∈ [i..j] costs at most (in amortized analysis) one product per j-<br />

increase (see Appendix 7.2). Note that increasing i simply deletes the last Ui and thus does not cost any<br />

3<br />

4


i=0<br />

00<br />

U[0]<br />

Figure 2: Uk matrices: example when i = 0 and j = 24<br />

27<br />

26<br />

25<br />

24<br />

23<br />

22<br />

21<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

14<br />

13<br />

12<br />

11<br />

10<br />

09<br />

08<br />

07<br />

06<br />

05<br />

04<br />

03<br />

02<br />

01<br />

U[1]<br />

U[2]<br />

U[3]<br />

U[4]<br />

U[5]<br />

U[6]<br />

U[7]<br />

U[8]...<br />

25<br />

additional product on the Uk’s. A pseudo-code of the add right process (increment of j)isprovidedin<br />

Algorithm 1.<br />

Without considering that any previous computation is kept, it is directly possible to compute the Mi..j<br />

product, as Mi × Mi+1···Mj for any i,j (j>i)inO(log(j − i)) products using the updated Uk set of<br />

matrices for k ∈ [i..j] (see Appendix 7.1).<br />

But if the product is computed when i and j follow two monotonically ( +0<br />

+1 )-increasing functions, the<br />

number of products can be reduced to (amortized) constants for each i and j step-move (or for both moves).<br />

3.3.2 Middle m definition and Mi..j product update<br />

00<br />

i=1<br />

01<br />

U[1]<br />

02<br />

U[2]<br />

03<br />

U[3]<br />

04<br />

U[16]<br />

Figure 3: Uk matrices: example when i = 1 and j = 24<br />

U[4]<br />

05<br />

U[5]<br />

06<br />

U[6]<br />

07<br />

U[7]<br />

08<br />

09<br />

10<br />

U[8]...<br />

11<br />

12<br />

24<br />

13<br />

14<br />

15<br />

m=16<br />

16<br />

17<br />

U[16]<br />

18<br />

19<br />

U[24]<br />

j=24<br />

DRAFT<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

U[24]<br />

To split the computation when only i or j is moved, we need to define here the middle m of i and j.<br />

It is defined as the beginning position of the maximal (in size) U-block included in the interval i..j. Iftwo<br />

equal-size maximal blocks are between i and j, we choose m as the one that is the most factorized by two,<br />

which corresponds 3 to the beginning of the right maximal block (see Figure 3). This middle border enables<br />

to split the computation in two parts when needed, which we will call left (colored in green in Figure 3) and<br />

right (red in Figure 3). Note that m< 1 3 i+ 2 3j. Note also that when there is only one maximal sized<br />

block,thenm< 1 2 i+ 1 2 j, and when there are two maximal sized blocks,thenm>2 3 i+ 1 3 j.<br />

3 proof : the other choice would implies that the two maximal left and right blocks would be merged, which contradicts<br />

“maximality” of the left block; thus only the right block can be increased in size; to conclude : for two contiguous blocks of<br />

equal size, the right block is at least one more power of two factorizable than the left block<br />

26<br />

27<br />

j=24<br />

28<br />

28<br />

29<br />

29<br />

Algorithm 1: add right : increments the right border j, and updates the set Ui..j using Mj<br />

Input:<br />

• M0,M1,M2,...,Mn−1 : original matrices.<br />

Global:<br />

• i,j : integers,<br />

• Ui,...,Uj : original and updated set of matrices.<br />

Local:<br />

• u,t,told : integers.<br />

/* a) only before the first increment */<br />

if j =0then<br />

U0 ← M0;<br />

/* b) increment j */<br />

inc(j);<br />

/* c) and process the subset of Uj−t matrices that have to be updated */<br />

Uj ← Mj;<br />

u ← j +1;told ← 0;t ← 1;<br />

while u is even and j −t ≥ i do<br />

Uj−t ← Uj−t ; ×Uj−told<br />

told ← t ; t ← 2.t+1;u ← u/2;<br />

In the next part, we will compute in two separate parts Mi..m−1 and Mm..j, considering the case when<br />

m is fixed first, and then two cases when m is increased.<br />

middle unchanged : if we suppose that the middle m does not change during a computational step,<br />

the following can be observed :<br />

• when j is increased (so that j = jold +1), updating the product Mm..j can be done with one product,<br />

considering that we keep the previous computation . Thus, considering that we also update<br />

Mm..jold<br />

the Uk’s values at the same time, an amortized single product must be added (Amortization on j :see<br />

Appendix 7.2). Joining Mi..m−1 with Mm..j will then cost one extra product, giving a total number of<br />

products of three.<br />

• when i is increased (i = iold + 1), previous computation Miold..m−1 does not help and can be erased<br />

here. However, if we suppose that we keep all the previous computed products Mk..m−1 in a stack for<br />

all the blocks Uk visited before, reusing and updating this part can be done with one single amortized<br />

product (Amortization on m : see Appendix 7.3). Joining Mi..m−1 and Mm..j will then cost one extra<br />

product, giving a total number of products of two.<br />

DRAFT<br />

At first glance, a {cost(i) ≤ i+m; cost(j) ≤ 3j} cost is applied when m does not change. Otherwise<br />

this computation has to be updated and this will be considered in the next part :<br />

middle changed : if we suppose that the middle m does change, then the previous computation cut<br />

in two parts Mi..m−1 and Mm..j is somehow “compromised”; Let’s now see when m changes, and moreover,<br />

why :<br />

5<br />

6


• whenmchangesduetoa j-increase, asmfollowsthebeginningofthelargestright-mostUk block, j can<br />

increase the maximal block size by two without changing m (case handled before, corresponding<br />

to one single maximal block), or j can make m jump to the next power of two “potential” block,<br />

thus from mold = odd×2 p to m =(odd+1)×2 p = odd+1<br />

2 ×2 p+1 (case not handled, that corresponds<br />

to two maximal blocks of equal size, the right-most being now the “m one”): This last case has no<br />

consequence on the product Mm..j that is immediately computed by the update of the Uk’s values as<br />

Mm..j corresponds to the right-most maximal block in Uk,thusinone single product here (and<br />

not two as shown before).<br />

However, moving m will obviously compromise the left stack of Mk..mold−1 previous computations that<br />

will now not help the computation of the next Mi..mold−1 on the next i-increase, since mold is now<br />

pushed to the next power of two m, and can be erased. This cost can however be bound by a log2( ∆m 2 )<br />

where ∆m represents the m increase (see Appendix 7.4).<br />

At the end, joining Mi..m−1 with Mm..j will cost one extra product.<br />

Using an amortization on m and j and combining the two j-increase cases (when m does change, or<br />

not) gives a cost(j) ≤ 3j + 1 8m (see Appendix 7.4)<br />

• when i is increased so that i>m(thus i = m+1), m (that correponds to the largest right-most block)<br />

can only “jump” to a next block of smaller size : the cost on the left stack [i..m − 1] is already<br />

paid as it corresponds to a “legal” move of i that is amortized by one product as seen previously<br />

(Amortization on m, see Appendix 7.3).<br />

However, moving m will obviously compromise the right computation of Mmold..j since mold is now<br />

pushed to the next (smaller) block, and can be erased and recomputed. This cost can however be<br />

bound by a log2( ∆m 2 )where∆m where ∆m =representsthem increase (see Appendix 7.5).<br />

At the end, joining Mi..m−1 and Mm..j will cost one extra product.<br />

Using an amortization on m and i and combining the two i-increase cases (when m does change, or<br />

not) gives a cost(i) ≤ i+ 9 8m (see Appendix 7.5)<br />

To conclude, a {cost(i) ≤ i+ 9 8 m; cost(j) ≤ 3j + 1 8m} cost is applied.<br />

3.3.3 Pseudocode<br />

To illustrate the previously described computation, an associated pseudocode is given in Algorithm 2.The<br />

proposed algorithm returns the Mi..j product (still defined as Mi × Mi+1 × ···× Mj). It can only be<br />

applied once the Ui..j matrices have been updated by Algorithm 1. The main global data structure used in<br />

Algorithm 2 is a stack of matrices left products to m stack that keeps a set of products Mk..m−1 (where k<br />

is i ≤ k pairs,<br />

• right product from m : < matrix,int > pair.<br />

Local:<br />

• Pleft,Pright : matrices,<br />

• kleft,kright : integers.<br />

Result:<br />

• the product Mi..j<br />

/* a) update m (update algorithm not described here) */<br />

mold ← m ; m ← update(m,i,j);<br />

/* a.1) reset all global variables when m change */<br />

if mold ≠ m then<br />

left products to m stack ←∅;<br />

right products from m ← ;<br />

/* b) update the left stack products*/<br />

/* b.1) remove stacked product that are not usefull */<br />

while left products to m stack ≠ ∅ and top(left products to m stack).int < i do<br />

pop(left products to m stack);<br />

if left products to m stack ≠ ∅ then<br />

← top(left products to m stack);<br />

else<br />

← ;<br />

/* b.2) and compute / stack left products from i to m */<br />

while kleft >ido<br />

kleft ← kleft −size of block before(kleft,i);<br />

Pleft ← Ukleft ×Pleft;<br />

push(left products to m stack,< Pleft,kleft >);<br />

DRAFT<br />

/* c) compute the right product from m to j */<br />

← right product from m;<br />

while kright


1<br />

4 Experiments on seed design<br />

The previous algorithm has been implemented and tested in Iedera where it can now be activated with the<br />

-ll option (see http://bioinfo.lifl.fr/yass/iedera.php).<br />

We designed spaced seeds on reads using an alignment model obtained from a profile HMM : on a<br />

typical example, for a read/windows length of 100 (respectively of 200) that corresponds to an observed<br />

current Illumina single read length (respectively two merged reads), and a simplified 4 profile HMM alignment<br />

of size 1605 (from a 16S rRNA database), the number of products required for the full computation of the<br />

1605−100+1 = 1506windowsoflength100was5931 5 (respectively5720productsforthe1605−200+1 = 1406<br />

windows of length 200) that must be compared to the number of products for the naive range algorithm of<br />

≈ 150000 (respectively ≈ 300000).<br />

However, it must be noticed that the products required for the naive algorithm are less time consuming<br />

(as matrix × vector products) compared to our case (matrix × matrix products). We thus compared the<br />

execution time of both approaches under the conditions proposed above for both 100 and 200 windows<br />

length. We conducted two experiments, one on spaced seeds, and the other on positioned spaced seeds. For<br />

the first experiment, the seeds were set at every position along the HMM, and the sensitivity was computed<br />

on all the windows along the HMM. For the second experiment, we additionally set a fixed number of<br />

positions along the HMM (10,20,40,80,160,320,640,1280) where seeds were set : seed positions were drawn,<br />

and the sensitivity was again computed on all the windows along the HMM.<br />

For both spaced seeds and positioned spaced seeds, we have chosen seeds of weight w ranging from 8 to<br />

12 (Figure 4 and 5: x-axis bottom label), span s ranging from w to 2×w (Figure 4 and 5: x-axis top label),<br />

and, for each pair (w,s), we have computed the sensitivity on 100 seeds and measured the time elapsed<br />

(Figure 4 and 5: y-axis label). Note that the set of seeds (respectively the set of positions for each seed)<br />

was identical on both methods being evaluated. The computation was carried out exclusively on a HP Z800<br />

Computer (Intel(R) Xeon(R) CPU E5620 @ 2.40GHz) with 20Gb of RAM (in practice, not more than 20% of<br />

the RAM was used), using a single thread.<br />

The obtained results are illustrated on Figure 4 for one single seed, and also on Figure 5 for a set of two<br />

seeds : they show a substantial improvement in almost all cases considered in the experiments.<br />

There is a double speedup observed in the most time consuming problems : this appends for seeds of the<br />

largest span in the set. This is the worst case, in the sense that large and dense matrices are produced. In<br />

practice, the practical speedup for seeds of reasonable span<br />

weight ratio (e.g. ≤ 1.8) is at least four times the one<br />

of the naive algorithm on non positioned seeds. The practical speedup for positioned seeds is less obvious on<br />

middle span seeds, but appears to increase if the seeds are of small or very large span, and when the set of<br />

positions increase. Finally, it must be noticed that on non positioned seeds, increasing the window length<br />

from 100 to 200 has a strong impact on the overall performances.<br />

5 Concluding remarks<br />

DRAFT<br />

First, it is very likely that the bounds proposed at the end of section 3 could be improved by a more precise<br />

analysis; However going under a bound of 3.0 per move while computing, for each move, the window product,<br />

is unlikely (at least without any initial amortized cost), since we have found at least one example such that<br />

the amortized number of products is 9253<br />

3081 ≈ 3.0032457 per move6 .<br />

Moreover when j is increased by “runs” while i is fixed, the proposed algorithm can be enhanced with a<br />

greedy computation of the Mi..j product (that can be done quickly provided that i is fixed for a while). In<br />

practice, this implementation always gives less or the same number of products than the proposed one, but<br />

has to further be carefully analyzed.<br />

4 only matching states are kept : insertion and deletion states are removed, but we keep track of transitions between matching<br />

Figure 4: Iedera speed improvement for one seed<br />

positioned seeds (window length 100)<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />

non positioned seeds (window length 100)<br />

span<br />

span<br />

1000000<br />

naive range algorithm<br />

proposed algorithm<br />

proposed algorithm slower<br />

naive range algorithm faster<br />

naive range algorithm<br />

proposed algorithm<br />

100000<br />

× 2<br />

× 2<br />

10000<br />

1000<br />

100<br />

10<br />

8 9 10 11 12<br />

weight<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />

8 9 10 11 12<br />

positioned seeds (window length 200)<br />

span<br />

1000000<br />

DRAFT<br />

weight<br />

non positioned seeds (window length 200)<br />

span<br />

naive range algorithm<br />

proposed algorithm<br />

proposed algorithm slower<br />

naive range algorithm faster<br />

naive range algorithm<br />

proposed algorithm<br />

100000<br />

× 2<br />

10000<br />

× 2<br />

1000<br />

100<br />

10<br />

8 9 10 11 12<br />

8 9 10 11 12<br />

weight<br />

weight<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />

↔ indel states to generate some break between contiguous blocks of matches : we thus keep the indel even without its length,<br />

and without any letter it may add here, since they are not supposed match any seed<br />

5 it must be noticed that each window needs a displacement both on i and j<br />

6 1,2,−1,3..24,−2,−3,25..51,−4,52..72,−5,73..392,−6,393..441,−7,−8,442..577,−9,578..3071 where i-moves are given<br />

with a minus notation<br />

1000000<br />

100000<br />

10000<br />

1000<br />

100<br />

10<br />

10000000<br />

1000000<br />

100000<br />

10000<br />

1000<br />

100<br />

10<br />

9<br />

time (seconds)<br />

10<br />

time (seconds)


Figure 5: Iedera speed improvement for two seeds<br />

positioned seeds (two seeds, window length 100)<br />

span<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />

non positioned seeds (two seeds, window length 100)<br />

span<br />

100000<br />

naive range algorithm<br />

proposed algorithm<br />

proposed algorithm slower<br />

naive × 2 range algorithm faster<br />

naive range algorithm<br />

proposed algorithm<br />

10000<br />

1000<br />

100<br />

10<br />

8 9 10 11 12<br />

weight<br />

positioned seeds (two seeds, window length 200)<br />

span<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192020222324<br />

8 9 10 11 12<br />

10000000<br />

naive range algorithm<br />

proposed algorithm<br />

proposed algorithm slower<br />

naive range algorithm faster<br />

1000000<br />

DRAFT<br />

weight<br />

100000<br />

× 2<br />

10000<br />

1000<br />

100<br />

8 9 10 11 12<br />

weight<br />

10<br />

8 9 10 11 12<br />

weight<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />

10000000<br />

1000000<br />

100000<br />

× 2<br />

Finally, it is also very likely that the algorithm “binary block division” may be modified and analyzed to<br />

get an even better bound. For example, some sorting algorithm as the Smooth sort [Dij82] uses Fibonacci<br />

numbers to partition data, rather than following the “binary tree” of the classical Heap Sort. Another<br />

interestingpointofviewistoconsiderthesizeofeachmatrix(hereunknownmostofthetime, andunfortately<br />

square on trivial cases), in order to combine the sliding window problem with the classical chain matrix<br />

non positioned seeds (two seeds, window length 200)<br />

span<br />

naive range algorithm<br />

proposed algorithm<br />

× 2<br />

DRAFT<br />

8<br />

9101112131415169101112131415161718101112131415161718192011121314151617181920212212131415161718192021222324<br />

10000<br />

1000<br />

100<br />

10<br />

1000000<br />

100000<br />

10000<br />

1000<br />

100<br />

10<br />

multiplication problem for the overall computation.<br />

From a more practical point of view, matrix products used within the algorithm, when applied on<br />

sparse/non sparse matrices, cannot be considered as a “constant” operation, but more likely as a “function<br />

of the sparsity”. However, such implementation needs to know this “sparsity cost” for all the possible<br />

products, which, in practice on unknown automata, is not predictable, but is similar to simulating the<br />

product, thus costs as much as the product itself; We have adopted in Iedera the choice of representing<br />

matrices (and the associated product) with the two possibilities : the algorithm chooses, for each matrix<br />

row, a sparse implementation if less than 20% of the cells are present, or a dense implementation otherwise.<br />

However, it is still possible here to get very high costs with the full matrix product : an alternative solution<br />

would be to combine both elements of the naive range product with sub-computations from the proposed<br />

algorithm if such cases would appear.<br />

Finally, a last aspect that can be taken into account is to parallelize the block product carefully since this<br />

one heavily depends on separate calculations for the same window being considered : the naive algorithm<br />

is in fact more difficult to parallelize efficiently withing each window for at least two reasons : first, there<br />

is a flow dependency between the set of products; worse, within each product, synchronization is needed<br />

when accessing the post computation vector, unless one has to reverse the computation by considering it<br />

first, which implies to reverse the matrices cells access from row-first to column-first.<br />

6 Acknowledgments<br />

This research was supported by the ANR project MAPPI (ANR-2010-COSI-004-02), <strong>LIFL</strong> (UMR CNRS<br />

8022 Université de Lille 1) and Inria Lille Nord-Europe. Project MAPPI is associated with the Tara Oceans<br />

expedition where the principal tasks involve the development of new software for mapping and assembling<br />

metagenomic and metatranscriptomic data.<br />

References<br />

[AS87]<br />

[BBV04]<br />

[BBV05]<br />

[BK01]<br />

[BK03]<br />

NogaAlonandBaruchSchieber. Optimalpreprocessingforansweringon-lineproductqueries. Technical<br />

Report TR 71/87, Inst. of Comp. Science, Tel-Aviv Univ., 1987.<br />

Broňa Brejová, Daniel G. Brown, and Tomáš Vinař. Optimal spaced seeds for homologous coding<br />

regions. Journal of Bioinformatics and Computational Biology, 1(4):595–610, Jan 2004. (earlier version<br />

in CPM 2003). URL: http://www.worldscinet.com/jbcb/01/0104/S0219720004000326.html, doi:<br />

10.1142/S0219720004000326.<br />

BroňaBrejová, DanielG.Brown, andTomášVinař. Vectorseeds: Anextensiontospacedseeds. Journal<br />

of Computer and System Sciences, 70(3):364–380, 2005. (earlier version in WABI 2003). URL: http://<br />

linkinghub.elsevier.com/retrieve/pii/S0022000004001527, doi:10.1016/j.jcss.2004.12.008.<br />

Stefan Burkhardt and Juha Kärkkäinen. Better filtering with gapped q-grams. In Proceedings of<br />

the 12th Symposium on Combinatorial Pattern Matching (CPM),volume2089ofLecture Notes in<br />

Computer Science, pages 73–85. Springer, July 2001. URL: http://www.springerlink.com/content/<br />

gykw51mpjqnwrmqx, doi:10.1007/3-540-48194-X_6.<br />

Stefan Burkhardt and Juha Kärkkäinen. Better filtering with gapped q-grams. Fundamenta Informaticae,<br />

56(1-2):51–70, 2003. Preliminary version in Combinatorial Pattern Matching 2001. URL:<br />

http://iospress.metapress.com/content/8ad9p3mqeday8vt5.<br />

[BKS05]<br />

Jeremy Buhler, Uri Keich, and Yanni Sun. Designing seeds for similarity search in genomic DNA.<br />

Journal of Computer and System Sciences, 70(3):342–363, 2005. (earlier version in RECOMB 2003).<br />

time (seconds)<br />

11<br />

time (seconds)<br />

12


URL: http://linkinghub.elsevier.com/retrieve/pii/S0022000004001515, doi:10.1016/j.jcss.<br />

2004.12.003.<br />

[Bro05] Daniel G. Brown. Optimizing multiple seeds for protein homology search. IEEE/ACM Transactions<br />

on Computational Biology and Bioinformatics (TCBB), 2(1):29–38, january 2005. (earlier version in<br />

WABI 2004). URL: http://ieeexplore.ieee.org/xpl/freeabs_all.jsparnumber=1416848, doi:<br />

10.1109/tcbb.2005.13.<br />

[Buh02]<br />

[CM07]<br />

[CP10]<br />

[CR93]<br />

[CSC09]<br />

Jeremy Buhler. Provably sensitive indexing strategies for biosequence similarity search. In RECOMB,<br />

Washington, DC (USA), pages 90–99. ACM Press, April 2002. URL: http://doi.acm.org/10.1145/<br />

565196.565208, doi:10.1145/565196.565208.<br />

Miklós Csűrös and Bin Ma. Rapid homology search with neighbor seeds. Algorithmica,48(2):187–<br />

202, Jun. 2007. (earlier version in COCOON 2005). URL: http://www.springerlink.com/content/<br />

45446712u14n0416, doi:10.1007/s00453-007-0062-y.<br />

Won-Hyoung Chung and Seong-Bae Park. Hit integration for identifying optimal spaced seeds. BMC<br />

Bioinformatics - Selected articles from the 8th Asia-Pacific Bioinformatics Conference (APBC), 18-21<br />

january, Bangalore, India,11(Suppl1):S37,2010.URL:http://www.biomedcentral.com/1471-2105/<br />

11/S1/S37, doi:10.1186/1471-2105-11-S1-S37.<br />

Andrea Califano and Isidore Rigoutsos. Flash: A fast look-up algorithm for string homology. In<br />

Proceedings of the 1st International Conference on Intelligent Systems for Molecular Biology (ISMB),<br />

pages 56–64, July 1993.<br />

Yangho Chen, Tate Souaiaia, and Ting Chen. PerM: efficient mapping of short sequencing reads<br />

with periodic full sensitive spaced seeds. Bioinformatics, 25(19):2514–2521,2009. URL:http://<br />

bioinformatics.oxfordjournals.org/content/25/19/2514, doi:10.1093/bioinformatics/btp486.<br />

[DDDD + 12] Dong Do Duc, Huy Q. Dinh, Thanh Hai Dang, Kris Laukens, and Xuan Huan Hoang. AcoSeeD: An<br />

ant colony optimization for finding optimal spaced seeds in biological sequence search. In Proceedings of<br />

the 8th International Conference on Swarm Intelligence (ANTS), Brussels (Belgium),volume7461of<br />

Lecture Notes in Computer Science, pages 204–211. Springer, <strong>2012</strong>. URL: http://www.springerlink.<br />

com/content/n1476j612302410k/, doi:10.1007/978-3-642-32650-9_19.<br />

[DDL + 11]<br />

[Dij82]<br />

[Edd98]<br />

[EM11]<br />

Matei David, Misko Dzamba, Dan Lister, Lucian Ilie, and Michael Brudno. SHRiMP2: Sensitive yet<br />

practical short read mapping. Bioinformatics,2011.doi:10.1093/bioinformatics/btr046.<br />

Edsger W. Dijkstra. Smoothsort, an alternative to sorting in situ. Sci. Comp. Progr.,1:223–233,1982.<br />

Sean R. Eddy. Profile hidden Markov models. Bioinformatics,14(9):755–763,1998. doi:10.1093/<br />

bioinformatics/14.9.755.<br />

Lavinia Egidi and Giovanni Manzini. Spaced seeds design using perfect rulers. In Proceedings of the<br />

18th International Symposium on String Processing and Information Retrieval (SPIRE), Pisa (Italy),<br />

volume 7024 of Lecture Notes in Computer Science, pages 32–43. Springer, 2011. URL: http://www.<br />

springerlink.com/content/c18m78j1214h7k21/, doi:10.1007/978-3-642-24583-1_5.<br />

[FCLCST05] Martin Farach-Colton, Gad M. Landau, Süleyman Cenk Sahinalp, and Dekel Tsur. Optimal spaced<br />

seeds for faster approximate string matching. In Proceedings of the 32nd International Colloquium<br />

on Automata, Languages and Programming (ICALP’05), Lisboa (Portugal),volume3580ofLecture<br />

Notes in Computer Science, pages 1251–1262. Springer, 2005. URL: http://www.springerlink.com/<br />

content/815pej6c1kc09upj, doi:10.1007/11523468_101.<br />

[HMN09]<br />

[HR08]<br />

DRAFT<br />

Nils Homer, Barry Merriman, and Stanley F. Nelson. BFAST: An alignment tool for large scale genome<br />

resequencing. PLoS One,4(11):e7767,2009.doi:10.1371/journal.pone.0007767.<br />

Inke Herms and Sven Rahmann. Computing alignment seed sensitivity with probabilistic arithmetic<br />

automata. In Proceedings of the 8th International Workshop on Algorithms in Bioinformatics<br />

(WABI), Karlsruhe (Germany), volume5251ofLecture Notes in Bioinformatics, pages318–<br />

329. Springer, Sept. 2008. URL: http://www.springerlink.com/content/e8w1g39288144l56, doi:<br />

10.1007/978-3-540-87361-7_27.<br />

[II09] Lucian Ilie and Silvana Ilie. Fast computation of neighbor seeds. Bioinformatics, 25(6):822–<br />

823, 2009. URL: http://bioinformatics.oxfordjournals.org/content/25/6/822, doi:10.1093/<br />

bioinformatics/btp054.<br />

[IIMB11]<br />

[KLMT04]<br />

[KNP04]<br />

[KNR06]<br />

[KWS + 11]<br />

[LZZ + 08]<br />

[MB07]<br />

[MGB06]<br />

Lucian Ilie, Silvana Ilie, and Anahita Mansouri Bigvand. SpEED: fast computation of sensitive spaced<br />

seeds. Bioinformatics,2011.doi:10.1093/bioinformatics/btr368.<br />

Uri Keich, Ming Li, Bin Ma, and John Tromp. On spaced seeds for similarity search. Discrete Applied<br />

Mathematics, 138(3):253–263, 2004. (preliminary version in 2002). doi:10.1016/S0166-218X(03)<br />

00382-2.<br />

Gregory Kucherov, Laurent Noé, and Yann Ponty. Estimating seed sensitivity on homogeneous<br />

alignments. In Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering<br />

(BIBE), May 19-21, 2004, Taichung (Taiwan), pages 387–394. IEEE Computer Society Press, April<br />

2004. URL: http://ieeexplore.ieee.org/xpl/freeabs_all.jsparnumber=1317369, arXiv:cs.OH/<br />

0603106, doi:10.1109/BIBE.2004.1317369.<br />

Gregory Kucherov, Laurent Noé, and Mikhail A. Roytberg. A unifying framework for seed sensitivity<br />

and its application to subset seeds. Journal of Bioinformatics and Computational Biology,4(2):553–<br />

569, November 2006. URL: http://www.worldscinet.com/jbcb/04/0402/S0219720006001977.html,<br />

arXiv:cs.DS/0601116, doi:10.1142/S0219720006001977.<br />

Szymon M. Kie̷lbasa, Raymond Wan, Kengo Sato, Paul Horton, and Martin C. Frith. Adaptive seeds<br />

tame genomic sequence comparison. Genome Research,21(3):487–493,2011.URL:http://genome.<br />

cshlp.org/content/21/3/487, doi:10.1101/gr.113985.110.<br />

Hao Lin, Zefeng Zhang, Michael Q. Zhang, Bin Ma, and Ming Li. ZOOM! Zillions Of Oligos<br />

Mapped. Bioinformatics,24(21):2431–2437,2008. URL:http://bioinformatics.oxfordjournals.<br />

org/content/24/21/2431, doi:10.1093/bioinformatics/btn416.<br />

Denise Y.F. Mak and Gary Benson. All hits all the time: parameter free calculation of seed sensitivity.<br />

In D. Sankoff, L. Wang, and F. Chin, editors, Proceedings of the 5th Asia Pacific Bioinformatics<br />

Conference (APBC),volume5ofAdvances in Bioinformatics and Computational Biology,pages327–<br />

340. Imperial College Press, 2007. URL: http://eproceedings.worldscinet.com/9781860947995/<br />

9781860947995_0035.html, doi:10.1142/9781860947995_0035.<br />

DeniseY.F.Mak, YevgeniyGelfand, andGaryBenson. Indelseedsforhomologysearch. Bioinformatics,<br />

22(14):e341–e349, 2006. URL: http://bioinformatics.oxfordjournals.org/content/22/14/e341,<br />

doi:10.1093/bioinformatics/btl263.<br />

[MHKR12] Tobias Marschall, Inke Herms, Hans-Michael Kaltenbach, and Sven Rahmann. Probabilistic arithmetic<br />

automata and their applications. IEEE/ACM Transactions on Computational Biology and Bioinformatics<br />

(TCBB),9(6):1737–1750,<strong>2012</strong>.URL:http://doi.ieeecomputersociety.org/10.1109/tcbb.<br />

<strong>2012</strong>.109, doi:10.1109/TCBB.<strong>2012</strong>.109.<br />

[Moh09] Mehryar Mohri. Handbook of Weighted Automata, chapter Weighted automata algorithms, pages 213–<br />

254. Springer, 2009. doi:10.1007/978-3-642-01492-5_6.<br />

[MS09] Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry. (draft book-in-progress), 2009.<br />

[MTL02]<br />

[MY09]<br />

[NGK10]<br />

[NR08]<br />

[Nue11]<br />

[Pin98]<br />

Bin Ma, John Tromp, and Ming Li. PatternHunter: Faster and more sensitive homology search.<br />

Bioinformatics,18(3):440–445,2002. URL:http://bioinformatics.oxfordjournals.org/content/<br />

18/3/440, doi:10.1093/bioinformatics/18.3.440.<br />

DRAFT<br />

Bin Ma and Hongyi Yao. Seed optimization for i.i.d. similarities is no easier than optimal Golomb<br />

ruler design. Information Processing Letters, 109(19):1120–1124,2009. URL:http://linkinghub.<br />

elsevier.com/retrieve/pii/S0020019009002270, doi:10.1016/j.ipl.2009.07.008.<br />

Laurent Noé, Marta Gîrdea, and Gregory Kucherov. Designing efficient spaced seeds for SOLiD read<br />

mapping. Advances in Bioinformatics,2010:ID708501,July2010.URL:http://www.hindawi.com/<br />

journals/abi/2010/708501/, doi:10.1155/2010/708501.<br />

François Nicolas and Éric Rivals. Hardness of optimal spaced seed design. Journal of Computer and<br />

System Sciences, 74(5):831–849, Aug. 2008. (earlier version in CPM 2005). URL: http://linkinghub.<br />

elsevier.com/retrieve/pii/S0022000007001444, doi:10.1016/j.jcss.2007.10.001.<br />

Gregory Nuel. Bioinformatics - Trends and Methodologies, chapter Significance Score of Motifs in<br />

Biological Sequences. InTech, 2011. doi:10.5772/18448.<br />

Jean-Éric Pin. Tropical semirings. In J. Gunawardena, editor, Idempotency,volume11ofPubl. Newton<br />

Inst., pages 50–69, Bristol, 1998. Cambridge Univ. Press.<br />

13<br />

14


x−1<br />

x−2<br />

[RLD + 09] Stephen M. Rumble, Phil Lacroute, Adrian V. Dalca, Marc Fiume, Arend Sidow, and Michael<br />

Brudno. SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol,5(5):e1000386,<br />

05 2009. URL: http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000386,<br />

doi:10.1371/journal.pcbi.1000386.<br />

[SB05] Yanni Sun and Jeremy Buhler. Designing multiple simultaneous seeds for DNA similarity search.<br />

Journal of Computational Biology, 12(6):847–861, 2005. (earlierversioninRECOMB2004). URL:http:<br />

//www.liebertonline.com/doi/abs/10.1089/cmb.2005.12.847, doi:10.1089/cmb.2005.12.847.<br />

[SB09]<br />

[Sim88]<br />

Yanni Sun and Jeremy Buhler. Designing patterns and profiles for faster HMM search. IEEE/ACM<br />

Transactions on Computational Biology and Bioinformatics (TCBB),6(2):232–243,2009. doi:10.<br />

1109/tcbb.2008.14.<br />

Imre Simon. Recognizable sets with multiplicities in the tropical semiring. In Mathematical foundations<br />

of computer science, 1988 (Carlsbad, 1988),volume324ofLecture Notes in Comput. Sci.,pages107–<br />

120. Springer, Berlin, 1988. doi:10.1007/BFb0017135.<br />

[YWC + 04] I-Hsuan Yang, Sheng-Ho Wang, Yang-Ho Chen, Pao-Hsian Huang, Liang Ye, Xiaoqiu Huang, and Kun-<br />

Mao Chao. Efficient methods for generating optimal single and multiple spaced seeds. In Proceedings<br />

of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE), Taichung (Taiwan),pages<br />

411–416. IEEE Computer Society Press, 2004. URL:http://ieeexplore.ieee.org/xpl/freeabs_all.<br />

jsparnumber=1317372, doi:10.1109/BIBE.2004.1317372.<br />

[YZ08]<br />

[ZF07]<br />

Jialiang Yang and Louxin Zhang. Run probabilities of seed-like patterns and identifying good transition<br />

seeds. Journal of Computational Biology, 15(10):1295–1313, Dec. 2008. (earlier version in APBC<br />

2008). URL: http://www.liebertonline.com/doi/abs/10.1089/cmb.2007.0209, doi:10.1089/cmb.<br />

2007.0209.<br />

Leming Zhou and Liliana Florea. Designing sensitive and specific spaced seeds for cross-species mRNAto-genome<br />

alignment. Journal of Computational Biology, 14(2):113–130, Mar. 2007. URL: http:<br />

//www.liebertonline.com/doi/abs/10.1089/cmb.2006.0130, doi:10.1089/cmb.2006.0130.<br />

DRAFT<br />

00<br />

00<br />

01<br />

01<br />

02<br />

02<br />

7 Appendix<br />

Figure 6: Uk matrices and product: example when i = 9 and j = 23<br />

24<br />

23<br />

22<br />

21<br />

20<br />

19<br />

18<br />

17<br />

16<br />

15<br />

14<br />

13<br />

12<br />

11<br />

10<br />

09<br />

08<br />

07<br />

06<br />

05<br />

04<br />

03<br />

i=9<br />

x<br />

x<br />

n=14<br />

x<br />

29<br />

28<br />

27<br />

26<br />

DRAFT<br />

x<br />

x<br />

j=23<br />

Figure 7: Uk matrices and product: example when i = 9 and j = 19<br />

03<br />

04<br />

05<br />

06<br />

07<br />

08<br />

i=9<br />

09<br />

x<br />

10<br />

11<br />

x<br />

12<br />

13<br />

14<br />

n=10<br />

7.1 Worst case number of products from i to j<br />

We denote by n the number of single matrices : n = j − i +1(n is thus the length of the block being<br />

computed with help of the Uk matrices already given). We illustrate below how to obtain the smaller size n<br />

according to the number of products x.<br />

• if x[<br />

is odd, ] the worst case is produced by a concatenation of blocks of size 2 i on both ends, for<br />

i ∈ 0.. x−1<br />

2 (see Figure 6 for x = 5):<br />

n = 2<br />

2∑<br />

i=0<br />

15<br />

x<br />

16<br />

17<br />

x<br />

18<br />

19<br />

20<br />

j=19<br />

2 i = 2 √ (<br />

2×2 x n+2<br />

)<br />

2 −2 x = 2log 2<br />

• If x is even, the worst case]<br />

is produced by a concatenation of blocks of size 2 i on both ends of a block<br />

of size 2 x 2, for i ∈<br />

[0.. x 2 −1 (see Figure 7 for x = 4):<br />

n = 2<br />

2∑<br />

i=0<br />

21<br />

22<br />

23<br />

24<br />

25<br />

25<br />

26<br />

2 √ 2<br />

(<br />

2 i +2 x 2 = 3×2 x n+2<br />

)<br />

2 −2 x = 2log 2<br />

3<br />

27<br />

28<br />

29<br />

15<br />

16


Figure 8: minimal n (for x even and odd) functions compared to 2×x<br />

Figure 9: Uk matrices and Mi..m−1 product : example when i = 33 and j goes from 47 to 48<br />

16<br />

22.2 x/2 - 2<br />

3.2 x/2 - 2<br />

14<br />

2.x 14<br />

12<br />

10<br />

10<br />

8<br />

6<br />

6<br />

4<br />

4<br />

2<br />

2<br />

1<br />

0<br />

0 1 2 3 4 5<br />

Note that this integer sequence has its own OEIS sequence at http://oeis.org/A027383,definedhere<br />

as a partial sum of http://oeis.org/A016116.<br />

Combining those two cases, it can be shown that when the number of products is set to x =1,2 or 3,<br />

then the minimal size is exactly 2×x (Illustration on Figure 8), and also when x>3 (or x = 0) that this<br />

minimal bound is never reached again.<br />

In other words, the number of products x is always ≤ n 2 .<br />

7.2 Amortized analysis of Uk blocks when i =0and j ≥ 0<br />

Summing the number of products needed when computing Uk should be 2 on average, and not 1 : a quick<br />

analysis shows that, indeed, if one product is done half of the time, two are done each 1/4, three done each<br />

1/8, and so on ... then the ∑ ∞<br />

u=1 u<br />

2 = 2. u<br />

However here, we will show that amortized number of product when considering j is only 1. We use an<br />

amortized analysis by giving one coin each time j is increased (i is supposed to stay at 0 but this assumption<br />

can be lifted since it can be seen as a worst case when updating Uk) to show than any sub-block Uk will<br />

generate one extra coin, and thus grouped with its neighbour block in size (itself generating one extra coin),<br />

the cost of the father block processed with those two is also generating (1+1)−1 = one extra coin.<br />

• this is true for blocks of size 2 since they are build of blocks of size 1 that do not generate any product<br />

: the cost for such block of size 2 is thus 1, and 1 extra coin remains.<br />

• this can be easily verified for blocks of size 2 p (p>1), since by induction hypothesis the two sub-blocks<br />

of size 2 p−1 give each one extra coin : the cost associated when joining the two sub-blocks then removes<br />

one coin, and one extra coin remains again.<br />

Note that this analysis can be set for any i ≥ 0 and any j>iprovided that at first an extra number of<br />

j −i coins is provided.<br />

x<br />

DRAFT<br />

i=33<br />

51<br />

50<br />

49<br />

48<br />

47<br />

46<br />

45<br />

44<br />

43<br />

42<br />

41<br />

40<br />

39<br />

38<br />

37<br />

36<br />

35<br />

34<br />

33<br />

32<br />

m_old=36<br />

x<br />

x<br />

24<br />

m=40<br />

j_old=47<br />

j=48<br />

7.3 Amortized analysis of the left Mi..m−1 blocks when m fixed and i increased<br />

Summing the number of products needed when computing Mi..m−1 for any i from 0 to m is on average 1 : a<br />

quick analysis shows indeed that if no product is done half of the time (when i is even), one product is done<br />

each 1/4, two done each 1/8, and so on ... then ∑ ∞ u<br />

u=0 2u+1 = 1.<br />

But this does not guaranty that the total number of products paid when increasing i from any value (for<br />

example 0) to m is always less than m. Here we will show that the number of products (once m is fixed) for<br />

computing Mi..m−1 for any i from 0 to a given m =2 p is ≤ 2 p −p−1.<br />

m =2! 0<br />

m =4! 1<br />

m =8! 4<br />

m = 16 ! 11<br />

A similar method to section 7.2 can be applied.<br />

First we consider the case when i = 0 and m has been increased to reach a given (and fixed) value 2 p .<br />

• this is true when p =1(thuswhenm = 2) since, using Uk blocks, it needs no product to compute<br />

M0..1 and M1..1.<br />

• this can be verified for blocks of size 2 p (p>1), since we can then use the two sub-blocks of size 2 p−1 :<br />

when i is within the first sub-block, as the product is done from m to i and stacked in such way that any<br />

suffix Mk..m in kept, it costs the product produced by this sub-block (2 (p−1) −(p−1)−1) added to the<br />

log2( m 2 )=p−1 extra products to cover the second sub-block of size 2p−1 ;wheni is within the second<br />

sub-block, exactly the number of products produced by this sub-block ( (2 (p−1) −(p−1)−1). ) Thus when<br />

summing these two quantities, the number of products is ≤ 2× 2 (p−1) −(p−1)−1 +(p−1) = 2 p −p−1<br />

DRAFT<br />

Thus, increasing i from any value ≥ 0tom and computing all the possible products (with the help of<br />

the Blocks Uk)is≤ m−log2(m)−1, thus costs less than m.<br />

Note that this analysis can be set for any i ≥ 0 and any m (not necessary represented as a strict power<br />

of 2 , but as m = a×2 p such that 2 p is the maximal block size of Uk for k ∈ [i..j]).<br />

7.4 Amortized analysis of the left Mi..m−1 blocks when m is increased (due to a<br />

j increase) and i is fixed<br />

When j increases while i is fixed, m may change to a new (and of course increased) value pointing to an<br />

equal (or twice larger block) : this appends when m goes from mold =2 pold ×aold (with aold odd), to its new<br />

value m = mold+2 pold =(aold+1)×2 pold = a×2 p (with a = aold<br />

2 and p = pold+1), as illustrated on Figure<br />

9.<br />

17<br />

18


Here we are interested in the computation of Mi..m−1 due to this ∆m = m − mold =2 pold increase.<br />

In practice, since m has changed, the full set of left stack matrices Mk..m−1 has to be recomputed for<br />

some k ∈ [i..m], and some products already done Mk..mold−1 (amortized in Section 7.3)havetoberedone<br />

unfortunately twice here.<br />

This twice-cost is at most log2( mold−i+1<br />

2 ) ≤ log2( m−mold<br />

2 )=log2( ∆m 2<br />

)(mold −i


7.6 Analysis 1<br />

If the size of the ∆m block increase is given by 2 u , the function f(u) that represents the amortized increase<br />

per j move (Appendix 7.4)is:<br />

f(u)=<br />

(<br />

3− 1 )<br />

2 u [j]+ u−1<br />

(<br />

2 u [m] ≤ g(u)= 3− 1 )<br />

2 u [j]+ u−1 [ i+2×j<br />

2 u 3<br />

DRAFT<br />

]<br />

since m< i+2×j<br />

3<br />

[ i+2×j<br />

f ′ (u)= ln(2)<br />

2 u [j]+1−ln(2)−uln(2) 2 u [m] g ′ (u)= ln(2)<br />

2 u [j]+1−ln(2)−uln(2) 2 u 3<br />

note that<br />

g ′ (x) ≥ 0ifx ≤ 1+ 1<br />

ln(2) ≈ 2.44<br />

g ′ (x) ≤ 0ifx ≥ 5 2 + 1<br />

ln(2) ≈ 3.94<br />

so the maximal g(int) candidate is one of g(2),g(3) or g(4). Since,<br />

g(3)−g(2) = 1 8 [j] ≥ 0<br />

g(4)−g(3) = 1<br />

48 ([j]−[i]) ≥ 0<br />

then g(4) = 3[j]+ [i]+[j]<br />

16 is the maximal value. Thus, f(u) ≤ 3[j]+ [i]+[j]<br />

16 .<br />

Note also that<br />

f(u)=<br />

(<br />

3− 1 )<br />

2 u [j]+ u−1 [m]<br />

2u ≤ 3[j]+u−2[m] 2u sincem

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!