11.04.2013 Views

BLAST and BLAT - Algorithms in Bioinformatics

BLAST and BLAT - Algorithms in Bioinformatics

BLAST and BLAT - Algorithms in Bioinformatics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

32 Bio<strong>in</strong>formatics I, WS’12/13, D. Huson, October 30, 2012<br />

query sequence<br />

3.12.2 Nucleotide alignments<br />

1 2<br />

3<br />

5<br />

4<br />

6<br />

Database sequence<br />

Clump<strong>in</strong>g is the first part of the extension stage. In the case of nucleotide alignments, each clump is<br />

then processed as follows.<br />

• A hit list is generated between the query sequence q <strong>and</strong> the homologous region h <strong>in</strong> the database, look<strong>in</strong>g<br />

for smaller, perfect K-mers.<br />

• If a K-mer w <strong>in</strong> q matches multiple K-mers <strong>in</strong> h, then w is repeatedly extended by one until the match<br />

is unique or exceeds a certa<strong>in</strong> size.<br />

• The hits are extended as far as possible, without mismatches.<br />

• Overlapp<strong>in</strong>g hits are merged.<br />

• If there are gaps <strong>in</strong> the alignment <strong>in</strong> both the query <strong>and</strong> the database, then the algorithm recurses to fill<br />

<strong>in</strong> the gaps, us<strong>in</strong>g a smaller K.<br />

• Then extensions us<strong>in</strong>g <strong>in</strong>dels followed by matches are considered.<br />

• Large gaps <strong>in</strong> the query sequence often correspond to <strong>in</strong>trons <strong>and</strong> they are slid around to f<strong>in</strong>d the best<br />

GT/AG consensus sequence for the <strong>in</strong>tron ends.<br />

3.12.3 Prote<strong>in</strong> alignments<br />

In the case of am<strong>in</strong>o acid sequences, each clump is processed as follows:<br />

• All hits obta<strong>in</strong>ed <strong>in</strong> the seed stage are extended <strong>in</strong>to maximally scor<strong>in</strong>g ungapped alignments (HSPs)<br />

us<strong>in</strong>g a score function where a match is worth 2 <strong>and</strong> a mismatch is worth 1.<br />

• A graph is build with HSPs as nodes.<br />

• If HSP A starts before HSP B <strong>in</strong> both sequences, then an edge is put from A to B that is weighted by<br />

the score of B m<strong>in</strong>us a gap penalty based on the distances between A <strong>and</strong> B.<br />

• If A <strong>and</strong> B overlap, then an optimal crossover position x is determ<strong>in</strong>ed that maximizes the sum of score<br />

of A up to x <strong>and</strong> B start<strong>in</strong>g from x <strong>and</strong> the edge weight is set accord<strong>in</strong>gly.<br />

• A dynamic programm<strong>in</strong>g algorithm then extracts the maximal scor<strong>in</strong>g alignment by travers<strong>in</strong>g the graph.<br />

• The HSPs conta<strong>in</strong>ed <strong>in</strong> the path are removed <strong>and</strong> if any HSPs are left then the dynamic program is run<br />

aga<strong>in</strong>.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!