BLAST and BLAT - Algorithms in Bioinformatics
BLAST and BLAT - Algorithms in Bioinformatics
BLAST and BLAT - Algorithms in Bioinformatics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Bio<strong>in</strong>formatics I, WS’12/13, D. Huson, October 30, 2012 31<br />
3.12.1 Clump<strong>in</strong>g hits<br />
(We skipped the details of how to perform the extend phase of the algorithm.)<br />
The first step <strong>in</strong> alignment generation is to form clumps of hits that represent regions <strong>in</strong> the database<br />
sequence that are homologous to the query sequence. Each such clump consists of a number of hits<br />
(that exceeds a given m<strong>in</strong>imum number of hits) that form a cha<strong>in</strong> <strong>in</strong> which two consecutive hits are<br />
not too far apart from each other <strong>and</strong> also <strong>in</strong> which the gap size <strong>in</strong> either sequence does not exceed a<br />
given threshold.<br />
Multiple hits are clumped together as follows:<br />
• The hit list L is sorted by database coord<strong>in</strong>ate.<br />
• The list L is split <strong>in</strong>to buckets of size 64 kb each, based on the database coord<strong>in</strong>ate.<br />
• Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position<br />
m<strong>in</strong>us query position.<br />
• Hits that are with<strong>in</strong> the gap limit are grouped together <strong>in</strong>to “proto-clumps”.<br />
• Hits with<strong>in</strong> proto-clumps are then sorted by their database coord<strong>in</strong>ate <strong>and</strong> put <strong>in</strong>to real clumps,<br />
if they are with<strong>in</strong> the w<strong>in</strong>dow limit on the database coord<strong>in</strong>ate.<br />
• Clumps with<strong>in</strong> 300 bp or 100 am<strong>in</strong>o acids of each other <strong>in</strong> the database are merged <strong>and</strong> then<br />
500 bp are added to each end of a clump.<br />
A list of hits:<br />
query sequence<br />
Sorted by database coord<strong>in</strong>ate:<br />
query sequence<br />
Sorted along the diagonal:<br />
2<br />
1 2<br />
4<br />
3<br />
4<br />
3<br />
5<br />
6 1<br />
5<br />
Database sequence<br />
6<br />
Database sequence