11.04.2013 Views

BLAST and BLAT - Algorithms in Bioinformatics

BLAST and BLAT - Algorithms in Bioinformatics

BLAST and BLAT - Algorithms in Bioinformatics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Bio<strong>in</strong>formatics I, WS’12/13, D. Huson, October 30, 2012 31<br />

3.12.1 Clump<strong>in</strong>g hits<br />

(We skipped the details of how to perform the extend phase of the algorithm.)<br />

The first step <strong>in</strong> alignment generation is to form clumps of hits that represent regions <strong>in</strong> the database<br />

sequence that are homologous to the query sequence. Each such clump consists of a number of hits<br />

(that exceeds a given m<strong>in</strong>imum number of hits) that form a cha<strong>in</strong> <strong>in</strong> which two consecutive hits are<br />

not too far apart from each other <strong>and</strong> also <strong>in</strong> which the gap size <strong>in</strong> either sequence does not exceed a<br />

given threshold.<br />

Multiple hits are clumped together as follows:<br />

• The hit list L is sorted by database coord<strong>in</strong>ate.<br />

• The list L is split <strong>in</strong>to buckets of size 64 kb each, based on the database coord<strong>in</strong>ate.<br />

• Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position<br />

m<strong>in</strong>us query position.<br />

• Hits that are with<strong>in</strong> the gap limit are grouped together <strong>in</strong>to “proto-clumps”.<br />

• Hits with<strong>in</strong> proto-clumps are then sorted by their database coord<strong>in</strong>ate <strong>and</strong> put <strong>in</strong>to real clumps,<br />

if they are with<strong>in</strong> the w<strong>in</strong>dow limit on the database coord<strong>in</strong>ate.<br />

• Clumps with<strong>in</strong> 300 bp or 100 am<strong>in</strong>o acids of each other <strong>in</strong> the database are merged <strong>and</strong> then<br />

500 bp are added to each end of a clump.<br />

A list of hits:<br />

query sequence<br />

Sorted by database coord<strong>in</strong>ate:<br />

query sequence<br />

Sorted along the diagonal:<br />

2<br />

1 2<br />

4<br />

3<br />

4<br />

3<br />

5<br />

6 1<br />

5<br />

Database sequence<br />

6<br />

Database sequence

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!