21.01.2013 Views

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

Lecture Notes in Computer Science 4917

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

164 H. Vandierendonck et al.<br />

Table 1. Complexity of Clustal W <strong>in</strong> time and space, with N the number of sequences<br />

and L the typical sequence length<br />

Stage O(Space) O(Time)<br />

PW: Pairwise calculation N 2 + L 2<br />

GT: Guide tree N 2<br />

PA: Progressive alignment NL + L 2<br />

Total N 2 + L 2<br />

N 2 L 2<br />

N 4<br />

N 3 + NL 2<br />

N 4 + L 2<br />

each phase <strong>in</strong> terms of N and L (see Table 1). This theoretical analysis <strong>in</strong>dicates<br />

that the second stage of Clustal W will become more important as the number<br />

of sequences <strong>in</strong>creases, but it is <strong>in</strong>different to the length of these sequences. In<br />

the other stages both the number of sequences and the length are of importance.<br />

The time and space complexity <strong>in</strong>dicate how each phase scales with <strong>in</strong>creas<strong>in</strong>g<br />

problem size, but they does not tell us the absolute amount of time spent <strong>in</strong> each<br />

phase. In order to understand this, we analyze the program by randomly creat<strong>in</strong>g<br />

<strong>in</strong>put sets with preset number of sequences N and sequence length L. Statistical<br />

data of prote<strong>in</strong> databases [10] <strong>in</strong>dicates that the average length of sequences is<br />

366 am<strong>in</strong>o acids, while sequences with more than 2000 am<strong>in</strong>o acids are very rare.<br />

So we randomly created <strong>in</strong>put sets with a sequence length rang<strong>in</strong>g from 10 to<br />

1000 am<strong>in</strong>o acids and with a number of sequences <strong>in</strong> the same range.<br />

Figure 1 shows the percentage of execution time spent <strong>in</strong> each stage. Each<br />

bar <strong>in</strong>dicates a certa<strong>in</strong> configuration of the number of sequences (bottom Xvalue)<br />

and the sequence length (top X-value). The pairwise alignment becomes<br />

the dom<strong>in</strong>ant stage when the number of sequences is high, tak<strong>in</strong>g responsibility<br />

for more than 50% of execution time when there are at least 50 sequences. In<br />

contrast, when the number of sequences is small, then progressive alignment<br />

takes the larger share of the execution time.<br />

The guide tree only plays an important role when the <strong>in</strong>put set conta<strong>in</strong>s a large<br />

number of short sequences. In other cases this stage is only responsible for less<br />

than 5% of the execution time. In a previous study, G-prote<strong>in</strong> coupled receptor<br />

Percentage execution time<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

10<br />

50<br />

100<br />

500<br />

1000<br />

10<br />

50<br />

100<br />

500<br />

1000<br />

10<br />

50<br />

100<br />

500<br />

1000<br />

10<br />

50<br />

100<br />

500<br />

1000<br />

10<br />

50<br />

100<br />

500<br />

1000<br />

N = 10 N = 50 N = 100 N = 500 N = 1000<br />

Number of sequences (lower) & Sequence length (upper)<br />

Fig. 1. Percentage of execution time of the three major stages <strong>in</strong> Clustal W<br />

PA<br />

GT<br />

PW

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!