13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

EULERIAN ASSEMBLY AND MULTIPLE ALIGNMENT 20970012006001000Sum <strong>of</strong> Pair Distance Score500400300Aligning Alignment Distance Score8006004002002001000 50 100 150 200 250 300 350 400 450 500Number <strong>of</strong> Sequences00 50 100 150 200 250 300 350 400 450 500Number <strong>of</strong> SequencesFigure 5. SP and AA scores (distance score) with respect to the number <strong>of</strong> sequences. <strong>The</strong> squares and triangles indicate differentpair-wise similarities (square, 90%, triangle, 70%). Both SP and AA scores are computed from a single alignment test. Solid lines connectpoints from EulerAlign, and dashed lines connect points from ClustalW.mance <strong>of</strong> EulerAlign: (1) sum <strong>of</strong> pairs (SP) score, a popularand simple measure, and (2) aligning alignment(AA) score, comparison <strong>of</strong> an alignment to the true alignment;by simulating sequences, the true alignment (ratherthan the mathematically optimal alignment) is known.We used ClustalW (Higgins and Sharp 1989; Thompsonet al. 1994), a well-studied and popular MSA s<strong>of</strong>tware, asthe reference. Figure 5 shows the comparison betweenEulerAlign and ClustalW on sequence sets generated bythe evolutionary model with different mutation rates:5.2% and 16.4%, respectively, corresponding to 90% and70% pair-wise sequence similarities. <strong>The</strong> comparison onthe equidistance model is not shown, simply because EulerAlignis designed for that model and hence achieves abetter result. <strong>The</strong> linear growth <strong>of</strong> the computational timewith respect to the number <strong>of</strong> aligned sequences by EulerAlignis shown in Figure 6a, and a significant comparisonto the quadratic growth by ClustalW is shown in Figure6b. We used distance scores; hence, the smaller thescore the better the result. All tests are done on a SUNUltraSPARC 750MHz workstation.Application on Arabidopsis SequencesArabidopsis thaliana is widely used as a model organismfor genetic study in plant biology. As an applicationon real genomic sequences, we used EulerAlign to constructalignments for several sets <strong>of</strong> short specific sequencessampled from 96 Arabidopsis individuals byPCR experiments with certain primers. <strong>The</strong>se alignmentsare then used to study the genetic variations and henceevolutionary relationships in the Arabidopsis population.Sequence data are kindly provided by M. Nordborg atUSC. Presented with base-calling errors, an accuratemultiple alignment is crucial for efficiently detecting realgenetic variations other than sequencing errors. <strong>The</strong> maindifference between genetic variations and sequencing errorsis that sequencing errors are more independently andrandomly distributed (although <strong>of</strong> course a function <strong>of</strong>position in the sequence fragment).To reduce base-calling errors, each individual is sequencedfrom both forward and backward strands, andeach base-call has a quality value assigned by Phred (Ew-1401500012010090%1000090%806080%80%70%50004070%20Ours00 50 100 150 200 250 300 350 400 450 500Number <strong>of</strong> Sequences( a )00 50 100 150 200 250 300 350 400 450 500Number <strong>of</strong> Sequences( b )Figure 6. (a) Linear time cost (in seconds) by EulerAlign with respect to the number <strong>of</strong> sequences; three lines correspond to 90%,80%, and 70% pair-wise similarities. (b) Comparison to the quadratic time cost by ClustalW (dashed lines). <strong>The</strong> tested numbers <strong>of</strong>sequences are 10, 15, 50, 100, 250, and 500.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!