11.07.2015 Views

Bioinformatics for DNA Sequence Analysis.pdf - Index of

Bioinformatics for DNA Sequence Analysis.pdf - Index of

Bioinformatics for DNA Sequence Analysis.pdf - Index of

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Similarity Searching Using BLAST 5made to the sequence since it was first submitted. Locus names(see Note 1) are older, less standardized identifiers whose originalpurpose was to group entries with similar sequences (10). Theoriginal locus <strong>for</strong>mat was intended to hold in<strong>for</strong>mation about theorganism and other common group characteristics (such as geneproduct). That ten-character <strong>for</strong>mat is no longer able to hold suchin<strong>for</strong>mation <strong>for</strong> the large number and variety <strong>of</strong> sequences nowavailable, so the locus has become yet another unique identifier<strong>of</strong>ten set to be the same value as the accession number. Databaseidentifiers are simply two- or three-character strings that serve toindicate which database originally received and stored the in<strong>for</strong>mation.The database identifier is the first value listed in theFASTA identifier syntax (Table 1.1).When a sequence is first submitted to GenBank, it is submittedwith several defined features associated with the sequence. Someinclude CDS (coding sequence), RBS (ribosome binding site),rep_origin (origin <strong>of</strong> replication), and tRNA (mature transferRNA) in<strong>for</strong>mation. A translation <strong>of</strong> protein coding nucleotidesequences into amino acids is provided as part <strong>of</strong> the featuressection. Likewise, labeling <strong>of</strong> different open reading frames,introns, etc., are all part <strong>of</strong> the table <strong>of</strong> features. A list <strong>of</strong> featuresand their descriptions, <strong>for</strong>mats, and conventions that were agreedupon by INSDC can be found in the Feature Table (seeSection 2.1.2).2.2. Smith–Watermanand DynamicProgrammingIn 1970, Needleman and Wunsch adapted the idea <strong>of</strong> dynamicprogramming to the difficult problem <strong>of</strong> global sequence alignment(11). In 1981, Smith and Waterman adapted this algorithmto local alignments (12). A global alignment attempts to align twosequences throughout their entire length, whereas a local alignmentaligns regions <strong>of</strong> two sequences where high similarity isobserved. Both methods involve initializing, scoring, and tracinga matrix where the rows and columns correspond to the bases orresidues <strong>of</strong> the two sequences being aligned (Fig. 1.2). In the localalignment case, the first row and the first column are filled withzeroes. The remaining cells are filled with a metric value recursivelyderived from neighboring values:80>< left neighbor þ gap penaltymaxtop neighbor þ gap penatly>:top-left neighbor þ match/mismatch scoreIf the current cell corresponds to a match (identical bases), thematch score is added to the value from the diagonal neighbor,otherwise the mismatch score is used. The gap penalty and mismatchscores are generally zero or a small, negative number whilethe match score is a positive number, larger in magnitude. This

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!