15.12.2012 Views

Bioinformatics, Volume I Data, Sequence Analysis and Evolution

Bioinformatics, Volume I Data, Sequence Analysis and Evolution

Bioinformatics, Volume I Data, Sequence Analysis and Evolution

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

6. The GenBank<br />

<strong>Sequence</strong> Record<br />

6.1. Definition,<br />

Accession,<br />

<strong>and</strong> Organism<br />

Managing <strong>Sequence</strong> <strong>Data</strong> 13<br />

A GenBank sequence record is most familiarly viewed as a flat<br />

file with the information being presented in specific fields. DDBJ<br />

<strong>and</strong> GenBank flat files are similar in design. The EMBL flat file<br />

format, however, is different in design but identical in content to<br />

flat files from GenBank <strong>and</strong> DDBJ.<br />

Here is the top of the GenBank flat file, showing the top seven fields.<br />

LOCUS AF123456 1510 bp mRNA linear VRT<br />

25-JUL-2000<br />

DEFINITION Gallus gallus doublesex <strong>and</strong> mab-3<br />

related transcription factor 1<br />

(DMRT1) mRNA, partial cds.<br />

ACCESSION AF123456<br />

VERSION AF123456.2 GI:6633795<br />

KEYWORDS .<br />

SOURCE Gallus gallus (chicken)<br />

ORGANISM Gallus gallus<br />

Eukaryota; Metazoa; Chordata;<br />

Craniata; Vertebrata; Euteleostomi;<br />

Archosauria; Aves; Neognathae;<br />

Galliformes; Phasianidae;<br />

Phasianinae; Gallus<br />

The first token of the LOCUS field is the locus name. At present,<br />

this locus name is the same as the accession number, but in the<br />

past, more descriptive names were used. For instance, HUMHBB<br />

is the locus name for the human beta-globin gene in the record<br />

with accession number U01317. With the increase in the number<br />

of redundant sequences over time, the generation of descriptive<br />

locus names was ab<strong>and</strong>oned. Following the locus name is the<br />

length of the sequence, molecule type of the sequence, topology<br />

of the sequence (linear or circular), GenBank taxonomic or<br />

functional division, <strong>and</strong> date of the last modification. The DEFI-<br />

NITION line gives a brief description of the sequence including<br />

information about the source organism, gene(s), <strong>and</strong> molecule<br />

information. The ACCESSION is the database-assigned accession<br />

number, which has one of the following formats: two letters<br />

<strong>and</strong> six digits or one letter <strong>and</strong> five digits for INSD records;<br />

four letters <strong>and</strong> eight digits for WGS records; <strong>and</strong> two letters,<br />

an underscore, <strong>and</strong> six to eight digits for RefSeq records. The<br />

VERSION line contains the sequence version <strong>and</strong> the gi, a<br />

unique numerical identifier for that particular sequence. A Gen-<br />

Bank record may or may not have KEYWORDS. Historically,<br />

the KEYWORD field in the GenBank record was used as a summary<br />

of the information present in the record. It was a free text<br />

field <strong>and</strong> may have contained gene name, protein name, tissue

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!