14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

There are other sets of data that can be grouped together. For example, sequences<br />

of the same gene from a group of related organisms comprise a population set, such<br />

as the actin sequences from related vertebrates. These <strong>and</strong> other more complicated<br />

sets are just beginning to be used <strong>and</strong> will not be discussed further in this chapter.<br />

Sequence Identifiers.<br />

There are a variety of reasons why there are multiple sequence identifiers in the “id”<br />

blocks of Bioseqs. For example, consider an expressed sequence tag (EST)<br />

sequence. In its “native” database, dbEST[10], it is given an integral tag, an “est-id”<br />

for its unique sequence identifier. If this sequence is to be used outside of the context<br />

of this particular database, the string “dbEST” must be added to this integral<br />

identifier for the resulting composite identifier to be unique. However, to appear<br />

in GenBank[11], the sequence record needs, additionally, both a LOCUS name <strong>and</strong><br />

an ACCESSION number. Every sequence in GenBank is also given a “gi” number,<br />

which allows all sequences to be retrieved with a single integral key, by Entrez, for<br />

example. Data from other sources also retain their identifiers assigned by those<br />

sources. So this single sequence record will have four to six different sequence<br />

identifiers. Different retrieval programs will use a different identifier from this<br />

synonymous set.<br />

‘gi’s Vs accessions<br />

There has been some confusion about the role of the ‘gi’ <strong>and</strong> how it compliments an<br />

ACCESSION. When a laboratory submits a sequence from a piece of DNA to a<br />

database, an ACCESSION number is assigned <strong>and</strong> is permanently associated with<br />

that piece of DNA. When the record containing that sequence is loaded into the ID<br />

database (see below) an initial ‘gi’ is assigned to that sequence. Further experiments<br />

over a time may alter the best underst<strong>and</strong>ing of the true sequence of that DNA.<br />

When these new sequences for the same piece of DNA are submitted to NCBI, a new<br />

‘gi’ is assigned. This leads to a “chain” or series of sequences <strong>and</strong> corresponding<br />

‘gi’s for the same piece of DNA. When it is important to identify the piece of DNA,<br />

for example as a subclone at a particular location within some other clone, then the<br />

ACCESSION is best used. When the particular sequence is most important, for<br />

statements about sequence similarity or some conceptual translation, the ‘gi’ that<br />

points to the particular sequence intended is best used. Statements <strong>and</strong> experiments<br />

that use ‘gi’s can always be repeated, because the sequence identified with a<br />

particular ‘gi’ is always available, even if the particular sequence identified by that<br />

‘gi’ is not thought, currently, to be the accurate sequence. This relationship between<br />

ACCESSION <strong>and</strong> ‘gi’ is shown in Figure 2, below.<br />

gi gi gi<br />

100 citation change 100 sequence change 201<br />

Figure 2: A “chain” of ‘gi’s<br />

15

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!