14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2. Explicit declaration by the history in the incoming record. Since NCBI accepts<br />

ASN.l messages directly, the history slot can be used to explicitly declare the<br />

relationship between protein Bioseqs. Of course, these are only allowed if the<br />

nucleic acid Bioseqs of the complex sets containing the proteins are also properly<br />

related. Here, too, a ‘gi’ identifier or its equivalent can be used.<br />

3. Matching by exact sequence identity. This actually happens quite frequently.<br />

4. Matching by exact identity of location on the nucleic acid Bioseq. This can<br />

happen without sequence identity because of a prior error in translation, for example,<br />

caused by using an incorrect genetic code.<br />

Figure 3: Deducing chains for protein ‘gi‘s<br />

Figure 3: Proteins (curly lines) Bioseqs in the new records (on the left) <strong>and</strong> old records (on the<br />

right) are ordered by their position on the nucleic acid (solid vertical lines). Proteins matched by<br />

one of the above rules are indicated by solid lines, while proteins matched by rule 5 (see the main<br />

text) are indicated by lighted dotted lines. A protein that can not be matched is indicated by the<br />

absence of lines <strong>and</strong> a question mark.<br />

5. Matching by position in the set of records (Figure 3). Generally, the rule is if<br />

that a protein is bounded by either matched proteins or the end of the DNA sequence,<br />

it is assumed to be in the same chain (be encoded by the same gene). It is realized<br />

that this algorithm will occasionally make mistakes. However, it is usually correct<br />

<strong>and</strong> the trail of ‘gi’s can be very useful.<br />

Conversion to ‘gi’ Sequence Identifiers<br />

During loading into the ID database, sequence identifiers used as pointers to Bioseqs<br />

are converted to ‘gi’ type sequence identifiers. This allows any subpiece of the<br />

ASN.l message to be used as an independent object by later software, even if<br />

separate from its original Bioseq.<br />

A consequence of this conversion of all other pointer sequence identifiers to ‘gi’<br />

identifiers is that if a record points to a sequence identifier not yet known to ID, the<br />

sequence identifier can not be converted. When such “sought” (currently unknown)<br />

identifiers are defined by having their sequence loaded into ID, the original record is<br />

altered to point to it. This provides some independence from the order of addition of<br />

data to ID <strong>and</strong> guarantees that all the sequence identifiers that can be converted to<br />

‘gi’ identifiers have been converted, but with computational cost <strong>and</strong> increased<br />

complexity in the processing code.<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!