14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

14<br />

made available as part of the NCBI toolkit distribution. “asncode” takes ASN.l<br />

message definitions <strong>and</strong> automatically generates both the C language structure<br />

definitions <strong>and</strong> object loaders. This “asncode”-generated software is used in network<br />

communication for the Entrez <strong>and</strong> Blast systems <strong>and</strong> the MMDB [9] structure<br />

manipulating software <strong>and</strong> could be easily used by software developers for most<br />

ASN.1 applications.<br />

Sequence record types<br />

In the “asn.all” definition, a “Bioseq” is a biological sequence that can contain<br />

information in addition to the sequence. For the purposes of this discussion, a<br />

Bioseq can be considered a collection of a “sequence” block, an “id” block, a<br />

“history” block, a “descriptor” block, <strong>and</strong> an “annotation” block. Each Bioseq can<br />

have a set of synonymous sequence identifiers that it contains in its “id” block. The<br />

semantics of the definition in “asn.all” are that this set of sequence identifiers are<br />

names for this Bioseq. Sequence identifiers that occur elsewhere in the record are<br />

pointers to Bioseqs that contain those sequence identifiers in their “id” block.<br />

“Descriptors” provide information that can apply to the entire sequence, such as<br />

citations to the literature <strong>and</strong> taxonomic classifications for the source of the material<br />

that led to the sequence information. Feature “annotation” applies to a particular<br />

region of the sequence, such as a coding region. These feature annotations use a<br />

sequence identifier to point to the Bioseq containing the sequence being described.<br />

Bioseq sets<br />

In the “asn.all” definition, a sequence entry can either be a Bioseq, or a more<br />

complex set of Bioseqs. One example of a more complex set is the set of protein<br />

Bioseqs combined with the nucleic acid that encodes them. The Bioseq can either<br />

contain actual sequence, or can incorporate sequence information solely by reference<br />

to other Bioseqs that actually contain the sequence. An example of this incorporation<br />

by reference is in the set of Bioseqs comprising the exons of a gene. This is the most<br />

common type of a “segmented set”. In this case, the set begins with a Bioseq that<br />

points to the Bioseqs that contain the exon sequences. These pointers use sequence<br />

identifiers to specify the order of the exons by referencing the name (sequence<br />

identifier) of the Bioseqs containing the exon sequence. The Bioseqs containing the<br />

actual raw sequences for the exons are, in this case, part of the Bioseq set that<br />

includes the Bioseq pointing to them. Significantly, this “pointer” Bioseq, which has<br />

no sequence of its own <strong>and</strong> only incorporates sequence by reference, can be<br />

processed by the NCBI software system in the same way as any other Bioseq.<br />

The entries in the Genomes division of Entrez are another example how pointers<br />

incorporate sequence data by reference to other Bioseqs. However, they differ from<br />

the sequence entry of the segmented set in that the Bioseqs containing the raw<br />

sequence are not in the same entry. This makes it critical for the sequence identifiers<br />

to be used accurately <strong>and</strong> uniquely.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!