14.06.2013 Views

Databases and Systems

Databases and Systems

Databases and Systems

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

figure 1, <strong>and</strong> all flows into the ID database, which is described in a subsequent<br />

section.<br />

Conversion to ASN.l<br />

Transforming data into the same format allows them to be used by software<br />

independently of the original source, a major requirement for integration. Within this<br />

uniform format the accuracy of the identifiers used to link, or point, to other records<br />

determines the accuracy <strong>and</strong> actual meaning of the links themselves. For literature<br />

articles, NCBI has added an integral identifier, the PubMed identifier. For<br />

convenience, many articles contain both a MEDLINE unique identifier <strong>and</strong> a<br />

PubMed identifier. Genetic sequence identifiers are more complex <strong>and</strong> can occur in<br />

different forms <strong>and</strong> places within each data record. Before this integration <strong>and</strong> use of<br />

sequence identifiers can be explained, the way in which NCBI represents the data<br />

must be clear.<br />

So that the databases <strong>and</strong> other later applications can use a common format <strong>and</strong> be<br />

insulated from parsing the input data, the diverse streams of data are converted to<br />

Abstract Syntax Notation (ASN. 1), an international st<strong>and</strong>ard [7]. When using<br />

ASN.l, one file describes the format to which other files (messages) of data must<br />

correspond. This file can be considered the definition of the message format. The<br />

data conforming to the message format can be understood by later application<br />

software that is thereby insulated from details of the original input formats. The<br />

messages use a “tag-value” format, in that an identifier describes the value that<br />

follows; however, analogous to programming language record definitions, the ability<br />

to use both recursion <strong>and</strong> user defined types in the definition of the message format<br />

allows for almost infinite complexity in the messages themselves. Since ASN. 1<br />

messages can thus become rather complex <strong>and</strong> are not intended to be human<br />

readable, other report formats, such as GenBank flatfile format, are used to display<br />

this data.<br />

“asn.all” ASN. 1 message definition<br />

The particular format, or message definition, plays a central role. It is what describes<br />

the syntax into which all the sequence record information must be parsed. The<br />

original definition proposed by Jim Ostell [8] has been used with minor<br />

modifications for over five years. It is available as the file, “asn.all” in the NCBI<br />

toolkit distribution (ftp:ncbi.nlm.nih.gov; directory toolbox/ncbi_tools) <strong>and</strong> is<br />

discussed in detail in the NCBI Programmers Toolkit documentation [8]. “asn.all”<br />

describes the format of both the literature <strong>and</strong> genetic sequence messages. NCBI also<br />

makes use of other ASN. 1 message definitions.<br />

So that the data in the ASN.l messages can be used by software, C language<br />

structures that map fairly closely to the “asn.all” definitions were designed, as well as<br />

software (object loaders) that could read <strong>and</strong> write these messages from C language<br />

structures to files <strong>and</strong> vice versa. These original structures <strong>and</strong> object loaders were<br />

h<strong>and</strong> crafted. More recently, the program “asncode” was written by the author <strong>and</strong><br />

13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!