28.02.2013 Views

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

Bio-medical Ontologies Maintenance and Change Management

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Protein Data Integration Problem 59<br />

is difficult: The different database projects may use different terms to refer to the<br />

same concept <strong>and</strong> the same terms to refer to different concepts. Furthermore, these<br />

terms are typically not formally linked with each other in any way. GO seeks to<br />

reveal these underlying biological functionalities by providing a structured controlled<br />

vocabulary that can be used to describe gene products, <strong>and</strong> is shared between<br />

biological databases. This facilitates querying for gene products that share<br />

biologically meaningful attributes, whether from separate databases or within the<br />

same database.<br />

Association between ontology nodes <strong>and</strong> proteins, namely, protein annotation<br />

through gene ontology, is an integral application of GO. To efficiently annotate<br />

proteins, the GO Consortium developed a software platform, the GO Engine,<br />

which combines rigorous sequence homology comparison with text information<br />

analysis. During evolution, many new genes arose through mutation, duplication,<br />

<strong>and</strong> recombination of the ancestral genes. When one species evolved into another,<br />

the majority of orthologs retained very high levels of homology. The high<br />

sequence similarity between orthologs forms one of the foundations of the GO<br />

Engine.<br />

Text information related to individual genes or proteins is immersed in the vast<br />

ocean of bio<strong>medical</strong> literature. Manual review of the literature to annotate proteins<br />

presents a daunting task. Several recent papers described the development of various<br />

methods for the automatic extraction of text information (Jenssen et al., 2001,<br />

Li et al., 2000). However, the direct applications of these approaches in GO annotation<br />

have been minimal. A simple correlation of text information with specific<br />

GO nodes in the training data predicts GO association for unannotated proteins.<br />

The GO Engine combines homology information, a unique protein-clustering procedure,<br />

<strong>and</strong> text information analysis to create the best possible annotations.<br />

Recently, Protein Data Bank (PDB) also released versions of the PDB Exchange<br />

Dictionary <strong>and</strong> the PDB archival files in XML format collectively named<br />

PDBML (Westbrook et al., 2005). The representation of PDB data in XML builds<br />

from content of PDB Exchange Dictionary, both for assignment of data item<br />

names <strong>and</strong> defining data organization. PDB Exchange <strong>and</strong> XML Representations<br />

use the same logical data organization. A side effect of maintaining a logical correspondence<br />

with PDB Exchange representation is that PDBML lacks the hierarchical<br />

structure characteristic of XML data. A directed acyclic graph (DAG) based<br />

ontology induction tool (Mani et al., 2004) is also used to constructs a protein ontology<br />

including protein names found in MEDLINE abstracts <strong>and</strong> in UNIPROT. It<br />

is a typical example of text mining the literature <strong>and</strong> the data sources. It can’t be<br />

classified as protein ontology as it represents only the relationship between protein<br />

literatures <strong>and</strong> does not formalize knowledge about the protein synthesis process.<br />

Recently, the Genomes to Life Initiative was introduced (Frazier et al., 2003a,<br />

Frazier et al., 2003b) close to completion of Human Genome Project (HGP) which<br />

finished in April 2003 (Collins et al., 2003). Lessons learnt from HGP will guide<br />

ongoing management <strong>and</strong> coordination of GTL. GTL states an objective: “To correlate<br />

information about multiprotein machines with data in major protein databases<br />

to better underst<strong>and</strong> sequence, structure <strong>and</strong> function of protein machines.”<br />

This objective can be achieved to some extent only by creating a Generic Protein

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!