09.07.2015 Views

Ontology engineering

Ontology engineering

Ontology engineering

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

correspondence© 2010 Nature America, Inc. All rights reserved.0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 BitsBiological processMacromoleculemetabolismDefenseresponseCell proliferationsingle genes 5,6 . GO has become a primaryprovider of these gene sets and researchersuse its graphical structure to identify thespecificity of a gene class so that they willcompare classes of the same specificty 7 .Previous studies have found that thestructure of GO does not conform toexpected intuitions regarding the structureand distributions of ontology terms 8,9 .Gene enrichment methods typically usethe structure of ontologies as a proxy forthe specificity of a term 10,11 or, in somecases, use automated procedures to identifystructural biases and to compensate forthem in the analysis 7,8,12 . Unfortunately,in some cases, even these compensativemethods are unable to reach the sameconclusions of a well-calibrated ontology(Supplementary Notes 1).The approach we advocate here aims tosolve the problem at its root by optimizingthe structure of the ontology so that it willindeed be an accurate representation ofthe informational specificity of any termin the ontology. This approach would notonly avoid the necessity to compensatefor biases but also improve the semantictransparency of the ontology structure.To do so, we introduce an automatedmethod for <strong>engineering</strong> the structure ofGO based on the information content ofeach single term. The intuition behind thismethod is that ontologies are informationsystems and, as such, they can be optimizedusing the well-established mathematics ofinformation theory. Given its mathematicalnature, this optimization process can beautomated, thus producing a principled andscalable architecture to engineer GO and,analogously, other biomedical ontologies.Our approach starts from thequantification of information contained inthe terms of the ontology. The informationcontent of a term is computed fromthe amount of annotation available forit relative to all other terms, and it is aHeterocyclemetabolismFructosemetabolismFigure 1 Spectrum of GO terms: examples ranging from 1 to 14 bits.TerpenemetabolismGanglion mother cellfate determinationmeasure of the surprise caused by labelinga gene with this term rather than withany other term (Supplementary Notes 2).For instance, if a term contains all genes,then it is not surprising for a given geneto be labeled with it, so this term does notcontain much information. Thus, the moregenes or gene products associated with aterm, the less specific the term is and theless information is conveyed by it. This‘surprise factor’ is called ‘self-information’,and information theory provides a formaldefinition for it 13 (Fig. 1).Using information theory, we analyzed theevolution of the information content of GOacross time, examining 2 million genes acrossall the organisms encoded in the ontologyannotations. This process highlightedinformation biases and inefficiencies thatmay affect the usage of GO and identifiedthose areas of the ontology that were suboptimallyorganized. The analysis identifiedthree types of information inefficiencies inthe structure of GO.Topological metric0.610.60.590.580.570.560.552.5 × 10 1 , 7.6 × 10 −2 , 5.9 × 10 −10.550.0780.0740.072Inter-level metric 0.0720The first type of inefficiency arises fromthe variability of the information contentamong the terms within a given ontologylevel. By the principle of maximum entropy,an even a priori distribution of information(where all terms in a level are equally specificand hence equally informative) is mostefficient because a random experimentis most informative if the probabilitydistribution over outcomes is uniform 13 .Furthermore, gene set enrichment methodsoften use GO level (that is, distance from thetop of the graph) as a proxy for degree ofspecificity 7,10,11 ; this strategy implicitly relieson within-level uniformity of informationcontent. Optimally, then, all the terms in agiven level would have equal specificity and,therefore, the same information content. Ouranalysis revealed that the original version ofGO contained a large degree of such intralevelvariability of information content.For example, the term ‘pilus retraction’ wasoriginally at level 2, at the same level of termslike ‘cell cycle’ and ‘cell development’ that areactually much more general.The second type of structural inefficiency,inter-level variability, arises from deviationsin information content between levels. Ingeneral, terms become more specific as theinformation content of a level increases withdepth in the graph. In some areas of GO,however, the mean information contentdecreases from one level to the next, creatingan information bottleneck. In this case,most of the annotation information of theprevious level is transmitted to the nextthrough only a few terms. The larger thedecrease in information content, the moresevere the bottleneck. The presence of these2.9 × 10 1 , 7.7 × 10 −2 , 6.1 × 10 −13.0 × 10 1 , 7.9 × 10 −2 , 5.9 × 10 −13.0 × 10 1 , 7.6 × 10 −2 , 6.0 × 10 −13.2 × 10 1 , 7.7 × 10 −2 , 5.9 × 10 −12.9 × 10 −1 ,7.6 × 10 2 ,5.9 × 10 −13.0 × 10 1 , 7.6 × 10 −2 , 5.8 × 10 −12.5 × 10 1 , 7.5 × 10 −2 , 5.7 × 10 −12.0 × 10 1 , 7.6 × 10 −2 , 5.6 × 10 −1 2.5 × 10 1 , 7.4 × 10 −2 , 5.6 × 10 −1262824Inter-level metricFigure 2 Three-dimensional evolution of GO over ten releases from 2005 to 2007 along the threedimensions of structural inefficiency. An ontology with no inefficiency across these metrics wouldbe at the origin (0,0,0).223032nature biotechnology volume 28 number 2 february 2010 129

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!