11.07.2015 Views

What is Bioinformatics? A Proposed Definition and Overview of the ...

What is Bioinformatics? A Proposed Definition and Overview of the ...

What is Bioinformatics? A Proposed Definition and Overview of the ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

349<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?ed from a common ancestral gene, <strong>and</strong>paralogues, proteins that are related bygene duplication within a genome [35].Normally, orthologues retain <strong>the</strong> samefunction while paralogues evolve d<strong>is</strong>tinct,but related functions [36].An important concept that ar<strong>is</strong>es from<strong>the</strong>se observations <strong>is</strong> that <strong>of</strong> a finite “partsl<strong>is</strong>t” for different organ<strong>is</strong>ms [37-39]: aninventory <strong>of</strong> proteins contained within anorgan<strong>is</strong>m, arranged according to differentproperties such as gene sequence, proteinfold or function. Taking protein folds as anexample, we mentioned that with a fewexceptions, <strong>the</strong> tertiary structures <strong>of</strong> proteinsadopt one <strong>of</strong> a limited repertoire<strong>of</strong> folds. As <strong>the</strong> number <strong>of</strong> different foldfamilies <strong>is</strong> considerably smaller than <strong>the</strong>number <strong>of</strong> genes, categor<strong>is</strong>ing <strong>the</strong> proteinsby fold provides a substantial simplification<strong>of</strong> <strong>the</strong> contents <strong>of</strong> a genome. Similar simplificationscan be provided by o<strong>the</strong>r attributessuch as protein function. As such, weexpect th<strong>is</strong> notion <strong>of</strong> a finite parts l<strong>is</strong>t tobecome increasingly common in futuregenomic analyses.Clearly, an essential aspect <strong>of</strong> managingth<strong>is</strong> large volume <strong>of</strong> data lies in developingmethods for assessing similarities betweendifferent biomolecules <strong>and</strong> identifyingthose that are related. There are well-documentedclassifications for all <strong>of</strong> <strong>the</strong> maintypes <strong>of</strong> data we described earlier. Althoughdetailed descriptions <strong>of</strong> <strong>the</strong>se classificationsystems are beyond <strong>the</strong> scope <strong>of</strong><strong>the</strong> current review, <strong>the</strong>y are <strong>of</strong> great importanceas <strong>the</strong>y ease compar<strong>is</strong>ons betweengenomes <strong>and</strong> <strong>the</strong>ir products. Links to <strong>the</strong>major databases are available from oursupplementary website.3.2 Data IntegrationThe most pr<strong>of</strong>itable research in bioinformatics<strong>of</strong>ten results from integrating multiplesources <strong>of</strong> data [40]. For instance, <strong>the</strong>3D coordinates <strong>of</strong> a protein are more usefulif combined with data about <strong>the</strong> protein’sfunction, occurrence in different genomes,<strong>and</strong> interactions with o<strong>the</strong>r molecules. Inth<strong>is</strong> way, individual pieces <strong>of</strong> informationare put in context with respect to o<strong>the</strong>rdata. Unfortunately, it <strong>is</strong> not alwaysstraightforward to access <strong>and</strong> crossreference<strong>the</strong>se sources <strong>of</strong> information because<strong>of</strong> differences in nomenclature <strong>and</strong>file formats.At a basic level, th<strong>is</strong> problem <strong>is</strong> frequentlyaddressed by providing externallinks to o<strong>the</strong>r databases. For example inPDBsum, web-pages for individual structuresdirect <strong>the</strong> user towards correspondingentries in <strong>the</strong> PDB, NDB, CATH, SCOP<strong>and</strong> SWISS-PROT databases. At a moreadvanced level, <strong>the</strong>re have been efforts tointegrate access across several data sources.One <strong>is</strong> <strong>the</strong> Sequence Retrieval System, SRS[41], which allows flat-file databases to beindexed to each o<strong>the</strong>r; th<strong>is</strong> allows <strong>the</strong> userto retrieve, link <strong>and</strong> access entries fromnucleic acid, protein sequence, proteinmotif, protein structure <strong>and</strong> bibliographicdatabases. Ano<strong>the</strong>r <strong>is</strong> <strong>the</strong> Entrez facility[42], which provides similar gateways toDNA <strong>and</strong> protein sequences, genomemapping data, 3D macromolecular structures<strong>and</strong> <strong>the</strong> PubMed bibliographic database[43].A search for a particular gene in ei<strong>the</strong>rdatabase will allow smooth transitions to<strong>the</strong> genome it comes from, <strong>the</strong> proteinsequence it encodes, its structure, bibliographicreference <strong>and</strong> equivalent entries forall related genes. In our own group, we havedeveloped <strong>the</strong> SPINE [44] <strong>and</strong> PartsL<strong>is</strong>t[39] web resources; <strong>the</strong>se databases integratemany types <strong>of</strong> experimental data <strong>and</strong>organ<strong>is</strong>e <strong>the</strong>m using <strong>the</strong> concept <strong>of</strong> <strong>the</strong>finite “parts l<strong>is</strong>t” we described above.4. “…UNDERSTAND <strong>and</strong>Organ<strong>is</strong>e <strong>the</strong> Information…”Having examined <strong>the</strong> data, we can d<strong>is</strong>cuss<strong>the</strong> types <strong>of</strong> analyses that are conducted.Asshown in Table 1, <strong>the</strong> broad subject areas inbioinformatics can be separated accordingto <strong>the</strong> type <strong>of</strong> information that <strong>is</strong> used. Forraw DNA sequences, investigations involveseparating coding <strong>and</strong> non-coding regions,<strong>and</strong> identification <strong>of</strong> introns, exons <strong>and</strong>promoter regions for annotating genomicDNA [45, 46]. For protein sequences, analysesinclude developing algorithms forsequence compar<strong>is</strong>ons [47], methods forproducing multiple sequence alignments[48], <strong>and</strong> searching for functional domainsfrom conserved sequence motifs in suchalignments. Investigations <strong>of</strong> structuraldata include prediction <strong>of</strong> secondary <strong>and</strong>tertiary protein structures, producingmethods for 3D structural alignments [49,50], examining protein geometries usingd<strong>is</strong>tance <strong>and</strong> angular measurements, calculations<strong>of</strong> surface <strong>and</strong> volume shapes <strong>and</strong>analys<strong>is</strong> <strong>of</strong> protein interactions with o<strong>the</strong>rsubunits, DNA, RNA <strong>and</strong> smaller molecules.These studies have lead to molecularsimulation topics in which structural dataare used to calculate <strong>the</strong> energetics involvedin stabil<strong>is</strong>ing macromolecular structures,simulating movements within macromolecules,<strong>and</strong> computing <strong>the</strong> energiesinvolved in molecular docking. The increasingavailability <strong>of</strong> annotated genomicsequences has resulted in <strong>the</strong> introduction<strong>of</strong> computational genomics <strong>and</strong> proteomics– large-scale analyses <strong>of</strong> complete genomes<strong>and</strong> <strong>the</strong> proteins that <strong>the</strong>y encode. Researchincludes character<strong>is</strong>ation <strong>of</strong> proteincontent <strong>and</strong> metabolic pathways betweendifferent genomes, identification <strong>of</strong> interactingproteins, assignment <strong>and</strong> prediction <strong>of</strong>gene products, <strong>and</strong> large-scale analyses <strong>of</strong>gene expression levels. Some <strong>of</strong> <strong>the</strong>se researchtopics will be demonstrated in ourexample analys<strong>is</strong> <strong>of</strong> transcription regulatorysystems.O<strong>the</strong>r subject areas we have included inTable 1 are: development <strong>of</strong> digital librariesfor automated bibliographical searches,knowledge bases <strong>of</strong> biological informationfrom <strong>the</strong> literature, DNA analys<strong>is</strong> methodsin forensics, prediction <strong>of</strong> nucleic acid structures,metabolic pathway simulations, <strong>and</strong>linkage analys<strong>is</strong> – linking specific genes todifferent d<strong>is</strong>ease traits.In addition to finding relationships betweendifferent proteins, much <strong>of</strong> bioinformaticsinvolves <strong>the</strong> analys<strong>is</strong> <strong>of</strong> one type<strong>of</strong> data to infer <strong>and</strong> underst<strong>and</strong> <strong>the</strong> observationsfor ano<strong>the</strong>r type <strong>of</strong> data. An example<strong>is</strong> <strong>the</strong> use <strong>of</strong> sequence <strong>and</strong> structuraldata to predict <strong>the</strong> secondary <strong>and</strong> tertiarystructures <strong>of</strong> new protein sequences [51].These methods, especially <strong>the</strong> former, are<strong>of</strong>ten based on stat<strong>is</strong>tical rules derivedfrom structures, such as <strong>the</strong> propensity forcertain amino acid sequences to produceMethod Inform Med 4/2001

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!