11.07.2015 Views

What is Bioinformatics? A Proposed Definition and Overview of the ...

What is Bioinformatics? A Proposed Definition and Overview of the ...

What is Bioinformatics? A Proposed Definition and Overview of the ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

347<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?Fig. 1 Plot showing <strong>the</strong> growth <strong>of</strong> scientific publications in bioinformatics between 1973 <strong>and</strong> 2000. The h<strong>is</strong>togram bars(left vertical ax<strong>is</strong>) counts <strong>the</strong> total number <strong>of</strong> scientific articles relating to bioinformatics, <strong>and</strong> <strong>the</strong> black line (right verticalax<strong>is</strong>) gives <strong>the</strong> percentage <strong>of</strong> <strong>the</strong> annual total <strong>of</strong> articles relating to bioinformatics. The data are taken from PubMed.Th<strong>is</strong> needs more than just a simple textbasedsearch, <strong>and</strong> programs such as FASTA[8] <strong>and</strong> PSI-BLAST [9] must consider whatconstitutes a biologically significant match.Development <strong>of</strong> such resources dictatesexpert<strong>is</strong>e in computational <strong>the</strong>ory, as wellas a thorough underst<strong>and</strong>ing <strong>of</strong> biology.The third aim <strong>is</strong> to use <strong>the</strong>se tools to analyse<strong>the</strong> data <strong>and</strong> interpret <strong>the</strong> results in abiologically meaningful manner. Traditionally,biological studies examined individualsystems in detail, <strong>and</strong> frequently compared<strong>the</strong>m with a few that are related. Inbioinformatics, we can now conduct globalanalyses <strong>of</strong> all <strong>the</strong> available data with <strong>the</strong>aim <strong>of</strong> uncovering common principles thatapply across many systems <strong>and</strong> highlightnovel features.In th<strong>is</strong> review, we provide a systematicdefinition <strong>of</strong> bioinformatics as shown inBox 1. We focus on <strong>the</strong> first <strong>and</strong> third aimsjust described, with particular reference to<strong>the</strong> keywords: information, informatics,organ<strong>is</strong>ation, underst<strong>and</strong>ing, large-scale<strong>and</strong> practical applications. Specifically, wed<strong>is</strong>cuss <strong>the</strong> range <strong>of</strong> data that are currentlybeing examined, <strong>the</strong> databases into which<strong>the</strong>y are organ<strong>is</strong>ed, <strong>the</strong> types <strong>of</strong> analysesthat are being conducted using transcriptionregulatory systems as an example, <strong>and</strong>finally some <strong>of</strong> <strong>the</strong> major practical applications<strong>of</strong> bioinformatics.2. “…<strong>the</strong> INFORMATIONassociated with <strong>the</strong>seMolecules…”Table 1 l<strong>is</strong>ts <strong>the</strong> types <strong>of</strong> data that areanalysed in bioinformatics <strong>and</strong> <strong>the</strong> range <strong>of</strong>topics that we consider to fall within <strong>the</strong>field. Here we take a broad view <strong>and</strong> includesubjects that may not normally bel<strong>is</strong>ted. We also give approximate valuesdescribing <strong>the</strong> sizes <strong>of</strong> data being d<strong>is</strong>cussed.We start with an overview <strong>of</strong> <strong>the</strong> sources<strong>of</strong> information. Most bioinformatics analysesfocus on three primary sources <strong>of</strong> data:DNA or protein sequences, macromolecularstructures <strong>and</strong> <strong>the</strong> results <strong>of</strong> functionalgenomics experiments. Raw DNA sequencesare strings <strong>of</strong> <strong>the</strong> four base-letterscompr<strong>is</strong>ing genes, each typically 1,000 baseslong. The GenBank [2] repository <strong>of</strong>nucleic acid sequences currently holds atotal <strong>of</strong> 12.5 billion bases in 11.5 millionentries (all database figures as <strong>of</strong> April2001).At <strong>the</strong> next level are protein sequencescompr<strong>is</strong>ing strings <strong>of</strong> 20 amino acidletters.At present <strong>the</strong>re are about 400,000known protein sequences [3], with a typicalbacterial protein containing approximately300 amino acids. Macromolecular structuraldata represents a more complex form<strong>of</strong> information. There are currently 15,000entries in <strong>the</strong> Protein Data Bank, PDB[6, 7], containing atomic structures <strong>of</strong> proteins,DNA <strong>and</strong> RNA solved by x-raycrystallography <strong>and</strong> NMR. A typical PDBfile for a medium-sized protein contains <strong>the</strong>xyz-coordinates <strong>of</strong> approximately 2,000atoms.Scientific euphoria has recently centredon whole genome sequencing. As with <strong>the</strong>raw DNA sequences, genomes cons<strong>is</strong>t <strong>of</strong>strings <strong>of</strong> base-letters, ranging from 1.6million bases in Haemophilus influenzae[10] to 3 billion in humans [11, 12]. TheEntrez database [13] currently has completesequences for nearly 300 archaeal,bacterial <strong>and</strong> eukaryotic organ<strong>is</strong>ms. Inaddition to producing <strong>the</strong> raw nucleotidesequence, a lot <strong>of</strong> work <strong>is</strong> involved inprocessing th<strong>is</strong> data. An important aspect<strong>of</strong> complete genomes <strong>is</strong> <strong>the</strong> d<strong>is</strong>tinctionbetween coding regions <strong>and</strong> non-codingregions -‘junk’ repetitive sequences makingup <strong>the</strong> bulk <strong>of</strong> base sequences especially ineukaryotes. Within <strong>the</strong> coding regions,genes are annotated with <strong>the</strong>ir translatedprotein sequence, <strong>and</strong> <strong>of</strong>ten with <strong>the</strong>ircellular function.<strong>Bioinformatics</strong> – a <strong>Definition</strong> 1(Molecular) bio – informatics: bioinformatics<strong>is</strong> conceptual<strong>is</strong>ing biology interms <strong>of</strong> molecules (in <strong>the</strong> sense <strong>of</strong> Physicalchem<strong>is</strong>try) <strong>and</strong> applying “informaticstechniques” (derived from d<strong>is</strong>ciplinessuch as applied maths, computerscience <strong>and</strong> stat<strong>is</strong>tics) to underst<strong>and</strong> <strong>and</strong>organ<strong>is</strong>e <strong>the</strong> information associatedwith <strong>the</strong>se molecules, on a large scale. Inshort, bioinformatics <strong>is</strong> a managementinformation system for molecular biology<strong>and</strong> has many practical applications.1As submitted to <strong>the</strong> Oxford Engl<strong>is</strong>hDictionary.Method Inform Med 4/2001


348Luscombe, Greenbaum, GersteinTable 1 Sources <strong>of</strong> data used in bioinformatics, <strong>the</strong> quantity <strong>of</strong> each type <strong>of</strong> data that <strong>is</strong> currently (April 2001) available,<strong>and</strong> bioinformatics subject areas that utilize th<strong>is</strong> data.tities <strong>of</strong> data when experiments are conductedfor larger organ<strong>is</strong>ms <strong>and</strong> at moretime-points.Fur<strong>the</strong>r genomic-scale data includebiochemical information on metabolicpathways, regulatory networks, proteinproteininteraction data from two-hybridexperiments, <strong>and</strong> systematic knockouts <strong>of</strong>individual genes to test <strong>the</strong> viability <strong>of</strong> anorgan<strong>is</strong>m.<strong>What</strong> <strong>is</strong> apparent from th<strong>is</strong> l<strong>is</strong>t <strong>is</strong> <strong>the</strong>diversity in <strong>the</strong> size <strong>and</strong> complexity <strong>of</strong> differentdatasets. There are invariably moresequence-based data than o<strong>the</strong>rs because<strong>of</strong> <strong>the</strong> relative ease with which <strong>the</strong>y can beproduced.Th<strong>is</strong> <strong>is</strong> partly related to <strong>the</strong> greatercomplexity <strong>and</strong> information-content <strong>of</strong>individual structures or gene expressionexperiments compared to individual sequences.While more biological informationcan be derived from a single structurethan a protein sequence, <strong>the</strong> lack <strong>of</strong> depthin <strong>the</strong> latter <strong>is</strong> compensated by analysinglarger quantities <strong>of</strong> data.Method Inform Med 4/2001More recent sources <strong>of</strong> data have beenfrom functional genomics experiments, <strong>of</strong>which <strong>the</strong> most common are gene expressionstudies.We can now determine expressionlevels <strong>of</strong> almost every gene in a givencell on a whole-genome level, however<strong>the</strong>re <strong>is</strong> currently no central repository forth<strong>is</strong> data <strong>and</strong> public availability <strong>is</strong> limited.These experiments measure <strong>the</strong> amount <strong>of</strong>mRNA that <strong>is</strong> produced by <strong>the</strong> cell [14-18]under different environmental conditions,different stages <strong>of</strong> <strong>the</strong> cell cycle <strong>and</strong> differentcell types in multi-cellular organ<strong>is</strong>ms.Much <strong>of</strong> <strong>the</strong> effort has so far focused on <strong>the</strong>yeast [19-24] <strong>and</strong> human genomes [25, 26].One <strong>of</strong> <strong>the</strong> largest dataset for yeast hasmade approximately 20 time-point measurementsfor 6,000 genes [19]. However,<strong>the</strong>re <strong>is</strong> potential for much greater quan-3. “… ORGANISE <strong>the</strong> Informationon a LARGE SCALE…”3.1 Redundancy <strong>and</strong> Multiplicity<strong>of</strong> DataA concept that underpins most researchmethods in bioinformatics <strong>is</strong> that much <strong>of</strong><strong>the</strong> data can be grouped toge<strong>the</strong>r based onbiologically meaningful similarities. Forexample, sequence segments are <strong>of</strong>tenrepeated at different positions <strong>of</strong> genomicDNA [27]. Genes can be clustered intothose with particular functions (eg enzymaticactions) or according to <strong>the</strong> metabolicpathway to which <strong>the</strong>y belong [28],although here, single genes may actuallypossess several functions [29]. Goingfur<strong>the</strong>r, d<strong>is</strong>tinct proteins frequently havecomparable sequences – organ<strong>is</strong>ms <strong>of</strong>tenhave multiple copies <strong>of</strong> a particular genethrough duplication <strong>and</strong> different specieshave equivalent or similar proteins thatwere inherited when <strong>the</strong>y diverged fromeach o<strong>the</strong>r in evolution. At a structurallevel, we predict <strong>the</strong>re to be a finite number<strong>of</strong> different tertiary structures – estimatesrange between 1,000 <strong>and</strong> 10,000 folds[30, 31] – <strong>and</strong> proteins adopt equivalentstructures even when <strong>the</strong>y differ greatly insequence [32]. As a result, although <strong>the</strong>number <strong>of</strong> structures in <strong>the</strong> PDB hasincreased exponentially, <strong>the</strong> rate <strong>of</strong> d<strong>is</strong>covery<strong>of</strong> novel folds has actually decreased.There are common terms to describe <strong>the</strong>relationship between pairs <strong>of</strong> proteins or<strong>the</strong> genes from which <strong>the</strong>y are derived:analogous proteins have related folds, butunrelated sequences, while homologousproteins are both sequentially <strong>and</strong> structurallysimilar. The two categories can sometimesbe difficult to d<strong>is</strong>tingu<strong>is</strong>h especially if<strong>the</strong> relationship between <strong>the</strong> two proteins<strong>is</strong> remote [33, 34]. Among homologues, it <strong>is</strong>useful to d<strong>is</strong>tingu<strong>is</strong>h between orthologues,proteins in different species that have evolv-


349<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?ed from a common ancestral gene, <strong>and</strong>paralogues, proteins that are related bygene duplication within a genome [35].Normally, orthologues retain <strong>the</strong> samefunction while paralogues evolve d<strong>is</strong>tinct,but related functions [36].An important concept that ar<strong>is</strong>es from<strong>the</strong>se observations <strong>is</strong> that <strong>of</strong> a finite “partsl<strong>is</strong>t” for different organ<strong>is</strong>ms [37-39]: aninventory <strong>of</strong> proteins contained within anorgan<strong>is</strong>m, arranged according to differentproperties such as gene sequence, proteinfold or function. Taking protein folds as anexample, we mentioned that with a fewexceptions, <strong>the</strong> tertiary structures <strong>of</strong> proteinsadopt one <strong>of</strong> a limited repertoire<strong>of</strong> folds. As <strong>the</strong> number <strong>of</strong> different foldfamilies <strong>is</strong> considerably smaller than <strong>the</strong>number <strong>of</strong> genes, categor<strong>is</strong>ing <strong>the</strong> proteinsby fold provides a substantial simplification<strong>of</strong> <strong>the</strong> contents <strong>of</strong> a genome. Similar simplificationscan be provided by o<strong>the</strong>r attributessuch as protein function. As such, weexpect th<strong>is</strong> notion <strong>of</strong> a finite parts l<strong>is</strong>t tobecome increasingly common in futuregenomic analyses.Clearly, an essential aspect <strong>of</strong> managingth<strong>is</strong> large volume <strong>of</strong> data lies in developingmethods for assessing similarities betweendifferent biomolecules <strong>and</strong> identifyingthose that are related. There are well-documentedclassifications for all <strong>of</strong> <strong>the</strong> maintypes <strong>of</strong> data we described earlier. Althoughdetailed descriptions <strong>of</strong> <strong>the</strong>se classificationsystems are beyond <strong>the</strong> scope <strong>of</strong><strong>the</strong> current review, <strong>the</strong>y are <strong>of</strong> great importanceas <strong>the</strong>y ease compar<strong>is</strong>ons betweengenomes <strong>and</strong> <strong>the</strong>ir products. Links to <strong>the</strong>major databases are available from oursupplementary website.3.2 Data IntegrationThe most pr<strong>of</strong>itable research in bioinformatics<strong>of</strong>ten results from integrating multiplesources <strong>of</strong> data [40]. For instance, <strong>the</strong>3D coordinates <strong>of</strong> a protein are more usefulif combined with data about <strong>the</strong> protein’sfunction, occurrence in different genomes,<strong>and</strong> interactions with o<strong>the</strong>r molecules. Inth<strong>is</strong> way, individual pieces <strong>of</strong> informationare put in context with respect to o<strong>the</strong>rdata. Unfortunately, it <strong>is</strong> not alwaysstraightforward to access <strong>and</strong> crossreference<strong>the</strong>se sources <strong>of</strong> information because<strong>of</strong> differences in nomenclature <strong>and</strong>file formats.At a basic level, th<strong>is</strong> problem <strong>is</strong> frequentlyaddressed by providing externallinks to o<strong>the</strong>r databases. For example inPDBsum, web-pages for individual structuresdirect <strong>the</strong> user towards correspondingentries in <strong>the</strong> PDB, NDB, CATH, SCOP<strong>and</strong> SWISS-PROT databases. At a moreadvanced level, <strong>the</strong>re have been efforts tointegrate access across several data sources.One <strong>is</strong> <strong>the</strong> Sequence Retrieval System, SRS[41], which allows flat-file databases to beindexed to each o<strong>the</strong>r; th<strong>is</strong> allows <strong>the</strong> userto retrieve, link <strong>and</strong> access entries fromnucleic acid, protein sequence, proteinmotif, protein structure <strong>and</strong> bibliographicdatabases. Ano<strong>the</strong>r <strong>is</strong> <strong>the</strong> Entrez facility[42], which provides similar gateways toDNA <strong>and</strong> protein sequences, genomemapping data, 3D macromolecular structures<strong>and</strong> <strong>the</strong> PubMed bibliographic database[43].A search for a particular gene in ei<strong>the</strong>rdatabase will allow smooth transitions to<strong>the</strong> genome it comes from, <strong>the</strong> proteinsequence it encodes, its structure, bibliographicreference <strong>and</strong> equivalent entries forall related genes. In our own group, we havedeveloped <strong>the</strong> SPINE [44] <strong>and</strong> PartsL<strong>is</strong>t[39] web resources; <strong>the</strong>se databases integratemany types <strong>of</strong> experimental data <strong>and</strong>organ<strong>is</strong>e <strong>the</strong>m using <strong>the</strong> concept <strong>of</strong> <strong>the</strong>finite “parts l<strong>is</strong>t” we described above.4. “…UNDERSTAND <strong>and</strong>Organ<strong>is</strong>e <strong>the</strong> Information…”Having examined <strong>the</strong> data, we can d<strong>is</strong>cuss<strong>the</strong> types <strong>of</strong> analyses that are conducted.Asshown in Table 1, <strong>the</strong> broad subject areas inbioinformatics can be separated accordingto <strong>the</strong> type <strong>of</strong> information that <strong>is</strong> used. Forraw DNA sequences, investigations involveseparating coding <strong>and</strong> non-coding regions,<strong>and</strong> identification <strong>of</strong> introns, exons <strong>and</strong>promoter regions for annotating genomicDNA [45, 46]. For protein sequences, analysesinclude developing algorithms forsequence compar<strong>is</strong>ons [47], methods forproducing multiple sequence alignments[48], <strong>and</strong> searching for functional domainsfrom conserved sequence motifs in suchalignments. Investigations <strong>of</strong> structuraldata include prediction <strong>of</strong> secondary <strong>and</strong>tertiary protein structures, producingmethods for 3D structural alignments [49,50], examining protein geometries usingd<strong>is</strong>tance <strong>and</strong> angular measurements, calculations<strong>of</strong> surface <strong>and</strong> volume shapes <strong>and</strong>analys<strong>is</strong> <strong>of</strong> protein interactions with o<strong>the</strong>rsubunits, DNA, RNA <strong>and</strong> smaller molecules.These studies have lead to molecularsimulation topics in which structural dataare used to calculate <strong>the</strong> energetics involvedin stabil<strong>is</strong>ing macromolecular structures,simulating movements within macromolecules,<strong>and</strong> computing <strong>the</strong> energiesinvolved in molecular docking. The increasingavailability <strong>of</strong> annotated genomicsequences has resulted in <strong>the</strong> introduction<strong>of</strong> computational genomics <strong>and</strong> proteomics– large-scale analyses <strong>of</strong> complete genomes<strong>and</strong> <strong>the</strong> proteins that <strong>the</strong>y encode. Researchincludes character<strong>is</strong>ation <strong>of</strong> proteincontent <strong>and</strong> metabolic pathways betweendifferent genomes, identification <strong>of</strong> interactingproteins, assignment <strong>and</strong> prediction <strong>of</strong>gene products, <strong>and</strong> large-scale analyses <strong>of</strong>gene expression levels. Some <strong>of</strong> <strong>the</strong>se researchtopics will be demonstrated in ourexample analys<strong>is</strong> <strong>of</strong> transcription regulatorysystems.O<strong>the</strong>r subject areas we have included inTable 1 are: development <strong>of</strong> digital librariesfor automated bibliographical searches,knowledge bases <strong>of</strong> biological informationfrom <strong>the</strong> literature, DNA analys<strong>is</strong> methodsin forensics, prediction <strong>of</strong> nucleic acid structures,metabolic pathway simulations, <strong>and</strong>linkage analys<strong>is</strong> – linking specific genes todifferent d<strong>is</strong>ease traits.In addition to finding relationships betweendifferent proteins, much <strong>of</strong> bioinformaticsinvolves <strong>the</strong> analys<strong>is</strong> <strong>of</strong> one type<strong>of</strong> data to infer <strong>and</strong> underst<strong>and</strong> <strong>the</strong> observationsfor ano<strong>the</strong>r type <strong>of</strong> data. An example<strong>is</strong> <strong>the</strong> use <strong>of</strong> sequence <strong>and</strong> structuraldata to predict <strong>the</strong> secondary <strong>and</strong> tertiarystructures <strong>of</strong> new protein sequences [51].These methods, especially <strong>the</strong> former, are<strong>of</strong>ten based on stat<strong>is</strong>tical rules derivedfrom structures, such as <strong>the</strong> propensity forcertain amino acid sequences to produceMethod Inform Med 4/2001


350Luscombe, Greenbaum, GersteinParadigm shifts during <strong>the</strong> past couple <strong>of</strong> decades have taken much <strong>of</strong> biology away from <strong>the</strong>laboratory bench <strong>and</strong> have allowed <strong>the</strong> integration <strong>of</strong> o<strong>the</strong>r scientific d<strong>is</strong>ciplines, specificallycomputing. The result <strong>is</strong> an expansion <strong>of</strong> biological research in breadth <strong>and</strong> depth. The vertical ax<strong>is</strong>demonstrates how bioinformatics can aid rational drug design with minimal work in <strong>the</strong> wet lab.Starting with a single gene sequence, we can determine with strong certainty, <strong>the</strong> proteinsequence. From <strong>the</strong>re, we can determine <strong>the</strong> structure using structure prediction techniques. Withgeometry calculations, we can fur<strong>the</strong>r resolve <strong>the</strong> protein’s surface <strong>and</strong> through molecularsimulation determine <strong>the</strong> force fields surrounding <strong>the</strong> molecule. Finally docking algorithms canprovide predictions <strong>of</strong> <strong>the</strong> lig<strong>and</strong>s that will bind on <strong>the</strong> protein surface, thus paving <strong>the</strong> way for <strong>the</strong>design <strong>of</strong> a drug specific to that molecule. The horizontal ax<strong>is</strong> shows how <strong>the</strong> influx <strong>of</strong> biologicaldata <strong>and</strong> advances in computer technology have broadened <strong>the</strong> scope <strong>of</strong> biology. Initially with a pair<strong>of</strong> proteins, we can make compar<strong>is</strong>ons between <strong>the</strong> between sequences <strong>and</strong> structures <strong>of</strong>evolutionary related proteins. With more data, algorithms for multiple alignments <strong>of</strong> severalproteins become necessary. Using multiple sequences, we can also create phylogenetic trees totrace <strong>the</strong> evolutionary development <strong>of</strong> <strong>the</strong> proteins in question. Finally, with <strong>the</strong> deluge <strong>of</strong> data wecurrently face, we need to construct large databases to store, view <strong>and</strong> deconstruct <strong>the</strong>information. Alignments now become more complex, requiring soph<strong>is</strong>ticated scoring schemes <strong>and</strong><strong>the</strong>re <strong>is</strong> enough data to compile a genome census – a genomic equivalent <strong>of</strong> a population census –providing comprehensive stat<strong>is</strong>tical accounting <strong>of</strong> protein features in genomes.Fig. 2Organizing <strong>and</strong> underst<strong>and</strong>ing biological datadifferent secondary structural elements.Ano<strong>the</strong>r example <strong>is</strong> <strong>the</strong> use <strong>of</strong> structuraldata to underst<strong>and</strong> a protein’s function;here studies have investigated <strong>the</strong> relationshipdifferent protein folds <strong>and</strong> <strong>the</strong>irfunctions [52, 53] <strong>and</strong> analysed similaritiesbetween different binding sites in <strong>the</strong> absence<strong>of</strong> homology [54]. Combined withsimilarity measurements, <strong>the</strong>se studies provideus with an underst<strong>and</strong>ing <strong>of</strong> how muchbiological information can be accuratelyMethod Inform Med 4/2001transferred between homologous proteins[55].4.1 The <strong>Bioinformatics</strong> SpectrumFig. 2 summar<strong>is</strong>es <strong>the</strong> main points wera<strong>is</strong>ed in our d<strong>is</strong>cussions <strong>of</strong> organ<strong>is</strong>ing<strong>and</strong> underst<strong>and</strong>ing biological data – <strong>the</strong>development <strong>of</strong> bioinformatics techniqueshas allowed an expansion <strong>of</strong> biologicalanalys<strong>is</strong> in two dimension, depth <strong>and</strong>breadth. The first <strong>is</strong> represented by <strong>the</strong>vertical ax<strong>is</strong> in <strong>the</strong> figure <strong>and</strong> outlines apossible approach to <strong>the</strong> rational drugdesign process. The aim <strong>is</strong> to take a singlegene <strong>and</strong> follow through an analys<strong>is</strong> thatmaxim<strong>is</strong>es our underst<strong>and</strong>ing <strong>of</strong> <strong>the</strong>protein it encodes. Starting with a genesequence, we can determine <strong>the</strong> proteinsequence with strong certainty. From <strong>the</strong>re,prediction algorithms can be used to calcu-


351<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?late <strong>the</strong> structure adopted by <strong>the</strong> protein.Geometry calculations can define <strong>the</strong>shape <strong>of</strong> <strong>the</strong> protein’s surface <strong>and</strong> molecularsimulations can determine <strong>the</strong> forcefields surrounding <strong>the</strong> molecule. Finally,using docking algorithms, one couldidentify or design lig<strong>and</strong>s that may bind<strong>the</strong> protein, paving <strong>the</strong> way for designing adrug that specifically alters <strong>the</strong> protein’sfunction. In pract<strong>is</strong>e, <strong>the</strong> intermediate stepsare still difficult to achieve accurately, <strong>and</strong><strong>the</strong>y are best combined with experimentalmethods to obtain some <strong>of</strong> <strong>the</strong> data, forexample character<strong>is</strong>ing <strong>the</strong> structure <strong>of</strong> <strong>the</strong>protein <strong>of</strong> interest.The aim <strong>of</strong> <strong>the</strong> second dimension, <strong>the</strong>breadth in biological analys<strong>is</strong>, <strong>is</strong> to comparea gene or gene product with o<strong>the</strong>rs. Initially,simple algorithms can be used tocompare <strong>the</strong> sequences <strong>and</strong> structures <strong>of</strong> apair <strong>of</strong> related proteins. With a larger number<strong>of</strong> proteins, improved algorithms can beused to produce multiple alignments, <strong>and</strong>extract sequence patterns or structuraltemplates that define a family <strong>of</strong> proteins.Using th<strong>is</strong> data, it <strong>is</strong> also possible to constructphylogenetic trees to trace <strong>the</strong> evolutionarypath <strong>of</strong> proteins. Finally, with evenmore data, <strong>the</strong> information must be storedin large-scale databases. Compar<strong>is</strong>onsbecome more complex, requiring multiplescoring schemes, <strong>and</strong> we are able to conductgenomic scale censuses that providecomprehensive stat<strong>is</strong>tical accounts <strong>of</strong>protein features, such as <strong>the</strong> abundance <strong>of</strong>particular structures or functions in differentgenomes. It also allows us to buildphylogenetic trees that trace <strong>the</strong> evolution<strong>of</strong> whole organ<strong>is</strong>ms.5. “… applying INFORMATICSTECHNIQUES…”The d<strong>is</strong>tinct subject areas we mentionrequire different types <strong>of</strong> informatics techniques.Briefly, for data organ<strong>is</strong>ation, <strong>the</strong>first biological databases were simple flatfiles. However with <strong>the</strong> increasing amount<strong>of</strong> information, relational databasemethods with Web-page interfaces havebecome increasingly popular. In sequenceanalys<strong>is</strong>, techniques include string compar<strong>is</strong>onmethods such as text search <strong>and</strong> onedimensionalalignment algorithms. Motif<strong>and</strong> pattern identification for multiplesequences depend on machine learning,clustering <strong>and</strong> data-mining techniques. 3Dstructural analys<strong>is</strong> techniques include Euclideangeometry calculations combinedwith basic application <strong>of</strong> physical chem<strong>is</strong>try,graphical representations <strong>of</strong> surfaces<strong>and</strong> volumes, <strong>and</strong> structural compar<strong>is</strong>on<strong>and</strong> 3D matching methods. For molecularsimulations, Newtonian mechanics, quantummechanics, molecular mechanics <strong>and</strong>electrostatic calculations are applied. Inmany <strong>of</strong> <strong>the</strong>se areas, <strong>the</strong> computationalmethods must be combined with goodstat<strong>is</strong>tical analyses in order to provide anobjective measure for <strong>the</strong> significance <strong>of</strong><strong>the</strong> results.6. Transcription Regulation –a Case Study in <strong>Bioinformatics</strong>DNA-binding proteins have a central rolein all aspects <strong>of</strong> genetic activity within anorgan<strong>is</strong>m, participating in processes such astranscription, packaging, rearrangement,replication <strong>and</strong> repair. In th<strong>is</strong> section, wefocus on <strong>the</strong> studies that have contributedto our underst<strong>and</strong>ing <strong>of</strong> transcriptionregulation in different organ<strong>is</strong>ms. Throughth<strong>is</strong> example, we demonstrate how bioinformaticshas been used to increase ourknowledge <strong>of</strong> biological systems <strong>and</strong> alsoillustrate <strong>the</strong> practical applications <strong>of</strong> <strong>the</strong>different subject areas that were brieflyoutlined earlier. We start by consideringstructural analyses <strong>of</strong> how DNA-bindingproteins recogn<strong>is</strong>e particular base sequences.Later, we review several genomicstudies that have character<strong>is</strong>ed <strong>the</strong> nature<strong>of</strong> transcription factors in different organ<strong>is</strong>ms,<strong>and</strong> <strong>the</strong> methods that have been usedto identify regulatory binding sites in <strong>the</strong>upstream regions. Finally, we provide anoverview <strong>of</strong> gene expression analyses thathave been recently conducted <strong>and</strong> suggestfuture uses <strong>of</strong> transcription regulatory analysesto rational<strong>is</strong>e <strong>the</strong> observations madein gene expression experiments. All <strong>the</strong>results that we describe have been foundthrough computational studies.6.1 Structural StudiesAs <strong>of</strong> April 2001, <strong>the</strong>re were 379 structures<strong>of</strong> protein-DNA complexes in <strong>the</strong> PDB.Analyses <strong>of</strong> <strong>the</strong>se structures have providedvaluable insight into <strong>the</strong> stereochemicalprinciples <strong>of</strong> binding, including how particularbase sequences are recognized<strong>and</strong> how <strong>the</strong> DNA structure <strong>is</strong> quite <strong>of</strong>tenmodified on binding.A structural taxonomy <strong>of</strong> DNA-bindingproteins, similar to that presented in SCOP<strong>and</strong> CATH, was first proposed by Harr<strong>is</strong>on[56] <strong>and</strong> periodically updated to accommodatenew structures as <strong>the</strong>y are solved[57]. The classification cons<strong>is</strong>ts <strong>of</strong> a two-tiersystem: <strong>the</strong> first level collects proteins intoeight groups that share gross structuralfeatures for DNA-binding, <strong>and</strong> <strong>the</strong> secondcompr<strong>is</strong>es 54 families <strong>of</strong> proteins that arestructurally homologous to each o<strong>the</strong>r.Assembly <strong>of</strong> such a system simplifies <strong>the</strong>compar<strong>is</strong>on <strong>of</strong> different binding methods; ithighlights <strong>the</strong> diversity <strong>of</strong> protein-DNAcomplex geometries found in nature, butalso underlines <strong>the</strong> importance <strong>of</strong> interactionsbetween -helices <strong>and</strong> <strong>the</strong> DNAmajor groove, <strong>the</strong> main mode <strong>of</strong> binding inover half <strong>the</strong> protein families. While <strong>the</strong>number <strong>of</strong> structures represented in <strong>the</strong>PDB does not necessarily reflect <strong>the</strong> relativeimportance <strong>of</strong> <strong>the</strong> different proteins in<strong>the</strong> cell, it <strong>is</strong> clear that helix-turn-helix,zinc-coordinating <strong>and</strong> leucine zipper motifsare used repeatedly.These provide compactframeworks to present <strong>the</strong> -helix on <strong>the</strong>surfaces <strong>of</strong> structurally diverse proteins. Ata gross level, it <strong>is</strong> possible to highlight <strong>the</strong>differences between transcription factordomains that “just” bind DNA <strong>and</strong> thoseinvolved in catalys<strong>is</strong> [58]. Although <strong>the</strong>reare exceptions, <strong>the</strong> former typicallyapproach <strong>the</strong> DNA from a single face <strong>and</strong>slot into <strong>the</strong> grooves to interact with baseedges. The latter commonly envelope <strong>the</strong>substrate, using complex networks <strong>of</strong>secondary structures <strong>and</strong> loops.Focusing on proteins with -helices, <strong>the</strong>structures show many variations, both inamino acid sequences <strong>and</strong> detailed geometry.They have clearly evolved independentlyin accordance with <strong>the</strong> requirements<strong>of</strong> <strong>the</strong> context in which <strong>the</strong>y are found.While achieving a close fit between <strong>the</strong>Method Inform Med 4/2001


352Luscombe, Greenbaum, Gerstein-helix <strong>and</strong> major groove, <strong>the</strong>re <strong>is</strong> enoughflexibility to allow both <strong>the</strong> protein <strong>and</strong>DNA to adopt d<strong>is</strong>tinct conformations.However, several studies that analysed <strong>the</strong>binding geometries <strong>of</strong> -helices demonstratedthat most adopt fairly uniform conformationsregardless <strong>of</strong> protein family.They are commonly inserted in <strong>the</strong> majorgroove sideways, with <strong>the</strong>ir lengthw<strong>is</strong>e ax<strong>is</strong>roughly parallel to <strong>the</strong> slope outlined by<strong>the</strong> DNA backbone. Most start with <strong>the</strong>N-terminus in <strong>the</strong> groove <strong>and</strong> extend out,completing two to three turns withincontacting d<strong>is</strong>tance <strong>of</strong> <strong>the</strong> nucleic acid [59,60].Given <strong>the</strong> similar binding orientations, it<strong>is</strong> surpr<strong>is</strong>ing to find that <strong>the</strong> interactionsbetween each amino acid position along<strong>the</strong> -helices <strong>and</strong> nucleotides on <strong>the</strong> DNAvary considerably between different proteinfamilies. However, by classifying <strong>the</strong>amino acids according to <strong>the</strong> sizes <strong>of</strong> <strong>the</strong>irside chains, we are able to rational<strong>is</strong>e <strong>the</strong>different interactions patterns. The rules <strong>of</strong>interactions are based on <strong>the</strong> simple prem<strong>is</strong>ethat for a given residue position on-helices in similar conformations, smallamino acids interact with nucleotides thatare close in d<strong>is</strong>tance <strong>and</strong> large amino acidswith those that are fur<strong>the</strong>r [60, 61]. Equi-valentstudies for binding by o<strong>the</strong>r structuralmotifs, like -hairpins, have also been conducted[62]. When considering <strong>the</strong>seinteractions, it <strong>is</strong> important to rememberthat different regions <strong>of</strong> <strong>the</strong> protein surfacealso provide interfaces with <strong>the</strong> DNA.Th<strong>is</strong> brings us to look at <strong>the</strong> atomic levelinteractions between individual aminoacid-base pairs. Such analyses are based on<strong>the</strong> prem<strong>is</strong>e that a significant proportion <strong>of</strong>specific DNA-binding could be rational<strong>is</strong>edby a universal code <strong>of</strong> recognition betweenamino acids <strong>and</strong> bases, ie whe<strong>the</strong>r certainprotein residues preferably interact withparticular nucleotides regardless <strong>of</strong> <strong>the</strong>type <strong>of</strong> protein-DNA complex [63]. Studieshave considered hydrogen bonds, van derWaals contacts <strong>and</strong> water-mediated bonds[64-66]. Results showed that about 2/3 <strong>of</strong> allinteractions are with <strong>the</strong> DNA backbone<strong>and</strong> that <strong>the</strong>ir main role <strong>is</strong> one <strong>of</strong>sequence-independent stabil<strong>is</strong>ation. Incontrast, interactions with bases d<strong>is</strong>playsome strong preferences, including <strong>the</strong>Method Inform Med 4/2001interactions <strong>of</strong> arginine or lysine withguanine, asparagine or glutamine withadenine <strong>and</strong> threonine with thymine. Suchpreferences were explained through examination<strong>of</strong> <strong>the</strong> stereochem<strong>is</strong>try <strong>of</strong> <strong>the</strong> aminoacid side chains <strong>and</strong> base edges. Alsohighlighted were more complex types <strong>of</strong>interactions where single amino acidscontact more than one base-step simultaneously,thus recogn<strong>is</strong>ing a short DNAsequence. These results suggested thatuniversal specificity, one that <strong>is</strong> observedacross all protein-DNA complexes, indeedex<strong>is</strong>ts. However, many interactions that arenormally considered to be non-specific,such as those with <strong>the</strong> DNA backbone, canalso provide specificity depending on <strong>the</strong>context in which <strong>the</strong>y are made.Armed with an underst<strong>and</strong>ing <strong>of</strong>protein structure, DNA-binding motifs <strong>and</strong>side chain stereochem<strong>is</strong>try, a major applicationhas been <strong>the</strong> prediction <strong>of</strong> bindingei<strong>the</strong>r by proteins known to contain a particularmotif, or those with structures solvedin <strong>the</strong> uncomplexed form. Most commonare predictions for -helix-major grooveinteractions – given <strong>the</strong> amino acid sequence,what DNA sequence would itrecogn<strong>is</strong>e [61, 67]. In a different approach,molecular simulation techniques have beenused to dock whole proteins <strong>and</strong> DNAs on<strong>the</strong> bas<strong>is</strong> <strong>of</strong> force-field calculations around<strong>the</strong> two molecules [68, 69].The reason that both methods havebeen met with limited success <strong>is</strong> becauseeven for apparently simple cases like -helix-binding, <strong>the</strong>re are many o<strong>the</strong>r factorsthat must be considered. Compar<strong>is</strong>onsbetween bound <strong>and</strong> unbound nucleic acidstructures show that DNA-bending <strong>is</strong> acommon feature <strong>of</strong> complexes formed withtranscription factors [58, 70].Th<strong>is</strong> <strong>and</strong> o<strong>the</strong>rfactors such as electrostatic <strong>and</strong> cationmediatedinteractions ass<strong>is</strong>t indirectrecognition <strong>of</strong> <strong>the</strong> nucleotide sequence,although <strong>the</strong>y are not well understood yet.Therefore, it <strong>is</strong> now clear that detailed rulesfor specific DNA-binding will be familyspecific, but with underlying trends such as<strong>the</strong> arginine-guanine interactions.6.2 Genomic StudiesDue to <strong>the</strong> wealth <strong>of</strong> biochemical data thatare available, genomic studies in bioinformaticshave concentrated on modelorgan<strong>is</strong>ms, <strong>and</strong> <strong>the</strong> analys<strong>is</strong> <strong>of</strong> regulatorysystems has been no exception. Identification<strong>of</strong> transcription factors in genomes invariablydepends on similarity search strategies,which assume a functional <strong>and</strong> evolutionaryrelationship between homologousproteins. In E. coli, studies have so farestimated a total <strong>of</strong> 300 to 500 transcriptionregulators [71] <strong>and</strong> PEDANT [72], a database<strong>of</strong> automatically assigned gene functions,shows that typically 2-3% <strong>of</strong> prokaryotic<strong>and</strong> 6-7% <strong>of</strong> eukaryotic genomescompr<strong>is</strong>e DNA-binding proteins.As assignmentswere only complete for 40-60% <strong>of</strong>genomes as <strong>of</strong> August 2000, <strong>the</strong>se figuresmost likely underestimate <strong>the</strong> actual number.None<strong>the</strong>less, <strong>the</strong>y already represent alarge quantity <strong>of</strong> proteins <strong>and</strong> it <strong>is</strong> clear that<strong>the</strong>re are more transcription regulatorsin eukaryotes than o<strong>the</strong>r species. Th<strong>is</strong> <strong>is</strong>unsurpr<strong>is</strong>ing, considering <strong>the</strong> organ<strong>is</strong>mshave developed a relatively soph<strong>is</strong>ticatedtranscription mechan<strong>is</strong>m.From <strong>the</strong> conclusions <strong>of</strong> <strong>the</strong> structuralstudies, <strong>the</strong> best strategy for character<strong>is</strong>ingDNA-binding <strong>of</strong> <strong>the</strong> putative transcriptionfactors in each genome <strong>is</strong> to group <strong>the</strong>mby homology <strong>and</strong> to analyse <strong>the</strong> individualfamilies. Such classifications are providedin <strong>the</strong> secondary sequence databasesdescribed earlier <strong>and</strong> also those thatspecial<strong>is</strong>e in regulatory proteins such asRegulonDB [73] <strong>and</strong> TRANSFAC [74].Of even greater use <strong>is</strong> <strong>the</strong> prov<strong>is</strong>ion <strong>of</strong>structural assignments to <strong>the</strong> proteins;given a transcription factor, it <strong>is</strong> helpful toknow <strong>the</strong> structural motif that it uses forbinding, <strong>the</strong>refore providing us with abetter underst<strong>and</strong>ing <strong>of</strong> how it recogn<strong>is</strong>es<strong>the</strong> target sequence. Structural genomicsthrough bioinformatics assigns structuresto <strong>the</strong> protein products <strong>of</strong> genomes bydemonstrating similarity to proteins <strong>of</strong>known structure [75]. These studies haveshown that prokaryotic transcription factorsmost frequently contain helix-turnhelixmotifs [71, 76] <strong>and</strong> eukaryotic factorscontain homeodomain type helix-turnhelix,zinc finger or leucine zipper motifs.


353<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?From <strong>the</strong> protein classifications in eachgenome, it <strong>is</strong> clear that different types <strong>of</strong>regulatory proteins differ in abundance <strong>and</strong>families significantly differ in size. A studyby Huynen <strong>and</strong> van Nimwegen [77] hasshown that members <strong>of</strong> a single family havesimilar functions, but as <strong>the</strong> requirements<strong>of</strong> th<strong>is</strong> function vary over time, so does<strong>the</strong> presence <strong>of</strong> each gene family in <strong>the</strong>genome.Most recently, using a combination <strong>of</strong>sequence <strong>and</strong> structural data, we examined<strong>the</strong> conservation <strong>of</strong> amino acid sequencesbetween related DNA-binding proteins,<strong>and</strong> <strong>the</strong> effect that mutations have onDNA sequence recognition. The structuralfamilies described above were exp<strong>and</strong>ed toinclude proteins that are related by sequencesimilarity, but whose structures remainunsolved. Again, members <strong>of</strong> <strong>the</strong> samefamily are homologous, <strong>and</strong> probably derivefrom a common ancestor.Amino acid conservations were calculatedfor <strong>the</strong> multiple sequence alignments<strong>of</strong> each family [78]. Generally, alignmentpositions that interact with <strong>the</strong> DNA arebetter conserved than <strong>the</strong> rest <strong>of</strong> <strong>the</strong> proteinsurface, although <strong>the</strong> detailed patterns<strong>of</strong> conservation are quite complex. Residuesthat contact <strong>the</strong> DNA backbone are highlyconserved in all protein families, providinga set <strong>of</strong> stabil<strong>is</strong>ing interactions that arecommon to all homologous proteins. Theconservation <strong>of</strong> alignment positions thatcontact bases, <strong>and</strong> recogn<strong>is</strong>e <strong>the</strong> DNAsequence, are more complex <strong>and</strong> could berational<strong>is</strong>ed by defining a three-class modelfor DNA-binding. First, protein familiesthat bind non-specifically usually containseveral conserved base-contacting residues;without exception, interactions are made in<strong>the</strong> minor groove where <strong>the</strong>re <strong>is</strong> littled<strong>is</strong>crimination between base types. Thecontacts are commonly used to stabil<strong>is</strong>edeformations in <strong>the</strong> nucleic acid structure,particularly in widening <strong>the</strong> DNA minorgroove. The second class compr<strong>is</strong>e familieswhose members all target <strong>the</strong> samenucleotide sequence; here, base-contactingpositions are absolutely or highly conservedallowing related proteins to target <strong>the</strong>same sequence.The third, <strong>and</strong> most interesting, classcompr<strong>is</strong>es families in which binding <strong>is</strong> alsospecific but different members bind d<strong>is</strong>tinctbase sequences. Here protein residuesundergo frequent mutations, <strong>and</strong> familymembers can be divided into subfamiliesaccording to <strong>the</strong> amino acid sequencesat base-contacting positions; those in <strong>the</strong>same subfamily are predicted to bind <strong>the</strong>same DNA sequence <strong>and</strong> those <strong>of</strong> differentsubfamilies to bind d<strong>is</strong>tinct sequences. On<strong>the</strong> whole, <strong>the</strong> subfamilies correspondedwell with <strong>the</strong> proteins’ functions <strong>and</strong> members<strong>of</strong> <strong>the</strong> same subfamilies were found toregulate similar transcription pathways.The combined analys<strong>is</strong> <strong>of</strong> sequence <strong>and</strong>structural data described by th<strong>is</strong> study providedan insight into how homologousDNA-binding scaffolds achieve differentspecificities by altering <strong>the</strong>ir amino acidsequences. In doing so, proteins evolvedd<strong>is</strong>tinct functions, <strong>the</strong>refore allowingstructurally related transcription factors toregulate expression <strong>of</strong> different genes.Therefore, <strong>the</strong> relative abundance <strong>of</strong> transcriptionregulatory families in a genomedepends, not only on <strong>the</strong> importance <strong>of</strong> aparticular protein function, but also in <strong>the</strong>adaptability <strong>of</strong> <strong>the</strong> DNA-binding motifs torecogn<strong>is</strong>e d<strong>is</strong>tinct nucleotide sequences.Th<strong>is</strong>, in turn, appears to be best accommodatedby simple binding motifs, such as <strong>the</strong>zinc fingers.Given <strong>the</strong> knowledge <strong>of</strong> <strong>the</strong> transcriptionregulators that are contained in eachorgan<strong>is</strong>m, <strong>and</strong> an underst<strong>and</strong>ing <strong>of</strong> how<strong>the</strong>y recogn<strong>is</strong>e DNA sequences, it <strong>is</strong> <strong>of</strong>interest to search for <strong>the</strong>ir potential bindingsites within genome sequences [79].For prokaryotes, most analyses have involvedcompiling data on experimentallyknown binding sites for particular proteins<strong>and</strong> building a consensus sequence that incorporatesany variations in nucleotides.Additional sites are found by conductingword-matching searches over <strong>the</strong> entiregenome <strong>and</strong> scoring c<strong>and</strong>idate sites bysimilarity [80-83]. Unsurpr<strong>is</strong>ingly, most <strong>of</strong><strong>the</strong> predicted sites are found in non-codingregions <strong>of</strong> <strong>the</strong> DNA [80] <strong>and</strong> <strong>the</strong> results <strong>of</strong><strong>the</strong> studies are <strong>of</strong>ten presented in databasessuch as RegulonDB [73]. The consensussearch approach <strong>is</strong> <strong>of</strong>ten complemented bycomparative genomic studies searchingupstream regions <strong>of</strong> orthologous genes inclosely related organ<strong>is</strong>ms. Through such anapproach, it was found that at least 27% <strong>of</strong>known E. coli DNA-regulatory motifs areconserved in one or more d<strong>is</strong>tantly relatedbacteria [84].The detection <strong>of</strong> regulatory sites ineukaryotes poses a more difficult problembecause consensus sequences tend to bemuch shorter, variable, <strong>and</strong> d<strong>is</strong>persed oververy large d<strong>is</strong>tances. However, initial studiesin S. cerev<strong>is</strong>iae provided an interestingobservation for <strong>the</strong> GATA protein in nitrogenmetabol<strong>is</strong>m regulation. While <strong>the</strong> 5base-pair GATA consensus sequence <strong>is</strong>found almost everywhere in <strong>the</strong> genome, asingle <strong>is</strong>olated binding site <strong>is</strong> insufficient toexert <strong>the</strong> regulatory function [85]. Thereforespecificity <strong>of</strong> GATA activity comesfrom <strong>the</strong> repetition <strong>of</strong> <strong>the</strong> consensus sequencewithin <strong>the</strong> upstream regions <strong>of</strong> controlledgenes in multiple copies. An initialstudy has used th<strong>is</strong> observation to predictnew regulatory sites by searching for overrepresentedoligonucleotides in non-codingregions <strong>of</strong> yeast <strong>and</strong> worm genomes [86,87].Having detected <strong>the</strong> regulatory bindingsites, <strong>the</strong>re <strong>is</strong> <strong>the</strong> problem <strong>of</strong> defining <strong>the</strong>genes that are actually regulated, commonlytermed regulons. Generally, binding sitesare assumed to be located directly upstream<strong>of</strong> <strong>the</strong> regulons; however <strong>the</strong>re are differentproblems associated with th<strong>is</strong> assumptiondepending on <strong>the</strong> organ<strong>is</strong>m. For prokaryotes,it <strong>is</strong> complicated by <strong>the</strong> presence <strong>of</strong>operons; it <strong>is</strong> difficult to locate <strong>the</strong> regulatedgene within an operon since it can lieseveral genes downstream <strong>of</strong> <strong>the</strong> regulatorysequence. It <strong>is</strong> <strong>of</strong>ten difficult to predict <strong>the</strong>organ<strong>is</strong>ation <strong>of</strong> operons [88], especially todefine <strong>the</strong> gene that <strong>is</strong> found at <strong>the</strong> head,<strong>and</strong> <strong>the</strong>re <strong>is</strong> <strong>of</strong>ten a lack <strong>of</strong> long-range conservationin gene order between relatedorgan<strong>is</strong>ms [89]. The problem in eukaryotes<strong>is</strong> even more severe; regulatory sites <strong>of</strong>tenact in both directions, binding sites areusually d<strong>is</strong>tant from regulons because <strong>of</strong>large intergenic regions, <strong>and</strong> transcriptionregulation <strong>is</strong> usually a result <strong>of</strong> combinedaction by multiple transcription factors in acombinatorial manner.Despite <strong>the</strong>se problems, <strong>the</strong>se studieshave succeeded in confirming <strong>the</strong> transcriptionregulatory pathways <strong>of</strong> well-characterizedsystems such as <strong>the</strong> heat shock responseMethod Inform Med 4/2001


354Luscombe, Greenbaum, Gersteinsystem [83]. In addition, it <strong>is</strong> feasible toexperimentally verify any predictions, mostnotably using gene expression data.6.3 Gene Expression StudiesMany expression studies have so farfocused on dev<strong>is</strong>ing methods to clustergenes by similarities in expression pr<strong>of</strong>iles.Th<strong>is</strong> <strong>is</strong> in order to determine <strong>the</strong> proteinsthat are expressed toge<strong>the</strong>r under differentcellular conditions. Briefly, <strong>the</strong> most commonmethods are hierarchical clustering,self-organ<strong>is</strong>ing maps, <strong>and</strong> K-means clustering.Hierarchical methods originally derivedfrom algorithms to construct phylogenetictrees, <strong>and</strong> group genes in a “bottomup”fashion; genes with <strong>the</strong> most similarexpression pr<strong>of</strong>iles are clustered first, <strong>and</strong>those with more diverse pr<strong>of</strong>iles areincluded iteratively [90-92]. In contrast, <strong>the</strong>self-organ<strong>is</strong>ing map [93, 94] <strong>and</strong> K-meansmethods [95, 96] employ a “top-down”approach in which <strong>the</strong> user pre-defines <strong>the</strong>number <strong>of</strong> clusters for <strong>the</strong> dataset. Theclusters are initially assigned r<strong>and</strong>omly, <strong>and</strong><strong>the</strong> genes are regrouped iteratively until<strong>the</strong>y are optimally clustered.Given <strong>the</strong>se methods, it <strong>is</strong> <strong>of</strong> interest torelate <strong>the</strong> expression data to o<strong>the</strong>r attributessuch as structure, function <strong>and</strong> subcellularlocal<strong>is</strong>ation <strong>of</strong> each gene product.Mapping <strong>the</strong>se properties provides aninsight into <strong>the</strong> character<strong>is</strong>tics <strong>of</strong> proteinsthat are expressed toge<strong>the</strong>r, <strong>and</strong> also suggestsome interesting conclusions about <strong>the</strong>overall biochem<strong>is</strong>try <strong>of</strong> <strong>the</strong> cell. In yeast,shorter proteins tend to be more highlyexpressed than longer proteins, probablybecause <strong>of</strong> <strong>the</strong> relative ease with which <strong>the</strong>yare produced [97]. Looking at <strong>the</strong> aminoacid content, highly expressed genes aregenerally enriched in alanine <strong>and</strong> glycine,<strong>and</strong> depleted in asparagine; <strong>the</strong>se arethought to reflect <strong>the</strong> requirements <strong>of</strong>amino acid usage in <strong>the</strong> organ<strong>is</strong>m, wheresyn<strong>the</strong>s<strong>is</strong> <strong>of</strong> alanine <strong>and</strong> glycine are energeticallyless expensive than asparagine.Turningto protein structure, expression levels <strong>of</strong><strong>the</strong> TIM barrel <strong>and</strong> NTP hydrolase foldsare highest, while those for <strong>the</strong> leucinezipper, zinc finger <strong>and</strong> transmembranehelix-containing folds are lowest. Th<strong>is</strong>Method Inform Med 4/2001relates to <strong>the</strong> functions associated with<strong>the</strong>se folds; <strong>the</strong> former are commonly involvedin metabolic pathways <strong>and</strong> <strong>the</strong> latterin signalling or transport processes [98].Th<strong>is</strong> <strong>is</strong> also reflected in <strong>the</strong> relationshipwith subcellular local<strong>is</strong>ations <strong>of</strong> proteins,where expression <strong>of</strong> cytoplasmic proteins <strong>is</strong>high, but nuclear <strong>and</strong> membrane proteinstend to be low [99, 100].More complex relationships have alsobeen assessed. Conventional w<strong>is</strong>dom <strong>is</strong> thatgene products that interact with each o<strong>the</strong>rare more likely to have similar expressionpr<strong>of</strong>iles than if <strong>the</strong>y do not [101, 102]. However,a recent study showed that th<strong>is</strong> relationship<strong>is</strong> not so simple [103]. While expressionpr<strong>of</strong>iles are similar for gene productsthat are permanently associated, forexample in <strong>the</strong> large ribosomal subunit,pr<strong>of</strong>iles differ significantly for productsthat are only associated transiently, includingthose belonging to <strong>the</strong> same metabolicpathway.As described below, one <strong>of</strong> <strong>the</strong> maindriving forces behind expression analys<strong>is</strong>has been to analyse cancerous cell lines[104]. In general, it has been shown that differentcell lines (eg epi<strong>the</strong>lial <strong>and</strong> ovariancells) can be d<strong>is</strong>tingu<strong>is</strong>hed on <strong>the</strong> bas<strong>is</strong> <strong>of</strong><strong>the</strong>ir expression pr<strong>of</strong>iles, <strong>and</strong> that <strong>the</strong>sepr<strong>of</strong>iles are maintained when cells aretransferred from an in vivo to an in vitroenvironment [105]. The bas<strong>is</strong> for <strong>the</strong>ir physiologicaldifferences were apparent in <strong>the</strong>expression <strong>of</strong> specific genes; for example,expression levels <strong>of</strong> gene products necessaryfor progression through <strong>the</strong> cell cycle,especially ribosomal genes, correlated wellwith variations in cell proliferation rate.Comparative analys<strong>is</strong> can be extended totumour cells, in which <strong>the</strong> underlyingcauses <strong>of</strong> cancer can be uncovered bypinpointing areas <strong>of</strong> biological variationscompared to normal cells. For example inbreast cancer, genes related to cell proliferation<strong>and</strong> <strong>the</strong> IFN-regulated signal transductionpathway were found to be upregulated[25, 106]. One <strong>of</strong> <strong>the</strong> difficulties incancer treatment has been to target specific<strong>the</strong>rapies to pathogenetically d<strong>is</strong>tinct tumourtypes, in order to maxim<strong>is</strong>e efficacy<strong>and</strong> minim<strong>is</strong>e toxicity. Thus, improvementsin cancer classifications have been centralto advances in cancer treatment. Although<strong>the</strong> d<strong>is</strong>tinction between different forms <strong>of</strong>cancer – for example subclasses <strong>of</strong> acuteleukaemia – has been well establ<strong>is</strong>hed, it <strong>is</strong>still not possible to establ<strong>is</strong>h a clinical diagnos<strong>is</strong>on <strong>the</strong> bas<strong>is</strong> <strong>of</strong> a single test. In arecent study, acute myeloid leukaemia <strong>and</strong>acute lymphoblastic leukaemia were successfullyd<strong>is</strong>tingu<strong>is</strong>hed based on <strong>the</strong> expressionpr<strong>of</strong>iles <strong>of</strong> <strong>the</strong>se cells [26]. As <strong>the</strong>approach does not require prior biologicalknowledge <strong>of</strong> <strong>the</strong> d<strong>is</strong>eases, it may provide ageneric strategy for classifying all types <strong>of</strong>cancer.Clearly, an essential aspect <strong>of</strong> underst<strong>and</strong>ingexpression data lies in underst<strong>and</strong>ing<strong>the</strong> bas<strong>is</strong> <strong>of</strong> transcription regulation.However, analys<strong>is</strong> in th<strong>is</strong> area <strong>is</strong> still limitedto preliminary analyses <strong>of</strong> expression levelsin yeast mutants lacking key components <strong>of</strong><strong>the</strong> transcription initiation complex [19,107].7 “… many PRACTICALAPPLICATIONS…”Here, we describe some <strong>of</strong> <strong>the</strong> major uses<strong>of</strong> bioinformatics.7.1 Finding HomologuesAs described earlier, one <strong>of</strong> <strong>the</strong> drivingforces behind bioinformatics <strong>is</strong> <strong>the</strong> searchfor similarities between different biomolecules.Apartfrom enabling systematic organ<strong>is</strong>ation<strong>of</strong> data, identification <strong>of</strong> proteinhomologues has some direct practical uses.The most obvious <strong>is</strong> transferring informationbetween related proteins. For example,given a poorly character<strong>is</strong>ed protein, it <strong>is</strong>possible to search for homologues that arebetter understood <strong>and</strong> with caution, applysome <strong>of</strong> <strong>the</strong> knowledge <strong>of</strong> <strong>the</strong> latter to <strong>the</strong>former. Specifically with structural data,<strong>the</strong>oretical models <strong>of</strong> proteins are usuallybased on experimentally solved structures<strong>of</strong> close homologues [108]. Similar techniquesare used in fold recognition in whichtertiary structure predictions depend onfinding structures <strong>of</strong> remote homologues<strong>and</strong> checking whe<strong>the</strong>r <strong>the</strong> prediction <strong>is</strong>energetically viable [109]. Where biochemicalor structural data are lacking, studies


355<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?Fig. 3 Above <strong>is</strong> a schematic outlining how scient<strong>is</strong>ts can use bioinformatics to aid rational drug d<strong>is</strong>covery. MLH1 <strong>is</strong> a human gene encoding a m<strong>is</strong>match repair protein (mmr) situated on <strong>the</strong>short arm <strong>of</strong> chromosome 3. Through linkage analys<strong>is</strong> <strong>and</strong> its similarity to mmr genes in mice, <strong>the</strong> gene has been implicated in nonpolypos<strong>is</strong> colorectal cancer. Given <strong>the</strong> nucleotide sequence,<strong>the</strong> probable amino acid sequence <strong>of</strong> <strong>the</strong> encoded protein can be determined using translation s<strong>of</strong>tware. Sequence search techniques can be used to find homologues in model organ<strong>is</strong>ms, <strong>and</strong>based on sequence similarity, it <strong>is</strong> possible to model <strong>the</strong> structure <strong>of</strong> <strong>the</strong> human protein on experimentally character<strong>is</strong>ed structures. Finally, docking algorithms could design molecules thatcould bind <strong>the</strong> model structure, leading <strong>the</strong> way for biochemical assays to test <strong>the</strong>ir biological activity on <strong>the</strong> actual protein.could be made in low-level organ<strong>is</strong>ms likeyeast <strong>and</strong> <strong>the</strong> results applied to homologuesin higher-level organ<strong>is</strong>ms such ashumans, where experiments are moredem<strong>and</strong>ing.An equivalent approach <strong>is</strong> also employedin genomics. Homologue-finding <strong>is</strong> extensivelyused to confirm coding regions innewly sequenced genomes <strong>and</strong> functionaldata <strong>is</strong> frequently transferred to annotateindividual genes. On a larger scale, it alsosimplifies <strong>the</strong> problem <strong>of</strong> underst<strong>and</strong>ingcomplex genomes by analysing simpleorgan<strong>is</strong>ms first <strong>and</strong> <strong>the</strong>n applying <strong>the</strong>same principles to more complicated ones –th<strong>is</strong> <strong>is</strong> one reason why early structuralgenomics projects focused on Mycoplasmagenitalium [75].Ironically, <strong>the</strong> same idea can be appliedin reverse. Potential drug targets are quicklyd<strong>is</strong>covered by checking whe<strong>the</strong>r homologues<strong>of</strong> essential microbial proteins arem<strong>is</strong>sing in humans. On a smaller scale,structural differences between similar proteinsmay be harnessed to design drugmolecules that specifically bind to onestructure but not ano<strong>the</strong>r.7.2 Rational Drug DesignOne <strong>of</strong> <strong>the</strong> earliest medical applications <strong>of</strong>bioinformatics has been in aiding rationalMethod Inform Med 4/2001


356Luscombe, Greenbaum, Gersteindrug design. Fig. 3 outlines <strong>the</strong> commonlycited approach, taking <strong>the</strong> MLH1 gene productas an example drug target. MLH1 <strong>is</strong> ahuman gene encoding a m<strong>is</strong>match repairprotein (mmr) situated on <strong>the</strong> short arm <strong>of</strong>chromosome 3 [110]. Through linkage analys<strong>is</strong><strong>and</strong> its similarity to mmr genes in mice,<strong>the</strong> gene has been implicated in nonpolypos<strong>is</strong>colorectal cancer [111]. Given <strong>the</strong>nucleotide sequence, <strong>the</strong> probable aminoacid sequence <strong>of</strong> <strong>the</strong> encoded protein canbe determined using translation s<strong>of</strong>tware.Sequence search techniques can <strong>the</strong>n beused to find homologues in model organ<strong>is</strong>ms,<strong>and</strong> based on sequence similarity, it<strong>is</strong> possible to model <strong>the</strong> structure <strong>of</strong> <strong>the</strong>human protein on experimentally characterizedstructures. Finally, docking algorithmscould design molecules that could bind <strong>the</strong>model structure, leading <strong>the</strong> way for biochemicalassays to test <strong>the</strong>ir biologicalactivity on <strong>the</strong> actual protein.7.3 Large-scale CensusesAlthough databases can efficiently store all<strong>the</strong> information related to genomes, structures<strong>and</strong> expression datasets, it <strong>is</strong> useful tocondense all th<strong>is</strong> information into underst<strong>and</strong>abletrends <strong>and</strong> facts that users canreadily underst<strong>and</strong>. Broad general<strong>is</strong>ationshelp identify interesting subject areas forfur<strong>the</strong>r detailed analys<strong>is</strong>, <strong>and</strong> place new observationsin a proper context. Th<strong>is</strong> enablesone to see whe<strong>the</strong>r <strong>the</strong>y are unusual in anyway.Through <strong>the</strong>se large-scale censuses, onecan address a number <strong>of</strong> evolutionary, biochemical<strong>and</strong> biophysical questions. Forexample, are specific protein folds associatedwith certain phylogenetic groups? Howcommon are different folds within particularorgan<strong>is</strong>ms? And to what degree arefolds shared between related organ<strong>is</strong>ms?Does th<strong>is</strong> extent <strong>of</strong> sharing parallel measures<strong>of</strong> relatedness derived from traditionalevolutionary trees? Initial studies showthat <strong>the</strong> frequency <strong>of</strong> folds differs greatlybetween organ<strong>is</strong>ms <strong>and</strong> that <strong>the</strong> sharing <strong>of</strong>folds between organ<strong>is</strong>ms does in fact followtraditional phylogenetic classifications[37, 112, 113]. We can also integrate data onprotein functions; given that <strong>the</strong> particularMethod Inform Med 4/2001protein folds are <strong>of</strong>ten related to specificbiochemical functions [52, 53], <strong>the</strong>se findingshighlight <strong>the</strong> diversity <strong>of</strong> metabolicpathways in different organ<strong>is</strong>ms [36, 89].As we d<strong>is</strong>cussed earlier, one <strong>of</strong> <strong>the</strong> mostexciting new sources <strong>of</strong> genomic information<strong>is</strong> <strong>the</strong> expression data. Combining expressioninformation with structural <strong>and</strong> functionalclassifications <strong>of</strong> proteins we can askwhe<strong>the</strong>r <strong>the</strong> high occurrence <strong>of</strong> a proteinfold in a genome <strong>is</strong> indicative <strong>of</strong> high expressionlevels [97]. Fur<strong>the</strong>r genomic scaledata that we can consider in large-scale surveysinclude <strong>the</strong> subcellular local<strong>is</strong>ations <strong>of</strong>proteins <strong>and</strong> <strong>the</strong>ir interactions with eacho<strong>the</strong>r [114-116]. In conjunction with structuraldata, we can <strong>the</strong>n begin to compile amap <strong>of</strong> all protein-protein interactions inan organ<strong>is</strong>m.8. ConclusionsWith <strong>the</strong> current deluge <strong>of</strong> data, computationalmethods have become ind<strong>is</strong>pensableto biological investigations. Originallydeveloped for <strong>the</strong> analys<strong>is</strong> <strong>of</strong> biological sequences,bioinformatics now encompassesa wide range <strong>of</strong> subject areas includingstructural biology, genomics <strong>and</strong> gene expressionstudies. In th<strong>is</strong> review, we providedan introduction <strong>and</strong> overview <strong>of</strong> <strong>the</strong> currentstate <strong>of</strong> field. In particular, we d<strong>is</strong>cussed<strong>the</strong> types <strong>of</strong> biological information <strong>and</strong>databases that are commonly used, examinedsome <strong>of</strong> <strong>the</strong> studies that are being conducted– with reference to transcriptionregulatory systems – <strong>and</strong> finally looked atseveral practical applications <strong>of</strong> <strong>the</strong> field.Two principal approaches underpin all studiesin bioinformatics. First <strong>is</strong> that <strong>of</strong> comparing<strong>and</strong> grouping <strong>the</strong> data according tobiologically meaningful similarities <strong>and</strong> second,that <strong>of</strong> analysing one type <strong>of</strong> data toinfer <strong>and</strong> underst<strong>and</strong> <strong>the</strong> observations forano<strong>the</strong>r type <strong>of</strong> data. These approaches arereflected in <strong>the</strong> main aims <strong>of</strong> <strong>the</strong> field,which are to underst<strong>and</strong> <strong>and</strong> organ<strong>is</strong>e <strong>the</strong>information associated with biologicalmolecules on a large scale. As a result,bioinformatics has not only provided greaterdepth to biological investigations, butadded <strong>the</strong> dimension <strong>of</strong> breadth as well. Inth<strong>is</strong> way, we are able to examine individualsystems in detail <strong>and</strong> also compare <strong>the</strong>mwith those that are related in order to uncovercommon principles that apply acrossmany systems <strong>and</strong> highlight unusual featuresthat are unique to some.AcknowledgmentsWe thank Patrick McGarvey for comments on<strong>the</strong> manuscript.References1. Reichhardt T. It’s sink or swim as a tidal wave<strong>of</strong> data approaches. Nature 1999. 399 (6736):517-20.2. Benson DA, et al. GenBank. Nucleic AcidsRes 2000; 28 (1): 15-8.3. Bairoch A, Apweiler R. The SWISS-PROTprotein sequence database <strong>and</strong> its supplementTrEMBL in 2000. Nucleic Acids Res 2000; 28(1): 45-8.4. Fle<strong>is</strong>chmann RD, et al. Whole-genome r<strong>and</strong>omsequencing <strong>and</strong> assembly <strong>of</strong> Haemophilusinfluenzae Rd. Science 1995; 269(5223): 496-512.5. Drowning in data. The Econom<strong>is</strong>t 1999 (26June 1999).6. Bernstein FC, et al. The Protein Data Bank. Acomputer-based archival file for macromolecularstructures. Eur J Biochem 1977; 80 (2):319-24.7. Berman HM, et al. The Protein Data Bank.Nucleic Acids Res 2000; 28 (1): 235-42.8. Pearson WR, Lipman DJ. Improved tools forbiological sequence compar<strong>is</strong>on. Proc NatlAcad Sci USA 1988; 85 (8): 2444-8.9. Altschul SF, et al. Gapped BLAST <strong>and</strong> PSI-BLAST: a new generation <strong>of</strong> protein databasesearch programs. Nucleic Acids Res 1997; 25(17): 3389-402.10. Fle<strong>is</strong>chmann RD, et al. Whole-genome r<strong>and</strong>omsequencing <strong>and</strong> assembly <strong>of</strong> Haemophilusinfluenzae Rd. Science 1995; 269(5223): 496-512.11. L<strong>and</strong>er ES, et al. Initial sequencing <strong>and</strong> analys<strong>is</strong><strong>of</strong> <strong>the</strong> human genome. Nature 2001; 409:860-921.12. Venter JC, et al. The sequence <strong>of</strong> <strong>the</strong> humangenome. Science 2001; 291 (5507): 1304-51.13. Tatusova TA, Karsch-Mizrachi I, Ostell JA.Complete genomes in WWW Entrez: datarepresentation <strong>and</strong> analys<strong>is</strong>. <strong>Bioinformatics</strong>1999; 15 (7-8): 536-43.14. E<strong>is</strong>en MB, Brown PO. DNA arrays for analys<strong>is</strong><strong>of</strong> gene expression. Methods Enzymol,1999; 303: 179-205.15. Cheung VG, et al. Making <strong>and</strong> reading microarrays.Nat Genet 1999; 21 (1 Suppl): 15-9.16. Duggan DJ, et al. Expression pr<strong>of</strong>iling usingcDNA microarrays. Nat Genet 1999. 21(1 Suppl): 10-4.17. Lipshutz RJ, et al. High density syn<strong>the</strong>ticoligonucleotide arrays. Nat Genet 1999; 21 (1):20-4.


357<strong>What</strong> <strong>is</strong> <strong>Bioinformatics</strong>?18. Velculescu VE, et al. Serial Analys<strong>is</strong> <strong>of</strong> GeneExpression. Detailed Protocol 1999.19. Holstege FC, et al. D<strong>is</strong>secting <strong>the</strong> regulatorycircuitry <strong>of</strong> a eukaryotic genome. Cell 1998; 95(5): 717-28.20. Roth FP, Estep PW, Church GM. FindingDNA regulatory motifs within unaligned noncodingsequences clustered by whole-genomemRNA quantitation. Nat Biotech 1998; 16(10): 939-45.21. Jelinsky SA, Samson LD. Global response <strong>of</strong>Saccharomyces cerev<strong>is</strong>iae to an alkylatingagent. Proc Natl Acad Sci USA 1999; 96 (4):1486-91.22. Cho RJ, et al. A genome-wide transcriptionalanalys<strong>is</strong> <strong>of</strong> <strong>the</strong> mitotic cell cycle. Mol Cell1998; 2 (1): 65-73.23. DeR<strong>is</strong>i JL, Iyer VR, Brown PO. Exploring <strong>the</strong>metabolic <strong>and</strong> genetic control <strong>of</strong> gene expressionon a genomic scale. Science 1997; 278(5338): 680-6.24. Winzeler EA, et al. Functional characterization<strong>of</strong> <strong>the</strong> S. cerev<strong>is</strong>iae genome by genedeletion <strong>and</strong> parallel analys<strong>is</strong>. Science 1999;285 (5429): 901-6.25. Perou CM, et al. Molecular portraits <strong>of</strong> humanbreast tumours. Nature 2000; 406 (6797):747-52.26. Golub TR, et al. Molecular classification <strong>of</strong>cancer: class d<strong>is</strong>covery <strong>and</strong> class prediction bygene expression monitoring. Science 1999; 286(5439): 531-7.27. Pedersendagger AG, et al. A DNA structuralatlas for Escherichia coli. J Mol Biol 2000; 299(4): 907-30.28. Kaneh<strong>is</strong>a M; Goto S. KEGG: kyoto encyclopedia<strong>of</strong> genes <strong>and</strong> genomes. Nucleic AcidsRes 2000; 28 (1): 27-30.29. Jeffery CJ. Moonlighting proteins. TIBS 1999;24 (1): 8-11.30. Chothia, C. Proteins. One thous<strong>and</strong> familiesfor <strong>the</strong> molecular biolog<strong>is</strong>t. Nature 1992; 357(6379): 543-4.31. Orengo CA, Jones DT, Thornton JM. Proteinsuperfamilies <strong>and</strong> domain superfolds. Nature1994; 372 (6507): 631-4.32. Lesk AM, Chothia C. How different aminoacid sequences determine similar proteinstructures: <strong>the</strong> structure <strong>and</strong> evolutionarydynamics <strong>of</strong> <strong>the</strong> globins. J Mol Biol 1980; 136(3): 225-70.33. Russell RB, et al. Recognition <strong>of</strong> analogous<strong>and</strong> homologous protein folds: analys<strong>is</strong> <strong>of</strong>sequence <strong>and</strong> structure conservation. J MolBiol 1997; 269 (3): 423-39.34. Russell RB, et al. Recognition <strong>of</strong> analogous<strong>and</strong> homologous protein folds – assessment <strong>of</strong>prediction success <strong>and</strong> associated alignmentaccuracy using empirical substitution matrices.Protein Eng 1998; 11 (1): 1-9.35. Fitch WM. D<strong>is</strong>tingu<strong>is</strong>hing homologous fromanalogous proteins. Syst Zool 1970; 19: 99-110.36. Tatusov RL, Koonin EV, Lipman DJ. A genomicperspective on protein families. Science1997; 278 (5338): 631-7.37. Gerstein M, Hegyi H. Comparing genomes interms <strong>of</strong> protein structure: surveys <strong>of</strong> a finiteparts l<strong>is</strong>t. FEMS Microbiol Rev 1998; 22 (4):277-304.38. Skolnick J, Fetrow JS. From genes to proteinstructure <strong>and</strong> function: novel applications <strong>of</strong>computational approaches in <strong>the</strong> genomic era.Trends Biotech 2000; 18: 34-9.39. Qian J, et al. PartsL<strong>is</strong>t: a web-based system fordynamically ranking protein folds based ond<strong>is</strong>parate attributes, including whole-genomeexpression <strong>and</strong> interaction information.Nucleic Acids Res 2001; 29 (8): 1750-64.40. Gerstein M. Integrative database analys<strong>is</strong> instructural genomics. Nat Struct Biol 2000; 7Suppl: 960-3.41. Etzold T, Ulyanov A, Argos P. SRS: informationretrieval system for molecular biology databanks. Methods Enzymol 1996; 266: 114-28.42. Schuler GD, et al. Entrez: molecular biologydatabase <strong>and</strong> retrieval system. MethodsEnzymol 1996; 266: 141-62.43. Wade K. Searching Entrez PubMed <strong>and</strong>uncover on <strong>the</strong> internet. Aviat Space EnvironMed 2000; 71 (5): 559.44. Bertone P, et al. SPINE: An integratedtracking database <strong>and</strong> datamining approachfor high-throughput structural proteomics,enabling <strong>the</strong> determination <strong>of</strong> <strong>the</strong> properties<strong>of</strong> readily characterized proteins. NucleicAcids Res. In Press.45. Zhang MQ. Promoter analys<strong>is</strong> <strong>of</strong> co-regulatedgenes in <strong>the</strong> yeast genome. Comput Chem1999; 23 (3-4): 233-50.46. Boguski MS. Biosequence exeges<strong>is</strong>. Science1999; 286 (5439): 453-5.47. Miller C, Gurd J, Brass A. A RAPIDalgorithm for sequence database compar<strong>is</strong>ons:application to <strong>the</strong> identification <strong>of</strong> vectorcontamination in <strong>the</strong> EMBL databases. <strong>Bioinformatics</strong>1999; 15 (2): 111-21.48. Gonnet GH, Korostensky C, Brenner S.Evaluation measures <strong>of</strong> multiple sequencealignments. J Comput Biol 2000; 7 (1-2):261-76.49. Orengo CA, Taylor WR. SSAP: sequentialstructure alignment program for protein structurecompar<strong>is</strong>on. Methods Enzymol 1996; 266:617-35.50. Orengo CA. CORA – topological fingerprintsfor protein structural families. Protein Sci1999; 8 (4): 699-715.51. Russell RB, Sternberg MJ. Structure prediction.How good are we? Curr Biol 1995; 5 (5):488-90.52. Martin AC, et al. Protein folds <strong>and</strong> functions.Structure 1998; 6 (7): 875-84.53. Hegyi H, Gerstein M. The relationship betweenprotein structure <strong>and</strong> function: a comprehensivesurvey with application to <strong>the</strong>yeast genome. J Mol Biol 1999; 288 (1): 147-64.54. Russell RB, Sasieni PD, Sternberg MJE.Supersites within superfolds. Binding sitesimilarity in <strong>the</strong> absence <strong>of</strong> homology. J MolBiol 1998; 282 (4): 903-18.55. Wilson CA, Kreychman J, Gerstein M. Assessingannotation transfer for genomics:quantifying <strong>the</strong> relations between proteinsequence, structure <strong>and</strong> function throughtraditional <strong>and</strong> probabil<strong>is</strong>tic scores. J Mol Biol2000; 297 (1): 233-49.56. Harr<strong>is</strong>on SC. A structural taxonomy <strong>of</strong> DNAbindingdomains. Nature 1991; 353 (6346):715-9.57. Luscombe NM, et al.An overview <strong>of</strong> <strong>the</strong> structures<strong>of</strong> protein-DNA complexes. GenomeBiology 2000; 1 (1): 1-37.58. Jones S, et al. Protein-DNA interactions: Astructural analys<strong>is</strong>. J Mol Biol 1999; 287 (5):877-96.59. Suzuki M, Gerstein M. Binding geometry <strong>of</strong>alpha-helices that recognize DNA. Proteins1995; 23 (4): 525-35.60. Luscombe NM, Thornton JM. Protein-DNAinteractions: a 3D analys<strong>is</strong> <strong>of</strong> alpha-helixbindingin <strong>the</strong> major groove. Manuscript inpreparation.61. Suzuki M, et al. DNA recognition code <strong>of</strong>transcription factors. Protein Eng 1995; 8 (4):319-28.62. Suzuki M. DNA recognition by a -sheet.Protein Eng 1995; 8 (1): 1-4.63. Seeman NC, Rosenberg JM, Rich A. Sequencespecific recognition <strong>of</strong> double helical nucleicacids by proteins. Proc Natl Acad Sci USA1976; 73: 804-8.64. Suzuki M. A framework for <strong>the</strong> DNA-proteinrecognition code <strong>of</strong> <strong>the</strong> probe helix intranscription factors: <strong>the</strong> chemical <strong>and</strong> stereochemicalrules. Structure 1994; 2 (4): 317-26.65. M<strong>and</strong>el-Gutfreund Y, Schueler O, Margalit H.Comprehensive analys<strong>is</strong> <strong>of</strong> hydrogen bonds inregulatory protein-DNA complexes: in search<strong>of</strong> common principles. J Mol Biol 1995; 253(2): 370-82.66. Luscombe NM, Laskowski RA, Thornton JM.Protein-DNA interactions: a 3D analys<strong>is</strong> <strong>of</strong>amino acid-base interactions. Nucleic AcidsRes. In Press.67. M<strong>and</strong>el-Gutfreund Y, Margalit H. Quantitativeparameters for amino acid-base interaction:inplications for prediction <strong>of</strong> protein-DNA binding sites. Nucleic Acids Res 1998;26: 2306-12.68. Sternberg MJ, Gabb HA, Jackson RM. Predictivedocking <strong>of</strong> protein-protein <strong>and</strong> protein-DNA complexes. Curr Opin Struct Biol 1998;8 (2): 250-6.69. Aloy P, et al. Modelling repressor proteinsdocking to DNA. Proteins 1998; 33 (4):535-49.70. Dickerson RE. DNA-binding: <strong>the</strong> prevalence<strong>of</strong> kinkiness <strong>and</strong> <strong>the</strong> virtues <strong>of</strong> normality.Nucleic Acids Res 1998; 26 (8): 1906-26.71. Perez-Rueda E, Collado-Vides J. The repertoire<strong>of</strong> DNA-binding transcriptional regulatorsin Escherichia coli K-12. Nucleic AcidsRes 2000; 28 (8): 1838-47.72. Mewes HW, et al. MIPS: a database for genomes<strong>and</strong> protein sequences. Nucleic Acids Res2000; 28 (1): 37-40.73. Salgado H, et al. RegulonDB (version 3.0):transcriptional regulation <strong>and</strong> operon organizationin Escherichia coli K-12. NucleicAcids Res 2000; 28 (1): 65-7.Method Inform Med 4/2001


358Luscombe, Greenbaum, Gerstein74. Wingender E, et al. TRANSFAC: an integratedsystem for gene expression regulation. NucleicAcids Res 2000; 28 (1): 316-9.75. Teichmann SA, Chothia C, Gerstein M.Advances in structural genomics. Curr OpinStruct Biol 1999; 9 (3): 390-9.76. Aravind L, Koonin EV. DNA-binding proteins<strong>and</strong> evolution <strong>of</strong> transcription regulationin <strong>the</strong> archaea. Nucleic Acids Res 1999; 27(23): 4658-70.77. Huynen MA, van Nimwegen E.The frequencyd<strong>is</strong>tribution <strong>of</strong> gene family sizes in completegenomes. Mol Biol Evol 1998; 15 (5): 583-9.78. Luscombe NM, Thornton JM. Protein-DNAinteractions: an analys<strong>is</strong> <strong>of</strong> amino acid conservation<strong>and</strong> <strong>the</strong> effect on binding specificity.Manuscript in preparation.79. Gelf<strong>and</strong> MS. Prediction <strong>of</strong> function in DNAsequence analys<strong>is</strong>. J Comp Biol 1995; 1:87-115.80. Rob<strong>is</strong>on K, McGuire AM, Church GM. Acomprehensive library <strong>of</strong> DNA-binding sitematrices for 55 proteins applied to <strong>the</strong>complete Escherichia coli K-12 genome. J MolBiol 1998; 284 (2): 241-54.81. Thieffry D, et al. Prediction <strong>of</strong> transcriptionalregulatory sites in <strong>the</strong> complete genomesequence <strong>of</strong> Escherichia coli K-12. <strong>Bioinformatics</strong>1998; 14 (5): 391-400.82. Mironov AA. et al. Computer analys<strong>is</strong> <strong>of</strong>transcription regulatory patterns in completelysequenced bacterial genomes. Nucleic AcidsRes 1999; 27 (14): 2981-9.83. Gelf<strong>and</strong> MS, Koonin EV, Mironov AA.Prediction <strong>of</strong> transcription regulatory sites inArchaea by a comparative genomic approach.Nucleic Acids Res 2000; 28 (3): 695-705.84. McGuire AM, Hughes JD, Church GM.Conservation <strong>of</strong> DNA regulatory motifs <strong>and</strong>d<strong>is</strong>covery <strong>of</strong> new motifs in microbial genomes.Genome Res 2000; 10 (6): 744-57.85. Bysani N, Daugherty JR, Cooper TG. Saturationmutagenes<strong>is</strong> <strong>of</strong> <strong>the</strong> UASNTR (GATAA)responsible for nitrogen catabolite repressionsensitivetranscriptional activation <strong>of</strong> <strong>the</strong>allantoin pathway genes in Saccharomycescerev<strong>is</strong>iae. J Bacteriol 1991; 173 (16): 4977-82.86. Clarke ND, Berg JM. Zinc fingers in Caenorhabdit<strong>is</strong>elegans: finding families <strong>and</strong> probingpathways. Science 1998; 282 (5396): 2018-22.87. van Helden J, Andre B, Collado-Vides J.Extracting regulatory sites from <strong>the</strong> upstreamregion <strong>of</strong> yeast genes by computationalanalys<strong>is</strong> <strong>of</strong> oligonucleotide frequencies. J MolBiol 1998; 281 (5): 827-42.88. Salgado H, et al. Operons in Escherichia coli:genomic analyses <strong>and</strong> predictions. Proc NatlAcad Sci USA, 2000; 97 (12): 6652-7.89. Tatusov RL, et al. Metabol<strong>is</strong>m <strong>and</strong> evolution<strong>of</strong> Haemophilus influenzae deduced from awhole-genome compar<strong>is</strong>on with Escherichiacoli. Curr Biol 1996; 6 (3): 279-91.90. E<strong>is</strong>en MB, et al. Cluster analys<strong>is</strong> <strong>and</strong> d<strong>is</strong>play<strong>of</strong> genome-wide expression patterns. ProcNatl Acad Sci USA 1998; 95 (25): 14863-8.91. Wen X, et al. Large-scale temporal gene expressionmapping <strong>of</strong> central nervous systemdevelopment. Proc Natl Acad Sci USA 1998;95 (1): 334-9.92. Alon U, et al. Broad patterns <strong>of</strong> gene expressionrevealed by clustering analys<strong>is</strong> <strong>of</strong>tumor <strong>and</strong> normal colon t<strong>is</strong>sues probed byoligonucleotide arrays. Proc Natl Acad SciUSA 1999; 96 (12): 6745-50.93. Tamayo P, et al. Interpreting patterns <strong>of</strong> geneexpression with self-organizing maps: methods<strong>and</strong> application to hematopoieticdifferentiation. Proc Natl Acad Sci USA 1999;96 (6): 2907-12.94. Toronen P, et al. Analys<strong>is</strong> <strong>of</strong> gene expressiondata using self-organizing maps. FEBS Lett1999; 451 (2): 142-6.95. Tavazoie S, et al. Systematic determination <strong>of</strong>genetic network architecture. Nat Genet 1999;22 (3): 281-5.96. Subrahmanyam YV, et al. RNA expressionpatterns change dramatically in human neutrophilsexposed to bacteria. Blood 2001; 97(8): 2457-68.97. Jansen R, Gerstein M. Analys<strong>is</strong> <strong>of</strong> <strong>the</strong> yeasttranscriptome with structural <strong>and</strong> functionalcategories: characterizing highly expressed proteins.Nucleic Acids Res 2000; 28 (6): 1481-8.98. Gerstein M, Jansen R. The current excitmentin bioinformatics, analys<strong>is</strong> <strong>of</strong> whole-genomeexpression data: how does it relate to proteinstructure <strong>and</strong> function. Curr Opin Struct Biol2000; 10: 574-84.99. Drawid A, Gerstein M. A Bayesian SystemIntegrating Expression Data with SequencePatterns for Localizing Proteins: ComprehensiveApplication to <strong>the</strong> Yeast Genome. JMol Biol 2000; 301:1059-75.100. Drawid A, Jansen R, Gerstein M. Genomwideanalys<strong>is</strong> relating expression level withprotein subcellular local<strong>is</strong>ation. Trends Genet2000; 16: 426-30.101. Marcotte EM, et al. Detecting protein function<strong>and</strong> protein-protein interactions fromgenome sequences. Science 1999; 285 (5428):751-3.102. E<strong>is</strong>enberg D, et al. Protein function in <strong>the</strong>post-genomic era. Nature 2000; 405 (6788):823-6.103. Jansen R, Greenbaum D, Gerstein M. Relatingwhole-genome expression data withprotein-protein interactions. Manuscript inpreparation.104. Marx J. DNA arrays reveal cancer in its manyforms. Science 2000; 289 (5485): 1670-2.105. Ross DT, et al. Systematic variation in geneexpression patterns in human cancer cell lines.Nat Genet 2000; 24 (3): 227-35.106. Perou CM, et al. D<strong>is</strong>tinctive gene expressionpatterns in human mammary epi<strong>the</strong>lial cells<strong>and</strong> breast cancers. Proc Natl Acad Sci USA1999; 96 (16): 9212-7.107. Livesey FJ, et al. Microarray analys<strong>is</strong> <strong>of</strong> <strong>the</strong>transcriptional network controlled by <strong>the</strong>photoreceptor homeobox gene Crx. Curr Biol2000; 10 (6): 301-10.108. Sali A, Blundell TL. Comparative proteinmodelling by sat<strong>is</strong>faction <strong>of</strong> spatial restraints.Journal <strong>of</strong> Molecular Biology 1993; 234 (3):779-815.109. Jones DT, Taylor WR, Thornton JM. A newapproach to protein fold recognition. Nature1992; 358 (6381): 86-9.110. Kok K, Naylor SL, Buys CH. Deletions <strong>of</strong> <strong>the</strong>short arm <strong>of</strong> chromosome 3 in solid tumors<strong>and</strong> <strong>the</strong> search for suppressor genes.Advancesin Cancer Research 1997; 71: 27-92.111. Syngal S, et al. Sensitivity <strong>and</strong> specificity <strong>of</strong>clinical criteria for hereditary non-polypos<strong>is</strong>colorectal cancer associated mutations inMSH2 <strong>and</strong> MLH1. Journal Med Genet 2000;37 (9): 641-5.112. Lin J, Gerstein M. Whole-genome trees basedon <strong>the</strong> occurrence <strong>of</strong> folds <strong>and</strong> orthologs:implications for comparing genomes ondifferent levels. Genome Res 2000; 10 (6):808-18.113. Harr<strong>is</strong>on PM, Echols N, Gerstein MB. Diggingfor dead genes: an analys<strong>is</strong> <strong>of</strong> <strong>the</strong> character<strong>is</strong>tics<strong>of</strong> <strong>the</strong> pseudogene population in <strong>the</strong>Caenorhabdit<strong>is</strong> elegans genome. NucleicAcids Res 2001; 29 (3): 818-30.114. Uetz P, et al. A comprehensive analys<strong>is</strong> <strong>of</strong>protein-protein interactions in Saccharomycescerev<strong>is</strong>iae. Nature 2000; 403 (6770): 623-7.115. Ross-Macdonald P, et al. Transposon mutagenes<strong>is</strong>for <strong>the</strong> analys<strong>is</strong> <strong>of</strong> protein production,function, <strong>and</strong> localization. Methods Enzymol1999; 303: 512-32.116. Mewes HW, et al. MIPS: a database forgenomes <strong>and</strong> protein sequences. NucleicAcids Res 1999; 27 (1): 44-8.Correspondence to:Mark GersteinDepartment <strong>of</strong> Molecular Biophysics <strong>and</strong> Biochem<strong>is</strong>tryYale University, 266 Whitney AvenuePO Box 208114, New Haven CT 06520-8114, USAE-Mail: mark.gerstein@yale.eduMethod Inform Med 4/2001

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!