13.07.2015 Views

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

The Genom of Homo sapiens.pdf

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

286 OVCHARENKO AND LOOTSmodulation <strong>of</strong> gene expression is achieved through thecomplex interaction <strong>of</strong> regulatory proteins (trans-factors)with specific DNA regions (cis-acting regulatory sequences)(Krivan and Wasserman 2001). Intensive experimentalefforts over several decades have identifiednumerous regulatory proteins, transcription factors (TF),and their DNA-binding specificities. <strong>The</strong> sequence-specificDNA-binding activity <strong>of</strong> TFs is central to transcriptionalregulation and regulatory networks. For most <strong>of</strong> theknown TFs, the DNA-binding sequence motifs werefound to be short (6–12 bp) and highly degenerate, andsuch specificities are cataloged in the TRANSFACdatabases (http://www.biobase.de/) (Wingender et al.2001; Matys et al. 2003). Pattern-recognition programs(MATCH or MatInspector) (Quandt et al. 1995) use thisdatabase to carry out the reverse-engineering task <strong>of</strong> predictingsignificant motif matches in DNA sequences,which could serve as an in silico strategy for detectingtranscription factor binding sites (TFBSs). Due to theirhighly degenerate nature, TF motifs occur very frequentlyin short genomic intervals, and only a very smallfraction <strong>of</strong> the predicted sites are biologically significant.This fact limits the use <strong>of</strong> TFBS databases for sequencebaseddiscovery <strong>of</strong> transcriptional regulatory elements(Fickett and Wasserman 2000).Multispecies comparative sequence analysis or phylogeneticfootprinting has been suggested as a strategy tocounter the large numbers <strong>of</strong> TFBS false positives derivedfrom the analysis <strong>of</strong> a single sequence (Gumucio et al.1996; Duret and Bucher 1997; Levy et al. 2001). In general,it is believed that TFBSs are the building blocks <strong>of</strong>gene regulation. Several studies have carefully shown thattranscriptional regulatory elements are evolutionarily conserved,supporting the use <strong>of</strong> genomic comparisons for thede novo discovery <strong>of</strong> gene regulatory elements (Hardisonet al. 1997; Oeltjen et al. 1997; Loots et al. 2000). In addition,in complex organisms, gene expression results fromthe cooperative action <strong>of</strong> many different proteins simultaneouslyrequired to cooperatively activate and modulategene expression (Berman et al. 2002). <strong>The</strong>refore, a potentialavenue for improving the discovery <strong>of</strong> functional regulatoryelements is to identify multiple TFBSs that arespecifically clustered together (Kel et al. 1999; Zhu et al.2002). This strategy has been implemented successfully inthe analysis <strong>of</strong> regulatory regions involved in muscle(Wasserman and Fickett 1998) and liver-specific gene expression(Krivan and Wasserman 2001).To facilitate the efficient and accurate prediction <strong>of</strong> biologicallyfunctional regulatory sequences present in largegenomic intervals, we have developed a computationaltool, Regulatory VISTA or rVISTA (http://rvista.dcode.org/) that enriches for functional TFBSs using evolutionaryconservation (Loots et al. 2002). <strong>The</strong> rVISTAtool combines TFBS motif recognition, orthologous sequencealignments, and TFBS cluster analysis to overcomesome <strong>of</strong> the limitations associated with TFBS predictions<strong>of</strong> sequences derived from a single organism. <strong>The</strong>analysis proceeds in four steps: (1) identification <strong>of</strong> TFBSmatches in the individual sequences, (2) identification <strong>of</strong>locally aligned noncoding TFBSs, (3) calculation <strong>of</strong> localconservation extending upstream and downstream fromeach orthologous TFBS, and (4) visualization <strong>of</strong> individualor clustered noncoding TFBSs. Alignments generatedby the zPicture (http:zpicture.dcode. org/) program can beprocessed by rVISTA by following the rVISTA link providedin the results page. <strong>The</strong> user can also submit alignmentfiles generated by PipMaker along with the correspondingsequence annotations (optional) to identifyconserved TFBS matches present only in noncoding genomicintervals. Pre-computed matrices imported fromthe TRANSFAC database or user-defined consensus sequencescan be used to identify TFBS motifs in the inputsequences. <strong>The</strong> alignment and annotation files are usednext to identify all the aligned TFBSs present in noncodingDNA and to calculate the degree <strong>of</strong> DNA conservationencompassing each TFBS. rVISTA calculates the maximumDNA conservation surrounding each aligned bindingsite in a dynamically shifting window, 20 bp in length,and filters out the sites present in regions that are less than80% conserved (Fig. 2) (Loots et al. 2002).<strong>The</strong> data generated by rVISTA are compiled into twotypes <strong>of</strong> outputs: (1) static data files and (2) a dynamicWeb-interactive graphical user interface that maps TFBSon top <strong>of</strong> the conservation plot. <strong>The</strong> static text files includedata tables with detailed statistics for all alignedand conserved TFBSs, alignment files depicting the textfor each TFBS match in a different color, and files withthe numerical position <strong>of</strong> each TFBS within the referencesequence. <strong>The</strong> visualization module allows the user tocustomize the data and graphically visualize TFBS togetherwith the alignment conservation plot. Since regulatoryregions in higher eukaryotes are represented byconglomerates <strong>of</strong> multiple TFBSs that act in concordanceto directly modulate the expression patterns <strong>of</strong> the linkedgenes (Pilpel et al. 2001), rVISTA calculates the distancebetween all neighboring TFBSs and allows the user toperform customized clustering <strong>of</strong> individual or multipleunique transcription factors. One clustering module allowsthe user to selectively cluster two or more sites <strong>of</strong>the same TF present in regions <strong>of</strong> user-defined lengths,and a second clustering module allows the user to identifygroups <strong>of</strong> multiple sites specific for different TFs, to predictDNA regions with unique regulatory signatures (Fig.2) (Loots et al. 2002).Recently, the ECR Browser has included an rVISTAportal that takes advantage <strong>of</strong> the available precomputedwhole-genome pair-wise alignments, eliminating theneed to create an alignment file prior to TFBS analysiswhile using the rVISTA tool. Users now have the optionto browse the human genome and to perform rVISTAanalysis on individual highly conserved noncoding elementsby using the button, or on long genomicintervals containing several blocks <strong>of</strong> conservationby using the link. rVISTA analysis canthen be conducted on any available pair-wise alignmentsby pushing the button. From this point on,TFBS analysis proceeds as described in Figure 2B–E, andis processed by the rVISTA s<strong>of</strong>tware.Annotating the noncoding portion <strong>of</strong> the humangenome still remains one <strong>of</strong> the greatest challenges post-

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!