Comparative Genome Analysis

Comparative Genome Analysis

Comparative GenomeAnalysis

Key point…Homolog vs. Paralog vs.orthologOrthologHomologParalog• Likely mechanism ofevolutionary ‘creation’event.• Useful distinction inextrapolating function.

Pictorial view of ortholog v.paralogReferences• Fitch, W M. 1970. Distinguishing homologous from analogousproteins. Systematic zoology 19, no. 2 (June): 99-113.• Fitch, Walter M. 2000a. Homology: a personal view on some of theproblems. Trends in Genetics 16, no. 5 (May 1): 227-231.• ———. 2000b. Homology: a personal view on some of the problems.Trends in Genetics 16, no. 5 (May 1): 227-231. / i / ti /B6TCY M/2/574bd642a8bc1b66e7df168c58791f3f• Jensen, Roy. 2001. Orthologs and paralogs - we need to get it right.Genome Biology 2, no. 8: interactions1002.1-interactions1002.3.

Distant paralogs create proteinfamilies.Rubin, G M, M D Yandell, J R Wortman, G L Gabor Miklos, C R Nelson, IK Hariharan, et al. 2000. Comparative genomics of the eukaryotes.Science (New York, N.Y.) 287, no. 5461 (March 24): 2204-15.

Recap other Key Points• is a good source of whole genomeanalysis, esp. from a comparative point of view.• BLINK from NCBI is a good tool to use to look atsimilar pairs of genes. (although the different formsof BLAST still have some good use.)• The best understood/annotated part of the genomeis protein-coding potential, ti other features thattcontrol when and where that protein or RNA ismade is less understood.

BLINK, BLAST with database• A path there… Go to NCBI entrez for gene, Go to aRefseq protein of interest, in right corner there is“BLINK” hyperlink. Blink hyperlink found in proteinsequence records.• Combines BLAST with database information…• (still, some Special cases where BLAST that youcovered before is still quite useful, though.)h

•• gi|4506103 eukaryotic translation initiation factor 2-alpha kinase 2 [Homo sapiens]

• Exploration… how to get paralogous sequences? How to getorthologous sequences? How to get sequences that sharecommon sequence domain? How to get 3-D structures? Howto get dotplots without extra work?

BLAST Variants•• nih

Gene Trees,more accurate reconstruction of evolution andprediction of function than pairwise“homology” alone• Protein trees are calculated using thelongest peptide of all the Ensembl proteincoding genes. Proteins are clustered basedon Best-Reciprocal Hits and Blast ScoreRatios. Each cluster of protein is alignedusing Muscle. PHYML is used to get a genetree from each multiple alignment. The genetree is reconciliated with the species treeusing RAP to root the tree and to callduplication events.

A gene tree for FOXJ3 fromEnsemblHubbard et al. 2007. Ensembl 2007. Nucl. Acids Res. 35, no. suppl_1 (January 12): D610-617.

Overall Conclusions….

HumanDiseaseGenes:Models foranalysisFly Worm Yeast

Characteristics of a genome…Assembly: NCBI 36, Oct 2005Genebuild: Ensembl, Aug 2006Database version:Human GenomeKnown genes:46.36hHuman Genome21,667Novel genes: 1,013Pseudogenes: 1,040RNA genes: 4,150Immunoglobulin/T-cell receptor gene segments: 388Genscan gene predictions: 69,185Gene exons: 269,405Gene transcripts: 44,340SNPs: 11,772,162Base Pairs*: 32 3,253,037,807 303 Golden Path Length**: 3,093,120,360

Top 40 InterPro domains in human(protein families)

The big question, though…What genotype change makesthe phenotype change?(let us ask for the phenotypechange of body plan created bydevelopment in all Bilateriananimals over last billion years orso.)

Very Different Body Plans, yet remarkably similar protein-coding`11

`11Very Different Body Plans/Phenotypes

Key Genome Data & IdeaYet, remarkably similargene products..Developmental genes in body plans areamong the most conserved.

ΔCellularsignalinggnetworksΔ Gene RegulatoryΔ ti dinetworksΔ protein coding

ΔCellularsignalinggnetworksProtein coding from the last ancestor common toBilaterian animals are rather similarm, includinggenes for development for body plan…seem otherchanges outside protein coding are more important.Regulation of genes key difference.ΔRegulatory networksΔ protein ti coding

Timing and relatedness of eventsin trees• Molecular clock• Exact Relations inspecies trees andestimate ofdivergence age thatconstrain both geneand species treesoften result frommolecular sequenceanalysis itself andassumption.• Although some fossilevents provide timingmilestones for fewevents.• Numbers onbranches are MYA(million years ago)

What can we conclude• SURPRISING to many,• Much more conservation in proteincoding than many thought• Far fewer human protein-coding genesthan originally thought…not muchdifferent from other species.

Despite what we think we knowfrom whole genome analysis…• ENCODE• Encyclopedia of DNA Elements (ENCODE) Project• A project to intensively investigate 1% of the humangenome.• Nature. 2007. Identification and analysis of functionalelements in 1% of the human genome by the ENCODE pilotproject. 447, no. 7146 (June 14): 799-816.•• Genome Research Issue June 2007 Many Papers• Following are key conclusions drawn directly from paper.

• ExperimentalandComparative(evolutionary,acrossmammals)AnalysisIn 44 regions

A few of their conclusions,Transcription is “everywhere”…• Transcription is more complex than expected, withmany non-coding transcripts intercalating withstandard protein-coding genes. However there was littleevidence for protein-coding genes outside ofestablished sets.• There are many more Transcription Start Sites (TSS)than expected, around 10-fold more than the number ofprotein-coding genes.• “The human genome ispervasively transcribed,such that the majority ofits bases are associatedwith at least one primarytranscript and manytranscripts link distalregions to establishedprotein-coding loci.”• “Many novel non-proteincodingtranscripts havebeen identified, withmany of theseoverlapping proteincodingloci and otherslocated in regions of thegenome previouslythought to betranscriptionally silent.”

Start sites many places.• “Numerous previously unrecognized transcriptiontistart sites have been identified, many of whichshow chromatin structure and sequence-specificprotein-binding properties similar il to wellunderstoodpromoters.”• “Regulatory sequences that surround transcriptionstart sites are symmetrically distributed, with nobias towards upstream regions.”• “ Chromatin accessibility and histone modificationpatterns are highly predictive of both the presenceand activity of transcription start sites.”

But the real surprise of Encode….• Comparative analysis across mammaliansat greater specificity within these 44 regionsand with functional information….• 40% of constrained regions show no function.• Many regions shown to have biochemicalfunction by assays, show no evolutionaryconstraint….For most types of non-codingfunctional elements, roughly 50% of theindividual elements seemed to be unconstrainedacross all mammals.

Evolutionary Constraint, (rejection ofmutations by purifying selection).• “A total t of 5% of the bases in the genome can beconfidently identified as being under evolutionaryconstraint in mammals; for approximately 60% ofthese constrained bases, there is evidence offunction on the basis of the results of theexperimental assays performed to date.”• “ Although there is general overlap betweengenomic regions identified as functional byexperimental assays and those under evolutionaryconstraint, not all bases within these experimentallydefined regions show evidence of constraint.”

Neutrality, bigger than even most“neutralists” may have thoughtlikely.• “Surprisingly, i many functional elements areseemingly unconstrained acrossmammalian evolution. This suggests thepossibility of a large pool of neutralelements that are biochemically active butprovide no specific benefit to the organism.This pool may serve as a 'warehouse' fornatural selection, ,potentially acting as thesource of lineage-specific elements andfunctionally conserved but non-orthologouselements between species.”

More magazines by this user
Similar magazines