Eukaryotic GenomesSeptember 15, 2011Patrik Medstrand
Perspective and pitfallsOne of the broadest goals of biology is to understand thenature of each species: what are its mechanisms ofdevelopment, metabolism, homeostasis, reproduction,and behavior? Sequencing a genome does not answerthese questions directly. After genome annotation, wetry to interpret the function of the genome’s constituentsin the context of various physiological processes.The field of bioinformatics needs continued developmentof algorithms to find genes, repetitive sequences, genomeduplications and other features, as well as tools to identifyconserved regions. We also need to realize that biologyis a function of time which has shaped organism functionsas we see them today. Only then may we generate and testhypotheses about genome function.
Introduction to the eukaryotesEukaryotes are single-celled or multicellular organisms.Eukaryots can roughly be divided into:(U=uni-cellular; M=multi-cellular)- “Protozoan” (U)- Plants (M), algae (U/M)- Amoebas (U)- Metazoan (M; vertebrates and invertebrates)- Fungi (U/M)We will explore the eukaryotes using a phylogenetic treeby Baldauf et al. (Science, 2000). This tree was made byconcatenating four protein sequences: elongation factor 1a,actin, α-tubulin, and β-tubulin.
General features of the eukaryotesSome of the general features of eukaryotes that distinguishthem from prokaryotes are:• Eukaryotes include many multicellular organisms,in addition to unicellular organisms.• Eukaryotes have  a membrane-bound nucleus, intracellular organelles, and  a cytoskeleton• Most eukaryotes undergo sexual reproduction• The genome size of eukaryotes spans a wider rangethan that of most prokaryotes• Eukaryotic genomes have a lower density of genes• Prokaryotes are haploid; eukaryotes have varying ploidy• Eukaryotic genomes tend to be organized intolinear chromosomes with a centromere and telomeresand genes are organized in exons separated by introns.
C value paradox:why eukaryotic genome sizes varyThe haploid genome size of eukaryotes, called the C value,varies enormously.Small genomes include:Encephalotiozoon cuniculi (2.9 Mb)A variety of fungi (10-40 Mb)Takifugu rubripes (pufferfish)(365 Mb)(same number of genesas other fish or as the human genome, but 1/10 th the size)Large genomes include:Pinus resinosa (Canadian red pine)(68 Gb)Protopterus aethiopicus (Marbled lungfish)(140 Gb)Amoeba dubia (amoeba)(690 Gb) (x200 larger than human)
C value paradox:why eukaryotic genome sizes varyThe range in C values does not correlate well with thecomplexity of the organism. This phenomenon is calledthe C value paradox.Other"Mobile"elementsExonsSimplerepeatsThe solution to this “paradox” is that genomes are filledwith large tracts of noncoding, often repetitive (mobile)DNA sequences.
Eukaryotic genomes are organizedinto chromosomesGenomic DNA is organized in chromosomes. The diploidnumber of chromosomes is constant in each species(e.g. 16 in S. cerevisiae, 46 (22 pairs of autosomes;1 pair of sex-chromosomesin human). Chromosomesare distinguished by a centromere and telomeres.The chromosomes are routinely visualized by karyotyping(imaging the chromosomes during metaphase, wheneach chromosome is a pair of sister chromatids).
Eukaryotic genomes are organizedinto chromosomes
Eukaryotic chromosomes can be dynamicChromosomes can be highly dynamic, in several ways.• Whole genome duplication (autopolyploidy) can occur,as in yeast (Chapter 15) and some plants.• The genomes of two distinct species can merge, as in themule (male donkey, 2n = 62 and female horse, 2n = 64)• An individual can acquire an extra copy of a chromosome(e.g. Down syndrome, TS13, TS18)• Chromosomes can fuse; e.g. human chromosome 2 derivesfrom a fusion of two ancestral primate chromosomes• Chromosomal regions can be inverted (hemophilia A)• Portions of chromosomes can be deleted (e.g. Δ11q syndrome)• Segmental and other duplications occur• Chromatin diminution can occur (Ascaris)
Broad genomic landscape: GC contentThe overall GC content of the human genome is 41%.A plot of GC content versus number of 20 kb windowsshows a broad profile with skewing to the right.
Broad genomic landscape: CpG islandsDinucleotides of CpG are under-represented ingenomic DNA, occuring at one fifth the expected frequency.CpG dinucleotides are often methylated on cytosine(and subsequently may be deamination to thymine).Methylated CpG residues are often associated withhouse-keeping genes in the promoter and exonic regions.Methyl-CpG binding proteins recruit histone deacetylasesand are thus responsible for transcriptional repression.They have roles in gene silencing, genomic imprinting,and X-chromosome inactivation.
Access to the human and other genomesChromosome 22UCSC Genome Browser – http://genome.ucsc.edu
Classes of repetitive DNA1. Interspersed (transposable) repeats- Retrovirus-like- LINEs- SINEs (Alu)- DNA transposons2. Tandem and Simple sequence repeats- Microsatellites (~1-5 bp units)- Minisatellites (~10-500 bp units)- Blocks (telomeric and centromeric repeats)3. Processed genes (usually pseudogenes)4. Segmental and tandem gene duplications(usually pseudogenes)
1. Repetitive DNA: Interspersed RepeatsInterspersed repeats (transposon-derived repeats)constitute ~45% of the human genome. They involveRNA intermediates (retroelements) or DNA intermediates(DNA transposons).• Long-terminal repeat transposons (RNA-mediated)• Long interspersed elements (LINEs);encode a reverse transcriptase• Short interspersed elements (SINEs)(RNA-mediated);these include Alu repeats• DNA transposons (encode a transposase)
1. Repetitive DNA: Interspersed Repeats
2. Repetitive DNA: Tandem and simplesequence repeatsA. Microsatellites: from one to a dozen base pairsExamples: (A) n , (CA) n , (CGG) nThese may be formed by replication slippage.B. Minisatellites: a dozen to 500 base pairsSimple sequence repeats of a particular length andcomposition occur preferentially in different species.In humans, an expansion of triplet repeats such as CAGis associated with at least 14 disorders (includingHuntington’s disease). Approximately 3% of the humangenome sequence is made up of micro and minisatellites
2. Repetitive DNA: Example of simplesequence Repeat (CCCA or TGGG)in human genomic DNA
2. Repetitive DNA: Tandem and simplesequence repeatsC. Blocks of tandem repeatsThese include telomeric repeats (e.g. TTAGGG repeated1000’s of times in humans) andcentromeric repeats (e.g. a 171 base pair α satellite DNArepeat spanning 1 to 4 Mb in humans).Such repetitive DNA is often species-specific.
2. Repetitive DNA: Example oftelomeric repeats (TTAGGG)
Repetitive DNA: Processed genesA. Functional genesExamples include retrotransposed genes that lack introns,such as:ADAM20 NM_003814 14q (original gene on 8p)Cetn1 NM_004066 18p (original gene on Xq)Glud2 NM_012084 Xq (original gene on 10q)Pdha2 NM_005390 4q (original gene on Xp)
3. Repetitive DNA: Processed genesb. Processed pseudogenes (non-functional)Not transcribed and/or translated. May have a stop codon orframeshift mutation and do not encode a functional protein.They arise from retrotransposition. Approx. 8000 processedPseudogenes have been identified in the human genome.¨Mark Gerstein’s pseudogene websitehttp://www.pseudogene.org
3. Repetitive DNA: Processed genes
Repetitive DNA: Segmental andtandem duplicationsA. Segmental duplicationsThese are blocks of about 1 kilobase to 300 kb that arecopied intra- or interchromosomally. Segmental duplicatedblocks may contain non-genic, complete or partial gene regions.B. Tandem duplicationsAre defined as those segmental duplications (usually genic)that are located next to each other in tandem or clusters onthe same chromosome.As an example, consider a group of lipocalin geneson human chromosome 9.
Repetitive DNA: Successive tandemgene duplicationsA. Observed today (4 genes, 7/8 exons; 1 psudogene)B. Ancestral state (1 gene; 7 exons) and first duplication event
Repetitive DNA: Successive tandemgene duplications
Software to detect repetitive DNAIt is essential to identify repetitive DNA in eukaryoticgenomes. RepBase Update is a database of knownrepeats and low-complexity regions.RepeatMasker is a program that searches DNA usingThe RepBase repeat sequences.We will discuss the importance of using RepeatMaskerwhen we are trying to predict genes.
Lecture part IIGeneral features of eukaryotic genomes(including humans) – part IIStrategies for finding eukaryotic gene andpromoter regionsComputer lab: Finding genes and promoters:sequenced based and by predictions.
Genes in eukaryotic DNATwo of the biggest challenges in understanding anyeukaryotic genome are:• defining what a gene is, and• identifying genes within genomic DNATypes of genes include:• protein-coding genes• functional RNA genes-tRNA transfer RNA-rRNA ribosomal RNA-snoRNA small nucleolar RNA-snRNA small nuclear RNA-miRNA microRNA-”asRNA” anti-sense RNA
Location – Organization ofprotein-coding genesGenes are found at different densities along a chromosome,and some chromosomes are more gene dense than others.Genes are sometimes associated with higher GC-contentHuman chr. Genes/Mb GC-contentAll 7.5 0.41Chr.1 9.4 0.41Chr.19 24.4 0.47Chr.Y 2.1 0.39Genes can be located in introns of other genesGenes can be part of multigene familiesNumber of genes in humans is ~30,000+
Genes in eukaryotic DNA
Structure ofprotein-coding genesMultiexon orsingle exonFeatureSize (mean)Promoter region 500 bp ?5’UTR300 bp3’UTR770 bpInternal exon 145 bpExon number 8.8Coding sequence 1340 bp / 447 aaIntrons3365 bpGenomic extent 27 kb
Human genome: protein coding genesAs part of the sequencing effort, the most commonprotein families, domains, and motifs have been catalogued.This also permits comparative proteomic analyses.Overall, 40% of predicted human proteins could be placedin InterPro functional categories. A blastp search of everyhuman protein revealed that 74% had significant matchesto known proteins.Include families: GPCR, KRAB, Ion transportInclude domains: Zinc fingers, Ig/MHC
Individual eukaryotic genomes:the fruitfly Drosophila (2000)Drosophila’s distinguishing features: Short lifecycle,varied phenotypes, model organism in genetics.Genome size: 180 MbChromosomes: 5Genes: about 13,000 (spanning 27% of genome)Website: http://www.fruitfly.org--At the time, largest genome for which whole genomeshotgun sequencing was applied.--Each genome annotation improves the gene models
Individual eukaryotic genomes: :the mosquito Anopheles gambiae (2002)A. gambiae was the second insect genome sequenced.Distinguishing features: It is the malaria parasite vector.Genome size: 278 Mb (twice the size of Drosophila)Chromosomes: 3Genes: about 14,000Website: http://www.ensembl.org/Anopheles_gambiae/--Diverged from Drosophila 250 MYA (average amino acidsequence identity of orthologs is 56%). Compare humanand pufferfish (diverged 400 MYA, 61% identity): insectproteins diverge at a faster rate.--High degree of genetic variation
Individual eukaryotic genomes:the fish Fugu rubripes (2002)Fugu is a pufferfish (also called Takifugu rubripes).Distinguishing features: Diverged from humans 450 MYA;has comparable number of genes in a compact genome.Genome size: 365 Mb (1/10 th human genome)Genes: about 30,000Website: http://genome.jgi-psf.org/fugu6/fugu6.info.html--Only 2.7% of genome is interspersed repeats (compare45% in human), based on RepeatMasker.--Introns are relatively short. 75% of Fugu introns are
Individual eukaryotic genomes:the mouse Mus musculus (2002)M. musculus is the second mammal to have its genomesequenced. Mouse diverged from human 75 MYA.Distinguishing features: only 300 of 30,000 annotatedgenes have no human orthologsGenome size: 2.5 Gb (euchromatic portion)(cf. 2.9 Gb human)Chromosomes: 21Genes: about 30,000Website: http://www.informatics.jax.org--Dozens of mouse-specific expansions occurred, such asolfactory receptor gene family.--40% of mouse genome can be aligned to human genomeat the nucleotide level.
Major differences between human and mouseare located in the non-genic regions
Comparison of eukaryotic DNA:PipMaker and VISTAWe studied pairwise sequence alignment at the beginningof the course. In studying genomes, it is important to alignlarge segments of DNA.PipMaker and VISTA are two tools for sequence alignmentand visualization. They show conserved segments, includingthe order and orientation of conserved elements. They alsodisplay large-scale genomic changes (inversions,rearrangements, duplications).VISTA (http://www-gsd.lbl.gov/vista)PipMaker (http://bio.cse.psu.edu/pipmaker)
Protein-coding genes in eukaryotic DNA:a new paradoxThe C value paradox is answered by the presence ofnoncoding DNA.Why are the number of protein-coding genes about the samefor worms, flies, plants, and humans?This has been called the N-value paradox (number of genes)or the G value paradox (number of genes).
Human genome: proteome complexityAlthough the human genome encodes a comparablenumber of proteins as do other genomes, the humanproteome may display greater complexity.• There are relatively more domains and protein familiesin the humans• The human genome includes relatively more paralogs,potentially yielding more functional diversity• Domain architectures tend to be more complex• Transcriptional/Translational regulation may be more complex• Alternative promoters and alternative splicing may be moreextensive in humans.
Alternative splicingPUTRCDS40-60% of human genes are alternatively splicedMay give rise to protein isoformsAlternative promoter usagePPApproximately 20% of human genes use multiple promotersGive rise to mRNAs with different 5’UTRs; may affect translational efficiencyUTRCDS
Antisense transcriptsPPUTRCDS~20% of human genes have overlapping antisenseAS transcripts may interfere with sense transcripts andthey may induce metyhlation and may represent aanother way of gene regulationmicroRNAA special type of noncoding small RNAs that bind toComplementary regions of mRNA. The interactionMay repress translation or degrade the mRNA.