Hap Map Project - Bgbunict.it

The International HapMap 

Project 

Anno 2009/2010 

Dott.ssa Laura Rita Duro

Most common diseases, such as diabetes, 

cancer, stroke, heart disease, depression, and 

asthma, are affected by combinations of multiple 

genetic and environmental factors

Genetic and environmental contributions to monogenic and 

complex disorders 

(A) Monogenic disease. A variant in a single gene is the primary determinant of a 

monogenic disease or trait, responsible for most of the disease risk or trait variation 

(dark blue sector), with possible minor contributions of modifier genes (yellow 

sectors) or environment (light blue sector). 

(B) Complex disease. Many variants of small effect (yellow sectors) contribute to 

disease risk or trait variation, along with many environmental factors (blue sector).

More than a 

thousand genes for 

rare, highly heritable 

‘mendelian’ disorders 

have been identified, 

in which variation in a 

single gene is both 

necessary and 

sufficient to cause 

disease. 

Complex diseases, in 

contrast, have proven 

much more 

challenging to study, 

as they are thought to 

be due to the 

combined effect of 

many different 

susceptibility DNA 

variants interacting 

with environmental 

factors

Discovering these genetic factors will provide 

fundamental new insights into the pathogenesis, 

diagnosis and treatment of human disease

Although any two unrelated people are the 

same at about 99.9% of their DNA sequences, 

the remaining 0.1% is important because it 

contains the genetic variants that influence 

how people differ in their risk of disease or 

their response to drugs. 

Discovering the DNA sequence variants that 

contribute to common disease risk offers one 

of the best opportunities for understanding the 

complex causes of disease in humans.

Human Genetic Variations 

Primarily two types of genetic mutation events create all 

forms of variations: 

Single base mutation which substitutes 

one nucleotide for another 

-Single Nucleotide Polymorphisms (SNP) 

Insertion or deletion of one or more 

nucleotide(s) 

-Tandem Repeat Polymorphisms 

-Insertion/Deletion Polymorphisms

Tandem Repeat Polymorphisms 

Tandem repeats or variable number of tandem repeats (VNTR) are a very 

common class of polymorphism, consisting of variable length of sequence 

motifs that are repeated in tandem in a variable copy number. 

VNTRs are subdivided into two subgroups based on the size of the 

tandem repeat units. 

Microsatellites or Short Tandem Repeat (STR) 

repeat unit: 1-6 (dinucleotide repeat: CACACACACACA) 

Minisatellites 

repeat unit: 10-100

SNPs 

Sites in the genome where the 

DNA sequences of many 

individuals differ by a single 

base are called single 

nucleotide polymorphisms 

(SNPs) 

For example, some people 

may have a chromosome with 

an A at a particular site where 

others have a chromosome 

with a G 

Each form is called an allele

Variation Or Mutation ? 

Terminology for variation at a single 

nucleotide position is defined by allele 

frequency

Polymorphism 

A sequence variation that occurs at least 1 

percent of the time (> 1%) 

90% of variations are SNPs 

Mutation 

If the variation is 

present less than 

1 percent of the 

time (

Transitions and Transversions 

SNPs include single base substitutions such as: 

Transitions: 

change of one purine (A,G) for a purine, 

or a pyrimidine (C,T) for a pyrimidine 

A G G A C T T C 

Transversions: 

change of a purine (A,G) for a pyrimidine (C,T), 

or viceversa 

A C A T G C G T C A C G T A T G

In principle, SNPs could be bi-, tri-, or tetra-allelic 

polymorphisms 

However, in humans, tri-allelic and tetra-allelic 

However, in humans, tri-allelic and tetra-allelic 

SNPs are rare almost to the point of 

non-existence, and so SNPs are sometimes 

simply referred to as bi-allelic markers

Non-coding SNPs: 

5’ and 3’ UTRs 

Introns 

Intergenic Spaces 

Non-synonymous Coding 

SNPs: 

when single base substitutions 

cause a change in the resultant 

amino acid 

Synonymous Coding SNPs: 

when single base substitutions do 

not cause a change in the 

resultant amino acid

Non-coding SNPs 

Example: Regulatory SNPs (rSNPs) 

Two allelic variants of the same gene are transcribed in different 

amounts as a consequence of an adjacent polymorphism. In this 

example, allele G, located upstream of the gene, has a higher 

transcript level than does allele T.

Coding SNPs 

Example: Synonymous, mutation does not change 

amino acid.

Coding SNPs 

Example: Non-synonymous, mutation change 

amino acid.

SNPs 

It has been estimated that, in the world’s human population, 

about 10 million sites (that is, one variant per 300 bases on 

average) vary such that both alleles are observed at a 

frequency of > 1%, and that these 10 million common SNPs 

constitute 90% of the variation in the population. 

The remaining 10% is due to a vast array of variants that are 

each rare in the population. 

The presence of particular SNP alleles in an individual is 

determined by testing (‘genotyping’) a genomic DNA sample. 

NATURE |VOL 426 | 18/25 DECEMBER 2003

A particular combination of alleles along a 

chromosome is termed a haplotype 

Haplotype is a set of SNPs on a single chromatid 

that are statistically associated

The coinheritance of SNP alleles on these haplotypes 

leads to associations between these alleles in the 

population 

(known as linkage disequilibrium, LD)

Linkage disequilibrium 

 

Situation in which some combinations of alleles or genetic 

markers occur more or less frequently in a population than 

would be expected from a random formation of haplotypes 

from alleles based on their frequencies. 

Non-random associations between polymorphisms at 

different loci are measured by the degree of linkage 

disequilibrium (LD).

The LD between many neighboring SNPs generally persists because meiotic recombination 

does not occur at random, but is concentrated in recombination hot spots. 

Adjacent SNPs that lack a hot spot between them are likely to be in strong LD. 

r 2 = 1: two SNPs that are perfectly correlated (allele A of SNP1 is always observed with 

allele C of SNP2, and viceversa) 

r 2 = 0: allele A of SNP1 providing no information at all about which allele of SNP4 is 

present. 

Complete independence of these 6 SNPs would predict the possibility of 64 different 

haplotypes (because n biallelic SNPs could generate 2 n haplotypes), but in reality just 4 

haplotypes comprise 90% of observed chromosomes, indicating that LD is present. 

Because of the strong associations 

among the SNPs in most chromosomal 

regions, only a few carefully chosen 

SNPs (known as tag SNPs) need to be 

typed to predict the likely variants at the 

rest of the SNPs in each region 

SNP1, SNP2, and SNP3 are strongly correlated, and SNP4, SNP5, and SNP6 

are strongly correlated, so that any of SNP1–SNP3 (or SNP4–SNP6) could 

serve as tags for the other 2 SNPs in each group.

Many empirical studies have shown highly significant levels of LD, and 

often strong associations between nearby SNPs, in the human genome. 

Because the likelihood of recombination between two SNPs increases 

with the distance between them, on average such associations between 

SNPs decline with distance. 

Average linkage disequilibrium, |D|, vs. 

distance between SNPs for 2597 

genes in which accurate distances 

were available. 

Lower values indicate a stronger effect 

of recombination and recurrent 

mutation. 

LD decreases with distance. 

B.A. Salisbury et al. Mutation Research 2003

The strong 

associations between 

SNPs in a region have 

a practical value 

Genotyping only a few, carefully chosen SNPs in the region will provide enough 

information to predict much of the information about the remainder of the common 

SNPs in that region. As a result, only a few of these ‘tag’ SNPs are required to 

identify each of the common haplotypes in a region. 

On the basis of empirical studies, it has been estimated that most of 

the information about genetic variation represented by the 10 million 

common SNPs in the population could be provided by genotyping 

200.000 to 1.000.000 tag SNPs across the genome 

These observations are the conceptual and empirical foundation for 

developing a haplotype map of the human genome, the ‘HapMap’.

The International HapMap Project is a partnership of scientists 

and funding agencies from Canada, China, Japan, Nigeria, the 

United Kingdom and the United States to develop a public resource 

that will help researchers find genes associated with human 

disease and response to pharmaceuticals. 

An initial meeting to discuss the scientific and ethical issues associated 

with developing a human haplotype map was held in Washington in 2001. 

The International HapMap Project was then formally initiated in 2002.

The goal of the International HapMap Project is to develop a 

haplotype map of the human genome, the HapMap, 

which will describe the common patterns of human DNA 

sequence variation. 

The HapMap is expected to be a key resource for researchers 

to use to find genes affecting health, disease, and responses 

to drugs and environmental factors. 

The information produced by the Project is freely available 

(www.hapmap.org) 


The HapMap was designed to determine the frequencies and 

patterns of association among roughly 3 million common 

SNPs in four populations, for use in genetic association 

studies 

The HapMap project focuses only on common SNPs, those 

where each allele occurs in at least 1% of the population

The project studied a total of 270 DNA samples: 

 

90 samples from a US Utah population with 

Northern and Western European ancestry 

(samples collected in 1980 by the Centre 

d’Etude du Polymorphisme Humain (CEPH) 

and used for other human genetic maps) 

new samples collected from 90 Yoruba 

people in Ibadan, Nigeria 

 

45 unrelated Japanese in Tokyo, Japan 

 

45 unrelated Han Chinese in Beijing, China

The International HapMap Consortium decided to include several 

populations from different ancestral geographic locations to ensure that the 

HapMap would include most of the common variation and some of the less 

common variation in different populations. 


Human Genome Project 

vs 

International HapMap Project 

In its scope and potential consequences, the International HapMap Project 

has much in common with the Human Genome Project, which sequenced the 

human genome. 

Both projects have been scientifically ambitious and technologically 

demanding, have involved intense international collaboration, have been 

dedicated to the rapid release of data into the public domain, and promise to 

have profound implications for our understanding of human biology and 

human health. 

Whereas the sequencing project covered the entire genome, including the 

99.9% of the genome where we are all the same, the HapMap will 

characterize the common patterns within the 0.1% where we differ from each 

other.

The project had become practical by the confluence of the following: 

the availability of the human genome sequence; 

databases of common SNPs (subsequently enriched by this 

project) from which genotyping assays could be designed; 

insights into human LD; 

development of inexpensive, accurate technologies for highthroughput 

SNP genotyping; 

web-based tools for storing and sharing data. 

The International HapMap Consortium NATURE October 2005

HapMap Project comprises two phases 

The complete data obtained 

in Phase I were published 

on October 2005. 

The analysis of the Phase II 

dataset was published in 

October 2007.

The Phase I HapMap 

Phase I of the HapMap Project set as a 

goal genotyping at least one common 

SNP every 5 kb across the genome in 

each of 269 DNA samples. 

For the sake of practicality, and motivated 

by the allele frequency distribution of 

variants in the human genome, a minor 

allele frequency (MAF) of 0.05 or greater 

was targeted for study. 

Minor Allele Frequency (MAF) : The frequency at which the less abundant 

(or minor) allele of a SNP is present in a population. The MAF for a SNP to 

be considered common is usually above 1%.

The project required a dense map of SNPs, ideally containing 

information about validation and frequency of each candidate SNP. 

When the project started, the public SNP 

database (dbSNP) contained 2.6 million 

candidate SNPs, few of which were 

annotated with the required information. 

The HapMap Project contributed about 6 

million new SNPs to dbSNP. At October 

2005 dbSNP contains 9.2 million 

candidate human SNPs.

To study patterns of genetic variation were selected ten 500-kb regions 

from the ENCODE (Encyclopedia of DNA Elements) Project. 

These ten regions were chosen to approximate the genome-wide 

average for G+C content, recombination rate, percentage of sequence 

conserved relative to mouse sequence, and gene density. 

Each 500-kb region was sequenced in 48 individuals, and all SNPs in 

these regions (discovered or in dbSNP) were genotyped in the complete 

set of 269 DNA samples.

Using the data provided by HapMap, a team of scientists at Harvard Medical 

School and the Broad Institute has discovered a new genetic variant 

associated with age-related macular degeneration (AMD), the leading cause 

of blindness in people over 60 years of age, as well as confirming previously 

reported variants 

Nature Genetics - 38, 1055 - 1059 (2006)

They estimate that genotypes related to just five variants in three different genes can 

explain 50% of the risk of developing AMD 

The new genetic common variant identified was found in a non-coding region of the 

Complement Factor H (CFH) gene, other variants of which were recently shown to 

be associated with the risk of developing AMD. 

In addition to CFH on chromosome 1 

the complement factor B (BF) gene on chromosome 6 

complement component 2 (C2) gene on chromosome 6 

a common variant (A69S) is in hypothetical gene LOC387715 on chromosome 10. 

Interestingly, these three genes do not appear to interact directly, but instead 

contribute to the risk of AMD independently.

Phase II HapMap characterizes over 3.1 million human 

SNPs genotyped in 270 individuals from four 

geographically diverse populations

Genotyping in phase II was attempted for about 4.4 million 

distinct SNPs, of which roughly 1.3 million either could 

not be 

typed, 

were not 

polymorphic 

in any 

of the 

populations, or did not pass genotyping quality control 

filters. 

 

Certain regions of the genome were recognized as being 

challenging to study, such as centromeres, telomeres, 

gaps in genome sequence, and segmental duplications, 

regions declared to be not HapMapable.

The resulting HapMap has an SNP density of approximately 

one per kilobase and is estimated to contain approximately 

25–35% of all the 9–10 million common SNPs in the 

assembled human genome

Variation in SNP density within the Phase II HapMap 

Phase I 

Phase II 

Example of the fine-scale structure of SNP density for a 100-kb region on chromosome 

17 showing polymorphic Phase I SNPs in the consensus data set (red triangles) and 

polymorphic Phase II SNPs in the consensus data set (blue triangles) 

The Phase II HapMap differs from the Phase I HapMap also in minor allele frequency 

(MAF) distribution. SNPs added in Phase II have lower MAF. Phase II HapMap 

includes a better representation of rare variation than the Phase I HapMap

Advances in technology for high-throughput SNP genotyping 

Advances in genotyping technology have vastly increased the number 

of variants that can be typed and decreased the per-sample costs 

These advances have made possible the 

dense genotyping needed to capture the 

majority of SNP variation within an individual 

at a sufficiently low cost to allow the large 

sample sizes needed for comparison of 

individuals with and without disease

Studies in additional populations have shown that the tag 

SNPs chosen using the HapMap are generally transferable 

across other populations, but there are some limitations. 

So additional samples from the populations used to develop 

the HapMap as well as from seven more populations have 

recently been genotyped across the genome. 

 

 

 

 

 

 

 

Luhya from Webuye, Kenya 

Maasai from Kenya 

Tuscans from Italy 

Indian-Americans (Gujarati) from Houston, TX 

Han Chinese from Denver 

Mexican-Americans from Los Angeles 

Americans of African Descent from the SW USA

It is now clear that the HapMap can be 

a useful resource for the design and 

analysis of disease association studies 

in populations across the world

APPLICATION OF THE HAPMAP TO 

COMMON DISEASE

The technological advances directly stimulated or 

indirectly facilitated by the HapMap have had a 

profound impact on the study of the genetics of 

common diseases

The history of high-density GWA 

scanning to date has 

demonstrated the striking 

success of this approach in 

finding genetic variants 

associated with disease. 

Variants or regions associated 

with nearly 40 complex diseases 

have been identified in diverse 

population samples.

Major Autism Gene Found with Help of HapMap 

Using data from the HapMap, along with DNA samples collected from many 

families who have affected children, researchers have discovered a genetic 

variation linked to autism, one of the most heritable mental health 

conditions. 

They found a variation in the sequence of a gene - the “MET receptor 

tyrosine kinase gene” - that is associated with autism. This gene is involved 

in brain development, immune function, and digestive system repair. 

The MET promoter variant rs1858830 allele "C" is strongly associated with 

ASD and results in reduced gene transcription. MET protein levels were 

significantly decreased in ASD cases compared with control subjects. 

People who have the variation are more than twice as likely as others to 

have “autism spectrum disorders” 

Campbell DB et al, Ann Neurol. 2007

A genome-wide association study identifies novel risk loci for 

type 2 diabetes 

Type 2 diabetes mellitus results from the interaction of environmental factors 

with a combination of genetic variants. 

A systematic search for these variants was recently made possible by the 

development of high-density arrays that permit the genotyping of hundreds of 

thousands of polymorphisms. 

Researchers tested 392,935 SNPs in a French case–control cohort. 

Markers with the most significant difference in genotype frequencies between 

cases of type 2 diabetes and controls were fast-tracked for testing in a second 

cohort. 

This identified four loci containing variants that confer type 2 diabetes risk, in 

addition to confirming the known association with the TCF7L2 gene. 

These loci include a non-synonymous polymorphism in the zinc transporter 

SLC30A8, which is expressed exclusively in insulin-producing β-cells, and two 

linkage disequilibrium blocks that contain genes potentially involved in β-cell 

development or function (IDE–KIF11–HHEX and EXT2–ALX4). 

Sladek R et al. Nature 445, 881-885 (2007)

Future of the HapMap Project 

Currently, additional samples from the populations used to develop 

the initial HapMap, as well as samples from seven additional 

populations will be sequenced and genotyped extensively to extend 

the HapMap, providing information on rarer variants and helping to 

enable genome-wide association studies in additional populations. 

There are also ongoing efforts by many groups to characterize 

additional forms of genetic variation, such as structural variation, and 

molecular phenotypes in the HapMap samples. Finally, in the future, 

whole-genome sequencing will provide a natural convergence of 

technologies to type both SNP and structural variation. 

Nevertheless, until that point the HapMap Project data will provide an 

invaluable resource for understanding the structure of human genetic 

variation and its link to phenotype.

Beyond SNPs: 

Copy Number Variants and Other 

Structural Variation

Current generation high-throughput 

genotyping platforms are 

extraordinarily efficient at genotyping 

SNPs, but they are less effective at 

genotyping structural variants, such 

as insertions, deletions, inversions, 

and copy number variants

Although not as common as SNPs, these variants also occur 

commonly in the human genome 

The distribution of copy number variation in the human genome among 270 HapMap samples

A Copy number variants (CNV) is 

a segment of DNA in which copynumber 

differences have been 

found by comparison of two or more 

genomes. 

CNV in which stretches of genomic 

sequence of roughly 1 kb to 3 Mb in 

size are deleted or are duplicated in 

varying numbers, have gained 

increasing attention because of their 

apparent ubiquity and potential 

dosage effect on gene expression.

In 2004, the interrogation of genomic variability by array 

hybridization methods clearly demonstrated the existence of copy 

number variants. 

Intense analysis of this type of genomic variability followed, and 

the current conservative estimate from studies in a few hundred 

individuals is that at least 10% of the genome is subject to copy 

number variation

Although a typical SNP affects only one single nucleotide 

pair, their genomic abundance (over 10 million) makes 

them the most frequent source of polymorphic changes 

By contrast, CNVs are far less numerous but can affect 

from one kilobase to several megabases of DNA per 

event, adding up to a significant fraction of the genome

It is now recognized that the genomes of any two 

individuals in the human population differ more at the 

structural level than at the nucleotide sequence level 

NATURE GENETICS SUPPLEMENT | VOLUME 39 | JULY 2007

Much of what was previously known about the role of CNVs in disease 

comes from a rich literature on ‘genomic disorders’. 

 

Genomic disorders are defined as a diverse group of genetic diseases that 

are each caused by an alteration in DNA copy number. 

These mutations can be relatively large, microscopically visible 

imbalances, such as in Prader-Willi syndrome, or they may be much 

smaller, requiring higher resolution detection methods, such as in Williams 

Syndrome. 

 

Genomic disorders are typically sporadic in nature because the CNV in 

most cases is a de novo mutation with nearly complete penetrance, and 

because the affected individuals have severe developmental problems and 

are unlikely to have offspring. 

However, there are notable examples of mendelian disease traits 

associated with CNVs. For example, duplications of the gene for peripheral 

myelin protein 22 (PMP22) cause the dominant neuropathy Charcot-Marie 

Tooth disease type 1A, and deletions of the α-globin gene cluster cause 

the recessive anemia α-thalassemia.

Bibliografia 

The International HapMap Consortium. The International HapMap Project. 

NATURE. 426: 18/25, December 2003. 

Deloukas P, Bentley D. The HapMap project and its application to genetic 

studies of drug response. The Pharmacogenomics Journal. 4, 88–90 (2004). 

The International HapMap Consortium. A haplotype map of the human 

genome. NATURE. 437: 27, October 2005. 

Manolio TA, Brooks LD, Collins FS. A HapMap harvest of insights into the 

genetics of common disease. The Journal of Clinical Investigation. 118: 5, 

May 2008. 

The International HapMap Consortium. A second generation human 

haplotype map of over 3.1 million SNPs. NATURE. 449: 18, October 2007. 

 

Maller J, George S, Purcell S, Fagerness J, Altshuler D, Daly MJ, Seddon JM. 

Common variation in three genes, including a noncoding variant in CFH, 

strongly influences risk of age-related macular degeneration. Nat Genet. 38:9 

(1055-9), Sep 2006.

Hap Map Project - Bgbunict.it

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?