QMSim - CGIL - University of Guelph
QMSim - CGIL - University of Guelph
QMSim - CGIL - University of Guelph
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>QMSim</strong>: A flexible large scale genome simulator for livestock<br />
Mehdi Sargolzaei and Flavio S. Schenkel<br />
Centre for Genetic Improvement <strong>of</strong> Livestock, Animal and Poultry Science Department,<br />
<strong>University</strong> <strong>of</strong> <strong>Guelph</strong>, <strong>Guelph</strong>, ON, Canada<br />
INTRODUCTION<br />
Schaffner et al., 2005; Li and Li, 2007). While<br />
Linkage disequilibrium (LD) and linkage few s<strong>of</strong>tware for simulating livestock genomes<br />
analyses have been used extensively to identify are available, they do not provide the<br />
quantitative trait loci (QTL) in human and functionality required for many studies. The<br />
livestock. Owing to the recent development in objective <strong>of</strong> this paper is to introduce a<br />
genotyping technologies dense marker maps s<strong>of</strong>tware called <strong>QMSim</strong> that has been designed<br />
are now a reality for human and some livestock to simulate general genomes with any arbitrary<br />
species. But currently, the density <strong>of</strong> markers in marker and QTL maps on simulated pedigree<br />
livestock is small compared to that in human. under polygenic genetic model mimicking the<br />
Even though genotyping costs have livestock populations. <strong>QMSim</strong> is a family<br />
substantially declined, large scale based simulation which can generate complex<br />
genome-wide association studies are still very pedigrees with predefined evolutionary<br />
costly. For this reason most <strong>of</strong> studies in features such as LD, bottleneck and<br />
livestock suffer from small sample size or from<br />
low density <strong>of</strong> markers. However, simulation<br />
expansions.<br />
<strong>of</strong> molecular markers and QTL has allowed METHODS AND DISCRIPTION OF<br />
statisticians to answer a wide variety <strong>of</strong><br />
THE ALGORITHMS<br />
questions in genomics.<br />
Population structure: <strong>QMSim</strong> simulation is<br />
There are several main differences between performed in two steps: coalescent step and<br />
human and livestock genome-wide association forward-time step. In the first step, a historical<br />
studies. For instance, in human, most focus is population is simulated to create initial LD,<br />
on identifying susceptibility genes for complex considering equal numbers for both sexes,<br />
diseases whereas in livestock, finding QTL for discrete generations, random mating, no<br />
production traits under a polygenic model is <strong>of</strong> selection and no migration. Expansion and<br />
interest. More importantly, human populations contraction <strong>of</strong> the historical population size are<br />
have been experiencing an expansion in allowed. In the second step or forward-time<br />
effective population size (Ne) while Ne in step, recent population structures are simulated.<br />
livestock populations has decreased due to Animals from the last historical generation can<br />
intense selection. Consequently LD in be chosen as founders <strong>of</strong> the recent populations.<br />
livestock is extended over long distances However, for the case <strong>of</strong> multiple populations,<br />
compared to that in human. Moreover, founders can also come from other recent<br />
combined LD and linkage QTL mapping has populations. In forward-time step, selection<br />
attracted more attention in livestock due to based on different criteria for a single trait with<br />
strong family structure. Therefore, population predefined heritability and phenotypic variance<br />
structure is crucial to identify and correctly may be performed. <strong>QMSim</strong> has enough<br />
interpret the associations between functional flexibility in simulating wide range <strong>of</strong><br />
and molecular diversity (Pritchard and population structures. For example, in<br />
Rosenberg 1999; Buckler and Thornsberry livestock, some <strong>of</strong> QTL mapping designs<br />
2002).<br />
involve line crosses produced from inbred lines<br />
Several s<strong>of</strong>tware have been developed for with divergent phenotypes. In this case, the<br />
simulation <strong>of</strong> genome as a cost effective mean associated mutations are expected to have high<br />
for testing new algorithms and methods frequencies in opposite directions. Owing to<br />
especially in human (e.g., Hudson, 2002; the object-oriented programming, it is easy to<br />
Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 1
simulate multiple populations with different<br />
structures and selection criteria and cross them<br />
to mimic the livestock populations.<br />
Genome: A wide range <strong>of</strong> options can be<br />
specified for simulating the genome such as:<br />
mutation rate, crossover interference, number<br />
<strong>of</strong> chromosomes, markers and QTL, location <strong>of</strong><br />
markers and QTL, number <strong>of</strong> alleles, allelic<br />
frequencies and missing marker rate. This<br />
permits more realistic and flexible simulation<br />
<strong>of</strong> a genome.<br />
Crossover is the key factor that gradually<br />
breakdowns the allelic associations or LD.<br />
According to the Haldane mapping function,<br />
crossovers occur at random and independently<br />
<strong>of</strong> each other, and the occurrence <strong>of</strong> crossovers<br />
follows a Poisson distribution. Therefore, the<br />
number <strong>of</strong> crossover events for each<br />
chromosome is sampled from a Poisson<br />
distribution with mean equal to the length <strong>of</strong><br />
chromosomes in Morgan. Then locations <strong>of</strong><br />
crossovers along chromosomes are assigned at<br />
random. Moreover a simple algorithm is<br />
applied to account for crossover interference.<br />
To establish mutation-drift equilibrium in the<br />
historical generations either infinite mutation<br />
model or recurrent mutation model is used. The<br />
infinite mutation model assumes that a<br />
mutation creates a new allele while the<br />
recurrent mutation model assumes that a<br />
mutation alters an allelic state to another and<br />
does not necessarily create a new allele. In the<br />
recurrent model, probabilities <strong>of</strong> transitions<br />
from one allelic state to another are equal.<br />
normal or uniform.<br />
In addition to QTL effects, polygenic effect<br />
can be included.<br />
16-bits version allows for assigning unique<br />
alleles to each founder.<br />
Calculates LD in specified generations.<br />
Population expansion or bottleneck is<br />
allowed for both historical and recent<br />
populations.<br />
Selection and culling <strong>of</strong> breeding population<br />
can be carried out based on different criteria,<br />
such as EBV or genomic EBV.<br />
More than one litter size with predefined<br />
probabilities can be considered.<br />
Multiple populations or lines can be<br />
simulated. Crossing between populations or<br />
lines is allowed.<br />
Multiple populations can be analyzed jointly<br />
for estimating breeding values and<br />
computing inbreeding.<br />
Creates detailed output files. Outputs can be<br />
customized to avoid saving unwanted data.<br />
Equipped with fast and high-quality<br />
pseudo-random number generators.<br />
Allows flexible input parameter file.<br />
Computationally efficient in terms <strong>of</strong> both<br />
time and memory.<br />
COMPUTATIONAL EFFICIENCY<br />
The computational efficiency <strong>of</strong> <strong>QMSim</strong> in<br />
terms <strong>of</strong> memory requirement is achieved by<br />
memory optimization methods implemented in<br />
it. We have successfully simulated 500K SNP<br />
panel on a large population size (100 discrete<br />
generations each with size 1,000 individuals).<br />
In this case, maximum RAM (Random Access<br />
Memory) requirement was around 3 GB.<br />
Computing time for simulating 500K on 1,000<br />
individuals for 22 discrete generations (22,000<br />
individuals in total) on an AMD Opteron server<br />
running at 2.6 GHz with 16 GB RAM was 23<br />
minutes. For 50K and 10K SNP panels the<br />
corresponding times were 138 and 14 seconds,<br />
respectively. One strength <strong>of</strong> <strong>QMSim</strong> is that it<br />
provides the user with various and detailed<br />
output files while providing options for<br />
managing them. When simulating large marker<br />
panels or large populations with many<br />
<strong>QMSim</strong> MAIN FEATURES<br />
Simulates historical generations to created<br />
linkage disequilibrium.<br />
Establishes mutation-drift equilibrium.<br />
Recombination is appropriately modeled.<br />
Interference is allowed.<br />
Multiple chromosomes, each with different<br />
or similar density <strong>of</strong> markers and QTL maps,<br />
can be generated.<br />
Very dense marker map can be simulated.<br />
Markers can be either SNP or<br />
microsatellites.<br />
Additive QTL effects can be simulated with<br />
different distributions, such as gamma, replicates, large output files might become an<br />
Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 2
issue. In this situation, one may alter the output<br />
options to avoid saving unwanted outputs.<br />
COMPUTING ENVIRONMENT<br />
The code is written in C++ language using<br />
object oriented techniques and the application<br />
runs on Windows and Linux platforms.<br />
In conclusion, <strong>QMSim</strong> is a user friendly tool<br />
for simulating large scale genomic data in<br />
livestock, which helps validating new methods<br />
for fine mapping and genomic selection. It<br />
integrates efficient and fast algorithms.<br />
Further developments <strong>of</strong> <strong>QMSim</strong> will further<br />
improve the selection schemes and incorporate<br />
none-additive effects.<br />
ACKNOWLEDGEMENTS<br />
This research was financially supported by<br />
L’Alliance Boviteq Inc. (Semex, Canada) and<br />
the Ontario Centre for Agricultural Genomics<br />
(Challenge Fund).<br />
REFERENCES<br />
Buckler,E.S. and Thornsberry,J. (2002) Plant<br />
molecular diversity and applications to<br />
genomics. Curr. Opin. Plant Biol., 5,<br />
107-111.<br />
Falconer,D.S. and Mackay,T.F.C. (1996)<br />
Introduction to Quantitative Genetics, edn.<br />
4. Longmans Green, Harlow, Essex, UK.<br />
Hudson,R.R. (2002) Generating samples<br />
under a Wright-Fisher neutral model.<br />
Bioin-formatics, 18, 337-378.<br />
Li,C. and Li.M. (2008) GWAsimulator: A<br />
rapid whole genome simulation program.<br />
Bioinformatics, 24, 140-142.<br />
Matsumoto,M. and Nishimura,T. (1998)<br />
Mersenne twister: a 623-dimensionally<br />
equidistributed uniform pseudorandom<br />
number generator. ACM Trans. Model.<br />
Comput. Simul., 8, 3-30.<br />
Pritchard,J.K. and Rosenberg,N.A. (1999) Use<br />
<strong>of</strong> unlinked genetic markers to detect<br />
population stratification in association<br />
studies. Am. J. <strong>of</strong> Hum. Gen., 65, 220-228<br />
Schaffner,S.F. et al. (2005) Calibrating a<br />
coalescent simulation <strong>of</strong> human genome<br />
sequence variation. Genome Res., 15,<br />
1576-1583.<br />
Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 3
INPUT PARAMETER FILE<br />
The program requires a parameter file, in which various parameters for the simulation should be<br />
specified. The parameter file consists <strong>of</strong> five main sections. The first part describes global<br />
parameters, the second part describes historical generations, the third part describes parameters for<br />
subpopulations and generations, the fourth part contains genome parameters and the fifth part is<br />
related to the output options. The order <strong>of</strong> commands within each section is not normally important.<br />
An example <strong>of</strong> parameter file is given below.<br />
/*******************************<br />
** Global parameters **<br />
*******************************/<br />
title = "Simulating two divergent lines - 50k SNP panel";<br />
nr = 1; //Number <strong>of</strong> replicates<br />
h2 = 0.2; //Overall heritability<br />
qtlh2 = 0.2; //QTL heritability<br />
phvar = 1.0; //Phenotypic variance<br />
/*******************************<br />
** Historical generations **<br />
*******************************/<br />
begin_hist; //Historical generations<br />
nhg = 200; //Number <strong>of</strong> historical generations<br />
sfhg = 420; //Size <strong>of</strong> the first historical generation<br />
sihg = 420; //Size <strong>of</strong> intermediate historical generation<br />
ihg = 200; //Intermediate historical generation number<br />
slhg = 420; //Size <strong>of</strong> the last historical generation<br />
nmlhg = 20; //Number <strong>of</strong> male in the last historical generation<br />
end_hist;<br />
/*******************************<br />
** Populations **<br />
*******************************/<br />
begin_pop = "Line 1";<br />
begin_founder;<br />
male [n = 20, pop = "hp"];<br />
female [n = 400, pop = "hp"];<br />
end_founder;<br />
ls = 1 2 [0.05]; //Litter size<br />
pmp = 0.5 /fix; //Proportion <strong>of</strong> male progeny (random, fix or fixwf)<br />
ng = 10; //Number <strong>of</strong> generations<br />
md = random; //Mating design<br />
sr = 0.4; //Replacement ratio for sires<br />
dr = 0.2; //Replacement ratio for dams<br />
sd = phenotypic /h; //Selection design<br />
cd = age; //Culling design<br />
begin_popoutput;<br />
ld /bin 10 /maf 0.1 /gen 0;<br />
data;<br />
genotype /snp_code;<br />
allele_freq /gen 10;<br />
end_popoutput;<br />
end_pop;<br />
begin_pop = "Line 2";<br />
Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 4
egin_founder;<br />
male [n = 20, pop = "hp"];<br />
female [n = 400, pop = "hp"];<br />
end_founder;<br />
ls = 1 2 [0.05]; //Litter size<br />
pmp = 0.5 /fix; //Proportion <strong>of</strong> male progeny (random, fix or fixwf)<br />
ng = 10; //Number <strong>of</strong> generations<br />
md = random; //Mating design<br />
sr = 0.4; //Replacement ratio for sires<br />
dr = 0.2; //Replacement ratio for dams<br />
sd = phenotypic /l; //Selection design<br />
cd = age; //Culling design<br />
begin_popoutput;<br />
data;<br />
genotype /snp_code;<br />
allele_freq /gen 10;<br />
end_popoutput;<br />
end_pop;<br />
/*******************************<br />
** Genome **<br />
*******************************/<br />
begin_genome;<br />
mmutr = 2.5e-5 /recurrent; //Marker mutation rate<br />
qmutr = 2.5e-5; //QTL mutation rate<br />
interference = 25;<br />
begin_chr = 30;<br />
chrlen = 100; //Chromosome length<br />
nmloci = 1667; //Number <strong>of</strong> markers<br />
mpos = r; //Marker positions<br />
nma = snp; //Number <strong>of</strong> marker alleles<br />
maf = e; //Marker allele frequencies<br />
nqloci = 25; //Number <strong>of</strong> QTL<br />
qpos = r; //QTL positions<br />
nqa = r 2 3 4; //Number <strong>of</strong> QTL alleles<br />
qaf = e; //QTL allele frequencies<br />
qae = grand 0.4; //QTL allele effects<br />
end_chr;<br />
rpos /mrk /qtl;<br />
end_genome;<br />
/*******************************<br />
** Output options **<br />
*******************************/<br />
begin_output;<br />
linkage_map;<br />
allele_effect;<br />
end_output;<br />
Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 5