QMSim - CGIL - University of Guelph

QMSim: A flexible large scale genome simulator for livestock 

Mehdi Sargolzaei and Flavio S. Schenkel 

Centre for Genetic Improvement of Livestock, Animal and Poultry Science Department, 

University of Guelph, Guelph, ON, Canada 

INTRODUCTION 

Schaffner et al., 2005; Li and Li, 2007). While 

Linkage disequilibrium (LD) and linkage few software for simulating livestock genomes 

analyses have been used extensively to identify are available, they do not provide the 

quantitative trait loci (QTL) in human and functionality required for many studies. The 

livestock. Owing to the recent development in objective of this paper is to introduce a 

genotyping technologies dense marker maps software called QMSim that has been designed 

are now a reality for human and some livestock to simulate general genomes with any arbitrary 

species. But currently, the density of markers in marker and QTL maps on simulated pedigree 

livestock is small compared to that in human. under polygenic genetic model mimicking the 

Even though genotyping costs have livestock populations. QMSim is a family 

substantially declined, large scale based simulation which can generate complex 

genome-wide association studies are still very pedigrees with predefined evolutionary 

costly. For this reason most of studies in features such as LD, bottleneck and 

livestock suffer from small sample size or from 

low density of markers. However, simulation 

expansions. 

of molecular markers and QTL has allowed METHODS AND DISCRIPTION OF 

statisticians to answer a wide variety of 

THE ALGORITHMS 

questions in genomics. 

Population structure: QMSim simulation is 

There are several main differences between performed in two steps: coalescent step and 

human and livestock genome-wide association forward-time step. In the first step, a historical 

studies. For instance, in human, most focus is population is simulated to create initial LD, 

on identifying susceptibility genes for complex considering equal numbers for both sexes, 

diseases whereas in livestock, finding QTL for discrete generations, random mating, no 

production traits under a polygenic model is of selection and no migration. Expansion and 

interest. More importantly, human populations contraction of the historical population size are 

have been experiencing an expansion in allowed. In the second step or forward-time 

effective population size (Ne) while Ne in step, recent population structures are simulated. 

livestock populations has decreased due to Animals from the last historical generation can 

intense selection. Consequently LD in be chosen as founders of the recent populations. 

livestock is extended over long distances However, for the case of multiple populations, 

compared to that in human. Moreover, founders can also come from other recent 

combined LD and linkage QTL mapping has populations. In forward-time step, selection 

attracted more attention in livestock due to based on different criteria for a single trait with 

strong family structure. Therefore, population predefined heritability and phenotypic variance 

structure is crucial to identify and correctly may be performed. QMSim has enough 

interpret the associations between functional flexibility in simulating wide range of 

and molecular diversity (Pritchard and population structures. For example, in 

Rosenberg 1999; Buckler and Thornsberry livestock, some of QTL mapping designs 

2002). 

involve line crosses produced from inbred lines 

Several software have been developed for with divergent phenotypes. In this case, the 

simulation of genome as a cost effective mean associated mutations are expected to have high 

for testing new algorithms and methods frequencies in opposite directions. Owing to 

especially in human (e.g., Hudson, 2002; the object-oriented programming, it is easy to 

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 1

simulate multiple populations with different 

structures and selection criteria and cross them 

to mimic the livestock populations. 

Genome: A wide range of options can be 

specified for simulating the genome such as: 

mutation rate, crossover interference, number 

of chromosomes, markers and QTL, location of 

markers and QTL, number of alleles, allelic 

frequencies and missing marker rate. This 

permits more realistic and flexible simulation 

of a genome. 

Crossover is the key factor that gradually 

breakdowns the allelic associations or LD. 

According to the Haldane mapping function, 

crossovers occur at random and independently 

of each other, and the occurrence of crossovers 

follows a Poisson distribution. Therefore, the 

number of crossover events for each 

chromosome is sampled from a Poisson 

distribution with mean equal to the length of 

chromosomes in Morgan. Then locations of 

crossovers along chromosomes are assigned at 

random. Moreover a simple algorithm is 

applied to account for crossover interference. 

To establish mutation-drift equilibrium in the 

historical generations either infinite mutation 

model or recurrent mutation model is used. The 

infinite mutation model assumes that a 

mutation creates a new allele while the 

recurrent mutation model assumes that a 

mutation alters an allelic state to another and 

does not necessarily create a new allele. In the 

recurrent model, probabilities of transitions 

from one allelic state to another are equal. 

normal or uniform. 

In addition to QTL effects, polygenic effect 

can be included. 

16-bits version allows for assigning unique 

alleles to each founder. 

Calculates LD in specified generations. 

Population expansion or bottleneck is 

allowed for both historical and recent 

populations. 

Selection and culling of breeding population 

can be carried out based on different criteria, 

such as EBV or genomic EBV. 

More than one litter size with predefined 

probabilities can be considered. 

Multiple populations or lines can be 

simulated. Crossing between populations or 

lines is allowed. 

Multiple populations can be analyzed jointly 

for estimating breeding values and 

computing inbreeding. 

Creates detailed output files. Outputs can be 

customized to avoid saving unwanted data. 

Equipped with fast and high-quality 

pseudo-random number generators. 

Allows flexible input parameter file. 

Computationally efficient in terms of both 

time and memory. 

COMPUTATIONAL EFFICIENCY 

The computational efficiency of QMSim in 

terms of memory requirement is achieved by 

memory optimization methods implemented in 

it. We have successfully simulated 500K SNP 

panel on a large population size (100 discrete 

generations each with size 1,000 individuals). 

In this case, maximum RAM (Random Access 

Memory) requirement was around 3 GB. 

Computing time for simulating 500K on 1,000 

individuals for 22 discrete generations (22,000 

individuals in total) on an AMD Opteron server 

running at 2.6 GHz with 16 GB RAM was 23 

minutes. For 50K and 10K SNP panels the 

corresponding times were 138 and 14 seconds, 

respectively. One strength of QMSim is that it 

provides the user with various and detailed 

output files while providing options for 

managing them. When simulating large marker 

panels or large populations with many 

QMSim MAIN FEATURES 

Simulates historical generations to created 

linkage disequilibrium. 

Establishes mutation-drift equilibrium. 

Recombination is appropriately modeled. 

Interference is allowed. 

Multiple chromosomes, each with different 

or similar density of markers and QTL maps, 

can be generated. 

Very dense marker map can be simulated. 

Markers can be either SNP or 

microsatellites. 

Additive QTL effects can be simulated with 

different distributions, such as gamma, replicates, large output files might become an 


issue. In this situation, one may alter the output 

options to avoid saving unwanted outputs. 

COMPUTING ENVIRONMENT 

The code is written in C++ language using 

object oriented techniques and the application 

runs on Windows and Linux platforms. 

In conclusion, QMSim is a user friendly tool 

for simulating large scale genomic data in 

livestock, which helps validating new methods 

for fine mapping and genomic selection. It 

integrates efficient and fast algorithms. 

Further developments of QMSim will further 

improve the selection schemes and incorporate 

none-additive effects. 

ACKNOWLEDGEMENTS 

This research was financially supported by 

L’Alliance Boviteq Inc. (Semex, Canada) and 

the Ontario Centre for Agricultural Genomics 

(Challenge Fund). 

REFERENCES 

Buckler,E.S. and Thornsberry,J. (2002) Plant 

molecular diversity and applications to 

genomics. Curr. Opin. Plant Biol., 5, 

107-111. 

Falconer,D.S. and Mackay,T.F.C. (1996) 

Introduction to Quantitative Genetics, edn. 

4. Longmans Green, Harlow, Essex, UK. 

Hudson,R.R. (2002) Generating samples 

under a Wright-Fisher neutral model. 

Bioin-formatics, 18, 337-378. 

Li,C. and Li.M. (2008) GWAsimulator: A 

rapid whole genome simulation program. 

Bioinformatics, 24, 140-142. 

Matsumoto,M. and Nishimura,T. (1998) 

Mersenne twister: a 623-dimensionally 

equidistributed uniform pseudorandom 

number generator. ACM Trans. Model. 

Comput. Simul., 8, 3-30. 

Pritchard,J.K. and Rosenberg,N.A. (1999) Use 

of unlinked genetic markers to detect 

population stratification in association 

studies. Am. J. of Hum. Gen., 65, 220-228 

Schaffner,S.F. et al. (2005) Calibrating a 

coalescent simulation of human genome 

sequence variation. Genome Res., 15, 

1576-1583. 


INPUT PARAMETER FILE 

The program requires a parameter file, in which various parameters for the simulation should be 

specified. The parameter file consists of five main sections. The first part describes global 

parameters, the second part describes historical generations, the third part describes parameters for 

subpopulations and generations, the fourth part contains genome parameters and the fifth part is 

related to the output options. The order of commands within each section is not normally important. 

An example of parameter file is given below. 

/******************************* 

** Global parameters ** 

*******************************/ 

title = "Simulating two divergent lines - 50k SNP panel"; 

nr = 1; //Number of replicates 

h2 = 0.2; //Overall heritability 

qtlh2 = 0.2; //QTL heritability 

phvar = 1.0; //Phenotypic variance 

/******************************* 

** Historical generations ** 

*******************************/ 

begin_hist; //Historical generations 

nhg = 200; //Number of historical generations 

sfhg = 420; //Size of the first historical generation 

sihg = 420; //Size of intermediate historical generation 

ihg = 200; //Intermediate historical generation number 

slhg = 420; //Size of the last historical generation 

nmlhg = 20; //Number of male in the last historical generation 

end_hist; 

/******************************* 

** Populations ** 

*******************************/ 

begin_pop = "Line 1"; 

begin_founder; 

male [n = 20, pop = "hp"]; 

female [n = 400, pop = "hp"]; 

end_founder; 

ls = 1 2 [0.05]; //Litter size 

pmp = 0.5 /fix; //Proportion of male progeny (random, fix or fixwf) 

ng = 10; //Number of generations 

md = random; //Mating design 

sr = 0.4; //Replacement ratio for sires 

dr = 0.2; //Replacement ratio for dams 

sd = phenotypic /h; //Selection design 

cd = age; //Culling design 

begin_popoutput; 

ld /bin 10 /maf 0.1 /gen 0; 

data; 

genotype /snp_code; 

allele_freq /gen 10; 

end_popoutput; 

end_pop; 

begin_pop = "Line 2"; 


egin_founder; 

male [n = 20, pop = "hp"]; 

female [n = 400, pop = "hp"]; 

end_founder; 

ls = 1 2 [0.05]; //Litter size 

pmp = 0.5 /fix; //Proportion of male progeny (random, fix or fixwf) 

ng = 10; //Number of generations 

md = random; //Mating design 

sr = 0.4; //Replacement ratio for sires 

dr = 0.2; //Replacement ratio for dams 

sd = phenotypic /l; //Selection design 

cd = age; //Culling design 

begin_popoutput; 

data; 

genotype /snp_code; 

allele_freq /gen 10; 

end_popoutput; 

end_pop; 

/******************************* 

** Genome ** 

*******************************/ 

begin_genome; 

mmutr = 2.5e-5 /recurrent; //Marker mutation rate 

qmutr = 2.5e-5; //QTL mutation rate 

interference = 25; 

begin_chr = 30; 

chrlen = 100; //Chromosome length 

nmloci = 1667; //Number of markers 

mpos = r; //Marker positions 

nma = snp; //Number of marker alleles 

maf = e; //Marker allele frequencies 

nqloci = 25; //Number of QTL 

qpos = r; //QTL positions 

nqa = r 2 3 4; //Number of QTL alleles 

qaf = e; //QTL allele frequencies 

qae = grand 0.4; //QTL allele effects 

end_chr; 

rpos /mrk /qtl; 

end_genome; 

/******************************* 

** Output options ** 

*******************************/ 

begin_output; 

linkage_map; 

allele_effect; 

end_output;

QMSim - CGIL - University of Guelph

Create successful ePaper yourself

Delete template?

Save as template?