22.07.2013 Views

QMSim - CGIL - University of Guelph

QMSim - CGIL - University of Guelph

QMSim - CGIL - University of Guelph

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>QMSim</strong>: A flexible large scale genome simulator for livestock<br />

Mehdi Sargolzaei and Flavio S. Schenkel<br />

Centre for Genetic Improvement <strong>of</strong> Livestock, Animal and Poultry Science Department,<br />

<strong>University</strong> <strong>of</strong> <strong>Guelph</strong>, <strong>Guelph</strong>, ON, Canada<br />

INTRODUCTION<br />

Schaffner et al., 2005; Li and Li, 2007). While<br />

Linkage disequilibrium (LD) and linkage few s<strong>of</strong>tware for simulating livestock genomes<br />

analyses have been used extensively to identify are available, they do not provide the<br />

quantitative trait loci (QTL) in human and functionality required for many studies. The<br />

livestock. Owing to the recent development in objective <strong>of</strong> this paper is to introduce a<br />

genotyping technologies dense marker maps s<strong>of</strong>tware called <strong>QMSim</strong> that has been designed<br />

are now a reality for human and some livestock to simulate general genomes with any arbitrary<br />

species. But currently, the density <strong>of</strong> markers in marker and QTL maps on simulated pedigree<br />

livestock is small compared to that in human. under polygenic genetic model mimicking the<br />

Even though genotyping costs have livestock populations. <strong>QMSim</strong> is a family<br />

substantially declined, large scale based simulation which can generate complex<br />

genome-wide association studies are still very pedigrees with predefined evolutionary<br />

costly. For this reason most <strong>of</strong> studies in features such as LD, bottleneck and<br />

livestock suffer from small sample size or from<br />

low density <strong>of</strong> markers. However, simulation<br />

expansions.<br />

<strong>of</strong> molecular markers and QTL has allowed METHODS AND DISCRIPTION OF<br />

statisticians to answer a wide variety <strong>of</strong><br />

THE ALGORITHMS<br />

questions in genomics.<br />

Population structure: <strong>QMSim</strong> simulation is<br />

There are several main differences between performed in two steps: coalescent step and<br />

human and livestock genome-wide association forward-time step. In the first step, a historical<br />

studies. For instance, in human, most focus is population is simulated to create initial LD,<br />

on identifying susceptibility genes for complex considering equal numbers for both sexes,<br />

diseases whereas in livestock, finding QTL for discrete generations, random mating, no<br />

production traits under a polygenic model is <strong>of</strong> selection and no migration. Expansion and<br />

interest. More importantly, human populations contraction <strong>of</strong> the historical population size are<br />

have been experiencing an expansion in allowed. In the second step or forward-time<br />

effective population size (Ne) while Ne in step, recent population structures are simulated.<br />

livestock populations has decreased due to Animals from the last historical generation can<br />

intense selection. Consequently LD in be chosen as founders <strong>of</strong> the recent populations.<br />

livestock is extended over long distances However, for the case <strong>of</strong> multiple populations,<br />

compared to that in human. Moreover, founders can also come from other recent<br />

combined LD and linkage QTL mapping has populations. In forward-time step, selection<br />

attracted more attention in livestock due to based on different criteria for a single trait with<br />

strong family structure. Therefore, population predefined heritability and phenotypic variance<br />

structure is crucial to identify and correctly may be performed. <strong>QMSim</strong> has enough<br />

interpret the associations between functional flexibility in simulating wide range <strong>of</strong><br />

and molecular diversity (Pritchard and population structures. For example, in<br />

Rosenberg 1999; Buckler and Thornsberry livestock, some <strong>of</strong> QTL mapping designs<br />

2002).<br />

involve line crosses produced from inbred lines<br />

Several s<strong>of</strong>tware have been developed for with divergent phenotypes. In this case, the<br />

simulation <strong>of</strong> genome as a cost effective mean associated mutations are expected to have high<br />

for testing new algorithms and methods frequencies in opposite directions. Owing to<br />

especially in human (e.g., Hudson, 2002; the object-oriented programming, it is easy to<br />

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 1


simulate multiple populations with different<br />

structures and selection criteria and cross them<br />

to mimic the livestock populations.<br />

Genome: A wide range <strong>of</strong> options can be<br />

specified for simulating the genome such as:<br />

mutation rate, crossover interference, number<br />

<strong>of</strong> chromosomes, markers and QTL, location <strong>of</strong><br />

markers and QTL, number <strong>of</strong> alleles, allelic<br />

frequencies and missing marker rate. This<br />

permits more realistic and flexible simulation<br />

<strong>of</strong> a genome.<br />

Crossover is the key factor that gradually<br />

breakdowns the allelic associations or LD.<br />

According to the Haldane mapping function,<br />

crossovers occur at random and independently<br />

<strong>of</strong> each other, and the occurrence <strong>of</strong> crossovers<br />

follows a Poisson distribution. Therefore, the<br />

number <strong>of</strong> crossover events for each<br />

chromosome is sampled from a Poisson<br />

distribution with mean equal to the length <strong>of</strong><br />

chromosomes in Morgan. Then locations <strong>of</strong><br />

crossovers along chromosomes are assigned at<br />

random. Moreover a simple algorithm is<br />

applied to account for crossover interference.<br />

To establish mutation-drift equilibrium in the<br />

historical generations either infinite mutation<br />

model or recurrent mutation model is used. The<br />

infinite mutation model assumes that a<br />

mutation creates a new allele while the<br />

recurrent mutation model assumes that a<br />

mutation alters an allelic state to another and<br />

does not necessarily create a new allele. In the<br />

recurrent model, probabilities <strong>of</strong> transitions<br />

from one allelic state to another are equal.<br />

normal or uniform.<br />

In addition to QTL effects, polygenic effect<br />

can be included.<br />

16-bits version allows for assigning unique<br />

alleles to each founder.<br />

Calculates LD in specified generations.<br />

Population expansion or bottleneck is<br />

allowed for both historical and recent<br />

populations.<br />

Selection and culling <strong>of</strong> breeding population<br />

can be carried out based on different criteria,<br />

such as EBV or genomic EBV.<br />

More than one litter size with predefined<br />

probabilities can be considered.<br />

Multiple populations or lines can be<br />

simulated. Crossing between populations or<br />

lines is allowed.<br />

Multiple populations can be analyzed jointly<br />

for estimating breeding values and<br />

computing inbreeding.<br />

Creates detailed output files. Outputs can be<br />

customized to avoid saving unwanted data.<br />

Equipped with fast and high-quality<br />

pseudo-random number generators.<br />

Allows flexible input parameter file.<br />

Computationally efficient in terms <strong>of</strong> both<br />

time and memory.<br />

COMPUTATIONAL EFFICIENCY<br />

The computational efficiency <strong>of</strong> <strong>QMSim</strong> in<br />

terms <strong>of</strong> memory requirement is achieved by<br />

memory optimization methods implemented in<br />

it. We have successfully simulated 500K SNP<br />

panel on a large population size (100 discrete<br />

generations each with size 1,000 individuals).<br />

In this case, maximum RAM (Random Access<br />

Memory) requirement was around 3 GB.<br />

Computing time for simulating 500K on 1,000<br />

individuals for 22 discrete generations (22,000<br />

individuals in total) on an AMD Opteron server<br />

running at 2.6 GHz with 16 GB RAM was 23<br />

minutes. For 50K and 10K SNP panels the<br />

corresponding times were 138 and 14 seconds,<br />

respectively. One strength <strong>of</strong> <strong>QMSim</strong> is that it<br />

provides the user with various and detailed<br />

output files while providing options for<br />

managing them. When simulating large marker<br />

panels or large populations with many<br />

<strong>QMSim</strong> MAIN FEATURES<br />

Simulates historical generations to created<br />

linkage disequilibrium.<br />

Establishes mutation-drift equilibrium.<br />

Recombination is appropriately modeled.<br />

Interference is allowed.<br />

Multiple chromosomes, each with different<br />

or similar density <strong>of</strong> markers and QTL maps,<br />

can be generated.<br />

Very dense marker map can be simulated.<br />

Markers can be either SNP or<br />

microsatellites.<br />

Additive QTL effects can be simulated with<br />

different distributions, such as gamma, replicates, large output files might become an<br />

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 2


issue. In this situation, one may alter the output<br />

options to avoid saving unwanted outputs.<br />

COMPUTING ENVIRONMENT<br />

The code is written in C++ language using<br />

object oriented techniques and the application<br />

runs on Windows and Linux platforms.<br />

In conclusion, <strong>QMSim</strong> is a user friendly tool<br />

for simulating large scale genomic data in<br />

livestock, which helps validating new methods<br />

for fine mapping and genomic selection. It<br />

integrates efficient and fast algorithms.<br />

Further developments <strong>of</strong> <strong>QMSim</strong> will further<br />

improve the selection schemes and incorporate<br />

none-additive effects.<br />

ACKNOWLEDGEMENTS<br />

This research was financially supported by<br />

L’Alliance Boviteq Inc. (Semex, Canada) and<br />

the Ontario Centre for Agricultural Genomics<br />

(Challenge Fund).<br />

REFERENCES<br />

Buckler,E.S. and Thornsberry,J. (2002) Plant<br />

molecular diversity and applications to<br />

genomics. Curr. Opin. Plant Biol., 5,<br />

107-111.<br />

Falconer,D.S. and Mackay,T.F.C. (1996)<br />

Introduction to Quantitative Genetics, edn.<br />

4. Longmans Green, Harlow, Essex, UK.<br />

Hudson,R.R. (2002) Generating samples<br />

under a Wright-Fisher neutral model.<br />

Bioin-formatics, 18, 337-378.<br />

Li,C. and Li.M. (2008) GWAsimulator: A<br />

rapid whole genome simulation program.<br />

Bioinformatics, 24, 140-142.<br />

Matsumoto,M. and Nishimura,T. (1998)<br />

Mersenne twister: a 623-dimensionally<br />

equidistributed uniform pseudorandom<br />

number generator. ACM Trans. Model.<br />

Comput. Simul., 8, 3-30.<br />

Pritchard,J.K. and Rosenberg,N.A. (1999) Use<br />

<strong>of</strong> unlinked genetic markers to detect<br />

population stratification in association<br />

studies. Am. J. <strong>of</strong> Hum. Gen., 65, 220-228<br />

Schaffner,S.F. et al. (2005) Calibrating a<br />

coalescent simulation <strong>of</strong> human genome<br />

sequence variation. Genome Res., 15,<br />

1576-1583.<br />

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 3


INPUT PARAMETER FILE<br />

The program requires a parameter file, in which various parameters for the simulation should be<br />

specified. The parameter file consists <strong>of</strong> five main sections. The first part describes global<br />

parameters, the second part describes historical generations, the third part describes parameters for<br />

subpopulations and generations, the fourth part contains genome parameters and the fifth part is<br />

related to the output options. The order <strong>of</strong> commands within each section is not normally important.<br />

An example <strong>of</strong> parameter file is given below.<br />

/*******************************<br />

** Global parameters **<br />

*******************************/<br />

title = "Simulating two divergent lines - 50k SNP panel";<br />

nr = 1; //Number <strong>of</strong> replicates<br />

h2 = 0.2; //Overall heritability<br />

qtlh2 = 0.2; //QTL heritability<br />

phvar = 1.0; //Phenotypic variance<br />

/*******************************<br />

** Historical generations **<br />

*******************************/<br />

begin_hist; //Historical generations<br />

nhg = 200; //Number <strong>of</strong> historical generations<br />

sfhg = 420; //Size <strong>of</strong> the first historical generation<br />

sihg = 420; //Size <strong>of</strong> intermediate historical generation<br />

ihg = 200; //Intermediate historical generation number<br />

slhg = 420; //Size <strong>of</strong> the last historical generation<br />

nmlhg = 20; //Number <strong>of</strong> male in the last historical generation<br />

end_hist;<br />

/*******************************<br />

** Populations **<br />

*******************************/<br />

begin_pop = "Line 1";<br />

begin_founder;<br />

male [n = 20, pop = "hp"];<br />

female [n = 400, pop = "hp"];<br />

end_founder;<br />

ls = 1 2 [0.05]; //Litter size<br />

pmp = 0.5 /fix; //Proportion <strong>of</strong> male progeny (random, fix or fixwf)<br />

ng = 10; //Number <strong>of</strong> generations<br />

md = random; //Mating design<br />

sr = 0.4; //Replacement ratio for sires<br />

dr = 0.2; //Replacement ratio for dams<br />

sd = phenotypic /h; //Selection design<br />

cd = age; //Culling design<br />

begin_popoutput;<br />

ld /bin 10 /maf 0.1 /gen 0;<br />

data;<br />

genotype /snp_code;<br />

allele_freq /gen 10;<br />

end_popoutput;<br />

end_pop;<br />

begin_pop = "Line 2";<br />

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 4


egin_founder;<br />

male [n = 20, pop = "hp"];<br />

female [n = 400, pop = "hp"];<br />

end_founder;<br />

ls = 1 2 [0.05]; //Litter size<br />

pmp = 0.5 /fix; //Proportion <strong>of</strong> male progeny (random, fix or fixwf)<br />

ng = 10; //Number <strong>of</strong> generations<br />

md = random; //Mating design<br />

sr = 0.4; //Replacement ratio for sires<br />

dr = 0.2; //Replacement ratio for dams<br />

sd = phenotypic /l; //Selection design<br />

cd = age; //Culling design<br />

begin_popoutput;<br />

data;<br />

genotype /snp_code;<br />

allele_freq /gen 10;<br />

end_popoutput;<br />

end_pop;<br />

/*******************************<br />

** Genome **<br />

*******************************/<br />

begin_genome;<br />

mmutr = 2.5e-5 /recurrent; //Marker mutation rate<br />

qmutr = 2.5e-5; //QTL mutation rate<br />

interference = 25;<br />

begin_chr = 30;<br />

chrlen = 100; //Chromosome length<br />

nmloci = 1667; //Number <strong>of</strong> markers<br />

mpos = r; //Marker positions<br />

nma = snp; //Number <strong>of</strong> marker alleles<br />

maf = e; //Marker allele frequencies<br />

nqloci = 25; //Number <strong>of</strong> QTL<br />

qpos = r; //QTL positions<br />

nqa = r 2 3 4; //Number <strong>of</strong> QTL alleles<br />

qaf = e; //QTL allele frequencies<br />

qae = grand 0.4; //QTL allele effects<br />

end_chr;<br />

rpos /mrk /qtl;<br />

end_genome;<br />

/*******************************<br />

** Output options **<br />

*******************************/<br />

begin_output;<br />

linkage_map;<br />

allele_effect;<br />

end_output;<br />

Dairy Cattle Breeding and Genetics Committee Meeting, September 18, 2008. 5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!