22.07.2013 Views

gebv: Genomic breeding value estimator for livestock - CGIL ...

gebv: Genomic breeding value estimator for livestock - CGIL ...

gebv: Genomic breeding value estimator for livestock - CGIL ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>gebv</strong>: <strong>Genomic</strong> <strong>breeding</strong> <strong>value</strong> <strong>estimator</strong> <strong>for</strong> <strong>livestock</strong><br />

Mehdi Sargolzaei 1,2,* , Flavio S. Schenkel 2 and Paul M. VanRaden 3<br />

1 L’Alliance Boviteq Inc. 1425 grand rang St-François, Saint-Hyacinthe, Quebec, Canada<br />

2 Department of Animal and Poultry Science, University of Guelph, Guelph, Ontario, Canada<br />

3 Animal Improvement Programs Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA<br />

Summary: This paper introduces the <strong>gebv</strong> software, a genomic evaluation tool <strong>for</strong> <strong>livestock</strong>. <strong>gebv</strong> esti-<br />

mates genomic <strong>breeding</strong> <strong>value</strong>s using dense SNP maps and a single trait linear model. <strong>Genomic</strong> <strong>breeding</strong><br />

<strong>value</strong>s can be estimated based on ridge regression and an equivalent model. The most time-consuming<br />

tasks, involving matrix multiplication and inversion, were efficiently implemented. Features provided by<br />

<strong>gebv</strong> include the ability to run parallel jobs that can significantly reduce computing time.<br />

1 INTRODUCTION<br />

<strong>Genomic</strong> evaluation has been developed to predict <strong>breeding</strong> <strong>value</strong>s using dense marker maps<br />

(Nejati-Javaremi et al., 1997; Meuwissen et al., 2001). The introduction of high-throughput sin-<br />

gle nucleotide polymorphism (SNP) genotyping methods has cleared the way <strong>for</strong> implementation<br />

of genomic selection. Several studies have shown that genomic selection is significantly more<br />

accurate than traditional selection of young animals, especially <strong>for</strong> low-heritability traits (Meu-<br />

wissen et al. 2001; Habier et al. 2007; vanRaden et al., 2009). This has led to a great need <strong>for</strong> de-<br />

veloping flexible and efficient software <strong>for</strong> genomic evaluation in <strong>livestock</strong>. Methods commonly<br />

used to estimate genomic <strong>breeding</strong> <strong>value</strong>s (GEBV) are best linear unbiased prediction from<br />

mixed model analysis using a genomically estimated relationship matrix (G-BLUP), random re-<br />

gression BLUP (R-BLUP) and different non linear methods. For most of the economically impor-<br />

tant traits in <strong>livestock</strong>, accuracy of linear models was shown to be similar to non linear methods<br />

or even more accurate. Only <strong>for</strong> traits that are lowly heritable and controlled by few large QTL,<br />

1


the non linear methods were more accurate (vanRaden et al., 2009). There<strong>for</strong>e, the use of linear<br />

methods, which are computationally less demanding compared to non linear methods, are justi-<br />

fied.<br />

2<br />

One of the advantages of G-BLUP over R-BLUP is that individual reliability <strong>for</strong> GEBV can<br />

be obtained. However, G-BLUP is computationally much more demanding than R-BLUP, which<br />

has driven many studies to use simplified approximations. Computational strategies <strong>for</strong> G-BLUP<br />

have been already described (vanRaden, 2008). From the implementation point of view, efficient<br />

computer programs and algorithms should continuously be developed to meet the future demands<br />

from the genomic era. This paper introduces the <strong>gebv</strong> software, which is an efficient implementa-<br />

tion of G-BLUP and R-BLUP <strong>for</strong> genomic selection in <strong>livestock</strong>.<br />

2 METHODS AND DISCRIPTION OF THE ALGORITHMS<br />

To compute the GEBV, the following model is used:<br />

y = μ + ∑ x + e ,<br />

i<br />

k<br />

where i y is observation of the ith animal, µ is the overall mean, ik x is the effect of the kth SNP <strong>for</strong><br />

the i th animal and ei is the residual. It is assumed that all SNP markers have the same variance.<br />

R-BLUP: Regression coefficients are obtained from the solution of the following set of mixed<br />

model equations (Xu, 2003):<br />

⎛ 1′<br />

R<br />

⎜<br />

⎝ X′<br />

R<br />

−1<br />

1<br />

−1<br />

1<br />

−1<br />

1′<br />

R X ⎞⎛<br />

ˆµ ⎞ ⎛ 1′<br />

R<br />

⎟⎜<br />

⎟ = ⎜<br />

−1<br />

X′<br />

R X + I<br />

⎟ ⎜<br />

⎠⎝<br />

uˆ<br />

⎠ ⎝ X′<br />

R<br />

ik<br />

i<br />

−1<br />

−1<br />

y ⎞<br />

⎟<br />

y<br />

⎟<br />

,<br />


where X is the genotypic coefficient matrix of order n×p (n is the number of observations and p is<br />

the number of SNP), R is a diagonal matrix with elements = ( 1 Rel ) 1<br />

*<br />

R ii − ,where Rel * is the re-<br />

liability of daughter deviation (DD) and y is the overall mean + 2×DD. The GEBV are obtained<br />

as Xˆ u . Preconditioned conjugate gradient solver is applied to obtain û .<br />

G-BLUP: GEBV are obtained using selection index theory by solving the following set of equa-<br />

tions (Nejati-Javaremi et al., 1997; vanRaden et al., 2009):<br />

− 1<br />

aˆ<br />

= G(G + R) (y − ˆ µ )<br />

where Z is incidence matrix relating animals to the observations and G is genomic relationship<br />

matrix.<br />

Four steps are involved in G-BLUP: 1- Calculation of base allele frequencies 2- Calculation of<br />

traditional relationship matrix (A) and genomic relationship matrix (G) 3- Solving selection in-<br />

dex equations in which direct inverse of G+R matrix is required 4- Blending direct GEBV with<br />

parental average or EBV. The first three steps are computationally intensive thereby posing chal-<br />

lenges <strong>for</strong> implementation. These steps were optimized <strong>for</strong> overall speed and memory require-<br />

ments.<br />

Calculation of base allele frequencies: Base allele frequencies are required <strong>for</strong> unbiased estima-<br />

tion of in<strong>breeding</strong>. Base allele frequencies were estimated according to Gengler et al. (2007) us-<br />

ing an animal model. This method is simple and practical <strong>for</strong> large pedigrees with almost the<br />

same accuracy as the alternate peeling method. One of advantages of the method is that pedigree<br />

and genotyping errors can be accounted <strong>for</strong>. The mixed model equations are solved <strong>for</strong> each SNP<br />

3


at a time using the preconditioned conjugate gradient method. Parallel processing was imple-<br />

mented in order to reduce the computing time. Markers are distributed across parallel jobs.<br />

Calculation of traditional and genomic relationship matrices: The numerator relationship ma-<br />

trix is efficiently calculated by Colleau’s indirect method (Colleau, 2002). <strong>Genomic</strong> relationship<br />

matrix is calculated as X X′<br />

2∑ pi<br />

( 1−<br />

pi<br />

) where pi is allele frequency of i th SNP. Constructing G<br />

with conventional matrix multiplication algorithm can be very time-consuming. This is because<br />

X is dense and very large. For large and dense matrix multiplication or inversion, the bottleneck<br />

in computation is the memory access time. In order to expedite matrix multiplication, a blocking<br />

technique was implemented. The X matrix was divided into 8x8 sub-matrices. This technique<br />

simply increases temporal locality of memory by putting sub-matrices in fast memory (cache).<br />

The optimum block size was determined empirically. Furthermore, a distributed-memory parallel<br />

processing was implemented, in which equal number of SNP are assigned to each parallel job. To<br />

save memory, elements of G are stored as 4-byte variables (float), but in order to avoid rounding<br />

error, calculations are carried out using 8-byte variables (double).<br />

Solving selection index equations: Selection index equations are solved twice, one by incorpo-<br />

rating A and another by incorporating G. Here, the most intensive operations are matrix multipli-<br />

cation and matrix inversion. When A matrix is incorporated, the matrix multiplication is done by<br />

the indirect method, which saves substantial amount of time. With G, block matrix multiplication<br />

is applied as previously explained. The matrix inversion is another important computational bot-<br />

tleneck. Similar technique to the matrix multiplication (blocking technique) is used to speed up<br />

the inversion. The symmetry of system of equation further enables to reduce the number of op-<br />

erations involved in the inversion. When running more than one individual trait, each trait can be<br />

assigned to a single job allowing <strong>for</strong> parallel processing.<br />

4


3 FEATURES<br />

<strong>gebv</strong> has a number of important features, some of which are highlight next:<br />

• Is memory efficient and relatively fast;<br />

• Uses distributed memory parallel processing;<br />

• The required elements of A and G matrices <strong>for</strong> multiple traits are identified and calculated<br />

once;<br />

• The effect of each SNP can be estimated using R-BLUP;<br />

• The GEBV <strong>for</strong> newly genotyped animals can be obtain quickly from previous SNP solutions;<br />

• Uses an indirect method (Colleau, 2002) to compute relationship matrix and <strong>for</strong> fast matrix<br />

multiplication;<br />

• Uses the precondition conjugate gradient method to solve <strong>for</strong> base allele frequencies and SNP<br />

effects;<br />

• Uses fast block matrix multiplication;<br />

• Uses fast block matrix inversion;<br />

• SNP can be filtered based on minor allele frequencies or a predefined list of SNP.<br />

The behavior of the program can be controlled by changing parameters in the control file. De-<br />

tails of the parameters are given in the user’s guide. The number of parallel jobs can be specified<br />

by the user and, there<strong>for</strong>e, computer resources can be managed <strong>for</strong> optimal use.<br />

4 IMPLEMENTATION AND EFFICIENCY<br />

5


The program is written in C++ and is portable to multiple operating systems. Executable files are<br />

currently available <strong>for</strong> Windows and Linux plat<strong>for</strong>ms. The program was tested on a data set con-<br />

sisting of 69 traits, 38,416 SNP, 21,961 genotyped animals (<strong>for</strong> most of traits, ~30% of the ani-<br />

mals in the estimation group and ~70% in the prediction group) and 78,699 animals in the pedi-<br />

gree file. Computing time <strong>for</strong> the genomic evaluation using G-BLUP on an AMD Opteron server,<br />

running at 2.6 GHz with 4 processors and 16 GB RAM was less than 8 hours. Random access<br />

memory requirement was around 10 GB.<br />

5 CONCLUSIONS<br />

With the rapid progress and decrease of cost in high-throughput genotyping techniques in lives-<br />

tock, it is expected that the genomic selection will become an important tool <strong>for</strong> selection of<br />

young animals, traits with low heritability and traits that are difficult to measure. For the G-<br />

BLUP, matrix inversion and multiplication seem to be the main computational bottlenecks. In the<br />

future with the increased number of genotyped animals this method may become very time con-<br />

suming and genomic evaluations may need days to complete. By implementing parallel<br />

processing and block matrix inversion and block matrix multiplication (which use high speed<br />

memory efficiently) in the <strong>gebv</strong> software, genomic evaluations using direct matrix inversion can<br />

be run within a reasonable time if the number of genotyped animals in estimation group is not too<br />

large. The costs of matrix inversion and multiplication are proportional to n 3 . There<strong>for</strong>e, when the<br />

estimation group doubles, about 8 times more CPU is needed. If n grows to be sufficiently large,<br />

the equivalent method might become computationally unfeasible. However, GEBV <strong>for</strong> large<br />

number of animals can be still efficiently estimated using R-BLUP, but an approximated method<br />

to obtain the genomic reliabilities will be needed.<br />

6


<strong>gebv</strong> is currently limited to a linear model, but the future plan is to extend the application to in-<br />

clude non-linear models.<br />

ACKNOWLEDGEMENTS<br />

The authors gratefully acknowledge financial support from L’Alliance Boviteq Inc. (SEMEX Al-<br />

liance, Canada) and Natural Sciences and Engineering Research Council of Canada (Collabora-<br />

tive Research and Development grant).<br />

REFERENCES<br />

Colleau, J.J. (2002) An indirect approach to the extensive calculation of relationship coefficients.<br />

Genet. Sel. Evol., 34: 409-421.<br />

Habier, D, Fernando, R.L., Dekkers, J.C.M. (2007) The impact of genetic relationship in<strong>for</strong>ma-<br />

tion on genome-assisted <strong>breeding</strong> <strong>value</strong>s. Genetics 177, 2389-2397.<br />

Meuwissen, T.H.E., Hayes, B.J., Goddard, M. E. (2001) Prediction of total genetic <strong>value</strong> using<br />

genome-wide dense marker maps. Genetics 157, 1819-1829.<br />

Nejati-Javaremi, A., Smith, C. and Gibson, J.P. (1997) Effect of total allelic relationship on accu-<br />

racy of evaluation and response to selection. J. Anim. Sci. 75, 1738-1745.<br />

VanRaden, P.M., Van Tassell, C.P., Wiggans, G.R., Sonstegard, T.S., Schnabel, R.D., Taylor,<br />

J.F., Schenkel, F.S. (2009) Invited Review: Reliability of <strong>Genomic</strong> Predictions <strong>for</strong> North<br />

American Holstein Bulls. J. Dairy Sci. 92, 16-24.<br />

Xu, S. (2003) Estimating polygenic effects using markers of the entire genome. Genetics 163,<br />

789-801.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!