gebv: Genomic breeding value estimator for livestock - CGIL ...

gebv: Genomic breeding value estimator for livestock 

Mehdi Sargolzaei 1,2,* , Flavio S. Schenkel 2 and Paul M. VanRaden 3 

1 L’Alliance Boviteq Inc. 1425 grand rang St-François, Saint-Hyacinthe, Quebec, Canada 

2 Department of Animal and Poultry Science, University of Guelph, Guelph, Ontario, Canada 

3 Animal Improvement Programs Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA 

Summary: This paper introduces the gebv software, a genomic evaluation tool for livestock. gebv esti- 

mates genomic breeding values using dense SNP maps and a single trait linear model. Genomic breeding 

values can be estimated based on ridge regression and an equivalent model. The most time-consuming 

tasks, involving matrix multiplication and inversion, were efficiently implemented. Features provided by 

gebv include the ability to run parallel jobs that can significantly reduce computing time. 

1 INTRODUCTION 

Genomic evaluation has been developed to predict breeding values using dense marker maps 

(Nejati-Javaremi et al., 1997; Meuwissen et al., 2001). The introduction of high-throughput sin- 

gle nucleotide polymorphism (SNP) genotyping methods has cleared the way for implementation 

of genomic selection. Several studies have shown that genomic selection is significantly more 

accurate than traditional selection of young animals, especially for low-heritability traits (Meu- 

wissen et al. 2001; Habier et al. 2007; vanRaden et al., 2009). This has led to a great need for de- 

veloping flexible and efficient software for genomic evaluation in livestock. Methods commonly 

used to estimate genomic breeding values (GEBV) are best linear unbiased prediction from 

mixed model analysis using a genomically estimated relationship matrix (G-BLUP), random re- 

gression BLUP (R-BLUP) and different non linear methods. For most of the economically impor- 

tant traits in livestock, accuracy of linear models was shown to be similar to non linear methods 

or even more accurate. Only for traits that are lowly heritable and controlled by few large QTL, 

1

the non linear methods were more accurate (vanRaden et al., 2009). Therefore, the use of linear 

methods, which are computationally less demanding compared to non linear methods, are justi- 

fied. 

2 

One of the advantages of G-BLUP over R-BLUP is that individual reliability for GEBV can 

be obtained. However, G-BLUP is computationally much more demanding than R-BLUP, which 

has driven many studies to use simplified approximations. Computational strategies for G-BLUP 

have been already described (vanRaden, 2008). From the implementation point of view, efficient 

computer programs and algorithms should continuously be developed to meet the future demands 

from the genomic era. This paper introduces the gebv software, which is an efficient implementa- 

tion of G-BLUP and R-BLUP for genomic selection in livestock. 

2 METHODS AND DISCRIPTION OF THE ALGORITHMS 

To compute the GEBV, the following model is used: 

y = μ + ∑ x + e , 

i 

k 

where i y is observation of the ith animal, µ is the overall mean, ik x is the effect of the kth SNP for 

the i th animal and ei is the residual. It is assumed that all SNP markers have the same variance. 

R-BLUP: Regression coefficients are obtained from the solution of the following set of mixed 

model equations (Xu, 2003): 

⎛ 1′ 

R 

⎜ 

⎝ X′ 

R 

−1 

1 

−1 

1 

−1 

1′ 

R X ⎞⎛ 

ˆµ ⎞ ⎛ 1′ 

R 

⎟⎜ 

⎟ = ⎜ 

−1 

X′ 

R X + I 

⎟ ⎜ 

⎠⎝ 

uˆ 

⎠ ⎝ X′ 

R 

ik 

i 

−1 

−1 

y ⎞ 

⎟ 

y 

⎟ 

, 

⎠

where X is the genotypic coefficient matrix of order n×p (n is the number of observations and p is 

the number of SNP), R is a diagonal matrix with elements = ( 1 Rel ) 1 

* 

R ii − ,where Rel * is the re- 

liability of daughter deviation (DD) and y is the overall mean + 2×DD. The GEBV are obtained 

as Xˆ u . Preconditioned conjugate gradient solver is applied to obtain û . 

G-BLUP: GEBV are obtained using selection index theory by solving the following set of equa- 

tions (Nejati-Javaremi et al., 1997; vanRaden et al., 2009): 

− 1 

aˆ 

= G(G + R) (y − ˆ µ ) 

where Z is incidence matrix relating animals to the observations and G is genomic relationship 

matrix. 

Four steps are involved in G-BLUP: 1- Calculation of base allele frequencies 2- Calculation of 

traditional relationship matrix (A) and genomic relationship matrix (G) 3- Solving selection in- 

dex equations in which direct inverse of G+R matrix is required 4- Blending direct GEBV with 

parental average or EBV. The first three steps are computationally intensive thereby posing chal- 

lenges for implementation. These steps were optimized for overall speed and memory require- 

ments. 

Calculation of base allele frequencies: Base allele frequencies are required for unbiased estima- 

tion of inbreeding. Base allele frequencies were estimated according to Gengler et al. (2007) us- 

ing an animal model. This method is simple and practical for large pedigrees with almost the 

same accuracy as the alternate peeling method. One of advantages of the method is that pedigree 

and genotyping errors can be accounted for. The mixed model equations are solved for each SNP 

3

at a time using the preconditioned conjugate gradient method. Parallel processing was imple- 

mented in order to reduce the computing time. Markers are distributed across parallel jobs. 

Calculation of traditional and genomic relationship matrices: The numerator relationship ma- 

trix is efficiently calculated by Colleau’s indirect method (Colleau, 2002). Genomic relationship 

matrix is calculated as X X′ 

2∑ pi 

( 1− 

pi 

) where pi is allele frequency of i th SNP. Constructing G 

with conventional matrix multiplication algorithm can be very time-consuming. This is because 

X is dense and very large. For large and dense matrix multiplication or inversion, the bottleneck 

in computation is the memory access time. In order to expedite matrix multiplication, a blocking 

technique was implemented. The X matrix was divided into 8x8 sub-matrices. This technique 

simply increases temporal locality of memory by putting sub-matrices in fast memory (cache). 

The optimum block size was determined empirically. Furthermore, a distributed-memory parallel 

processing was implemented, in which equal number of SNP are assigned to each parallel job. To 

save memory, elements of G are stored as 4-byte variables (float), but in order to avoid rounding 

error, calculations are carried out using 8-byte variables (double). 

Solving selection index equations: Selection index equations are solved twice, one by incorpo- 

rating A and another by incorporating G. Here, the most intensive operations are matrix multipli- 

cation and matrix inversion. When A matrix is incorporated, the matrix multiplication is done by 

the indirect method, which saves substantial amount of time. With G, block matrix multiplication 

is applied as previously explained. The matrix inversion is another important computational bot- 

tleneck. Similar technique to the matrix multiplication (blocking technique) is used to speed up 

the inversion. The symmetry of system of equation further enables to reduce the number of op- 

erations involved in the inversion. When running more than one individual trait, each trait can be 

assigned to a single job allowing for parallel processing. 

4

3 FEATURES 

gebv has a number of important features, some of which are highlight next: 

• Is memory efficient and relatively fast; 

• Uses distributed memory parallel processing; 

• The required elements of A and G matrices for multiple traits are identified and calculated 

once; 

• The effect of each SNP can be estimated using R-BLUP; 

• The GEBV for newly genotyped animals can be obtain quickly from previous SNP solutions; 

• Uses an indirect method (Colleau, 2002) to compute relationship matrix and for fast matrix 

multiplication; 

• Uses the precondition conjugate gradient method to solve for base allele frequencies and SNP 

effects; 

• Uses fast block matrix multiplication; 

• Uses fast block matrix inversion; 

• SNP can be filtered based on minor allele frequencies or a predefined list of SNP. 

The behavior of the program can be controlled by changing parameters in the control file. De- 

tails of the parameters are given in the user’s guide. The number of parallel jobs can be specified 

by the user and, therefore, computer resources can be managed for optimal use. 

4 IMPLEMENTATION AND EFFICIENCY 

5

The program is written in C++ and is portable to multiple operating systems. Executable files are 

currently available for Windows and Linux platforms. The program was tested on a data set con- 

sisting of 69 traits, 38,416 SNP, 21,961 genotyped animals (for most of traits, ~30% of the ani- 

mals in the estimation group and ~70% in the prediction group) and 78,699 animals in the pedi- 

gree file. Computing time for the genomic evaluation using G-BLUP on an AMD Opteron server, 

running at 2.6 GHz with 4 processors and 16 GB RAM was less than 8 hours. Random access 

memory requirement was around 10 GB. 

5 CONCLUSIONS 

With the rapid progress and decrease of cost in high-throughput genotyping techniques in lives- 

tock, it is expected that the genomic selection will become an important tool for selection of 

young animals, traits with low heritability and traits that are difficult to measure. For the G- 

BLUP, matrix inversion and multiplication seem to be the main computational bottlenecks. In the 

future with the increased number of genotyped animals this method may become very time con- 

suming and genomic evaluations may need days to complete. By implementing parallel 

processing and block matrix inversion and block matrix multiplication (which use high speed 

memory efficiently) in the gebv software, genomic evaluations using direct matrix inversion can 

be run within a reasonable time if the number of genotyped animals in estimation group is not too 

large. The costs of matrix inversion and multiplication are proportional to n 3 . Therefore, when the 

estimation group doubles, about 8 times more CPU is needed. If n grows to be sufficiently large, 

the equivalent method might become computationally unfeasible. However, GEBV for large 

number of animals can be still efficiently estimated using R-BLUP, but an approximated method 

to obtain the genomic reliabilities will be needed. 

6

gebv is currently limited to a linear model, but the future plan is to extend the application to in- 

clude non-linear models. 

ACKNOWLEDGEMENTS 

The authors gratefully acknowledge financial support from L’Alliance Boviteq Inc. (SEMEX Al- 

liance, Canada) and Natural Sciences and Engineering Research Council of Canada (Collabora- 

tive Research and Development grant). 

REFERENCES 

Colleau, J.J. (2002) An indirect approach to the extensive calculation of relationship coefficients. 

Genet. Sel. Evol., 34: 409-421. 

Habier, D, Fernando, R.L., Dekkers, J.C.M. (2007) The impact of genetic relationship informa- 

tion on genome-assisted breeding values. Genetics 177, 2389-2397. 

Meuwissen, T.H.E., Hayes, B.J., Goddard, M. E. (2001) Prediction of total genetic value using 

genome-wide dense marker maps. Genetics 157, 1819-1829. 

Nejati-Javaremi, A., Smith, C. and Gibson, J.P. (1997) Effect of total allelic relationship on accu- 

racy of evaluation and response to selection. J. Anim. Sci. 75, 1738-1745. 

VanRaden, P.M., Van Tassell, C.P., Wiggans, G.R., Sonstegard, T.S., Schnabel, R.D., Taylor, 

J.F., Schenkel, F.S. (2009) Invited Review: Reliability of Genomic Predictions for North 

American Holstein Bulls. J. Dairy Sci. 92, 16-24. 

Xu, S. (2003) Estimating polygenic effects using markers of the entire genome. Genetics 163, 

789-801. 

7

gebv: Genomic breeding value estimator for livestock - CGIL ...

Create successful ePaper yourself

Delete template?

Save as template?