gebv: Genomic breeding value estimator for livestock - CGIL ...
gebv: Genomic breeding value estimator for livestock - CGIL ...
gebv: Genomic breeding value estimator for livestock - CGIL ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>gebv</strong>: <strong>Genomic</strong> <strong>breeding</strong> <strong>value</strong> <strong>estimator</strong> <strong>for</strong> <strong>livestock</strong><br />
Mehdi Sargolzaei 1,2,* , Flavio S. Schenkel 2 and Paul M. VanRaden 3<br />
1 L’Alliance Boviteq Inc. 1425 grand rang St-François, Saint-Hyacinthe, Quebec, Canada<br />
2 Department of Animal and Poultry Science, University of Guelph, Guelph, Ontario, Canada<br />
3 Animal Improvement Programs Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA<br />
Summary: This paper introduces the <strong>gebv</strong> software, a genomic evaluation tool <strong>for</strong> <strong>livestock</strong>. <strong>gebv</strong> esti-<br />
mates genomic <strong>breeding</strong> <strong>value</strong>s using dense SNP maps and a single trait linear model. <strong>Genomic</strong> <strong>breeding</strong><br />
<strong>value</strong>s can be estimated based on ridge regression and an equivalent model. The most time-consuming<br />
tasks, involving matrix multiplication and inversion, were efficiently implemented. Features provided by<br />
<strong>gebv</strong> include the ability to run parallel jobs that can significantly reduce computing time.<br />
1 INTRODUCTION<br />
<strong>Genomic</strong> evaluation has been developed to predict <strong>breeding</strong> <strong>value</strong>s using dense marker maps<br />
(Nejati-Javaremi et al., 1997; Meuwissen et al., 2001). The introduction of high-throughput sin-<br />
gle nucleotide polymorphism (SNP) genotyping methods has cleared the way <strong>for</strong> implementation<br />
of genomic selection. Several studies have shown that genomic selection is significantly more<br />
accurate than traditional selection of young animals, especially <strong>for</strong> low-heritability traits (Meu-<br />
wissen et al. 2001; Habier et al. 2007; vanRaden et al., 2009). This has led to a great need <strong>for</strong> de-<br />
veloping flexible and efficient software <strong>for</strong> genomic evaluation in <strong>livestock</strong>. Methods commonly<br />
used to estimate genomic <strong>breeding</strong> <strong>value</strong>s (GEBV) are best linear unbiased prediction from<br />
mixed model analysis using a genomically estimated relationship matrix (G-BLUP), random re-<br />
gression BLUP (R-BLUP) and different non linear methods. For most of the economically impor-<br />
tant traits in <strong>livestock</strong>, accuracy of linear models was shown to be similar to non linear methods<br />
or even more accurate. Only <strong>for</strong> traits that are lowly heritable and controlled by few large QTL,<br />
1
the non linear methods were more accurate (vanRaden et al., 2009). There<strong>for</strong>e, the use of linear<br />
methods, which are computationally less demanding compared to non linear methods, are justi-<br />
fied.<br />
2<br />
One of the advantages of G-BLUP over R-BLUP is that individual reliability <strong>for</strong> GEBV can<br />
be obtained. However, G-BLUP is computationally much more demanding than R-BLUP, which<br />
has driven many studies to use simplified approximations. Computational strategies <strong>for</strong> G-BLUP<br />
have been already described (vanRaden, 2008). From the implementation point of view, efficient<br />
computer programs and algorithms should continuously be developed to meet the future demands<br />
from the genomic era. This paper introduces the <strong>gebv</strong> software, which is an efficient implementa-<br />
tion of G-BLUP and R-BLUP <strong>for</strong> genomic selection in <strong>livestock</strong>.<br />
2 METHODS AND DISCRIPTION OF THE ALGORITHMS<br />
To compute the GEBV, the following model is used:<br />
y = μ + ∑ x + e ,<br />
i<br />
k<br />
where i y is observation of the ith animal, µ is the overall mean, ik x is the effect of the kth SNP <strong>for</strong><br />
the i th animal and ei is the residual. It is assumed that all SNP markers have the same variance.<br />
R-BLUP: Regression coefficients are obtained from the solution of the following set of mixed<br />
model equations (Xu, 2003):<br />
⎛ 1′<br />
R<br />
⎜<br />
⎝ X′<br />
R<br />
−1<br />
1<br />
−1<br />
1<br />
−1<br />
1′<br />
R X ⎞⎛<br />
ˆµ ⎞ ⎛ 1′<br />
R<br />
⎟⎜<br />
⎟ = ⎜<br />
−1<br />
X′<br />
R X + I<br />
⎟ ⎜<br />
⎠⎝<br />
uˆ<br />
⎠ ⎝ X′<br />
R<br />
ik<br />
i<br />
−1<br />
−1<br />
y ⎞<br />
⎟<br />
y<br />
⎟<br />
,<br />
⎠
where X is the genotypic coefficient matrix of order n×p (n is the number of observations and p is<br />
the number of SNP), R is a diagonal matrix with elements = ( 1 Rel ) 1<br />
*<br />
R ii − ,where Rel * is the re-<br />
liability of daughter deviation (DD) and y is the overall mean + 2×DD. The GEBV are obtained<br />
as Xˆ u . Preconditioned conjugate gradient solver is applied to obtain û .<br />
G-BLUP: GEBV are obtained using selection index theory by solving the following set of equa-<br />
tions (Nejati-Javaremi et al., 1997; vanRaden et al., 2009):<br />
− 1<br />
aˆ<br />
= G(G + R) (y − ˆ µ )<br />
where Z is incidence matrix relating animals to the observations and G is genomic relationship<br />
matrix.<br />
Four steps are involved in G-BLUP: 1- Calculation of base allele frequencies 2- Calculation of<br />
traditional relationship matrix (A) and genomic relationship matrix (G) 3- Solving selection in-<br />
dex equations in which direct inverse of G+R matrix is required 4- Blending direct GEBV with<br />
parental average or EBV. The first three steps are computationally intensive thereby posing chal-<br />
lenges <strong>for</strong> implementation. These steps were optimized <strong>for</strong> overall speed and memory require-<br />
ments.<br />
Calculation of base allele frequencies: Base allele frequencies are required <strong>for</strong> unbiased estima-<br />
tion of in<strong>breeding</strong>. Base allele frequencies were estimated according to Gengler et al. (2007) us-<br />
ing an animal model. This method is simple and practical <strong>for</strong> large pedigrees with almost the<br />
same accuracy as the alternate peeling method. One of advantages of the method is that pedigree<br />
and genotyping errors can be accounted <strong>for</strong>. The mixed model equations are solved <strong>for</strong> each SNP<br />
3
at a time using the preconditioned conjugate gradient method. Parallel processing was imple-<br />
mented in order to reduce the computing time. Markers are distributed across parallel jobs.<br />
Calculation of traditional and genomic relationship matrices: The numerator relationship ma-<br />
trix is efficiently calculated by Colleau’s indirect method (Colleau, 2002). <strong>Genomic</strong> relationship<br />
matrix is calculated as X X′<br />
2∑ pi<br />
( 1−<br />
pi<br />
) where pi is allele frequency of i th SNP. Constructing G<br />
with conventional matrix multiplication algorithm can be very time-consuming. This is because<br />
X is dense and very large. For large and dense matrix multiplication or inversion, the bottleneck<br />
in computation is the memory access time. In order to expedite matrix multiplication, a blocking<br />
technique was implemented. The X matrix was divided into 8x8 sub-matrices. This technique<br />
simply increases temporal locality of memory by putting sub-matrices in fast memory (cache).<br />
The optimum block size was determined empirically. Furthermore, a distributed-memory parallel<br />
processing was implemented, in which equal number of SNP are assigned to each parallel job. To<br />
save memory, elements of G are stored as 4-byte variables (float), but in order to avoid rounding<br />
error, calculations are carried out using 8-byte variables (double).<br />
Solving selection index equations: Selection index equations are solved twice, one by incorpo-<br />
rating A and another by incorporating G. Here, the most intensive operations are matrix multipli-<br />
cation and matrix inversion. When A matrix is incorporated, the matrix multiplication is done by<br />
the indirect method, which saves substantial amount of time. With G, block matrix multiplication<br />
is applied as previously explained. The matrix inversion is another important computational bot-<br />
tleneck. Similar technique to the matrix multiplication (blocking technique) is used to speed up<br />
the inversion. The symmetry of system of equation further enables to reduce the number of op-<br />
erations involved in the inversion. When running more than one individual trait, each trait can be<br />
assigned to a single job allowing <strong>for</strong> parallel processing.<br />
4
3 FEATURES<br />
<strong>gebv</strong> has a number of important features, some of which are highlight next:<br />
• Is memory efficient and relatively fast;<br />
• Uses distributed memory parallel processing;<br />
• The required elements of A and G matrices <strong>for</strong> multiple traits are identified and calculated<br />
once;<br />
• The effect of each SNP can be estimated using R-BLUP;<br />
• The GEBV <strong>for</strong> newly genotyped animals can be obtain quickly from previous SNP solutions;<br />
• Uses an indirect method (Colleau, 2002) to compute relationship matrix and <strong>for</strong> fast matrix<br />
multiplication;<br />
• Uses the precondition conjugate gradient method to solve <strong>for</strong> base allele frequencies and SNP<br />
effects;<br />
• Uses fast block matrix multiplication;<br />
• Uses fast block matrix inversion;<br />
• SNP can be filtered based on minor allele frequencies or a predefined list of SNP.<br />
The behavior of the program can be controlled by changing parameters in the control file. De-<br />
tails of the parameters are given in the user’s guide. The number of parallel jobs can be specified<br />
by the user and, there<strong>for</strong>e, computer resources can be managed <strong>for</strong> optimal use.<br />
4 IMPLEMENTATION AND EFFICIENCY<br />
5
The program is written in C++ and is portable to multiple operating systems. Executable files are<br />
currently available <strong>for</strong> Windows and Linux plat<strong>for</strong>ms. The program was tested on a data set con-<br />
sisting of 69 traits, 38,416 SNP, 21,961 genotyped animals (<strong>for</strong> most of traits, ~30% of the ani-<br />
mals in the estimation group and ~70% in the prediction group) and 78,699 animals in the pedi-<br />
gree file. Computing time <strong>for</strong> the genomic evaluation using G-BLUP on an AMD Opteron server,<br />
running at 2.6 GHz with 4 processors and 16 GB RAM was less than 8 hours. Random access<br />
memory requirement was around 10 GB.<br />
5 CONCLUSIONS<br />
With the rapid progress and decrease of cost in high-throughput genotyping techniques in lives-<br />
tock, it is expected that the genomic selection will become an important tool <strong>for</strong> selection of<br />
young animals, traits with low heritability and traits that are difficult to measure. For the G-<br />
BLUP, matrix inversion and multiplication seem to be the main computational bottlenecks. In the<br />
future with the increased number of genotyped animals this method may become very time con-<br />
suming and genomic evaluations may need days to complete. By implementing parallel<br />
processing and block matrix inversion and block matrix multiplication (which use high speed<br />
memory efficiently) in the <strong>gebv</strong> software, genomic evaluations using direct matrix inversion can<br />
be run within a reasonable time if the number of genotyped animals in estimation group is not too<br />
large. The costs of matrix inversion and multiplication are proportional to n 3 . There<strong>for</strong>e, when the<br />
estimation group doubles, about 8 times more CPU is needed. If n grows to be sufficiently large,<br />
the equivalent method might become computationally unfeasible. However, GEBV <strong>for</strong> large<br />
number of animals can be still efficiently estimated using R-BLUP, but an approximated method<br />
to obtain the genomic reliabilities will be needed.<br />
6
<strong>gebv</strong> is currently limited to a linear model, but the future plan is to extend the application to in-<br />
clude non-linear models.<br />
ACKNOWLEDGEMENTS<br />
The authors gratefully acknowledge financial support from L’Alliance Boviteq Inc. (SEMEX Al-<br />
liance, Canada) and Natural Sciences and Engineering Research Council of Canada (Collabora-<br />
tive Research and Development grant).<br />
REFERENCES<br />
Colleau, J.J. (2002) An indirect approach to the extensive calculation of relationship coefficients.<br />
Genet. Sel. Evol., 34: 409-421.<br />
Habier, D, Fernando, R.L., Dekkers, J.C.M. (2007) The impact of genetic relationship in<strong>for</strong>ma-<br />
tion on genome-assisted <strong>breeding</strong> <strong>value</strong>s. Genetics 177, 2389-2397.<br />
Meuwissen, T.H.E., Hayes, B.J., Goddard, M. E. (2001) Prediction of total genetic <strong>value</strong> using<br />
genome-wide dense marker maps. Genetics 157, 1819-1829.<br />
Nejati-Javaremi, A., Smith, C. and Gibson, J.P. (1997) Effect of total allelic relationship on accu-<br />
racy of evaluation and response to selection. J. Anim. Sci. 75, 1738-1745.<br />
VanRaden, P.M., Van Tassell, C.P., Wiggans, G.R., Sonstegard, T.S., Schnabel, R.D., Taylor,<br />
J.F., Schenkel, F.S. (2009) Invited Review: Reliability of <strong>Genomic</strong> Predictions <strong>for</strong> North<br />
American Holstein Bulls. J. Dairy Sci. 92, 16-24.<br />
Xu, S. (2003) Estimating polygenic effects using markers of the entire genome. Genetics 163,<br />
789-801.<br />
7