10.07.2015 Views

Asymptotic properties of the likelihood ratio test statistics test ...

Asymptotic properties of the likelihood ratio test statistics test ...

Asymptotic properties of the likelihood ratio test statistics test ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The Canadian Journal <strong>of</strong> StatisticsVol. 34, No. 4, 2007, Pages ???-???La revue canadienne de statistique<strong>Asymptotic</strong> <strong>properties</strong> <strong>of</strong> <strong>the</strong> <strong>likelihood</strong> <strong>ratio</strong><strong>test</strong> <strong>statistics</strong><strong>test</strong> <strong>statistics</strong> in affected-sib-pair analysisZeny Z. Feng ∗ , Jiahua Chen and Mary E. ThompsonKey words and phrases: ASP method, identical by descent (IBD), <strong>likelihood</strong> <strong>ratio</strong> <strong>test</strong> (LRT),linkage analysis, possible triangle constraint.MSC 2000 : Primary 92D10; secondary 92D30.Abstract: In an Affected-Sib-Pair (ASP) genetic linkage analysis, IBD data for affected sib pairs are routinelycollected at a large number <strong>of</strong> markers along chromosomes. Under very general genetic assumptions,<strong>the</strong> IBD distribution at each marker satisfies <strong>the</strong> possible triangle constraint. Statistical analysis <strong>of</strong> IBDdata hence should utilize this information to improve efficiency. At <strong>the</strong> same time, this constraint renders<strong>the</strong> usual regularity conditions for <strong>likelihood</strong> based statistical methods unsatisfied. In this paper, authorsstudy <strong>the</strong> asymptotic <strong>properties</strong> <strong>of</strong> <strong>the</strong> <strong>likelihood</strong> <strong>ratio</strong> <strong>test</strong> under <strong>the</strong> possible triangle constraint. Thelimiting distribution <strong>of</strong> <strong>the</strong> LRT statistic based on data from a single locus is derived. The precision <strong>of</strong> <strong>the</strong>limiting distribution and <strong>the</strong> power <strong>of</strong> <strong>the</strong> <strong>test</strong> are investigated by simulation. Fur<strong>the</strong>r, authors study <strong>the</strong><strong>test</strong> based on <strong>the</strong> supremum <strong>of</strong> <strong>the</strong> LRT <strong>statistics</strong> over <strong>the</strong> markers distributed throughout a chromosome.Instead <strong>of</strong> deriving a limiting distribution <strong>the</strong>oretically, a mixture <strong>of</strong> chisquare distributions to is used toapproximate <strong>the</strong> true distribution. The simulation results show that this approach has desirable simplicityand satisfactory precision.Title in French: we can supply thisRésumé : Insérer votre résumé ici. In an Affected-Sib-Pair (ASP) genetic linkage analysis, IBD data foraffected sib pairs are routinely collected at a large number <strong>of</strong> markers along chromosomes. Under verygeneral genetic assumptions, <strong>the</strong> IBD distribution at each marker satisfies <strong>the</strong> possible triangle constraint.Statistical analysis <strong>of</strong> IBD data hence should utilize this information to improve efficiency. At <strong>the</strong> sametime, this constraint renders <strong>the</strong> usual regularity conditions for <strong>likelihood</strong> based statistical methods unsatisfied.In this paper, authors study <strong>the</strong> asymptotic <strong>properties</strong> <strong>of</strong> <strong>the</strong> <strong>likelihood</strong> <strong>ratio</strong> <strong>test</strong> under <strong>the</strong>possible triangle constraint. The limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic based on data from a singlelocus is derived. The precision <strong>of</strong> <strong>the</strong> limiting distribution and <strong>the</strong> power <strong>of</strong> <strong>the</strong> <strong>test</strong> are investigated bysimulation. Fur<strong>the</strong>r, authors study <strong>the</strong> <strong>test</strong> based on <strong>the</strong> supremum <strong>of</strong> <strong>the</strong> LRT <strong>statistics</strong> over <strong>the</strong> markersdistributed throughout a chromosome. Instead <strong>of</strong> deriving a limiting distribution <strong>the</strong>oretically, a mixture<strong>of</strong> chisquare distributions to is used to approximate <strong>the</strong> true distribution. The simulation results showthat this approach has desirable simplicity and satisfactory precision.1. INTRODUCTIONThe sib-pair method is a non-parametric statistical method for linkage analysis introduced inPenrose (1935). The analysis is based on genetic information on sib pairs and <strong>the</strong>ir parents. In1


The paper is organized as follows. In Section 2, we derive <strong>the</strong> limiting distribution <strong>of</strong> <strong>the</strong> LRTstatistic under <strong>the</strong> constraint at a single locus. Simulation results for assessing <strong>the</strong> precision and<strong>the</strong> power follow. In Section 3, we study <strong>the</strong> empirical distribution <strong>of</strong> <strong>the</strong> supremum <strong>of</strong> <strong>the</strong> LRT<strong>statistics</strong>.2. Limiting distribution <strong>of</strong> <strong>the</strong> LRT statisticIn this section, we derive <strong>the</strong> limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic. The precision <strong>of</strong> <strong>the</strong>limiting distribution is examined by a simulation study. The power <strong>of</strong> <strong>the</strong> LRT with possibletriangle constraint is also investigated via simulation.2.1 The limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic under <strong>the</strong> possible triangle constraintSuppose N independent sib pairs affected with a certain disease are recruited and <strong>the</strong> IBD valuesat a given marker are collected. Let N 0 , N 1 and N 2 be observed frequencies <strong>of</strong> IBD values equaling0, 1 and 2. The random variables N 0 and N 1 are jointly multinomially distributed with parametersπ 0 , π 1 and π 2 = 1 − π 0 − π 1 . The log-<strong>likelihood</strong> function <strong>of</strong> π 0 and π 1 isl(π 0 , π 1 ) = N 0 log π 0 + N 1 log π 1 + (N − N 0 − N 1 ) log(1 − π 0 − π 1 ).It is well known that ˆπ 0 = N 0N and ˆπ 1 = N 1Nare <strong>the</strong> unique maximum <strong>likelihood</strong> estimates (MLE) <strong>of</strong>π 0 and π 1 with no constraints imposed. To <strong>test</strong> <strong>the</strong> null hypo<strong>the</strong>sis <strong>of</strong> no linkage, <strong>the</strong> log-<strong>likelihood</strong><strong>ratio</strong> statistic is given byΛ N = 2N[ˆπ 0 log(4ˆπ 0 ) + ˆπ 1 log(2ˆπ 1 ) + (1 − ˆπ 0 − ˆπ 1 ) log(4 − 4ˆπ 0 − 4ˆπ 1 )]. (1)By a classical result in Wilks (1938), Λ N in (1) converges to χ 2 2 in distribution as N → ∞ under<strong>the</strong> null hypo<strong>the</strong>sis <strong>of</strong> no linkage.As we pointed out earlier, <strong>the</strong> IBD distribution <strong>of</strong> an affected sib pair satisfies <strong>the</strong> possibletriangle constraint. It is advantageous to reject <strong>the</strong> null hypo<strong>the</strong>sis only if <strong>the</strong> deviation <strong>of</strong> <strong>the</strong> IBDdistribution is in <strong>the</strong> direction <strong>of</strong> <strong>the</strong> triangle region. Thus, <strong>the</strong> <strong>test</strong> for linkage becomes a <strong>test</strong> <strong>of</strong>H 0 : (π 0 , π 1 , π 2 ) = (1/4, 1/2, 1/4)vsH A : 2π 0 ≤ π 1 ≤ 1/2.The problem <strong>of</strong> <strong>the</strong> computation <strong>of</strong> <strong>the</strong> restricted MLE, denoted as ˜π, under <strong>the</strong> possible triangleconstraint was studied in Holmans (1993). The numerical solution can be easily obtained followinga simple procedure. The <strong>likelihood</strong> <strong>ratio</strong> statistic under <strong>the</strong> possible triangle constraint is given by˜Λ N = 2N[ˆπ 0 log(4˜π 0 ) + ˆπ 1 log(2˜π 1 ) + (1 − ˆπ 0 − ˆπ 1 ) log(4 − 4˜π 0 − 4˜π 1 )]. (2)Due to <strong>the</strong> violation <strong>of</strong> regularity conditions, <strong>the</strong> limiting distribution <strong>of</strong> ˜Λ N differs from <strong>the</strong> usualchisquare. We derive its limiting distribution as follows.For simplicity, we orthogonalize parameters by transforming from <strong>the</strong> (π 0 , π 1 ) to <strong>the</strong> (w 0 , w 1 )space, wherew 0 = √ 2(2π 0 + π 1 − 1), w 1 = 2π 1 − 1.See Figure 1 for illust<strong>ratio</strong>n. By this transformation, <strong>the</strong> MLEs ˆπ and <strong>the</strong> restricted MLEs ˜π aretransformed to ŵ and ˜w respectively. Under <strong>the</strong> null hypo<strong>the</strong>sis, it is easy to show thatvar(ˆπ 0 ) = 316N , var(ˆπ 1) = 14N , cov(ˆπ 0, ˆπ 1 ) = − 18N .3


w 1−1.414II000000001111111100000000111111110000000011111111 φ000000001111111100000000111111110000000011111111 ρ000000001111111100000000111111110000000011111111Ι0000000011111111 .0000000011111111(−1.414,−1) (w 0 , w 1 )( w 0 , w 1)IIIw 1 = -0.707w0w 0w0ΙΙΙw 1. Iρφ000000000111111111000000000111111111000000000111111111000000000111111111000000000111111111ΙΙFigure 2: Rotate <strong>the</strong> (w 0 , w 1 ) plane such that Cone I is <strong>the</strong> first quadrant.For x > 0, we work for an expression <strong>of</strong> P (˜Λ N ≥ x) under <strong>the</strong> null hypo<strong>the</strong>sis. Let θ be<strong>the</strong> angle between <strong>the</strong> positive w 0 axis and <strong>the</strong> vector (ŵ 0 , ŵ 1 ). Note that under <strong>the</strong> normalityassumption on ( √ Nŵ 0 , √ Nŵ 1 ), θ has a uniform distribution in [0, 2π] and is independent <strong>of</strong> <strong>the</strong>norm ρ = √ N(ŵ0 2 + ŵ2 1 ). Decomposing <strong>the</strong> probability by conditioning on <strong>the</strong> position <strong>of</strong> θ, wemay writeP (˜Λ N ≥ x) = P (θ ∈ Cone I) P (˜Λ N ≥ x|θ ∈ Cone I)+ P (θ ∈ Cone II) P (˜Λ N ≥ x|θ ∈ Cone II)+ P (θ ∈ Cone III) P (˜Λ N ≥ x|θ ∈ Cone III)+ P (θ ∈ Cone IV) P (˜Λ N ≥ x|θ ∈ Cone IV)Suppose (ŵ 0 , ŵ 1 ) falls in <strong>the</strong> triangle region. In this case, we have ˜Λ N = N(ŵ 2 0 + ŵ 2 1) = ρ 2 .Since ρ and θ are independent, we haveP (˜Λ N ≥ x|θ ∈ Cone IV) = P (χ 2 2 ≥ x),where χ 2 2 stands for a random variable with a χ 2 2 distribution. This takes care <strong>of</strong> <strong>the</strong> fourth termin (4).For (ŵ 0 , ŵ 1 ) falling in Cone I, ( ˜w 0 , ˜w 1 ) is <strong>the</strong> projection <strong>of</strong> (ŵ 0 , ŵ 1 ) on <strong>the</strong> line w 1 = √ 22 w 0.See Figure 2. Let φ be <strong>the</strong> angle between <strong>the</strong> line w 1 = √ 22 w 0 and <strong>the</strong> vector (ŵ 0 , ŵ 1 ). Then˜Λ N = ρ 2 cos 2 φ,and ρ 2 cos 2 φ has a χ 2 1 distribution.Rotate <strong>the</strong> (w 0 , w 1 ) plane as we show in Figure 2, such that Cone I is <strong>the</strong> first quadrant (Q 1 )<strong>of</strong> <strong>the</strong> rotated plane. By circular symmetry <strong>of</strong> <strong>the</strong> distribution <strong>of</strong> (ŵ 0 , ŵ 1 ) and <strong>the</strong> periodicity<strong>properties</strong> <strong>of</strong> cos φ,P (˜Λ N ≥ x|θ ∈ Cone I) = P (ρ 2 cos 2 φ ≥ x|φ ∈ Q 1 ) = P (χ 2 1 ≥ x).Similarly, P (˜Λ N ≥ x|θ ∈ Cone II) = P (χ 2 1 ≥ x). For (ŵ 0 , ŵ 1 ) falling in Cone III, <strong>the</strong> corresponding( ˜w 0 , ˜w 1 ) = 0. Thus we have P (˜Λ N ≥ x|θ ∈ Cone III) = 0.In summary, <strong>the</strong> expression <strong>of</strong> ˜Λ N can be written as:⎧ √ √21N( ⎪⎨ 3ŵ0 + 3ŵ1) 2 + o p (1), when (ŵ 0 , ŵ 1 ) ∈ Cone I.˜Λ N = Nŵ0 2 + o p (1),when (ŵ 0 , ŵ 1 ) ∈ Cone II.(5)o ⎪⎩ p (1),when (ŵ 0 , ŵ 1 ) ∈ Cone III.N(ŵ0 2 + ŵ1) 2 + o p (1), when (ŵ 0 , ŵ 1 ) ∈ Cone IV,(4)5


Table 3: The power <strong>of</strong> LRT <strong>of</strong> restricted and unrestricted model.N=100 N=200Size <strong>of</strong> Test Restricted Unrestricted Restricted Unrestricted0.05 0.860 0.680 0.988 0.9510.01 0.669 0.486 0.949 0.8590.005 0.576 0.399 0.912 0.8160.001 0.396 0.217 0.816 0.6800.0001 0.190 0.098 0.621 0.449<strong>statistics</strong> Λ N without constraint and ˜Λ N with constraint. The powers at a number <strong>of</strong> significancelevels are computed based on <strong>the</strong> outcomes <strong>of</strong> 10,000 sets <strong>of</strong> sample. The powers <strong>of</strong> <strong>the</strong> <strong>test</strong> withpossible triangle constraint (restricted) and without possible triangle constraint (unrestricted) aresummarized in Table 3. The simulation result reveals <strong>the</strong> expected power gain by <strong>the</strong> constrained<strong>likelihood</strong> <strong>ratio</strong> <strong>test</strong>, which is particularly striking when N = 100.3. Approximating <strong>the</strong> sample distribution <strong>of</strong> <strong>the</strong> supremum <strong>of</strong> <strong>the</strong> LRT <strong>statistics</strong>.With <strong>the</strong> advance <strong>of</strong> modern technology, IBD data <strong>of</strong> ASPs are usually collected at a largenumber <strong>of</strong> markers over stretches <strong>of</strong> chromosomes. The IBD values are related to each o<strong>the</strong>rthrough a crossover process along <strong>the</strong> chromosomes. Thus, ˜Λ N values computed at <strong>the</strong>se markersform a stochastic process indexed by <strong>the</strong> location <strong>of</strong> <strong>the</strong>se markers on <strong>the</strong> chromosome. We maywrite <strong>the</strong> process as {˜Λ N (t), 0 ≤ t ≤ T } with t being <strong>the</strong> marker locus in cM and T being <strong>the</strong>total genetic length <strong>of</strong> a chromosome segment under investigation. Here,“cM” is an abbreviationfor centiMorgan, a unit <strong>of</strong> genetic length. During meiosis (<strong>the</strong> sex cell division process), a pair <strong>of</strong>homologous (same type, same length) chromosomes will duplicate and exchange genetic materialby crossing-over. The number <strong>of</strong> crossovers between two loci depends on <strong>the</strong> distance between<strong>the</strong>m. Thus <strong>the</strong> expected number <strong>of</strong> crossovers occurring in a chromosomal interval serves as ameasure <strong>of</strong> chromosomal length. In a genetic map, if two loci are 1cM apart, that means, among100 gametes (eggs or sperms) that pass from randomly selected parents to <strong>of</strong>fspring, it is expectedto observe 1 crossover between <strong>the</strong> two loci. If a chromosome segment under investigation containsa disease gene, an unusually large value <strong>of</strong> ˜Λ N (t) is expected at t near <strong>the</strong> disease locus. Thus, <strong>the</strong>maximum value <strong>of</strong> ˜Λ N (t) over t ∈ [0, T ] is a natural <strong>test</strong> statistic for linkage. The next problemis to determine <strong>the</strong> sample distribution <strong>of</strong> <strong>the</strong> supremum <strong>of</strong> ˜Λ N (t) under <strong>the</strong> null hypo<strong>the</strong>sis <strong>of</strong>no linkage. Given <strong>the</strong> threshold value, say x, <strong>the</strong> type I error <strong>of</strong> <strong>the</strong> <strong>test</strong> for each individualchromosome isP 0 { max ˜Λ N (t) ≥ x},0≤t≤Twhere P 0 represents <strong>the</strong> probability under <strong>the</strong> null hypo<strong>the</strong>sis.At each locus t, ˜Λ N (t) is determined by <strong>the</strong> value <strong>of</strong> unrestricted processed ŵ 0 (t) and ŵ 1 (t)defined in (5). The limiting distribution <strong>of</strong> ˜Λ N (t) is a mixture <strong>of</strong> χ 2 1, χ 2 2 and a point mass at 0. Itis noted that, for large N, √ Nŵ 0 (t) and √ Nŵ 1 (t) are asymptotically two independent stationaryGaussian processes. Thus, <strong>the</strong> process ˜Λ N is built up from <strong>the</strong> underlying stationary Gaussianprocesses. The limiting distribution <strong>of</strong> max 0≤t≤T ˜ΛN (t) is difficult to obtain <strong>the</strong>oretically. Thus, wedo not have a limiting distribution to be used to compute approximate sample quantiles. Instead,we use simulation to find a simple method to approximate <strong>the</strong> sample distribution.We placed markers at points <strong>of</strong> a 1cM grid, 3cM grid, 5cM grid, and 10cM grid along ahypo<strong>the</strong>tical chromosome <strong>of</strong> some length. We let ∆t denote <strong>the</strong> marker density, where, for example,∆t = 1cM means that markers are placed at points <strong>of</strong> a 1cM grid. A crossover process was <strong>the</strong>nsimulated as a pure Poisson process according to <strong>the</strong> Haldane mapping function (Haldane, 1919).7


Figure 3: Proportions <strong>of</strong> zeroes for different chromosomal length (T ) and marker density (∆t).markers are 1cM apartmarkers are 3cM apartproportion <strong>of</strong> 00.0 0.2 0.4proportion <strong>of</strong> 00.0 0.2 0.420 60 100 140chromosomal length20 60 100 140chromosomal lengthmarkers are 5cM apartmarkers are 10cM apartproportion <strong>of</strong> 00.0 0.2 0.4proportion <strong>of</strong> 00.0 0.2 0.420 60 100 140chromosomal length20 60 100 140chromosomal lengthMarker data were consequently determined along <strong>the</strong> chromosome. We set <strong>the</strong> overall chromosomelength equal to 150cM, as <strong>the</strong> longest mean genetic length (mean <strong>of</strong> male and female) <strong>of</strong> a humanchromosome arm is about 150cM (Morton, 1991). The IBD data at each marker for each sibpair were <strong>the</strong>n obtained, with markers assumed fully informative. We set <strong>the</strong> number <strong>of</strong> sib pairsat 500 and generated 10,000 samples for each setting <strong>of</strong> marker density. In applications, variouslengths <strong>of</strong> chromosome segments might be investigated. We aim at providing precise approximatesample quantiles <strong>of</strong> max 0≤t≤T ˜ΛN (t) for T = 10cM, 20cM, . . . , 150cM, along with different markerdensities. The sample quantile approximation will be discussed later for T values in between. Forconvenience, we denote X(T, ∆t) = max 0≤t≤T ˜ΛN (t, ∆t). Notation <strong>of</strong> ∆t only comes in <strong>the</strong> casewhere multiple markers on <strong>the</strong> same chromosomal arm are <strong>test</strong>ed.Let X i (T, ∆t) be <strong>the</strong> observed X(T, ∆t) based on <strong>the</strong> ith simulated sample. For each givenvalue <strong>of</strong> T and ∆t , we may write <strong>the</strong> cumulative distribution function <strong>of</strong> X(T, ∆t) as:G T,∆t (x) = P {X(T, ∆t) ≤ x} = wF T,∆t (x) + (1 − w), for x > 0, (7)where 1 − w is <strong>the</strong> point mass <strong>of</strong> X i (T, ∆t) at 0. As <strong>the</strong> chromosomal length T increases, 1 − wdecreases. The empty dots in Figure 3 show <strong>the</strong> empirical proportions <strong>of</strong> zeroes as a function <strong>of</strong> Tand ∆t.Using <strong>the</strong> simulated X i (T, ∆t) for each given T and ∆t, <strong>the</strong> corresponding quantiles <strong>of</strong> <strong>the</strong>distribution <strong>of</strong> X(T, ∆t) can be approximated by <strong>the</strong> sample quantiles. In applications, we mayprovide a table <strong>of</strong> simulated quantiles for various combinations <strong>of</strong> T and ∆t and a number <strong>of</strong> prechosensignificance levels. This is not very convenient whenever a new significance level is required.A more useful solution is to find a simple family <strong>of</strong> parametric distributions that provide accurateapproximations to <strong>the</strong> distributions <strong>of</strong> X(T, ∆t) with parameter values being smooth functions <strong>of</strong>8


Figure 4: Proportions <strong>of</strong> zeroes for different chromosomal length (T ) and marker density ∆t = 2cM.Marksers are 2cM apartproportion <strong>of</strong> 00.0 0.1 0.2 0.3 0.420 40 60 80 100 120 140chromosomal lengthT and ∆t. Thus, for any given values <strong>of</strong> T and ∆t in an application and <strong>the</strong> observed value <strong>of</strong>X(T, ∆t), one can use this distribution family to compute an approximate p-value conveniently.Recall that when T = 0, <strong>the</strong> marginal limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic is a mixture <strong>of</strong>χ 2 1 and χ 2 2 distributions, and a distribution concentrated at 0. Based on simulation results at <strong>the</strong>selected T and ∆t, we propose to use <strong>the</strong> following distribution family:whereF T,∆t (x) =∫ x0f(u; p, d 1 , d 2 , T, ∆t)duf(x; p, d 1 , d 2 , T, ∆t) = pf(x; d 1 , T, ∆t) + (1 − p)f(x; d 2 , T, ∆t),with f(x; d, T, ∆t) being a χ 2 ddensity, and p being <strong>the</strong> mixing proportion, to approximate <strong>the</strong>sample distribution <strong>of</strong> <strong>the</strong> non-zero portion <strong>of</strong> X(T, ∆t). When d is not an integer, we interpretf(x; d, T, ∆t) as a Gamma distribution with parameters (d/2, 2). Our task now is to find smoothfunctions <strong>of</strong> T and ∆t for parameters p, d 1 and d 2 . Based on <strong>the</strong> simulated data, with <strong>the</strong>conside<strong>ratio</strong>n <strong>of</strong> simplicity, we propose <strong>the</strong> following function to be used:logit(p) = a 1 + a 2 T + a 3√∆t,d 1 = b 1 + b 2 T + b 3√T + b4√∆t,d 2 = c 1 + c 2 T + c 3√T + c4√∆t,(8)where logit(p) = log(p/(1 − p)). The values <strong>of</strong> a = (a 1 , a 2 , a 3 ), b = (b 1 , b 2 , b 3 , b 4 ) and c =(c 1 , c 2 , c 3 , c 4 ) will be chosen to best fit <strong>the</strong> simulated sample distributions <strong>of</strong> X(T, ∆t). The probability<strong>of</strong> w = 1 − P {X(T, ∆t) = 0} will be modeled later.We now regard <strong>the</strong> simulated nonzero values <strong>of</strong> X i (T, ∆t), i = 1, 2, . . . , n = 10, 000 for T =0, 10, . . . , 150 and ∆t = 1, 3, 5, 10 as a random sample from density f(x; p, d 1 , d 2 , T, ∆t) and considerestimating a, b and c by maximum <strong>likelihood</strong>. At each given T and ∆t, <strong>the</strong> log-<strong>likelihood</strong> function9


is given by:l(a, b, c; T, ∆t) =n∑log{pf(x i ; d 1 , T, ∆t) + (1 − p)f(x i ; d 2 , T, ∆t)}.i=1Note that p, d 1 and d 2 are functions <strong>of</strong> a, b and c for each given T and ∆t. Due to <strong>the</strong> verycomplex relationship between X(T 1 , ∆t 1 ) and X(T 2 , ∆t 2 ) for any T 1 ≠ T 2 or ∆t 1 ≠ ∆t 2 , wemaximize, instead <strong>of</strong> <strong>the</strong> joint log-<strong>likelihood</strong> function, <strong>the</strong> following “pseudo” log-<strong>likelihood</strong>:l(a, b, c) = ∑ ∑l(a, b, c; T, ∆t)T ∆twith summation over T = 0, 10, · · · , 150 and ∆t = 1, 3, 5, 10. Since when T = 0 and ∆t = 0, <strong>the</strong>limiting distribution <strong>of</strong> X(0, 0) is known to have p = 0.836, d 1 = 1 and d 2 = 2, our maximizationwill be done under <strong>the</strong> constraints: a 1 = 1.629, b 1 = 1 and c 1 = 2.We now use <strong>the</strong> EM algorithm (Dempster, Laird and Rubin, 1977) for <strong>the</strong> numerical computation.Let Z i be a latent variable representing <strong>the</strong> mixture component <strong>of</strong> <strong>the</strong> ith observation. Thecomplete log-<strong>likelihood</strong> function is:l c (a, b, c) =n∑ ∑ ∑n∑z i log f(x i ; d 1 ; T, ∆t) + − z i )i=1 T ∆ti=1(1 ∑ Tn∑n∑+{ z i } log(p) + {n − z i } log(1 − p).i=1i=1∑log f(x i ; d 2 ; T, ∆t)∆tGiven initial values <strong>of</strong> parameters a, b and c, we compute <strong>the</strong> conditional expected values <strong>of</strong> Z i in<strong>the</strong> above set up in <strong>the</strong> E-step. In <strong>the</strong> M-step, we update <strong>the</strong> parameter values a, b and c as <strong>the</strong>maximizer <strong>of</strong> <strong>the</strong> complete <strong>likelihood</strong>. The E-step and M-step are iterated until convergence.The E-M algorithm converges at â 2 = −0.048, â 3 = 0.590, ˆb 2 = −0.016, ˆb 3 = 0.241, ˆb 4 =−0.115, ĉ 2 = −0.016, ĉ 3 = 0.442 and ĉ 2 = −0.519. With (8), we find a set <strong>of</strong> parameters <strong>of</strong> p, d 1and d 2 , for any given value <strong>of</strong> T and ∆t. Naturally, <strong>the</strong> corresponding distribution may not be agood approximation for <strong>the</strong> distribution <strong>of</strong> X(T, ∆t) when T is much larger than 150cM.We approximate <strong>the</strong> proportion <strong>of</strong> 0 observations 1 − w by a function <strong>of</strong> <strong>the</strong> form1 − w(T, ∆t) = 0.402 exp{α 1 T + α 2 ∆t + α 3 T · ∆t}.This function is motivated by <strong>the</strong> requirement <strong>of</strong> w(0, 0) = 0.598 (<strong>the</strong> total weight <strong>of</strong> <strong>the</strong> twochisquare components for T = 0 and ∆t = 0) and <strong>the</strong> general trend <strong>of</strong> <strong>the</strong> observed proportions.By fitting this model to <strong>the</strong> observed zero proportions, we getw(T, ∆t) = 1 − 0.402 exp{−0.052T + 0.018∆t + 0.002T · ∆t}.The solid line in Figure 3 shows <strong>the</strong> fitted values. Here, we fixed <strong>the</strong> coefficient <strong>of</strong> <strong>the</strong> exponentialterm to be 0.402, which is <strong>the</strong> portion <strong>of</strong> zeroes when T = 0 and ∆t = 0 given by our derivedmarginal limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic (6). Figure 3 shows that, although <strong>the</strong> fit isnot quite as good when T = 10cM, this simple model gives a very good fit when T ≥ 20cM forvarious ∆t. In Figure 4, we plot <strong>the</strong> empirical proportion <strong>of</strong> 0 (empty dots) when ∆t = 2cM foreach T = 10, . . . , 150cM. The solid line is <strong>the</strong> predicted proportion <strong>of</strong> 0 for each each T based onour fitted model. We see that, our model approximates <strong>the</strong> proportion <strong>of</strong> 0 very well.In summary, we use <strong>the</strong> following probability function to approximate <strong>the</strong> distribution <strong>of</strong> <strong>the</strong>supremum statistic for any given T and ∆t: for any x ≥ 0,P{X(T, ∆t) < x} = w[pP(χ 2 d 1< x) + (1 − p)P(χ 2 d 2< x)] + (1 − w), (9)10


withw(T, ∆t) = 1 − 0.402 exp{−0.052T + 0.018∆t + 0.002T · ∆t};p(T, ∆t) = exp{1.629 − 0.049T + 0.590 √ ∆t}/[1 + exp{1.629 − 0.048T + 0.590 √ ∆t}];d 1 (T, ∆t) = 1 − 0.016T + 0.241 √ T − 0.115 √ ∆t;d 2 (T, ∆t) = 2 − 0.016T + 0.442 √ T − 0.519 √ ∆t.For example, when T = 10 and ∆t = 1, we have w = 0.752, p = 0.849, d 1 = 1.487 and d 2 = 2.719.To assess <strong>the</strong> precision <strong>of</strong> <strong>the</strong> approximations, we use Q-Q plots to compare <strong>the</strong> empiricalquantiles with <strong>the</strong> quantiles from our mixed χ 2 for each size <strong>of</strong> T and each marker density <strong>of</strong> ∆t.In Figure 5, 6, 7 and 8, we give <strong>the</strong> Q-Q plots for T =20cM, 40cM, 60cM, 80cM, 100cM and 120cMand ∆t = 1cM, 3cM, 5cM and 10cM. It is seen that our mixed χ 2 distribution approximates <strong>the</strong>distribution <strong>of</strong> <strong>the</strong> supremum <strong>statistics</strong> very well for various values <strong>of</strong> T and ∆t. Our approximationprovides an approximation conveniently for any values <strong>of</strong> T and ∆t. To see this, we generated10,000 sample sets. In each set, we have <strong>the</strong> supremum <strong>statistics</strong> for T = 20cM, 40cM, 60cM,80cM, 100cM and 120cM with ∆t = 2cM. Values <strong>of</strong> w’s, p’s, d 1 ’s and d 2 ’s are computed by <strong>the</strong>equation (9). Figure 9 contains <strong>the</strong> Q-Q plots between <strong>the</strong> empirical quantiles and <strong>the</strong> quantilesfrom <strong>the</strong> mixture <strong>of</strong> χ 2 distributions for each T . It is seen that <strong>the</strong> quantiles computed fromour mixture <strong>of</strong> χ 2 distributions match <strong>the</strong> quantiles <strong>of</strong> <strong>the</strong> <strong>test</strong> data very well. Our method isspecifically for <strong>the</strong> problem <strong>of</strong> <strong>test</strong>ing on a large number <strong>of</strong> markers. In human genome, <strong>the</strong>reare 22 pairs <strong>of</strong> autosomes. In a genome-wide scan, since we obtain a supremum LRT <strong>statistics</strong>for each chromosome, <strong>the</strong>re are 22 supremum <strong>statistics</strong> to be <strong>test</strong>ed. Alternatively, if we considercrossover processes to be independent from chromosome arm to chromosome arm (usually, <strong>the</strong>reare two arms for each chromosome: a longer arm and a shorter arm), <strong>the</strong>re are total 44 supremum<strong>statistics</strong> to be <strong>test</strong>ed. If we want to control <strong>the</strong> family-wise error rate at a level <strong>of</strong> 0.05, <strong>the</strong> fit<strong>of</strong> <strong>the</strong> distribution for p-value=0.001 is sufficient for each chromosome arm. It can be shown from<strong>the</strong> simulation results that <strong>the</strong> distribution approximations give p-values which are accurate when<strong>the</strong>y are near 0.05, 0.01 and 0.001 regardless <strong>of</strong> <strong>the</strong> values <strong>of</strong> T and ∆t. See Table 4.In summary, our mixture χ 2 distribution <strong>of</strong> <strong>the</strong> formP{X(T, ∆t) < x} = w[pP(χ 2 d 1< x) + (1 − p)P(χ 2 d 2< x)] + (1 − w),is an adaptable approximation to <strong>the</strong> limiting distribution <strong>of</strong> <strong>the</strong> supremum <strong>statistics</strong> max 0≤t≤T ˜ΛN (t).With any given size <strong>of</strong> T within our range <strong>of</strong> simulation, <strong>the</strong> parameters p, d 1 and d 2 can be obtainedsuch that <strong>the</strong> threshold value for a given size <strong>of</strong> <strong>test</strong> can be easily determined by our probabilityfunction.4. DiscussionsIn this paper, <strong>the</strong> limiting distribution <strong>of</strong> <strong>the</strong> LRT statistic at any given marker under <strong>the</strong> possibletriangle constraint is obtained via a technique similar to that used in Chern<strong>of</strong>f (1954) and Self& Liang (1987) for deriving <strong>the</strong> limiting distribution <strong>of</strong> <strong>the</strong> LRT <strong>statistics</strong> under nonstandardconditions. Our approach is straightforward and simple and consistent with Holmans’ result.Holmans (1993) showed that with a given genetic model, e.g. allele frequencies, <strong>the</strong> weight <strong>of</strong>χ 2 2 can be calculated through a complicated method. He did not appear to notice that, whenmarker is fully informative, <strong>the</strong> weight <strong>of</strong> χ 2 2 is fixed. When <strong>the</strong> marker is not fully informative,<strong>the</strong> number <strong>of</strong> alleles shared IBD by a sib pair may not be determined unambiguously. Generallyin this case a conditional IBD distribution can be obtained given observed genotype data. AnEM algorithm can be implemented in order to compute <strong>the</strong> correct log <strong>likelihood</strong> <strong>ratio</strong> statistic.Under <strong>the</strong> null hypo<strong>the</strong>sis, twice <strong>the</strong> correct log <strong>likelihood</strong> <strong>ratio</strong> statistic should be asymptoticallychi-squared with 2 degrees <strong>of</strong> freedom. Thus, for a sufficiently large sample size, we expect <strong>the</strong>empirical distribution <strong>of</strong> <strong>the</strong> unrestricted LRT statistic using an incomplete informative marker11


Figure 5: Q-Q plots <strong>of</strong> approximating mixed χ 2 distributions and empirical distributions <strong>of</strong> supremumLRT <strong>statistics</strong> (max 0≤t≤T ˜λN ) for T =20cM, 40cM, 60cM, 80cM, 100cM and 120cM and∆t = 1cM.Fitted mixed chi−square dist0 5 15T=20cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=40cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=60cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=80cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=100cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=120cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>12


Figure 6: Q-Q plots <strong>of</strong> approximating mixed χ 2 distributions and empirical distributions <strong>of</strong> supremumLRT <strong>statistics</strong> (max 0≤t≤T ˜λN ) for T =20cM, 40cM, 60cM, 80cM, 100cM and 120cM and∆t = 3cM.Fitted mixed chi−square dist0 5 15T=20cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 25T=40cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=60cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=80cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=100cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=120cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>13


Figure 7: Q-Q plots <strong>of</strong> approximating mixed χ 2 distributions and empirical distributions <strong>of</strong> supremumLRT <strong>statistics</strong> (max 0≤t≤T ˜λN ) for T =20cM, 40cM, 60cM, 80cM, 100cM and 120cM and∆t = 5cM.Fitted mixed chi−square dist0 5 15T=20cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=40cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=60cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 20T=80cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 25T=100cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>Fitted mixed chi−square dist0 10 25T=120cM0 5 10 15 20empirical supremum LRT <strong>statistics</strong>14

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!