05.03.2013 Views

Looking Elsewhere and Looking Everywhere (a.k.a. Multiple Testing ...

Looking Elsewhere and Looking Everywhere (a.k.a. Multiple Testing ...

Looking Elsewhere and Looking Everywhere (a.k.a. Multiple Testing ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Looking</strong> <strong>Elsewhere</strong> <strong>and</strong> <strong>Looking</strong> <strong>Everywhere</strong><br />

(a.k.a. <strong>Multiple</strong> <strong>Testing</strong>, Simultaneous Confidence B<strong>and</strong>s,...)<br />

Victor M. Panaretos<br />

Institute of Mathematics<br />

Swiss Federal Institute of Technology, Lausanne<br />

victor.panaretos@epfl.ch<br />

SLAC – June 2012<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 1 / 25


The Look <strong>Elsewhere</strong> Effect<br />

Two “definitions” found in a (admittedly limited) search:<br />

When searching for a new resonance somewhere in a possible mass range,<br />

the significance of observing a local excess of events must take into<br />

account the probability of observing such an excess anywhere in the range<br />

OR<br />

In experiments that are aimed at detecting astrophysical sources [...] one<br />

usually performs a search over a continuous parameter space [...] looking<br />

for the most significant deviation from the background hypothesis. Such a<br />

procedure inherently involves a “look elsewhere effect”, namely, the<br />

possibility for a signal-like fluctuation to appear anywhere within the<br />

search range.<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 2 / 25


The Look <strong>Elsewhere</strong> Effect<br />

One possible mathematical description is as follows:<br />

Interested in detecting presence of a signal µ(xt), t = 1, . . . , T over a<br />

discretised domain, {x1, . . . , xt}, on the basis of noisy measurements<br />

This is to be detected against some known background, say 0.<br />

Say, for the moment, that we are not specifically interested in<br />

detecting the presence of the signal in some particular location xt, but<br />

in detecting whether the a signal is present anywhere in the domain.<br />

Formally:<br />

as opposed to<br />

Does there exist a t ∈ {1, . . . , T } such that µ(xt) = 0?<br />

Is µ(xt) = 0 for some specific t ∈ {1, . . . , T }?<br />

(but might also be interested in“for which t is µ(xt) = 0”)<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 3 / 25


The Look <strong>Elsewhere</strong> Effect as Hypothesis <strong>Testing</strong><br />

More formally:<br />

Observe<br />

with εt assumed exchangeable 1<br />

Yt = µ(xt) + εt, t = 1, . . . , T .<br />

Wish to test, at some significance level α:<br />

<br />

H0 : µ(xt) = 0 for all t ∈ {1, . . . , T },<br />

HA : µ(xt) = 0 for some t ∈ {1, . . . , T }.<br />

(alternatively, wish to construct a confidence set for all µ(xt)<br />

simultaneously)<br />

1 i.e. their joint distribution is invariant under permutation.<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 4 / 25


(Some) Aspects to be considered<br />

Approach to problem can be influenced by:<br />

Are the individual tests independent or dependent?<br />

What is the geometry of the domain of the signal?<br />

Is µ(·) estimated with a biased estimator (e.g. via smoothing)?<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 5 / 25


Bonferroni Method<br />

Consider initially the case where no specific assumptions are made, but we<br />

have a means of testing each individual hypothesis at some level:<br />

Set H0,t to be the hypothesis µ(xt) = 0.<br />

Assume that we have a test for H0,t for any level αt.<br />

How can we construct a test for the global hypothesis H0 at level α?<br />

Bonferroni<br />

1 Test individual hypotheses separately at level αt = α/T<br />

2 Reject H0 if at least one of the {H0,t} T t=1<br />

Global level is bounded as follows:<br />

<br />

T<br />

P[❩ ❩H0|H0] = P {❍<br />

❍ H0,t}<br />

<br />

<br />

<br />

<br />

t=1<br />

H0<br />

<br />

≤<br />

T<br />

t=1<br />

is rejected<br />

P[❍<br />

❍ H0,t|H0] = T α<br />

T<br />

= α<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 6 / 25


WARNING: TEST STATISTICS VS P-VALUES<br />

In what follows, we assume that, under H0,t, the corresponding test<br />

statistic Wt has a distribution F , that is the same for all t<br />

(e.g. all individual tests are t-tests with the same degrees of freedom,<br />

or χ 2 tests with the same degrees of freedom)<br />

This is typically the case in the signal/background problem.<br />

Therefore, arranging these test statistics in decreasing order, will<br />

correspond to arranging them from most to least significant.<br />

If the test statistics are not identically distributed under H0, then we<br />

need to work with p-values, instead (arranging p-values in increasing<br />

order will correspond to arranging them from most to least<br />

significant).<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 7 / 25


Holm-Bonferroni Method<br />

Advantage: Works for any (discrete domain) setup!<br />

Disadvantage: Too conservative when T large<br />

Holm’s modification increases average # of hypotheses rejected at level α<br />

(but does not increase power for overall rejection of H0)<br />

Holm’s Procedure<br />

1 Suppose we reject H0,t for large values of a test statistic Wt<br />

2 Order individual test statistics, largest to smallest: W (T ) ≥ . . . ≥ W (1)<br />

3 Starting from t = T <strong>and</strong> going down, reject all H 0,(t) such that W (t)<br />

supercritical at level α/t. Stop rejecting at first subcritical W (t).<br />

Genuine improvement over Bonferroni if want to detect as many signals as<br />

possible, not just existence of some signal<br />

Both Holm <strong>and</strong> Bonferroni reject the global H0 if <strong>and</strong> only if sup t Wt<br />

supercritical at level α/T wrt distribution of corresponding W under H0<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 8 / 25


Taking Advantage of Structure<br />

Bonferroni <strong>and</strong> Holm methods have the advantage of holding for:<br />

Any dependence structure between test statistics (or data)<br />

Any signal structure: µ can be totally arbitrary<br />

Any geometry of the signal domain<br />

Indeed, both methods would work for a collection of totally unrelated<br />

hypotheses that need to be tested simultaneously!<br />

↩→ Price to pay: low rejection power of global H0<br />

Can we take advantage of prior knowledge of additional structure to<br />

increase power for H0 while maintaining level?<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 9 / 25


Taking Advantage of Structure: Independence<br />

In the (special) case where individual test statistics are independent, one<br />

may use Sime’s (in)equality,<br />

<br />

<br />

α(T − j + 1)<br />

<br />

P W (j) supercritical at level , for some j = 1, ..., T <br />

T<br />

H0<br />

<br />

= α<br />

(strict equality requires continuous test statistics, otherwise ≤ α)<br />

Yields Sime’s procedure (assuming independence)<br />

1 Suppose we reject H0,j for large values of a test statistic Wj<br />

2 Order test statistics from largest to smallest: W (T ) ≥ . . . ≥ W (1)<br />

3 If, for some j = 1, . . . , T the statistic W (j) is supercritical at level<br />

α(T −j+1)<br />

T , then reject H0.<br />

Provides a test for the global hypothesis H0, but does not “localise” the<br />

signal at a particular xt<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 10 / 25


Taking Advantage of Structure: Independence<br />

One can, however, devise a sequential procedure to “localise” Sime’s<br />

procedure, at the expense of lower power for the global hypothesis H0:<br />

Hochberg’s procedure (assuming independence)<br />

1 Suppose we reject H0,j for large values of a test statistic Wj<br />

2 Order test statistics from largest to smallest: W (T ) ≥ . . . ≥ W (1)<br />

3 Starting from j = 1, 2, ... <strong>and</strong> going up, accept all H 0,(j) such that<br />

W (j) subcritical at level α/j.<br />

4 Stop accepting for the first j such that W (j) is supercritical at level<br />

α/j, <strong>and</strong> reject all the remaining ordered hypotheses past that j.<br />

Genuine improvement over Holm-Bonferroni both overall (H0) <strong>and</strong> in<br />

terms of signal localisation:<br />

1 Rejects “more” individual hypotheses than Holm-Bonferroni<br />

2 Power for overall H0 “weaker” than Sime’s (for T > 2), much<br />

“stronger” than Holm (for T > 1).<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 11 / 25


Taking Advantage of Structure: Independence<br />

Rejection Level<br />

0.01 0.02 0.03 0.04 0.05<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Bonferroni, Hochberg, Simes<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

● ● ● ● ● ● ●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

● ●<br />

●<br />

●<br />

●<br />

● ●<br />

5 10 15 20 25<br />

Minimal to Maximal Significance<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 12 / 25<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />


Taking Advantage of Structure: Dependence<br />

Though independence provides some improvement over the Holm <strong>and</strong><br />

Bonferroni methods, in some sense it complicates things rather than<br />

simplifies them.<br />

Dependence can complicate the analysis, but improve power considerably!<br />

Manifests itself in any of the two following ways (or a combination)<br />

1 Dependence between the test statistics for each H0,t<br />

↩→ Can arise if the measurement errors are correlated, or if the datasets on<br />

which individual test statistics Wt are based on are not mutually<br />

exclusive for different values of t = 1, ..., T .<br />

2 Logical dependence between the hypotheses H0,t<br />

↩→ It might be that rejecting a hypothesis H0,t1 would imply necessarily<br />

the rejection of another, H0,t2 , or, more generally, that not all<br />

combinations of rejections of individual hypotheses are feasible<br />

(effective number of individual tests smaller than T ).<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 13 / 25


Exploiting Dependence Between Test Statistics<br />

For the moment, focus on global hypothesis only (H0)<br />

Under no assumptions at all, we used sup t Wt compared to the α/T<br />

critical value of the corresponding W under H0<br />

(independence means cdf of W ∗ is product of individual cdfs)<br />

If there is dependence, then sup t Wt should have special behaviour,<br />

<strong>and</strong> maybe it’s too conservative to apply a strong correction to<br />

compare it with the critical value of the corresponding W under H0<br />

Instead, perhaps we should use the distribution of sup t Wt itself?<br />

Under a completely general dependence structure this is unknown –<br />

but it can be simulated!<br />

How? Focus again on our prototype signal search example:<br />

Yt = µ(xt) + εt, t = 1, . . . , T .<br />

with εt assumed exchangeable, test H0 : µ(xt) = 0 ∀t ∈ {1, . . . , T }.<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 14 / 25


Exploiting Dependence Between Test Statistics<br />

Westfall-Young Single Step Permutation Procedure<br />

Under the complete H0, the joint distribution of (Y1, . . . , YT ) is the<br />

same as the distribution of g(Y1, . . . , YT ), for any permutation g,<br />

g(Y1, . . . , YT ) = (Y π(1), . . . , Y π(T )),<br />

where π(·) : {1, . . . , T } → {1, . . . , T } is a bijection.<br />

Therefore, under H0, the distribution of W ∗ = sup t Wt can be<br />

approximated, by considering all possible permutations of<br />

(Y1, . . . , YT ) (re-calculating the supremum of individual test statistics<br />

for each permutation g in the set of all T ! permutations, say G).<br />

This will give us approximate quantiles for the distribution of<br />

W ∗ = sup t Wt under H0, upon which we can base testing<br />

Q1−α ≈ ˆQ1−α := inf<br />

⎧<br />

⎨<br />

1 <br />

w ∈ R : 1<br />

⎩ T !<br />

W ∗ g ≤ w ⎫<br />

⎬<br />

≥ 1 − α<br />

⎭<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 15 / 25<br />

g∈G


Exploiting Dependence Between Test Statistics<br />

Westfall-Young procedure has properties:<br />

Maintains global level α for H0<br />

Can also be “localised”: simply reject all H0,j such that Wj > ˆQ1−α<br />

↩→ To maintain level α under any set of null hypotheses, need additional<br />

assumptions (e.g. “subset pivotality”). Previous procedures maintained<br />

this without extra assumptions.<br />

Has a step-down version (like Holm) which is more powerful than<br />

Holm (under subset pivotality).<br />

Asymptotic “oracle” optimality as T → ∞ if only a small proportion<br />

(sparsity) of individual nulls are false.<br />

Computationally intensive for T large. Problem: often T → ∞, e.g.<br />

when the discretisation approximates a continuous domain...<br />

What if more is known about the dependence structure of {Wt} T t=1?<br />

↩→ Could we hope to obtain a mathematical result on the distribution of<br />

W ∗ = sup t Wt under H0 (at least approximately?)<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 16 / 25


Exploiting Dependence over “Metric” Continuous Domains<br />

Recall our special problem:<br />

<br />

H0 : µ(xt) = 0 for all t ∈ {1, . . . , T },<br />

HA : µ(xt) = 0 for some t ∈ {1, . . . , T }.<br />

The T test statistics arise as through the discrete approximation of a<br />

signal on an “metric” continuous domain, say [0, S],<br />

0 = x1 < x2 < . . . < xT −1 < xT = S<br />

We thus expect T to become very large as we increase our resolution.<br />

Might be worth “imbedding” the problem in continuous domain<br />

Rather than a collection of T test statistics {Wt : t = 1, . . . , T },<br />

consider an uncountable collection indexed by the interval [0, S].<br />

Take advantage of metric structure <strong>and</strong> continuity of domain (could<br />

they allow more structured dependence to be further exploited?)<br />

View collection {W (x) : x ∈ [0, S]} as r<strong>and</strong>om function (stoch. process)<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 17 / 25


Exploiting Dependence over “Metric” Continuous Domains<br />

If the dependence among the {W (x) : x ∈ [0, S]} is structured enough, we<br />

hope to be able to determine useful 2 (approximate) bounds for<br />

P[W ∗ > u], where W ∗ = sup<br />

x∈[0,S]<br />

W (x)<br />

(<strong>and</strong> hence get global critical values) without resorting to simulation.<br />

Suppose (for example) that:<br />

{W (x) : x ∈ [0, S]} is zero mean stationary Gaussian process under H0<br />

Structured dependence could use “metric” domain through smoothness:<br />

When |x − y| is small, then |W (x) − W (y)| should be small too.<br />

But W (x) − W (y) ∼ N(0, 2C(x − y)), C(x − y) = Cov[W (x), W (y)]<br />

So if C(x − y) is smooth close to 0, we expect W (x) to be smooth<br />

2 Not so general that they do not provide power<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 18 / 25


Exploiting Smoothness <strong>and</strong> Geometric Domains<br />

Disclaimer: Will not attempt to discuss regularity conditions<br />

Truly remarkable result:<br />

Take W (x) to be Gaussian <strong>and</strong> smooth<br />

Let x be in a “nice” geometric domain D<br />

Extrema <strong>and</strong> Expected Euler Characteristics (Taylor & Adler)<br />

<br />

<br />

<br />

P<br />

<br />

<br />

sup W (x) ≥ u − E {ϕ (Au(W , D))}<br />

x∈D<br />

<br />

<br />

<br />

< O<br />

<br />

exp − Cu2<br />

2σ2 <br />

Au(W , D) = {y ∈ D : W (y) > u} is the excursion set of W at level u<br />

ϕ(D) is the Euler characteristic of the index set D<br />

Characterises basic topological features of set (e.g. for polyhedron<br />

ϕ = [#Vertices] − [#Edges] + [#Faces])<br />

Expectation computable for Gaussian <strong>and</strong> related processes (χ 2 , t, F )<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 19 / 25


Taking Advantage of Structure<br />

Are the individual tests independent or dependent?<br />

What is the geometry of the domain of the signal?<br />

Is µ(·) estimated with a biased estimator (e.g. via smoothing)?<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 20 / 25


Tests Based on Biased Estimators<br />

Often the test statistics W (t) are based on a biased estimator.<br />

Means that W (t) may not be centred under H0<br />

↩→ example: nonparametric estimation of µ(x), assuming it is smooth,<br />

but without assuming any specific parametric form.<br />

e.g. ssume µ(·) : [0, 1] → R has Lipschitz second derivative, <strong>and</strong> observe<br />

Yt = µ(xt) + εt, t = 1, . . . , T .<br />

with εt assumed iid variance σ 2 . Test H0 : µ(x) = 0 ∀x ∈ [0, 1].<br />

A classical estimator of µ is a kernel estimator (convolution estimator)<br />

ˆµλ(x) = 1<br />

λT<br />

T<br />

<br />

x − xt<br />

YtK<br />

λ<br />

t=1<br />

<br />

, K a centred symmetric pdf on [−1, 1]<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 21 / 25


Tests Based on Biased Estimators<br />

<br />

Estimator uses local weighted average to estimate µ<br />

Sacrifices unbiasedness to reduce overall IMSE:<br />

E(ˆµλ(x)−µ(x)) 2 <br />

dx =<br />

(ˆµλ(x) − Eˆµλ(x)) 2<br />

<br />

bias 2<br />

<br />

dx+<br />

λ chosen to “optimise” bias-variance tradeoff.<br />

Note that Eˆµλ(x) = Kλ ∗ µ(x) (a smeared version of µ)<br />

(Eˆµλ(x) − µ(x)) 2<br />

dx<br />

<br />

variance<br />

Assume using a t statistic (assuming asymptotic normality), then<br />

Wλ(x) = ˆµλ(x) − µ(x)<br />

Var[ˆµλ(x)] = ˆµλ(x) − Eˆµλ(x)<br />

Var[ˆµλ(x)] + Eˆµλ(x) − µ(x)<br />

Var[ˆµλ(x)]<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 22 / 25


Tests Based on Biased Estimators<br />

Let µ <strong>and</strong> K be “smooth” <strong>and</strong> let ˆ λ be a data-tuned balancer of bias &<br />

variance with (ˆλ − λ ∗ )/λ ∗ = OP(T −1/10 ) (λ ∗ the true one). Let ˆQα be<br />

σ K 2 (x)dx<br />

<br />

T ˆλ<br />

⎧<br />

⎛<br />

⎪⎨<br />

−2 log ˆλ + log ⎝ 1<br />

˙K<br />

2π<br />

2 ⎞<br />

(y)dy<br />

⎠ − log<br />

K 2 (y)dy<br />

<br />

−2 log ˆλ<br />

⎪⎩<br />

− log(1−α)<br />

2<br />

⎫<br />

<br />

Then, under {H0 : µ(x) = 0 ∀x ∈ [0, 1]}, <strong>and</strong> ˆbˆ λ (x) = ˆλ 2 1<br />

2 ˆµ′′ (x) K 2dy, P<br />

<br />

<br />

<br />

sup ˆµˆ λ (x) − bˆ λ(x) (x)<br />

x∈[0,1]<br />

<br />

<br />

<br />

> ˆQ1−α<br />

T →∞<br />

−→ α<br />

⎪⎬<br />

.<br />

ˆbˆ λ (x) is a point-wise asymptotic bias correction.<br />

Bonferroni would be conservative, even though failing to account for<br />

bias! (still, this result is asymptotic in T (# of observations))<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 23 / 25<br />

⎪⎭


Final Remarks<br />

Many different approaches, depending on formulation <strong>and</strong> detail of<br />

specification of relation between hypotheses<br />

Statistical literature on topic is vast<br />

The more specific we can be, the more powerful we can be<br />

Can use p-values equally easily<br />

Methods can be adapted when logical relations exist between<br />

hypotheses (e.g. when invalidity of H0,1 –say– logically implies<br />

invalidity of H0,2)<br />

Focused here on strong control of family-wise error rate which has<br />

dominated the literature (overwhelmingly frequentist)<br />

However, a whole wealth of other powerful methods can be obtained<br />

via other control criteria are becoming very popular (<strong>and</strong> successful)<br />

Per comparison error rate E[Number of true hypotheses rejected]/T<br />

Expected error rate E[Number of wrong decisions]/T<br />

False discovery rate E[#(Falsely rejected hypotheses)/#(Rejected<br />

Hypotheses)] – see Benjamini & Hochberg (1995)<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 24 / 25


Adler, R.J. & Taylor (2007). R<strong>and</strong>om Fields <strong>and</strong> Geometry. Springer.<br />

Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a<br />

practical <strong>and</strong> powerful approach to multiple testing. J. R. Statisti. Soc. B., 57:<br />

289–300.<br />

Efron, B. (2010). Large-Scale Inference: Empirical Bayes Methods for Estimation,<br />

<strong>Testing</strong> <strong>and</strong> Prediction. Institute of Mathematical Statistics Monographs.<br />

Eubank, R.L. & Speckman, P.L. (1993). Confidence b<strong>and</strong>s in nonparametric<br />

regression. J. Amer. Statisti. Assoc. 88: 1287–1301.<br />

Hall, P. (1993). On Edgeworth expansion <strong>and</strong> bootstrap confidence b<strong>and</strong>s in<br />

nonparametric curve estimation. J. R. Statisti. Soc. B, 55” 291–304.<br />

Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of<br />

significance. Biometrika, 75: 800–802.<br />

Holm, S. (1979). A simple sequentially rejective multiple test procedure. Sc<strong>and</strong>. J.<br />

Statist., 6: 65–70.<br />

Simes, R.J. (1986). An improved Bonferroni procedure for multiple tests of<br />

significance. Biometrika, 73: 751–754.<br />

Westfall, P.H. & Young, S.S. (1993). Resampling-Based <strong>Multiple</strong> <strong>Testing</strong>: Examples<br />

<strong>and</strong> Methods for p-Value Adjustment. Wiley Series in Probability <strong>and</strong> Statistics.<br />

Victor M. Panaretos (EPFL) Progress on Statistical Issues in Searches SLAC – June 2012 25 / 25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!