24.02.2013 Views

Optimality

Optimality

Optimality

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Massive multiple hypotheses testing 63<br />

The factor α0 serves asymptotically as a calibrator of the adaptive significance<br />

threshold to the Bonferroni threshold in the least favorable scenario π0 = 1, i.e., all<br />

null hypotheses are true. Analysis of the asymptotic ERR of the HT(α ∗ cal ) procedure<br />

suggests a few choices of α0 in practice.<br />

4.2. Asymptotic ERR of HT(α ∗ cal )<br />

Recall from (2.7) that<br />

ERR(α) = � π0α � Fm(α) � Pr(P1:m≤ α).<br />

The probability Pr(P1:m ≤ α) is not tractable in general, but an upper bound<br />

can be obtained under a reasonable assumption on the set Pm of the m P values.<br />

Massive multiple tests are mostly applied in exploratory studies to produce<br />

“inference-guided discoveries” that are either subject to further confirmation and<br />

validation, or helpful for developing new research hypotheses. For this reason often<br />

all the alternative hypotheses are two-sided, and hence so are the tests. It is instructive<br />

to first consider the case of m two-sample t tests. Conceptually the data<br />

consist of n1 i.i.d. observations on R m Xi = [Xi1, Xi2, . . . , Xim], i = 1, . . . , n1 in<br />

the first group, and n2 i.i.d. observations Yi = [Yi1, Yi2, . . . , Yim], i = 1, . . . , n2 in<br />

the second group. The hypothesis pair (H0k, HAk) is tested by the two-sided twosample<br />

t statistic Tk =|T(Xk,Yk, n1, n2)| based on the dataXk ={X1k, . . . , Xn1k}<br />

andYk ={Y1k, . . . , Yn2k}. Often in biological applications that study gene signaling<br />

pathways (see e.g., Kuo et al. [18], and the simulation model in Section<br />

5), Xik and Xik ′ (i = 1, . . . , n1) are either positively or negatively correlated<br />

for certain k �= k ′ , and the same holds for Yik and Yik ′ (i = 1, . . . , n2). Such<br />

dependence in data raises positive association between the two-sided test statis-<br />

tics Tk and Tk ′ so that Pr(Tk ≤ t|T ′ k ≤ t) ≥ Pr(Tk ≤ t), implying Pr(Tk ≤<br />

t, Tk ′ ≤ t)≥Pr(Tk ≤ t)Pr(Tk ′ ≤ t), t≥0. Then the P values in turn satisfy<br />

Pr(Pk > α, Pk ′ > α)≥Pr(Pk > α)Pr(Pk ′ > α), α∈[0,1]. It is straightforward to<br />

generalize this type of dependency to more than two tests. Alternatively, a direct<br />

model for the P values can be constructed.<br />

Example 4.1. LetJ ⊆{1, . . . , m} be a nonempty set of indices. Assume Pj =<br />

P Xj<br />

0 , j∈J , where P0 follows a distribution F0 on [0, 1], and Xj’s are i.i.d. continuous<br />

random variables following a distribution H on [0,∞), and are independent<br />

of the P values. Assume that the Pi’s for i�∈J are either independent or related to<br />

each other in the same fashion. This model mimics the effect of an activated gene<br />

signaling pathway that results in gene differential expression as reflected by the P<br />

values: the setJ represents the genes involved in the pathway, P0 represents the<br />

underlying activation mechanism, and Xj represents the noisy response of gene j<br />

resulting in Pj. Because Pi > α if and only if Xj < log α � log P0, direct calculations<br />

using independence of the Xj’s show that<br />

⎛<br />

Pr⎝<br />

�<br />

⎞ ⎛<br />

� 1<br />

{Pj >α} ⎠= Pr⎝<br />

�<br />

� �<br />

log α<br />

Xj <<br />

log t<br />

⎞ �� � �� �<br />

|J |<br />

⎠dF0(t)=E<br />

log α<br />

H<br />

,<br />

log P0<br />

j∈J<br />

0<br />

j∈J<br />

where|J| is the cardinalityJ . Next<br />

�<br />

Pr(Pj > α) = �<br />

j∈J<br />

j∈J<br />

� 1<br />

0<br />

�<br />

H<br />

� �� � � � ��� |J |<br />

log α<br />

log α<br />

dF0(t) = E H<br />

.<br />

log t<br />

log P0

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!