21.06.2014 Views

Subsampling estimates of the Lasso distribution.

Subsampling estimates of the Lasso distribution.

Subsampling estimates of the Lasso distribution.

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

✄✄ ✄ ✄<br />

✄<br />

✄✄ ✄ ✄ ✄<br />

Swiss Federal Institute <strong>of</strong> Technology Zurich<br />

Seminar for<br />

Statistics<br />

Department <strong>of</strong> Ma<strong>the</strong>matics<br />

Master Thesis Summer 2011<br />

Emmanuel Payebto Zoua<br />

<strong>Subsampling</strong> <strong>estimates</strong> <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> <strong>distribution</strong>.<br />

Submission Date: September 09th 2011<br />

Adviser:<br />

Pr<strong>of</strong>. Dr. Peter Bühlmann


Acknowledgement<br />

First and foremost, I want to thank my supervisor Pr<strong>of</strong>. Dr. Peter Bühlmann for <strong>the</strong><br />

excellent guidance and for introducing me with this fascinating topic. The interest he<br />

demonstrated for my work has been a great source <strong>of</strong> motivation.<br />

I want to express my gratitude to Pr<strong>of</strong>. Dr. Sara Van de Geer <strong>of</strong> <strong>the</strong> Seminar for Statistics<br />

and to Dr. Michel Baes <strong>of</strong> <strong>the</strong> Insitute for Operational Research for taking <strong>the</strong> time to<br />

answer my questions. Also, I thank David Lamparter for many interesting discussions on<br />

<strong>the</strong> <strong>Lasso</strong>.<br />

Finally, writing this <strong>the</strong>sis would not have been possible without <strong>the</strong> friendly atmosphere<br />

at <strong>the</strong> Seminar for Statistics. I would like to thank all staff members and <strong>the</strong> assistants for<br />

helping me with ma<strong>the</strong>matical or editing issues. To Sarah Gerster and Jürg Schelldorfer,<br />

a huge thanks for <strong>the</strong> hospitality in <strong>the</strong> <strong>of</strong>fice.<br />

iii


iv<br />

Abstract


Abstract<br />

We investigate possibilities <strong>of</strong>fered by subsampling to etimate <strong>the</strong> <strong>distribution</strong> <strong>of</strong> <strong>the</strong> <strong>Lasso</strong><br />

estimator and construct confidence intervals/hypo<strong>the</strong>sis tests. Despite being inferior to<br />

<strong>the</strong> bootstrap in terms <strong>of</strong> higher-order accuracy in situations where <strong>the</strong> later is consistent,<br />

subsampling <strong>of</strong>fers <strong>the</strong> advantage to work under very weak assumptions. Thus, building<br />

upon Knight and Fu (2000), we first study <strong>the</strong> asymptotics <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> estimator in a<br />

low dimensional setting and prove that under an orthogonal design assumption, <strong>the</strong> finite<br />

sample component <strong>distribution</strong>s converge to a limit in a mode allowing for consistency <strong>of</strong><br />

subsampling confidence intervals. We give hints that this result holds in greater generality.<br />

In a high dimensional setting, we study <strong>the</strong> adaptive <strong>Lasso</strong> under assumption <strong>of</strong> partial<br />

orthogonality introduced by Huang, Ma, and Zhang (2008) and use <strong>the</strong> partial oracle result<br />

in <strong>distribution</strong> to argue that subsampling should provide valid confidence intervals for<br />

nonzero parameters. Simulations studies confirm <strong>the</strong> validity <strong>of</strong> subsampling to construct<br />

confidence intervals, tests for null hypo<strong>the</strong>ses ansd control <strong>the</strong> FWER through subsampled<br />

p-values in a low dimensional setting. In <strong>the</strong> high dimensional setting, confidence intervals<br />

for nonzero coefficients are slightly anticonservative and false positive rates are shown to<br />

be conservative.<br />

v


vi<br />

CONTENTS<br />

Contents<br />

Notation<br />

viii<br />

1 Introduction 1<br />

2 Minimizers <strong>of</strong> convex processes 3<br />

2.1 Convergence in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 3<br />

2.2 Convergence in <strong>distribution</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . 6<br />

2.2.1 Weak convergence in metric spaces . . . . . . . . . . . . . . . . . . . 7<br />

2.2.2 Bounded and locally bounded functions . . . . . . . . . . . . . . . . 10<br />

2.3 A continous mapping <strong>the</strong>orem for argmin functionals . . . . . . . . . . . . . 15<br />

3 Application to <strong>the</strong> <strong>Lasso</strong> estimator 21<br />

3.1 Limit in probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21<br />

3.2 Limit in <strong>distribution</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23<br />

3.2.1 Limiting <strong>distribution</strong> <strong>of</strong> components . . . . . . . . . . . . . . . . . . 24<br />

3.2.2 Uniform convergence in <strong>the</strong> orthogonal case . . . . . . . . . . . . . . 27<br />

4 The adaptive <strong>Lasso</strong> in a high dimensional setting 31<br />

4.1 Variable selection consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 37<br />

4.2 Partial asymptotic normality . . . . . . . . . . . . . . . . . . . . . . . . . . 39<br />

4.3 Marginal regressors as initial <strong>estimates</strong> . . . . . . . . . . . . . . . . . . . . . 41<br />

5 <strong>Subsampling</strong> 43<br />

5.1 Pointwise consistency for <strong>distribution</strong> estimation . . . . . . . . . . . . . . . 43<br />

5.2 Uniform consistency for quantiles appproximation . . . . . . . . . . . . . . . 46<br />

5.2.1 Statement <strong>of</strong> <strong>the</strong> general result . . . . . . . . . . . . . . . . . . . . . 47<br />

5.2.2 Pro<strong>of</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48<br />

6 Numerical results 55<br />

6.1 Low dimensinal setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55<br />

6.1.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 56<br />

6.1.2 Hypo<strong>the</strong>sis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57<br />

6.1.3 F.W.E.R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58<br />

6.2 High dimensional setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61<br />

7 Concluding remarks 71<br />

Bibliography 72<br />

A R Codes 75<br />

A.1 Simulation code for <strong>the</strong> <strong>Lasso</strong> in a low dimensional setting . . . . . . . . . . 75<br />

A.2 Simulation code for <strong>the</strong> adaptive <strong>Lasso</strong> in a high dimensional setting . . . . 79


LIST OF FIGURES<br />

vii<br />

List <strong>of</strong> Figures<br />

3.1 Monte Carlo <strong>estimates</strong> <strong>of</strong> <strong>the</strong> <strong>distribution</strong> <strong>of</strong> <strong>the</strong> root √ n( ˆβ j − β j ), j =<br />

7, . . . , 10, with penalization parameter λ n = 2 √ n. . . . . . . . . . . . . . . . 29<br />

5.1 In red, <strong>the</strong> subsampling <strong>distribution</strong> <strong>estimates</strong> for <strong>the</strong> roots √ n( ˆβ j − β j ),<br />

j = 1, . . . , 9. n = 10000, b = n 0.65 ≈ 400, B = 2000 with penalization<br />

parameter λ n = 2 √ n. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53<br />

6.1 <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models A and<br />

A’ (n= 250). Red triangles stand for <strong>the</strong> true parameters. . . . . . . . . . . 58<br />

6.2 <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models B and<br />

B’ (n= 250). Red triangles stand for <strong>the</strong> true parameters. . . . . . . . . . . 59<br />

6.3 <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models C and<br />

C’ (n= 250). Red triangles stand for <strong>the</strong> true parameters. . . . . . . . . . . 60<br />

6.4 Coverage rates <strong>of</strong> <strong>the</strong> two sided confidence intervals I 2 for <strong>the</strong> adaptive<br />

<strong>Lasso</strong> in high dimension. Green triangles correspond to relevant variables.<br />

Black dots correspond to noise variables. . . . . . . . . . . . . . . . . . . . . 69


viii<br />

Notation<br />

List <strong>of</strong> Tables<br />

6.1 Model A and A’. Empirical coverage/false positive rates for <strong>the</strong> two sided<br />

interval I 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65<br />

6.2 Model B nd B’. Empirical coverage/false positive rates for <strong>the</strong> two sided<br />

interval I 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66<br />

6.3 Model C and C’. Empirical coverage/false positive rates for <strong>the</strong> two sided<br />

interval I 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67<br />

6.4 Model A and A’. FWER and empirical power for nonzero coefficients. . . . 67<br />

6.5 Model B and B’. FWER and empirical power for nonzero coefficients. . . . 68<br />

6.6 Model C and C’. FWER and empirical power for nonzero coefficients. . . . 68


Notation<br />

Spaces and sets<br />

A<br />

A σ-algebra.<br />

B o The Borel σ-algebra on [0, 1].<br />

B loc (R p ) Set <strong>of</strong> locally bounded functions on R p .<br />

B(x, ε) Open ball <strong>of</strong> radius ε around a point x in a metric space.<br />

C b (D) Bounded continous functions on a metric space D.<br />

C min (R p ) Set <strong>of</strong> continuous functions uniquely minimized on R p .<br />

K δ<br />

δ-enlargement <strong>of</strong> a compact set K in a metric space.<br />

l ∞ (T ) Space <strong>of</strong> uniformly bounded functions on a set T .<br />

l ∞ (T 2 , T 2 , . . . ) Space <strong>of</strong> functions uniformly bounded on each T i , i ∈ N.<br />

R<br />

The extended real line.<br />

Ω<br />

Sample space.<br />

Modes <strong>of</strong> convergence<br />

→<br />

→ P<br />

→ as∗<br />

→ au<br />

<br />

Deterministic or almost sure convergence.<br />

Convergence in probability.<br />

Outer almost sure convergence.<br />

Almost uniform convergence.<br />

Weak convergence.<br />

O notation<br />

X n = o P (a n ) Convergence in probability <strong>of</strong> {X n /a n } n to zero.<br />

X n = O P (a n ) Uniform tightness <strong>of</strong> {X n /a n } n .<br />

Fur<strong>the</strong>r notation<br />

∆J(x) Jump height at x <strong>of</strong> a right continous function J with left limit.<br />

E, E ∗ , E ∗ Expectation, outer expectation and inner expection, respectively.<br />

f |T Restriction <strong>of</strong> a function f on a subset T .<br />

P, P ∗ , P ∗ Probability, outer probability and inner probability, respectively.<br />

T ∗ Minimal measurable majorant <strong>of</strong> a random map T .<br />

T ∗ Maximal measurable minorant <strong>of</strong> a random map T .<br />

‖X‖ ψd Orlicz norm <strong>of</strong> a random variable X.<br />

ix


Chapter 1<br />

Introduction<br />

In this <strong>the</strong>sis we consider <strong>the</strong> linear regression model<br />

Y i = x ′ iβ + ε i , i = 1, . . . , n<br />

and focus on <strong>the</strong> Least Absolute Shrinkage and Selection Operator, or <strong>Lasso</strong>, as estimation<br />

method for <strong>the</strong> parameter β. The <strong>Lasso</strong>, defined as<br />

arg min<br />

φ∈R p<br />

∑ n p∑<br />

(x ′ iφ − Y i ) 2 + λ |φ| j ,<br />

i=1<br />

j=1<br />

was introduced by Tibshirani (1996) and has become very popular over <strong>the</strong> last years.<br />

There are two main reasons for this gain in popularity. The first being, in a context <strong>of</strong><br />

high dimensional data analysis, where <strong>the</strong> dimension <strong>of</strong> <strong>the</strong> parameter is very large, by introducing<br />

an l 1 -norm penalty in addition to <strong>the</strong> squared loss, <strong>the</strong> <strong>Lasso</strong> can induce sparsity<br />

on <strong>the</strong> estimated model while maintaining its prediction ability. That is, many coefficients<br />

will be set to zero, which is a desirable property if one believes that only few parameters<br />

are relevant. The second reason is that <strong>the</strong> solution can be computed efficiently by convex<br />

optimization. Efron, Hastie, Johnstone, and Tibshirani (2004) introduced <strong>the</strong> LARS<br />

algorithm which computes <strong>the</strong> entire solution path, that is, <strong>estimates</strong> corresponding to all<br />

values λ > 0 with <strong>the</strong> cost <strong>of</strong> a single least square regression.<br />

For <strong>the</strong> sole purpose <strong>of</strong> prediction, <strong>the</strong> <strong>Lasso</strong> proves to be very efficient. Several authors<br />

showed under various assumptions that it enjoys a so-called oracle property, see for instance<br />

Zou (2006), Bunea, Tsybakov, and Wegkamp (2007) or Van De Geer and Bühlmann<br />

(2009). Roughly speaking, an oracle result states that prediction based on <strong>the</strong> <strong>Lasso</strong> solution<br />

will be as accurate as if <strong>the</strong> true model was known. Variable selection properties<br />

<strong>of</strong> <strong>the</strong> <strong>Lasso</strong> are by now also quite well understood and it was shown that consistency in<br />

this mode can be characterized by a so called irrerepresentable condition, see for instance<br />

Zhao and Yu (2006) or Meinshausen and Bühlmann (2006).<br />

Never<strong>the</strong>less, <strong>the</strong> <strong>Lasso</strong> <strong>the</strong>ory still needs to bridge a major gap with traditional statistical<br />

inference. Indeed, it is to this date difficult to assign a measure <strong>of</strong> incertainty to its <strong>estimates</strong>.<br />

One consequence is that <strong>the</strong> data analyst who goes beyond <strong>the</strong> goal <strong>of</strong> prediction<br />

and is interrested in selecting <strong>the</strong> exact generating model will typically face many noise<br />

variables for which no statistical tests are available. Also, <strong>the</strong>re are no confidence intervals<br />

for <strong>the</strong> estimated coeffficents.<br />

Assigning confidence intervals to an estimator or testing for a null hypo<strong>the</strong>sis typically<br />

involves understanding its <strong>distribution</strong>al properties. In <strong>the</strong> situation where an analytic<br />

1


2 Introduction<br />

solution to this problem doesn’t seem to be sight, as is <strong>the</strong> case when <strong>the</strong> estimator results<br />

from numerical optimization, one usually resorts to resampling techniques, <strong>of</strong> which Efron’s<br />

bootstrap is a prime example. However, proving <strong>the</strong> consistency (or <strong>the</strong> inconsistency)<br />

<strong>of</strong> <strong>the</strong> bootstrap <strong>of</strong>ten proves to be a hard task. Meanwhile, subsampling, or resampling<br />

without replacement, will yield consistency under very weak assumptions. Roughly speaking,<br />

<strong>the</strong> estimator, or an associated root needs to have a limiting <strong>distribution</strong> only. In this<br />

<strong>the</strong>sis we investigate this direction.<br />

Chapter 2 reviews <strong>the</strong> asymptotic <strong>the</strong>ory for minimizers <strong>of</strong> convex processes. In Chapter<br />

3, <strong>the</strong> results obtained are applied to <strong>the</strong> <strong>Lasso</strong> estimator in a low dimensional setting in<br />

<strong>the</strong> vein <strong>of</strong> Knight and Fu (2000) and continuity propperties <strong>of</strong> <strong>the</strong> limiting <strong>distribution</strong><br />

are outlined. In Chapter 4, following Huang et al. (2008) <strong>the</strong> adaptive <strong>Lasso</strong> is studied<br />

in a sparse high dimensional setting and a partial oracle property (in <strong>distribution</strong>) is presented.<br />

In Chapter 5, <strong>the</strong> <strong>the</strong>ory <strong>of</strong> subsampling is introduced and consistency results<br />

are presented, following <strong>the</strong> expositions <strong>of</strong> Politis, Romano, and Wolf (1999) and Romano<br />

and Shaikh (2010). Finally, a simulation study is conducted in Chapter 6 to acess <strong>the</strong><br />

performance <strong>of</strong> subsampling applied to <strong>the</strong> aforementioned problems.


Chapter 2<br />

Minimizers <strong>of</strong> convex processes<br />

This chapter is devoted to <strong>the</strong> asymptotic <strong>the</strong>ory <strong>of</strong> minimizers <strong>of</strong> convex processes. The<br />

general message we hope to convey is that both in probability and in <strong>distribution</strong>, <strong>the</strong><br />

determination <strong>of</strong> <strong>the</strong> limit for such minimizers can be reduced to <strong>the</strong> determination <strong>of</strong> <strong>the</strong><br />

pointwise (or marginal) limit <strong>of</strong> <strong>the</strong> processes, provided that some regularity conditions<br />

are met and that this pointwise limiting process is uniquely minimized.<br />

2.1 Convergence in probability<br />

We first show that, in probability, pointwise convergence <strong>of</strong> convex processes on a dense<br />

subset implies uniform convergence on compacts. This result will be derived from <strong>the</strong><br />

following determistic version after a subsequencing argument.<br />

Theorem 2.1.0.1. (Rockafellar, 1970, Theorem 10.8) Let C ⊆ R p be open and let {f n :<br />

C → R} n be a sequence <strong>of</strong> finite convex functions on C. Suppose that {f n } n converges<br />

pointwise on a dense subset <strong>of</strong> C 0 <strong>of</strong> C, that is<br />

lim f n(x)<br />

n→∞<br />

exists for every x ∈ C 0 . Then, {f n } n converges pointwise on <strong>the</strong> whole set C and <strong>the</strong> limit<br />

function<br />

f(x) := lim<br />

n→∞ f n(x)<br />

is finite and convex on C. Moreover, {f n } n∈N converges uniformly to f on every compact<br />

subset K ⊂ C, that is<br />

sup |f n (x) − f(x)| = 0.<br />

x∈K<br />

Theorem 2.1.0.2. (Andersen and Gill, 1982, Theorem 2.1) Let C ⊆ R p be open and let<br />

{f n } n be a sequence <strong>of</strong> random convex functions on C such that for every x ∈ C,<br />

f n (x) → P f(x),<br />

where f is a real (random) function on C. Then f is convex and for every compact subset<br />

K ⊂ C, it holds that<br />

sup |f n (x) − f(x)| → P 0.<br />

x∈K<br />

3


4 Minimizers <strong>of</strong> convex processes<br />

Pro<strong>of</strong>. Recall that convergence in probability implies almost sure convergence along some<br />

subsequence. Consider a dense sequence {x i } i ⊂ C. Since f n (x i ) → P f(x i ) for every<br />

i ∈ N, <strong>the</strong>re exists a subsequence {j 1 n} n for which<br />

f j 1 n<br />

(x 1 ) → f(x 1 ) (2.1.0.1)<br />

almost everywhere.<br />

almost everywhere,<br />

This subsequence contains a fur<strong>the</strong>r subsequence {j 2 n} n for which,<br />

f j 2 n<br />

(x 2 ) → f(x 2 )<br />

holds in addition to 2.1.0.1.<br />

subsequence {jn} k n satisfying<br />

Repeating this argument, we obtain for every k ∈ N a<br />

f j k n<br />

(x i ) → f(x i ) for i = 1, . . . , k,<br />

almost everywhere. Then for <strong>the</strong> sequence {j ∗ n} n <strong>of</strong> diagonal members, i.e. j ∗ n = j n n, we<br />

have<br />

f j ∗ n<br />

(x i ) → f(x i ) for every i ∈ N<br />

almost surely. Next, Theorem 2.1.0.1 implies that f is almost everywhere convex on C<br />

and that<br />

∣<br />

sup ∣f j ∗ n<br />

(x) − f(x) ∣ → 0<br />

x∈K<br />

almost surely for every compact K ⊂ C. We have shown more generally that every<br />

subsequence <strong>of</strong> {sup x∈K |f n (x) − f(x)|} n has a fur<strong>the</strong>r subsequence which converges almost<br />

surely to zero. This implies convergence in probability to zero along <strong>the</strong> whole sequence.<br />

Indeed, suppose that this is not <strong>the</strong> case, <strong>the</strong>n <strong>the</strong>re exists a subsequence {j n } n and<br />

costants M, ε > 0 such that P (sup x∈K |f jn (x) − f(x)| > M) > ε for all n ∈ N. For every<br />

fur<strong>the</strong>r subsequence {j n} ′ n , it holds that<br />

(<br />

∣<br />

P ∣f j ′ n<br />

(x) − f(x) ∣ ) ⎛ {<br />

∞⋃ ⋂ ∣<br />

→ 0 ≤ P ⎝<br />

∣f j ′ n<br />

(x) − f(x) ∣ ≤ M} ⎞ ⎠<br />

sup<br />

x∈K<br />

N=1 n≥N<br />

sup<br />

x∈K<br />

(<br />

)<br />

≤ lim P ∣<br />

sup ∣f<br />

n→∞ j ′<br />

N<br />

(x) − f(x) ∣ ≤ M<br />

≤ 1 − ε<br />

which stands in contradiction with <strong>the</strong> previous conclusion. This argument is known as<br />

<strong>the</strong> subsequence criterion((Kallenberg, 2002, Lemma 3.2)).<br />

Remark. Note that a more direct and elementary pro<strong>of</strong> <strong>of</strong> <strong>the</strong> previous result was given<br />

by (Pollard, 1991, Convexity Lemma).<br />

x∈K<br />

Next, we use this result to infer on <strong>the</strong> minimizers <strong>the</strong>mselves.<br />

Corollary 2.1.0.3. Let C ⊆ R be open and let {f n } n be a sequence <strong>of</strong> random convex<br />

functions on C such that<br />

f n (x) → P f(x)


2.1 Convergence in probability 5<br />

for every x ∈ C, where f is a real (random) variable on C. Fur<strong>the</strong>r, assume that f is<br />

uniquely minimized at α ∈ C and let {α n } n be a sequence <strong>of</strong> minimizers <strong>of</strong> {f n } n , i.e.<br />

α n ∈ arg min<br />

x∈C<br />

f n (·), n ∈ N.<br />

Then,<br />

α n → P α.<br />

Pro<strong>of</strong>. The pro<strong>of</strong> rests on (Hjort and Pollard, 1993, Lemma 2) and subsequent remarks.<br />

For arbitrary δ > 0, define<br />

and<br />

∆ n (δ) =<br />

h(δ) =<br />

sup |f n (x) − f(x)|<br />

‖x−α‖≤δ<br />

inf f n(x) − f n (α)<br />

‖x−α‖=δ<br />

First, we show that<br />

P (‖α n − α‖ ≥ δ) ≤ P<br />

(∆ n (δ) ≥ 1 2 h(δ) )<br />

. (2.1.0.2)<br />

For every x ∈ R p with ‖x − α‖ > δ <strong>the</strong>re exist some c with ‖c − α‖ = δ and t ∈ [0, 1] such<br />

that c = tx + (1 − t)α. By convexity <strong>of</strong> <strong>the</strong> functions f n , we have<br />

f n (c) ≤ tf n (x) + (1 − t)f n (α).<br />

for every n ∈ N. Writing r n (x) = f(x) − f n (x), we obtain<br />

t(f n (x) − f n (α)) ≥ f n (c) − f n (α)<br />

= f(c) − f(α) + r n (c) − r n (α)<br />

≥ h(δ) − 2∆ n (δ)<br />

In particular,<br />

{∆ n (δ) < 1 }<br />

2 h(δ) ⊆ {f n (x) > f n (α)}<br />

for every x with ‖x − α‖ > δ and every n ∈ N. Since α minimizes f n , this implies that<br />

{‖α n − α‖} ⊆<br />

{∆ n (δ) ≥ 1 }<br />

2 h(δ)<br />

and 2.1.0.2 follows.<br />

Next, we show that<br />

∆ n (δ) = o P (1) (2.1.0.3)


6 Minimizers <strong>of</strong> convex processes<br />

Let ε > 0, for arbitrary M > 0 we have<br />

P (∆ n (δ) > ε) = P (∆ n (δ) > ε; ‖α‖ ≤ M) + P (∆ n (δ) > ε; ‖α‖ > M)<br />

(<br />

)<br />

≤ P<br />

sup |f n (x) − f(x)| > ε<br />

‖x‖≤δ+M<br />

= o(1) + P (‖α‖ > M)<br />

+ P (‖α‖ > M)<br />

by Theorem 2.1.0.2. Letting M tend to infitnity completes <strong>the</strong> argument.<br />

Now, we can prove<br />

α n → P α.<br />

Let δ > 0, for fxed arbitrary M > 0, we have<br />

P (‖α n −α‖ > δ) ≤ P<br />

(∆ n (δ) ≥ 1 )<br />

2 h(δ) (<br />

= P ∆ n (δ) ≥ 1 ) (<br />

2 h(δ); 1<br />

h(δ) ≤ M + P ∆ n (δ) ≥ 1 )<br />

2 h(δ); 1<br />

h(δ) > M (<br />

≤ P ∆ n (δ) ≥ 1 ) ( ) 1<br />

+ P<br />

2M h(δ) > M<br />

For fixed M, <strong>the</strong> first term tends to zero as n tends to infinity by 2.1.0.3. Then, <strong>the</strong><br />

second term tends to zero as M tends to infinity. Indeed, h(δ) −1 is almost surely finite by<br />

uniqueness <strong>of</strong> α. This completes <strong>the</strong> pro<strong>of</strong>.<br />

2.2 Convergence in <strong>distribution</strong><br />

The goal <strong>of</strong> this section is to derive conditions under which a sequence <strong>of</strong> minimizers <strong>of</strong><br />

convex objective functions converge in <strong>distribution</strong> and to provide means to determine this<br />

limit. At first, <strong>the</strong> concept <strong>of</strong> weak convergence must be revisited to encompass not necessarily<br />

measurable maps defined on probability spaces, this is a feature typically exhibited<br />

by argmin functionals. This wider concept <strong>of</strong> weak convergence was originally introduced<br />

by H<strong>of</strong>mann-Jørgensen but first exposited in Dudley (1985) and Pollard (1990). However,<br />

in <strong>the</strong> present section we follow <strong>the</strong> more mature exposition <strong>of</strong> Van der Vaart and Wellner<br />

(1996). Indeed <strong>the</strong>y showed that even in this general setting, most important results<br />

from weak convergence <strong>the</strong>ory, going from <strong>the</strong> portmanteau <strong>the</strong>orem to <strong>the</strong> almost sure<br />

representation <strong>the</strong>orem, through Prohorov’s <strong>the</strong>orem, remain valid, provided one makes<br />

necessary, but essentially minor, modifications. This section is ended with <strong>the</strong> argmin continuous<br />

mapping <strong>the</strong>orem which will be applied to <strong>the</strong> <strong>Lasso</strong> estimator in <strong>the</strong> next chapter.<br />

Definition 2.2.0.4. Let (Ω, A, P ) be a probability space and T : Ω → R be an arbitrary<br />

map.<br />

(i) The outer expection and <strong>the</strong> inner expectation <strong>of</strong> T with respect to P are defined<br />

as<br />

E ∗ (T ) = inf {E(U)|U ≥ T, U : Ω → R measurable and E(U) exists} (2.2.0.4)


2.2 Convergence in <strong>distribution</strong> 7<br />

and<br />

E ∗ (T ) = sup {E(U)|U ≤ T, U : Ω → R measurable and E(U) exists} (2.2.0.5)<br />

respectively.<br />

(ii) The outer probability and <strong>the</strong> innner probability <strong>of</strong> an arbitrary set B ⊂ Ω are<br />

defined as<br />

and<br />

respectively.<br />

P ∗ (B) = inf {P (A)|A ⊃ B, A ∈ A} (2.2.0.6)<br />

P ∗ (B) = sup {P (A)|A ⊂ B, A ∈ A} (2.2.0.7)<br />

It turns out that outer/inner integrals and outer/inner probabilities are indeed attained<br />

at measurable maps and sets, respectively, as stated in<br />

Lemma 2.2.0.5. (Van der Vaart and Wellner, 1996, Lemma 1.2.1) Let (Ω, A, P ) be a<br />

probability space. For an arbitrary map T : Ω → R, <strong>the</strong>re exist measurable functions<br />

T ∗ , T ∗ : Ω → R with<br />

(i) T ∗ ≥ T ,<br />

(ii) T ∗ ≤ U a.s. for every measurable U : Ω → R with U ≥ T a.s.,<br />

(iii) T ∗ ≤ T ;<br />

(iv) T ∗ ≥ U a.s. for every measurable U : Ω → R with U ≤ T a.s.<br />

For every such T ∗ and T ∗ , it holds that E ∗ (T ) = E(T ∗ ) and E ∗ (T ) = E(T ∗ ), respectively,<br />

provided that E(T ∗ ), respectively E(T ∗ ) exists.<br />

We call T ∗ and T ∗ minimal measurable majorant and maximal measurable minorant<br />

<strong>of</strong> T , respectively.<br />

2.2.1 Weak convergence in metric spaces<br />

In <strong>the</strong> remaining, let (D, d) nad (E, e) denote metric spaces.<br />

Definition 2.2.1.1. (Weak convergence) Let (Ω n , A, P n ) , n ∈ N, be probability spaces<br />

and let {X n : Ω n → D} n be a sequence <strong>of</strong> arbitrary random maps. {X n } n converges weakly<br />

to a Borel measure L if<br />

This is denoted by X n L.<br />

∫<br />

E ∗ (f(X n )) → fdL,<br />

for every f ∈ C b (D).


8 Minimizers <strong>of</strong> convex processes<br />

Remark. Weak convergence is also defined for convergence toward a Borel measurable<br />

map X defined on a probability space (Ω, A, P ) by taking L := P ◦ X −1 , this is denoted<br />

by X n X. Note that, while measurability <strong>of</strong> <strong>the</strong> maps X n has been dropped, it remains<br />

mandatory for <strong>the</strong> limit X.<br />

The Portmanteau <strong>the</strong>orem and <strong>the</strong> continuous mapping are essential tools in asymptotic<br />

statistics. They remain valid in this general framework. Their pro<strong>of</strong> are omitted but can<br />

be found in <strong>the</strong> references.<br />

Theorem 2.2.1.2. (Van der Vaart and Wellner, 1996, Theorem 1.3.4) Let {X n } n be a<br />

sequence <strong>of</strong> arbitrary random maps and let L be a Borel measure. The following statements<br />

are equivalent :<br />

(i) X n L;<br />

(ii) lim inf P ∗ (X n ∈ G) ≥ L(G) for every open G;<br />

(iii) lim sup P ∗ (X n ∈ F ) ≤ L(F ) for every closed F;<br />

(iv) lim P ∗ (X n ∈ B) = lim P ∗ (X n ∈ B) = L(B) for every Borel L-continuous set B, that<br />

is B with L(∂B) = 0 ;<br />

(v) lim inf E ∗ f(X n ) ≥ ∫ fdL for every bonded, Lipschitz continuous, nonnegative function<br />

f.<br />

Theorem 2.2.1.3. (Van der Vaart and Wellner, 1996, Theorem 1.3.6) Let g : D → E be<br />

continous at every point in a set D 0 ⊂ D. If X n X and X takes its values in D 0 , <strong>the</strong>n<br />

g(X n ) g(X).<br />

Definition 2.2.1.4. A Borel probability measure L is called tight if for every ε > 0, <strong>the</strong>re<br />

exists a compact set K ⊂ D with L(K) ≥ 1 − ε. A Borel measurable map X : Ω → D is<br />

called tight if its law P ◦ X −1 is tight.<br />

Next, we want to state Prohorov’s <strong>the</strong>orem which is a fundamental result in weak convergence<br />

<strong>the</strong>ory, it associates sequential compactness (with respect to weak convergence) to<br />

<strong>the</strong> concept <strong>of</strong> tightness, hence giving conditions for <strong>the</strong> existence <strong>of</strong> a weak limit. In <strong>the</strong><br />

present setting, <strong>the</strong> statement <strong>of</strong> <strong>the</strong> result requires introducing <strong>the</strong> concept <strong>of</strong> asymptotic<br />

measurability first.<br />

Definition 2.2.1.5. (Asymptotic measurability and tightness) A sequence <strong>of</strong> arbitrary<br />

random maps {X n : Ω n → D} n is asymptotically measurable if and only if<br />

E ∗ f(X n ) − E ∗ f(X n ) → 0,<br />

for every f ∈ C b (D).<br />

The sequence {X n } n is called asymptotically tight if for every ε > 0, <strong>the</strong>re exists a compact<br />

set K such that<br />

lim inf<br />

n→∞ P ∗(X n ∈ K δ ) ≥ 1 − ε, for every δ > 0.<br />

Here, K δ = {y ∈ D | d(x, K) < δ} is <strong>the</strong> δ-enlargement around K, an open set.<br />

Remark. One can show that for a sequence {X n } n <strong>of</strong> Borel measurable maps, asymptotic<br />

tightness and <strong>the</strong> usual uniform tighness, that is, <strong>the</strong> existence for every ε > 0 <strong>of</strong> a compact<br />

set K ⊂ D satisfying P (X n ∈ K) ≥ 1 − ε for every n ∈ N, are equivalent.


2.2 Convergence in <strong>distribution</strong> 9<br />

Lemma 2.2.1.6. (Van der Vaart and Wellner, 1996, Lemma 1.3.8)<br />

(i) If X n X, <strong>the</strong>n X n is asymptotically measurable.<br />

(ii) If X n X, <strong>the</strong>n X n is tight if and only if X n is tight.<br />

Definition 2.2.1.7.<br />

(i) A vector lattice F ⊂ C b (D) is a vector space that is closed under taking positive parts,<br />

that is, if f ∈ F <strong>the</strong>n f + = f ∨ 0 ∈ F.<br />

(ii) An algebra F ⊂ C b (D) is a vector space that is closed under taking products, i.e. if<br />

f, g ∈ F, <strong>the</strong>n fg : x ↦→ f(x)g(x) ∈ F.<br />

(iii) A set <strong>of</strong> functions on F on D is said to separate points in D if, for every pair x ≠ y,<br />

<strong>the</strong>re is a f ∈ F with f(x) ≠ f(y).<br />

Lemma 2.2.1.8. (Van der Vaart and Wellner, 1996, Lemma 1.3.12)<br />

(i) Let L 1 and L 2 be finite measurable measures on D. If ∫ fdL 1 = ∫ fdL 2 for every<br />

f ∈ C b (D), <strong>the</strong>n L 1 = L 2 .<br />

(ii) Let L 1 and L 2 be tight Borel probability measures on D. If ∫ fdL 1 = ∫ fdL 2 for every<br />

f in a vector laticce F ⊂ C b (D) that contains <strong>the</strong> constant functions and separates<br />

points <strong>of</strong> D, <strong>the</strong>n L 1 = L 2 .<br />

Lemma 2.2.1.9. (Van der Vaart and Wellner, 1996, Lemma 1.3.13) Let {X n : Ω → D} n<br />

be a sequence <strong>of</strong> arbitrary maps. Suppose that {X n } n is asymptotically tight and that<br />

E ∗ f (X n ) − E ∗ f (X n ) → 0 (2.2.1.1)<br />

for every f in a subalgebra F ⊂ C b (D) that separates points <strong>of</strong> D. Then {X n } n<br />

is asymptotically<br />

measurable.<br />

Lemma 2.2.1.10. (Van der Vaart and Wellner, 1996, Lemma 1.4.3) Sequences {X n :<br />

Ω → D} n and {Y n : Ω n → E} <strong>of</strong> arbitrary random maps are asymptotically tight if and<br />

only if <strong>the</strong> same is true for (X n , Y n ) : Ω n → D × E.<br />

Lemma 2.2.1.11. (Van der Vaart and Wellner, 1996, Lemma 1.4.4) Asymptotically tight<br />

sequences {X n : Ω → D} n and {Y n : Ω n → E} are asymptotically measurable if and only if<br />

<strong>the</strong> same is true for (X n , Y n ) : Ω n → D × E.<br />

Remark. Both previous results remain true for finitely many or countably many sequences<br />

<strong>of</strong> maps.<br />

Theorem 2.2.1.12. (Van der Vaart and Wellner, 1996, Theorem 1.3.9)(Prohorov’s<br />

<strong>the</strong>orem) Let {X n : Ω n → D} be a sequence <strong>of</strong> arbitrary random maps. If {X n } n is<br />

asymptotically tight and asymptotically measurable, <strong>the</strong>n it has a subsequence {X jn } n that<br />

converges weakly to a tight Borel law.


10 Minimizers <strong>of</strong> convex processes<br />

2.2.2 Bounded and locally bounded functions<br />

Uniformly bounded functions<br />

We are ultimatelly interrested in convex functions which are locally bounded, so from now<br />

on we will first focus on<br />

(D, d) = (l ∞ (T ), ‖·‖ T )<br />

That is ,<strong>the</strong> space <strong>of</strong> bounded functions z : T → R on an arbitrary set T with <strong>the</strong> supremum<br />

norm ‖z‖ T = sup t∈T |z(t)|. Later, we will see that asymptotic tightness <strong>of</strong> locally<br />

bounded processes can be characterized in terms <strong>of</strong> asymptotic tightness <strong>of</strong> restrictions on<br />

compacta.<br />

The next result reformulates in this context <strong>the</strong> abstract characterization <strong>of</strong> asymptotic<br />

measurability and uniqueness <strong>of</strong> Borel measures given in Lemma 2.2.1.9 and 2.2.1.8. They<br />

give a first hint at <strong>the</strong> important role played by marginals <strong>distribution</strong>s.<br />

Lemma 2.2.2.1. (Van der Vaart and Wellner, 1996, Lemma 1.5.2 and Lemma 1.5.3)<br />

(i) Let {X n : Ω n → l ∞ (T )} n be asymptotically tight. Then it is asymptotically measurable<br />

if and only if {X n (t)} n is asymptotically measurable for every t ∈ T .<br />

(ii) Let X and Y be tight Borel measurable maps into l ∞ (T ). Then X and Y are equal<br />

in Borel law if and only if all corresponding marginals <strong>of</strong> X and Y are equal in law.<br />

Pro<strong>of</strong>. One easily verifies that <strong>the</strong> set F ⊂ C b (l ∞ (T )) <strong>of</strong> continuous functions f <strong>of</strong> <strong>the</strong><br />

form f(z) = g(z(t 1 ), . . . , z(t k )) forms an algebra and a vector lattice, that it contains <strong>the</strong><br />

constant functions and that it separates points <strong>of</strong> T .<br />

(i) Note that for M > 0, {|X n (t)| ≤ M} ⊂ {‖X n ‖ |T ≤ M} n for every t ∈ T . In particular<br />

assymptotic tightness <strong>of</strong> a sequence {X n : Ω n → l ∞ (T )} implies asymptotic<br />

tightness <strong>of</strong> {X n (t)} n for every t ∈ T .<br />

Hence, <strong>the</strong> necessity <strong>of</strong> assymptotic tightness for {X n (t)} n for every t ∈ T follows<br />

directly from <strong>the</strong> fact that coordinate maps {π t : l ∞ → R}, t ∈ T are continuous.<br />

For <strong>the</strong> sufficiency <strong>of</strong> this assymptotic tightness, note first that given <strong>the</strong><br />

assumption and <strong>the</strong> fact that F defined above forms a vector laticce containing constants<br />

and separating points, it follows from Lemma 2.2.1.11 that every sequence<br />

{(X n (t 1 ), . . . , X n (t k )) : Ω n → R k } n is asymptotically measurable. The claim is <strong>the</strong>n<br />

a direct consequence <strong>of</strong> Lemma 2.2.1.9 with F given above.<br />

(ii) If X and Y have equal Borel laws on l ∞ (T ), <strong>the</strong>n <strong>the</strong>ir marginals are also equal<br />

in law since coordinates maps π t1 ,...,t k<br />

: l ∞ (T ) → R k are Borel measurables. The<br />

converse statement follows from Lemma 2.2.1.8.<br />

Theorem 2.2.2.2. (Van der Vaart and Wellner, 1996, Lemma 1.5.4) Let {X n : Ω n →<br />

l ∞ (T )} n be a sequence <strong>of</strong> arbitrary random maps. Then {X n } n converges weakly to a<br />

tight limit if and only if {X n } is asymptotically tight and for every finite subset t 1 , . . . , t k


2.2 Convergence in <strong>distribution</strong> 11<br />

<strong>of</strong> T , <strong>the</strong> sequence <strong>of</strong> marginals {(X n (t 1 ), . . . , X n (t k ))} n converges weakly to a tight limit. If<br />

{X n } n is asymptotically tight and its marginals converge to <strong>the</strong> marginals ((X(t 1 ), . . . , X(t k ))<br />

<strong>of</strong> a stochastic process X, <strong>the</strong>n <strong>the</strong>re is a version <strong>of</strong> X with uniformly bounded sample paths<br />

and X n X.<br />

Pro<strong>of</strong>. Coordinates maps π t1 ,...,t k<br />

are continuous so it follows from <strong>the</strong> continuous mapping<br />

Theorem 2.2.1.3 that for every subsequence and every finite subset t 1 , . . . , t k <strong>of</strong> T {j n } n ,<br />

X jn L jn implies (X jn (t 1 ), . . . , X jn (t k )) L jn ◦ π −1<br />

(t 1 ,...,t k ). In particular, if <strong>the</strong> whole<br />

sequence converges weakly to a tight limit, <strong>the</strong> same is true for every sequence <strong>of</strong> marginals;<br />

asymptotic tightness follows from Lemma 2.2.1.6.<br />

Conversely, suppose that {X n } n is asymptotically tight and that <strong>the</strong> marginals converge,<br />

<strong>the</strong>n it follows from Lemma 2.2.2.1, trough Lemma 2.2.1.11, that {X n } n is asymptotically<br />

measurable. Asymptotic measurability and tightness are kept under subsequencing, hence,<br />

by Prohorov’s <strong>the</strong>orem 2.2.1.12, every subsequence {X jn } n converges weakly to a tight<br />

limit, say L jn . It remains to show that <strong>the</strong> limits are <strong>the</strong> same. Invoke <strong>the</strong> continuous<br />

mapping <strong>the</strong>orem again to show that for every subsequence {j n } n , X jn L jn implies<br />

(X jn (t 1 ), . . . , X jn (t k )) L jn ◦ π −1<br />

(t 1 ,...,t k ) . By assumption, marginals (X j n<br />

(t 1 ), . . . , X jn (t k ))<br />

share <strong>the</strong> same weak limit, namely <strong>the</strong> weak limit <strong>of</strong> (X n (t 1 ), . . . , X n (t k )), uniqueness<br />

now follows from <strong>the</strong> second part <strong>of</strong> lemma 2.2.2.1 applied to <strong>the</strong> measures L jn . Finally,<br />

if {X n } n is asymptotically tight and its marginals converge to <strong>the</strong> marginals <strong>of</strong> some<br />

stochastic process X, <strong>the</strong>n {X n } n converge to some Borel measurable process Y : Ω Y →<br />

l ∞ (T ) which has <strong>the</strong> same marginal <strong>distribution</strong> as X, again by <strong>the</strong> second part <strong>of</strong> lemma<br />

2.2.2.1. This completes <strong>the</strong> pro<strong>of</strong>.<br />

Definition 2.2.2.3. A sequence {X n : Ω → l ∞ (T )} <strong>of</strong> arbitrary random maps is asymptotically<br />

uniformly ρ-equicontinuous if for every ɛ, η > 0 <strong>the</strong>re exists a δ > 0 sucht that<br />

)<br />

(<br />

lim sup P ∗ sup |X n (s) − X n (t)|<br />

n→∞ ρ(s,t)<br />

Theorem 2.2.2.4. (Van der Vaart and Wellner, 1996, Theorem 1.5.7)<br />

(i) A sequence {X n : Ω n → l ∞ (T )} n <strong>of</strong> arbitrary random maps is asymptotically tight<br />

if and only if {X n (t)} n is asymptotically tight in R for every t and <strong>the</strong>re exists a<br />

semimetric ρ on T such that (T, ρ) is totally bounded and {X n } n is asymptotically<br />

uniformly ρ-equicontinous in probability.<br />

(ii) If, moreover, X n X, <strong>the</strong>n almost all paths t ↦→ X(t, ω) are uniformly ρ-continuous;<br />

and <strong>the</strong> semimetric ρ can without loss <strong>of</strong> generality be taken equal to any semimetric<br />

ρ for which this is true and (T, ρ) is totally bounded.<br />

Pro<strong>of</strong>. We prove sufficiency <strong>of</strong> <strong>the</strong> conditions in i, see <strong>the</strong> reference for <strong>the</strong> o<strong>the</strong>r direction<br />

and <strong>the</strong> second part. Suppose that {X n (t)} n is asymptotically tight in R for every t ∈<br />

T and that <strong>the</strong>re exists a semimetric ρ on T such that (T, ρ) is totally bounded and<br />

that {X n } n is asymptotically uniformly ρ-equicontinuous in probability. Using totally<br />

boundedness <strong>of</strong> (T, ρ), <strong>the</strong>n disjointification <strong>of</strong> sets we conclude that for every ε, η > 0,<br />

<strong>the</strong>re is a finite partition T = ∪ k i=1 T i such that<br />

)<br />

lim sup<br />

n→∞<br />

P ∗ (sup<br />

i<br />

sup |X n (s) − X n (t)| > ε<br />

s,t∈T i<br />

> η (2.2.2.1)


12 Minimizers <strong>of</strong> convex processes<br />

For such a partition and an arbitrary choice <strong>of</strong> points t i ∈ T i , i = 1, . . . , k, we have<br />

(<br />

)<br />

(<br />

)<br />

lim inf P ∗ ‖X n ‖<br />

n→∞ |T ≤ max |X n(t i )| + ε ≥ lim inf P ∗ sup sup |X n (s) − X n (t)| ≤ ε<br />

i=1,...,k n→∞ i s,t∈T i<br />

≥ 1 − η.<br />

The maximum <strong>of</strong> finitely many asymptotically tight sequences <strong>of</strong> real valued maps is<br />

asymptotically tight (consider <strong>the</strong> finite union <strong>of</strong> compacts), so it follows that {‖X n ‖} n is<br />

asymptotically tight in R.<br />

Fix ζ > 0 and consider a sequence ε m ↘ 0. Take a constant M > 0 with<br />

(<br />

lim sup P ∗ ‖X n ‖ |T > M<br />

n→∞<br />

)<br />

< ζ<br />

by invoking asymptotic tightness, and for each ε = ε m and η = 2 −m ζ, take a partition<br />

T = ∪ km<br />

1=1 t i satisfying 2.2.2.1. Fix m for <strong>the</strong> moment and consider <strong>the</strong> set<br />

Z m =<br />

⋂ {<br />

}<br />

z ∈ l ∞ (T ) | z |Ti ≡ jε m for some j ∈ {−⌊M/ε⌋, . . . , ⌊M/ε⌋} .<br />

i=1,...,k m<br />

Z m is obviously finite, denote its elements by z 1 , . . . , z p and define<br />

K m =<br />

⋃<br />

B(z i , ε m ).<br />

Then <strong>the</strong> two conditions<br />

i=1,...,p<br />

‖X n ‖ |T ≤ M and sup<br />

i<br />

sup |X n (s) − X n (t)| ≤ ε m<br />

s,t∈T i<br />

imply that X n ∈ K m . An explicit z ∈ Z m with ‖X n − z‖ T ≤ ε m is given by setting<br />

z |Ti ≡ c for some c ∈ ( inf s∈Ti |X n (s)|, sup s∈Ti |X n (s)| ) ∩ {0, ±ε m ± ⌊M/ε m ⌋ε m }. This is<br />

true for every m. Let K = ∩ ∞ i=1 K m, K is closed, it is totally bounded since ε m ↘ 0; we<br />

are working in a metric space so it follows that K is compact. Now we show that for every<br />

δ > 0, <strong>the</strong>re is a m with K δ ⊃ ∩ m i=1 K i. Indeed, if this is not true <strong>the</strong>n <strong>the</strong>re exists a<br />

sequence {z m } m not contained in K δ but with z m ∈ ∩ m 1=1 for every m. This sequence has<br />

a first subsequence contained in one <strong>of</strong> <strong>the</strong> balls making up K 1 , <strong>the</strong>n a fur<strong>the</strong>r subsequence<br />

contained in one <strong>of</strong> <strong>the</strong> balls making ip K 2 . By <strong>the</strong> usual diagonal argument it follows<br />

that <strong>the</strong>re is a subsequence which is contained in a single ball <strong>of</strong> radius ε m for every m,<br />

hence Cauchy. Since K is closed its limit is contained in K, which stands in contradiction<br />

to <strong>the</strong> fact that {z m } m ⊂ l ∞ \ K δ . It follows that<br />

{<br />

}<br />

m⋃<br />

{X n /∈ K δ } ⊂ X n /∈ K m<br />

for some fixed m. For this m, we have<br />

lim sup<br />

n→∞<br />

( P ∗ X n /∈ K δ) ≤ lim sup P<br />

(X ∗ n /∈<br />

n→∞<br />

which concludes <strong>the</strong> pro<strong>of</strong>.<br />

≤ lim sup P ∗ (‖X n ‖ > M) +<br />

n→∞<br />

m∑<br />

≤ ζ + ζ2 −m < 2ζ,<br />

i=1<br />

i=1<br />

)<br />

m⋂<br />

K i<br />

i=1<br />

m∑<br />

i=1<br />

P ∗ (sup<br />

i<br />

sup |X n (s) − X n (t)| > ε<br />

s,t∈T i<br />

)


2.2 Convergence in <strong>distribution</strong> 13<br />

Locally bounded functions<br />

Next, for an arbitrary index set T and a covering sequence T 1 ⊂ T 2 ⊂ . . . <strong>of</strong> subsets<br />

(T = ∪ ∞ i=1 T i), denote by l ∞ (T 1 , T 2 , . . . ) <strong>the</strong> set <strong>of</strong> functions z uniformly bounded on each<br />

T i and equip it with <strong>the</strong> metric<br />

∞∑<br />

d(z 1 , z 2 ) = (‖z 1 − z 2 ‖ Ti ∧ 1) 2 −1 .<br />

i=1<br />

It turns out that weak convergence with respect to this topology can be characterized in<br />

terms <strong>of</strong> weak convergence <strong>of</strong> <strong>the</strong> restrictions on sets T i .<br />

Remark. When T = R p and <strong>the</strong> sets T i are chosen equal to closed balls B(0, i) or cubes<br />

[−i, i] p , ( i ∈ N), d induces <strong>the</strong> topology <strong>of</strong> uniform convergence on compacta. We will<br />

consider this case at <strong>the</strong> end <strong>of</strong> this chapter.<br />

Theorem 2.2.2.5. (Van der Vaart and Wellner, 1996, Theorem 1.6.1) Let {X n : Ω n →<br />

l ∞ (T 1 , T 2 , . . . )} n be a sequence <strong>of</strong> arbitrary maps. {X n } n converges weakly to a tight limit<br />

if and only if each <strong>of</strong> <strong>the</strong> sequences <strong>of</strong> restrictions {X |Ti : Ω n → l ∞ (T i )} n (i ∈ N) converges<br />

weakly to a tight limit.<br />

Pro<strong>of</strong>. The restriction maps<br />

l ∞ (T 1 , T 2 , . . . ) → l ∞ (T i )<br />

z ↦→ z |Ti<br />

i ∈ N, are continuous, so <strong>the</strong> necessity <strong>of</strong> convergence for all sequences <strong>of</strong> restrictions<br />

follows from <strong>the</strong> continuous mapping Theorem.<br />

Conversely, suppose that for every i ∈ N, {X |Ti : Ω n → l ∞ (T i )} n converges weakly to a<br />

tight limit. Recall that weak convergence implies asymptotic tightness. ( For fixed ) ε > 0,<br />

find for every i a compact set K i ⊂ l ∞ (T i ) such that lim sup n→∞ P ∗ X |Ti /∈ Ki<br />

δ < ε for<br />

every δ > 0. Then<br />

(i) Define<br />

K =<br />

{<br />

z : T → R | z |(Ti \T i−1 ) ∈ (K i ) |Ti \T i−1<br />

}<br />

for every i ∈ N ⊂ l ∞ (T 1 , T 2 , . . . ).<br />

K is compact. Indeed, let {z n } n be a sequence in K, for every i ∈ N <strong>the</strong>re is<br />

a sequence {z n (i) } n in K i such that (z n ) |Ti \T i−1<br />

= (z n (i) ) |Ti \T i−1<br />

for all n ∈ N. By<br />

compactness <strong>of</strong> K i , every subsequence <strong>of</strong> {z n (i) } n has a convergent subsequence. First,<br />

find a convergent subsequence {jn} 1 n <strong>of</strong> {z n<br />

(1) } n , <strong>the</strong>n extract a fur<strong>the</strong>r subsequence<br />

{jn} 2 n for which {z n (2) } n also converges. Proceed fur<strong>the</strong>r so with <strong>the</strong> remaining i ≥ 2.<br />

We verify that <strong>the</strong> diagonal subsequence {z j n } n is convergent. To do so, first note<br />

that for deterministic functions {z n } n ⊆ l ∞ (T 1 , T 2 , . . . ), <strong>the</strong> convergence <strong>of</strong> every<br />

sequence <strong>of</strong> restrictions in l ∞ (T i ) implies <strong>the</strong> convergence in l ∞ (T 1 , T 2 , . . . ). Indeed,<br />

for n, m, i ∈ N it holds that<br />

i∑<br />

‖z j n − z j m ‖ Ti = ‖(z j n − z j m ) |T1 ‖ T1 + ‖(z j n − z j m ) |Tl \T l−1<br />

‖ |Tl \T l−1<br />

l=2<br />

i∑<br />

= ‖(z (1)<br />

jn<br />

− z(1) jm ) |T 1<br />

‖ T1 + ‖(z (l)<br />

jn<br />

− z(l) jm ) |T l \T l−1<br />

‖ |Tl \T l−1<br />

l=2


14 Minimizers <strong>of</strong> convex processes<br />

Since every sequence {z (l)<br />

j n n } n is Cauchy in l ∞ (T l ) by construction and<br />

‖(z (l)<br />

j n n<br />

− z(l) jm ) |T l \T l−1<br />

‖ |Tl \T l−1<br />

≤ ‖z (l)<br />

jn<br />

− z(l) jm ‖ |T l<br />

for l = 1, . . . , i, it follows that {(z j n n<br />

) |Ti } n is Cauchy in l ∞ (T i ).<br />

(ii) Using compactness <strong>of</strong> <strong>the</strong> sets K i one easily verifies that for arbitrary δ > 0, it holds<br />

that<br />

{<br />

}<br />

z ∈ l ∞ (T 1 , T 2 , . . . ) | z |Ti ∈ Ki<br />

δ for every i ∈ N ⊆ K δ .<br />

This implies that<br />

(<br />

lim sup P X n /∈ K δ) ≤<br />

n→∞<br />

∞∑<br />

P<br />

i=1<br />

(<br />

X n|Ti<br />

)<br />

/∈ Ki<br />

δ < ε<br />

for every δ > 0, that is, {X n } n<br />

is asymptotically tight in l ∞ (T 1 , T 2 , . . . ).<br />

Then, consider <strong>the</strong> set F <strong>of</strong> continuous functions f : l ∞ (T 1 , T 2 , . . . ) → R <strong>of</strong> <strong>the</strong> form<br />

f(z) = g(z(t 1 ), . . . , z(t k )) with g ∈ C b (R), t 1 , . . . , t k ∈ T and k ∈ N. Obbviously<br />

F is an algebra that separates points in l ∞ (T 1 , T 2 , . . . ). Lemma 2.2.1.9 now implies<br />

that {X n } n is asymptotically measurable in l ∞ (t 1 , T 2 , . . . ).<br />

(iii) Finally Prohorov’s <strong>the</strong>orem 2.2.1.12 asserts that every subsequence <strong>of</strong> {X n } n has a<br />

fur<strong>the</strong>r subsequence which converges weakly to a tight limit. This limit is unique in<br />

viwe <strong>of</strong> Lemma 2.2.1.8<br />

Corollary 2.2.2.6. (Weak convergence <strong>of</strong> convex processes) Let {X n } n be a sequence<br />

<strong>of</strong> stochastic processes indexed by a convex, open subset C <strong>of</strong> R p such that every<br />

sample path t ↦→ X n (ω, t) is convex on C. If {X n } n converges marginally in <strong>distribution</strong><br />

to a limit, <strong>the</strong>n it converges in <strong>distribution</strong> to a tight limit in <strong>the</strong> space l ∞ (K 1 , K 2 , . . . , )<br />

for any sequence <strong>of</strong> compacts sets K 1 ⊂ K 2 ⊂ · · · ⊂ C. In particular, it converges to a<br />

tight limit with respect to <strong>the</strong> topology <strong>of</strong> uniform convergence on compacta.<br />

Pro<strong>of</strong>.<br />

(i) We show that for every compact K ⊂ C, <strong>the</strong>re exists an ε > 0 such that K ε ⊂ K 2ε ⊂<br />

C. In particular, ‖f‖ K ε = sup x∈K ε |f(x)| is well defined for every convex function<br />

f : C → R.<br />

Indeed, for each x ∈ K one can find an ε x such that B(x, 2ε x ) ⊂ C. Let ∪ i∈I B(x i , 2ε xi )<br />

be a finite covering <strong>of</strong> K with such balls. Set ε = min i∈I ε xi . One easily verifies that<br />

K 2ε ⊂ ∪ i∈I B(x i , 4ε) <strong>the</strong>n holds.<br />

(ii) Next, we first show that a sequence {x n : C → R} n <strong>of</strong> deterministic convex functions<br />

such that {x n (t)} n is bounded for every t is automatically uniformly bounded over<br />

K ε , i.e. sup n ‖x n ‖ K ε < ∞<br />

For such functions t ↦→ sup n |x n (t)| is finite and convex on C, hence uniformly<br />

continuous on every compact subset <strong>of</strong> C. This implies that for arbitrary ε > 0<br />

<strong>the</strong>re exists some δ > 0 such that for every t 0 ∈ K ε , sup n |x n (·)| ≤ sup n |x n (t 0 )| +<br />

ε on B(t 0 , δ). Cover K ε by finitely many balls B(t i , δ), i = 1, . . . , N. Then it holds<br />

that sup n ‖x n ‖ K ε ≤ max i sup n |x n (t i )| + ε.<br />

We can <strong>the</strong>n conclude that <strong>the</strong> sequence {‖X n ‖ K ε} n is asynmptotically tight.


2.3 A continous mapping <strong>the</strong>orem for argmin functionals 15<br />

(iii) A bounded convex function x on K ε is automatically Lipschitz on K with Lipschitz<br />

constant (2/ε)‖x‖ K ε.<br />

Indeed, for arbitrary y 1 , y 2 ∈ K, set z = y 1 + η(y 1 − y 2 )/‖y 1 − y 2 ‖ with η < ε. Then<br />

z ∈ K ε and y 1 = λz + (1 − λ)y 2 , λ = ‖y 1 − y 2 ‖/(‖y 1 − y 2 ‖ + η) ∈ [0, 1]. By convexity<br />

<strong>of</strong> x, we have x(y 1 ) − x(y 2 ) ≤ λ(f(z) − f(y 2 )) ≤ 2λ‖x‖ K ε. Since z ∈ K ε and x ∈ K<br />

it follows from <strong>the</strong> triangle inequality that λ ≤ 1/ε.<br />

(iv) We can now show that {X n } n is asymptotically equicontinuous in l ∞ (K).<br />

For arbitrary ε equi > 0, η equi > 0, choose M > 0 with<br />

lim sup P (‖X n ‖ K ε > M) < η equi<br />

n→∞<br />

by invoking asymptotical tightness <strong>of</strong> ‖X n ‖ K ε. On {‖X n ‖ K ε ≤ M} it holds that<br />

|X n (t) − X n (s)| ≤ (2/ε)M‖t − s‖. Set δ = ε equi ((2/ε)M) −1 . Then,<br />

{<br />

}<br />

sup |X n (t) − X n (s)| > ε equi ⊆ {‖X n ‖ K ε > M} .<br />

‖t−s‖


16 Minimizers <strong>of</strong> convex processes<br />

(ii) X n converges almost uniformly to X if for every ε > 0, <strong>the</strong>re exists a measurable set<br />

A with P (A) ≥ 1 − ε and d(X n , X) → 0 uniformly on A; this is denoted X n → au X.<br />

Almost uniform convergence is more convenient to prove, so we will make use <strong>of</strong><br />

Lemma 2.3.0.9. (Van der Vaart and Wellner, 1996, Lemma 1.9.2)Let X be Borel measurable.<br />

Then X n → au X if and only if X n → as∗ X.<br />

Theorem 2.3.0.10. (Dudley’s almost sure representation <strong>the</strong>orem)<br />

For probability spaces (Ω n , A n , P n ), n ∈ N ∪ {∞}, let X n : Ω n → D be arbitrary maps and<br />

let X ∞ : Ω → D be Borel measurable and separable. If X n X ∞ , <strong>the</strong>n <strong>the</strong>re exists a<br />

probability space (˜Ω, Ã, ˜P ) and maps X n : ˜Ω → D, n ∈ N ∪ {∞} with<br />

(i) ˜X n → as∗ ˜X∞<br />

(<br />

(ii) E ∗ f( ˜X<br />

)<br />

n ) = E ∗ (f(X n )), for every bounded f : D → R and every n ∈ N ⋃ {∞}.<br />

Moreover each ˜X n can be chosen as ˜X n = X n ◦ φ n with φ n measurable and perfect, and<br />

P n = ˜P ◦ φ −1<br />

n .<br />

Pro<strong>of</strong>. Call a set B ⊂ D a continuity set if P (X ∞ ∈ ∂B) = 0.<br />

(i) We first show that for every ε > 0, <strong>the</strong>re exists a partition <strong>of</strong> D into finitely many<br />

disjoint continuity sets B (ε)<br />

0 , B(ε) 1 , . . . , B(ε) k ε<br />

satisfying<br />

(<br />

P X ∞ ∈ B (ε) )<br />

0 < ε 2 (2.3.0.2)<br />

and<br />

diam(B (ε)<br />

i ) < ε (2.3.0.3)<br />

for i = 1, . . . , k ε . Let C ⊂ D be a separable subset for which P (X ∞ ∈ C) = 1 and let<br />

{s i } i be a dense sequence in C. For each i ∈ N, <strong>the</strong>re are at most countably many<br />

values r > 0 for which <strong>the</strong> open ball B(s i , r) is a discontinuity set (every σ−finite<br />

measure space has at most countably many disjoint sets with positive measure). So,<br />

choose ε/3 < r i < ε/2 such that B(s i , r i ) is a continuity set. Then for every i, set<br />

B (ε)<br />

i = B(s i , r i )\ ⋃ j 1 − ε 2 , so set B (ε)<br />

0 = D \ ⋃ i≤k ε<br />

B (ε)<br />

i .<br />

(ii) There is a sequence ε n → 0 taking values 1/m (m ∈ N) only, such that<br />

(<br />

P n∗ X n ∈ B (εn) )<br />

(<br />

≥ (1 − ε n ) P ∞ X ∞ ∈ B (εn) )<br />

, i = 1, . . . , k εn . (2.3.0.4)<br />

i<br />

By <strong>the</strong> ( Portmanteau <strong>the</strong>orem, for fixed ε > 0 and i = 1, . . . , k ε , it holds that<br />

P n∗ X n ∈ B (εn) ) (<br />

i → P ∞ X ∞ ∈ B (εn) )<br />

i . So, for m ∈ N choose n(m) be such that for<br />

i = 1, . . . , k 1/m and n ≥ n(m)<br />

(<br />

P n∗ X n ∈ B (1/m) )<br />

(<br />

≥ (1 − 1/m) P ∞ X ∞ ∈ B (1/m) )<br />

, (2.3.0.5)<br />

i<br />

i<br />

i


2.3 A continous mapping <strong>the</strong>orem for argmin functionals 17<br />

Next, assume without loss <strong>of</strong> generality that 1 = n(1) < n(2) < . . . and set γ(n) =<br />

max{k : n(k) ≤ n} and define ε n = 1/γ(n). The sequence ε n <strong>the</strong>n satisfies 2.3.0.4<br />

since γ(n) → ∞ and n(γ(n)) ≤ n by definition.<br />

(iii) For i = 1, . . . , k εn , let A n i<br />

and set<br />

⊆ {X n ∈ B (εn)<br />

i } be measurable with<br />

(<br />

P n (A n i ) = P n∗ X n ∈ B (εn) )<br />

A n 0 = Ω n \<br />

k εn<br />

⋃<br />

i=1<br />

A n i .<br />

For each n ∈ N define a probability measure µ n on (Ω n , A n ) as<br />

µ n (A) = 1<br />

k<br />

∑ εn<br />

P n (A|A n i ) ( P n (A n i ) − (1 − ε n ) P ∞ (X ∞ ∈ Bi εn )) .<br />

ε n<br />

i=1<br />

where P n (·|A n i ) is <strong>the</strong> P n-conditional measure given A n i . Finally, define <strong>the</strong> probability<br />

space (˜Ω, Ã, ˜P ) as follows :<br />

⎛<br />

⎞<br />

˜Ω = Ω ∞ × ∏ k<br />

∏ εn<br />

⎝Ω n × A n ⎠<br />

i × [0, 1],<br />

n<br />

i=0<br />

⎛<br />

⎞<br />

à = A ∞ × ∏ k<br />

∏ εn<br />

⎝A × A n ∩ A n ⎠<br />

i × B o ,<br />

n i=0<br />

⎛<br />

⎞<br />

˜P = P ∞ × ∏ k<br />

∏ εn<br />

⎝µ n × P n (·|A n i ) ⎠ × λ.<br />

n<br />

Here, B o is <strong>the</strong> Borel σ-algebra and λ is <strong>the</strong> Lebesgue measure on [0, 1], respectively.<br />

i=0<br />

(iv) We now define <strong>the</strong> maps φ n and verify that ˜P ◦ φ −1<br />

n<br />

Define<br />

For A ∈ Ã, we have:<br />

˜ω = ( ω ∞ , . . . , ω, ω n0 , . . . , ω kɛn n, . . . , ξ ) .<br />

i<br />

= P n . Write elements ˜ω <strong>of</strong> ˜Ω as<br />

φ ∞ = ω ∞<br />

{<br />

ωn , if ξ > 1 − ε n ,<br />

φ n =<br />

ω ni , if ξ ≤ 1 − ε and X ∞ (ω ∞ ) ∈ B (ε)<br />

i .<br />

˜P (φ n ∈ A) = ˜P (φ n ∈ A; ξ > 1 − ε n ) + ˜P (φ n ∈ A; ξ ≤ 1 − ε n )<br />

k εn<br />

= ˜P<br />

∑ (<br />

(ω n ∈ A; ξ > 1 − ε n ) + ˜P ω ni ∈ A; ξ ≤ 1 − ε n ; X ∞ ∈ B (εn) )<br />

i<br />

i=0<br />

k<br />

∑ εn<br />

(<br />

= µ n (A) ε n + P n (A|A n i ) (1 − ε n )P ∞ X ∞ ∈ B (εn) )<br />

i<br />

i=0<br />

k εn<br />

∑<br />

= P n (A|A n i ) P n (A n i )<br />

i=0<br />

= P n (A)


18 Minimizers <strong>of</strong> convex processes<br />

(v) We show that ˜X n → au ˜X∞ . It follows from <strong>the</strong> construction <strong>of</strong> <strong>the</strong> maps ˜X n =<br />

X n ◦ φ n and <strong>of</strong> <strong>the</strong> balls B (εn)<br />

i<br />

that d( ˜X n , ˜X ∞ ) on {˜ω : X ∞ /∈ B o<br />

(εn) , ξ ≤ 1 − ε}. Set<br />

A k = ∪ m≥k {˜ω : X ∞ ∈ B 1/m<br />

0 or ξ > 1 − 1/k}. Note that ˜P (A k ) ≤ 1/k, so for every<br />

ε > 0 <strong>the</strong>re exists some k with ˜P (A k ) ≤ ε. For ˜ω ∈ ˜Ω\A k and n such that ε n ≤ 1/k,<br />

it holds that d( ˜X n , ˜X ∞ ) ≤ ε n . This completes <strong>the</strong> argument.<br />

(vi) Let T : Ω n → R be bounded, and let T ∗ be its minimal measurable cover with respect<br />

to P n . Write<br />

k<br />

∑ εn<br />

T ◦ φ n = 1 πξ ≤1−ε n<br />

T |A n<br />

i<br />

i=0<br />

◦ π n,i 1 X∞◦π ∞∈B (εn)<br />

i<br />

+ 1 πξ >1−ε n<br />

T ◦ π n .<br />

The coordinate projections are perfect so that <strong>the</strong> minimal measurable cover <strong>of</strong> (T ◦<br />

φ n ) with respect to ˜P can be computed as follows<br />

k<br />

(T ◦ φ n ) ∗ ∑ εn<br />

= 1 πξ ≤1−ε n<br />

(T |A n<br />

i<br />

◦ π n,i ) ∗ 1 X∞◦π∞∈B (εn) + 1 πξ >1−ε n<br />

(T ◦ π n ) ∗<br />

i=0<br />

k<br />

∑ εn<br />

= 1 πξ ≤1−ε n<br />

(T |A n<br />

i<br />

) ∗Pn[·|An i ] ◦ π n,i 1 X∞◦π∞∈B (εn) + 1 πξ >1−ε n<br />

(T ) ∗µn ◦ π n<br />

i=0<br />

Now note that P n and P n (·|A n i ) are equivalent on (An i , A n∩A n i ), P n and µ n are equivalent<br />

on (Ω n , A n ), hence <strong>the</strong> measurable covers in <strong>the</strong> last formula can be computed<br />

under P n . It follows that (T ◦ φ n ) ∗ = T ∗ ◦ φ n . This completes <strong>the</strong> pro<strong>of</strong>.<br />

i<br />

i<br />

In <strong>the</strong> following, denote by B loc (R p ) <strong>the</strong> space <strong>of</strong> locally bounded functions on R p equipped<br />

with <strong>the</strong> topology <strong>of</strong> uniform convergence on compacta; denote by C min (R p ) <strong>the</strong> separable<br />

subset <strong>of</strong> continuous functions z(·) with minimum achieved at a unique point in R p and<br />

satisfying z(t) → ∞ as |t| → ∞.<br />

Dudley’s Almost sure representation <strong>the</strong>orem is used to prove<br />

Theorem 2.3.0.11. (Argmin continuous mapping <strong>the</strong>orem)<br />

Let {X n : Ω n → B loc (R p )} n and {t n : Ω n → R p } n be sequences <strong>of</strong> random maps. If<br />

(i) X n X for X Borel measureable and concentrated on C min (R p );<br />

(ii) t n = O P (1);<br />

(iii) X n (t n ) ≤ inf t∈R p X n (t) + α n for random variables {α n } n <strong>of</strong> order o P (1);<br />

<strong>the</strong>n t n arg min(X).<br />

Pro<strong>of</strong>. Invoke Dudley’s almost sure representation Theorem 2.3.0.10 to find a probability<br />

space (˜Ω, Ã, ˜P ) and maps ˜X n : ˜Ω → B loc (R p ), ˜X : ˜Ω → Bloc (R p ) satisfying<br />

˜X n → as∗ ˜X (2.3.0.6)


2.3 A continous mapping <strong>the</strong>orem for argmin functionals 19<br />

and<br />

E ∗ (f( ˜X n )) = E ∗ (f(X n )) (2.3.0.7)<br />

for every f ∈ C b (B loc (R p )) and every n ∈ N. According to <strong>the</strong> same <strong>the</strong>orem, we can<br />

assume that <strong>the</strong> maps ˜X n take <strong>the</strong> form<br />

˜X n = X n ◦ φ n (2.3.0.8)<br />

for measurable perfect maps φ n : ˜Ω → Ω satisfying P n = ˜P ◦ φ −1<br />

n .<br />

Next, consider <strong>the</strong> maps ˜t n , ˜t and ˜α n obtained after composittion with φ n . Denote by ˜t<br />

<strong>the</strong> unique minimizer <strong>of</strong> ˜X, using a density argument, one can show that ˜t is measurable,<br />

hence has <strong>the</strong> same law as arg min(X). We need to prove that<br />

E ∗ f(t n ) − Ẽf(˜t) → 0 (2.3.0.9)<br />

for every f ∈ C b (R p ). By perfectness <strong>of</strong> <strong>the</strong> maps φ n , <strong>the</strong> difference is equal to<br />

Ẽ ∗ ( f(˜t n ) − f(˜t) ) ≤ Ẽ∗ (∣∣ f( ˜ t n ) − f(˜t) ∣ ∣ ) .<br />

To prove 2.3.0.9, first define for abitrary δ > 0<br />

{ }<br />

˜∆(δ) = inf ˜X(t) | ‖t − ˜t‖ > δ − ˜X(˜t).<br />

Then, for an arbitrary ε > 0, use tightness <strong>of</strong> {˜t n } n and uniqueness <strong>of</strong> ˜t to find R > 0,<br />

η > 0 satisfying :<br />

lim sup<br />

n→∞<br />

˜P ∗ ( ‖˜t n ‖ > R ) < ε<br />

and<br />

˜P<br />

(<br />

‖˜t‖ > R or ˜∆(δ)<br />

)<br />

< η < ε.<br />

The restriction <strong>of</strong> f on <strong>the</strong> closed ball B(0; R) is uniformly continuous by Heine-Borel’s<br />

Theorem. Since<br />

E ∗ (f(˜t n ) − f(˜t)) ≤ E ∗ (f(˜t n ) − f(˜t))1 { ‖˜t n ‖, ‖˜t‖ ≤ R } + 2 · 2‖f‖ |B(0;R) ε<br />

for n sufficiently large, it is sufficent to show that<br />

˜P ∗ ( ‖˜t n − ˜t‖ > δ; ‖˜t n ‖, ‖˜t‖ ≤ R ) → 0<br />

for each δ > 0.<br />

According to <strong>the</strong> Representation <strong>the</strong>orem, it holds that<br />

2 −R sup<br />

‖t‖≤R<br />

(<br />

‖ ˜X n (t) − ˜X(t)‖<br />

) ∗ (<br />

≤ d( ˜X ˜X)) ∗<br />

n , ≤ ε,<br />

for n sufficently large. Fur<strong>the</strong>rmore, it follows from<br />

˜X n (˜t) = ˜X n (t) + ˜X(t) − ˜X n (t)<br />

+ ˜X(˜t) − ˜X n (˜t)<br />

+ ˜X(˜t) − ˜X(t)


20 Minimizers <strong>of</strong> convex processes<br />

that<br />

{<br />

‖˜t‖, ‖t‖ ≤ R; ‖t − ˜t‖ > δ; ˜∆(δ)<br />

} {<br />

≤ η ⊆ ˜Xn (˜t) ≤ ˜X<br />

}<br />

n (t) + 2 R+1 ε − η . (2.3.0.10)<br />

The event<br />

{<br />

}<br />

η > ˜α n + 2 R+1 ε<br />

has probability tending to one as n → ∞, since ˜α n = o ˜P<br />

(1). Note that by perfectness <strong>of</strong><br />

<strong>the</strong> maps φ n , one also has<br />

˜X n (˜t n ) ≤ inf<br />

t<br />

˜X n (t) + ˜α n<br />

for every n ∈ N. Thus, 2.3.0.10 forces ˜t n , ei<strong>the</strong>r to be at a distance greater than R from<br />

zero, or to stay within a δ neighborhood <strong>of</strong> ˜t. We finally obtain<br />

˜P ∗ ( ‖˜t n − ˜t‖ > δ; ‖˜t n ‖, ‖˜t‖ ≤ R ) ≤ 3ε<br />

for n large enough. This completes <strong>the</strong> pro<strong>of</strong>.


Chapter 3<br />

Application to <strong>the</strong> <strong>Lasso</strong> estimator<br />

In this chapter, we first apply Theorem 2.1.0.2 and 2.3.0.11 to determine <strong>the</strong> limit in<br />

probability and in <strong>distribution</strong> <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> estimator, defined as ˆβ n = arg min φ∈R p Z n (φ)<br />

for<br />

for a linear regression model<br />

Z n (φ) = 1 n∑<br />

(Y i − x ′<br />

n<br />

iφ) 2 + λ n<br />

p∑<br />

|φ j |. (3.0.0.1)<br />

n<br />

i=1<br />

j=1<br />

Y i = x ′ iβ + ε i , i = 1, . . . , n. (3.0.0.2)<br />

Then some continuity properties <strong>of</strong> <strong>the</strong> limit <strong>distribution</strong>s are derived and it is shown<br />

under assumption <strong>of</strong> an orthogonal design that <strong>the</strong> convergence in <strong>distribution</strong> is actually<br />

uniform.<br />

Throughout this chapter we make <strong>the</strong> following assumption on <strong>the</strong> model 3.0.0.2.<br />

and<br />

C n = 1 n X′ nX n → C (3.0.0.3)<br />

1<br />

n max<br />

1≤i≤n x′ ix i → 0 (3.0.0.4)<br />

3.1 Limit in probability<br />

To determine <strong>the</strong> limit in probability, <strong>the</strong> following law <strong>of</strong> large numbers will be needed.<br />

Theorem 3.1.0.12. (Kallenberg, 2002, Corollary 4.22) Let ξ 1 , ξ 2 , . . . be independent random<br />

variables with mean 0 satisfying<br />

∞∑<br />

n −2c E(ξi 2 ) < ∞<br />

i=1<br />

21


22 Application to <strong>the</strong> <strong>Lasso</strong> estimator<br />

for some c > 0.Then<br />

almost surely.<br />

1<br />

n<br />

n∑<br />

ξ i → 0<br />

i=1<br />

Theorem 3.1.0.13. (Convergence in probability) If C is nonsingular and λ n /n →<br />

λ 0 ≥ 0, <strong>the</strong>n<br />

where<br />

ˆβ n → P arg min(Z)<br />

p∑<br />

Z(φ) = (φ − β) ′ C(φ − β) ′ + λ 0 |φ j |.<br />

j=1<br />

Thus if λ n = o(n), arg min(Z) = β and so ˆβ n is consistent.<br />

Pro<strong>of</strong>. Maps Z n defined in 3.0.0.1 have convex sample paths and are minimized at ˆβ n .<br />

Sample paths <strong>of</strong> Z are strictly convex since C is nonsingular, hence have unique minimizers.<br />

Following Corollary 2.1.0.3, it is sufficient to show that Z n (φ) → P Z(φ) + σ 2 for every<br />

point φ ∈ R p .<br />

We have<br />

Set ξ i = ε i (β − φ) ′ x i with<br />

Z n (φ) = 1 n∑<br />

(Y i − x ′<br />

n<br />

iφ) 2 + λ n<br />

p∑<br />

|φ j |<br />

n<br />

i=1<br />

j=1<br />

= 1 n∑<br />

(x ′<br />

n<br />

iβ + ε i − x ′ iφ) 2 + λ n<br />

p∑<br />

|φ j |<br />

n<br />

i=1<br />

j=1<br />

( )<br />

1<br />

n∑<br />

= (β − φ) ′ x i x ′ i (β − φ) + 1 n∑<br />

ε i<br />

n<br />

n<br />

i=1<br />

i=1<br />

+ 2 n∑<br />

ε i (β − φ) ′ x i + λ n<br />

p∑<br />

|φ j |<br />

n<br />

n<br />

i=1<br />

j=1<br />

i=1<br />

.<br />

E(ξ 2 i ) = σ 2 |〈β − φ, x i 〉| 2<br />

≤ σ 2 ‖(β − φ)‖ 2 ‖x i ‖ 2 .<br />

Under assumption 3.0.0.4 i −1 ‖x i ‖ 2 is asymptotically bounded above by i −δ for some δ > 0,<br />

thus we obtain<br />

∞∑ 1<br />

∞<br />

i 2 E(ξ2 i ) ≤ σ 2 ‖(β − φ)‖ 2 ∑<br />

( ) 1 2<br />

i ‖x i‖ < ∞.<br />

Now it follows from <strong>the</strong>orem 3.1.0.12 that<br />

2<br />

n<br />

i=1<br />

n∑<br />

ε i (β − φ) ′ x i → 0<br />

i=1


3.2 Limit in <strong>distribution</strong> 23<br />

almost surely. Finally we have<br />

Z n (φ) → P (φ − β) ′ C(φ − β) ′ + σ 2 + λ 0<br />

as claimed.<br />

p∑<br />

|φ j |.<br />

j=1<br />

3.2 Limit in <strong>distribution</strong><br />

Asymptotic normality <strong>of</strong> independent, but not necessarily identically distributed random<br />

variables is usually proved by verifying <strong>the</strong> so called Lindeberg-Feller condition.<br />

Theorem 3.2.0.14. (Van der Vaart, 2000, Proposition 2.27) For each n ∈ N, let<br />

ξ n,1 , . . . , ξ n,kn<br />

be independent random vectors with finite variances Σ n,1 , . . . , Σ n,kn<br />

such that<br />

k n ∑<br />

i=1<br />

(<br />

)<br />

E ‖ξ n,i ‖ 2 1 {‖ξn,i ‖>ε} → 0 for every ε > 0<br />

and<br />

k n ∑<br />

i=1<br />

Σ n,i → Σ<br />

Then<br />

k n ∑<br />

i=1<br />

(<br />

)<br />

ξ n,i − E(ξ n,i ) N (0, Σ)<br />

The Lindeberg-Feller condition will also be used later to prove a partial asymptotic normality<br />

result for <strong>the</strong> adaptive <strong>Lasso</strong>.<br />

Theorem 3.2.0.15. (Convergence in <strong>distribution</strong>) If λ n / √ n → λ 0 ≥ 0 and C is<br />

nonsingular <strong>the</strong>n<br />

√ n( ˆβ − β) → arg min(V )<br />

where<br />

V (u) = −2u ′ W + u ′ Cu + λ 0<br />

for W ∼ N (0, σ 2 C).<br />

p∑<br />

(sgn(β j )I(β j ≠ 0) + |u j |I(β j = 0))<br />

j=1<br />

Pro<strong>of</strong>. Define V n : R p → R as<br />

V n (u) =<br />

n∑<br />

i=1<br />

(<br />

(ε 2 i − u ′ x i / √ ) p∑<br />

n) 2 − ε 2 (<br />

i + λ n |βj + u j / √ n| − |β j | ) ,<br />

j=1


24 Application to <strong>the</strong> <strong>Lasso</strong> estimator<br />

a convex function minimized at √ ( )<br />

n ˆβ n − β . We have<br />

n∑<br />

i=1<br />

(<br />

(ε 2 i − u ′ x i / √ )<br />

n) 2 − ε 2 i =<br />

n∑<br />

( 1<br />

n u′ x i x ′ iu − 2ε i u ′ x i / √ n)<br />

i=1<br />

= 1 ( n<br />

)<br />

∑<br />

n∑<br />

n u′ x i x ′ i u − −2ε i u ′ x i / √ n<br />

i=1<br />

i=1<br />

Set ξ i = ε i x i / √ n, under <strong>the</strong> assumption 3.0.0.4, it holds that<br />

n∑<br />

i=1<br />

(<br />

)<br />

E ‖ξ i ‖ 2 1 {‖ξi ‖>ε} =<br />

=<br />

≤<br />

n∑<br />

i=1<br />

n∑<br />

i=1<br />

n∑<br />

i=1<br />

(<br />

)<br />

E ‖ξ i ‖ 2 1{‖ξ i ‖ > ε}<br />

1 (<br />

n ‖x i‖ 2 E |ε i | 2 1{|x i ‖|ε i |/ √ )<br />

n > ε}<br />

1<br />

n ‖x i‖ 2 E<br />

= tr(C n ) E<br />

(<br />

|ε i | 2 1{|ε i | max ‖x i‖/ √ )<br />

n > ε}<br />

1≤i≤n<br />

(<br />

|ε 1 | 2 1{|ε 1 | max<br />

1≤i≤n ‖x i||/ √ n > ε}<br />

This converges to zero by Lebesgue’s dominated convergence <strong>the</strong>orem since tr(C n ) = p.<br />

The Lindeberg-Feller <strong>the</strong>orem 3.2.0.14 now implies that<br />

n∑<br />

i=1<br />

(<br />

(ε 2 i − u ′ x i / √ )<br />

n) 2 − ε 2 i −2u ′ Wu + u ′ Cu<br />

with W ∼ N (0, C) for every u ∈ R p .<br />

On <strong>the</strong> o<strong>the</strong>r hand we have<br />

p∑ ( |βj + u j / √ n| − |β j | ) p∑<br />

→ λ 0 (sgn(β j )I(β j ≠ 0) + |u j |I(β j = 0)) .<br />

λ n<br />

j=1<br />

j=1<br />

Now weak convergence <strong>of</strong> <strong>the</strong> marginals <strong>of</strong> V n to <strong>the</strong> marginals <strong>of</strong> V , both viewed as random<br />

maps into R p follows directly from <strong>the</strong> Cramer-Wold device. Since C is nonsingular <strong>the</strong><br />

process V is strictly convex, hence takes values in C min (R p ). In view <strong>of</strong> <strong>the</strong> Argmin<br />

continuous mapping <strong>the</strong>orem 2.3.0.11, this completes <strong>the</strong> pro<strong>of</strong>.<br />

)<br />

.<br />

3.2.1 Limiting <strong>distribution</strong> <strong>of</strong> components<br />

In this section, we fur<strong>the</strong>r investigate <strong>the</strong> asymptotic behavior in <strong>distribution</strong> <strong>of</strong> single<br />

components <strong>the</strong> root √ n( ˆβ n − β), our goal being <strong>the</strong> construction <strong>of</strong> confidence intervals<br />

and hypo<strong>the</strong>sis tests for individual coefficients.<br />

In <strong>the</strong> next proposition, we show that, in <strong>the</strong> limit, <strong>the</strong> subvector <strong>of</strong> <strong>the</strong> root corresponding<br />

to zero parameters can take value zero with positive probability, this probability is<br />

quantified. Also, we show that <strong>the</strong> limiting <strong>distribution</strong> <strong>of</strong> each component has at most a<br />

discontinuity at <strong>the</strong> point zero, that it is o<strong>the</strong>rwise Gaussian.<br />

For notational convenience, in <strong>the</strong> remaining, we assume without loss <strong>of</strong> generality that<br />

β 1 , . . . , β r are nonzero and that β r+1 = · · · = β p = 0, this yields<br />

⎛<br />

⎞<br />

r∑<br />

p∑<br />

V (u) = −2u ′ W + 2u ′ Cu + λ 0 ⎝ sgn(β j )u j + |u j | ⎠ .<br />

j=1<br />

j=r+1


3.2 Limit in <strong>distribution</strong> 25<br />

We also adopt <strong>the</strong> following partitioning <strong>of</strong> C, W, and u:<br />

( )<br />

C11 C<br />

C =<br />

12<br />

,<br />

C 21 C 22<br />

where C 11 is a r × r matrix, C 22 is a (p − r) × (p − r) matrix and C 12 = C ′ 21 ;<br />

W =<br />

( )<br />

W1<br />

;<br />

W 2<br />

u =<br />

(<br />

u1<br />

)<br />

,<br />

u 2<br />

where W 1 and u 1 are r-vectors. Finally, denote û = arg min u V (u).<br />

Proposition 3.2.1.1. If λ n / √ n → λ 0 ≥ 0 and C is nonsingular <strong>the</strong>n<br />

(i)<br />

P (û 2 = 0) = P<br />

(<br />

− λ 0<br />

2 1 ≤ C 21C −1<br />

11 (W 1 − λ o sgn(β))/2 − W 2 ≤ λ 0<br />

2 1 )<br />

where W 1 ∼ N (0, σ 2 C 11 ), W 2 ∼ N (0, σ 2 C 22 ) and <strong>the</strong> inequality is to be interpreted<br />

componentwise.<br />

(ii) û 1 has a Gaussian <strong>distribution</strong> and <strong>the</strong> <strong>distribution</strong> <strong>of</strong> every component <strong>of</strong> û 2 is <strong>the</strong><br />

mixture <strong>of</strong> a Gaussian <strong>distribution</strong> with a point mass at zero.<br />

Pro<strong>of</strong>. The subdifferential ∂V (û) at û contains 0, i.e.<br />

− 2<br />

= 0<br />

( ) ( ) ( ) ⎛<br />

W1 C11 C<br />

+ 2<br />

12 û1<br />

+ λ<br />

W 2 C 21 C 22 û 0 ⎝<br />

2<br />

sgn(β)<br />

(sgn(û j 2 )I(ûj 2 ) ≠ 0 + e jI(û j 2 ) ) p−r<br />

for some values e j ∈ [0, 1], j = 1, . . . , p − r. In particular, <strong>the</strong> following relations between<br />

C, W, û 1 and û 2 always hold :<br />

û 1 = C −1<br />

11 (W 1 − C 12 û 2 − λ 0 sgn(β)/2)<br />

j=1<br />

⎞<br />

⎠<br />

Toge<strong>the</strong>r, <strong>the</strong>y imply<br />

− λ 0<br />

2 1 ≤ −W 2 + C 21 û 1 + C 22 û 2 ≤ λ 0<br />

2 1.<br />

− λ (<br />

)<br />

0<br />

2 1 ≤ −W 2 + C 21 C −1<br />

11 (W 1 − λ 0 sgn(β)) + C 22 − C 21 C −1<br />

11 C 12 û 2 ≤ λ 0<br />

2 1 (3.2.1.1)<br />

Next, suppose that û 2 = 0. Then it follows readily from 3.2.1.1 that<br />

− λ 0<br />

2 1 ≤ −W 2 + C 21 C −1<br />

11 (W 1 − λ 0 sgn(β)) ≤ λ 0<br />

2 1 (3.2.1.2)


26 Application to <strong>the</strong> <strong>Lasso</strong> estimator<br />

Conversely, suppose that 3.2.1.2 holds, we show that <strong>the</strong> minimizer necessarily satisfies<br />

û 2 = 0. Indeed, if û 2 ≠ 0 <strong>the</strong>n <strong>the</strong> equality is attained in 3.2.1.2 at every line corresponding<br />

to nonzero components <strong>of</strong> û 2 , that is,<br />

⎧<br />

(<br />

(<br />

) ) ⎪⎨<br />

−W 2 + C 21 C −1<br />

11 (W 1 − λ 0 sgn(β)) + C 22 − C 21 C −1<br />

11 C 12 û 2 = − λ 0<br />

2 , if ûj 2 > 0<br />

j ⎪ ⎩<br />

λ 0<br />

2 , if ûj 2 < 0<br />

In particular, it follows from 3.2.1.2 that<br />

{<br />

((<br />

) )<br />

C 22 − C 21 C −1<br />

11 C 12 û 2 ∈ [−λ0 , 0], if û j 2 > 0<br />

j [0, λ 0 ] if û j 2 < 0 (3.2.1.3)<br />

Note that (C 22 − C 21 C −1<br />

11 C 12), <strong>the</strong> Schur complement <strong>of</strong> C, is SPD since C is. Let<br />

D be <strong>the</strong> matrix which results from (C 22 − C 21 C −1<br />

11 C 12) after removal <strong>of</strong> <strong>the</strong> lines and<br />

columns corresponding to <strong>the</strong> zero components <strong>of</strong> û 2 , D is also SPD. Let û≠0 2 be <strong>the</strong> vector<br />

which results from û 2 after removal <strong>of</strong> <strong>the</strong> zero components. Then 3.2.1.3 implies that<br />

û≠0 T<br />

2 Dû≠0 2 ≤ 0 which contradicts <strong>the</strong> fact that D is SPD, this complete <strong>the</strong> argument.<br />

Finally, we show that every component <strong>of</strong> <strong>the</strong> limit <strong>distribution</strong> has a discontinuity at <strong>the</strong><br />

point zero only, and is o<strong>the</strong>rwise Gaussian. Consider after an eventual permutation <strong>of</strong> <strong>the</strong><br />

last p − r covariates a fur<strong>the</strong>r partitioning <strong>of</strong> W 2 , u 2 and C <strong>of</strong> <strong>the</strong> form<br />

( )<br />

W2,1<br />

W 2 =<br />

W 2,0<br />

u 2 =<br />

(<br />

u2,1<br />

u 2,0<br />

)<br />

,<br />

and<br />

C =<br />

⎛<br />

⎜<br />

⎝<br />

⎞<br />

C 11 C 12 C 13<br />

⎟<br />

C 21 C 22 C 23 ⎠<br />

C 31 C 32 C 33<br />

respectively, where W 2,1 and u 2,1 are r ′ -vectors for some 0 < r ′ < p − r. For notational<br />

convenience, denote<br />

( )<br />

C11 C ˜C = 12<br />

.<br />

C 21 C 22<br />

Next, consider an arbitrary sign pattern s ′ ∈ {−1, +1} r′ . Then, given <strong>the</strong> event<br />

{ sgn(u2,1 ) = s ′ ; u 2,0 = 0 } ,<br />

(u 1 , u 2,1 ) ′ has <strong>the</strong> following <strong>distribution</strong> :<br />

( ) (<br />

( ))<br />

u1<br />

∼ N σ 2 λ 0 ˜C, −<br />

u 2,1 2 ˜C −1 sgn(β)<br />

s ′<br />

This completes <strong>the</strong> pro<strong>of</strong>.


3.2 Limit in <strong>distribution</strong> 27<br />

3.2.2 Uniform convergence in <strong>the</strong> orthogonal case<br />

More can be said about <strong>the</strong> convergence in <strong>distribution</strong> <strong>of</strong> single components <strong>of</strong> <strong>the</strong> <strong>Lasso</strong><br />

estimator when <strong>the</strong> matrix X n is orthogonal. We show that in this case, under <strong>the</strong> assumptions<br />

<strong>of</strong> Theorem 3.2.0.15, <strong>the</strong> finite sample <strong>distribution</strong> <strong>of</strong> each component converges<br />

uniformly ei<strong>the</strong>r to a normal <strong>distribution</strong> or to <strong>the</strong> mixture <strong>of</strong> a normal ditribution with a<br />

point mass at zero, depending <strong>the</strong>reon, whe<strong>the</strong>r <strong>the</strong> corresponding parameter is equal to<br />

zero or not.<br />

Assumption A. X n is an othogonal matrix for every n ∈ N. In particular C n = 1 n X′ nX n<br />

is diagonal.<br />

The following generalization <strong>of</strong> Polya’s <strong>the</strong>orem will be needed, it allows for discontinuities<br />

in <strong>the</strong> limiting <strong>distribution</strong>, provided that jump heights in <strong>the</strong> finite sample <strong>distribution</strong><br />

converge precisely to <strong>the</strong> jump height in <strong>the</strong> limit,<br />

Theorem 3.2.2.1. (Chow and Teicher, 1978, Chapter 8, Lemma 3)<br />

Let {F n } n , F be probability <strong>distribution</strong> functions. If F n F and ∆F n (x 0 ) → ∆F (x 0 ) at<br />

every discontinuity point x 0 <strong>of</strong> F , <strong>the</strong>n F n onverges uniformly to F , that is<br />

sup |F n (x) − F (x)| → 0<br />

x∈R<br />

In <strong>the</strong> following, let J n,j (·) denote <strong>the</strong> probability <strong>distribution</strong> function <strong>of</strong> <strong>the</strong> j-th component<br />

<strong>of</strong><br />

√ n( ˆβ n − β)<br />

and let J j (·) denote <strong>the</strong> probability <strong>distribution</strong> function <strong>of</strong> <strong>the</strong> j-th component <strong>of</strong><br />

V (·) as in <strong>the</strong>orem 3.2.0.15.<br />

arg min V (u),<br />

u∈R p<br />

Theorem 3.2.2.2. Assume that assumption A holds. Then<br />

for every j ∈ {1, . . . , p}.<br />

sup |J n,j (x) − J j (x)| → 0<br />

x∈R<br />

Pro<strong>of</strong>. Proposition 3.2.1.1 establishes <strong>the</strong> continuity <strong>of</strong> J j (·) for every j with β j ≠ 0, it<br />

also states that for j with β j ≠ 0, <strong>the</strong>re is a discontinuity at <strong>the</strong> point zero only. Hence,<br />

in view <strong>of</strong> Polya’s <strong>the</strong>orem 3.2.2.1, it is sufficient to show that for such j<br />

holds.<br />

Define<br />

∆J n,j (0) → ∆J j (0) (3.2.2.1)<br />

û n = √ n( ˆβ − β) = arg min V n (û).<br />

u∈R p


28 Application to <strong>the</strong> <strong>Lasso</strong> estimator<br />

From subdifferential calculus, it follows that <strong>the</strong> subdifferential ∂V n (û n ) <strong>of</strong> V n at û n contains<br />

<strong>the</strong> point zero, that is,<br />

− 2<br />

n∑ (<br />

εi − û ′ nx i / √ n ) x i / √ n<br />

i=1<br />

(<br />

+ λ n sgn(βj + û j / √ n)1/ √ n1(β j + û j / √ n ≠ 0) + e j / √ n1(β j + û j / √ n = 0) ) n<br />

n∑<br />

= −2 ε i x i / √ ( )<br />

1<br />

n∑<br />

n + 2 x i x ′ i û n<br />

n<br />

i=1<br />

i=1<br />

(<br />

+ λ n sgn(βj + û j / √ n)1/ √ n1(β j + û j / √ n ≠ 0) + e j / √ n1(β j + û j / √ n = 0) ) n<br />

n∑<br />

= −2 ε i x i / √ n + 2 C n û n<br />

i=1<br />

+ λ n<br />

( sgn(βj + û j / √ n)1/ √ n1(β j + û j / √ n ≠ 0) + e j / √ n1(β j + û j / √ n = 0) ) n<br />

j=1<br />

= 0<br />

for some values e j ∈ [−1, 1], j = 1, . . . , p. Assumption A allows for decoupling <strong>of</strong> <strong>the</strong><br />

components <strong>of</strong> û n , so writing w n,j = β j + û j / √ n we obtain :<br />

(<br />

û j n = (C n ) −1 ∑ n<br />

jj ε i x ij / √ n − λ n<br />

2 √ n (sgn(w n,j)/ √ n1(w n,j ≠ 0) + e j / √ )<br />

n1(w n,j = 0)) .<br />

i=1<br />

In particular,<br />

where<br />

− λ n<br />

2 √ n ≤ (C n) −1<br />

jj û j n −<br />

n∑<br />

ε i x ij / √ n ≤<br />

i=1<br />

n∑<br />

ε i x ij / √ n N (0, σ 2 C jj ).<br />

i=1<br />

λ n<br />

2 √ n<br />

One easily verifies that for j ∈ {r + 1, . . . , p}, û j n = 0 if and only if<br />

− λ n<br />

n<br />

2 √ n ≤ ∑<br />

ε i x ij / √ n ≤<br />

i=1<br />

λ n<br />

2 √ n<br />

j=1<br />

j=1<br />

(3.2.2.2)<br />

since an equality is attained in <strong>the</strong> inequality 3.2.2.2 if û j n is non-zero and (C n ) jj is<br />

positive. Since <strong>the</strong> <strong>distribution</strong> <strong>of</strong> <strong>the</strong> errors is assumed continuous, <strong>the</strong> Portmanteau<br />

<strong>the</strong>orem implies that<br />

∆J n,j (0) → ∆J j (0)<br />

for j ∈ {r + 1, . . . , p}. This completes <strong>the</strong> pro<strong>of</strong>.<br />

Remark. This result on uniform convergence toward <strong>the</strong> limit will be key to justify <strong>the</strong> use<br />

<strong>of</strong> subsampling to construct asymptotically valid confidence interval. Altough we proved<br />

it under <strong>the</strong> orthogonality assumption A only, we conjecture that it holds for a larger<br />

class <strong>of</strong> symmetric positive definite matrices, if not for all such matrices. However, <strong>the</strong><br />

argument seems to be tedious in greater generality. This fact can be illustrated by mean<br />

<strong>of</strong> Monte-Carlo simulations as shown in Figure 3.1 for a linear model with parameter β =<br />

(1.5, −1.5, 0.75, −1.5, 1.5, −3, 0) ′ ∈ R 20 , error ε i = N (0, √ 20) and covariates multivariate<br />

normal with mean zero and Toeplitz covariance matrix with parameter ρ = 0.99.


3.2 Limit in <strong>distribution</strong> 29<br />

Figure 3.1: Monte Carlo <strong>estimates</strong> <strong>of</strong> <strong>the</strong> <strong>distribution</strong> <strong>of</strong> <strong>the</strong> root √ n( ˆβ j −β j ), j = 7, . . . , 10,<br />

with penalization parameter λ n = 2 √ n.


30 Application to <strong>the</strong> <strong>Lasso</strong> estimator


Chapter 4<br />

The adaptive <strong>Lasso</strong> in a high<br />

dimensional setting<br />

In a high dimensional setting, where <strong>the</strong> use <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> is most justified, due to its<br />

sparsity inducing property, <strong>the</strong>re are to this date no asymptotic results similar to those <strong>of</strong><br />

Knight and Fu (2000). Hence we turn to a variant, <strong>the</strong> adaptive <strong>Lasso</strong> and presents <strong>the</strong><br />

results <strong>of</strong> Huang et al. (2008) who studied it under saprsity asssumptions and a fur<strong>the</strong>r a<br />

partial orthogonality assumption between relevant and noise covariates. Their approach<br />

<strong>of</strong>fers <strong>the</strong> advantage to provide an asymptotic normality result for <strong>estimates</strong> corresponding<br />

to nonzero-coefficients.<br />

One typically resorts to a triangular array to model high dimensionality in linear models,<br />

that is, one assumes that<br />

Y i = x ′ iβ n0 + ε i , i = 1, . . . , n (4.0.2.1)<br />

where <strong>the</strong> parameter β n0 has a dimension p n allowed to grow faster than n. For observations<br />

(Y i , x i ), i = 1, . . . , n drawn from 4.0.2.1, <strong>the</strong> adaptive <strong>Lasso</strong> is defined as<br />

arg min<br />

φ∈R p<br />

L n ( φ) = arg min<br />

φ∈R p<br />

n ∑<br />

i=1<br />

(Y i − x ′ iφ) 2 ∑p n<br />

+ λ n w nj |φ j | (4.0.2.2)<br />

where λ n is <strong>the</strong> usual penalty parameter and are {w nj } j are in general strictly positive<br />

weights. However, here we make <strong>the</strong> assumption that <strong>the</strong>y are computed using an initial<br />

estimator ˜β nj <strong>of</strong> β nj , that is :<br />

j=1<br />

w nj = | ˜β nj | −1 , (4.0.2.3)<br />

for j = 1, · · · , n. For notational convenience, in <strong>the</strong> remaining, we drop <strong>the</strong> subscript n<br />

for β n0 , yet dependence on n is implicitely assumed.<br />

Next, we introduce fur<strong>the</strong>r notation.<br />

First, assume without loss <strong>of</strong> generality that β 0 takes <strong>the</strong> form<br />

β 0 = (β ′ 1, β ′ 2) ′<br />

31


32 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

where β 1 is a k n × 1 vector, has each <strong>of</strong> its components different from zero and β 2 is a<br />

m n × 1 vector equal to zero. Let x i = (x i1 , . . . , x ipn ) ′ be <strong>the</strong> covariate vector <strong>of</strong> <strong>the</strong> i-th<br />

observation, denote by u i its first part, corresponding to non-zero coefficients, and by z i<br />

<strong>the</strong> part corresponding to zero coefficients, i.e.<br />

The empirical covariance matrix<br />

x i = (u ′ i, z ′ i) ′ .<br />

C n = 1 n X′ nX n<br />

is also partitioned according to zero and non-zero coefficients. For<br />

X ′ n1 = (u 1 , · · · , u kn ) ′ ,<br />

set<br />

C n1 = 1 n X′ n1X n1 ,<br />

a k n ×k n matrix. Finally let ρ n1 and τ n1 be <strong>the</strong> smallest respectively <strong>the</strong> largest eigenvalue<br />

<strong>of</strong> τ n1 <strong>of</strong> C n1 .<br />

Results on <strong>the</strong> variable selection consistency and <strong>the</strong> partial normality <strong>of</strong> <strong>the</strong> adaptive<br />

<strong>Lasso</strong> will be derived under <strong>the</strong> following assumptions:<br />

Assumption B.<br />

B.1 {ε i } i is a sequence <strong>of</strong> i.i.d. random variables with mean zero andc <strong>the</strong>re are some<br />

constants 1 ≤ d ≤ 2, C > 0 and K > 0, such that <strong>the</strong> tail probability satisfies<br />

for every x ≥ 0.<br />

P (|ε i | > x) ≤ K exp<br />

(−Cx d)<br />

B.2 The initial estimators ˜β nj are r n -consistent for <strong>the</strong> estimation <strong>of</strong> unknown constants<br />

η nj depending on β, that is<br />

r n<br />

The constants η nj satisfy<br />

∣ ∣ ∣∣ ∣∣<br />

max ˜βnj − η nj = OP (1), r n → ∞.<br />

1≤j≤p n<br />

max<br />

k n


33<br />

B.3 (Adaptive irrepresentable condition) For s n1 defined as<br />

s n1 =<br />

(<br />

) ′<br />

|η n1 | −1 sgn(β 01 ), . . . , |η nkn | −1 sgn(β 0kn )<br />

and some constant κ < 1, it holds that<br />

∣<br />

∣x ′ j X n1C −1<br />

n<br />

n1 s n1∣<br />

≤<br />

κ<br />

|η nj | ,<br />

for every j ∈ {k n + 1, . . . , p n }.<br />

B.4 The constants k n , m n , λ n , M n1 , M n2 and b n1 satisfy<br />

(<br />

log(n) 1{d=1} n −1/2 (log(k n)) 1/d<br />

(<br />

+ n 1/2 λ −1<br />

n (log(m n )) 1/d M n2 + 1 ) ) + M n1λ n<br />

b n1<br />

r n b n1 n → 0<br />

B.5 There is a constant τ 1 > 0 such that τ n1 ≥ τ 1 for all n.<br />

B.6<br />

n −1/2 max<br />

1≤i≤n u′ iu i → 0<br />

Next, we introduce <strong>the</strong> Orlicz norm and some <strong>of</strong> its properties which be used later to<br />

bound tail probabilities.<br />

Definition 4.0.2.3. (Orlicz norm) Let ψ d = exp(x d )−1 for d ≥ 1. The ψ d -Orlicz norm<br />

||X|| ψd <strong>of</strong> a random variable X is defined as<br />

{<br />

( ( ))<br />

∣ |X|<br />

||X|| ψd = inf C > 0∣E<br />

ψ d<br />

C<br />

}<br />

≤ 1<br />

Lemma 4.0.2.4. (Van der Vaart and Wellner, 1996, Lemma 2.2.1) Let X be a random<br />

variable with P (|X| > x) ≤ K exp(−Cx d ) for every x, for constants K and C and for<br />

d ≥ 1. Then its Orlicz norm sasitfies<br />

||X|| ψd ≤ ((1 + K)/C) 1/d<br />

In <strong>the</strong> next Lemma, let ‖X‖ P,d denote <strong>the</strong> d-moment <strong>of</strong> a random variable X and S n <strong>the</strong><br />

(partial) sum <strong>of</strong> <strong>the</strong> first n random variables in a sequence.<br />

Lemma 4.0.2.5. (Van der Vaart and Wellner, 1996, Proposition A.1.6) Let X 1 , . . . , X n<br />

be independent, mean zero random variables indexed by an arbitrary index set T . Then<br />

(i)<br />

(ii)<br />

‖S n ‖ P,d ≤ K<br />

d<br />

[<br />

∥ ]<br />

‖S n ‖ P,1 +<br />

log(d)<br />

∥ max ∥ ∥ ∥∥∥P,d<br />

∥X i , (d > 1).<br />

1≤i≤n<br />

∥ ]<br />

‖S n ‖ ψd ≤ K p<br />

[‖S n ‖ P,1 +<br />

∥ max ∥ ∥ ∥∥∥ψd<br />

∥X i , (0 < d ≤ 1).<br />

1≤i≤n


34 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

(iii)<br />

‖S n ‖ ψ,d ≤ K d<br />

⎡<br />

⎣‖S n ‖ P,1 +<br />

( n ∑<br />

i=1<br />

∥ ∥ ) ⎤ 1/d ′<br />

∥∥ ∥X i d ′<br />

⎦ , (1 < d ≤ 2).<br />

ψ d<br />

Here, 1/d + 1/d ′ = 1, K is a universal constant and K d is a constant depending on d only.<br />

Lemma 4.0.2.6. (Van der Vaart and Wellner, 1996, Lemma 2.2.2) Let ψ be a convex,<br />

nondecreasing, nonzero function with ψ(0) = 0 and<br />

lim sup ψ(x)ψ(y)/ψ(cxy) < ∞ (4.0.2.4)<br />

x,y→∞<br />

for some constant c. Then for arbitrary random variables X 1 , . . . , X m , it holds that<br />

for a constant K depending on ψ only.<br />

∥ max X i<br />

1≤i≤m ∥ ≤ Kψ −1 (m) max ‖X i‖<br />

ψ 1≤i≤m<br />

Lemma 4.0.2.7. Let {ε i } i be a sequence <strong>of</strong> i.i.d. random variables with mean zero and<br />

finite variance. Suppose that <strong>the</strong>ir tail probability satisfies<br />

P (|ε i > x|) ≤ K exp(−Cx d ),<br />

i ∈ N<br />

for constants C and K, and for 1 ≤ d ≤ 2. Then for all constants a i with ∑ n<br />

i=1 a 2 i<br />

holds that<br />

∥ n∑ ∥∥∥∥ψd a<br />

∥ i ε i ≤<br />

i=1<br />

{ (<br />

Kd σ + (1 + K) 1/d C −1/d) , 1 < d ≤ 2,<br />

K 1 (σ + (1 + K)C log(n)) , d = 1<br />

= 1, it<br />

where K d is a constant depending on d only. Consequently,<br />

q ∗ n(t) =<br />

sup P<br />

a 2 1 +···+a2 n =1<br />

(<br />

∑ n<br />

)<br />

⎧<br />

⎪⎨<br />

a i ε i > t ≤<br />

i=1<br />

⎪⎩<br />

( )<br />

exp − td<br />

, 1 < d ≤ 2<br />

M<br />

(<br />

)<br />

t<br />

exp −<br />

, d = 1,<br />

M(1 + log(n))<br />

for some constant M depending on d, K and C only.<br />

Pro<strong>of</strong>. {ε i } i satisfies P (|ε i | > x) ≤ K exp(−Cx d ), so it follows from 4.0.2.4 that<br />

‖ε i ‖ ψd ≤ [(1 + K)/C] 1/d .<br />

Let d ′ be <strong>the</strong> conjugate <strong>of</strong> d, that is 1/d + 1/d ′ = 1. Setting X i = a i ε i , by 4.0.2.7, <strong>the</strong>re


35<br />

exists a constant K d depending on d only such that<br />

∥ ⎡ (∣ ∥∥∥∥ψd ∣∣∣∣ ∑<br />

a i ε i ≤ K d<br />

⎣E<br />

n [<br />

∑ n<br />

n∑<br />

∥<br />

i=1<br />

= K d<br />

⎡<br />

⎣E<br />

≤ K d<br />

⎡<br />

⎢<br />

≤ K d<br />

⎡<br />

⎣E<br />

∣) ∣∣∣∣<br />

a i ε i +<br />

i=1<br />

(∣( ∣∣∣∣ ∑ n<br />

)<br />

a i ε i · 1<br />

∣<br />

i=1<br />

⎛( ∑ n<br />

⎣E ⎝<br />

(( n ∑<br />

) ⎞ 2<br />

a i ε i ⎠<br />

i=1<br />

a 2 i ε 2 i<br />

i=1<br />

] ⎤ 1/d ′<br />

‖a i ε i ‖ d′ ⎦<br />

ψ d<br />

i=1<br />

) [<br />

∣ ∑ n<br />

1/2<br />

+<br />

] ⎤ 1/d ′<br />

|a i | d′ ‖ε i ‖ d′ ⎦<br />

ψ d<br />

i=1<br />

[ n<br />

] ⎤<br />

+ (1 + K) 1/d C −1/d ∑ 1/d ′<br />

|a i | d′ ⎥ ⎦<br />

i=1<br />

)) 1/2<br />

+ (1 + K) 1/d C −1/d [ n ∑<br />

⎡<br />

[ n<br />

] ⎤<br />

=≤ K d<br />

⎣σ + (1 + K) 1/d C −1/d ∑ 1/d ′<br />

|a i | d′ ⎦<br />

i=1<br />

] ⎤ 1/d ′<br />

|a i | d′ ⎦<br />

i=1<br />

Here, Hölder’s inequality and Lemma 4.0.2.4 have been used in <strong>the</strong> second inequality. For<br />

1 < d ≤ 2, d ′ = d/(d − 1) ≥ 2, which implies<br />

It follows that<br />

(<br />

n∑<br />

n<br />

)<br />

∑ d ′ /2<br />

|a i | d′ ≤ |a i | 2 = 1<br />

i=1<br />

i=1<br />

∥ n∑ ∥∥∥∥ψd (<br />

a<br />

∥ i ε i ≤ K d σ + (1 + K) 1/d C −1/d)<br />

i=1<br />

For d = 1, by Lemma 4.0.2.7, <strong>the</strong>re exists some constant K 1 such that<br />

∥ (∣ n∑ ∥∥∥∥ψ1 ∣∣∣∣ a<br />

∥ i ε i ≤ K 1<br />

[E<br />

n ∣) ]<br />

∑ ∣∣∣∣<br />

a i ε i + ‖ max |a iε i |‖ ψ1<br />

1≤i≤n<br />

i=1<br />

i=1<br />

]<br />

≤ K 1<br />

[σ + K ′ log(n) max ‖a iε i ‖ ψ1<br />

1≤i≤n<br />

]<br />

≤ K 1<br />

[σ + K ′ (1 + K)C −1 log(n) max |a i|<br />

1≤i≤n<br />

]<br />

≤ K 1<br />

[σ + K ′ (1 + K)C −1 log(n)<br />

where Hölder’s inequality and Lemma 4.0.2.6 have been used in <strong>the</strong> second inequality.<br />

Finally, note that an arbitrary random variable X, <strong>the</strong> following holds<br />

P (X > t‖X‖ ψd ) ≤ (1 + ψ d (t)) −1 (1 + E (ψ d (|X|/‖X‖ ψd )))<br />

≤ 2 exp(−t d )<br />

for all t > 0, by Markov’s inequality and by definition <strong>of</strong> <strong>the</strong> ψ d -Orlicz norm.<br />

(<br />

Lemma 4.0.2.8. Let ˜s n1 = | ˜β<br />

′<br />

nj | −1 sgn(β 0j ))<br />

and s n1 = ( | η˜<br />

nj | −1 sgn(β 0j ) ) ′<br />

j∈J j∈J n1<br />

.<br />

n1<br />

Suppose that assumption B.2 holds. Then,<br />

‖˜s n1 ‖ = (1 + o P (1)) M n1


36 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

and<br />

∥ ∥ ∥∥| ∥∥<br />

max ˜βnj |˜s n1 − |η nj |s n1 = oP (1)<br />

j /∈J n1<br />

Pro<strong>of</strong>. By assumption B.2, we have<br />

∣∣ ∣ ∣∣∣∣ ∣∣∣∣ ˜βnj ∣∣∣∣ max − 1∣<br />

1≤j≤k n η nj<br />

∣ =<br />

∣ ∣ ∣∣∣∣ max 1 ∣∣∣∣ ∣ ∣∣| ˜βnj | − |η nj | ∣ ≤ O P (1/r n )M n1 = o P (1)<br />

1≤j≤k n η nj<br />

∣∣ ∣ ∣∣∣∣ ∣∣∣∣ η ∣∣∣∣ nj<br />

We also have max 1≤j≤kn − 1<br />

˜β nj<br />

∣ = o P (1). Indeed, note that<br />

1<br />

| ˜β nj | = 1<br />

| ˜β nj | − |η nj | + |η nj | ≤ 1<br />

|η nj |<br />

1<br />

(<br />

∣<br />

∣∣<br />

∣1 + ˜βnj /η nj − 1)∣<br />

≤ M n1<br />

1<br />

|1 + o P (1)| = M n1O P (1) = o P (r n ),<br />

for every 1 ≤ j ≤ k n , it follows that<br />

∣ ∣∣∣∣ η nj<br />

max − 1<br />

1≤j≤k n ˜β nj<br />

∣ = max 1/| ˜β ∣ nj | ∣|η nj | − | ˜β nj | ∣<br />

1≤j≤k n<br />

( ∣ ∣∣|ηnj<br />

≤ o P (r n ) max | − | ˜β )<br />

nj | ∣<br />

1≤j≤k n<br />

≤ o P (r n )o P (1/r n ) = o P (1)<br />

Now we can prove <strong>the</strong> first part <strong>of</strong> <strong>the</strong> claim,<br />

‖˜s n1 ‖ = √<br />

=<br />

≤<br />

k n ∑<br />

1<br />

j=1<br />

˜β nj<br />

2 ∑k n<br />

√<br />

j=1<br />

∑k n<br />

√<br />

j=1<br />

( (<br />

1<br />

1 + |η )) 2<br />

nj|<br />

|η nj | |β nj | − 1<br />

( ) 2<br />

1<br />

|η nj | (1 + o P (1))<br />

≤ (1 + o P (1)) √<br />

k n ∑<br />

j=1<br />

≤ (1 + o P (1)) M n1<br />

1<br />

|η nj | 2


4.1 Variable selection consistency 37<br />

For <strong>the</strong> second part <strong>of</strong> <strong>the</strong> claim, first note that<br />

max ‖|η nj |˜s n1 − |η nj |s n1 ‖ 2 ∑k n ∣<br />

= max ∣|η nj |/| ˜β ni | − |η nj |/|η ni | ∣ 2 ∑k n ≤ M 2 |η nj | − | ˜β ni |<br />

n2<br />

j>k n j>k n<br />

∣<br />

i=1<br />

i=1<br />

| ˜β nj ||η ni | ∣<br />

(<br />

) 2<br />

≤ max || η nj | − | ˜β<br />

k n 2<br />

∑<br />

nj ||<br />

M n2<br />

1≤j≤k n ∣|η ni |<br />

(1 2 + | ˜β<br />

)<br />

ni |/|η ni | − 1 ∣<br />

(<br />

=<br />

max || η nj | − | ˜β nj ||<br />

1≤j≤k n<br />

i=1<br />

) 2 ∑k n ∣<br />

i=1<br />

= o P (1/r 2 n) O P (1) 2 M 2 n1 = o P (1).<br />

M n2<br />

|η ni | 2 (1 + o P (1)) ∣<br />

Next, we have<br />

∣ ∣∣| max ˜βnj | − |η nj | ∣ ‖˜s n1 ‖ = o P (1/r n )(1 + o P (1))M n1 = o P (1)<br />

j>k n<br />

The second part <strong>of</strong> <strong>the</strong> claim now follows from <strong>the</strong> triangle inequality.<br />

2<br />

2<br />

4.1 Variable selection consistency<br />

Theorem 4.1.0.9. (Huang et al., 2008, Theorem 1)(Variable selection consistency<br />

<strong>of</strong> <strong>the</strong> adaptive <strong>Lasso</strong>) Suppose that assumptions B.1 to B.5 are valid, <strong>the</strong>n<br />

( )<br />

P ˆβn = s β 0 → 1.<br />

Pro<strong>of</strong>. From subdifferential calculus, we know that a vector ˆβ n ∈ R pn is a solution to <strong>the</strong><br />

adaptive <strong>Lasso</strong> problem if <strong>the</strong> <strong>the</strong> subdifferential ∂L n ( ˆβ n ) <strong>of</strong> <strong>the</strong> objective function at ˆβ n<br />

contains <strong>the</strong> point zero, that is<br />

{ x<br />

′ j (y − X ˆβ n ) = λ n w nj sgn( ˆβ nj ), if ˆβ nj ≠ 0<br />

∣<br />

∣x ′ j (y − X) ˆβ<br />

∣ ∣∣<br />

n < λn w nj , if ˆβ (4.1.0.5)<br />

nj = 0<br />

Fur<strong>the</strong>rmore, if <strong>the</strong> vector family<br />

solution. Define<br />

˜s n1 =<br />

{<br />

x j , ˆβ<br />

}<br />

nj ≠ 0 is linearly independent, <strong>the</strong>n ˆβ n is unique<br />

(<br />

| ˜β n1 | −1 sgn(β 01 ), . . . , | ˜β<br />

) ′<br />

nkn | −1 sgn(β 0kn )<br />

and<br />

ˆβ n1 = ( X ′ n1X n1<br />

) −1 ( X<br />

′<br />

n1 y − λ n˜s n1<br />

)<br />

= β 01 + 1 (<br />

n C−1 n1 X<br />

′ )<br />

1 ε − λ n˜s n1<br />

If ˆβ n1 = s β 01 <strong>the</strong>n 4.1 holds for ˆβ<br />

(<br />

n = ˆβ′ n1, 0 ′) ′<br />

. For this particular ˆβn we have X ˆβ n =<br />

X 1 ˆβ n1 . The family {x j , 1 ≤ j ≤ k n } is <strong>the</strong>n linearly independent, hence X ′ n1 X n1 regular<br />

and we obtain that ˆβ n = s β 0 if<br />

{<br />

∣ ˆβn1 = s β 01<br />

∣x ′ j (y − X n1 ˆβ (4.1.0.6)<br />

n1 ) ∣ < λ n w nj , k n < j ≤ p n


38 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

Now, let H n = I n − 1 n X n1C −1<br />

n1 X′ n1 be <strong>the</strong> projection matrix on <strong>the</strong> nullspace <strong>of</strong> X′ 1 . By<br />

definition <strong>of</strong> ˆβ n1 , we have<br />

(<br />

y − X n1 ˆβn1 = ε + X 1 β 01 − ˆβ<br />

)<br />

n1<br />

= ε + 1 n X n1C −1 (<br />

n1 X<br />

′ )<br />

1 ε − λ n˜s n1<br />

= H n ε + λ n<br />

n X n1C −1<br />

n1 ˜s n1<br />

Following condition 4.1.0.6 we have that ˆβ n = s β 0 if<br />

⎧<br />

⎨ sgn (β 0j )<br />

(β 0j − ˆβ<br />

)<br />

nj < |β 0j | 1 ≤ j ≤ k n<br />

(<br />

)∣<br />

⎩ ∣<br />

∣x ′ j H n ε + λn n X n1C −1<br />

n1 ˜s ∣∣ (4.1.0.7)<br />

n1 < λn w nj , k n < j ≤ p n<br />

Next, denote by e j , <strong>the</strong> j-th standard unit vector and choose an arbitrary 0 < κ < κ + δ <<br />

1. One verifies that according to condition 4.1.0.7, ˆβn ≠ s β 0 implies <strong>the</strong> realization <strong>of</strong><br />

ei<strong>the</strong>r <strong>of</strong> <strong>the</strong> following events :<br />

B n1 =<br />

B n2 =<br />

B n3 =<br />

B n4 =<br />

k n ⋃<br />

j=1<br />

k n ⋃<br />

j=1<br />

⋃p n<br />

j=k n+1<br />

⋃p n<br />

j=k n+1<br />

{<br />

}<br />

n −1 |e ′ jC −1<br />

n1 X 1ε| ≥ |β 0j |/2<br />

{<br />

}<br />

|e j C −1<br />

n1 ˜s n1| ≥ |β 0j |/2<br />

{∣ ∣∣x ′<br />

j H n ε∣<br />

∣ ≥ (1 − κ − δ)λ n w nj<br />

}<br />

{<br />

n −1 ∣ ∣ ∣x<br />

′<br />

j X 1 C −1<br />

n1 ˜s ∣<br />

n1<br />

}<br />

∣ ≥ (κ + δ) .<br />

We show that <strong>the</strong>y each have probability tending to zero as n → ∞. For B n1 , first note<br />

that <strong>the</strong> matrix C n1 has root n 1/2 X n1 (X ′ n1 X n1) −1 . Thus we obtain<br />

∥ ∥∥∥ ( ) ∥<br />

n −1 e ′ jCn1 −1 ′ ∥∥∥ X′ 1 = n −1/2 ‖n −1/2 X n1 C −1<br />

n1 e j‖<br />

≤ n −1/2 ‖C −1/2<br />

n1 ‖ ‖e j ‖<br />

≤ (nτ n1 ) −1/2<br />

by <strong>the</strong> spectral decomposition <strong>the</strong>orem. It follow that<br />

⎛<br />

⋃k n {<br />

} ⎞<br />

P (B n1 ) = P ⎝ ‖e ′ jC −1<br />

n1 X′ 1ε‖/n ≥ |β 0j |/2 ⎠<br />

≤ k n q ∗ n<br />

1=1<br />

( √ ) bn1 τn1 n<br />

2<br />

with q n (·) given in Lemma 4.0.2.7. Now, assumptions B.1, B.4 and B.5 imply that<br />

P (B n1 ) → 0.<br />

For <strong>the</strong> set B n2 , note that Lemma 4.0.2.6, <strong>the</strong>n assumptions B.4 and B.5 imply that<br />

λ n ∣<br />

∣e j C −1<br />

n1<br />

n ˜s n1∣ ≤ λ ( )<br />

n‖˜s n1 ‖ λn M n1<br />

= O P = o P (b n1 ).<br />

nτ n1<br />

nτ n1


4.2 Partial asymptotic normality 39<br />

Finally assumption B.5 yields P (B n2 ) = o(1).<br />

For B n3 , first note that<br />

1<br />

w nj<br />

= | ˜β nj | ≤ M n2 + O P (1/r n ), j = k n + 1, . . . , p n<br />

Fur<strong>the</strong>rmore, ‖(x j H n ) ′ ‖ ≤ √ n, so it follows that<br />

⎛<br />

⋃p n {<br />

P (B n3 ) ≤ P ⎝ |x ′ jH n ε| ≥ (1 − κ − δ)λ n C −1 (M n2 + 1/r n ) −1}⎞ ⎠ + o P (1)<br />

≤ m n q ∗ n<br />

j=k n+1<br />

((1 − κ − δ)λ n n −1/2 C −1 (M n2 + 1/r n ) −1)<br />

for C large enough. Lemma 4.0.2.7 and assumption B.4 now imply that P (B n3 ) → 0.<br />

Finally, for <strong>the</strong> set B n4 , recall that ‖x j ‖/n = 1, so Lemma 4.0.2.6 and assumption B.5<br />

toge<strong>the</strong>r imply<br />

{∣ ∣∣x ′<br />

max j X 1 C −1<br />

k nk n<br />

≤ τ −1/2<br />

n1 o P (1) = o P (1).<br />

∣<br />

Now, by assumption B.3, it holds that ∣η nj x ′ j X 1C −1<br />

0. Thich completes <strong>the</strong> pro<strong>of</strong>.<br />

∣<br />

∣ /(nw nj ) − ∣η nj x ′ jX 1 C −1<br />

) ∥ ′ ∥∥∥ ∥ }<br />

∥∥| n1 ˜βnj˜s n1 − η nj s n1 | ∥<br />

n1 s ∣<br />

n1<br />

}<br />

n1 s n1∣<br />

∣ ≤ κ, so we indeed obtain P (B n4 ) →<br />

4.2 Partial asymptotic normality<br />

The pro<strong>of</strong> <strong>of</strong> <strong>the</strong> following result on partial asymptotic normality builds upon <strong>the</strong> fact that<br />

<strong>estimates</strong> <strong>of</strong> relevant coefficients stay away from zero with high probability as n tends to<br />

infinity. This is indeed a conclusion <strong>of</strong> Theorem 4.1.0.9 which asserts variable selection<br />

consistency. Note however that this result is too strong for this purpose and a consistency<br />

result for an l q -norm, q > 1 would actually be sufficient.<br />

Theorem 4.2.0.10. (Huang et al., 2008, Theorem 2)(Asymptotic normality for nonzero<br />

coefficients) Suppose that assumptions B.1 to B.6 are valid. For an arbitrary k n ×1-vector<br />

α n with ‖α n ‖ ≤ 1, let<br />

If M n1 λ n n −1/2 → 0, <strong>the</strong>n<br />

)<br />

n 1/2 s −1<br />

n α ′ n<br />

(ˆβn − β 0 = n −1/2 s −1<br />

n<br />

s 2 n = σ 2 α ′ nC −1<br />

n1 α′ n<br />

n∑<br />

ε i α ′ nC ′ n1u i + o P (1) N (0, 1).<br />

i=1<br />

where o P (1) is a term that converges to zero in probability uniformly with respect to α n .<br />

Pro<strong>of</strong>. Under assumptions B.1 to B.5, one has variable selection consistency according<br />

to Theorem 4.1.0.9, in particular ˆβ n1 has no zero component on a set with probability<br />

converging to one. On this set, one has ∂/∂β 1 L n ( ˆβ n1 , ˆβ n2 ) = 0, that is,<br />

−2<br />

n∑<br />

i=1<br />

(y i − u i ˆβn1 − z i ˆβn2<br />

)<br />

u i + 2λ n ψ n = 0 (4.2.0.8)


40 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

where ψ n =<br />

written as<br />

(<br />

w nj sgn( ˆβ<br />

)<br />

n1j ) . Since β n2 = 0 and ε i = Y i − u i β 10 , this can be<br />

1≤j≤k n<br />

n∑<br />

i=1<br />

(ε i − u i ( ˆβ n1 − β 01 ) − z i ˆβn2<br />

)<br />

u i + λ n ψ n = 0<br />

Using C n1 = n −1 X ′ n1 X n1, we first obtain<br />

<strong>the</strong>n<br />

n 1/2 α ′ n<br />

( )<br />

C n1 ˆβn1 − β 10 = 1 n<br />

( ˆβ n1 − β 10<br />

)<br />

= n −1/2 n ∑<br />

i=1<br />

n∑<br />

i=1<br />

ε i u i − λ n<br />

n ψ n − 1 n<br />

n∑<br />

z ′ ˆβ i n2 w i ,<br />

i=1<br />

ε i α ′ nC −1<br />

n1 u i − n −1/2 λ n α ′ nC −1<br />

n1 ψ n − n −1/2<br />

n ∑<br />

i=1<br />

z ′ i ˆβ n2 u i .<br />

By Theorem 4.1.0.9, <strong>the</strong> set { ˆβ n2 = 0} has probability tending to one, thus <strong>the</strong> last term<br />

converges to zero in probability. By <strong>the</strong> spectral decomposition Theorem and assumption<br />

B.5 one has<br />

‖C −1<br />

n1 ‖ = τ n1<br />

−1 ≤ τ 1 −1 .<br />

Cauchy-Schwarz inequality <strong>the</strong>n yields<br />

∣<br />

∣n −1/2 λ n α ′ nC −1<br />

∣ ≤ n −1/2 λ n ‖α ′ n‖ ‖C −1<br />

n1 ‖ ‖˜s n1‖<br />

n1 ψ n<br />

≤ n −1/2 λ n τ −1<br />

1 (1 + o P (1))M n1<br />

= o P (1)<br />

Here we used Lemma 4.0.2.8 in <strong>the</strong> first inequality and <strong>the</strong> assumption n −1/2 λ n M n1 → 0<br />

in <strong>the</strong> equality. So we obtain as claimed<br />

n 1/2 s −1<br />

n<br />

= n −1/2 s −1<br />

n<br />

n∑<br />

i=1<br />

ε i α ′ nC −1<br />

n1 u i + o P (1).<br />

To prove asymptotic normality, we verify that <strong>the</strong> Lindeberg-Feller condition holds. Set<br />

and<br />

v i = n −1/2 s −1<br />

n α n C −1<br />

n1 x n1<br />

ξ i = ε i v i .<br />

Then we have<br />

var<br />

( n ∑<br />

)<br />

ξ i<br />

i=1<br />

= σ 2 s −2<br />

n α nC ′ −1<br />

n1 α n = 1.<br />

For arbitrary δ > 0,<br />

n∑<br />

i=1<br />

(<br />

E ξi 2 1{|ξ| > δ}<br />

)<br />

= σ 2 n ∑<br />

i=1<br />

(<br />

)<br />

vi 2 E ε 2 i 1{|ε i v i | > δ}


4.3 Marginal regressors as initial <strong>estimates</strong> 41<br />

holds, and one easily verifies that σ 2 ∑ n<br />

i=1 vi 2 = 1. So by <strong>the</strong> Lebesgue’s dominated convergence<br />

<strong>the</strong>orem it is sufficient to show that<br />

max |v i| = max<br />

1≤i≤n i≤i≤n n−1/2 s −1 ∣<br />

n ∣α nC ′ −1<br />

n1 u i∣ → 0.<br />

This claim follows from assumptions B.5 and B.6 as<br />

max<br />

i≤i≤n n−1/2 s −1 ∣<br />

n<br />

∣α ′ nC −1<br />

n1 u ∣<br />

i<br />

∣ ≤ max<br />

1≤i≤n n−1/2 s −1<br />

n<br />

≤= σ −1 n −1/2 max<br />

1≤i≤n<br />

( )<br />

α ′ nC −1 1/2 ( )<br />

n1 α′ n u ′ iC −1 1/2<br />

n1 u i<br />

( )<br />

u ′ iC −1 1/2<br />

n1 u i<br />

≤ n −1/2 σ −1 ‖C −1<br />

n1 ‖ max<br />

1≤i≤n (u′ iu i ) 1/2<br />

≤ σ −1 τ −1<br />

1 n−1/2 max<br />

1≤i≤n (u′ iu i ) 1/2 → 0.<br />

4.3 Marginal regressors as initial <strong>estimates</strong><br />

The validity <strong>of</strong> both Theorem 4.1.0.9 and 4.2.0.10 requires <strong>the</strong> existence <strong>of</strong> an initial<br />

estimator ˜β n which satisfies assumption B.2, that is, is r n -consistent for <strong>the</strong> estimation<br />

<strong>of</strong> a proxy η n <strong>of</strong> <strong>the</strong> true parameters β 0 . In this section we show that under a weak<br />

correlation assumption between relavant and noise variables, this assumption is satisfied<br />

by marginal regressors which also <strong>of</strong>fer <strong>the</strong> advantage to be computationally attrative.<br />

The marginal regressor ˜β n is defined as<br />

and <strong>the</strong> proxies η nj are chosen as<br />

˜β nj = x′ j Y<br />

n , j = 1, . . . , p n (4.3.0.9)<br />

η nj = E( ˜β nj ) = x′ j Xβ 0<br />

n<br />

(4.3.0.10)<br />

The weak correlation assumption is formally stated as:<br />

Assumption C.<br />

C.1 Condition B.1 holds.<br />

C.2 For 1 ≤ j ≤ k n and k n < k ≤ p n , it holds that<br />

∣ 1<br />

n∑ ∣∣∣∣ ∣ x<br />

∣ ij x ik =<br />

1 ∣∣∣<br />

n<br />

∣n x jx k ≤ ρ n , 1 ≤ j ≤ k n , k n < k ≤ p n<br />

i=1<br />

with ρ n satisfying<br />

for some 0 < κ < 1.<br />

(<br />

) ⎛<br />

c n = max |η nj | ⎝<br />

k n


42 The adaptive <strong>Lasso</strong> in a high dimensional setting<br />

C.3 The minimum ˜b n1 = min 1≤j≤kn |η nj | satisfies<br />

for<br />

r n =<br />

kn<br />

1/2 (1 + c n )˜b n1 rn −1 → 0<br />

n 1/2<br />

log(m n ) 1/d log(n) 1{d=1}<br />

Theorem 4.3.0.11. (Huang et al., 2008, Theorem 3) Suppose that assumptions C.1 to<br />

C.3 hold. Then β nj defined in 4.3.0.9 is r n consistent for η nj defined in 4.3.0.10 and <strong>the</strong><br />

adaptive irrepresentable condition holds.<br />

Pro<strong>of</strong>. Let<br />

Then<br />

µ 0 = E(Y) =<br />

˜β nj = x′ j Y<br />

n<br />

p n<br />

∑<br />

j=1<br />

x j β 0j .<br />

= η nj + x′ j ε<br />

n<br />

where η nj = x ′ j µ 0/n. The covariates are assumed to be standardized, that is ‖x j ‖/n = 1,<br />

so for arbitrary δ > 0, Lemma 4.0.2.7 implies that<br />

( {<br />

P r n max | ˜β<br />

} ) ( { } )<br />

nj − η nj | > δ = P r n max |x ′ jε|/n > δ<br />

1≤j≤k n 1≤j≤k n<br />

( )<br />

≤ p n qn<br />

∗ n 1/2 rn<br />

−1 δ = o P (1).<br />

Here, r n log(p n ) log(n) 1{d=1} n 1/2 = o P (1) have been used in <strong>the</strong> last equality. This proves<br />

<strong>the</strong> first part <strong>of</strong> assumption B.2. The second part follows from assumption C.3:<br />

(<br />

∑k n 1<br />

j=1<br />

For asumption B.3, note that<br />

η 2 nj<br />

+ M 2 n2<br />

η 4 nj<br />

)<br />

≤ k n<br />

˜b2 nj<br />

(1 + c 2 n) = o(r 2 n).<br />

and<br />

∥<br />

∥X ′ ∥<br />

n1x j 2 ∑k n<br />

( = x<br />

′ ) 2<br />

i x j ≤ kn n 2 ρ 2 n<br />

i=1<br />

|η nj |‖s n1 ‖ ≤ kn<br />

1/2 c n<br />

for every k n < j ≤ p n . Now, assumption C.2 yields<br />

n −1 ∣<br />

|η nj | ∣x ′ jX n1 C −1<br />

n1 s n1∣ ≤ τn1 −1 c nk n ρ n .<br />

for such j. This completes <strong>the</strong> pro<strong>of</strong>.<br />

The general message <strong>of</strong> Theorem 4.3.0.11 is that marginals regressors can be used as initial<br />

<strong>estimates</strong> if correlations between relevant and noise variables are believed to be weak.


Chapter 5<br />

<strong>Subsampling</strong><br />

In his seminal paper already, Tibshirani (1996) noted that bootstrap variance <strong>estimates</strong><br />

for <strong>the</strong> lasso could take value zero. Knight and Fu (2000) showed by heuristical means<br />

that residual bootstrap <strong>estimates</strong> <strong>of</strong> <strong>the</strong> lasso <strong>distribution</strong> are inconsistent, formal arguments<br />

on this inconsistency were finally given by Chatterjee and Lahiri (2010) who proved<br />

that <strong>the</strong> conditional residual boostrap <strong>distribution</strong> given <strong>the</strong> data converges to a random<br />

measure. In <strong>the</strong> subsequent paper Chatterjee and Lahiri (2011), <strong>the</strong>y propose a consistent<br />

modification <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> which also allows for consistency <strong>of</strong> residual bootstrap <strong>estimates</strong>.<br />

However, <strong>the</strong>ir modification, which consists in setting smallish <strong>estimates</strong> to zero according<br />

to a threshold value, can be problematic in finite samples in <strong>the</strong> presence <strong>of</strong> small parameters,<br />

fur<strong>the</strong>rmore it involves choosing <strong>the</strong> value <strong>of</strong> <strong>the</strong> threshold parameter. In general,<br />

proving consistency <strong>of</strong> <strong>the</strong> bootstrap is a difficult task. It typically involves proving that<br />

<strong>the</strong> <strong>distribution</strong> <strong>of</strong> <strong>the</strong> root <strong>of</strong> interest is continuous in <strong>the</strong> sampling <strong>distribution</strong>, usually<br />

<strong>the</strong> empirical <strong>distribution</strong>. In this chapter we introduce subsampling, a resampling<br />

method without replacement which <strong>of</strong>fers <strong>the</strong> advantage to be consistent under very weak<br />

aasumptions.<br />

In <strong>the</strong> remaining we will denote by X (n) a sample <strong>of</strong> n i.i.d. random variables drawn<br />

from some <strong>distribution</strong> P belonging to a family P. For a real valued parameter function<br />

θ : P → R and a root R n (X (n) , θ(P )), denote <strong>the</strong> <strong>distribution</strong> <strong>of</strong> R n under P by J n (·, P ).<br />

5.1 Pointwise consistency for <strong>distribution</strong> estimation<br />

In this section, we restrict ourselves to <strong>the</strong> case<br />

R n (X (n) , θ(P )) = τ n<br />

(ˆθn (X (n)) − θ(P ))<br />

where ˆθ n is an estimator <strong>of</strong> <strong>the</strong> parameter θ and τ n is a scaling factor tending to infinity.<br />

For a sample X (n) <strong>of</strong> size n and a value b = b(n) < n, denote by {X n,b.i } i <strong>the</strong> set <strong>of</strong> all<br />

N n := ( n<br />

b) subsamples <strong>of</strong> size b. The idea is to approximate <strong>the</strong> <strong>distribution</strong> <strong>of</strong> τb (ˆθ b −θ(P ))<br />

instead, using <strong>the</strong> subsamples X n,b,i at hand (which are true samples <strong>of</strong> size b drawn from<br />

P ) and replacing θ by ˆθ n . This results in <strong>the</strong> following estimator<br />

ˆL n,b (x) := 1 ∑N n<br />

1{τ b (ˆθ n,b,i −<br />

N ˆθ n ) ≤ x} (5.1.0.1)<br />

n<br />

i=1<br />

43


44 <strong>Subsampling</strong><br />

The consistency <strong>of</strong> ˆL n,b as an estimator to J n (·, P ) is derived under<br />

Assumption D.<br />

J(P ) as n → ∞.<br />

There exists a limit law J(P ) such that J n (P ) converges weakly to<br />

Theorem 5.1.0.12. (Politis et al., 1999, Theorem 2.2.1) Assume that assumption D<br />

holds. Also assume τ b /τ n → n, b → ∞, and b/n → 0 as n → ∞. Then <strong>the</strong> following are<br />

true:<br />

(i) If x is a continuity point <strong>of</strong> J(·, P ), <strong>the</strong>n<br />

ˆL n,b (x) → P J(x, P )<br />

(ii) If J(·, P ) is continuous, <strong>the</strong>n<br />

∣<br />

sup ∣ˆL n,b (x) − J n (x, P ) ∣ → P 0<br />

(iii) Let<br />

Correspondingly, define<br />

x∈R<br />

c n,b (1 − α) = inf{x ∈ R|ˆL n,b (x) ≥ 1 − α}.<br />

c(1 − α, P ) = inf{x ∈ R|J(x, P ) ≥ 1 − α}.<br />

Then<br />

(<br />

1 − α − ∆J(c(1 − α, P )) ≤ lim inf P n→∞<br />

≤ lim sup P<br />

n→∞<br />

≤ 1 − α.<br />

)<br />

τ n (ˆθ n − θ(P ) ≤ c n,b (1 − α))<br />

)<br />

(<br />

τ n (ˆθ n − θ(P )) ≤ c n,b (1 − α)<br />

If J(·, P ) is continuous at c(1 − α, P ), <strong>the</strong>n<br />

(<br />

)<br />

lim P τ n (ˆθ n − θ(P )) ≤ c n,b (1 − α)<br />

n→∞<br />

(iv) Assume that τ b (ˆθ n − θ(P )) → 0 almost surely and that<br />

∞∑<br />

exp{−d(n/b)} < ∞,<br />

n=1<br />

= 1 − α<br />

for every d > 0, <strong>the</strong>n <strong>the</strong> convergence in i and ii hold with probability one.<br />

Pro<strong>of</strong>. Define<br />

U n (x) = U n (x, P ) = N −1<br />

n<br />

N n<br />

∑<br />

i=1<br />

{<br />

) }<br />

1 τ b<br />

(ˆθn,b,i − θ(P ) ≤ x<br />

To prove i, we show that U n (x) converges in probablity to J(x, P ) for every continuity<br />

point x <strong>of</strong> J(x, P ). Note that<br />

ˆL n,b (x) = N −1<br />

n<br />

N n<br />

∑<br />

i=1<br />

{<br />

1 τ b (ˆθ n,b,i − θ(P )) + τ b (θ(P ) − ˆθ<br />

}<br />

n ) .


5.1 Pointwise consistency for <strong>distribution</strong> estimation 45<br />

For arbitrary ε > 0, set<br />

E n (ε) =<br />

{<br />

τ b |θ(P ) − ˆθ<br />

}<br />

n | ≤ ε .<br />

One <strong>the</strong>n verifies that<br />

U n (x − ε) ≤ 1 {E n (ε)} ≤ ˆL n,b (x)1{E n (ε)} ≤ U n (x + ε)<br />

holds. Fur<strong>the</strong>r, note that <strong>the</strong> assumption τ b /τ b → 0 implies P (E n (ε)) → 0, so for fixed ε,<br />

with probaility tending to one, we obtain<br />

For U n (x ± ε) → P J(x ± ε), we obtain<br />

U n (x − ε) ≤ ˆL n,b (x) ≤ U n (x + ε)<br />

J(x − ε, P ) ≤ ˆL n,b (x) ≤ J(x + ε, P ) + ε<br />

with probability tending to one. Hence, taking a sequence ε n → 0, such that x ± ε n are<br />

continuity points <strong>of</strong> J(·, P ) yields ˆL n,b (x) → P J(x, P ). Thus, it is sufficient to show that<br />

U n (x) → P J(x, P ) for every continuity point x <strong>of</strong> J(·, P ).<br />

For every 1 ≤ i ≤ N n , ˆθ n,b,i is a statistic based on a sample <strong>of</strong> size b drawn from <strong>the</strong><br />

<strong>distribution</strong> P , hence U n (x) is a U-statistic <strong>of</strong> degree b with<br />

0 ≤ U n (x) ≤ 1 and E (U n (x)) = J b (x, P )<br />

By Hoeffding’s ineqality ((Serfling, 1980, Theorem A, Section 5.6)), it follows that<br />

P (U n (x) − J b (x, P ) ≥ t) ≤ exp<br />

(−2⌊n/b⌋t 2)<br />

for every t > 0. A similar inequality is obtained for t < 0 by considering −U n (x). So, it<br />

follows that U n (x) → P J b (x, P ), for continuity points x <strong>of</strong> J(·, P ), this yields U n (x) → P<br />

J(x, P ) since for such x, J b (x, P ) → J(x, P ) by <strong>the</strong> portmanteau <strong>the</strong>orem.<br />

To prove ii, we use <strong>the</strong> subsequence criterion. Following i, given an arbitrary subsequence<br />

{j n } n , for every continuity point x <strong>of</strong> J(·, P ), one can extract a fur<strong>the</strong>r subsequence {k jn } n<br />

such that L kjn (x) → J(x, P ) almost surely. By a diagonal argument, one can assume that<br />

L kjn (x) → J(x, P ) almost surely for every x in a countable subset <strong>of</strong> <strong>the</strong> real line. So, we<br />

obtain L kjn J(·, P ). By continuity <strong>of</strong> J(·, P ), it follows from Polya’s <strong>the</strong>orem that<br />

∣<br />

sup ∣L kjn (x) − J(x, P ) ∣ → 0<br />

x∈R<br />

almost surely, which completes <strong>the</strong> argument.<br />

To prove iii, for α ∈ (0, 1), define<br />

and<br />

c L (1 − α) = inf {x ∈ R|J(x, P ) ≥ 1 − α}<br />

c U (1 − α) = sup {x ∈ R|J(x, P ) ≤ 1 − α}<br />

Then choose ε > 0 such that c L (1 − α) − ε and c U (1 − α) + ε are both continuity points<br />

<strong>of</strong> J(·, P ). Following i, we have both<br />

ˆL n,b (c L (1 − α) − ε) → P J(c L (1 − α) − ε, P )


46 <strong>Subsampling</strong><br />

and<br />

ˆL n,b (c U (1 − α) + ε) → P J(c U (1 − α) + ε, P ).<br />

Hence, <strong>the</strong> sets<br />

{ˆLn,b (c L (1 − α) − ε) < 1 − α ≤ ˆL n,b (c U (1 − α) + ε)}<br />

⊆<br />

{<br />

}<br />

−1<br />

c L (1 − α) − ε < ˆL<br />

n,b (1 − α) ≤ c U(1 − α) + ε<br />

have probability tending to one as n → ∞. It follows that<br />

(<br />

)<br />

P τ n (ˆθ n − θ(P )) ≤ ĉ n,b (1 − α) ≤ J n (c U (1 − α) + ε, P ) + o(1)<br />

and<br />

P<br />

(<br />

)<br />

τ n (ˆθ n − θ(P )) ≤ ĉ n,b (1 − α) ≥ J n (c L (1 − α) − ε, P ) + o(1).<br />

Letting n tend to infinity first, <strong>the</strong>n ε tend to zero yields, toge<strong>the</strong>r with <strong>the</strong> Portmanteau<br />

Theorem, <strong>the</strong> inequalities.<br />

Finally iv can be proved similarly to i and ii using Borel-Cantelli Lemma.<br />

Remark.<br />

(i) Note that point iii also holds for <strong>the</strong> root U n,b . Indeed, <strong>the</strong> pro<strong>of</strong> for ˆL n,b (·) solely<br />

rests on <strong>the</strong> convergence in probability <strong>of</strong> ˆL n,b (x) to J(x, P ) for every continuity<br />

point x <strong>of</strong> J(·, P ). As seen in <strong>the</strong> pro<strong>of</strong> <strong>of</strong> i, this is a property shared by U n,b (·) as<br />

well, this without even requiring τ b /τ n → 0, <strong>the</strong> assumption b/n → 0 being sufficient.<br />

Obviously <strong>the</strong> price to pay are larger confidence intervals.<br />

(ii) The conclusion <strong>of</strong> point iii can also be stated for two-sided confidence intervals with<br />

obvious changes in <strong>the</strong> assumptions.<br />

In <strong>the</strong> regular situation where τ n = √ n, <strong>the</strong> choice b = n δ for some 0 < δ < 1 satisfies <strong>the</strong><br />

conditions <strong>of</strong> Theorem 5.1.0.12.<br />

In view <strong>of</strong> our goal, constructing confidence intervals for <strong>Lasso</strong> <strong>estimates</strong>, <strong>the</strong> message<br />

conveyed by Theorem 5.1.0.12 is that in <strong>the</strong> situation where <strong>the</strong> 1 − α quantile happens<br />

to be a discontinuity point, and this can indeed happen if <strong>the</strong> corresponding parameter is<br />

equal to zero (cf. Theorem 3.2.1.1), <strong>the</strong> subsampling confidence interval assymptotically<br />

carries an error which is in <strong>the</strong> worst case equal to <strong>the</strong> jump height at <strong>the</strong> quantile.<br />

However, as we will see in <strong>the</strong> next section, this conclusion is too pessimistic and it turns<br />

out that some form <strong>of</strong> uniform convergence is what we need to achieve consistency.<br />

5.2 Uniform consistency for quantiles appproximation<br />

The present section focuses on <strong>the</strong> use <strong>of</strong> subsampling for <strong>the</strong> construction <strong>of</strong> confidence<br />

intervals only, in contrast to <strong>the</strong> previous one where <strong>the</strong> estimation <strong>of</strong> <strong>the</strong> <strong>distribution</strong><br />

function in a uniform sense was also considered. We will see that achieving asymptotic<br />

valid or conservative confidence intervals is possible if <strong>the</strong> <strong>distribution</strong> functions satisfy<br />

some uniformity or monoticity condition in <strong>the</strong> limit.<br />

All results <strong>of</strong> this section, appart <strong>the</strong> Dvoretzky-Kiefer-Wolfowitz inequality, are due to<br />

Romano and Shaikh (2010) who stated <strong>the</strong>ir results in a uniform sense for a family <strong>of</strong><br />

probability measures, we follow <strong>the</strong>ir exposition.


5.2 Uniform consistency for quantiles appproximation 47<br />

5.2.1 Statement <strong>of</strong> <strong>the</strong> general result<br />

Theorem 5.2.1.1. (Uniform asymptotic validity <strong>of</strong> subsampling) Let X (n) =<br />

(X 1 , . . . , X n ) be an i.i.d. sequence <strong>of</strong> random variables with <strong>distribution</strong> P ∈ P. Denote<br />

by J n (x, P ) <strong>the</strong> <strong>distribution</strong> <strong>of</strong> a real-valued root R n = R n (X (n) , P ) under P . Let<br />

b = b n < n be a sequence <strong>of</strong> positive integers tending to infinity, but satisfying b/n → 0.<br />

Let N n = ( b<br />

n<br />

) and<br />

L n (x, P ) = 1<br />

N n<br />

∑<br />

1≤i≤N n<br />

1{R b (X n,(b),i , P ) ≤ x}, (5.2.1.1)<br />

where X n,(b),i denotes <strong>the</strong> i-th subset <strong>of</strong> data <strong>of</strong> size b. Then, <strong>the</strong> following statements are<br />

true for every α ∈ (0, 1) :<br />

(i) If lim sup n→∞ sup P ∈P sup x∈R {J b (x, P ) − J n (x, P )} ≤ 0, <strong>the</strong>n<br />

lim inf<br />

n→∞<br />

(<br />

)<br />

inf P R n ≤ L −1<br />

n (1 − α, P ) ≥ 1 − α (5.2.1.2)<br />

P ∈P<br />

(ii) If lim sup n→∞ sup P ∈P sup x∈R {J n (x, P ) − J b (x, P )} ≤ 0, <strong>the</strong>n<br />

lim inf<br />

n→∞<br />

(<br />

)<br />

inf P R n ≥ L −1<br />

n (α, P ) ≥ 1 − α (5.2.1.3)<br />

P ∈P<br />

(iii) If lim sup n→∞ sup P ∈P sup x∈R |J b (x, P ) − J n (x, P )| = 0, <strong>the</strong>n 5.2.1.2 and 5.2.1.3 hold<br />

with <strong>the</strong> lim inf n→∞ and ≥ replaced by lim n→∞ and =, respectively. Moreover,<br />

(<br />

)<br />

lim inf P L −1<br />

n→∞<br />

n (α, P ) ≤ R n ≤ L −1<br />

n (1 − α, P ) = 1 − 2α (5.2.1.4)<br />

P ∈P<br />

Remark. Consider again <strong>the</strong> root R n <strong>of</strong> <strong>the</strong> previous section:<br />

R n (X (n) , P ) = τ n (ˆθ n − θ(P )) (5.2.1.5)<br />

where ˆθ n = ˆθ n (X (n) ) is an estimate <strong>of</strong> a real-valued parameter θ(P ) and tau n > 0 is<br />

a normalizing sequence tending to infinity. For <strong>the</strong> feasible estimate ˆL n (x) <strong>of</strong> J n (x, P )<br />

defined as<br />

we have<br />

L −1<br />

n (1 − α, P ) = inf<br />

ˆL n (x) = 1<br />

N n<br />

⎧<br />

⎨<br />

1<br />

x∈R ⎩N n<br />

⎧<br />

⎨ 1<br />

= inf<br />

x∈R ⎩N n<br />

=<br />

∑<br />

1≤i≤N n<br />

1<br />

(<br />

τ b (X n,(b),i − ˆθ<br />

)<br />

n )<br />

⎫<br />

∑<br />

)<br />

⎬<br />

1{τ b<br />

(ˆθb (X n,(b),i ) − θ(P ) ≤ x} ≥ 1 − α<br />

⎭<br />

1≤i≤N n<br />

⎫<br />

∑<br />

1{τ b<br />

(ˆθb (X n,(b),i ) − ˆθ<br />

)<br />

⎬<br />

n ≤ x − τ b (ˆθ n − θ(P ))}<br />

⎭<br />

1≤i≤N n<br />

ˆL<br />

−1<br />

n (1 − α) + τ b (ˆθ n − θ(P ))<br />

Hence, under <strong>the</strong> conditions <strong>of</strong> Theorem 5.2.1.1, we obtain for this particular root<br />

(i) lim inf n→∞ inf P ∈P P<br />

(<br />

)<br />

(τ n − τ b )(ˆθ n − θ(P )) ≤ ˆL<br />

−1<br />

n (1 − α) ≥ 1 − α


48 <strong>Subsampling</strong><br />

(ii) lim inf n→∞ inf P ∈P P<br />

(iii) lim n→∞ inf P ∈P P<br />

(<br />

(τ n − τ b )(ˆθ n − θ(P )) ≥<br />

) ˆL<br />

−1<br />

n (α) ≥ 1 − α<br />

(ˆL−1<br />

)<br />

n (α) ≤ (τ n − τ b )(ˆθ n − θ(P )) ≤ ˆL<br />

−1<br />

n (1 − α) = 1 − 2α.<br />

Remark. Following <strong>the</strong> previous remark, under <strong>the</strong> additional assumption τ n /τ b → 0,<br />

conclusions i, ii and iii are also valid for a scaling factor τ n in lieu <strong>of</strong> τ n − τ b .<br />

5.2.2 Pro<strong>of</strong><br />

Lemma 5.2.2.1. Massart (1990)(Dvoretzky-Kiefer-Wolfowitz inequality) For n ∈<br />

N, let X 1 , . . . , X n be i.i.d. random variables with probability <strong>distribution</strong> function F . Let<br />

ˆF n be <strong>the</strong> empirical <strong>distribution</strong> function, that is<br />

Then for every n ∈ N and ε > 0,<br />

P<br />

(<br />

ˆF n (x) = 1 n∑<br />

1{X i ≤ x}.<br />

n<br />

i=1<br />

sup | ˆF n (x) − F n (x)|<br />

x∈R<br />

〉<br />

(<br />

ε) ≤ 2 exp −2nε 2) (5.2.2.1)<br />

The following inequalities will be useful. Their verification is elementary, hence we omit<br />

<strong>the</strong> pro<strong>of</strong>.<br />

Lemma 5.2.2.2. (Romano and Shaikh, 2010, Lemma 4.1) Let G and F be (nonrandom)<br />

<strong>distribution</strong> functions on R, <strong>the</strong>n <strong>the</strong> following are true :<br />

(i) If sup x∈R {G(x) − F (x) ≤ ε}, <strong>the</strong>n<br />

G −1 (1 − α) ≥ F −1 (1 − (α + ε)) (5.2.2.2)<br />

(ii) If sup x∈R {F (x) − G(x) ≤ ε}, <strong>the</strong>n<br />

G −1 (α) ≥ F −1 ((α + ε)) (5.2.2.3)<br />

Fur<strong>the</strong>rmore for a random variable X with probability <strong>distribution</strong> function F , it holds :<br />

(i) If sup x∈R {G(x) − F (x) ≤ ε}, <strong>the</strong>n<br />

P<br />

(<br />

)<br />

X ≤ G −1 (1 − α) ≥ 1 − (α + ε) (5.2.2.4)<br />

(ii) If sup x∈R {F (x) − G(x) ≤ ε}, <strong>the</strong>n<br />

P<br />

(<br />

)<br />

X ≥ G −1 (α) ≥ 1 − (α + ε) (5.2.2.5)<br />

(iii) If sup x∈R |G(x) − F (x)| ≤ ε, <strong>the</strong>n<br />

P<br />

(<br />

)<br />

G −1 (α) ≤ X ≥ G −1 (1 − α) ≥ 1 − 2(α + ε). (5.2.2.6)


5.2 Uniform consistency for quantiles appproximation 49<br />

If Ĝ is a random function R and δ > 0, <strong>the</strong>n we fur<strong>the</strong>r have that:<br />

)<br />

(i) If P<br />

(sup {Ĝ(x) − F (x) ≤ ε} x∈R ≥ 1 − δ, <strong>the</strong>n<br />

)<br />

(ii) If P<br />

(sup x∈R {F (x) − Ĝ(x) ≤ ε} ≥ 1 − δ, <strong>the</strong>n<br />

)<br />

(iii) If P<br />

(sup |Ĝ(x) − F (x)| ≤ ε x∈R ≥ 1 − δ, <strong>the</strong>n<br />

P<br />

P<br />

(<br />

)<br />

X ≤ Ĝ−1 (1 − α) ≥ 1 − (α + ε + δ) (5.2.2.7)<br />

P<br />

(<br />

)<br />

X ≥ G −1 (α) ≥ 1 − (α + ε + δ) (5.2.2.8)<br />

(<br />

)<br />

G −1 (α) ≤ X ≥ G −1 (1 − α) ≥ 1 − 2(α + ε) (5.2.2.9)<br />

Lemma 5.2.2.3. Let X (n) = (X 1 , . . . , X n ) be an i.i.d. sequence <strong>of</strong> random variables with<br />

<strong>distribution</strong> P . Denote by J n (x, P ) <strong>the</strong> probability <strong>distribution</strong> function <strong>of</strong> a real valued<br />

root R n = R n (X (n) , P ) under P . Let N n = ( b<br />

n) , kn = ⌊n/b⌋ and define L n (x, P ) according<br />

to 5.2.2.5. Then, for every ε > 0 and every 0 < δ < 1, we have :<br />

i.)<br />

ii.)<br />

P<br />

(<br />

P<br />

(<br />

sup |L n (x, P ) − J b (x, P )| > ε<br />

x∈R<br />

sup |L n (x, P ) − J b (x, P )| > ε<br />

x∈R<br />

)<br />

)<br />

≤ 1 ε<br />

√<br />

2π<br />

k n<br />

(5.2.2.10)<br />

≤ δ ε + 2 ε exp(−2k nδ 2 ) (5.2.2.11)<br />

Pro<strong>of</strong>. Define<br />

S n (x, P ; X 1 , . . . , X n ) = 1<br />

k n<br />

∑<br />

1<br />

1≤ik n<br />

{<br />

}<br />

R((X b(i−1) , . . . , X bi ), P ) ≤ x − J b (x, P )<br />

Denote by S n <strong>the</strong> symmetric group on a set <strong>of</strong> cardinality n. Note that {X n,(b),i } 1≤i≤Nn =<br />

⋃<br />

π∈S n<br />

{(X π(b(i−1)+1) , . . . , X π(bi) )} 1≤i≤kn , This allows us to express<br />

as<br />

Then we have<br />

1<br />

N n<br />

∑<br />

1≤i≤N n<br />

1{R b (X n,(b),i , P ≤ x)} − J b (x, P )<br />

Z n (x, P ; X 1 , . . . , X n ) = 1 n!<br />

sup<br />

x∈R<br />

∑ (<br />

)<br />

S n x, P ; X π(1) , . . . , X π(n) .<br />

π∈S n<br />

|Z n (x, P ; X 1 , . . . , X n )| ≤ 1 n! sup |S n (x, P ; X π(1) , . . . , X π(n) )|<br />

x∈R


50 <strong>Subsampling</strong><br />

where <strong>the</strong> left hand side is <strong>the</strong> sum <strong>of</strong> n! identically distributed random ) variables, indeed<br />

given some π ∈ S n , for every 1 ≤ i ≤ k n ,<br />

(X π(b(i−1)+1) , . . . , X π(bi) is a sample <strong>of</strong> size b<br />

from <strong>the</strong> <strong>distribution</strong> P . For an arbitrary ε > 0 we have<br />

(<br />

)<br />

)<br />

P<br />

sup |Z n (x, P ; X 1 , . . . , X n )| > ε<br />

x∈R<br />

(<br />

1<br />

≤ P<br />

n! sup |S n (x, P ; X π(1) , . . . , X π(n) )| > ε<br />

x∈R<br />

)<br />

≤ 1 (sup<br />

ε E |S n (x, P ; X 1 , . . . , X n )|<br />

≤ 1 ε<br />

x∈R<br />

∫ ( 1<br />

0<br />

P<br />

sup |S n (x, P ; X 1 , . . . , X n ) > u|<br />

x∈R<br />

where Markov’s inequality have been used in <strong>the</strong> second inequality and Fubini’s <strong>the</strong>orem<br />

in <strong>the</strong> third one. Applying <strong>the</strong> Dvoretzky-Kiefer-Wolfowitz inequality 5.2.2.1 to <strong>the</strong> root<br />

R b (X n,(b),i , P ), we obtain<br />

P<br />

(<br />

sup |Z n (x, P ; X 1 , . . . , X n )| > ε<br />

x∈R<br />

)<br />

≤ 1 ε<br />

≤ 2 ε<br />

∫ 1<br />

0<br />

2 exp(−2k n u 2 )du<br />

√<br />

2π<br />

k n<br />

(<br />

Φ(2 √ k n − 1 2 ) )<br />

≤ 1 ε<br />

√<br />

2πkn<br />

where Φ(·) is <strong>the</strong> standard normal probability <strong>distribution</strong> function, this proves <strong>the</strong> first<br />

inequality <strong>of</strong> <strong>the</strong> claim. Fur<strong>the</strong>r, for arbitrary 0 < δ < 1, following previous arguments,<br />

and after partitioning <strong>the</strong> unit interval according to δ, we obtain<br />

(<br />

)<br />

≤ 1 (<br />

)<br />

ε E<br />

P<br />

sup |Z n (x, P ; X 1 , . . . , X n )| > ε<br />

x∈R<br />

sup |S n (x, P ; X 1 , . . . , X n )|<br />

x∈R<br />

≤ δ ε + 1 (sup<br />

ε P |S n (x, P ; X 1 , . . . , X n )| > δ<br />

x∈R<br />

Applying <strong>the</strong> Dvoretzky-Kiefer-Wolfowitz inequality 5.2.2.1 to bound <strong>the</strong> second term on<br />

<strong>the</strong> right hand side yields <strong>the</strong> second part <strong>of</strong> <strong>the</strong> claim.<br />

Lemma 5.2.2.4. Let X (n) = (X 1 , . . . , X n ) be an i.i.d. sequence <strong>of</strong> random variables with<br />

<strong>distribution</strong> P . Denote by J n (x, P ) <strong>the</strong> <strong>distribution</strong> <strong>of</strong> a real-valued root R n = R n (X (n) , P )<br />

under P . Let k n = ⌊n/b⌋ and define L n (x, P ) according to 5.2.2.5. Then, for every ε > 0<br />

and every 0 < γ < 1, <strong>the</strong> following hold:<br />

(i) P (sup x∈R {L n (x, P ) − J n (x, P )} > ε) ≤<br />

√ {<br />

}<br />

1 2π<br />

+ 1 sup{J b (x, P ) − J n (x, P )} > (1 − γ)ε<br />

γε k n x∈R<br />

(ii) P (sup x∈R {J n (x, P ) − L n (x, P )} > ε) ≤<br />

√ {<br />

}<br />

1 2π<br />

+ 1 sup{J n (x, P ) − J b (x, P )} > (1 − γ)ε<br />

γε k n x∈R<br />

)<br />

du<br />

)


5.2 Uniform consistency for quantiles appproximation 51<br />

(iii) P (sup x∈R |L n (x, P ) − J n (x, P )| > ε) ≤<br />

√ {<br />

}<br />

1 2π<br />

+ 1 sup |J b (x, P ) − J n (x, P )| > (1 − γ)ε<br />

γε k n x∈R<br />

Pro<strong>of</strong>. We prove (iii), (i) and (ii) can be proved by similar arguments. For arbitrary ε > 0<br />

and 0 < γ < 1, denote by A n (ε, γ) <strong>the</strong> event {sup x∈R |J b (x, P ) − J n (x, P )| ≤ (1 − γ)ε}, we<br />

<strong>the</strong>n have :<br />

P<br />

(<br />

≤ P<br />

≤ P<br />

sup |L n (x, P ) − J n (x, P )| > ε<br />

x∈R<br />

(<br />

+ ≤ P<br />

≤ P<br />

)<br />

sup{|L n (x, P ) − J b (x, P )| + sup |J b (x, P ) − J n (x, P )|} > ε<br />

x∈R<br />

(<br />

(<br />

x∈R<br />

sup{|L n (x, P ) − J b (x, P )| + |J b (x, P ) − J n (x, P )|} > ε; A n (ε, γ)<br />

x∈R<br />

(<br />

)<br />

sup{|L n (x, P ) − J b (x, P )| + |J b (x, P ) − J n (x, P )|} > ε; A n (ε, γ) c<br />

x∈R<br />

) {<br />

sup |L n (x, P ) − J n (x, P )| > γε<br />

x∈R<br />

+ 1<br />

sup |J b (x, P ) − J n (x, P )| > (1 − γ)ε<br />

x∈R<br />

Applying Lemma 5.2.2.3 to <strong>the</strong> first term on <strong>the</strong> right hand side <strong>of</strong> <strong>the</strong> last inequality<br />

completes <strong>the</strong> argument.<br />

Lemma 5.2.2.5. Let X (n) = (X 1 , . . . , X n ) be an i.i.d. sequence <strong>of</strong> random variables<br />

with <strong>distribution</strong> P ∈ P. Denote by J n (x, P ) <strong>the</strong> <strong>distribution</strong> <strong>of</strong> a real valued root R n =<br />

R n (X (n) , P ) under P . Let k n = ⌊n/b⌋ and define L n (x, P ) according to . For arbitrary<br />

ε > 0 and 0 < γ < 1, define<br />

δ 1,n (ε, γ, P ) = 1<br />

√ {<br />

}<br />

2π<br />

+ 1 sup{J b (x, P ) − J n (x, P )} > (1 − γ)ε<br />

γε k n x∈R<br />

δ 2,n (ε, γ, P ) = 1<br />

√ {<br />

}<br />

2π<br />

+ 1 sup{J n (x, P ) − J b (x, P )} > (1 − γ)ε<br />

γε k n x∈R<br />

δ 3,n (ε, γ, P ) = 1<br />

√ {<br />

}<br />

2π<br />

+ 1 sup{|J b (x, P ) − J n (x, P )|} > (1 − γ)ε<br />

γε k n x∈R<br />

)<br />

)<br />

}<br />

.<br />

Then we have<br />

(i) P ( R n ≤ L −1<br />

n (1 − α) ) ≥ 1 − (α + ε + δ 1,n (ε, γ, P ))<br />

(ii) P ( R n ≥ L −1<br />

n (α) ) ≥ 1 − (α + ε + δ 2,n (ε, γ, P ))<br />

(<br />

(iii) P<br />

L −1<br />

n<br />

( α 2 ) ≤ R n ≤ L −1<br />

n (1 − α 2 ) )<br />

≥ 1 − (α + ε + δ 3,n (ε, γ, P ))


52 <strong>Subsampling</strong><br />

Pro<strong>of</strong>. We prove (iii), (i) and (ii) are established by similar arguments. By part (iii) <strong>of</strong><br />

Lemma 5.2.2.4, for arbitrary ε > 0 and 0 < γ < 1, we have<br />

(<br />

)<br />

P<br />

sup |L n (x, P ) − J n (x, P )}| ≤ ε<br />

x∈R<br />

≥ 1 − δ 3,n (ε, γ, P )<br />

The claim directly follows from <strong>the</strong> last claim <strong>of</strong> Lemma 5.2.2.2 with Ĝ(x) = L n(x, P ) and<br />

X = R n (X (n) , P ) with probability <strong>distribution</strong> function F (x) = J n (x, P ).<br />

We are now able to prove Theorem 5.2.1.1.<br />

Pro<strong>of</strong>. We prove (iii), <strong>the</strong> arguments for (i) and (ii) go along <strong>the</strong> same lines. For arbitrary<br />

ε > 0 and 0 < γ < 1, define<br />

δ 3,n (ε, γ, P ) = 1<br />

√ {<br />

}<br />

2π<br />

+ 1 sup{|J b (x, P ) − J n (x, P )|} > (1 − γ)ε<br />

γε k n x∈R<br />

By Lemma 5.2.2.5, we have<br />

(<br />

)<br />

L −1<br />

n (α) ≤ R n ≤ L −1<br />

n (1 − α) ≥ 1 −<br />

(<br />

inf P<br />

P ∈P<br />

α + ε + sup{δ 3,n (ε, γ, P )}<br />

P ∈P<br />

Under <strong>the</strong> assumption lim n→∞ sup P ∈P sup x∈R |J n (x, P ) − J b (x, P )| = 0, for ε fixed we<br />

have sup P ∈P {δ 3,n (ε, γ, P )} → 0. By a diagonal argument, one argues that <strong>the</strong>re exists a<br />

sequence ε n ↘ 0 such that sup P ∈P {δ 3,n (ε n , γ, P )} → 0. Applying Lemma 5.2.2.5 to <strong>the</strong><br />

sequence {ε n } n∈N yields<br />

lim inf<br />

n→∞<br />

(<br />

inf P<br />

P ∈P<br />

L −1<br />

n<br />

)<br />

(α) ≤ R n ≤ L −1<br />

n (1 − α) ≥ 1 − 2α<br />

For <strong>the</strong> o<strong>the</strong>r inequality we show that lim sup n→∞ inf P ∈P P ( L −1<br />

n (α) ≤ R n ≤ L −1<br />

n (1 − α) ) ≤<br />

1 − 2α. Consider an arbitrary P 0 ∈ P. Then for every ε > 0 and 0 < γ < 1,<br />

(<br />

)<br />

P 0 L −1<br />

n (α) ≤ R n L −1<br />

n (1 − α)<br />

(<br />

)<br />

= P 0 L −1<br />

n (α) ≤ R n ≤ L −1<br />

n (1 − α); sup |J n (x) − L n (x)| ≤ ε<br />

x∈R<br />

)<br />

+ P 0<br />

(<br />

≤ P 0<br />

(<br />

L −1<br />

n<br />

L −1<br />

n<br />

(α) ≤ R n ≤ L −1<br />

n<br />

(α) ≤ R n ≤ L −1<br />

n<br />

+ P 0<br />

(sup |J n (x) − L n (x)| > ε<br />

x∈R<br />

(<br />

)<br />

≤ P 0 Jn<br />

−1 (α) ≤ R n ≤ Jn −1 (1 − α) + δ 3,n (ε, γ, P 0 )<br />

= 1 − 2α + δ 3,n (ε, γ, P 0 )<br />

(1 − α); sup |J n (x) − L n (x)| > ε<br />

x∈R<br />

)<br />

(1 − α); Jn<br />

−1 (α − ε) ≤ L −1<br />

n (α); L −1<br />

n (1 − α) ≤ Jn −1 (1 − α + ε)<br />

)<br />

where Lemma 5.2.2.3 have been used in <strong>the</strong> first inequality. Fur<strong>the</strong>r, under <strong>the</strong> assumption<br />

lim n→∞ sup P ∈P sup x∈R |J n (x, P ) − J b (x, P )| = 0, we have<br />

This completes <strong>the</strong> pro<strong>of</strong>.<br />

1 − 2α + δ 3,n (ε, γ, P 0 ) → 1 − 2α.<br />

)


5.2 Uniform consistency for quantiles appproximation 53<br />

Figure 5.1: In red, <strong>the</strong> subsampling <strong>distribution</strong> <strong>estimates</strong> for <strong>the</strong> roots √ n( ˆβ j − β j ), j =<br />

1, . . . , 9. n = 10000, b = n 0.65 ≈ 400, B = 2000 with penalization parameter λ n = 2 √ n.


54 <strong>Subsampling</strong>


Chapter 6<br />

Numerical results<br />

Motivated by <strong>the</strong> conclusions <strong>of</strong> Proposition 3.2.1.1 and Theorem 5.2.1.1 , we conduct<br />

numerical simulations to evaluate <strong>the</strong> finite sample performance <strong>of</strong> subsampling based<br />

confidence intervals. Control <strong>of</strong> <strong>the</strong> type I error is adressed for zero coefficients using<br />

<strong>the</strong> duality between confidence intervals and hypo<strong>the</strong>sis tests. Also, subsampling based<br />

p-values are constructed for <strong>the</strong> purpose <strong>of</strong> multiple testing adjustment. Finally, in <strong>the</strong><br />

last section, <strong>the</strong> method is applied to <strong>the</strong> adaptive <strong>Lasso</strong> in a high dimensional setting.<br />

6.1 Low dimensinal setting<br />

For <strong>the</strong> simulation study in <strong>the</strong> low dimensional setting we consider data sets generated<br />

from six linears models with random predictors. The six models differ in <strong>the</strong> correlation<br />

structure between <strong>the</strong> covariates and <strong>the</strong> noise level. More precisely, data sets are generated<br />

from a linear model<br />

with regression parameter<br />

Y i = x ′ iβ + ε i , i = 1, . . . , n<br />

β = (1.5, −1.5, 0.75, −1.5, 1.5, −3, 0) ′ ∈ R 20 .<br />

The errors {ε i } i are i.i.d normal with mean zero and standard deviation σ = 1 for <strong>the</strong><br />

models A, B and B, σ = 1.5 for <strong>the</strong> models A, B’ and C’. Rows <strong>of</strong> X are normally<br />

distributed vectors with mean zero and Toeplitz covariance matrix<br />

The models differ in <strong>the</strong> value ρ:<br />

C ij = ρ |i−j| .<br />

• Model A and A’: ρ = 0 (orthogonal design)<br />

• Model B and B’: ρ = 0.6<br />

• Model C and B’: ρ = 0.9<br />

55


56 Numerical results<br />

6.1.1 Confidence intervals<br />

We recall <strong>the</strong> definition <strong>of</strong> a confidence interval.<br />

Definition 6.1.1.1. Let α ∈ (0, 1). An interval valued functions I (j) (Z (n) ) is called confidence<br />

interval for <strong>the</strong> parameter β j if it satisfies<br />

(<br />

)<br />

P β j ∈ I (j) (Z (n) ) ≥ (1 − α).<br />

Estimation procedure<br />

For a sample (Y i , x i ) n i=1 <strong>of</strong> simulated observations we proceed as follows to construct confidence<br />

intervals for <strong>the</strong> coefficients:<br />

1. Compute <strong>the</strong> <strong>Lasso</strong> solution path for <strong>the</strong> scaled and centered observations, that is, for<br />

Ỹ i = Y i − n −1<br />

˜x ij =<br />

(<br />

n ∑<br />

l=1<br />

x ij − n −1<br />

Y l , i = 1, . . . , n<br />

∑ n ) (<br />

x lj n −1<br />

l=1<br />

n ∑<br />

l=1<br />

(x lj −<br />

)<br />

n∑ −1<br />

x kj ) 2 , j = 1, . . . , p, i = 1, . . . , n.<br />

2. Choose <strong>the</strong> penalization parameter by K-fold cross validation (we choose K = 10) on<br />

<strong>the</strong> whole data set (Ỹi, ˜x i ) n i=1 . Denote it by λ n,CV .<br />

k=1<br />

3. Set ˆβ n as <strong>the</strong> <strong>Lasso</strong> solution to <strong>the</strong> data set (Ỹi, ˜x i ) n i=1 corresponding to <strong>the</strong> parameter<br />

λ n,CV .<br />

4. Repeat <strong>the</strong> following steps for m = 1, . . . , B :<br />

(a) Generate a random subsample I m ⊂ {1, . . . , n} <strong>of</strong> size b by drawing without replacement.<br />

(b) Compute <strong>the</strong> <strong>Lasso</strong> solution path for <strong>the</strong> scaled and centered data set with indices<br />

i ∈ I m , that is for<br />

= Y i − b −1 ∑<br />

Ỹ (m)<br />

i<br />

˜x (m)<br />

ij =<br />

Y l , i ∈ I m<br />

l∈I m<br />

⎛<br />

⎞ ⎛<br />

⎝x ij − b −1 ∑<br />

x lj<br />

⎠ ⎝b −1 ∑<br />

l∈I m<br />

⎞<br />

(x lj − ∑<br />

x kj ) 2 ⎠<br />

l∈I m k∈I m<br />

−1<br />

, j = 1, . . . , p, i ∈ I m .<br />

(c) Set ˆβ (m)<br />

b as <strong>the</strong> <strong>Lasso</strong> solution to <strong>the</strong> data set<br />

√<br />

(Ỹ (m) , ˜X (m) ) corresponding to <strong>the</strong><br />

rescaled penalization parameter λ b,CV = λ n,CV b/n. Set Ln,b,m and L n,r,m to<br />

L n,b,m = √ ( )<br />

b ˆβ (m)<br />

b − ˆβ n<br />

and<br />

L n,r,m = √ ( )<br />

r ˆβ (m)<br />

b − ˆβ n<br />

respectively. Here, r = b/(1 − b/n) is <strong>the</strong> finite sample corrected subsample size,<br />

cf. (Politis et al., 1999, Section 10.3.1).


6.1 Low dimensinal setting 57<br />

5. Determine separately for each j ∈ {1, . . . , p} <strong>the</strong> following empirical quantiles <strong>of</strong> L (j)<br />

n,b,·<br />

and L (j)<br />

n,r,·, that is, L (j)<br />

n,b,(·)<br />

and L(j)<br />

n,r,(·)<br />

being <strong>the</strong> ordered statistics, set<br />

c (j)<br />

n,b<br />

(1 − α) = L(j)<br />

n,b,(⌊(1−α)·B⌋) ,<br />

c (j)<br />

n,b<br />

(α/2) = L(j)<br />

n,b,(⌊α/2·B⌋) ,<br />

and <strong>the</strong> analogous for L (j)<br />

n,r,(·) .<br />

c (j)<br />

n,b<br />

(1 − α/2) = L(j)<br />

n,b,(⌊(1−α/2)·B⌋) .<br />

6. For each j ∈ {1, . . . , p}, define <strong>the</strong> confidence intervals<br />

[<br />

I (j)<br />

1 =<br />

[<br />

I (j)<br />

2 =<br />

[<br />

I (j)<br />

3 =<br />

n − √ 1 c (j)<br />

n<br />

β (j)<br />

β (j)<br />

)<br />

n,r(1 − α), ∞<br />

n − √ 1 c (j)<br />

n<br />

n,r(1 − α/2), β n (j) − √ 1 c (j)<br />

n<br />

β (j)<br />

n<br />

−<br />

[<br />

I (j)<br />

4 = β n<br />

(j) −<br />

1<br />

√ n −<br />

√<br />

b<br />

c (j)<br />

n,b (1 − α), ∞ )<br />

1<br />

√ √ c (j) (1 − α/2), β(j) −<br />

n − b<br />

n,b<br />

n<br />

]<br />

n,r(α/2)<br />

1<br />

√ n −<br />

√<br />

b<br />

c (j)<br />

n,b (α/2) ]<br />

Estimated confidence intervals are illustrated in figures 6.1, 6.2 and 6.3<br />

.<br />

6.1.2 Hypo<strong>the</strong>sis testing<br />

For j ∈ {1, . . . , p}, consider <strong>the</strong> problem <strong>of</strong> testing <strong>the</strong> null hypo<strong>the</strong>sis<br />

against <strong>the</strong> alternative<br />

H 0,j : β j = 0<br />

H A,j : β j ≠ 0.<br />

Definition 6.1.2.1. Let α ∈ (0, 1). A function φ j : R n → {0, 1} said to reject H 0,j<br />

when it takes value 1 and to accept H 0,j when it takes value zero, is called a test for <strong>the</strong><br />

hypo<strong>the</strong>sis H 0,j to <strong>the</strong> level α if it satisfies<br />

E βj =0<br />

( )<br />

φ (j) (Z (n) ) ≤ α.<br />

According to <strong>the</strong> duality Lemma, if I (j) is a confidence interval to <strong>the</strong> level α for β j , <strong>the</strong>n<br />

a test is given by<br />

{<br />

φ (j) (Z (n) 0 if 0 ∈ I<br />

) =<br />

(j) (Z (n) )<br />

1 else


58 Numerical results<br />

Figure 6.1: <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models A and A’<br />

(n= 250). Red triangles stand for <strong>the</strong> true parameters.<br />

Hence, for each j ∈ {1, . . . , p} we define <strong>the</strong> test φ (j)<br />

k<br />

, corresponding to <strong>the</strong> intervals I(j)<br />

k<br />

(k = 1, . . . , 4) defined in <strong>the</strong> previous section.<br />

Rates <strong>of</strong> coverage and <strong>of</strong> false rejection based on 500 replications <strong>of</strong> <strong>the</strong> scenario are given<br />

in tables 6.1, 6.2 and 6.3 for <strong>the</strong> two sided interval I 2 (with corresponding test φ 2 ) and a<br />

subsample size b = n 0.85 . Results for <strong>the</strong> o<strong>the</strong>r intervals are omitted but it was noted that<br />

<strong>the</strong> one sided confidence interval I 1 has rates <strong>of</strong> coverage/false rejection similar to I 2 and<br />

that I 3 and I 4 are extremely conservative.<br />

Fluctuations from <strong>the</strong> nominal level for <strong>the</strong> interval I 2 can be attributed to <strong>the</strong> stochastic<br />

approximation to <strong>the</strong> true subsample quantile and to <strong>the</strong> fact that <strong>the</strong> penalization parameter<br />

is chosen by cross-validation for each scenario. Thus <strong>the</strong> coverage/false positive<br />

rates can be considered correct.<br />

6.1.3 F.W.E.R<br />

Instead <strong>of</strong> testing for individual hypo<strong>the</strong>ses, one can also be interrested in testing for a<br />

whole family <strong>of</strong> null hypo<strong>the</strong>ses. The goal is <strong>the</strong>n to control <strong>the</strong> probability to make


6.1 Low dimensinal setting 59<br />

Figure 6.2: <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models B and B’<br />

(n= 250). Red triangles stand for <strong>the</strong> true parameters.<br />

at least one false rejection. This is formalized by <strong>the</strong> Family Wise Error Rate (FEWR)<br />

defined as follows<br />

Definition 6.1.3.1. Let H 1 , · · · , H s be a family <strong>of</strong> hypo<strong>the</strong>ses. The corresponding FWER<br />

is defined as <strong>the</strong> probability to make at least one false rejection among <strong>the</strong> hypo<strong>the</strong>ses H i ,<br />

i.e.<br />

( s<br />

)<br />

⋃<br />

F W ER = P H1 ,...,H s<br />

{H i is rejected }<br />

i=1<br />

(6.1.3.1)<br />

Our goal is to test <strong>the</strong> hypo<strong>the</strong>ses H 1 , . . . , H s while controling <strong>the</strong> FWER to a given level<br />

α. That is,<br />

F W ER ≤ α.<br />

We focus on <strong>the</strong> Bonferonni-Holm procedure. It is a so called step-down procedure based<br />

on p-values for individual hypo<strong>the</strong>ses. We recall <strong>the</strong> definition <strong>of</strong> a p-value,


60 Numerical results<br />

Figure 6.3: <strong>Subsampling</strong> confidence intervals for single scenarios <strong>of</strong> <strong>the</strong> models C and C’<br />

(n= 250). Red triangles stand for <strong>the</strong> true parameters.<br />

Definition 6.1.3.2. Suppose that for a random variable X and an hypo<strong>the</strong>sis H, one has<br />

a familiy {S α (X); α ∈ (0, 1)} <strong>of</strong> rejection regions satisfying<br />

S α ⊂ S α ′<br />

whenever α < α ′ . Then a p-value is defined as<br />

ˆp = ˆp(X) = inf{α : X ∈ S α }<br />

that is, <strong>the</strong> smallest significance level at which one would reject <strong>the</strong> hypo<strong>the</strong>sis H.<br />

Given p-values ˆp 1 , . . . , ˆp s associated to hypo<strong>the</strong>sis tests H 1 , . . . , H s , <strong>the</strong> Holm procedure<br />

consits int he following steps:<br />

1. Consider <strong>the</strong> ordered realized p-values ˆp (1) ≤ . . . , ˆp (s) and <strong>the</strong> corresponding hypo<strong>the</strong>ses<br />

H (1) , . . . , H (s) .<br />

2. If ˆp (1) ≥ α/s, accept H 1 , ·, H s and stop. If ˆp (1) < α/s reject H (1) and test <strong>the</strong> remaining<br />

s − 1 hypo<strong>the</strong>ses at level α/(s − 1).


6.2 High dimensional setting 61<br />

3. If ˆp (1) < α/s but ˆp (2) ≥ α/(s − 1), accept H (2) , ·, H (s) and stop. If ˆp (1) < α/s and<br />

ˆp (2) < α/(s − 1), reject H (2) in addition to H (1) and test <strong>the</strong> remaining s − 2 hypo<strong>the</strong>ses<br />

at level α/(s − 2).<br />

One can show that <strong>the</strong> Bonferonni-Holm procedure controls <strong>the</strong> FWER (Lehmann and<br />

Romano, 2005, Theorem 9.1.2).<br />

In our case, for an individual hypo<strong>the</strong>sis H 0,j : β j = 0, we propose <strong>the</strong> following p-value<br />

based on subsampling: for a realized statistics √ n ˆβ j n computed on <strong>the</strong> whole sample,<br />

choose as p-value <strong>the</strong> proportion among B subsamples <strong>of</strong> values √ r( ˆβ j n,b,i − ˆβ j n) lying<br />

beyond or beneath it, depending <strong>the</strong>reon if it is positive or negative, more precisely:<br />

ˆp j =<br />

{<br />

B<br />

−1 ∑ B<br />

i=1 1{ √ r( ˆβ j n,b,i − ˆβ j n) ≥ √ n ˆβ j n} if ˆβ j n ≥ 0<br />

B −1 ∑ B<br />

i=1 1{ √ r( ˆβ j n,b,i − ˆβ j n) ≤ √ n ˆβ j n} if ˆβ j n < 0<br />

(6.1.3.2)<br />

where r = b/(1 − b/n) is <strong>the</strong> finite sample corrected subsample size. Similar p-values<br />

are considered in Berg, McMurry, and Politis (2010). Here, <strong>the</strong> disctintion between <strong>the</strong><br />

positive and <strong>the</strong> case is made to take in consideration power issues.<br />

Powers and Family Wise Errors Rates based on 500 replications are reported in tables<br />

6.4, 6.5 and 6.6. The procedure controls <strong>the</strong> FWER while leaving <strong>the</strong> power unaffected,<br />

indicating that <strong>the</strong> proposed p-values are correct.<br />

6.2 High dimensional setting<br />

To access <strong>the</strong> performance <strong>of</strong> subsampling in a high dimensional setting, we apply it to <strong>the</strong><br />

adaptive <strong>Lasso</strong> presented in Chapter 3. We generate four data sets D,D’,E and E’ which<br />

differ in <strong>the</strong> correlation structure between covariates and <strong>the</strong> dimension <strong>of</strong> <strong>the</strong> parameters.<br />

They result from <strong>the</strong> linear model<br />

where<br />

Y i = x ′ iβ + ε i , i = 1, . . . , n<br />

β = (1, −1.25, , −1.25, 1, −1.5, 0) ′<br />

is a 400 × 1 vector in models D and D’, and a 800 × 1 vector in models E E’. The errors ε i<br />

are taken i.i.d. standard normal. X is multivariate normal with mean zero and covariance<br />

matrix C given by<br />

• C i,j = 0.8 |i−j| 1{i, j ∈ {1, . . . , 5} or i, j ∈ {6, . . . , 400}} in model D,<br />

• C i,j = 0.8 |i−j| , i, j = 1, . . . , 400 in model D’,<br />

• C i,j = 0.8 |i−j| 1{i, j ∈ {1, . . . , 5} or i, j ∈ {6, . . . , 800}} in model E,


62 Numerical results<br />

• C i,j = 0.8 |i−j| , i, j = 1, . . . , 800 in model E’.<br />

Models D and E respect <strong>the</strong> partial orthogonality condition between relevant and irrelevant<br />

parameters while D’ and E’ violate it.<br />

Estimation procedure<br />

For a sample (Y i , x i ) n i=1 <strong>of</strong> simulated observations, we proceed as follows to construct<br />

confidence intervals for <strong>the</strong> coefficinents:<br />

(i) Center and scale <strong>the</strong> observations, that is, consider<br />

Ỹ i = Y i − n −1<br />

⎛<br />

n ∑<br />

˜x ij = ⎝x ij − n −1<br />

Y l , i = 1, . . . , n<br />

l=1<br />

⎞<br />

∑ n (<br />

⎠ n −1<br />

n ∑<br />

(x lj −<br />

lj<br />

l=1 k=1<br />

)<br />

n∑ −1<br />

x kj ) 2 , j = 1, . . . , p n , i = 1, . . . , n.<br />

(ii) Set weights w j as<br />

w j = n −1<br />

n ∑<br />

i=1<br />

Ỹ i˜x ij , j = 1, . . . , p n .<br />

(iii) Compute <strong>the</strong> <strong>Lasso</strong> path for <strong>the</strong> data set with scaled covariates, that is, for<br />

(Ỹi, diag(w 1 , . . . , w pn )˜x i ) n i=1.<br />

Then choose <strong>the</strong> penalization parameter by K-fold cross validation (we choose K=10)<br />

based on (Ỹi, diag(w 1 , . . . , w pn )x i ) n i=1 . Denote it by λ n,CV . Let ˜β n be <strong>the</strong> <strong>Lasso</strong> solution<br />

corresponding to (Ỹi, diag(w 1 , . . . , w pn )x i ) n i=1 and to <strong>the</strong> penalization parameter<br />

λ n,CV . Finally, set <strong>the</strong> adaptive <strong>Lasso</strong> solution to<br />

ˆβ n = diag(w −1<br />

1 , . . . , w−1 p n<br />

) ˜β n .<br />

(iv) Repeat <strong>the</strong> folowing steps for m = 1, . . . , B:<br />

(a) Generate a random subsample I m ⊂ {1, . . . , n} <strong>of</strong> size b by drawing without<br />

replacement.<br />

(b) Center and scale observations with index i ∈ I m to obtain a data set (Ỹ (m)<br />

i , x (m)<br />

i ) i∈Im .<br />

(m)<br />

(c) Compute <strong>the</strong> <strong>Lasso</strong> solution path for <strong>the</strong> data set (Ỹi , diag(w1 , . . . , w pn )x (m)<br />

i<br />

with covariates scaled by weights obtained in step ii. Let ˜β (m)<br />

b be <strong>the</strong> <strong>Lasso</strong> solution<br />

corresponding to (Ỹi , diag(w1 , . . . , w pn )x (m)<br />

i ) i∈Im and to <strong>the</strong> rescaled<br />

(m)<br />

penalization parameter λ b,CV = λ n,CV (b/n) 0.4 .<br />

) i∈Im


6.2 High dimensional setting 63<br />

(d) Set <strong>the</strong> adaptive <strong>Lasso</strong> solution to<br />

ˆβ (m)<br />

b<br />

and <strong>the</strong>n set L n,b,m and L n,r,m to<br />

= diag(w1 −1 , . . . , w−1 p n<br />

) ˜β (m)<br />

b<br />

L n,b,m = √ ( )<br />

b ˆβ (m)<br />

b − ˆβ n<br />

and<br />

L n,r,m = √ ( )<br />

r ˆβ (m)<br />

b − ˆβ n<br />

respectively. Here, r = b/(1 − b/n) is <strong>the</strong> finite sample corrected subsample size,<br />

cf. (Politis et al., 1999, Section 10.3.1).<br />

(v) Determine separately for each j ∈ {1, . . . , p n } <strong>the</strong> following empirical quantiles <strong>of</strong><br />

L (j)<br />

n,b,· and L(j) n,r,·, that is, L (j)<br />

n,b,(·)<br />

and L(j)<br />

n,r,(·)<br />

being <strong>the</strong> ordered statistics, set<br />

c (j)<br />

n,b<br />

(1 − α) = L(j)<br />

n,b,(⌊(1−α)·B⌋) ,<br />

c (j)<br />

n,b<br />

(α/2) = L(j)<br />

n,b,(⌊α/2·B⌋) ,<br />

and <strong>the</strong> analogous for L (j)<br />

n,r,(·) .<br />

c (j)<br />

n,b<br />

(1 − α/2) = L(j)<br />

n,b,(⌊(1−α/2)·B⌋) .<br />

(vi) For each j ∈ {1, . . . , p}, define <strong>the</strong> confidence intervals<br />

[<br />

I (j)<br />

1 =<br />

[<br />

I (j)<br />

2 =<br />

[<br />

I (j)<br />

3 =<br />

n − √ 1 c (j)<br />

n<br />

β (j)<br />

β (j)<br />

)<br />

n,r(1 − α), ∞<br />

n − √ 1 c (j)<br />

n<br />

n,r(1 − α/2), β n (j) − √ 1 c (j)<br />

n<br />

β (j)<br />

n<br />

−<br />

[<br />

I (j)<br />

4 = β n<br />

(j) −<br />

1<br />

√ n −<br />

√<br />

b<br />

c (j)<br />

n,b (1 − α), ∞ )<br />

1<br />

√ √ c (j) (1 − α/2), β(j) −<br />

n − b<br />

n,b<br />

n<br />

]<br />

n,r(α/2)<br />

1<br />

√ n −<br />

√<br />

b<br />

c (j)<br />

n,b (α/2) ]<br />

Remark. Note that <strong>the</strong> procedure above does not follow <strong>the</strong> subsampling scheme in <strong>the</strong><br />

strict sense since <strong>the</strong> adaptive <strong>Lasso</strong> weights are not recomputed over subsamples, however<br />

we could note that using weights obtained on <strong>the</strong> whole sample to compute subsamples<br />

<strong>estimates</strong> actually yields better results.<br />

Coverage rates <strong>of</strong> <strong>the</strong> two sided confidence interval I (·)<br />

2 are illustrated in Figure 6.4. First,<br />

we note that <strong>the</strong> results are quite robust to violation <strong>of</strong> <strong>the</strong> partial orthogonality condition.<br />

Then we see that <strong>the</strong> coverage rate for relevant coefficients are slightly below <strong>the</strong> nominal<br />

level and that <strong>the</strong> false positive rates for zero coefficients are conservative. A possible<br />

reason for <strong>the</strong> former is <strong>the</strong> bias introduced by <strong>the</strong> <strong>Lasso</strong>. The conservative false rejection


64 Numerical results<br />

rates can be explained by <strong>the</strong> variable selection property or may indicate that <strong>the</strong> rate <strong>of</strong><br />

convergence to <strong>the</strong> limit is actually slower for zero coefficients, <strong>of</strong> <strong>the</strong> order O( √ n/ log(p))<br />

instead <strong>of</strong> <strong>the</strong> rate √ n used. However, due to time constraints, <strong>the</strong>se suggestions could<br />

not be investigated.<br />

Finally, p-values similar to <strong>the</strong> ones proposed in 6.1.3.2 were also computed but <strong>the</strong>y don’t<br />

allow to control <strong>the</strong> FWER through a Bonferonni-Holm procedure. We suspect that <strong>the</strong><br />

variable selection property is again at <strong>the</strong> source <strong>of</strong> this problem; some adaptive <strong>Lasso</strong><br />

<strong>estimates</strong> take value zero over almost all subsamples. A possible solution would be to set<br />

<strong>the</strong> p-value to one if a very large proportion <strong>of</strong> subsample <strong>estimates</strong> take <strong>the</strong> value zero.<br />

This would involve <strong>the</strong> choice <strong>of</strong> a threshold parameter. Due to time constraints, this<br />

direction could not be investigated fur<strong>the</strong>r.


6.2 High dimensional setting 65<br />

Model A (σ = 1) Model A’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 1000 2000<br />

Rates <strong>of</strong> coverage<br />

β 1 0.95 0.96 0.94 0.93 0.95 0.96 0.95 0.94 0.93 0.95<br />

β 2 0.94 0.94 0.92 0.95 0.95 0.95 0.96 0.92 0.95 0.95<br />

β 3 0.95 0.95 0.95 0.95 0.93 0.91 0.96 0.96 0.94 0.93<br />

β 4 0.96 0.95 0.96 0.95 0.95 0.96 0.95 0.96 0.96 0.95<br />

β 5 0.96 0.94 0.96 0.95 0.94 0.95 0.95 0.96 0.95 0.93<br />

β 6 0.95 0.94 0.94 0.95 0.95 0.94 0.95 0.97 0.95 0.95<br />

False rejection rates<br />

β 7 0.02 0.02 0.01 0.02 0.02 0.02 0.02 0.01 0.02 0.02<br />

β 8 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02<br />

β 9 0.01 0.02 0.02 0.03 0.02 0.01 0.02 0.02 0.03 0.02<br />

β 10 0.02 0.02 0.02 0.02 0.03 0.02 0.02 0.02 0.02 0.03<br />

β 11 0.03 0.02 0.03 0.03 0.03 0.03 0.02 0.03 0.03 0.03<br />

β 12 0.04 0.03 0.03 0.02 0.02 0.04 0.03 0.03 0.02 0.02<br />

β 13 0.01 0.03 0.02 0.03 0.03 0.01 0.03 0.02 0.03 0.03<br />

β 14 0.03 0.04 0.02 0.03 0.02 0.03 0.04 0.02 0.03 0.02<br />

β 15 0.03 0.02 0.02 0.03 0.02 0.03 0.02 0.02 0.03 0.02<br />

β 16 0.02 0.02 0.03 0.01 0.02 0.02 0.02 0.03 0.01 0.02<br />

β 17 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02<br />

β 18 0.02 0.01 0.02 0.02 0.03 0.02 0.01 0.02 0.02 0.03<br />

β 19 0.03 0.01 0.03 0.03 0.02 0.03 0.01 0.03 0.03 0.02<br />

β 20 0.02 0.02 0.03 0.03 0.02 0.02 0.02 0.03 0.03 0.02<br />

Table 6.1: Model A and A’. Empirical coverage/false positive rates for <strong>the</strong> two sided<br />

interval I 2 .


66 Numerical results<br />

Model B (σ = 1) Model B’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 1000 2000<br />

Rates <strong>of</strong> coverage<br />

β 1 0.97 0.96 0.94 0.94 0.94 0.97 0.96 0.95 0.94 0.93<br />

β 2 0.95 0.93 0.92 0.92 0.94 0.93 0.94 0.91 0.93 0.95<br />

β 3 0.90 0.95 0.94 0.93 0.91 0.73 0.92 0.94 0.93 0.91<br />

β 4 0.94 0.93 0.95 0.92 0.94 0.92 0.93 0.94 0.94 0.94<br />

β 5 0.94 0.94 0.95 0.92 0.91 0.94 0.94 0.94 0.92 0.91<br />

β 6 0.94 0.93 0.94 0.94 0.94 0.95 0.94 0.95 0.94 0.93<br />

False rejection rates<br />

β 7 0.02 0.02 0.02 0.03 0.02 0.03 0.02 0.02 0.03 0.02<br />

β 8 0.02 0.02 0.04 0.03 0.04 0.02 0.02 0.03 0.03 0.04<br />

β 9 0.01 0.02 0.03 0.04 0.03 0.02 0.02 0.03 0.04 0.03<br />

β 10 0.02 0.03 0.02 0.03 0.03 0.02 0.03 0.02 0.02 0.03<br />

β 11 0.01 0.02 0.03 0.05 0.02 0.02 0.02 0.04 0.05 0.02<br />

β 12 0.03 0.03 0.04 0.03 0.02 0.03 0.03 0.03 0.03 0.02<br />

β 13 0.02 0.03 0.05 0.04 0.03 0.01 0.03 0.04 0.04 0.03<br />

β 14 0.03 0.03 0.03 0.04 0.03 0.02 0.04 0.03 0.04 0.03<br />

β 15 0.03 0.03 0.04 0.03 0.03 0.02 0.03 0.04 0.03 0.03<br />

β 16 0.03 0.02 0.05 0.03 0.03 0.02 0.02 0.03 0.03 0.03<br />

β 17 0.02 0.03 0.03 0.03 0.03 0.02 0.03 0.02 0.03 0.03<br />

β 18 0.03 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.02 0.02<br />

β 19 0.03 0.03 0.03 0.04 0.03 0.03 0.03 0.03 0.04 0.03<br />

β 20 0.02 0.02 0.04 0.04 0.03 0.03 0.02 0.03 0.04 0.03<br />

Table 6.2: Model B nd B’. Empirical coverage/false positive rates for <strong>the</strong> two sided interval<br />

I 2 .


6.2 High dimensional setting 67<br />

Model C (σ = 1) Model C’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 100 2000<br />

Rates <strong>of</strong> coverage<br />

β 1 0.92 0.94 0.93 0.94 0.96 0.78 0.93 0.94 0.94 0.95<br />

β 2 0.84 0.91 0.92 0.92 0.94 0.60 0.86 0.88 0.92 0.94<br />

β 3 0.56 0.80 0.90 0.94 0.92 0.45 0.63 0.75 0.88 0.93<br />

β 4 0.81 0.91 0.94 0.92 0.92 0.60 0.85 0.91 0.92 0.92<br />

β 5 0.80 0.94 0.93 0.91 0.90 0.57 0.86 0.94 0.92 0.9<br />

β 6 0.94 0.94 0.95 0.95 0.92 0.85 0.94 0.95 0.93 0.92<br />

False rejection rates<br />

β 7 0.02 0.03 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.02<br />

β 8 0.03 0.03 0.03 0.03 0.04 0.03 0.02 0.03 0.03 0.04<br />

β 9 0.02 0.02 0.04 0.03 0.03 0.03 0.02 0.03 0.03 0.03<br />

β 10 0.03 0.03 0.02 0.03 0.04 0.02 0.03 0.02 0.03 0.04<br />

β 11 0.01 0.02 0.03 0.05 0.02 0.01 0.02 0.03 0.05 0.03<br />

β 12 0.01 0.02 0.03 0.04 0.02 0.01 0.02 0.03 0.04 0.03<br />

β 13 0.02 0.03 0.04 0.04 0.04 0.02 0.03 0.04 0.04 0.03<br />

β 14 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03<br />

β 15 0.03 0.02 0.04 0.03 0.03 0.02 0.02 0.03 0.03 0.03<br />

β 16 0.02 0.02 0.03 0.04 0.03 0.03 0.02 0.03 0.04 0.03<br />

β 17 0.02 0.02 0.02 0.04 0.03 0.02 0.02 0.02 0.04 0.04<br />

β 18 0.02 0.03 0.03 0.03 0.02 0.02 0.03 0.03 0.03 0.04<br />

β 19 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02<br />

β 20 0.03 0.03 0.03 0.04 0.02 0.02 0.03 0.04 0.04 0.03<br />

Table 6.3: Model C and C’. Empirical coverage/false positive rates for <strong>the</strong> two sided<br />

interval I 2 .<br />

Empirical power<br />

Model A (σ = 1) Model A’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 1000 2000<br />

β 1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 3 1.00 1.00 1.00 1.00 1.00 0.83 1.00 1.00 1.00 1.00<br />

β 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

FWER 0.036 0.026 0.044 0.056 0.016 0.038 0.026 0.044 0.056 0.016<br />

Table 6.4: Model A and A’. FWER and empirical power for nonzero coefficients.


68 Numerical results<br />

Empirical power<br />

Model B (σ = 1) Model B’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 1000 2000<br />

β 1 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 2 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 3 0.67 1.00 1.00 1.00 1.00 0.33 1.00 1.00 1.00 1.00<br />

β 4 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 5 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

β 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

FWER 0.036 0.034 0.046 0.044 0.036 0.04 0.034 0.046 0.044 0.036<br />

Table 6.5: Model B and B’. FWER and empirical power for nonzero coefficients.<br />

Empirical power<br />

Model C (σ = 1) Model C’ (σ = 1.5)<br />

n 100 250 500 1000 2000 100 250 500 1000 2000<br />

β 1 1.00 1.00 1.00 1.00 1.00 0.67 1.00 1.00 1.00 1.00<br />

β 2 0.83 1.00 1.00 1.00 1.00 0.00 1.00 1.00 1.00 1.00<br />

β 3 0.00 0.67 1.00 1.00 1.00 0.00 0.33 0.83 0.98 1.00<br />

β 4 0.83 1.00 1.00 1.00 1.00 0.17 1.00 1.00 1.00 1.00<br />

β 5 0.83 1.00 1.00 1.00 1.00 0.33 0.83 1.00 1.00 1.00<br />

β 6 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00<br />

FWER 0.042 0.026 0.046 0.046 0.04 0.04 0.028 0.042 0.046 0.042<br />

Table 6.6: Model C and C’. FWER and empirical power for nonzero coefficients.


6.2 High dimensional setting 69<br />

Figure 6.4: Coverage rates <strong>of</strong> <strong>the</strong> two sided confidence intervals I 2 for <strong>the</strong> adaptive <strong>Lasso</strong> in<br />

high dimension. Green triangles correspond to relevant variables. Black dots correspond<br />

to noise variables.


70 Numerical results


Chapter 7<br />

Concluding remarks<br />

We reviewed <strong>the</strong> general <strong>the</strong>ory <strong>of</strong> weak convergence with a focus on minizers <strong>of</strong> convex<br />

processes and applied results toge<strong>the</strong>r with tools from convex analysis to give a fairly<br />

precise characterization <strong>of</strong> <strong>the</strong> limiting <strong>distribution</strong> <strong>of</strong> <strong>Lasso</strong> components in a low dimensional<br />

setting, following <strong>the</strong> steps <strong>of</strong> Knight and Fu (2000). It was outlined that despite a<br />

discontinuity at <strong>the</strong> point zero in <strong>the</strong> limiting <strong>distribution</strong>s, <strong>the</strong> use <strong>of</strong> subsampling is still<br />

justified to construct confidence intervals if <strong>the</strong> finite population <strong>distribution</strong>s converge to<br />

<strong>the</strong> limit uniformly, as pointed out recently in Romano and Shaikh (2010). We verified<br />

this uniform convergence for an orthogonal design setting only but <strong>the</strong>re are indications<br />

that this property holds in greater generality; this remain an open problem.<br />

In a high dimensional setting, where <strong>the</strong> use <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> is most justified, it is more difficult<br />

to study <strong>the</strong> large sample behavior in <strong>distribution</strong>. A major hurdle is that an eventual<br />

limit to <strong>the</strong> objective function would not be uniquely minimized, thus prohibiting a direct<br />

application <strong>of</strong> standard techniques from weak convergence <strong>the</strong>ory. This explains why in<br />

this setting, <strong>the</strong> asymptotics <strong>of</strong> <strong>the</strong> <strong>Lasso</strong> have not been adressed in <strong>the</strong> litterature in<br />

a satisfactory way yet. Never<strong>the</strong>less, we investigated <strong>the</strong> adaptive <strong>Lasso</strong> under sparsity<br />

assumptions in <strong>the</strong> vein <strong>of</strong> Huang et al. (2008) since this variant guarantees asymptotic<br />

normality with optimal rate <strong>of</strong> convergence for <strong>estimates</strong> corresponding to nonzero coefficients<br />

and it achieves variable selection, thus predicting assymptotic valid subsampling<br />

confidence intervals for non-zero coefficients at least.<br />

Two pictures emerge from <strong>the</strong> conducted simulation study. In a low dimensional setting,<br />

<strong>the</strong> validity <strong>of</strong> subsampling was confirmed. Confidence intervals <strong>of</strong>fer sastifying coverage<br />

rates and proposed subsampled p-values allow to control <strong>the</strong> FWER through a Bonferonni-<br />

Holm procedure. In a high dimensional setting however, coverage rates for nonzero coefficients<br />

are slightly below <strong>the</strong> nominal level, this may be due to <strong>the</strong> bias introduced by <strong>the</strong><br />

<strong>Lasso</strong>. The conclusions for zero c<strong>of</strong>effients are more interresting though; while convergence<br />

to <strong>the</strong> nominal level was not achieved, probably due to variable selection consistency or<br />

to a slower rate <strong>of</strong> convergence <strong>of</strong> order √ n/log(p n ), <strong>the</strong> rate <strong>of</strong> false positives is clearly<br />

conservative, i.e. remains below <strong>the</strong> nominal level. Subsampled p-values however, seem<br />

not to be valid in <strong>the</strong> high dimensional setting. This issue remains unsolved.<br />

71


72 Concluding remarks


Bibliography<br />

Andersen, P. and R. Gill (1982). Cox’s regression model for counting processes: a large<br />

sample study. The annals <strong>of</strong> statistics 10 (4), 1100–1120.<br />

Berg, A., T. McMurry, and D. Politis (2010). <strong>Subsampling</strong> p-values. Statistics & Probability<br />

Letters 80 (17-18), 1358–1364.<br />

Bunea, F., A. Tsybakov, and M. Wegkamp (2007). Sparsity oracle inequalities for <strong>the</strong><br />

lasso. Electronic Journal <strong>of</strong> Statistics 1, 169–194.<br />

Chatterjee, A. and S. Lahiri (2010). Asymptotic properties <strong>of</strong> <strong>the</strong> residual bootstrap for<br />

lasso estimators. Proceedings <strong>of</strong> <strong>the</strong> American Ma<strong>the</strong>matical Society 138 (12), 4497–4509.<br />

Chatterjee, A. and S. Lahiri (2011). Bootstrapping lasso estimators. Journal <strong>of</strong> <strong>the</strong> American<br />

Statistical Association 106 (494), 1–18.<br />

Chow, Y. and H. Teicher (1978). Probability Theory : independence, interchangeability,<br />

martingales. Springer-Verlag.<br />

Dudley, R. (1985). An extended wichura <strong>the</strong>orem, definitions <strong>of</strong> donsker class, and weighted<br />

empirical <strong>distribution</strong>s. Probability in Banach Spaces V 1153, 141–178.<br />

Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The<br />

Annals <strong>of</strong> statistics 32 (2), 407–499.<br />

Hjort, N. and D. Pollard (1993). Asymptotics for minimisers <strong>of</strong> convex processes. Technical<br />

report, University <strong>of</strong> Oslo and Yale University.<br />

Huang, J., S. Ma, and C. Zhang (2008). Adaptive lasso for sparse high-dimensional regression<br />

models. Statistica Sinica 18 (4), 1603–1618.<br />

Kallenberg, O. (2002). Foundations <strong>of</strong> modern probability. Springer Verlag.<br />

Knight, K. and W. Fu (2000). Asymptotics for lasso-type estimators. Annals <strong>of</strong> Statistics<br />

28 (5), 1356–1378.<br />

Lehmann, E. and J. Romano (2005). Testing statistical hypo<strong>the</strong>ses. Springer Verlag.<br />

Massart, P. (1990). The tight constant in <strong>the</strong> dvoretzky-kiefer-wolfowitz inequality. The<br />

Annals <strong>of</strong> Probability 18 (3), 1269–1283.<br />

Meinshausen, N. and P. Bühlmann (2006). High-dimensional graphs and variable selection<br />

with <strong>the</strong> lasso. The Annals <strong>of</strong> Statistics 34 (3), 1436–1462.<br />

Politis, D., J. Romano, and M. Wolf (1999). <strong>Subsampling</strong>. Springer Verlag.<br />

73


74 BIBLIOGRAPHY<br />

Pollard, D. (1984). Convergence <strong>of</strong> stochastic processes. Springer.<br />

Pollard, D. (1990). Empirical processes: <strong>the</strong>ory and applications. In NSF-CBMS regional<br />

conference series in probability and statistics. JSTOR.<br />

Pollard, D. (1991). Asymptotics for least absolute deviation regression estimators. Econometric<br />

Theory 7 (2), 186–199.<br />

Rockafellar, R. (1970). Convex Analysis, volume 28 <strong>of</strong> Princeton Ma<strong>the</strong>matics Series.<br />

Princeton University Press.<br />

Romano, J. and A. Shaikh (2010). On <strong>the</strong> uniform asymptotic validity <strong>of</strong> subsampling<br />

and <strong>the</strong> bootstrap. Technical report, Mimeo, University <strong>of</strong> Chicago.<br />

Serfling, Robert, J. (1980). Approximation <strong>the</strong>orems <strong>of</strong> ma<strong>the</strong>matical statistics. Wiley.<br />

Tibshirani, R. (1996). Regression shrinkage and selection via <strong>the</strong> lasso. Journal <strong>of</strong> <strong>the</strong><br />

Royal Statistical Society. Series B (Methodological) 58 (1), 267–288.<br />

Van De Geer, S. and P. Bühlmann (2009). On <strong>the</strong> conditions used to prove oracle results<br />

for <strong>the</strong> lasso. Electronic Journal <strong>of</strong> Statistics 3, 1360–1392.<br />

Van der Vaart, A. (2000). Asymptotic statistics. Cambridge Univ Press.<br />

Van der Vaart, A. and J. Wellner (1996).<br />

Springer Verlag.<br />

Weak convergence and empirical processes.<br />

Zhao, P. and B. Yu (2006). On model selection consistency <strong>of</strong> lasso. The Journal <strong>of</strong><br />

Machine Learning Research 7, 2541–2563.<br />

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal <strong>of</strong> <strong>the</strong> American<br />

Statistical Association 101 (476), 1418–1429.


Appendix A<br />

R Codes<br />

A.1 Simulation code for <strong>the</strong> <strong>Lasso</strong> in a low dimensional setting<br />

library(mnormt);<br />

library(lars);<br />

library(Matrix);<br />

rm(list=ls(all=TRUE));<br />

set.seed(1);<br />

################################<br />

###### Setting parameters ######<br />

p


76 R Codes<br />

bias


A.1 Simulation code for <strong>the</strong> <strong>Lasso</strong> in a low dimensional setting 77<br />

beta_lasso[k,],+Inf);<br />

intervals[k,2,,]


78 R Codes<br />

-(sqrt(nobs)+sqrt(r))*beta_lasso[k,j]) >= 0)}<br />

}<br />

###### Holm -procedure ######<br />

index_sorted[k,]


A.2 Simulation code for <strong>the</strong> adaptive <strong>Lasso</strong> in a high dimensional setting 79<br />

p_value,false_rejection_multiple,<br />

test_multiple,FWER,file= "simulation_low.RData");<br />

A.2 Simulation code for <strong>the</strong> adaptive <strong>Lasso</strong> in a high dimensional<br />

setting<br />

library(mnormt);<br />

library(lars);<br />

library(Matrix);<br />

rm(list=ls(all=TRUE));<br />

set.seed(10);<br />

#################################<br />

###### Setting parameters ######<br />

p


80 R Codes<br />

#######################################################<br />

###### Outer loop - Replications <strong>of</strong> <strong>the</strong> scenario ######<br />

for(k in 1:K) {<br />

x


A.2 Simulation code for <strong>the</strong> adaptive <strong>Lasso</strong> in a high dimensional setting 81<br />

^(-1)*apply(L[k,,],2,quantile,probs=c(1-alpha)),beta_adaptivelasso[k,],+Inf);<br />

intervals[k,4,,]


82 R Codes<br />

while(!(stop_loop == 1)&(l= alpha/(p-(l-1))) {<br />

stop_loop

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!