13.07.2015 Views

Document de synthèse Antoine CHAMBAZ l'Habilitation à Diriger ...

Document de synthèse Antoine CHAMBAZ l'Habilitation à Diriger ...

Document de synthèse Antoine CHAMBAZ l'Habilitation à Diriger ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

UNIVERSITÉ PARIS DESCARTES<strong>Document</strong> <strong>de</strong> synthèseprésenté par<strong>Antoine</strong> <strong>CHAMBAZ</strong>en vue <strong>de</strong> l’obtention <strong>de</strong>l’Habilitation à <strong>Diriger</strong> <strong>de</strong>s RecherchesSpécialité : Mathématiques AppliquéesEstimation et test <strong>de</strong> l’ordre <strong>de</strong> lois, <strong>de</strong> l’importance <strong>de</strong> variableset <strong>de</strong> paramètres causaux ; applications biomédicalessoutenue le 13 décembre 2011<strong>de</strong>vant le jury composé <strong>de</strong> :Stéphane BoucheronFabienne ComteRandal DoucElisabeth GassiatMarc LavielleStéphane RobinJean-Christophe ThalabardAad van <strong>de</strong>r VaartProfesseur (Université Paris Di<strong>de</strong>rot), rapporteurProfesseur (Université Paris Descartes)Professeur (Telecom SudParis), rapporteurProfesseur (Université Paris Sud)Directeur <strong>de</strong> recherche (INRIA)Directeur <strong>de</strong> recherche (INRA), prési<strong>de</strong>ntProfesseur (Université Paris Descartes)Professeur (Vrije Universiteit Amsterdam), rapporteur


Felix qui potuit rerum cognoscere causas !Virgile, GeorgicaIn relating what follows I must confess to acertain chronological vagueness. The eventsthemselves I can see in sharp focus, and Iwant to think they happened that same evening,and there are good reasons to supposethey did. In a narrative sense they presenta nice neat package, effect dutifully trippingalong at the heels of cause. Perhaps it isthe attraction of such simplicity that makesme suspicious, that along with the convictionthat real life seldom works this way.R. Russo, The risk pool


RemerciementsJe tiens tout d’abord à remercier Elisabeth Gassiat et Marc Lavielle, qui m’ont initié au métier<strong>de</strong> chercheur. Vous m’avez ouvert <strong>de</strong>s horizons insoupçonnés et <strong>de</strong>meurez pour moi <strong>de</strong>s exemples,tout simplement.Je suis très reconnaissant à Stéphane Boucheron, Randal Douc et Aad van <strong>de</strong>r Vaart d’avoiraccepté d’évaluer mon document <strong>de</strong> synthèse en vue <strong>de</strong> l’obtention <strong>de</strong> l’habilitation à diriger lesrecherches. C’est un honneur que <strong>de</strong> décrocher l’habilitation sur la foi <strong>de</strong> rapports <strong>de</strong> chercheursaussi talentueux que vous !Je suis très honoré que Fabienne Comte, Jean-Christophe Thalabard et Stéphane Robin aientaccepté <strong>de</strong> participer à mon jury <strong>de</strong> soutenance. Stéphane, tu m’es un modèle <strong>de</strong> polyvalence statistique.Jean-Christophe, je me régale et m’enrichis <strong>de</strong> chacune <strong>de</strong> nos nombreuses conversationssur la statistique, la mé<strong>de</strong>cine et la causalité. Fabienne, brillante et par trop mo<strong>de</strong>ste, je veux teremercier ici pour ta générosité humaine et scientifique jamais démentie. Ton soutien toutes cesannées et particulièrement lors <strong>de</strong> la préparation <strong>de</strong> cette habilitation a été déterminant.Outre Fabienne, je veux remercier avec sympathie mes bienveillantes collègues ValentineGenon-Catalot et Catherine Huber (ainsi que la regrettée Minh-Thu Hoang) pour leur expertisescientifique maintes fois sollicitée et pour la confiance qu’elles m’ont manifestée en me confiantles cours vacants d’épidémiologie, bien que je fusse alors un parfait néophyte.Je dois d’une certaine façon à ces cours l’éveil d’une vive curiosité pour les applications <strong>de</strong> lastatistique aux sciences biomédicales et à la causalité. Et donc aussi, <strong>de</strong> fil en aiguille, ma relationscientifique nouée avec Mark van <strong>de</strong>r Laan, qui a été et <strong>de</strong>meure extrêmement enrichissante.J’admire, Mark, l’unité, la profon<strong>de</strong>ur et la puissance <strong>de</strong> ta vision statistique. Te côtoyer, échangeret construire avec toi sont un formidable privilège.Je voudrais aussi dire le plaisir que c’est <strong>de</strong> collaborer avec ma proche collègue A<strong>de</strong>line Samsonet Christophe Denis à la préparation du doctorat <strong>de</strong> Christophe, promis à un bel avenir.Je suis heureux <strong>de</strong> pouvoir remercier la plupart <strong>de</strong> mes proches collaborateurs que je n’ai pasencore cités, que nos projets soient finalisés ou pas, pour tout ce qu’ils m’ont apporté : IsabelleBonan, Jean Bouyer, Cristina Butucea, Michel Chavance, Dominique Choudat, Eric Denion, IsabelleDrouet, Daniel Eugène, Aurélien Garivier, Susan Gruber, Annamaria Guolo, Erwin Idoux,Christophe Magnani, Laurence Meyer, Lee Moore, Christian Néri, Pierre Neuvial, Grégory Nuel,Jean-Clau<strong>de</strong> Pairon, Sherri Rose, Michael Rosenblum, Judith Rousseau, Aurore Schmitt, WilsonToussile, Pascale Tubert-Bitter, Cristiano Sammy Varin, Pierre-Paul Vidal. Merci aussi à GillesBlanchard, Ivan Gentil et Catherine Matias, amis <strong>de</strong> la première heure et chercheurs accomplis,avec qui j’ai déjà eu la chance <strong>de</strong> collaborer ou pas. . . encore !Le laboratoire MAP5 et l’UFR <strong>de</strong> Mathématiques et Informatique sont un cadre <strong>de</strong> travail trèsépanouissant. Je remercie chaleureusement : Sylvain Durand, pour avoir coordonné sûrement macandidature à l’habilitation, et Hermine Biermé, pour les conseils prodigués en chemin ; BernardYcart, Christine Graffigne et Annie Raoult, entreprenants directeurs successifs du laboratoire ; laprécieuse Marie-Hélène Gbaguidi, gestionnaire du laboratoire ; les ingénieux Vincent Delos, AzedineMani, Maïk Mercuri, Laurent Moineau et Thierry Rae<strong>de</strong>rsdorff ; l’ensemble <strong>de</strong>s membres du MAP5et <strong>de</strong> l’UFR, avec un clin d’œil appuyé à mes collègues statisticiens d’hier et aujourd’hui, Avner,Chantal, Elodie, Flora, Hector, Jean-Clau<strong>de</strong>, Jérôme, Marie-Luce, Olivier, Rachid, Yves, et unemention spéciale attendrie à Servane Gey et Pierre Calka (toujours MAP5, <strong>de</strong> cœur au moins).Merci à la volée à Norv et Kate Brasch, Re-Cheng et Jonathan Jaffe, Elissa et Alan Kittner,nos amis américains !Je dédie ce travail à mes quatre amours : Julie, Lou, Fausto et Claire.


TABLE DES MATIÈRES 7Table <strong>de</strong>s matièresIntroduction 11 Estimation et test <strong>de</strong> l’ordre d’une loi 31.1 Consistance (cas indépendant) . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.1 Estimation par maximum <strong>de</strong> vraisemblance pénalisé . . . . . . . . . . . . 41.1.2 Longueur <strong>de</strong> co<strong>de</strong>, maximum <strong>de</strong> vraisemblance et estimation par longueur<strong>de</strong> co<strong>de</strong> pénalisée . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Vitesses <strong>de</strong> convergence (cas indépendant) . . . . . . . . . . . . . . . . . . . . . 81.2.1 Erreurs d’estimation et erreurs <strong>de</strong> test <strong>de</strong> l’ordre d’une loi . . . . . . . . . 81.2.2 Estimation par maximum <strong>de</strong> vraisemblance pénalisé . . . . . . . . . . . . 91.2.3 Estimation bayésienne . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.3 Consistance (cas dépendant) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.1 Estimation <strong>de</strong> l’ordre d’un champ <strong>de</strong> ruptures par minimum <strong>de</strong> contrastepénalisé . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3.2 Estimation <strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov cachée par maximum <strong>de</strong>vraisemblance et longueur <strong>de</strong> co<strong>de</strong> pénalisés . . . . . . . . . . . . . . . . 151.3.3 Estimation <strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov à régime markovien par maximum<strong>de</strong> vraisemblance et longueur <strong>de</strong> co<strong>de</strong> pénalisés . . . . . . . . . . . 161.4 Application à l’étu<strong>de</strong> du maintien postural (1/2) . . . . . . . . . . . . . . . . . . 201.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.4.2 Description succincte <strong>de</strong>s données . . . . . . . . . . . . . . . . . . . . . 211.4.3 Un modèle <strong>de</strong> maintien postural . . . . . . . . . . . . . . . . . . . . . . 222 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel) 252.1 Amiante et cancer du poumon en France . . . . . . . . . . . . . . . . . . . . . . 262.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.1.2 Description succincte <strong>de</strong>s données . . . . . . . . . . . . . . . . . . . . . 272.1.3 Un modèle ad hoc <strong>de</strong> régression seuillée . . . . . . . . . . . . . . . . . . 282.1.4 Estimation par maximum <strong>de</strong> vraisemblance pondérée . . . . . . . . . . . 292.1.5 Nombre moyen d’années <strong>de</strong> vie perdues . . . . . . . . . . . . . . . . . . 312.2 A propos <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte ciblée . . . . . . . . . . . . . 312.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.2.2 Le principe TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.2.3 TMLE <strong>de</strong> l’excès <strong>de</strong> risque . . . . . . . . . . . . . . . . . . . . . . . . . 362.3 Probabilité <strong>de</strong> succès d’un programme <strong>de</strong> FIV en France . . . . . . . . . . . . . 372.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37


2.3.2 Description succincte <strong>de</strong>s données . . . . . . . . . . . . . . . . . . . . . 382.3.3 TMLE <strong>de</strong> la probabilité <strong>de</strong> succès d’un programme <strong>de</strong> FIV en France . . . 392.4 Mesure non-paramétrique <strong>de</strong> l’importance d’une variable . . . . . . . . . . . . . 412.4.1 Une mesure inédite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.4.2 TMLE <strong>de</strong> la mesure non-paramétrique <strong>de</strong> l’importance d’une variable . . 422.4.3 Etu<strong>de</strong> <strong>de</strong> simulations inspirée <strong>de</strong>s données TCGA . . . . . . . . . . . . . 432.5 Application à l’étu<strong>de</strong> du maintien postural (2/2) . . . . . . . . . . . . . . . . . . 462.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.2 Classification selon le maintien postural . . . . . . . . . . . . . . . . . . 463 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental) 513.1 Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives . . . . . . . . . . . . . . . . 523.1.1 Ciblage du schéma optimal . . . . . . . . . . . . . . . . . . . . . . . . . 523.1.2 Etu<strong>de</strong> asymptotique <strong>de</strong> l’estimateur du maximum <strong>de</strong> vraisemblance . . . 543.1.3 Etu<strong>de</strong> asymptotique <strong>de</strong> la procédure <strong>de</strong> test groupes-séquentielle . . . . . 563.2 Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives et ajustées aux covariables . . 583.2.1 Formalisme statistique et i<strong>de</strong>ntification du schéma optimal . . . . . . . . 583.2.2 Modèle <strong>de</strong> travail, stratégie d’adaptation et initialisation <strong>de</strong> l’estimation . 593.2.3 Construction du TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . 613.2.4 Etu<strong>de</strong> asymptotique du TMLE . . . . . . . . . . . . . . . . . . . . . . . 623.2.5 Etu<strong>de</strong> asymptotique <strong>de</strong> la procédure <strong>de</strong> test groupes-séquentielle fondéesur le TMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63Perspectives 65Liste <strong>de</strong>s publications 69Bibliographie 70Curriculum Vitæ 79Articles 87Référence [A1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99Référence [A2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118Référence [A3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Référence [A4] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138Référence [A5] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146Référence [A6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160Référence [A7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176Référence [A8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193Référence [A9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Référence [A10] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213Référence [A11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221Référence [A12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234Référence [A13] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250Référence [A14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271


3Chapitre 1Estimation et test <strong>de</strong> l’ordre d’une loiUne première partie <strong>de</strong> mes travaux concerne l’estimation et le test <strong>de</strong> l’ordre d’une loi, dontvoici le cadre statistique.Soit le modèleM=∪ K∈K M K ⊂M NP (modèle non-paramétrique), la collection K d’indicesétant dotée d’une relation d’ordre pour laquelle la famille <strong>de</strong> modèles{M K :K∈ K} estéventuellement emboîtée (i.e. éventuellement telle queK K ′ impliqueM K ⊂M K ′). PourtoutP∈M NP , l’ordre Ψ(P) <strong>de</strong>P (relatif àM) est l’indiceΨ(P) = min{K∈ K :P∈M K }(avec convention Ψ(P) =∞ siP∉M). Etant données <strong>de</strong>s réalisations obtenues sousP, c’estun paramètre <strong>de</strong>P que l’on peut vouloir estimer, ou bien tester : étant donnéK 0 ∈ K, a-t-on“Ψ(P) K 0 ” ou bien “Ψ(P)≻K 0 ” ?Les exemples emblématiques et historiques sont ceux <strong>de</strong> l’estimation <strong>de</strong> l’ordre d’une sérietemporelle, d’une chaîne <strong>de</strong> Markov (sur alphabet fini) et d’un mélange en localisation. Je n’aipas travaillé sur l’estimation <strong>de</strong> l’ordre d’une série temporelle, mais je dois beaucoup aux résultatspionniers obtenus dans [3, 40, 42], les premiers que j’aie lus, qui ont influencé d’autres articles quim’ont directement inspiré et que je citerai plus tard. Je n’ai pas non plus travaillé sur l’estimation<strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov ; toutefois ce problème est intimement lié à celui <strong>de</strong> l’estimation<strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov cachée (hmm, pour “hid<strong>de</strong>n Markov mo<strong>de</strong>l”), auquel je me suisconsacré. Le cas <strong>de</strong>s mélanges en localisation constitue enfin l’un <strong>de</strong>s exemples clefs considérésdans mes articles.Je me suis intéressé <strong>de</strong>puis mes travaux <strong>de</strong> thèse [A1-A2] à l’étu<strong>de</strong> du comportement asymptotique<strong>de</strong> divers estimateurs <strong>de</strong> l’ordre d’une loi en termes <strong>de</strong> consistance et <strong>de</strong> vitesses <strong>de</strong>convergence [A1-A6], à partir <strong>de</strong> données indépendantes [A2-A3,A5] ou dépendantes [A1,A4-A6],selon <strong>de</strong>s approches fréquentistes [A1-A2,A4-A6], bayésiennes [A3] et informationnelles [A4-A6].J’ai illustré mes résultats par diverses étu<strong>de</strong>s <strong>de</strong> simulations [A4-A5], et j’ai tiré partie <strong>de</strong> métho<strong>de</strong>sd’estimation <strong>de</strong> l’ordre d’une loi dans le cadre d’une application biomédicale à l’étu<strong>de</strong> du maintienpostural chez l’homme [A5-A6]. Des détails suivent.1.1 Consistance d’estimateurs <strong>de</strong> l’ordre d’une loi à partir <strong>de</strong> donnéesindépendantes [A2,A5]Je rassemble dans cette section un ensemble <strong>de</strong> résultats obtenus soit seul [A2] soit en collaborationavec Aurélien Garivier (Laboratoire Traitement et Communication <strong>de</strong> l’Information, CNRS et


4 Estimation et test <strong>de</strong> l’ordre d’une loiTélécom ParisTech) et Elisabeth Gassiat (Laboratoire <strong>de</strong> Mathématiques, Université Paris Sud 11)[A5]. Le principe moteur <strong>de</strong> [A2] est d’exprimer les événements caractéristiques <strong>de</strong> sous-estimationet sur-estimation en termes d’événements concernant la mesure empirique afin <strong>de</strong> tirer partie <strong>de</strong>la théorie <strong>de</strong>s processus empiriques. L’article [A5] repose quant à lui sur une approche inspirée <strong>de</strong>la théorie <strong>de</strong> l’information.1.1.1 Estimation par maximum <strong>de</strong> vraisemblance pénaliséLes résultats concernent <strong>de</strong>s collections{M K :K∈ K} <strong>de</strong> nature paramétrique : chaqueM K s’écrit sous la formeM K ={P θ =p θ dµ :θ∈Θ K }, la collection d’espaces métriques{(Θ K ,d K ) :K∈ K} étant emboîtée (K≤K ′ implique Θ K ⊂ Θ K ′). Ici, on peut se ramenersans perte <strong>de</strong> généralité au cas où K⊂N ∗ , l’éventuelle connaissance a priori d’une borne supérieureK max sur K étant susceptible <strong>de</strong> simplifier gran<strong>de</strong>ment l’étu<strong>de</strong> <strong>de</strong> la consistance. Les <strong>de</strong>ux exemplessuivants rentrent dans ce cadre <strong>de</strong> travail.Exemple. Mélange en localisation. SoitC ⊂ R un ensemble compact,S K ={π ∈ R K + :∑ Kk=1π k = 1} leK-simplexe et{φ m :m∈C} un ensemble <strong>de</strong> <strong>de</strong>nsités telles que chaqueφ m dµ soit <strong>de</strong> moyennem(d’où le terme “localisation”). Quel que soitK≥ 1, Θ K =C K ×S K etM K ={p θ dµ :θ∈Θ K } oùp θ = ∑ Kk=1 π k φ mk (d’où le terme “mélange”,p θ dµ étant la loi marginale <strong>de</strong>X relativement à la loi jointe <strong>de</strong> (Z,X) telle queZ =m k avecprobabilitéπ k et, sachantZ,X a pour <strong>de</strong>nsité conditionnelleφ Z ). Nous nous intéresseronstout particulièrement aux cas oùφ m (·) =ϕ((·−m)/σ) est la <strong>de</strong>nsité (par rapport à lamesure <strong>de</strong> Lebesgue) <strong>de</strong> la loiN(m,σ 2 ) (σ 2 connu) et oùφ m est la <strong>de</strong>nsité (par rapport àla mesure <strong>de</strong> comptage) <strong>de</strong> la loi Poisson(m).Exemple. Ruptures (cas indépendant). SoitC⊂ R un ensemble compact et (X,B,pdλ) unespace probabilisé avecX⊂ R D (D≥2) ensemble ouvert etλla mesure <strong>de</strong> Lebesgue.D’après [56], il existe une métriquedsur l’ensemble CP b <strong>de</strong>s partitions <strong>de</strong> Cacciopoli dénombrables<strong>de</strong>X dont les “périmètres” sont majorés parb>0 telle que (CP b ,d) soit un espacemétrique compact. On peut aisément marquer une partitionτ ={τ j } j≥1 ∈ CP b , i.e. associerm j ∈C à chaqueτ j , et étendre la métriqued<strong>de</strong> telle sorte que l’ensemble <strong>de</strong>s partitions<strong>de</strong> CP b marquées muni <strong>de</strong> cette métrique étendue constitue un espace compact. Pour toutK≥ 1, nous notons (Θ K ,d) l’espace métrique compact <strong>de</strong>s partitions <strong>de</strong> CP b marquéestelles que ∫ τ jpdλ = 0 sauf pour au plusK <strong>de</strong>s indicesj≥ 1, etM K ={p θ dµdλ :θ∈Θ K }oùp θ (x,y)dλ(x)dµ(y) =ϕ((y− ∑ j≥1 m j1{x∈τ j })/σ)p(x)dxdy (ϕ est la <strong>de</strong>nsité <strong>de</strong> laloiN(0, 1)).Le premier exemple correspond au cas <strong>de</strong> figure où l’on observe un certain trait dans une populationhétérogène. Il est délicat car la perte d’i<strong>de</strong>ntifiabilité lorsque l’on sur-estime l’ordre du mélangeinduit notamment la singularité <strong>de</strong> la matrice d’information <strong>de</strong> Fisher, rendant ainsi pour le moinssubtil le recours à <strong>de</strong>s développements <strong>de</strong> Taylor (nous y reviendrons). Le second exemple (originaldans la littérature dédiée à l’i<strong>de</strong>ntification <strong>de</strong> l’ordre d’une loi) s’inspire <strong>de</strong> la thématique <strong>de</strong> lasegmentation variationnelle d’une image (à ceci près qu’au lieu d’observer l’image toute entière,on ne la lit qu’en <strong>de</strong>s points tirés au hasard). Notez que dans le second exemple, les ensemblesΘ K ne sont pas <strong>de</strong> dimension finie.Soit doncO 1 ,...,O n n copies indépendantes <strong>de</strong>O∼P 0 =p 0 dµ∈M NP et une certainecollection{M K :K∈ K} comme décrite ci-<strong>de</strong>ssus. On noteP n la mesure empirique associée.Nous nous appuyons sur la fonction <strong>de</strong> log-vraisemblancel n (θ) = ∑ ni=1 logp θ (O i ) (toutθ∈


Consistance (cas indépendant) 5∪ K∈K Θ K ) pour construire notre premier estimateur <strong>de</strong> l’ordre <strong>de</strong>P 0 (qualifié <strong>de</strong> global dans [A2],où une version locale est aussi étudiée) :K MLn = min arg maxK∈K{supl n (θ)−pen(n,K)θ∈Θ K}, (1.1)l’indispensable terme <strong>de</strong> pénalisation pen <strong>de</strong>vant satisfaire a minima les conditions pen(n,·) croissantepour toutn≥1, et pour toutK∈ K, lim n pen(n,K) =∞ et lim n n −1 pen(n,K) = 0.Cas d’un modèle mal spécifié.Notons KL la divergence <strong>de</strong> Kullback-Leibler (définie par KL(P|Q) =P log dPdQ siP≪Q,KL(P|Q) =∞ sinon ; pourMun ensemble <strong>de</strong> loi, nous écrivons KL(P|M) = inf Q∈M KL(P|Q)et KL(M|Q) = inf P∈M KL(P|Q)).On voit sans peine que lorsque le modèle est mal spécifié (i.e. quandP 0 ∉M), la consistance(qui prend une forme dégénérée du fait que Ψ(P 0 ) =∞) découle <strong>de</strong> la loi <strong>de</strong>s grands nombrespourP n .Spécifiquement, sous <strong>de</strong>s conditions très générales visant essentiellement à garantir que lesclassesG K ={logp θ − logp 0 :θ∈Θ K } <strong>de</strong>s log-rapports <strong>de</strong> vraisemblances sontP 0 -Glivenko-Cantelli, si pour toutK


6 Estimation et test <strong>de</strong> l’ordre d’une loigran<strong>de</strong> “régularité” <strong>de</strong>s log-vraisemblances. Concrètement, en substituant par “peeling” la borne(valable pour toutK ′ >K≥ Ψ(P 0 ))( ∣ ∣supθ∈Θ K ′∣ (P n−P 0 ) logp θ− logp 0∣KL(P 0 |P θ )∣) 2≥ supθ∈Θ K ′P n logp θ − supθ∈Θ KP n logp θ ≡ ∆ n (K,K ′ )à la plus évi<strong>de</strong>nte borne sup θ∈ΘK ′ |(P n−P 0 )(logp θ − logp 0 )|≥∆ n (K,K ′ ) (cf Proposition A.1dans [A2]), on peut remplacer (1.2) parlog lognlim sup = 0, (1.3)n pen(n,K)}si les classes{logpθ −logp 0:θ∈ΘKL(P 0 |P θ ) 1/2 K , KL(P 0 |P θ )>0 <strong>de</strong>s log-vraisemblances renormalisées (plutôtque lesG K ) sontP 0 -Donsker (cf Theorem 4 dans [A2]). Ce résultat s’applique à l’exemple dumélange en localisation (cas gaussien).1.1.2 Longueur <strong>de</strong> co<strong>de</strong>, maximum <strong>de</strong> vraisemblance et estimation par longueur<strong>de</strong> co<strong>de</strong> pénaliséeLa question <strong>de</strong> l’estimation <strong>de</strong> l’ordre d’un modèle est aussi très pertinente en théorie <strong>de</strong>l’information, où elle est associée à <strong>de</strong>s questions d’optimalité <strong>de</strong> procédures <strong>de</strong> codage.NotonsAun alphabet fini etA n l’ensemble <strong>de</strong>s mots ànlettres dansA. Il se trouve quel’on sait associer <strong>de</strong> façon générique un codage surA n à toute loi <strong>de</strong> probabilité surA n <strong>de</strong> tellesorte que la longueur du co<strong>de</strong> <strong>de</strong> tout mot soit proportionnelle au logarithme <strong>de</strong> la probabilité <strong>de</strong>ce mot (on parle donc <strong>de</strong> “probabilité <strong>de</strong> codage” ; c’est une conséquence du Théorème <strong>de</strong> Kraft-McMillan, cf [18]). En tirant profit d’inégalités comparant la longueur <strong>de</strong> <strong>de</strong>ux co<strong>de</strong>s fondéssur <strong>de</strong>ux probabilités <strong>de</strong> codage (l’une s’écrivant sous la forme du maximum <strong>de</strong> vraisemblancecorrectement renormalisé, l’autre sous la forme d’un “mélange à la Krichevsky-Trofimov”), Finesso[26], Liu and Narayan [58] ont démontré la consistance <strong>de</strong> <strong>de</strong>ux estimateurs <strong>de</strong> l’ordre d’une loi(dont un estimateur par maximum <strong>de</strong> vraisemblance pénalisé) sous l’hypothèse qu’une borne apriori sur l’ordre <strong>de</strong> la loi est connue, lorsque cette loi est une chaîne <strong>de</strong> Markov (son ordre estalors son paramètre <strong>de</strong> mémoire) ou bien une hmm (son ordre est alors le cardinal <strong>de</strong> l’espaced’états <strong>de</strong> la chaîne <strong>de</strong> Markov cachée) à émissions dans un alphabet fini. En exploitant plusfinement <strong>de</strong> telles inégalités, Gassiat and Boucheron [32] sont parvenus à s’affranchir du besoin<strong>de</strong> disposer d’une borne a priori dans le cas hmm. Dans leur sillage, nous avons considéré dans[A5] le cas d’une hmm à émissions gaussiennes ou poissonniennes (alphabet infini donc) et saforme dégénérée (mémoire <strong>de</strong> la chaîne <strong>de</strong> Markov égale à zéro) qui coïnci<strong>de</strong> avec notre exempledu mélange en localisation.Plaçons-nous ainsi dans le cadre <strong>de</strong> notre exemple du mélange en localisation.SoitK∈ K = N ∗ . Notonsδ K la <strong>de</strong>nsité <strong>de</strong> la loi <strong>de</strong> Dirichlet <strong>de</strong> paramètre ( 1 2 ,..., 1 2 ) surS K etγ τ celle <strong>de</strong> la loi Gamma(τ, 1 2 ). Soitν K la loi sur Θ K =C K ×S K telle quedν K ((m 1 ,...,m K ),π) =∏ Kk=1ϕ(m k /τ)dm k ×δ K (π)dπ (cas gaussien) oudν K ((m 1 ,...,m K ),π) = ∏ Kk=1 γ τ (m k )dm k ×δ K (π)dπ (cas poissonnien). On associe àν K la statistique <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong>∫lc n (K) =− logΘ K n ∏i=1p θ (O i )dν K (θ), (1.4)ainsi appelée parce que l’argument du logarithme peut être vu comme une certaine probabilité<strong>de</strong> codage (<strong>de</strong> type “mélange à la Krichevsky-Trofimov”), évaluée en (O 1 ,...,O n ). Le choix <strong>de</strong>


Consistance (cas indépendant) 7ν K est pragmatique : il permet <strong>de</strong> calculer la valeur exacte <strong>de</strong>lc n (K) ! L’alphabet d’émissionn’étant pas fini, il n’est pas possible <strong>de</strong> mettre au point une probabilité <strong>de</strong> codage fondée surle maximum <strong>de</strong> vraisemblance (parce qu’on ne peut pas le renormaliser correctement), et lastatistique− sup θ∈ΘK l n (θ) ne s’interprète pas comme une statistique <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong> (àconstante additive près) évaluée en (O 1 ,...,O n ). On peut néanmoins toujours comparerlc n (K)et− sup θ∈ΘK l n (θ), avec un encadrement <strong>de</strong> la forme (cf Theorem 2 dans [A5])0≤lc n (K)−(− supθ∈Θ Kl n (θ))≤ 1 2 dim(Θ K) logn +KR n +r Kn (1.5)oùR n est un terme aléatoire (R n = max i≤n Oi 2/2τ 2 dans le cas gaussien etR n =τmax i≤n O idans le cas poissonnien) etr Kn est un terme déterministe (connu explicitement) négligeable <strong>de</strong>vantles <strong>de</strong>ux termes précé<strong>de</strong>nts.Deux caractéristiques démarquent (1.5) <strong>de</strong> ses contreparties dans [26, 58, 32], prix à payerpour l’infinitu<strong>de</strong> <strong>de</strong> l’alphabet d’émission : primo, l’absence d’une borne inférieure strictementpositive (due notamment au fait que l’on n’a pas renormalisé le maximum <strong>de</strong> vraisemblance pouren faire une loi <strong>de</strong> codage) ; secundo, l’apparition du terme aléatoireR n dans la borne supérieure.On retrouve bien en revanche le premier terme <strong>de</strong> type bic, avec la particularité qu’il est le termedominant dans le cas poissonnien (R n =O P (logn/(log logn) 1/2 ), cf Lemma 4 dans [A5]) maispas dans le cas gaussien (R n =O P (logn), cf Lemma 3 dans [A5]).Soit finalement le second estimateur <strong>de</strong> l’ordre <strong>de</strong>P 0 fondé sur la statistique <strong>de</strong> longueur <strong>de</strong>co<strong>de</strong>lc n (K) :Kn lc = min arg min{lc n (K) + pen(n,K)} (1.6)K∈Kpour un terme <strong>de</strong> pénalisation pen à calibrer. L’estimateurKn lc est <strong>de</strong> type mdl (pour “minimum<strong>de</strong>scription length”, le principe <strong>de</strong> sélection <strong>de</strong> modèle introduit par Rissanen [66]).On parvient à prouver que même si l’on ne connaît pas <strong>de</strong> borne a priori sur Ψ(P 0 ), on aasymptotiquementKn ML =Kn lc = Ψ(P 0 )P 0 -presque sûrement dès lors quepen(n,K) = 1 2K∑(α + dim(Θ k )) logn +S Kn +s Knk=1pour toutα>2, les termesS Kn ets Kn étant dédiés aux contrôles respectifs <strong>de</strong>R n dans (1.5)(S Kn est doncO(logn) dans le cas gaussien etO(logn/(log logn) 1/2 ) dans le cas poissonnien)et <strong>de</strong>r Kn (cf Theorems 5 et 6 dans [A5]).La preuve repose essentiellement sur la loi <strong>de</strong>s grands nombres pourP n (qui permet <strong>de</strong> montrerque l’on ne sous-estime pas Ψ(P 0 ) asymptotiquement) et sur (1.5) combinée à un changement<strong>de</strong> probabilité et au lemme <strong>de</strong> Borel-Cantelli (pour montrer que l’on ne sur-estime pas Ψ(P 0 )asymptotiquement). La pénalité, lour<strong>de</strong>, évoque celle du critère bic, mais sous une forme cumulée(avec <strong>de</strong> surcroît le terme <strong>de</strong> contrôleS Kn ). Il est légitime <strong>de</strong> se <strong>de</strong>man<strong>de</strong>r si le recours à unepénalisation est vraiment nécessaire dans la définition <strong>de</strong>Kn lc : après tout,Kn lc relève aussi duparadigme bayésien, et <strong>de</strong>vrait donc à ce titre bénéficier <strong>de</strong> l’effet d’auto-pénalisation dont lesestimateurs bayésiens profitent [46]. Par exemple, on a évoqué plus tôt comment le critère bicdécoule d’un développement <strong>de</strong> Laplace <strong>de</strong> la vraisemblance intégrée : cela s’écrirait icilc n (K) =− sup θ∈ΘK l n (θ)+ 1 2 dim(Θ K) logn+O P (1) si le modèle était “régulier” (au sens <strong>de</strong> [77], ce qu’iln’est pas !). Dans la même veine, j’ai démontré avec Judith Rousseau (cf [A3] et la Section 1.2.3)que l’on peut estimer efficacement Ψ(P 0 ) dans le cadre du mélange en localisation (cas gaussien)


8 Estimation et test <strong>de</strong> l’ordre d’une loien se fondant sur <strong>de</strong>s comparaisons <strong>de</strong> vraisemblances marginales sans les pénaliser. Il existetoutefois <strong>de</strong>s exemples où une version non pénalisée <strong>de</strong>Kn lc est inconsistante alors queKn ML estconsistant [19].Remarque.L’obtention par Aurélien Garivier, Elisabeth Gassiat et moi-même du résultat <strong>de</strong>consistance pourKn ML dans le cadre du modèle <strong>de</strong> mélange en l’absence d’une borne a priorisur Ψ(P 0 ) était une jolie réussite, malgré la forme certainement sous-optimale <strong>de</strong> la pénalitéassociée. On sait <strong>de</strong>puis le tour <strong>de</strong> force réalisé récemment par Gassiat and Van Han<strong>de</strong>l [33]qu’une pénalité <strong>de</strong> la forme pen(n,K) =Kω(n) avec log logn =o(ω(n)) (on retrouvenotre condition (1.3)) suffit à garantir la consistance <strong>de</strong>K MLn (dans le cas gaussien) enl’absence d’une borne a priori sur Ψ(P 0 ). Par ailleurs, cette forme est optimale au sens où lechoix pen(n,K) =δK log logn pourδ> 0 assez petit conduit àK MLn ≠ Ψ(P 0 ) infinimentsouventP 0 -presque sûrement (cf [33, Proposition 4.4] ; <strong>de</strong> surcroît, leur résultat s’étend aucas où l’ensembleC n’est pas compact, cf [33, Proposition 4.5]).1.2 Vitesses <strong>de</strong> convergence d’estimateurs <strong>de</strong> l’ordre d’une loi àpartir <strong>de</strong> données indépendantes [A2,A3]Je rassemble dans cette section un ensemble <strong>de</strong> résultats obtenus soit seul [A2] soit en collaborationavec Judith Rousseau (Laboratoire CEREMADE, Université Paris Dauphine) [A3]. Rappelonsque le principe moteur <strong>de</strong> [A2] est d’exprimer les événements caractéristiques <strong>de</strong> sousestimationet sur-estimation en termes d’événements concernant la mesure empirique afin <strong>de</strong> tirerpartie <strong>de</strong> la théorie <strong>de</strong>s processus empiriques. L’article [A3] repose quant à lui sur une approchebayésienne et <strong>de</strong>s techniques bayésiennes non-paramétriques.1.2.1 Erreurs d’estimation et erreurs <strong>de</strong> test <strong>de</strong> l’ordre d’une loiSoit une borne a prioriK max telle que Ψ(P 0 )∈K={1,...,K max } et un certainK 0 ∈K\{K max }. On souhaite tester “Ψ(P 0 )≤K 0 ” (hypothèse nulle) contre “Ψ(P 0 )>K 0 ”, soit <strong>de</strong>façon équivalente “P 0 ∈M K0 ” contre “P 0 ∉M K0 ”. SiK n est un estimateur <strong>de</strong> Ψ(P 0 ) construità partir <strong>de</strong>P n alors il est naturel <strong>de</strong> déci<strong>de</strong>r <strong>de</strong> rejeter l’hypothèse nulle lorsqueK n >K 0 . Notantα n etβ n les erreurs <strong>de</strong> type I et <strong>de</strong> type II d’un tel test, on voit aisément queα n ≤P 0 (K n >K 0 )etβ n ≤P 0 (K n 0 ete o > 0 telles que lim sup n n −1 logP 0 (K n K 0 )≤−e o ?– Si oui,e u ete o peuvent-elles arbitrairement gran<strong>de</strong>s ?– Si non, que se passe-t-il à une vitesse sous-exponentielle ?Nous présentons dans les Sections 1.2.2 et 1.2.3 un ensemble <strong>de</strong> réponses à ces questions,tirées <strong>de</strong> [A2] et [A3].L’étu<strong>de</strong> <strong>de</strong> procédures <strong>de</strong> test <strong>de</strong> l’ordre d’une loi en termes <strong>de</strong> vitesses <strong>de</strong> convergences a suscitémoins <strong>de</strong> travaux que l’étu<strong>de</strong> <strong>de</strong>s propriétés <strong>de</strong> consistance. Sans prétendre à l’exhaustivité, onpeut citer (chronologiquement) [41, modèles exponentiels], [25, modèle <strong>de</strong> Markov à espace d’états


Vitesses <strong>de</strong> convergence (cas indépendant) 9fini], [20, 21, modèle <strong>de</strong> mélange et processus ARMA], [36, modèles caractérisés par l’existenced’une statistique exhaustive], [49, modèles réguliers], [32, hmm à émissions dans un alphabet fini]et [12, processus auto-régressif].Notons que [A2], [32, 12] adoptent une approche similaire pour traiter le problème dans <strong>de</strong>scadres différents. On y trouve notamment trois réponses partielles aux <strong>de</strong>ux premières questions ci<strong>de</strong>ssus,obtenues toutes trois comme corollaire du lemme <strong>de</strong> Stein [4, Theorem 2.1]. Les réponsesétaient curieusement inédites dans le cadre i.i.d <strong>de</strong> [A2] (cf Lemma 3 et Theorem 6 dans [A2]) :quel que soit l’estimateurK n <strong>de</strong> Ψ(P 0 ), si lim sup n P 0 (K n < Ψ(P 0 )) Ψ(P 0 )) Ψ(P 0 )) = 0, (1.7)1n logP 0(K n < Ψ(P 0 ))≥−KL(M Ψ(P0 )−1|P 0 ). (1.8)Ainsi, siK n a <strong>de</strong>s probabilités <strong>de</strong> sous-estimation et sur-estimation <strong>de</strong> Ψ(P 0 ) éloignées <strong>de</strong> 0 et1, alors (1.7) nous enseigne que la vitesse <strong>de</strong> sur-estimation ne peut pas être exponentielle enntandis que (1.8) nous apprend que la vitesse <strong>de</strong> sous-estimation peut être exponentielle enn, endésignant <strong>de</strong> plus un exposant optimal (le membre <strong>de</strong> droite <strong>de</strong> (1.8)).1.2.2 Estimation par maximum <strong>de</strong> vraisemblance pénaliséVitesse <strong>de</strong> sous-estimation.Le résultat le plus marquant <strong>de</strong> [A2] est relatif à la vitesse <strong>de</strong> sous-estimation <strong>de</strong>Kn ML (unrésultat comparable concernant la version locale <strong>de</strong>Kn ML est aussi obtenu). Il découle <strong>de</strong> la miseen évi<strong>de</strong>nce que l’événement <strong>de</strong> sous-estimation relève <strong>de</strong>s gran<strong>de</strong>s déviations <strong>de</strong>P n .Plus spécifiquement, sous un jeu <strong>de</strong> conditions faibles comprenant notamment l’hypothèseque les log-vraisemblances admettent un (plutôt que tout) moment exponentiel relativement àP 0 ainsi qu’une hypothèse <strong>de</strong> type “existence <strong>de</strong> sieves finis” (cf Theorem 7, condition (ii) dans[A2]), il apparaît qu’il existe une constancec>0 telle quelim supn1n logP 0(K MLn < Ψ(P 0 ))≤−c, (1.9)mettant ainsi en lumière le fait que la vitesse <strong>de</strong> sous-estimation décroît bien exponentiellementenn(cf Theorem 7 dans [A2]). Ce résultat s’applique aux exemples du mélange en localisation(cas gaussien) et aussi à celui <strong>de</strong>s ruptures.Ce théorème a bénéficié d’un joli concours <strong>de</strong> circonstances : alors que j’abordais l’étu<strong>de</strong><strong>de</strong>s gran<strong>de</strong>s déviations <strong>de</strong>P n paraissait un résultat, dû à Léonard et Najim [55], qui s’avéralittéralement ad hoc pour conclure (cf la justification fournie dans la Remark 2 <strong>de</strong> [A2]) !Sans surprise, la constantecs’exprime comme l’infimum <strong>de</strong> la fonction <strong>de</strong> taux du principe <strong>de</strong>gran<strong>de</strong>s déviations sur un certain ensemble. Sous un nouveau jeu d’hypothèses dont la plus contraignanteexige que les modèlesM K soient exponentiels (nos exemples du mélange en localisationet <strong>de</strong>s ruptures sont donc exclus), il s’est avéré possible <strong>de</strong> prouver quec = KL(M Ψ(P0 )−1|P 0 )c’est-à-dire que la vitesse <strong>de</strong> sous-estimation <strong>de</strong>K MLn est exponentielle ennet optimale (au regard<strong>de</strong> (1.8) ; cf Theorem 8 dans [A2]).


10 Estimation et test <strong>de</strong> l’ordre d’une loiLa preuve <strong>de</strong> ce résultat d’optimalité est <strong>de</strong> nature géométrique au sens <strong>de</strong> la divergenceKL, avec une particularité : les projections au sens <strong>de</strong> KL sont qualifiées d’inversées car elles sefont relativement au second argument <strong>de</strong> KL plutôt qu’au premier (cas le plus classique). On nepeut donc a priori pas invoquer l’espèce d”’inégalité <strong>de</strong> Pythagore” dont jouissent les projections<strong>de</strong> KL classiques. Heureusement, les projections <strong>de</strong> KL inversées satisfont aussi une inégalité <strong>de</strong>Pythagore pour <strong>de</strong>s modèles exponentiels, comme prouvé dans l’esprit <strong>de</strong> [16] (cf Lemma 5 dans[A2]).Vitesse <strong>de</strong> sur-estimation.Alors que l’événement <strong>de</strong> sous-estimation relève <strong>de</strong>s gran<strong>de</strong>s déviations <strong>de</strong>P n , l’événement <strong>de</strong>sur-estimation est intrinsèquement lié aux moyennes déviations <strong>de</strong>P n .Spécifiquement, moyennant que la classeG Kmax soitP 0 -Donsker et qu’elle ait une fonctionenveloppe admettant un moment exponentiel relativement àP 0 , si par exemple pen(n,K) =D(K)ω(n) avecDcroissante etω(n) =n 1−δ (pourδ∈]0, 1/2[) alorslim supn1n 1−2δ logP 0(K MLn > Ψ(P 0 )) Ψ(P 0 )) 0) moyennant}que la classe <strong>de</strong>s log-vraisemblances renormaliséeslogpθ −logp 0KL(P 0 |P θ ):θ∈Θ Kmax , KL(P 0 |P θ ) 1/2 > 0 (plutôt queG Kmax ) soitP 0 -Donsker et qu’elleait une fonction enveloppe admettant un moment exponentiel relativement àP 0 . Ceci est assezcontraignant, et ce second résultat ne s’applique pas à l’exemple du mélange en localisation.1.2.3 Estimation bayésienneLa littérature bayésienne consacrée à l’estimation dans <strong>de</strong>s modèles <strong>de</strong> mélange, et donc enparticulier à la sélection du nombre <strong>de</strong> composantes (i.e. <strong>de</strong> l’ordre <strong>de</strong> la loi), est vaste. Il n’y avaiten revanche, lorsque Judith Rousseau et moi avons entrepris [A3] et à notre connaissance, aucuntravail dédié spécifiquement à l’obtention <strong>de</strong> propriétés fréquentistes <strong>de</strong> vitesses <strong>de</strong> convergencepour <strong>de</strong>s estimateurs bayésiens <strong>de</strong> l’ordre d’une loi dans un cadre non-régulier. Certes, Ishwaranet al. [45] étudient un estimateur bayésien <strong>de</strong> la distribution <strong>de</strong> mélange (la loi ∑ Kk=1 π k Dirac mkdans les notations <strong>de</strong> notre exemple du mélange en localisation) lorsque le nombre <strong>de</strong> composantesest inconnu (et borné a priori), mais estimer l’ordre <strong>de</strong> la loi à partir <strong>de</strong> l’estimateur <strong>de</strong> la loi <strong>de</strong>mélange serait sans aucun doute sous-optimal dans la mesure où celui-ci converge à la mo<strong>de</strong>stevitessen −1/4 .Soit donc Π une loi a priori sur∪ Kmaxk=1 Θ k (les Θ k étant paramétriques) satisfaisantdΠ(θ) =π(k)π k (θ)dθ (pour toutθ∈Θ k etk≤K max ). On note Π(k|P n ) la loi a posteriori sur K qui endécoule, à partir <strong>de</strong> laquelle on construit notre estimateur bayésien <strong>de</strong> Ψ(P 0 ) :K B n = min{k≤K max : Π(k|P n )≥Π(k + 1|P n )}


Vitesses <strong>de</strong> convergence (cas indépendant) 11(celui-ci est qualifié <strong>de</strong> local ; nous en étudions aussi une version globale dans [A3]). Comme nousen discutions plus tôt,Kn B jouit d’une forme d’auto-pénalisation et il n’est donc pas nécessaired’introduire un équivalent aux pénalités utilisées dans les définitions <strong>de</strong>Kn ML etKn lc .Vitesse <strong>de</strong> sous-estimation.En adoptant une technique <strong>de</strong> type bayésien non-paramétrique dans l’esprit <strong>de</strong> l’article fondateur[34], on obtient sous <strong>de</strong>s hypothèses classiques qu’il existe <strong>de</strong>ux constantesc,c ′ > 0 (connuesexplicitement) telle que, pour toutn≥1,1n logP 0(K B n< Ψ(P 0 ))≤−c + logc′n , (1.11)mettant ainsi en lumière le fait que la vitesse <strong>de</strong> sous-optimisation <strong>de</strong>Kn B décroît bien exponentiellementenn(cf Theorem 1 dans [A3]). Ce résultat s’applique notamment à l’exemple du mélangeen localisation (cas gaussien, avec même la possibilité d’associer une variance différente à chaqueétat du mélange, cf Corollary 1 dans [A3]) ainsi qu’à une version temporelle <strong>de</strong> l’exemple <strong>de</strong>sruptures (oùX serait égal à R et où les partitions seraient déterminées par un nombre finis <strong>de</strong>points <strong>de</strong> rupture). Il faut noter que la constantec<strong>de</strong> (1.11) est inférieure à la constante optimaledéduite du lemme <strong>de</strong> Stein et qui apparaît dans le membre <strong>de</strong> droite <strong>de</strong> (1.8)Le jeu d’hypothèses qui sous-tend (1.11) requiert notamment que les fonctions (logp θ −logp 0 )admettent un (plutôt que tout) moment exponentiel relativement àP 0 au moins pourθdans unδ-voisinageSδ k <strong>de</strong>θk 0 tel quep θ0k soit la <strong>de</strong>nsité <strong>de</strong> la projection <strong>de</strong> KL (inversée) <strong>de</strong>P 0 surM k , et que ces voisinagesSδ k soient chargés par la loi a prioriπ k. La démonstration repose sur laconstruction <strong>de</strong> tests <strong>de</strong> voisinages (cf Proposition B.1 dans [A3]) et sur un argument <strong>de</strong> chaînageinspirés <strong>de</strong> [34].Vitesse <strong>de</strong> sur-estimation.Toujours dans l’esprit bayésien non-paramétrique, on obtient sous un autre jeu d’hypothèsesqu’il existe trois constantesc 1 ,c 2 ,c 3 > 0 (dont nous discuterons la nature ci-après) telles que,pour toutn≥3,1logn logP 0(K B n> Ψ(P 0 ))≤−c 1 +c 2log lognlogn + c 3logn(1.12)(cf Theorem 3 dans [A3]). Ce résultat est à rapprocher <strong>de</strong> (1.10).Parmi les hypothèses requises pour obtenir (1.12) figure une condition portant sur la façondont décroît la masse a priori <strong>de</strong>δ n -voisinagesS Ψ(P 0)+1δ npourδ n → 0 dans Θ Ψ(P0 )+1. Cettecondition fait intervenir une notion <strong>de</strong> dimension effectiveD 1 (Ψ(P 0 )+1) <strong>de</strong> Θ Ψ(P0 )+1 relativementà Θ Ψ(P0 ). Elle peut différer <strong>de</strong> la dimension vectorielle, avec par exempleD 1 (Ψ(P 0 ) + 1) =dim(Θ Ψ(P0 )+1)−1 = dim(Θ Ψ(P0 ))+1 pour le modèle <strong>de</strong> mélange en localisation (cas gaussien) etD 1 (Ψ(P 0 )+1) = dim(Θ Ψ(P0 )+1)+Ψ(P 0 )−2 = dim(Θ Ψ(P0 ))+Ψ(P 0 ) pour le modèle <strong>de</strong> ruptures.Un secon<strong>de</strong> condition se substitue à la contrainte d’existence d’une développement <strong>de</strong> Laplace <strong>de</strong>la vraisemblance intégrée. Elle introduit une autre notion <strong>de</strong> dimension effectiveD 2 (Ψ(P 0 )) <strong>de</strong>Θ Ψ(P0 ), qui peut coïnci<strong>de</strong>r avec la dimension vectorielle (comme c’est le cas pour le modèle <strong>de</strong>mélange en localisation, cas gaussien) ou pas (comme c’est le cas pour le modèle <strong>de</strong> ruptures). Lesconstantesc 2 etc 3 dépen<strong>de</strong>nt explicitement (et presqu’exclusivement) <strong>de</strong>s dimensionsD 1 (Ψ(P 0 )+


12 Estimation et test <strong>de</strong> l’ordre d’une loi1) etD 2 (Ψ(P 0 )). Enfin, la <strong>de</strong>rnière hypothèse vraiment digne d’être évoquée s’exprime en termes<strong>de</strong> contrôle <strong>de</strong> l’entropie d’anneaux <strong>de</strong> la formeS Ψ(P 0+1)2(j+1)δ n\S Ψ(P 0)+12jδ n.Le résultat (1.12) s’applique notamment à l’exemple du mélange en localisation (cas gaussienavec encore la possibilité d’associer une variance différente à chaque état du mélange) ainsi qu’àla version temporelle <strong>de</strong> l’exemple <strong>de</strong>s ruptures évoquée plus tôt. Dans l’exemple du mélangeen localisation, on obtient ainsiP 0 (Kn> B Ψ(P 0 )) =O((logn) 3Ψ(P0) /n 1/2 ) (cf Theorem 4 dans[A3]). La vérification <strong>de</strong>s hypothèses est délicate. Elle implique en particulier l’étu<strong>de</strong> <strong>de</strong> la structuregéométrique du mélange, que nous avons abordée à l’ai<strong>de</strong> du concept <strong>de</strong> paramétrisation coniquelocale initialement introduit et exploité dans [21].1.3 Consistance d’estimateurs <strong>de</strong> l’ordre d’une loi à partir <strong>de</strong> donnéesdépendantes [A1,A4-A6]Je rassemble dans cette section un ensemble <strong>de</strong> résultats obtenus soit seul [A1], soit en collaborationavec Catherine Matias (Laboratoire Statistique et Génome, CNRS et Université d’Evry Vald’Essonne) [A4], Aurélien Garivier (Laboratoire Traitement et Communication <strong>de</strong> l’Information,CNRS et Télécom ParisTech) et Elisabeth Gassiat (Laboratoire <strong>de</strong> Mathématiques, Université ParisSud 11) [A5]. Les articles [A5,A4] reposent sur une approche inspirée <strong>de</strong> la théorie <strong>de</strong> l’information.L’article [A1] met en œuvre une approche ad hoc relevant <strong>de</strong> laM-estimation.1.3.1 Estimation <strong>de</strong> l’ordre d’un champ <strong>de</strong> ruptures par minimum <strong>de</strong> contrastepénaliséDétection <strong>de</strong> ruptures dans un champ aléatoire.Le problème <strong>de</strong> la détection <strong>de</strong> ruptures recouvre un large spectre <strong>de</strong> sujets [7, 13, 15] unifiéspar un cadre <strong>de</strong> travail commun : l’observation d’un processus aléatoire dont la loi est globalementhétérogène mais localement homogène. Le cas le plus étudié est sans doute celui où le processusaléatoire est une série chronologique (ou bien plusieurs séries chronologiques observées simultanément),avec <strong>de</strong>s applications biomédicales très diverses comprenant par exemple la recherche<strong>de</strong> zones <strong>de</strong> gain ou <strong>de</strong> perte <strong>de</strong> nombre <strong>de</strong> copies d’ADN (cf Section 2.4) ou encore l’étu<strong>de</strong> dumaintien postural (cf Section 1.4). Dans [A1] (mon premier travail publié), le processus aléatoired’intérêt est un champ aléatoire in<strong>de</strong>xé par R D et à valeurs dans R q , que l’on n’observe quepartiellement en <strong>de</strong>s points tirés au hasard. L’article [A1] repose pourtant sur l’adaptation <strong>de</strong>stechniques développées par Lavielle [51] et Lavielle et Moulines [52] pour détecter <strong>de</strong>s rupturesdans une série chronologique (D =q = 1) suivie en continu.Le cadre statistique <strong>de</strong> [A1] peut être résumé succinctement ainsi.Soit un champ (Y x ) x∈X <strong>de</strong> variables aléatoires (éventuellement dépendantes). On observe cechamp ponctuellement en (X i ) i≥1 suite <strong>de</strong> variables aléatoires i.i.d tirées indépendamment duchamp, d’où (Y i ≡Y Xi ) i≥1 . Pour toutK∈ K ={1,...,K max }, soitM ′ K ={P θ :θ∈Θ K } unensemble <strong>de</strong> lois candidates pour la suite d’observations (O i ≡ (X i ,Y i )) i≥1 tel que :– la loi marginale commune <strong>de</strong>sX i sousP θ ne dépend ni <strong>de</strong>θni <strong>de</strong>K ;– chaqueθ ={(τ k ,ϑ k )} k≤K ∈ Θ K est constitué d’une partitionτ ={τ k } k≤K (<strong>de</strong> cardinalK)<strong>de</strong>X marquée par une famille{ϑ k } k≤K <strong>de</strong> paramètres fini-dimensionnels <strong>de</strong>ux à <strong>de</strong>uxdistincts ;– sousP θ , la loi conditionnelle <strong>de</strong>Y i sachantX i dépend <strong>de</strong>θ k si et seulement si (ssi)X i ∈τ k .


Consistance (cas dépendant) 13SoitM K =∪ k≤K M ′ K (la famille <strong>de</strong> modèles{M K :K ≤K max } est ainsi emboîtée). Ensupposant que la vraie loiP 0 <strong>de</strong> (O i ) i≥1 satisfaitP 0 ∈M=∪ K≤Kmax M K (en d’autres termes,que le modèleMest bien spécifié), l’objectif statistique est d’estimer l’ordre <strong>de</strong> la loiP 0 et lapartition marquéeθ 0 ={(τk 0,ϑ0 k )} k≤Ψ(P 0 )∈ Θ Ψ(P0 ) telle queP 0 =P θ 0.On suppose disposer pour cela d’une fonction <strong>de</strong> contraste ad hoc (ϑ,ϑ ′ )↦→w(ϑ,ϑ ′ ) telleque :–w(ϑ,ϑ ′ )≥0 avec égalité ssiϑ =ϑ ′ ;– il existeψ 1 ,ψ 2 (continuement différentiables) etξ(garantissant queξ(Y X1 ) et chaqueξ(Y x ) admettent un moment d’ordre un) telles que la décompositionw(ϑ 0 j ,ϑ) =ψ 1(ϑ) +ψ 2 (ϑ) ⊤ E(ξ(Y x )|X =x) soit valable pour toutk≤ Ψ(P 0 ),x∈τk 0 et paramètreϑ.Evi<strong>de</strong>mment, la fonction <strong>de</strong> contraste dépend <strong>de</strong> la nature <strong>de</strong>s changements auxquels on s’intéresse: par exemple en moyenne (cf l’exemple ci-après), ou en moyenne et variance (cas aussiconsidéré dans [A1]).Exemple. Ruptures (cas dépendant). SoitC ⊂ R un ensemble compact et (X,B,P X ) unespace probabilisé avecX⊂ R D (D≥2). SoitF 0 ⊂B un ensemble fixé etF l’ensemble<strong>de</strong> toutes les unions finies (et <strong>de</strong> toutes les intersections) d’éléments (<strong>de</strong> paires d’éléments)<strong>de</strong>F 0 . On note PF 0 l’ensemble <strong>de</strong>s partitions finiesτ ={τ j } j∈J <strong>de</strong>X telles que chaqueτ j soit une union finie d’élémentsτ jl ∈F 0 qui sont <strong>de</strong>ux à <strong>de</strong>ux disjoints et tels quemin l P X (τ jl )≥δ>0. En particulier, les partitions sont constituées d’un maximum <strong>de</strong>K max éléments. On suppose queδ (ou bien une borne inférieure surδ) est connue. On peutaisément marquer une partitionτ ={τ j } j∈J ∈ PF 0 , i.e. associerm j ∈C à chaqueτ j . PourtoutK≥ 1, nous notons (Θ K ,d) l’espace, muni d’une pseudo-distanced, <strong>de</strong>s partitions <strong>de</strong>PF 0 marquées telles que (a) card(J) =K, (b) les marquesm j soient <strong>de</strong>ux à <strong>de</strong>ux distinctes.Enfin, il existe un champ strictement stationnaire (Y x) ′ x∈X <strong>de</strong> variables aléatoires réellescentrées <strong>de</strong> même variance tel que, pour toutK≥ 1,M ′ K ={P θ :θ∈Θ K } est l’ensemble<strong>de</strong>s lois <strong>de</strong>O 1 = (X 1 ,Y 1 ),...,O n = (X n ,Y n ),... telles que, sousP θ ,X 1 ,...,X n ,... sonti.i.d <strong>de</strong> même loiP X et indépendantes <strong>de</strong> (Y x) ′ x∈X et, pour touti≥n,K∑Y i = m j 1{X i ∈τ j } +Y X ′ i.j=1La pseudo-distance<strong>de</strong>st caractérisée en <strong>de</strong>ux temps : pour Θ K ∋θ={(τ j ,m j ) :j≤K}et Θ K ′∋θ ′ ={(τ ′ j ′,m′ j ′) :j′ ≤K ′ },d(θ,θ ′ ) =d 1 (θ,θ ′ ) +d 2 (θ,θ ′ ), le premier terme (resp.second terme) correspondant à la comparaison <strong>de</strong>s partitions (resp. marques) :)d 1 (θ,θ ′ ) = max min Pj ′ ≤K ′ X((∪ j∈κ τ j ) △τ j ′ κ⊂{1,...,K} ′ ,d 2 (θ,θ ′ ) = max max ‖m j −m ′j ′ ≤K ′ j ′‖ 2,j∈κ j ′oùA △Best la différence symétrique entreAetB etκ j ′ désigne le plus petit sous-ensemble<strong>de</strong>{1,...,K} réalisant le minimum pour chaquej ′ ≤K ′ dans la définition <strong>de</strong>d 1 (θ,θ ′ ).Celle-ci généralise directement la pseudo-distance naturelle dans le casD=1.La fonction <strong>de</strong> contraste ad hoc est caractérisée ici par les fonctionsψ 1 ,ψ 2 etξ telles queψ 1 (m) =m 2 ,ψ 2 (m) =−2m,ξ(y) =y.


14 Estimation et test <strong>de</strong> l’ordre d’une loiEstimation <strong>de</strong> l’ordre du champ aléatoire par minimum <strong>de</strong> contraste pénalisé.La procédure statistique d’estimation mise en œuvre dans [A1] relève <strong>de</strong> laM-estimation :pour une pénalité <strong>de</strong> la forme pen(n,K) =Kω(n) à calibrer, l’estimateur <strong>de</strong> (K 0 ,{(τ 0 j ,ϑ0 j )} j≤K 0)est défini par{ n}(Kn,θ w n w ∑ K∑ ( )) = arg min ψ 1 (m k ) +ψ 2 (m k ) ⊤ ξ(Y i ) 1{X i ∈τ k } + pen(n,K)K≤K max i=1k=1θ∈Θ K(implicitement ci-<strong>de</strong>ssus, l’indiceθs’écritθ={(τ k ,m k ) :k≤K}).L’étu<strong>de</strong> <strong>de</strong> la consistance et <strong>de</strong> la vitesse <strong>de</strong> convergence <strong>de</strong>θ w n en supposant Ψ(P 0 ) connu(cf Theorems 5.2 et 5.4 dans [A1]) permet <strong>de</strong> déterminer comment calibrerω(n) pour étendreles résultats au cas où Ψ(P 0 ) est aussi estimé (cf Theorem 6.1 dans [A1]). On obtient ainsi laconsistance (faible) <strong>de</strong> (K w n,θ w n ) sous un jeu d’hypothèses satisfaites dans l’exemple <strong>de</strong>s ruptures.Plus spécifiquement, il apparaît dans l’exemple <strong>de</strong>s ruptures quelim nP 0 (K w n = Ψ(P 0 ) etd(θ w n,θ 0 )≤ε) = 1pour toutε>0dès lors quen h/2 =o(ω(n)) pour une constanteh∈]1, 2[ introduite dans leshypothèses qui, <strong>de</strong> façon intéressante, ne dépend pas <strong>de</strong> la structure <strong>de</strong> dépendance du champsous-jacent. Au regard <strong>de</strong>s résultats <strong>de</strong> consistance déjà exposés dans la Section 1.1, la pénalisation(polynômiale enn) est très forte. On verra en Sections 1.3.2 et 1.3.3 qu’elle est encore forte auregard <strong>de</strong>s résultats <strong>de</strong> consistance que j’ai obtenus dans un cadre <strong>de</strong> dépendance markovienne.Peut-être est-ce le prix à payer pour le peu <strong>de</strong> conditions imposées à la forme <strong>de</strong> dépendance enjeu dans le champ sous-jacent.Les preuves développées dans [A1] reposent essentiellement sur une quantification <strong>de</strong> laconcentration <strong>de</strong>P n,X (la mesure empirique <strong>de</strong>sX i ) autour <strong>de</strong>P 0,X (sa vraie contrepartie)d’une part, et sur le contrôle <strong>de</strong> fluctuations maximales <strong>de</strong> la forme sup G∈G ‖Σ n (G)‖ ∞ pourΣ n (B) = ∑ ni=1 (ξ(Y i )−E(ξ(Y i )|X i ))1{X i ∈B} (toutB∈B) d’autre part. La quantification<strong>de</strong> la concentration <strong>de</strong>P n,X découle facilement d’arguments classiques (symétrisation, inégalité<strong>de</strong> Hoeffding) lorsque l’on suppose que les partitions sont composées d’éléments appartenant àune classe <strong>de</strong> Vapnik-Červonenkis (VC) <strong>de</strong> dimension <strong>de</strong> VC finie (cf Proposition 3.3 dans [A1]).Quant au contrôle <strong>de</strong>s fluctuations maximales, il s’agit <strong>de</strong> garantir qu’il existeC 1 > 0 eth∈]1, 2[tels que, pour toutε>0 etB∈B,P(supF∈F)∣{‖Σ n (F∩B)‖ ∞ }≥ε∣X 1 ,...,X n ≤ C (1 ∑ n hε 2 1{X i ∈B})(1.13)P 0,X -presque sûrement. Cette condition est-elle raisonnable ? Lorsque la fonctionξ est à valeursréelles (valable pour l’exemple <strong>de</strong>s ruptures ; on remplace donc les‖·‖ ∞ par <strong>de</strong>s valeurs absolues),(1.13) est satisfaite dès lors qu’il existeC 2 > 0 eth∈[1, 2) tels que, pour toutp>2 etB∈B,(E (|Σ n (B)| p |X 1 ,...,X n )≤C p ∑ n hp/22 pp/2 1{X i ∈B})(1.14)i=1P 0,X -presque sûrement. L’inégalité (1.14), <strong>de</strong> type “Marcinkiewicz-Zygmund” relâchée (en raison<strong>de</strong> la puissancehdans la borne supérieure) et la démonstration <strong>de</strong> ce qu’elle implique (1.13)i=1


Consistance (cas dépendant) 15(cf Proposition 7.3 dans [A1]), sont largement inspirées du travail mené par De<strong>de</strong>cker dans [24].Grâce à (1.14), on peut exhiber <strong>de</strong>s exemples concrets où (1.13) est satisfaite. Soit par exempleX = Z D un réseau régulier et (Z x =ξ(Y x )−E(ξ(Y x ))) x∈X centré, borné et strictement stationnaire: si (Z x ) x∈X estm-dépendant (moralement, si pour toutx∈X ,Z x est indépendante <strong>de</strong>toute famille constituée d’un nombre fini <strong>de</strong>Z x ′ à distance au moinsm<strong>de</strong>xsurX ) alors (1.13)est satisfaite (cf Proposition 7.5 dans [A1], ainsi que les Propositions 7.4 et 7.6 pour d’autresexemples).1.3.2 Estimation <strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov cachée par maximum <strong>de</strong>vraisemblance et longueur <strong>de</strong> co<strong>de</strong> pénalisésJ’ai présenté dans la Section 1.1.2 une partie seulement <strong>de</strong>s résultats obtenus dans [A5] encollaboration avec Aurélien Garivier et Elisabeth Gassiat, en me concentrant sur ceux relatifs à l’estimation<strong>de</strong> l’ordre d’un mélange en localisation à partir <strong>de</strong> données indépendantes. Nous avonsaussi mené (pour ainsi dire en parallèle) l’étu<strong>de</strong> du problème d’estimation <strong>de</strong> l’ordre d’une loilorsque celle-ci est une hmm à émissions gaussiennes ou poissonniennes (cf [14] pour la monographie<strong>de</strong> référence sur les modèles <strong>de</strong> Markov cachés) :Exemple. Mélange en localisation par hmm. Reprenons les objetsC,S K ,φ m <strong>de</strong> l’exemple dumélange en localisation (cas indépendant) tel que présenté en Section 1.1.1. Soitπ 0 ∈S Karbitrairement fixé. Quel que soitK≥ 1, on pose Θ K =C K × ∏ Kk=1 S K etM K ={P θ :θ∈Θ K } : iciP θ pourθ=(m 1 ,...,m K ,π 1 ,...,π K )∈Θ K est la loi <strong>de</strong> la suite (O i ) i≥1 àvaleurs dans R (cas gaussien) ou N (cas poissonnien) déduite par projection <strong>de</strong> la loi jointe P θ<strong>de</strong>s suites ((O i ) i≥1 , (Z i ) i≥0 ) caractérisée par la décomposition suivante <strong>de</strong> la vraisemblancesous P θ au tempsn:n−1 ∏P θ (Z 0 ,Z 1 ,O 1 ,...,Z n ,O n ) =π 0,Z0 × π Zi ,Z i+1×i=1n∏φ mZi (O i )i=1(ainsi : (Z i ) i≥0 est une chaîne <strong>de</strong> Markov <strong>de</strong> loi initiale et <strong>de</strong> transitions paramétrées parπ 0 ,π 1 ,...,π K ; (O 1 ,...,O n ) sont conditionnellement indépendantes sachant (Z 0 ,...,Z n ) ;sachant (Z 0 ,...,Z n ),O i a pour <strong>de</strong>nsité conditionnelleφ mZi ).L’exemple présenté ci-<strong>de</strong>ssus entre bien dans la catégorie <strong>de</strong>s modèles <strong>de</strong> mélange dans la mesureoù la loi marginale <strong>de</strong> chaque observationO i est un mélange comme décrit dans l’exemple dumélange en localisation <strong>de</strong> la Section 1.1.1. Il diffère du modèle <strong>de</strong> mélange indépendant carl’état du mélange à un instantidonné (soitZ i ) dépend <strong>de</strong> la suite <strong>de</strong>s états précé<strong>de</strong>nts (soitZ 0 ,...,Z i−1 , au sens d’une chaîne <strong>de</strong> Markov).Plaçons-nous donc dans le cadre <strong>de</strong> l’exemple du mélange en localisation par hmm et reprenonsles notations <strong>de</strong> la Section 1.1.2 : K = N ∗ ,δ K est la loi <strong>de</strong>nsité <strong>de</strong> la loi <strong>de</strong> Dirichlet <strong>de</strong> paramètre( 1 2 ,..., 1 2 ),γ τ est la <strong>de</strong>nsité <strong>de</strong> la loi Gamma <strong>de</strong> paramètre (τ, 1 2 ).Nous nous intéressons à la consistance <strong>de</strong> l’estimateurK MLn du maximum <strong>de</strong> vraisemblance<strong>de</strong> l’ordre Ψ(P 0 ) <strong>de</strong> la loiP 0 construit partir <strong>de</strong> l’observation <strong>de</strong> (O 1 ,...,O n )∼P 0 ∈M=∪ K∈K M K . L’estimateurK MLn s’écrit toujours comme dans (1.1), avec un terme <strong>de</strong> pénalité àcalibrer et une log-vraisemblancel n (θ) sousθ=(m 1 ,...,m K ,π 1 ,...,π K )∈Θ K qui satisfait∑ n∏ n∏l n (θ) = log π 0,z0 × π zi ,z i+1× φ mzi (O i ).z 0 ,...,z n∈{1,...,K} i=1 i=1


16 Estimation et test <strong>de</strong> l’ordre d’une loiLe principe qui sous-tend [A5] est, on l’a déjà vu et commenté, l’introduction pour toutK≥ 1d’une statistique <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong> associée à une loi a prioriν K sur Θ K :∫lc n (K) =− log e ln(θ) dν K (θ)Θ K(1.15)(la définition est très similaire à (1.4)), et sa comparaison à la statistique du maximum <strong>de</strong>log-vraisemblance. Spécifiquement, soit pour toutK ≥ 1 la loiν K sur Θ K caractérisée pardν K (m 1 ,...,m K ,π 1 ,...,π K ) = ∏ Kk=1 ϕ(m k /τ)dm k × ∏ Kk=1 δ K (π k )dπ k (dans le cas gaussien)oudν K (m 1 ,...,m K ,π 1 ,...,π K ) = ∏ Kk=1 γ τ (m k )dm k × ∏ Kk=1 δ K (π k )dπ k (dans le cas poissonnien).Ce choix (pragmatique, car il permet le calcul exact <strong>de</strong> la valeur <strong>de</strong>lc n (K)) induit unencadrement <strong>de</strong> la différence entre les statistiques <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong> et <strong>de</strong> maximum <strong>de</strong> logvraisemblancequi s’écrit exactement sous la forme (1.5) (cf Theorem 1 dans [A5] ; le termedéterminister Kn n’a pas la même forme que dans le cas indépendant du modèle <strong>de</strong> mélange enlocalisation). La statistique <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong> peut aussi être utilisée à <strong>de</strong>s fins d’estimation <strong>de</strong>l’ordre Ψ(P 0 ) <strong>de</strong> la loiP 0 : on introduit ainsi l’estimateurK lcn comme dans (1.6), pour un terme<strong>de</strong> pénalité à calibrer.Finalement, on démontre que même si l’on ne connaît pas <strong>de</strong> borne a priori sur Ψ(P 0 ), on aasymptotiquementK MLn =K lcn = Ψ(P 0 )P 0 -presque sûrement dès lors quepen(n,K) = 1 2K∑(α + dim(Θ k )) logn +S Kn +s Knk=1pour toutα>2, les termesS Kn ets Kn étant dédiés aux contrôles respectifs <strong>de</strong>R n dans (1.5)(S Kn est doncO(logn) dans le cas gaussien etO(logn/(log logn) 1/2 ) dans le cas poissonnien)et <strong>de</strong>r Kn (cf Theorems 5 et 6 dans [A5]).Une étu<strong>de</strong> <strong>de</strong> simulations (que nous ne résumons pas ici ; cf la Section 4 <strong>de</strong> [A5]) illustre lasomme <strong>de</strong>s résultats théoriques obtenus dans [A5].Remarque. A propos <strong>de</strong> la vitesse <strong>de</strong> sur-estimation (exemple hmm). On obtient dans lecours <strong>de</strong> la preuve le contrôle suivant <strong>de</strong> la vitesse <strong>de</strong> sur-estimation associée àK n =KnMLouK n =Kn lc : il existe <strong>de</strong>ux constantesc 1 ,c 2 > 0 telles que, pournassez grand,1logn P 0(K n > Ψ(P 0 ))≤− min( α 2 ,c 1) + c 2logn(1.16)avecc 1 = 3 2 dans le cas gaussien etc 1 = 2 dans le cas poissonnien (c 2 varie aussi selon lescas), oùαest le paramètre intervenant dans la définition <strong>de</strong> la pénalité. Ces inégalités sontcomparables à (1.10) et (1.12).1.3.3 Estimation <strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markov à régime markovien parmaximum <strong>de</strong> vraisemblance et longueur <strong>de</strong> co<strong>de</strong> pénalisésJe me suis aussi consacré à la question <strong>de</strong> l’estimation <strong>de</strong> l’ordre d’une chaîne <strong>de</strong> Markovà régime markovien avec Catherine Matias [A4]. L’étu<strong>de</strong> <strong>de</strong> cet exemple était motivée par <strong>de</strong>sconsidérations applicatives.


Consistance (cas dépendant) 17Motivations biologiques.Les séquences biologiques (e.g. séquences d’ADN ou <strong>de</strong> protéines) sont aujourd’hui produites<strong>de</strong> façon routinière (grâce aux progrès et à la démocratisation <strong>de</strong>s diverses techniques <strong>de</strong> séquençage)pour être exploitées dans une variété d’analyses biologiques. Les modèles <strong>de</strong> Markov cachésont joué un rôle prépondérant dans la modélisation et l’analyse <strong>de</strong> la composition et <strong>de</strong> l’évolution<strong>de</strong> ces séquences, vues comme <strong>de</strong>s suites <strong>de</strong> variables aléatoires à valeurs dans un alphabetfini (l’alphabet <strong>de</strong>s quatre nucléoti<strong>de</strong>s que sont l’adénine, la thymine, la cytosine et la guaninedans l’exemple <strong>de</strong>s séquences d’ADN ; l’alphabet <strong>de</strong>s vingt-<strong>de</strong>ux aci<strong>de</strong>s aminés protéinogènes dansl’exemple <strong>de</strong>s séquences <strong>de</strong> protéines). Ainsi, la modélisation par hmm (cf l’exemple du mélangeen localisation <strong>de</strong> la Section 1.3.2 avec <strong>de</strong>s émissions selon une loi sur l’alphabet d’intérêt plutôtque sur R ou N) <strong>de</strong> séquences hétérogènes a été assez largement utilisée. On interprète dans cecadre les états cachés comme <strong>de</strong>s régimes (e.g., dans l’exemple du séquençage <strong>de</strong> l’ADN, régioncodante ou région non-codante). La loi du temps <strong>de</strong> séjour dans un régime donné étant géométrique,une telle modélisation est d’autant plus appropriée que l’on sait que la séquence présenteune succession <strong>de</strong> zones homogènes. Il ne faut cependant pas perdre <strong>de</strong> vue que les variables <strong>de</strong> laséquence observée sont alors i.i.d conditionnellement aux régimes : cette contrainte imposée par lamodélisation hmm est jugée trop restrictive dans certains cas (e.g., dans l’exemple du séquençage<strong>de</strong> l’ADN, on sait bien que les régions codantes <strong>de</strong> l’ADN sont organisées par codons (i.e. lasuccession <strong>de</strong> trois nucléoti<strong>de</strong>s) ; dans ce cas, une solution simple pourrait consister à substituerà l’alphabet <strong>de</strong>s quatre nucléoti<strong>de</strong>s celui <strong>de</strong> tous les triplets <strong>de</strong> nucléoti<strong>de</strong>s).Pour cette raison notamment, la communauté s’est penchée sur les modèles <strong>de</strong> chaînes <strong>de</strong>Markov à régime markovien (cmrm). Egalement appelés modèles auto-régressifs à régimes markoviens,ils ont été introduits à l’origine en économétrie [37]. Dans cette classe <strong>de</strong> modèles (quenous définissons ci-<strong>de</strong>ssous), les états cachés constituent toujours une chaîne <strong>de</strong> Markov, maisconditionnellement à ceux-ci les observations forment une secon<strong>de</strong> chaîne <strong>de</strong> Markov (dont lestransitions dépen<strong>de</strong>nt du régime). Ainsi, Nicolas et al. [62] entreprennent l’étu<strong>de</strong> <strong>de</strong> l’hétérogénéité<strong>de</strong> l’unique chromosome <strong>de</strong> Bacillus subtilis à l’ai<strong>de</strong> d’une cmrm choisie suivant <strong>de</strong>s critèresessentiellement biologiques afin <strong>de</strong> détecter <strong>de</strong>s segments atypiques <strong>de</strong> longueur approximative25kb (1kb correspond à 1, 000 nucléoti<strong>de</strong>s) le long du chromosome <strong>de</strong> longueur totale 4, 200kb.C’est <strong>de</strong> la volonté <strong>de</strong> mettre au point un argument statistique <strong>de</strong> sélection <strong>de</strong> modèle pour appuyerces critères biologiques <strong>de</strong> choix <strong>de</strong> modèle qu’est né [A4].Notion d’ordre.Commençons par définir rigoureusement une cmrm (on utilise la notationx j i pour représentertoute suitex i ,...,x j ;S K représente toujours leK-simplexe) :Exemple. Chaîne <strong>de</strong> Markov à régime markovien (cmrm). SoitAun alphabet <strong>de</strong> cardinalfini (et connu) notér. Quel que soit (K,M)∈N ∗ ×N, posons Θ K,M = ∏ Kk=1 S K × ∏ Kr Mk=1 S retM K,M ={P θ :θ∈Θ K,M } : ici,P θ pourθ=(π 1 ,...,π K , (π zo M )1 z≤K,o M1 ∈AM ) est laloi <strong>de</strong> la suite (O i ) i≥1 à valeurs dansAdéduite par projection <strong>de</strong> la loi jointe P θ <strong>de</strong>s suites((O i ) i≥1 , (Z i ) i≥1 ) caractérisée par la décomposition suivante <strong>de</strong> la vraisemblance sous P θau tempsn(les conventions <strong>de</strong> notation d’usage s’appliquent sin≤M ouM = 0) :n−1P θ (Z 1 ,O 1 ,...,Z n ,O n ) =µ θ (Z 1 ,O1 M ∏)× π Zi ,Z i+1×i=1n∏i=M+1π Zi O i−1i−M ,O i


18 Estimation et test <strong>de</strong> l’ordre d’une loioùµ θ est la loi initiale sur{1,...,K}×A M qui fait <strong>de</strong> P θ une loi stationnaire (ainsi :(Z i ) i≥1 est une chaîne <strong>de</strong> Markov (<strong>de</strong> mémoire un) sur{1,...,K} dont les transitionssont paramétrées parπ 1 ,...,π K ; conditionnellement à (Z i ) i≥1 , (O i ) i≥1 est une chaîne <strong>de</strong>Markov surA<strong>de</strong> mémoireM, la loi conditionnelle <strong>de</strong>O j sachant ((Z i ) i≥1 ,O j−11 ) étantcaractérisée pour toutj>M parπ Zj O j−1j−M; les <strong>de</strong>ux chaînes sont initialisées <strong>de</strong> telle sorteque P θ soit stationnaire).Notons que la log-vraisemblancel n (θ) sousθ=(π 1 ,...,π K , (π zo M )1 z≤K,o M1 ∈A M )∈Θ K,Ms’écritl n (θ) = log∑z 1 ,...,z n∈{1,...,K}n−1µ θ (z 1 ,O1 M ∏)× π zi ,z i+1×i=1n∏i=M+1π zi O i−1i−M ,O i .Contrairement aux autres exemples introduits jusqu’ici, l’exemple <strong>de</strong>s chaînes <strong>de</strong> Markov àrégime markovien ne se voit pas associer naturellement une famille emboîtée <strong>de</strong> modèles (cequi justifie a posteriori l’emploi <strong>de</strong> l’expression “éventuellement emboîtée” en introduction <strong>de</strong> cechapitre) : certesM K,M ⊂M K,M+1 etM K,M ⊂M K+1,M pour tout (K,M)∈N ∗ × N, maisil n’est pas possible <strong>de</strong> déterminer une relation d’ordre sur N ∗ × N qui fasse <strong>de</strong> la famille{M K,M : (K,M)∈N ∗ × N} une famille emboîtée. Par ailleurs, étant donnéP 0 ∈M=∪ (K,M)∈N ∗ ×NM K,M , il n’existe en général pas d’unique couple (K 0 ,M 0 ) tel queP 0 ∈M K0 ,M 0etP 0 n’appartienne à aucun <strong>de</strong>sM K,M M K0 ,M 0(on dit queP 0 “admet plusieurs représentationsminimales”). La définition-même d’une notion d’ordre <strong>de</strong> la loiP 0 s’en trouve menacée. . . Leproblème <strong>de</strong> l’i<strong>de</strong>ntification <strong>de</strong> l’ordre d’un processus ARMA(p,q) peut sembler à première vueprésenter le même défaut, le paramètre structurel (p,q) y étant notamment aussi bivarié. Il existecependant pour tout processus ARMA(p,q) une unique représentation minimale ARMA(p 0 ,q 0 ), lecouple (p 0 ,q 0 ) jouant donc naturellement le rôle d’ordre <strong>de</strong> cette loi (<strong>de</strong>s propriétés <strong>de</strong> consistance<strong>de</strong> toute une gamme <strong>de</strong> procédures d’estimation <strong>de</strong> cet ordre ont déjà été obtenues [39, 65, 21]).Nous proposons <strong>de</strong> contourner le problème dans l’esprit du principe mdl (déjà évoqué dans lacf Section 1.1.2) : siP 0 ∈M admet plusieurs représentations minimales, alors on définit l’ordre<strong>de</strong> la loiP 0 comme étant l’indice <strong>de</strong> la représentation minimale associée au modèle <strong>de</strong> plus petitedimension. Spécifiquement, on définit pour toutP 0 ∈M :Ψ(P 0 ) = min{(K,M)∈(N ∗ × N,) :P 0 ∈M K,M },la relation d’ordre étant telle que (K 1 ,M 1 )≺(K 2 ,M 2 ) ssi dim(Θ K1 ,M 1)


Consistance (cas dépendant) 19qui est <strong>de</strong> type longueur <strong>de</strong> co<strong>de</strong> pénalisée pour la longueur <strong>de</strong> co<strong>de</strong>∫lc n (K,M) =− log e ln(θ) dν K,M (θ)Θ K,Massociée aux lois a prioriν K,M sur Θ K,M (tout (K,M)∈N ∗ × N) telles queKdν K,M (π 1 ,...,π K , (π zo M )1 z≤K,o M1 ∈A M ) = ∏δ K (π k )dπ k ×k=1∏z≤K,o M 1 ∈AM δ r (π zo M1)dπ zo M1(on rappelle queδ K est la <strong>de</strong>nsité <strong>de</strong> la loi <strong>de</strong> Dirichlet <strong>de</strong> paramètre ( 1 2 ,..., 1 2) sur leK-simplexe).Ce choix est intéressant car il permet <strong>de</strong> nouveau <strong>de</strong> comparer les statistiques du maximum <strong>de</strong>log-vraisemblance et <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong>, avec un jeu d’inégalités <strong>de</strong> la forme0≤lc n (K,M)−(− supθ∈Θ K,Ml n (θ))≤ 1 2 dim(Θ K,M) logn +r KMnpour un certain terme résiduelr KMn (cf Lemma 3.4 dans [A4] et (1.5)).Finalement, on démontre que même si l’on ne connaît pas <strong>de</strong> borne a priori sur Ψ(P 0 ), on aasymptotiquementKn ML =Kn lc = Ψ(P 0 )P 0 -presque sûrement dès lors quepen(n,K) = 1 2∑(k,m)(K,M)(αf(k,m) + dim(Θ k,m )) logn +s KMnpour toutα>2, la fonctionf : (N ∗ ×N,)→Nayant pour unique contrainte d’être strictementcroissante et le termes KMn étant dédié au contrôle <strong>de</strong>s termes résiduelsr KMn qui apparaissentci-<strong>de</strong>ssus (cf Theorem 3.1 dans [A4]).Remarque. A propos <strong>de</strong> la vitesse <strong>de</strong> sur-estimation (exemple mcmr). On obtient dans lecours <strong>de</strong> la preuve le contrôle suivant <strong>de</strong> la vitesse <strong>de</strong> sur-estimation associée àK n =KnMLouK n =Kn lc : il existe une constantesc>0 telles que, pournassez grand,1logn P 0(K n > Ψ(P 0 ))≤− α 2 + clogn .Cette inégalité est comparable à (1.10), (1.12) et (1.16).Etu<strong>de</strong> <strong>de</strong> simulations.Une étu<strong>de</strong> <strong>de</strong> simulations (dont le schéma s’inspire <strong>de</strong> [62]) illustre les résultats que nous venons<strong>de</strong> présenter ; nous n’en présentons qu’un bref résumé (cf la Section 4 <strong>de</strong> [A4] pour les détails).Les divers maxima <strong>de</strong> log-vraisemblance y sont approchés via l’algorithme EM. L’étu<strong>de</strong> suggèreque le régime asymptotique <strong>de</strong>K MLn est atteint pour <strong>de</strong>s longueurs <strong>de</strong> séquences supérieures ouégales à 50, 000. La substitution d’une authentique pénalité <strong>de</strong> type bic (plutôt que <strong>de</strong> type biccumulé comme choisie ci-<strong>de</strong>ssus) semble garantir que le régime asymptotique est atteint pour <strong>de</strong>slongueurs <strong>de</strong> chaînes supérieures ou égales à 25, 000 (les résultats sont en revanche très mauvaispour une longueur <strong>de</strong> chaînes égale à 15, 000).Pour conclure, Nicolas et al. [62] choisissent (sur la base <strong>de</strong> considérations essentiellementbiologiques) une cmrm d’ordre (K,M) = (3, 2) ; on obtientK MLn = (3, 0) (une cmrm d’ordre(3, 0) est en fait une hmm) ; la substitution d’une authentique pénalité <strong>de</strong> type bic conduit à unordre estimé égal à (2, 1) : difficile <strong>de</strong> se faire une opinion !


20 Estimation et test <strong>de</strong> l’ordre d’une loi1.4 Application à l’étu<strong>de</strong> du maintien postural (1/2) [A6]L’étu<strong>de</strong> du maintien postural est l’un <strong>de</strong> mes sujets d’application biomédicale. Je dois à Pierre-Paul Vidal <strong>de</strong> m’y être intéressé. Je présente dans cette section un premier ensemble <strong>de</strong> résultats,que j’ai obtenus en collaboration avec Isabelle Bonan (Service <strong>de</strong> Mé<strong>de</strong>cine Physique et <strong>de</strong> réadaptation,CHU Rennes) et Pierre-Paul Vidal (Laboratoire Contrôle <strong>de</strong> la Sensorimotricité : duNeurone au Muscle, Université Paris Descartes) [A6]. L’approche théorique, très similaire à cellemise en œuvre dans [A5,A4], s’inspire <strong>de</strong> la théorie <strong>de</strong> l’information. Le second ensemble <strong>de</strong>résultats est exposé dans la Section 2.5.1.4.1 IntroductionEléments biomédicaux.Le contrôle postural d’un individu est fondé sur trois types d’informations encodées par les systèmesvisuel, vestibulaire (situé dans l’oreille interne) et proprioceptif (composé par les récepteurssensoriels situés au voisinage <strong>de</strong>s os, <strong>de</strong>s articulations et <strong>de</strong>s muscles et sensibles aux stimulationsproduites par les mouvements du corps). La façon dont le système nerveux central traite ces informationsvarie avec l’expérience sensorimotrice <strong>de</strong> chaque individu. Elle est mo<strong>de</strong>lée notammentpar l’âge, les sports et les professions pratiqués, ainsi sans doute que par <strong>de</strong>s facteurs génétiques.Un <strong>de</strong>s éléments fondamentaux du contrôle postural tient à la capacité du cerveau central à saisirl’information sensorielle la plus pertinente à un instant donné pour réagir à une situation donnée.Dans ce contexte, on peut comprendre que chacun tend, en raison <strong>de</strong> son expérience sensorimotrice,à développer une préférence pour un type d’information sensorielle en particulier, quise retrouve ainsi sollicité <strong>de</strong> façon prédominante. La préférence visuelle est sans doute la plusfréquente, en tout cas la mieux décrite. On peut la relever chez <strong>de</strong>s sujets sains mais elle estsurtout courante chez les personnes âgées, chez les personnes souffrant <strong>de</strong> la maladie <strong>de</strong> Parkinsonou suite à un acci<strong>de</strong>nt vasculaire cérébral ou à une pathologie vestibulaire. Or si une tellesélection systématique d’un mo<strong>de</strong> perceptif permet à un individu <strong>de</strong> se déplacer efficacement dansson environnement habituel, il est clair qu’elle est peu adaptée pour répondre à <strong>de</strong>s situationsnouvelles ou inattendues, pour ne pas dire qu’elle est dangereuse. Un tel mo<strong>de</strong> <strong>de</strong> fonctionnementest donc plus susceptible d’entraîner une chute, chute dont les conséquences sont souventdramatiques au-<strong>de</strong>là <strong>de</strong> soixante ans puisque les chutes et leurs séquelles entraînent le décès <strong>de</strong>près <strong>de</strong> dix mille personnes par an. Confronté à ce problème, le rééducateur tâche d’i<strong>de</strong>ntifier,sur <strong>de</strong>s bases cliniques et empiriques, une “préférence sensorielle” du patient qui lui a été adressépour <strong>de</strong>s troubles du contrôle postural. Il s’efforce ensuite <strong>de</strong> les corriger en amenant le sujet àprendre en compte la totalité <strong>de</strong> ses afférences sensorielles pour réguler sa posture. La “préférencesensorielle”, un concept encore flou, n’est donc pas encore validée par <strong>de</strong>s mesures quantitativeset donc son éventuelle modification par la rééducation non plus.Notre objectif est <strong>de</strong> contribuer à l’élaboration d’une notion <strong>de</strong> style postural, et <strong>de</strong> mettre aupoint <strong>de</strong>s techniques d’i<strong>de</strong>ntification du style postural d’un individu.Un aperçu <strong>de</strong> l’état <strong>de</strong> l’art.L’étu<strong>de</strong> du contrôle postural est l’objet d’un grand nombre <strong>de</strong> publications. On peut trèsschématiquement les répartir en <strong>de</strong>ux catégories, selon qu’il y est question <strong>de</strong> la modélisation duprocessus d’intégration <strong>de</strong>s informations (voir le récent [50] et ses références) ou bien <strong>de</strong> l’analyse


Application à l’étu<strong>de</strong> du maintien postural (1/2) 21statistique <strong>de</strong>s déplacements d’un sujet (voir [63] pour une étu<strong>de</strong> statistique par régression, pionnièredu genre, <strong>de</strong> la démarche – et non <strong>de</strong> la posture). Notre approche s’inscrit dans ce secondgroupe.Toujours schématiquement, les données ici analysées statistiquement sont obtenues sous laforme d’enregistrements <strong>de</strong>s petits déplacements d’un individu se tenant <strong>de</strong>bout campé sur ses<strong>de</strong>ux pieds et soumis à diverses perturbations. Chaque type <strong>de</strong> stimulation vise à explorer un <strong>de</strong>strois systèmes d’acquisition <strong>de</strong> l’information :– la fermeture <strong>de</strong>s yeux permet <strong>de</strong> s’intéresser au rôle du système visuel (voir par exemple [59]et ses références) ;– les stimulations galvanique et opto-cinétique permettent d’étudier le système vestibulaire(voir par exemple [29, 57] et leurs références) ;– la stimulation vibratoire permet <strong>de</strong> se pencher sur le rôle du système proprioceptif (voir parexemple [29, 30] et ses références).La quantification du contrôle postural est obtenue via une plateforme <strong>de</strong> force. Cet appareil,semblable à un pèse-personne, évalue les positions successives du point <strong>de</strong> pression maximal exercépar le sujet.On peut classer grossièrement les articles consacrés à l’étu<strong>de</strong> du maintien postural dans <strong>de</strong>uxcatégories, selon l’argumentation statistique développée.Dans la première catégorie, cet argument est fondé sur <strong>de</strong>s comparaisons statistiques classiques<strong>de</strong> quantités moyennes ou extrémales (déplacement moyen, vitesse moyenne, étendue min-maxetc.). On pourrait argumenter que ces quantités ont le défaut <strong>de</strong> ne pas exploiter suffisammentla dynamique temporelle du phénomène. Justement, cette dynamique est en revanche centraledans les articles <strong>de</strong> la secon<strong>de</strong> catégorie, qui exploitent les modèles classiques <strong>de</strong> séries temporelles(comme notamment dans [28, 71]). C’est à plus forte raison aussi le cas <strong>de</strong>s travaux dans lesquelsles trajectoires <strong>de</strong>s positions du point maximal <strong>de</strong> pression sont modélisées par un mouvementbrownien (voir [17] et ses références). Dans leurs prolongements, <strong>de</strong>s modélisations par mouvementbrownien fractionnaire ont aussi été exploitées ; ainsi, les auteurs <strong>de</strong> [9, 6] étudient la façon dontla dépendance se propage à travers le temps au sein d’une trajectoire (pour <strong>de</strong>s protocoles sansstimulation). Un autre courant met enfin en jeu <strong>de</strong>s modélisations par diffusion, typiquement <strong>de</strong>type Ornstein-Uhlenbeck (voir en particulier [61, 27]).L’article [A6], dont je résume ici les résultats, relève <strong>de</strong> la secon<strong>de</strong> catégorie. Nous verronsdans la Section 2.5 que la prépublication [A11] relève au contraire <strong>de</strong> la première catégorie.1.4.2 Description succincte <strong>de</strong>s donnéesNous n’exploitons dans [A6] qu’une partie <strong>de</strong>s données produites par l’équipe médicale partenaire<strong>de</strong> cette étu<strong>de</strong>. Notre choix s’est porté sur un unique protocole (dans [A11], plusieursprotocoles sont simultanément exploités, cf Section 2.5) parmi les quatre qui avaient été appliquésà tous les participants à l’étu<strong>de</strong>, qu’ils soient <strong>de</strong>s sujets qualifiés <strong>de</strong> sains, ou <strong>de</strong> vestibulaires,ou d’hémiplégiques (suite à un acci<strong>de</strong>nt cérébro-vasculaire). Cet unique protocole entre dans lacatégorie <strong>de</strong>s protocoles à stimulation vibratoire.Au cours <strong>de</strong> ce protocole, les sujets se tiennent <strong>de</strong>bout sur une plateforme <strong>de</strong> force, appareildont nous avons déjà écrit qu’il s’apparente à un pèse-personne. Cet appareil enregistre à intervallesréguliers les positions successives <strong>de</strong>s points où le pied gauche et le pied droit du sujet exercentséparément une pression maximale. Le temps standard séparant <strong>de</strong>ux relevés est <strong>de</strong>δ = 25millisecon<strong>de</strong>s.Le protocole se découpe en trois phases expérimentales : lors <strong>de</strong>s 15 premières secon<strong>de</strong>s, le


22 Estimation et test <strong>de</strong> l’ordre d’une loiFig. 1.1 – Trajectoirest↦→L t (gauche) ett↦→R t (droite) <strong>de</strong>s positions <strong>de</strong> pression maximaleexercée par chaque pied sur la plateforme <strong>de</strong> force ; en haut, un sujet qualifiée <strong>de</strong> sain, en bas unsujet qualifié d’hémiplégique.maintien n’est pas perturbé artificiellement ; suivent alors 35 secon<strong>de</strong>s durant lesquelles le maintienest perturbé par stimulation vibratoire <strong>de</strong>s triceps du sujet, qui par ailleurs doit fermer les yeux ;pendant les 20 <strong>de</strong>rnières secon<strong>de</strong>s, le maintien n’est finalement plus perturbé artificiellement et lesujet rouvre les yeux.On reporte dans la Figure 1.1 <strong>de</strong>ux exemple d’enregistrements obtenus au cours <strong>de</strong> ce protocole,l’un pour un sujet qualifié <strong>de</strong> sain, l’autre pour un sujet qualifié d’hémiplégique.1.4.3 Un modèle <strong>de</strong> maintien posturalRéduction <strong>de</strong>s données brutes.Nous observons 32 sujets qualifiés <strong>de</strong> sains, 23 sujets qualifiés <strong>de</strong> vestibulaires et 16 sujetsqualifiés d’hémiplégiques, soit un total <strong>de</strong> 71 sujets. Les observations brutes pour l’un d’entreeux s’écrivent (X t ) t∈T = (L t ,R t ) t∈T ,L t (respectivementR t ) étant la position où le pied gauche(respectivement droit) exerce une pression maximale à l’instantt, pour toutt∈T ={jδ : 1≤j≤2800} (le pas <strong>de</strong> tempsδ vaut 0.025 secon<strong>de</strong>s). Ces positions sont repérées par leurs coordonnéescartésiennes. Nous procédons à un premier résumé <strong>de</strong> ces données brutes en ne considérant quela suite <strong>de</strong>s milieux <strong>de</strong>s segments (B t ) t∈T = ( 1 2 (L t +R t )) t∈T . Heuristiquement, cette suite décritla trajectoire <strong>de</strong> la projection du centre <strong>de</strong> gravité du sujet sur la plateforme <strong>de</strong> force.Par ailleurs, une étu<strong>de</strong> préliminaire a confirmé la pertinence <strong>de</strong> résumer ces données en neconsidérant que la suite (C t ) t∈T <strong>de</strong>s distances séparantB t d’un point <strong>de</strong> référence. Possibilitéparmi plusieurs, ce <strong>de</strong>rnier est défini comme la valeur médiane <strong>de</strong>sB t sur la première pério<strong>de</strong> (i.e.avant les stimulations). Cette même étu<strong>de</strong> préliminaire nous a enfin permis <strong>de</strong> mettre en évi<strong>de</strong>nce


Application à l’étu<strong>de</strong> du maintien postural (1/2) 23que la volatilité <strong>de</strong> la suite (C t ) t∈T est particulièrement intéressante. Partant <strong>de</strong> ce constat, nouseffectuons une <strong>de</strong>rnière transformation en introduisant pour toutt∈T\{max(T)} la quantité{O t = log (C t+1 −C t ) 2} .Un argument reposant sur l’hypothèse que le processus (C t ) t∈T satisfait une certaine équationdifférentielle stochastique justifie l’introduction <strong>de</strong> cette quantité (cf la Section 7.1 dans [A6]).Un modèle <strong>de</strong> Markov caché.L’idée principale est <strong>de</strong> modéliser l’ensemble <strong>de</strong>s 71 trajectoires selon une hmm à émissionsgaussiennes multidimensionnelles.Exemple. Mélange en localisation par hmm (cas multidimensionnel). Reprenons les objetsC,S K ,φ m =ϕ((·−m)/σ) <strong>de</strong> l’exemple du mélange en localisation (cas indépendant) telque présenté en Section 1.1.1. Soitπ 0 ∈S K arbitrairement fixé. Quel que soitK≥ 1,on pose Θ K =C 3K × ∏ Kk=1 S K etM K ={P θ :θ∈Θ K } : iciP θ pour un quelconqueθ = ((m l (k)) k≤K,l≤3 ,π 1 ,...,π K )∈Θ K est la loi <strong>de</strong> la suite (Ot) i t∈δN ∗ ,i≤71 à valeursdans R déduite par projection <strong>de</strong> la loi jointe P θ <strong>de</strong>s suites ((Ot) i t∈δN ∗ ,i≤71, (Zt) i t∈δN,i≤71 )caractérisée par la décomposition suivante <strong>de</strong> la vraisemblance sous P θ au tempsNδ :P θ ((Z i 0,Z i δ,O i δ,...,Z i Nδ,O i Nδ) i≤71 ) =⎧71∏ ⎨i=1N−1⎩ π 0,Z0 i× ∏N∏π Z ijδ ,Z(j+1)δ× ij=1 j=1⎫⎬φ µ i (Zjδ i )(Oi jδ)⎭ ,avecµ i (z) =m l (k) ssiz =k et leième sujet (auquel (O i t) t∈δN ∗ est associée) est qualifié <strong>de</strong>typel(l = 1 pour sain,l=2 pour vestibulaire,l=3 pour hémiplégique) ; ainsi : toutes les(Z i t) t∈δN sont <strong>de</strong>s chaînes <strong>de</strong> Markov i.i.d <strong>de</strong> mêmes loi initiale et transitions paramétréesparπ 0 ,π 1 ,...,π K ; toutes les (O i t) t∈δN ∗ sont indépendantes ; pour chaquei≤71 et toutNδ∈δN ∗ , (O i δ ,...,Oi Nδ ) sont conditionnellement indépendantes sachant (Zi 0 ,...,Zi Nδ ) ;sachant (Z i 0 ,...,Zi Nδ ),Oi jδ a pour <strong>de</strong>nsité conditionnelleφ m i (Z i jδ ).Même si cela n’apparaît pas <strong>de</strong> façon évi<strong>de</strong>nte dans la <strong>de</strong>scription proposée ci-<strong>de</strong>ssus,P θ est bienla loi d’une hmm à émissions gaussiennes tridimensionnelles, avec une matrice <strong>de</strong> transition (dontla dimension est un multiple <strong>de</strong> 71, le nombre <strong>de</strong> trajectoires considérées <strong>de</strong> front) très creuse.Nous optons pour cette <strong>de</strong>scription car elle met en lumière le fait que chaque sujet se voit associersa propre chaîne <strong>de</strong> Markov cachée. A un instant donné, la moyenne <strong>de</strong> l’émission gaussiennedépend <strong>de</strong> l’état caché et <strong>de</strong> la qualification du sujet (sain, vestibulaire, hémiplégique).Supposons donc que nous observons (O i t) t∈δN ∗ ,i≤71∼P 0 ∈M=∪ K≥1 M K . Sans surprise àce sta<strong>de</strong> <strong>de</strong> l’exposé, il est important d’estimer l’ordre Ψ(P 0 ) <strong>de</strong> la loiP 0 à partir <strong>de</strong>s observations.Dans l’esprit <strong>de</strong> [A5,A4], nous introduisons un estimateurK MLn du maximum <strong>de</strong> vraisemblancepénalisé et démontrons (cf Proposition 1 dans [A6]) que même si l’on ne connaît pas <strong>de</strong> borne apriori sur Ψ(P 0 ), on a asymptotiquementK MLn = Ψ(P 0 )P 0 -presque sûrement moyennant que lapénalité soit calibrée <strong>de</strong> façon adéquate (i.e. pour une pénalité du type bic cumulé). La preuverepose encore sur un contrôle (similaire à (1.5)) <strong>de</strong> la différence entre la statistique <strong>de</strong> maximum<strong>de</strong> log-vraisemblance et une statistique <strong>de</strong> longueur <strong>de</strong> co<strong>de</strong> (similaire à (1.15), cf Proposition 3dans [A6]).


24 Estimation et test <strong>de</strong> l’ordre d’une loiContribution à l’élaboration d’une notion <strong>de</strong> style postural (1/2).Le paramètreσ d’écart-type est estimé à partir d’un jeu <strong>de</strong> données indépendant, correspondantà un protocole au cours duquel les sujets se maintiennent <strong>de</strong>bout sur une plateforme <strong>de</strong>force sans aucune stimulation. Le jeu <strong>de</strong> données est découpé aléatoirement en trois parties indépendantes<strong>de</strong> tailles comparables, constituées respectivement <strong>de</strong>s nombres <strong>de</strong> sujets qualifiés <strong>de</strong>sains, vestibulaires ou hémiplégiques suivant : (10, 7, 6), (10, 8, 6) et (12, 8, 4). Le premier souséchantillonest dédié à l’estimation <strong>de</strong> l’ordre, le second à l’ajustement du modèle correspondant àl’ordre élu, le troisième étant réservé à <strong>de</strong>s fins <strong>de</strong> validation. (La robustesse <strong>de</strong>s résultats relativeà ce découpage a été vérifiée.)Nous mettons en œuvre un algorithme <strong>de</strong> maximisation <strong>de</strong> la log-vraisemblance exacte plutôtqu’un algorithme EM, avec un gain notoire en termes <strong>de</strong> performance numérique (cf Section 7.2dans [A6]).Nous obtenonsKn ML = 5. Si l’on emploie une pénalisation <strong>de</strong> type bic (plutôt que bic cumulé)alors on obtient une estimation égale à 6. Nous argumentons qu’un modèle à 5 états cachés estplus approprié et l’ajustons. Nous renvoyons le lecteur à la Section 4 <strong>de</strong> [A6] pour le détail <strong>de</strong>l’argument et les résultats <strong>de</strong> l’ajustement <strong>de</strong> ce modèle.Finalement, nous mettons au point une procédure <strong>de</strong> classification <strong>de</strong>s sujets à partir <strong>de</strong> leurstrajectoires observées, procédure fondée sur le modèle ajusté. Les performances sont très décevantes,avec seulement 66% <strong>de</strong>s sujets correctement classés. Nous reviendrons sur la thématique<strong>de</strong> la classification <strong>de</strong>s sujets en termes <strong>de</strong> maintien postural dans la Section 2.5.Remarque. Nous développons dans [A6] un modèle concurrent à celui présenté ici afin d’incorporerl’idée qu’il existe bien différents styles posturaux, mais que ceux-ci ne sont sans doutepas caractérisés par la qualification <strong>de</strong>s sujets (sain, vestibulaire, hémiplégique). Après uneanalyse assez similaire à celle résumée dans cette section, on obtient une <strong>de</strong>scription à troisstyles posturaux (ordre <strong>de</strong> la loi égal à trois), le premier correspondant essentiellement auxsujets qualifiés <strong>de</strong> sains, le second majoritairement aux sujets qualifies d’hémiplégiques plutôtque vestibulaires (75% contre 25%), le troisième majoritairement aux sujets qualifies <strong>de</strong>vestibulaires plutôt qu’hémiplégiques (67% contre 33%). Je pense que ce modèle est plusprometteur que le précé<strong>de</strong>nt, mais la mise au point d’une procédure <strong>de</strong> classification,dansce cas sensiblement plus difficile, reste à concrétiser.


25Chapitre 2Estimation <strong>de</strong> l’importance <strong>de</strong>variables et <strong>de</strong> paramètres causaux surdonnées observationnellesLes étu<strong>de</strong>s observationnelles, qui produisent <strong>de</strong>s données aussi qualifiées d’observationnelles,sont caractérisées par le fait que leurs investigateurs n’interviennent pas dans le processus stochastique<strong>de</strong> création <strong>de</strong>s observations (comme c’est au contraire le cas dans le cadre d’analysescliniques randomisées, cf le Chapitre 3 pour mes contributions à la construction et à l’étu<strong>de</strong> <strong>de</strong>telles analyses). Il est donc entendu qu’elles ne permettent pas réellement <strong>de</strong> mettre en évi<strong>de</strong>nceet d’évaluer <strong>de</strong>s effets causaux (tout au moins pas sous <strong>de</strong>s hypothèses consensuelles). Il n’en restepas moins que les paramètres d’importance <strong>de</strong> variables (selon l’expression introduite dans [78])qu’elles permettent typiquement d’estimer, et qui ont vocation <strong>de</strong> s’approcher autant que faire sepeut (étant donnée la nature <strong>de</strong>s observations) <strong>de</strong> paramètres causaux, sont dignes d’intérêt etcontribuent souvent <strong>de</strong> façon déterminante à la compréhension d’un phénomène d’intérêt.Une secon<strong>de</strong> partie <strong>de</strong> mes travaux concerne justement l’estimation <strong>de</strong> l’importance <strong>de</strong> variableset <strong>de</strong> paramètres causaux à partir <strong>de</strong> données observationnelles. J’ai consacré ces trois <strong>de</strong>rnièresannées beaucoup <strong>de</strong> mon temps et <strong>de</strong> mon énergie au développement et à l’utilisation du principe<strong>de</strong> minimisation <strong>de</strong> perte ciblée (traduction <strong>de</strong> targeted minimum loss estimation, d’où l’acronymeTMLE), typiquement pour l’estimation (et le test) <strong>de</strong> tels paramètres. Je présente dans ce chapitreles gran<strong>de</strong>s lignes du principe TMLE (cf la Section 2.2, qui peut être lue indépendamment), ainsiqu’un exemple d’application <strong>de</strong> celui-ci à l’estimation <strong>de</strong> la probabilité <strong>de</strong> succès d’un programme<strong>de</strong> fécondation in vitro en France, sujet <strong>de</strong> [A9] (cf la Section 2.3). Le principe TMLE joueaussi un rôle central dans l’étu<strong>de</strong> du maintien postural menée dans [A11] (cf la Section 2.5),qui complète les résultats <strong>de</strong> [A6] (tels que résumés dans la Section 1.4). La définition d’unemesure inédite <strong>de</strong> l’importance d’une variable (qualifiée <strong>de</strong> non-paramétrique), le développementd’une procédure TMLE ad hoc, ses étu<strong>de</strong>s théorique et par simulations sont au cœur <strong>de</strong> [A14](cf la Section 2.4). Finalement, je présente les résultats <strong>de</strong> l’étu<strong>de</strong> menée dans [A12] <strong>de</strong>s effets<strong>de</strong> l’exposition professionnelle à l’amiante sur le développement d’un cancer du poumon (cf laSection 2.1). Même si celle-ci ne relève pas stricto sensu <strong>de</strong> l’estimation <strong>de</strong> l’importance d’unevariable ni d’un paramètre causal (la nature <strong>de</strong>s données ne s’y prête pas), elle trouve tout-à-faitsa place dans ce chapitre.


26 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)2.1 Amiante et cancer du poumon en France [A12]Je présente dans cette section les résultats que j’ai obtenus dans [A12] en collaboration avecDominique Choudat (Service <strong>de</strong> pathologie professionnelle, Assistance Publique–Hôpitaux <strong>de</strong> Pariset Université Paris Descartes), Catherine Huber (Laboratoire MAP5, Université Paris Descartes),Jean-Clau<strong>de</strong> Pairon (Service <strong>de</strong> pneumologie et pathologie professionnelle, INSERM et UniversitéParis-Est Créteil) et Mark van <strong>de</strong>r Laan (Division of Biostatistics, UC Berkeley). Cette étu<strong>de</strong> estdédiée à l’exploitation d’un jeu <strong>de</strong> données cas-témoins original décrivant l’exposition professionnelleà l’amiante en relation avec le développement <strong>de</strong> cancers du poumon en France. Nous ymettons en œuvre une méthodologie statistique sophistiquée mêlant notamment modélisation paramétriquepar régression seuillée, correction <strong>de</strong> biais par pondération, et sélection <strong>de</strong> modèle parvalidation croisée fondée sur la vraisemblance.2.1.1 IntroductionOn sait <strong>de</strong>puis plusieurs décennies que l’amiante est un dangereux carcinogène [44]. L’objectifprincipal <strong>de</strong> l’étu<strong>de</strong> menée dans [A12] est <strong>de</strong> quantifier la relation entre exposition professionnelleà l’amiante et augmentation du risque <strong>de</strong> cancer du poumon. De plus, nous y considérons la trèsdélicate question <strong>de</strong> l’évaluation, pour <strong>de</strong>s sujets atteints d’un cancer du poumon, <strong>de</strong> la contribution<strong>de</strong> l’exposition à l’amiante à l’occurrence <strong>de</strong> cette maladie (nous parlerons <strong>de</strong> contribution d’uneexposition à l’occurrence d’une maladie).Toute réponse raisonnée à cette secon<strong>de</strong> question pourrait avoir <strong>de</strong>s implications importantesen termes <strong>de</strong> politique <strong>de</strong> santé publique (ainsi par exemple aux Etats-Unis, <strong>de</strong>s schémas légaux <strong>de</strong>compensation financière s’appuient sur <strong>de</strong> telles évaluations). Nous formulons la nôtre en termes<strong>de</strong> nombre moyen d’années <strong>de</strong> vie perdues, au sens défini par Robins et Greenland [68] (cf aussi[60]). Si la réponse que nous apportons ne nous semble pas définitive, elle constitue néanmoinsun apport intéressant à l’étu<strong>de</strong> générale <strong>de</strong> l’évaluation <strong>de</strong> la contribution d’une exposition àl’occurrence d’une maladie, et un progrès important <strong>de</strong> l’état <strong>de</strong> l’art sur données françaises pourl’exposition à l’amiante et le cancer du poumon.Bien que les <strong>de</strong>ux questions centrales que nous venons d’évoquer soient par nature causales,nous employons avec précaution pour les présenter <strong>de</strong>s expressions qui relèvent du champ sémantique<strong>de</strong> l’association. L’une <strong>de</strong>s raisons est que l’on sait bien que le tabagisme est un autre facteur<strong>de</strong> risque au rôle prédominant dans l’apparition d’un cancer du poumon [10]. Il nous faudrait, pouraccé<strong>de</strong>r à une interprétation causale, démêler les effets combinés du tabac et <strong>de</strong> l’amiante, tâcheimpossible avec les données dont nous disposons.L’analyse statistique mise en œuvre dans [A12] est développée spécifiquement pour un jeu <strong>de</strong>données français obtenu selon le paradigme d’une enquête cas-témoins [64]. Le recours à un teltype d’enquête est justifié pour les maladies rares telles que le cancer du poumon (rare au sens oùla probabilité qu’a une personne prise au hasard dans la population <strong>de</strong> développer un tel cancervaut environ 0.047% [8]). Pour les environ 2, 000 participants à l’étu<strong>de</strong>, nous disposons d’unensemble d’informations comprenant un résumé du comportement tabagique et une <strong>de</strong>scriptionlongitudinale <strong>de</strong> l’exposition professionnelle à l’amiante (durée <strong>de</strong> chaque emploi et classificationdans l’une parmi 28 catégories). Pour exploiter cette <strong>de</strong>scription originale, nous <strong>de</strong>vons mettre aupoint un système <strong>de</strong> paramétrisation ad hoc. Outre que son interprétation est assez directe, cesystème se prête naturellement à <strong>de</strong>s réductions <strong>de</strong> complexité <strong>de</strong> la <strong>de</strong>scription originale (qui,avec ses a priori 28 différentes classes, apparaît d’emblée comme étant sur-dimensionnée).Nous construisons une gran<strong>de</strong> famille <strong>de</strong> modèles paramétriques <strong>de</strong> type premier temps <strong>de</strong>


Amiante et cancer du poumon en France 27passage (ou <strong>de</strong> régression seuillée) [53, 54]. Grâce à un habile jeu <strong>de</strong> pondérations <strong>de</strong> la fonction<strong>de</strong> vraisemblance, nous parvenons à sélectionner (par validation croisée) un meilleur modèle dansla collection, à l’ajuster et à en déduire <strong>de</strong>s informations qui ne sont pas conditionnelles au faitd’être atteint d’un cancer du poumon ou pas, bien que les données soient <strong>de</strong> type cas-témoins (end’autres termes, la pondération permet <strong>de</strong> corriger le biais induit par le type d’enquête).2.1.2 Description succincte <strong>de</strong>s donnéesFidèles à la stratégie introduite par van <strong>de</strong>r Laan [79] :1. Nous commençons par déduire <strong>de</strong> la <strong>de</strong>scription <strong>de</strong> l’enquête cas-témoins entreprise quelleaurait été l’étu<strong>de</strong> prospective que les investigateurs auraient menée si la maladie n’avait pasété si rare. Cela revient essentiellement à caractériser une observation génériqueO ⋆ pourcette étu<strong>de</strong> prospective, dont la loiP ⋆ 0 présente certains traits d’intérêt Ψ(P⋆ 0 ).2. Ensuite, nous décrivons l’observation génériqueOassociée à l’enquête cas-témoins et saloiP 0 (en termes <strong>de</strong>O ⋆ et <strong>de</strong>P0 ⋆ ). Après quoi nous décrivons comment estimer les traitsΨ(P0 ⋆) à partir <strong>de</strong> données échantillonnées sousP 0.Ainsi, l’observation génériqueO ⋆ s’écritO ⋆ = (W,X,Ā(X),Y,Z)∼P∗ 0avec–Wvecteur <strong>de</strong> covariables <strong>de</strong> base (incluant notamment le genre, l’occurrence d’un cancerdu poumon dans la famille proche, le tabagisme en quatre niveaux : jamais fumeur, entre1 et 25 paquets-années cumulés, entre 26 et 45 paquets-années cumulés, au moins 46paquets-années cumulés) ;–X âge au moment <strong>de</strong> l’échantillonnage ;–Ā(X) <strong>de</strong>scription longitudinale <strong>de</strong> l’exposition professionnelle à l’amiante jusqu’à l’âgeX ;–Y= 1 ssi l’âgeT auquel le cancer du poumon se développe(ra) (T =∞ s’il ne doit jamaisse développer) satisfaitT =Z≤X (on est alors en présence d’un cas) etY = 0 ssiT>Z =X sinon (on est alors en présence d’une témoin).C’est parce que la probabilitéP0 ⋆ (Y = 1) d’observer un cas <strong>de</strong> cancer du poumon dans la populationest très faible que l’échantillonnage sousP0 ⋆ aurait été irréalisable. Un échantillonnage <strong>de</strong>type cas-témoins (ici, appariés) a pour vocation <strong>de</strong> pallier ce problème. SoitV⊂W la variabled’appariement (incluant notamment le genre et l’âge à l’échantillonnage discrétisé avec un pas <strong>de</strong>cinq ans ; on utilisera les notations redondantes (V,W), (V,O) etc). L’échantillonnage <strong>de</strong> typecas-témoins appariés consiste à– échantillonner d’abord (V 1 ,O 1⋆ ) = (V 1 ,W 1 ,X 1 ,Ā1 (X 1 ),Y 1 = 1,Z 1 ) sous la loi conditionnelle<strong>de</strong> (V ⋆ ,O ⋆ ) sachantY = 1,– puis à échantillonner (V 0 ,O 0 ) = (V 0 ,W 0 ,X 0 ,Ā0 (X 0 ),Y 0 = 0,Z 0 ) sous la loi conditionnelle<strong>de</strong> (V,O ⋆ ) sachantY = 0 etV =V 1 .On construit ainsi une observation génériqueO = ((V 1 ,O 1⋆ ), (V 0 ,O 0⋆ ))∼P 0dont la loiP 0 peut être déduite <strong>de</strong>P0 ⋆ et <strong>de</strong> la procédure d’échantillonnage décrite ci-<strong>de</strong>ssus.Nous en observonsn = 860 copies indépendantes.


28 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)2.1.3 Un modèle ad hoc <strong>de</strong> régression seuilléeNotre collection <strong>de</strong> modèles paramétriques s’articule autour <strong>de</strong> <strong>de</strong>ux idées :– une modélisation ad hoc <strong>de</strong> l’accélération du vieillissement par exposition à l’amiante d’unepart, dont découle une notion d’âge biologique par opposition à l’âge calendaire ;– un modèle <strong>de</strong> régression seuillée d’autre part, justifié par l’interprétation <strong>de</strong> la santé commeun processus stochastique.Accélération du vieillissement par exposition à l’amiante ; âges biologique et calendaire.Nous avons dit que la <strong>de</strong>scriptionĀ(X) <strong>de</strong> l’exposition professionnelle à l’amiante jusqu’à l’âge<strong>de</strong> l’échantillonnage était <strong>de</strong> nature longitudinale. Spécifiquement, nous disposons pour chaqueemploi occupé <strong>de</strong>s dates <strong>de</strong> début et fin d’emploi et d’une <strong>de</strong>scription originale <strong>de</strong> la teneur <strong>de</strong>l’exposition dans le poste occupé. Celle-ci s’exprime sous la forme d’un tripletε = (ε 1 ,ε 2 ,ε 3 )≡ε 1 ε 2 ε 3 (triplet dit “probabilité/fréquence/intensité”), dont chaque coordonnéeε j prend ses valeursdans{1, 2, 3}, selon que la probabilité (j = 1), la fréquence (j = 2) ou l’intensité (j = 3) <strong>de</strong>l’exposition à l’amiante est jugée faible (ε j = 1), modérée (ε j = 2) ou forte (ε j = 3) lors d’uneexpertise. L’ensembleE <strong>de</strong>s différents niveaux d’exposition professionnelle à l’amiante comptedonc 28 catégories (nous ajoutons une catégorie 0 = (0, 0, 0)≡000 en l’absence totale d’une telleexposition).C’est le produit <strong>de</strong>s facteurs “probabilité/fréquence/intensité” qui s’interprète comme un tauxd’exposition professionnelle à l’amiante. Ceci motive l’introduction <strong>de</strong> l’ensemble <strong>de</strong> paramètres{}M = (M 0 , (M k,l ) k,l≤3 )∈R + × M 3,3 (R + ) : 0≤M k,1 ≤M k,2 ≤M k,3 = 1,k = 1, 2, 3constitué <strong>de</strong> telle sorte que le taux associé àε∈E\{0} parM∈ M soitM(ε) = 1 +M 0 ×M 1,ε1 ×M 2,ε2 ×M 3,ε3 ,avec <strong>de</strong> plusM(0) = 1. Ainsi,M 0 s’interprète comme un facteur d’accélération maximale, associéà la plus forte exposition (codée enε=333). Les taux varient entre 1 (absence d’accélérationadditionnelle) àM(3, 3, 3) = 1 +M 0 , avecM(ε)−1M 0=M 1,ε1 ×M 2,ε2 ×M 3,ε3 ,une exposition caractérisée parε=ε 1 ε 2 ε 3 atteignant une fractionM 1,ε1 ×M 2,ε2 ×M 3,ε3 <strong>de</strong>l’accélération maximale.Cette paramétrisation est assez économe, sept paramètres suffisant à décrire les 28 accélérationspossibles. Elle est par ailleurs i<strong>de</strong>ntifiable : siM,M ′ ∈ M satisfontM(ε) =M ′ (ε) pour tousε∈E alorsM =M ′ .Soit finalement, pour toute <strong>de</strong>scription longitudinale ¯ε (telle que ¯ε(t) =ε ssi l’emploi occupéà l’âgetadmet pour <strong>de</strong>scription le tripletε), tout paramètreM∈ M et tout âgetla quantitéR(t;M, ¯ε) =∫ t0M(¯ε(s))ds.On interprèteR(t;M, ¯ε) comme un âge biologique, par opposition à l’âge calendairet, quien diverge <strong>de</strong> par l’effet <strong>de</strong> l’exposition professionnelle à l’amiante résumée par ¯ε. En général,R(t;M, ¯ε)≥t pour toutt, et on n’a égalité pour touttque ssiM(¯ε(t)) = 1 pour toutt, i.e. ssiaucune exposition professionnelle à l’amiante n’a été jugée significative.


Amiante et cancer du poumon en France 29Modèle <strong>de</strong> régression seuillée.Nous adoptons l’approche <strong>de</strong> modélisation par régression seuillée (cf [53, 54] et leurs références): nous modélisons ainsi la durée écoulée jusqu’à l’événement d’intérêt (développementd’un cancer du poumon) comme la durée que met un processus stochastique à franchir une frontièrepour la première fois. Ce processus représente ici la quantité <strong>de</strong> santé relative au cancer dupoumon. Tant qu’elle reste au-<strong>de</strong>ssus <strong>de</strong> zéro (la frontière), le cancer du poumon ne se déclarepas. Lorsqu’elle franchit la frontière pour la première fois, le cancer du poumon se déclare.Soit B un mouvement brownien, (h,µ)∈R ∗ +× R − etRune fonction croissante continue surR + telle queR(t)≥t pour toutt≥0. On leur associeT{h,µ,R} = inf{t≥0:h +µR(t) + B R(t) ≤ 0},le premier instant auquel le mouvement brownien avec dérive (h+µt+B t ) t≥0 franchit la frontièrezéro le long <strong>de</strong> l’échelle <strong>de</strong> temps modifiée déduite <strong>de</strong>R. Ici,hs’interprète comme une quantitéinitiale <strong>de</strong> santé relative au cancer du poumon etµcomme un taux <strong>de</strong> perte <strong>de</strong> santé relativeau cancer du poumon. On noteT{h,µ, id}≡T{h,µ}, <strong>de</strong> loi inverse gaussienne paramétrée par(h,µ). Puisqueµ≤0, on sait queT{h,µ}


30 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)Le procédé <strong>de</strong> pondération <strong>de</strong> la vraisemblance : définition et propriétés asymptotiques.Rappelons que l’on suppose le modèleMbien spécifié (i.e.P ⋆ 0 =P⋆ θ 0∈M ; dans le cascontraire, il faut substituer àP ⋆ 0 sa projection <strong>de</strong> Kullback-LeiblerP⋆ θ 0surMdans les énoncés quisuivent). En vertu du principe <strong>de</strong> maximum <strong>de</strong> vraisemblance, on a (sous <strong>de</strong>s hypothèses faibles)P0 ⋆ = arg maxE P ⋆0logP θ (O ⋆ ).θ∈ΘCependant, nous observons <strong>de</strong>s copies i.i.d <strong>de</strong>O=(O 0⋆ ,O 1⋆ )∼P 0 plutôt que <strong>de</strong>O ⋆ ∼P ⋆ 0 . Pourpallier cet inconvénient, introduisons les log-vraisemblances pondérées caractérisées parl(P ⋆ θ )(O) =q 0 logP ⋆ θ (O 1⋆ ) +q 0 (W) logP ⋆ θ (O 0⋆ ),(Y =1|W )où les poids sont définis parq 0 =P0 ⋆(Y = 1) etq P0 0(W) =q ⋆ 0 P0 ⋆ (Y =0|W ). On peut démontrerfacilement que (sous <strong>de</strong>s hypothèses faibles)P0 ⋆ = arg maxE P0 l(P θ )(O),θ∈Θc’est-à-dire que la pondération corrige le biais induit par la nature cas-témoins <strong>de</strong> l’enquête(cf Proposition 1 dans [A12]). Les techniques <strong>de</strong> preuve classiques pour obtenir <strong>de</strong>s résultats<strong>de</strong> consistance et <strong>de</strong>s développements asymptotiquement linéaires (dont on déduit <strong>de</strong>s théorèmes<strong>de</strong> la limite centrale) pour <strong>de</strong>sM-estimateurs s’adaptent facilement. Nous démontronsainsi que (sous un jeu standard d’hypothèses) l’estimateur du maximum <strong>de</strong> vraisemblance pondéréeθn = arg max θ∈Θ P n logP θ (O i ) est consistant et que √ n(θ n −θ 0 ) converge vers la loiN(0, Σ)pour une certaine matrice <strong>de</strong> variances-covariances Σ (cf Propositions 2 et 3 dans [A12]). Nousmontrons aussi que lesncopies indépendantes <strong>de</strong>OsousP 0 sont équivalentes à 2×n copiesindépendantes <strong>de</strong>O ⋆ sous une loiP1 ⋆ définie comme une perturbation <strong>de</strong>P⋆ 0 (cf Proposition 3dans [A12]).Sélection <strong>de</strong> modèle par validation croisée fondée sur la vraisemblance pondérée.Le modèleMest assurément sur-dimensionné. Nous l’utilisons en fait comme un modèlemaximal dont nous extrayons <strong>de</strong>s sous-modèles obtenus en imposant <strong>de</strong>s contraintes naturellessur le paramètresθ. On dénombre en tout 11, 008 tels sous-modèles, et il est donc hors <strong>de</strong> question<strong>de</strong> les considérer tous ! Nous mettons en fait au point une procédure <strong>de</strong> sélection <strong>de</strong> modèle en<strong>de</strong>ux temps qui consiste d’abord à élaborer une liste <strong>de</strong> modèles dignes d’intérêt puis ensuite àexplorer tous les sous-modèles qu’ils contiennent. Les comparaisons <strong>de</strong> tous ces sous-modèles (56au total lors <strong>de</strong> l’application <strong>de</strong> la procédure aux données réelles) sont faites selon le principe <strong>de</strong> lavalidation croisée fondée sur la vraisemblance pondérée décrite ci-<strong>de</strong>ssus (les détails <strong>de</strong> la procéduresont exposés dans la Section 6.2 <strong>de</strong> [A12]). Même si nous n’avons pas vérifié rigoureusement leshypothèses requises pour invoquer les résultats-clefs <strong>de</strong> [86], ceux-ci nous confortent dans l’idéeque la procédure détermine un meilleur sous-modèle peu distant <strong>de</strong> l’oracle <strong>de</strong> la collection.Remarque. L’implémentation <strong>de</strong> cette étape est très difficile, car elle requiert que tout sousmodèlesoit a priori ajustable par maximum <strong>de</strong> vraisemblance pondérée. J’ai astucieusementsurmonté la difficulté (sous R) en mettant au point un algorithme dont la fonction est<strong>de</strong> créer automatiquement, à partir <strong>de</strong> la <strong>de</strong>scription littéraire du sous-modèle, toutes lesfonctions requises pour procé<strong>de</strong>r à son ajustement.


A propos <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte ciblée 31Résultats <strong>de</strong> l’ajustement du meilleur sous-modèle.Le meilleur modèle (au sens <strong>de</strong> la validation croisée fondée sur la vraisemblance pondérée) estfinalement ajusté sur tout le jeu <strong>de</strong> données. Il satisfait les contraintes suivantes :–h ne dépend que du genre ;–µ ne dépend que du genre et du tabagisme ;– l’exposition professionnelle à l’amiante est nocive (M 0 > 0) ; il n’y a pas <strong>de</strong> différence entre“probabilité faible” et pas d’exposition du tout (M 1,1 = 0) ; il n’y a pas <strong>de</strong> différence entre“fréquence faible” et “fréquence modérée” (M 2,1 =M 2,2 ).Les intervalles <strong>de</strong> confiance sont obtenus par bootstrap, avec une correction assurant une couverturesimultanée <strong>de</strong> 95%. Les résultats complets sont présentés dans la Section 6.3 <strong>de</strong> [A12].2.1.5 Nombre moyen d’années <strong>de</strong> vie perduesEn vertu <strong>de</strong> (2.1), l’égalité suivante est satisfaite :T{h,µ} =R(T{h,µ,R}).Autrement dit, toutes choses (genre, occurrence ou non d’un cancer dans la famille proche, tabagisme)étant égales par ailleurs pour un cas, l’âge (contrefactuel)T{h,µ} auquel le cancerdu poumon se serait déclaré en l’absence <strong>de</strong> toute exposition professionnelle à l’amiante est unefonction déterministe <strong>de</strong> l’âge (observé) auquel le cancer du poumon s’est déclaré et <strong>de</strong> l’historique<strong>de</strong> l’exposition professionnelle à l’amiante. De plus (avec la convention∞−∞ = 0), la quantitéR(T{h,µ,R})−T{h,µ,R}peut être interprétée comme un nombre moyen d’années <strong>de</strong> vie perdues au sens défini dans [68].Sur les 860 cas recensés dans notre jeu <strong>de</strong> données, seuls 30% d’entre eux se voient associer unnombre moyen d’années <strong>de</strong> vie perdues strictement positif. Nous renvoyons le lecteur vers [A12]pour la présentation détaillée <strong>de</strong>s résultats relatifs au nombre moyen d’années <strong>de</strong> vie perdues.2.2 A propos <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte cibléeLe principe <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte ciblée (targeted minimum loss estimationen anglais, d’où l’acronyme TMLE) joue un rôle important dans tous les travaux qu’il mereste à résumer. Puisque je ne tiens pas pour acquis que le lecteur est familier avec ce principe,je lui consacre à titre introductif la présente section (volontairement très généraliste et lisibleindépendamment du reste <strong>de</strong> ce mémoire).2.2.1 IntroductionOriginellement introduit par van <strong>de</strong>r Laan et Rubin [83] en 2006, le principe TMLE a <strong>de</strong>puisété appliqué à une gran<strong>de</strong> variété <strong>de</strong> problèmes statistiques, comme en témoigne la premièremonographie sur le sujet [82]. Inscrit dans le cadre <strong>de</strong> la théorie <strong>de</strong> la statistique semi-paramétrique,il s’appuie sur les fondamentaux <strong>de</strong> l’inférence statistique :– les données : elles sont vues comme <strong>de</strong>s réalisations <strong>de</strong> variables aléatoires, <strong>de</strong> (vraie) loi(inconnue)P 0 , <strong>de</strong> mesure empiriqueP n ; je me limiterai dans cette section au cas <strong>de</strong> donnéesindépendantes et i<strong>de</strong>ntiquement distribuéesO 1 ,...,O n copies <strong>de</strong>O∼P 0 ;


32 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)– le modèle : c’est une collectionM<strong>de</strong> lois potentielles, dont la caractérisation compilel’ensemble <strong>de</strong>s connaissances dont on dispose surP 0 ;– le paramètre cible : nous nous intéressons particulièrement à un traitψ 0 = Ψ(P 0 ) <strong>de</strong>P 0 , lafonctionnelle Ψ :M→R d étant définie sur tout le modèleM, même si celui-ci n’est pasparamétrique ;– un estimateur : c’est un algorithme totalement spécifié a priori, qui associe aux données unevaleur estimée <strong>de</strong>ψ 0 , et dont les performances sont mesurées selon une certaine mesure <strong>de</strong>dissimilarité relative àψ 0 (e.g. l’erreur quadratique).Dans cette section, nous illustrerons le principe TMLE avec l’exemple fondamental suivant.Exemple. Excès <strong>de</strong> risque. SoitO=(W,A,Y )∼P 0 ,W étant un vecteur <strong>de</strong> covariables <strong>de</strong>base,A∈{0, 1} une variable <strong>de</strong> traitement (e.g. protocoles médicamenteux #1 ou #2) etY∈ R + une issue principale positive (e.g. charge virale après une semaine <strong>de</strong> traitement).Supposons que l’état <strong>de</strong>s connaissances est réduit : on choisit alors comme modèleM=M NP , le modèle dit purement non-paramétrique (i.e. l’ensemble <strong>de</strong> toutes les lois compatiblesavec la définition <strong>de</strong>O).Le paramètre d’intérêt est l’excès <strong>de</strong> risque : on cherche à estimerψ 0 = Ψ(P 0 ), la fonctionnelleΨ≡ER :M→R étant telle que, pour toutP∈M,ER(P) =E P {E P (Y|A = 1,W)−E P (Y|A = 0,W)}. (2.2)L’excès <strong>de</strong> risque est l’une <strong>de</strong>s mesures les plus populaires <strong>de</strong> l’importance d’une variable (ici <strong>de</strong>AsurY en contrôlantW). On peut, pour justifier la définition <strong>de</strong> l’excès <strong>de</strong> risque, s’appuyer sur unargument contrefactuel. Admettons qu’il existe une donnée (dite complète)X = (W,Y (0),Y (1)),qui comprend les covariablesW et les <strong>de</strong>ux issues contrefactuellesY (0) etY (1) qu’auraient lestraitementsA=0etA=1respectivement. Sous l’hypothèse <strong>de</strong> cohérence, l’issue principalesatisfait l’égalitéY =Y (A). Bien entendu, il est impossible <strong>de</strong> jamais observerY (0) etY (1). Si<strong>de</strong> plus l’hypothèse d’indépendance conditionnelle (dite <strong>de</strong> randomisation)A⊥X|W est elle aussisatisfaite, alors on voit sans peine que ER(P) =E{Y (1)−Y (0)} est la différence <strong>de</strong>s valeursmoyennes <strong>de</strong>s issues contrefactuelles ; il admet ainsi une interprétation causale. Cette secon<strong>de</strong>hypothèse est intestable sur les données.Remarque. A propos <strong>de</strong> l’existence <strong>de</strong> la donnée complète. On peut montrer qu’il est toujourspossible, au prix d’une augmentation <strong>de</strong> l’espace probabilisé permettant <strong>de</strong> tirer aléatoirement<strong>de</strong> nouvelles variables uniformes et en recourant à <strong>de</strong>s transformations quantilequantileadéquates, <strong>de</strong> construire une donnée complèteX comme ci-<strong>de</strong>ssus (cf Theorem 2.1dans [89] et [35]). Cette construction mathématique induit une notion <strong>de</strong> causalité qui nefait cependant pas nécessairement sens en réalité.L’inférence <strong>de</strong>ψ 0 = Ψ(P 0 ) est réalisable par exemple par substitution : une fois construitun estimateurP 0 n <strong>de</strong>P 0 , l’estimateurψ 0 n = Ψ(P 0 n) peut sembler être un candidat intéressant,surtout si la fonctionnelle Ψ jouit <strong>de</strong> propriétés <strong>de</strong> régularité (nous y reviendrons). Cette approchen’est cependant certainement pas optimale. Heuristiquement, il n’y a aucune raison pour quel’équilibre biais variance réalisé parP 0 n dans l’objectif d’estimerP 0 se traduise par <strong>de</strong>s performancessatisfaisantes dans l’objectif d’estimer le traitψ 0 = Ψ(P 0 ) seul. Ainsi, revenons sur l’exemple <strong>de</strong>l’excès <strong>de</strong> risque à titre illustratif. De par sa définition, il suffit pour estimer ER(P 0 ) d’une partd’estimer la loi marginale <strong>de</strong>W et d’autre part <strong>de</strong> régresserY sur (A,W). Soit doncP 0 n ∈M telle queP 0 n,W =P n,W (les lois marginales <strong>de</strong>W sousP 0 n etP n coïnci<strong>de</strong>nt) et telle quelogE P 0 n(Y|A,W) soit obtenue par ajustement d’un certain modèle paramétrique (e.g. ajustement


A propos <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte ciblée 33par moindres carrés d’un modèle linéaire enAet les composantes <strong>de</strong>W). On voit bien que, si laparamétrisation <strong>de</strong> l’espérance conditionnelle est mal-spécifiée (ce qu’elle sera toujours !), alors laconsistance même <strong>de</strong>ψn 0 = ER(Pn) 0 est mise en défaut.Le premier principe moteur <strong>de</strong> TMLE est d’exploiter un estimateur <strong>de</strong> substitution initialψn 0 = Ψ(Pn) 0 comme point d’appui pour en construire itérativement un second en ciblant lafonctionnelle Ψ, i.e. en distordant l’estimateur initialPn 0 <strong>de</strong>P 0 “dans la direction <strong>de</strong> Ψ” <strong>de</strong> tellesorte que la rupture <strong>de</strong> l’équilibre biais variance réalisé parPn 0 dans l’objectif d’estimerP 0 setraduise par une amélioration <strong>de</strong>s performances <strong>de</strong> l’estimateur <strong>de</strong> substitution. Revenons surl’exemple <strong>de</strong> l’excès <strong>de</strong> risque à titre illustratif : typiquement, l’étape <strong>de</strong> distorsion y consisteà introduire une nouvelle covariable (fonction <strong>de</strong>AetW) dans le modèle paramétrique pourl’espérance conditionnelle <strong>de</strong>Y sachant (A,W) et à ajuster par moindres carrés la fluctuation<strong>de</strong> logE P 0 n(Y|A,W) ainsi obtenue. NotonsPn 1 ∈Ml’estimateur <strong>de</strong>P 0 mis à jour, tel quePn,W 1 =P n,W 0 et que logE Pn 1(Y|A,W) égale la fluctuation ajustée <strong>de</strong> logE Pn 0 (Y|A,W). Si laprocédure statistique subsidiaire qui a commandé la construction <strong>de</strong> la nouvelle covariable estconsistante alors l’estimateur mis à jourψn 1 = Ψ(Pn) 1 du traitψ 0 = Ψ(P 0 ) est lui aussi consistant.Le second principe moteur <strong>de</strong> TMLE est la volonté <strong>de</strong> limiter les “interventions humaines”(pour citer Mark van <strong>de</strong>r Laan) et le recours à <strong>de</strong>s modèles paramétriques irréalistes dans laconstruction <strong>de</strong> l’estimateur initialPn 0 <strong>de</strong>P 0 . A cette fin, il est recommandé <strong>de</strong> recourir au superlearning,une métho<strong>de</strong> d’agrégation d’estimateurs par validation croisée (cf par exemple l’articleoriginal [84] et le Chapitre 3 <strong>de</strong> [82]). Un remarquable paquet R intitulé SuperLearner et dû àEric Polley permet d’implémenter facilement cette méthodologie (nous renvoyons le lecteur versle site dédié, à l’URL https://github.com/ecpolley/SuperLearner).L’estimateur TMLE est typiquement √ n-consistant (comprendre robuste) et asymptotiquementgaussien sous <strong>de</strong>s hypothèses raisonnables, et même efficace moyennant <strong>de</strong> les renforcer.De façon notoire, l’étape <strong>de</strong> mise à jour participe à la fois à la réduction du biais et au gain enefficacité.2.2.2 Le principe TMLEIl est temps d’exposer rigoureusement le principe TMLE. Pour simplifier l’exposé, supposonsque le modèleM=M NP est le modèle purement non-paramétrique et que Ψ :M→R prend<strong>de</strong>s valeurs réelles (ces <strong>de</strong>ux conditions sont satisfaites dans l’exemple <strong>de</strong> l’excès <strong>de</strong> risque).La fonction d’influence efficace.Commençons par quelques éléments fondamentaux <strong>de</strong> la théorie <strong>de</strong> la statistique semi-paramétrique(cf la Section 25.3 <strong>de</strong> [85]).Le principe TMLE s’applique à l’estimation <strong>de</strong> fonctionnelles Ψ régulières. Une fonctionnelleΨ :M→R est dite régulière (ou différentiable sur les chemins) enP∈M s’il existeD ⋆ (P)∈L 2 0 (P)≡{f∈L2 (P) :Pf = 0} telle que, pour tout score (ou direction)s∈L 2 0 (P)∩L∞ (P), pourtoute fluctuation (ou chemin){P s (ε) :|ε|


34 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)rien d’autre qu’un modèle paramétrique contenantP, paramétré par le réelε, et <strong>de</strong> scores. Unetelle fluctuation est par exemple caractérisée par le rapport <strong>de</strong> vraisemblancedP s (ε)(O) = 1 +εs(O)dPpour tout|ε| < η = ‖s‖ −1∞ . Cette forme <strong>de</strong> fluctuation joue un rôle très important dansl’implémentation <strong>de</strong> TMLE. Très générale, on lui préfère parfois, au cas par cas, d’autres formesplus pratiques (comme dans l’exemple à venir <strong>de</strong> l’excès <strong>de</strong> risque).Considérons ainsi l’exemple <strong>de</strong> l’excès <strong>de</strong> risque. La fonctionnelle Ψ≡ER est différentiablesur les chemins en toutP∈M. Sa fonction d’influence efficace satisfait, pour toutP∈M,D ⋆ (P) = DW ⋆ (P) +DY ⋆ (P) avec (2.3)DW ⋆ (P)(O) = ¯Q(P)(1,W)− ¯Q(P)(0,W)−ER(P),DY ⋆ (P)(O) =2A−1g(P)(A|W) (Y−¯Q(P)(A,W)),où l’on note ¯Q(P)(A,W)≡E P (Y|A,W) etg(P)(A|W)≡P(A|W). Les composantesDW ⋆ (P)etDY ⋆ (P) correspon<strong>de</strong>nt respectivement aux facteursP(W) etP(Y|A,W) <strong>de</strong> la factorisationP(O) =P(W)×P(A|W)×P(Y|A,W) <strong>de</strong> la vraisemblance. On voit ainsi en creux que lacomposante correspondant au facteurP(A|W) est nulle. Par ailleurs,DW ⋆ (P) etD⋆ Y (P) sontorthogonales dansL 2 0 (P). Que nous apprend la fonction d’influence efficaceD⋆ (P) sur Ψ ? Ellefait apparaître notamment l’importance du mécanisme <strong>de</strong> traitementg(P), pourtant absent <strong>de</strong> ladéfinition <strong>de</strong> Ψ(P) = Ψ( ¯Q(P),P W ). Nous verrons comment TMLE tire partie <strong>de</strong> cette information.Replaçons-nous dans notre cadre général <strong>de</strong> présentation. Soit Ψ :M→R une fonctionnelledifférentiable sur les chemins en toutP∈M, <strong>de</strong> fonction d’influence efficaceD ⋆ (P) enP∈M.NotonsQ(P) l’ensemble <strong>de</strong>s traits <strong>de</strong>P requis pour l’évaluation <strong>de</strong> Ψ(P) (e.g. dans l’exemple <strong>de</strong>l’excès <strong>de</strong> risque,Q(P) = ( ¯Q(P),P W )) etg(P) l’ensemble <strong>de</strong>s traits <strong>de</strong>P qui n’apparaissent pasdansQ(P) mais sont présents dans l’expression <strong>de</strong>D ⋆ (P) (e.g. dans l’exemple <strong>de</strong> l’excès <strong>de</strong> risque,g(P) est le mécanisme <strong>de</strong> traitementP(A|W)). L’une <strong>de</strong>s informations portées par la fonctiond’influence efficace concerne la variance asymptotique <strong>de</strong>s estimateurs réguliers <strong>de</strong>ψ 0 = Ψ(P 0 ) :en vertu du théorème <strong>de</strong> convolution (cf Theorem 25.20 dans [85]), la variance asymptotique<strong>de</strong> tout estimateur régulier <strong>de</strong>ψ 0 = Ψ(P 0 ) est supérieure ou égale à la varianceP 0 D ⋆ (P 0 ) 2 ≡Var P0 {D ⋆ (P 0 )(O)} <strong>de</strong> la fonction d’influence efficace. De plus, c’est sur une fluctuation{P s (ε) :|ε|


A propos <strong>de</strong> l’estimation par minimisation <strong>de</strong> perte ciblée 35MP 0RxΨ(P 0 n)xP 0 nD ⋆ (P k n)P k n(ε k n) = P k+1nxxP k nΨΨ(P 0)Ψ(P k+1n ){P k n(ε) : |ε| < η k n}Fig. 2.1 – La procédure TMLE illustrée.La procédure TMLE générale.La procédure TMLE est une procédure d’estimation itérative, dont nous présentons une illustrationdans la Figure 2.1. Sa première étape consiste en la construction d’estimateurs initiauxQ 0 n etgn 0 <strong>de</strong>Q(P 0 ) etg(P 0 ), le recours au super-learning étant recommandé. NotonsPn∈M0une loi candidate telle queQ(Pn) 0 =Q 0 n etg(Pn) 0 =gn. 0 Pour décrire sa secon<strong>de</strong> étape, itérative,admettons que l’on dispose déjà <strong>de</strong>Pn,...,P 0 n k et voyons comment construire la mise à jourPnk+1suivante. Introduisons à cet effet une fluctuation{Pn(ε) k :|ε|


36 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)ailleurs, la mise à jour <strong>de</strong>P k n enP k+1n est souvent décomposée en une série <strong>de</strong> mises à jourcorrespondant aux composantes orthogonales <strong>de</strong> la fonction d’influence efficace, comme nous leverrons aussi dans l’exemple <strong>de</strong> l’excès <strong>de</strong> risque (ou dans les Sections 2.3 et 2.4). Notons enfinque l’illustration <strong>de</strong> la Figure 2.1 suggère bien queP 0 n estime mieuxP 0 que ne le faitP k+1n (P 0 nest plus proche <strong>de</strong>P 0 que ne l’estP k+1n , car c’est sa vocation) mais qu’en revanche Ψ(P k+1estime mieuxψ 0 = Ψ(P 0 ) que ne le fait l’estimateur initialψ 0 n = Ψ(P 0 n) (comme c’est aussi savocation).TMLE vs équations d’estimation.S’il fallait résumer le plus brièvement possible le principe général <strong>de</strong> l’inférence fondée sur leséquations d’estimation (étudié en gran<strong>de</strong> généralité e.g. dans [81]), disons que cela consiste àestimer en premier lieu les facteursQ(P 0 ) etg(P 0 ) parQ 0 n etgn, 0 puis en second lieu à chercherune solutionψ n <strong>de</strong> l’équation (enψ)P n D ⋆ (Q 0 n,gn,ψ) 0 = 0, cette solutionψ n étant prisecomme estimateur <strong>de</strong>ψ 0 = Ψ(P 0 ). Implicitement, une telle procédure est applicable si la fonctiond’influence efficaceD ⋆ (P) enP∈Mse présente sous la forme d’une fonction estimanteD ⋆ (Q(P),g(P),ψ(P)) (comme c’est e.g. le cas dans l’exemple <strong>de</strong> l’excès <strong>de</strong> risque). Ainsi, l’estimateurψn jouit bien finalement <strong>de</strong> la propriété désirableP n D ⋆ (Q 0 n,gn,ψ 0 n )≈0, dont peuventdécouler les propriétés statistiques désirables déjà évoquées plus tôt (sous <strong>de</strong>s hypothèses pluscontraignantes que celles requises pour le TMLE).Le contraste le plus marquant entre les principes TMLE et d’inférence par équations d’estimationest que le principe TMLE produit <strong>de</strong>s estimateurs <strong>de</strong> substitution. C’est en mettant à jour lesestimations initialesQ 0 n etgn 0 que l’on résout finalement l’équationP n D ⋆ (Q(Pn Kn ),g(Pn Kn ),ψn ∗ =Ψ(Q Knn ))≈0. Par conséquent, le TMLEψn ∗ = Ψ(Pn Kn ) satisfait nécessairement toutes lescontraintes imposées par la définition <strong>de</strong> la fonctionnelle Ψ. Du point <strong>de</strong> vue computationnel parailleurs, le principe TMLE repose sur <strong>de</strong>s minimisation <strong>de</strong> pertes, numériquement plus aisées àmener que la résolution d’équations (notamment en termes <strong>de</strong> gestion <strong>de</strong>s solutions multiples).Le calcul <strong>de</strong> TMLE s’en trouve favorisé.2.2.3 La procédure TMLE appliquée à l’estimation <strong>de</strong> l’excès <strong>de</strong> risqueNous développons ici la procédure TMLE pour l’estimation <strong>de</strong> l’excès <strong>de</strong> risque. Sa premièreétape consiste en la construction d’estimateurs initiaux ¯Q 0 n etg 0 n <strong>de</strong> la fonction <strong>de</strong> régression¯Q(P 0 ) et du mécanisme <strong>de</strong> traitementg(P 0 ), le recours au super-learning étant recommandé. Laloi marginale <strong>de</strong>W est estimée par sa version empiriqueP n,W . NotonsP 0 n∈M un élément dumodèle tel queQ(P 0 n)≡( ¯Q(P 0 n),P 0 n,W ) = ( ¯Q 0 n,P n,W ). Introduisons pour sa secon<strong>de</strong> étape la“fluctuation”{ ¯Q 0 n(ε) :ε∈R} caractérisée parn )¯Q 0 n(ε)(W,A,Y ) = ¯Q 0 n(W,A,Y ) +ε 2A−1g 0 n(A|W)(2.5)et associée à la perteL(O, ¯Q) = (Y−¯Q(A,W)) 2 . Nous fluctuons ici directement la fonction<strong>de</strong> régression plutôt que la loiPn. 0 2A−1Le choix <strong>de</strong> la covariable dans (2.5) cible la composanteDY⋆ (P n) 0 <strong>de</strong> la fonction d’influence efficaceD ⋆ (Pn) 0 au sens où ∂ ∂ε L(O, ¯Q 0 n(ε))| ε=0 =gn 0 (A|W )−2DY ⋆ (P n)(O) 0 est proportionnelle à celle-ci. Le paramètreε 0 n <strong>de</strong> fluctuation est le minimiseur <strong>de</strong>la perte empirique, soitε 0 n = arg min ε∈R P n L(O i , ¯Q 0 n), qui distordPn 0 enPn 1 =Pn(ε 0 0 n). Selonles termes <strong>de</strong> la présentation générale du principe TMLE, on pourrait s’attendre maintenant à cequ’il faille procé<strong>de</strong>r à la mise à jour <strong>de</strong>Pn 1 afin <strong>de</strong> cibler la composanteD W (Pn) 0 <strong>de</strong> la fonction


Probabilité <strong>de</strong> succès d’un programme <strong>de</strong> FIV en France 37d’influence efficaceD ⋆ (P 0 n). C’est cependant inutile, dans la mesure où une telle mise à jour(introduction d’une fluctuation dans la directionD W (P 0 n) au sens <strong>de</strong> la perte opposée <strong>de</strong> la logvraisemblance,évaluation par maximum <strong>de</strong> vraisemblance du paramètre <strong>de</strong> distorsion) resterait,on le vérifie aisément, sans effet (i.e. le paramètre <strong>de</strong> distorsion est nul). Puisque nous ne mettonspas à jour l’estimateurg 0 n, la procédure itérative converge ici en un coup. AinsiK n = 1,P 1 n =P ∗ n,P n D ⋆ (P ∗ n) = 0 (exactement) et le TMLE <strong>de</strong>ψ 0 = Ψ(P 0 ) satisfaitψ ∗ n≡ Ψ(P ∗ n) = 1 n(n∑¯Q 0 n(1,W i )− ¯Q 0 ε 0 )nn(0,W i ) +gn(1|W 0 i )gn(0|W 0 .i )i=1Remarque. Nous aurions pu tout aussi bien développer la procédure TMLE pour la perte opposée<strong>de</strong> la log-vraisemblance. Toutes les pertes ne sont pourtant pas équivalentes, et la libertéd’en choisir une offre l’opportunité d’incorporer à la procédure <strong>de</strong>s contraintes traduisant,par exemple, <strong>de</strong>s connaissances disponibles sur la vraie loiP 0 . Admettons ainsi que l’onsait que l’issue principaleY prend ses valeurs dans l’intervalle [a,b], disons [a,b] = [0, 1],impliquant en particulier queψ 0 = Ψ(P 0 )∈[−1, 1]. Pour en tenir compte, nous pouvonssubstituer à la fluctuation (2.5) la secon<strong>de</strong> fluctuation caractérisée parlogit ¯Q 0 n(ε)(O) = logit ¯Q 0 n(O) +ε 2A−1g 0 n(A|W)et choisir la perte logistiqueL(O, ¯Q) =−Y log( ¯Q(O))−(1−Y ) log(1− ¯Q(O)). Il fautsouligner que cette modification peut permettre d’imposer aux fonctions <strong>de</strong> régression <strong>de</strong>prendre leurs valeurs par exemple dans [min i≤n Y i ; max i≤n Y i ].2.3 Probabilité <strong>de</strong> succès d’un programme <strong>de</strong> fertilisation in vitroen France : contribution à l’étu<strong>de</strong> DAIFI [A9]Je résume dans cette section mon travail [A9] qui a fait l’objet d’un chapitre dans la monographie[82]. J’y mets en œuvre la méthodologie d’estimation par minimisation <strong>de</strong> perte ciblée(présentée sous sa forme générale dans la Section 2.2) pour l’estimation <strong>de</strong> la probabilité <strong>de</strong> succèsd’un programme <strong>de</strong> fécondation in vitro <strong>de</strong> quatre cycles au plus, à partir <strong>de</strong>s résultats <strong>de</strong> l’enquêtefrançaise DAIFI.Je souhaite remercier à cette occasion mes collaborateurs Jean Bouyer (Centre <strong>de</strong> Rechercheen Epidémiologie et Santé <strong>de</strong>s Populations, INED, INSERM et Université Paris Sud 11), Elise<strong>de</strong> la Rochebrochard (Centre <strong>de</strong> Recherche en Epidémiologie et Santé <strong>de</strong>s Populations, INED,INSERM et Université Paris Sud 11), Susan Gruber (Department of Epi<strong>de</strong>miology, Harvard Schoolof Public Health), Sherri Rose (Department of Biostatistics, Johns Hopkins Bloomberg School ofPublic Health) et Mark van <strong>de</strong>r Laan (Division of Biostatistics, UC Berkeley) pour d’enrichissantesdiscussions et la série d’articles, bientôt finalisés, qui prolongent [A9].2.3.1 IntroductionEntre 9% et 15% <strong>de</strong>s couples rencontrent <strong>de</strong>s difficultés pour concevoir un enfant, i.e. neconçoivent pas en douze mois <strong>de</strong> tentatives [11]. En réponse à l’infertilité, <strong>de</strong>s techniques <strong>de</strong>reproduction assistée ont été développées au cours <strong>de</strong> ces trente <strong>de</strong>rnières années, avec notammentla mise au point <strong>de</strong> techniques <strong>de</strong> fécondation in vitro (FIV). Le premier “bébé éprouvette” a ainsi


38 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)vu le jour en 1978, grâce à une procédure <strong>de</strong> FIV dirigée par Patrick Steptoe et Robert G. Edwards,Prix Nobel <strong>de</strong> Mé<strong>de</strong>cine 2010 pour le développement <strong>de</strong> la FIV. Aujourd’hui, environ 40,000cycles <strong>de</strong> FIV sont réalisés chaque année en France (63,000 aux Etats-Unis) [1]. Les techniques<strong>de</strong> reproduction assistée sont réputées éprouvantes physiquement et psychologiquement. Fourniraux couples la quantification la plus adéquate et la plus précise <strong>de</strong> leurs performances est doncune question importante, que nous considérons dans [A9].Comment quantifier les performances <strong>de</strong> techniques <strong>de</strong> reproduction assistée fait cependanttoujours débat. On pourrait par exemple les exprimer en termes <strong>de</strong> nombre <strong>de</strong> grossesses oud’accouchements par cycle <strong>de</strong> FIV. Les programmes <strong>de</strong> FIV consistant souvent en plusieurs cycles<strong>de</strong> FIV, nous optons pour une évaluation <strong>de</strong> tout le programme. Puisqu’en France les quatrepremiers cycles <strong>de</strong> FIV sont entièrement couverts par la Sécurité Sociale, nous convenons à la suite<strong>de</strong> [74, 22, 23] que la plus adéquate quantification <strong>de</strong>s performances d’une procédure <strong>de</strong> FIV pourles couples français est la probabilité <strong>de</strong> donner naissance (à la suite d’un transfert d’embryons)au cours d’un programme d’au plus quatre cycles successifs <strong>de</strong> FIV, soit encore la probabilité <strong>de</strong>succès d’un programme d’au plus quatre FIV (ou encore plus brièvement la probabilité <strong>de</strong> succès).Estimer cette quantité est une tâche difficile parce que certains couples abandonnent le programme<strong>de</strong> FIV en cours <strong>de</strong> route. On peut soupçonner que ces couples ont une probabilité plusfaible <strong>de</strong> concevoir un enfant lors du programme <strong>de</strong> FIV que les couples qui n’abandonnent pas, etil serait donc maladroit <strong>de</strong> les écarter du jeu <strong>de</strong> données ; en termes statistiques, il serait maladroitd’ignorer ce phénomène <strong>de</strong> censure à droite. Par ailleurs, il semble plausible que <strong>de</strong>s covariables<strong>de</strong> base (telle que l’âge ou une évaluation <strong>de</strong> la fertilité en début du programme <strong>de</strong> FIV) soientprédictives du temps <strong>de</strong> l’abandon ; en termes statistiques, il est plausible que la censure à droitesoit informative.2.3.2 Description succincte <strong>de</strong>s donnéesNotre contribution s’appuie sur les résultats <strong>de</strong> l’enquête française Devenir Après Interruption<strong>de</strong> la FIV (DAIFI) [74, 22, 23]. Nous disposons ainsi d’environn=3, 000 copies indépendantes<strong>de</strong> l’observation génériqueO = (L 0 ,A 0 ,L 1 ,A 1 ,L 2 ,A 2 ,Y =L 3 )où– le vecteur <strong>de</strong> covariables <strong>de</strong> baseL 0 nous renseigne sur l’unité <strong>de</strong> FIV fréquentée par lafemme à laquelleOest associée, son âge au début du premier cycle <strong>de</strong> FIV, le nombred’embryons congelés ou transférés lors du premier cycle <strong>de</strong> FIV, et si le premier cycle <strong>de</strong> FIVa été couronné <strong>de</strong> succès ou non ;– pour tout 1≤j≤ 3,A j−1 ∈{0, 1} nous renseigne sur l’abandon éventuel <strong>de</strong> la femmesuite aujème cycle <strong>de</strong> FIV (avec la conventionA j−1 = 0 pour indiquer un abandon) ;– pour tout 1≤j≤ 3,L j ∈{0, 1} nous renseigne sur l’éventuel succès du (j + 1)ème cycle<strong>de</strong> FIV (avec la conventionL j = 1 pour indiquer un accouchement).Nous faisons implicitement les hypothèses que le statut économique et social du couple ne jouepas <strong>de</strong> rôle significatif (soit encore qu’il n’est pas un facteur <strong>de</strong> confusion non observé), que l’âgeseul <strong>de</strong> la femme au début du premier cycle <strong>de</strong> FIV est un résumé fidèle <strong>de</strong>s âges aux débuts <strong>de</strong>scycles successifs <strong>de</strong> FIV, et que le nombre d’embryons congelés ou transférés lors du premier cycle<strong>de</strong> FIV est un indicateur suffisant <strong>de</strong> la fertilité du couple (soit encore que les nombres d’embryonscongelés ou transférés lors <strong>de</strong>s cycles <strong>de</strong> FIV successifs ne sont pas <strong>de</strong>s variables <strong>de</strong> confusiondépendant du temps, cf la remarque finale à ce propos).


Probabilité <strong>de</strong> succès d’un programme <strong>de</strong> FIV en France 392.3.3 La procédure TMLE appliquée à l’estimation <strong>de</strong> la probabilité <strong>de</strong> succèsd’un programme <strong>de</strong> FIV en FranceEtat <strong>de</strong> l’art.Trois approches statistiques sont entreprises dans [74]. La première, assumée comme naïve,consiste à estimer la probabilité <strong>de</strong> succès par le nombre d’accouchements rapporté au nombre<strong>de</strong> femmes participant à l’étu<strong>de</strong>. Cette approche, qui néglige une gran<strong>de</strong> quantité d’informationprésente dans les données, aboutit à une estimation ponctuelle <strong>de</strong> 37% et à un intervalle <strong>de</strong>confiance <strong>de</strong> niveau 95% égal à [0.36; 0.38]. La secon<strong>de</strong> approche est une analyse <strong>de</strong> survie nonparamétriqueclassique, qui aboutit à une estimation ponctuelle <strong>de</strong> 52% et à un intervalle <strong>de</strong>confiance <strong>de</strong> niveau 95% égal à [0.49; 0.55]. Bien que beaucoup plus soli<strong>de</strong> que la première, cetteanalyse néglige les covariables <strong>de</strong> base et suppose donc que la décision d’une femme d’abandonnerle programme <strong>de</strong> FIV ne s’appuie pas sur <strong>de</strong>s facteurs prédictifs (comme par exemple les covariables<strong>de</strong> base) d’un éventuel futur succès. La troisième métho<strong>de</strong> met en jeu la méthodologie dite <strong>de</strong>l’imputation multiple. Reposant sur un modèle paramétrique (certainement mal spécifié, et doncresponsable <strong>de</strong> l’introduction <strong>de</strong> biais), cette métho<strong>de</strong> aboutit à une estimation ponctuelle <strong>de</strong> 46%et à un intervalle <strong>de</strong> confiance <strong>de</strong> niveau 95% égal à [0.44; 0.48].Notre ambition est d’estimer la probabilité <strong>de</strong> succès selon les principes <strong>de</strong> TMLE, notre fild’Ariane pour la résolution <strong>de</strong> ce délicat problème.La procédure TMLE.NotonsP 0 la vraie loi <strong>de</strong>O, que nous voyons comme un élément du modèleMM NP choisipurement non-paramétrique en l’absence <strong>de</strong> connaissances spécifiques surP 0 . Le paramètre d’intérêts’exprime sous la forme <strong>de</strong>ψ 0 = Ψ(P 0 ) pour la fonctionnelle Ψ :M→RcaractériséeparΨ(P) =E P(∑l 1:2 ∈{0,1} 2 P(Y = 1|A 0:2 = 1 0:2 ,L 1:2 =l 1:2 ,L 0 )×P(L 2 =l 2 |A 0:1 = 1 0:1 ,L 1 =l 1 ,L 0 )×P(L 1 =l 1 |A 0 = 1,L 0 ))pour toutP∈M(avec la conventionP(L j =l j |B) = 0 quand l’événement <strong>de</strong> conditionnementB est <strong>de</strong> probabilité nulle). Deux arguments offrent une interprétation causale <strong>de</strong> Ψ(P) (cf laSection 25.2 <strong>de</strong> [A9]), qui apparaît sous certaines hypothèses comme la probabilité qu’une femmedonne naissance à un enfant à l’occasion d’un programme <strong>de</strong> FIV consistant en au plus quatrecycles <strong>de</strong> FIV, lorsque celle-ci se voit imposer <strong>de</strong> suivre le programme dans son intégralité (i.e. nepeut abandonner le programme en cours <strong>de</strong> route) : autrement dit, Ψ(P) est bien la probabilité <strong>de</strong>succès recherchée. Il faut souligner que même si l’on n’adhère pas aux hypothèses requises pouraccé<strong>de</strong>r à cette interprétation causale, Ψ(P) est toujours un paramètre statistique sain décrivantl’effet d’une intervention sur la loiP, et qu’il se rapproche autant que faire se peut (étant donnéela structure <strong>de</strong> l’observation génériqueO) d’une mesure d’un effet causal.Nous montrons que la fonctionnelle Ψ est différentiable sur les chemins, et calculons sa fonctiond’influence efficaceD ⋆ (P) en toutP∈M(cf la Section 2.2 <strong>de</strong> ce mémoire et la Proposition 25.1dans [A9]). La procédure d’estimation TMLE peut donc être mise en place. Spécifiquement, nousestimons par super-learning les facteursQ(P 0 ) (constitué <strong>de</strong>s lois conditionnelles <strong>de</strong>L 1 ,L 2 ,L 3


40 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)sachant leurs passés et <strong>de</strong> la loi marginale <strong>de</strong>L 0 , estimée elle par sa version empirique) etg(P 0 )(constitué <strong>de</strong>s lois conditionnelles <strong>de</strong>A 0 ,A 1 ,A 2 sachant leurs passés). Nous choisissons la perteopposée <strong>de</strong> la log-vraisemblance et fluctuons séparément (grâce à <strong>de</strong> simples modèles <strong>de</strong> régressionlogistique) chaque composante <strong>de</strong> l’estimateurQ 0 n <strong>de</strong>Q(P 0 ) dans la direction <strong>de</strong> la composantecorrespondante <strong>de</strong> la fonction d’influence efficace, en procédant à rebours (i.e. en fluctuant la loiconditionnelle <strong>de</strong>L 3 , puis celle <strong>de</strong>L 2 , puis celle <strong>de</strong>L 1 , puis celle <strong>de</strong>L 0 ). Nous ne fluctuons enrevanche pasg 0 n, si bien que l’on montre que la procédure TMLE converge en une seule itération(cf Proposition 2.5 dans [A9]). Au final, la mise à jour <strong>de</strong> (Q 0 n,g 0 n) aboutit en (Q ∗ n,g ∗ n =g 0 n) etle TMLE <strong>de</strong>ψ 0 = Ψ(P 0 ) estψ ∗ n = Ψ(Q ∗ n).Etu<strong>de</strong> théorique <strong>de</strong>s propriétés asymptotiques du TMLE.Puisque la fonction d’influence efficace <strong>de</strong> Ψ est doublement robuste et queP n D ⋆ (Q ∗ n,g ∗ n) = 0,le TMLEψ ∗ n jouit <strong>de</strong> propriétés asymptotiques remarquables sous certaines hypothèses typiques.Plus spécifiquement (cf la Proposition 25.3 dans [A9]), siQ ∗ n etg ∗ n convergent tous les <strong>de</strong>uxversQ 1 etg 1 <strong>de</strong> telle sorte queQ 1 oug 1 égale sa vraie contrepartieQ(P 0 ) oug(P 0 ) ; si lasuite (D ⋆ (Q ∗ n,g ∗ n)) appartient à une classeP 0 -Donsker avec probabilité tendant vers 1 ; et si unterme <strong>de</strong> second ordre impliquant le produit <strong>de</strong>s écarts entreQ ∗ n etQ(P 0 ) d’une part etg ∗ n etg(P 0 ) d’autre part est unO P (n −1/2 ) ; alorsψ ∗ n est consistant et asymptotiquement linéaire doncasymptotiquement gaussien. La variance limite peut être estimée par bootstrap. Si l’on supposeque (Q 1 ,g 1 ) = (Q(P 0 ),g(P 0 )) alors le TMLE est efficace et sa variance asymptotique est estiméeparP n D ⋆ (Q ∗ n,g ∗ n) 2 . Si l’on suppose seulement queg 1 =g(P 0 ) et sig ∗ n est <strong>de</strong> plus obtenu parmaximum <strong>de</strong> vraisemblance sur un modèle bien spécifié, alors il apparaît queP n D ⋆ (Q ∗ n,g ∗ n) 2 estun estimateur conservatif <strong>de</strong> la variance asymptotique du TMLE.Application.Nous vérifions dans [A9] les propriétés <strong>de</strong> notre estimateur par simulations (cf la Section 25.4dans [A9]). L’application aux données réelles <strong>de</strong> l’enquête DAIFI aboutit àψ ∗ n = 50.5%. L’utilisation<strong>de</strong>P n D ⋆ (Q ∗ n,g ∗ n) 2 comme estimateur <strong>de</strong> la variance asymptotique (ou d’une borne supérieure)conduit à un intervalle <strong>de</strong> confiance à 95% égal à [0.48; 0.53]. Puisque nous n’avonspas la certitu<strong>de</strong> que les hypothèses justifiant cette construction sont satisfaites, nous procédonsaussi à la construction d’un intervalle <strong>de</strong> confiance à 95% par bootstrap, aboutissant à l’intervalle[0.47; 0.54], plus étendu que le précé<strong>de</strong>nt.Ainsi, le TMLE n’est pas significativement différent <strong>de</strong> l’estimateur obtenu selon l’analyse <strong>de</strong>survie non-paramétrique menée dans [74]. Notre contribution est tout <strong>de</strong> même importante, carelle tient compte <strong>de</strong>s covariables <strong>de</strong> base afin <strong>de</strong> gagner en efficacité et <strong>de</strong> témoigner <strong>de</strong> ce que lacensure est informative.En conclusion, notre analyse supporte la démarche consistant à informer les futurs participantsà un programme <strong>de</strong> FIV <strong>de</strong> quatre cycles au plus en France qu’environ la moitié d’entre eux<strong>de</strong>vraient parvenir à concevoir un enfant.Remarque. Extension. Dans un travail en voie d’achèvement, nous étendons l’analyse parTMLE présentée ci-<strong>de</strong>ssus en incluant les nombres d’embryons transférés ou congelés àchaque cycle <strong>de</strong> FIV. L’inclusion <strong>de</strong> ces variables, susceptibles d’être <strong>de</strong>s facteurs <strong>de</strong> confusiondépendant du temps, est un pas supplémentaire vers l’estimation d’une probabilité <strong>de</strong>succès incontestable. Pour l’anecdote, on ne s’écarte pas sensiblement <strong>de</strong> la valeur ponctuelle<strong>de</strong> 50% <strong>de</strong> probabilité <strong>de</strong> succès estimée.


Mesure non-paramétrique <strong>de</strong> l’importance d’une variable 412.4 Mesure non-paramétrique <strong>de</strong> l’importance d’une variable [A14]Je présente dans cette section un travail en collaboration avec Pierre Neuvial (LaboratoireStatistique et Génome, Université d’Evry et CNRS) et Mark van <strong>de</strong>r Laan (Division of Biostatistics,UC Berkeley) [A14]. Nous y introduisons une notion inédite <strong>de</strong> mesure non-paramétrique <strong>de</strong>l’importance d’une variable continue, et développons une procédure d’estimation par minimisation<strong>de</strong> perte ciblée pour son inférence. Nous illustrons l’intérêt <strong>de</strong> cette nouvelle mesure d’importanceavec un exemple pertinent d’évaluation, pour chaque gène d’une cellule cancéreuse, <strong>de</strong> l’influencedu nombre <strong>de</strong> copies sur l’expression du gène en prenant sa méthylation en compte.2.4.1 Une mesure inédite et non-paramétrique <strong>de</strong> l’importance d’une variablecontinueSoit le problème statistique suivant : nous observons une donnéeO = (W,X,Y )∼P 0 associéeà une unité expérimentale d’intérêt, oùW est un vecteur <strong>de</strong> covariables <strong>de</strong> base etX,Y ∈ Rquantifient une exposition et une réponse ; l’expositionX a un niveau <strong>de</strong> référencex 0 <strong>de</strong> massestrictement positive (i.e.P 0 (X =x 0 )>0) et un continuum d’autres niveaux ; l’objectif estd’étudier l’influence <strong>de</strong>X surY en prenantW en compte.La difficulté tient au fait que la variable d’expositionX ne prend pas un nombre fini <strong>de</strong>valeurs (si c’était le cas, nous pourrions utiliser <strong>de</strong>s mesures classiques <strong>de</strong> l’importance d’unevariable telle que l’excès <strong>de</strong> risque, cf l’exemple <strong>de</strong> la Section 2.2). Par ailleurs, on sait bienqu’il est <strong>de</strong>s situations où la discrétisation <strong>de</strong> l’exposition, loin d’être anodine, a <strong>de</strong>s conséquencesimportantes sur l’interprétabilité <strong>de</strong>s paramètres qui en découlent (cf e.g. [75] pour une discussionsur les hypothèses implicites permettant d’interpréter <strong>de</strong>s mesures <strong>de</strong> l’importance d’une expositioncontinue au préalable discrétisée). Quant à la possibilité <strong>de</strong> prendreW en compte, elle est désirablelorsque l’on sait (ou lorsque l’on ne peut pas écarter la possibilité) qu’il contient <strong>de</strong>s facteurs <strong>de</strong>confusion, i.e. <strong>de</strong>s facteurs qui influencent simultanémentX etY . Il faut souligner que, même sinous avons employé du vocabulaire relevant du champ sémantique <strong>de</strong> la causalité (“influence”,“confusion”), le problème conserve son intérêt dans sa version exprimée dans le champ sémantique<strong>de</strong>s associations. Nous l’illustrons avec un exemple dans lequel l’unité expérimentale est un gènespécifique d’une cellule cancéreuse,W∈ [0, 1] quantifie son <strong>de</strong>gré <strong>de</strong> méthylation,X∈ R est sonnombre <strong>de</strong> copies d’ADN (avecx 0 = 2 comme niveau <strong>de</strong> référence, le nombre attendu <strong>de</strong> copiesdans une cellule saine) etY∈ R son niveau d’expression. Le fait que l’on n’ait pas d’indicationclaire sur l’interprétation <strong>de</strong>X comme une exposition et <strong>de</strong>Y comme une réponse ne pose pasproblème (cf la Section 2.4.3 pour une présentation plus poussée du problème).Nous voyonsP 0 comme un élément du modèleM NP <strong>de</strong> toutes les lois candidatesP pourO telles queP(X =x 0 )>0etP(X ≠x 0 |W)>0presque sûrement. Notre mesure nonparamétrique<strong>de</strong> l’importance <strong>de</strong>X relativement àY en prenantW en compte est définie via lafonctionnelle Ψ :M NP → R caractérisée parΨ(P) = arg minE P{(E P (Y|X,W)−E P (Y|X =x 0 ,W)−β(X−x 0 )) 2} (2.6)β∈Rpour toutP∈M NP . Son nom indique qu’elle appartient à la classe <strong>de</strong>s mesures <strong>de</strong> l’importance<strong>de</strong> variables telle qu’introduites dans [78], bien que son cas ne soit pas couvert dans ce <strong>de</strong>rnierarticle. Naturellement, on peut substituer au termeβ(X−x 0 ) toute expressionβf(X,x 0 ) tellequeE P {f(X,x 0 ) 2 }>0pour toutP∈M NP . Insistons enfin sur le fait que nous ne postulonspas un modèle semi-paramétriqueY =β(X−x 0 ) +η(W) pour un paramètre <strong>de</strong> nuisanceη nonspécifié.


42 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)Remarque. Comparaison avec l’excès <strong>de</strong> risque. Les fonctionnelles Ψ et ER (l’excès <strong>de</strong> risquedéfini dans la Section 2.2) sont intimement liées puisque, siP(X∈{x 0 ,x 1 }) = 1 pourx 1 ≠x 0 alors Ψ(P) =E P {(E P (Y|X =x 1 ,W)−E P (Y|X =x 0 ,W))h(P)(W)} pourh(P)(W) =P(X =x 1 |W)/E P (X 2 ), i.e. la mesure d’importance Ψ(P) apparaît commeune version pondérée <strong>de</strong> l’excès <strong>de</strong> risque.2.4.2 TMLE <strong>de</strong> la mesure non-paramétrique <strong>de</strong> l’importance d’une variableEsquisse <strong>de</strong> la procédure TMLE.Nous démontrons que la fonction Ψ (2.6) est différentiable sur les chemins en toutP∈M NP ,<strong>de</strong> fonction d’influence efficace notéeD ⋆ (P) (cf Proposition 1 dans [A14]). L’étu<strong>de</strong> <strong>de</strong>D ⋆ (P)nous enseigne que les traits <strong>de</strong>P qui importent sont Ψ(P) (évi<strong>de</strong>mment),E P {X 2 },E P (X|W),P(X =x 0 |W) etE P (Y|X,W). Par ailleurs, la fonction d’influence efficace est doublementrobuste :PD ⋆ (P ′ ) = 0 implique Ψ(P ′ ) = Ψ(P) siE P ′(Y|X =x 0 ,W) =E P (Y|X =x 0 ,W) oubien si (E P ′(X|W) =E P (X|W) etP ′ (X =x 0 |W) =P(X =x 0 |W)).Ainsi, on peut mettre en place une procédure d’estimation par TMLE. Nous montrons commentfluctuer un estimateur initialP 0 n <strong>de</strong>P 0 (obtenu par super-learning) dans la direction <strong>de</strong>D ⋆ (P 0 n),pour la perte opposée <strong>de</strong> la log-vraisemblance ou pour la perte logistique (cf la Section 2.2.3pour un exemple d’utilisation <strong>de</strong> cette perte). La procédure TMLE ne converge pas en une seuleitération, et il faut mettre jour itérativement l’estimateur résultant <strong>de</strong> la fluctuation précé<strong>de</strong>nte.Remarque. Parce que nous avons aussi en vue l’implémentation <strong>de</strong> la procédure, nous mettonsau point <strong>de</strong>s fluctuations que l’on peut qualifier <strong>de</strong> parcimonieuses, au sens où les lois <strong>de</strong>la fluctuations ne chargent que les copies indépendantes <strong>de</strong>Osur lesquelles l’inférence sefon<strong>de</strong> (cf Lemma 1 dans [A14]). Le gain computationnel est immense, car ces fluctuationsparcimonieuses permettent <strong>de</strong> passer d’une complexité quadratique en temps <strong>de</strong> calcul eten espace mémoire à une complexité linéaire en temps <strong>de</strong> calcul et constante en espacemémoire.Etu<strong>de</strong> théorique <strong>de</strong> la convergence <strong>de</strong> la procédure itérative.Dans un premier temps, nous nous intéressons aux propriétés <strong>de</strong> convergence <strong>de</strong> la procédureitérative qui est au cœur <strong>de</strong> la construction du TMLE.Soit donc{Pn} k la suite <strong>de</strong>s estimateurs successifs <strong>de</strong> la vraie loiP 0 obtenus par fluctuationssuccessives{Pn(ε) k :|ε| 0, qui apparaît comme une condition plus explicite portant sur l’échantillonobservé (cf Lemma 5 dans [A14]) et garantissant la convergence <strong>de</strong> la procédure itérative.Par définition, le TMLE s’écritψn ∗ = Ψ(Pn Kn ) pour un nombre d’itérationsK n tel queP n D ⋆ (Pn Kn ) =O P (1/ √ n) (cf la Section 2.2.2). Il n’est pas rare queK n = 1 suffise, commedans l’exemple du TMLE pour l’excès <strong>de</strong> risque (cf la Section 2.2.3 ; les procédures TMLE développéesdans les Sections 2.3 et 3.2 convergent aussi en une seule itération). Il est néanmoinsnaturel <strong>de</strong> se <strong>de</strong>man<strong>de</strong>r si les suites{Pn} k et{Ψ(Pn)} k convergent en général. Nous démontrons à


Mesure non-paramétrique <strong>de</strong> l’importance d’une variable 43ce propos que{P k n} converge en variations totales (et donc en loi) à condition que{ε k n} convergeassez vite (il faut que ∑ k |εk n|


44 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)réel” <strong>de</strong> génération <strong>de</strong>s données, i.e. à la vraie loiP 0 du phénomène d’intérêt – notre vision <strong>de</strong>la nature est stochastique !) et ciblée vers une quantité <strong>de</strong> petite dimension, pour laquelle nousdéveloppons une procédure inférentielle aux propriétés statistiques bien i<strong>de</strong>ntifiées.Pour conclure, l’observation générique associé à un gène fixé s’écritO=(W,X,Y ), où laméthylationW∈ [0, 1] est la proportion <strong>de</strong> signal méthylé dans la séquence promotrice du gène, lenombre <strong>de</strong> copiesX∈ R est calculé à partir d’une version normalisée (relativement à un échantillon<strong>de</strong> référence) et lissée (le long du génome) du nombre <strong>de</strong> copies d’ADN (statistiquement traitée<strong>de</strong> telle sorte que le nombre attendu <strong>de</strong> copiesx 0 = 2 ait une masse positive) et l’expressionY estun niveau d’expression unifié obtenu par combinaison <strong>de</strong>s expressions obtenues à partir <strong>de</strong> troisplateformes expérimentales distinctes (cf Section 5 dans [A14] pour le détail).Simulations.Nous imitons le “mécanisme réel” <strong>de</strong> génération <strong>de</strong>s données afin que notre schéma <strong>de</strong> simulationsoit aussi réaliste que possible (dans l’exemple reporté ici, l’astuce consiste à perturberaléatoirement <strong>de</strong>ux authentiques observations, cf la Section 5 <strong>de</strong> [A14]). Le lecteur est invité àjuger <strong>de</strong> la ressemblance en observant la Figure 2.2.Nous simulons indépendammentB = 1000 jeux <strong>de</strong> données <strong>de</strong>n=200 observations, etcalculons le TMLEψ ∗ n. Nous mettons au point un critère d’arrêt <strong>de</strong> la procédure itérative, qui leplus souvent conduit à trois mises à jour <strong>de</strong> l’estimateur initial. Nous reportons dans la Figure 2.3un graphique qui illustre quelques caractéristiques fondamentales <strong>de</strong> la procédure TMLE :– le TMLEψ ∗ n est robuste : on voit comment dès la première itération, le ciblage du paramètred’intérêt corrige l’estimation initiale biaisée ;– la procédure itérative TMLE converge rapi<strong>de</strong>ment : on voit que les itérations successivesn’affectent pas sensiblement (du moins en loi) les itéréesψ k n = Ψ(P k n) pourk≥1–parailleurs, on observe typiquement <strong>de</strong>s suites{ε k n : 0≤k≤2} <strong>de</strong> pas optimaux successifsperdant un ordre <strong>de</strong> gran<strong>de</strong>ur à chaque itération ;– le TMLEψ ∗ n est asymptotiquement gaussien : la forme <strong>de</strong>s <strong>de</strong>nsités estimées (par métho<strong>de</strong>à noyau) est visuellement <strong>de</strong> nature gaussienne.De plus, l’estimation <strong>de</strong> la variance asymptotique sous la formeP n D ⋆ (P Knn ) 2 (que l’on peutqualifier d’optimiste, dans la mesure où elle est théoriquement validée sous <strong>de</strong>s conditions assezcontraignantes) produit <strong>de</strong>s intervalles <strong>de</strong> confiance (et <strong>de</strong>s procédures <strong>de</strong> test) dont on vérifieempiriquement qu’ils sont du niveau souhaité.Remarque. Nous avons mis au point le package NPVI (en cours <strong>de</strong> finition) sous R qui implémentel’estimation par TMLE <strong>de</strong> la mesure non-paramétrique <strong>de</strong> l’importance d’une variable(2.6). L’implémentation inclut le super-learning initial <strong>de</strong> la loi <strong>de</strong>s observations, conformémentà l’esprit <strong>de</strong> la procédure TMLE. Parce qu’elle repose sur la procédure parcimonieuseque nous évoquions plus tôt, elle est par ailleurs computationnellement très rapi<strong>de</strong>. Enfin, ellepermet <strong>de</strong> considérer <strong>de</strong>s covariablesW <strong>de</strong> gran<strong>de</strong> dimension (et pas seulement <strong>de</strong> dimensionun comme dans l’étu<strong>de</strong> ci-<strong>de</strong>ssus), ce qui nous permettra, dans une prochaine étu<strong>de</strong>,<strong>de</strong> prendre en compte les données <strong>de</strong> gènes voisins du gène d’intérêt lors <strong>de</strong> l’évaluation <strong>de</strong>l’importance <strong>de</strong> son nombre <strong>de</strong> copie sur son expression.


Mesure non-paramétrique <strong>de</strong> l’importance d’une variable 452468101214DNA methylation−0.57−0.540.080.060.04DNA methylation0.021412108642●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●DNA copy number●●●●●●● ●● ●●●● ●● ●● ●●● ●● ●●●● ●●●●●●●●●●●●● ● ●●● ●● ● ● ●●● ●●●● ●● ●●●●● ●●●●●●●●● ●●● ●●●●● ●●●●●●●●●● ● ●●●●●●●● ●●0.87gene expression210−10.020.040.060.08−10125101520−0.66−0.700.100.080.060.040.022015105●●●●●● ●● ● ●●●●● ●●●● ●●●●●●●● ●●●●●●●● ● ●●●● ●● ●●●●●●● ● ●●● ● ●● ●●● ●●●●●●●●●●●● ●●●●● ●● ● ●●●●● ●●●●●●●● ●●● ●●●●●●●●● ●●●●●●● ●●● ●●●●● ●●●●●●● ●●●●●●● ● ●● ● ●●●● ●●●● ●●●●●●●●●● ●●●● ● ●●● ● ●● ●●●●●●●●● ●●● ●●● ● ● ●●●● ●●● ● ● ●● ●●●●● ●●●●● ●●●●●●●●●●DNA copy number●● ●●●● ●●●●●●● ●●●●● ●● ●●●●●●● ●●●● ●●●● ● ● ●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●0.80gene expression0.00210−1−20.000.020.040.060.080.10−2−1012Fig. 2.2 – Comparaison du jeu <strong>de</strong> données TCGA [76] pour le gène EGFR (gauche) et du jeu <strong>de</strong>données simulé l’imitant (droite). Les vignettes diagonales représentent <strong>de</strong>s estimateurs à noyau<strong>de</strong>s <strong>de</strong>nsités, les vignettes inférieures représentent les couples <strong>de</strong> données, et on reporte dans lesvignettes supérieures les valeurs <strong>de</strong>s coefficients <strong>de</strong> corrélation <strong>de</strong> Pearson.Fig. 2.3 – Illustration <strong>de</strong> la convergence et <strong>de</strong>s propriétés asymptotiques <strong>de</strong> la procédure TMLE.On noteψ k n = Ψ(P k n) lakième itérée <strong>de</strong> la procédure TMLE. La ligne verticale indique la vraievaleur du paramètreψ 0 = Ψ(P 0 ), mesure non-paramétrique <strong>de</strong> l’importance <strong>de</strong> la variableX surY en prenantW en compte (2.6).


46 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)2.5 Application à l’étu<strong>de</strong> du maintien postural (2/2) [A11]Je présente dans cette section un second ensemble <strong>de</strong> résultats relatifs à l’étu<strong>de</strong> du maintien<strong>de</strong> la posture, que j’ai obtenus en collaboration avec Christophe Denis (Laboratoire MAP5, UniversitéParis Descartes) [A11]. Le premier ensemble <strong>de</strong> résultats, précédé d’une introduction à lathématique <strong>de</strong> l’étu<strong>de</strong> du maintien postural, est exposé dans la Section 1.4. L’objectif principalest ici la classification <strong>de</strong> sujets dans le groupe <strong>de</strong>s sujets qualifiés <strong>de</strong> “normaux” ou dans celui <strong>de</strong>ssujets qualifiés d’hémiplégiques. La classification se fait à l’ai<strong>de</strong> d’une procédure <strong>de</strong> type plug-in,avec estimation <strong>de</strong> la fonction <strong>de</strong> régression par agrégation <strong>de</strong> différents estimateurs. En amont,nous classons les protocoles expérimentaux du plus informatif au moins informatif en nous appuyantsur un critère que nous évaluons par minimisation <strong>de</strong> perte ciblée (cf Section 2.2 pourune présentation générale du cette méthodologie d’estimation). Cela nous permet d’évaluer lesperformances <strong>de</strong> procédures <strong>de</strong> classification qui, en n’exploitant que les informations produitespar un, <strong>de</strong>ux, trois, ou quatre <strong>de</strong>s protocoles les plus informatifs, ne nécessitent pas forcémentqu’un sujet à classer subisse tous les protocoles expérimentaux (épreuve éventuellement péniblepour <strong>de</strong>s sujets sensibles).2.5.1 IntroductionD’une certaine façon, nous prenons dans cette secon<strong>de</strong> approche le contre-pied <strong>de</strong> la première(résumée dans la Section 1.4) : au lieu d’exploiter les trajectoires dans leur totalité (rappelons qu’ily a en tout 2, 800 temps d’observation), nous en construisons <strong>de</strong>s résumés <strong>de</strong> faible dimension,constitués d’un ensemble <strong>de</strong> quantités à l’interprétation directe. L’objectif n’est par ailleurs pasexactement le même : il s’agit dans cette secon<strong>de</strong> approche <strong>de</strong> classer les sujets en termes <strong>de</strong>leur répartition en sujets sains ou hémiplégiques plutôt que <strong>de</strong> construire et d’ajuster un modèle<strong>de</strong>scriptif, puis <strong>de</strong> l’exploiter pragmatiquement à <strong>de</strong>s fins <strong>de</strong> classification. Nous avons malheureusementdû écarter les sujets vestibulaires car nous ne disposons pas <strong>de</strong>s covariables <strong>de</strong> ceux-ci(âge, genre, latéralité (i.e. gaucher ou droitier), taille et poids), alors que nous pensons qu’il estimportant <strong>de</strong> les incorporer à cette version <strong>de</strong> l’étu<strong>de</strong>. Insistons sur le fait que, si la distinctionentre sujet sain et hémiplégique est relativement facile à déterminer par un examen visuel, ellen’est a priori pas évi<strong>de</strong>nte sur la base <strong>de</strong>s seules covariables et trajectoires. Cet exemple réduit estpar ailleurs très instructif, car on se convainc aisément qu’une notion <strong>de</strong> style postural, pour fairesens, <strong>de</strong>vrait associer <strong>de</strong>s styles différents à <strong>de</strong>s sujets sain et hémiplégique.2.5.2 Classification selon le maintien posturalRéduction <strong>de</strong>s données brutes.Nous n’utilisons donc ici, en plus <strong>de</strong>s covariables, que les trajectoires associées à quatre protocolesdécrits dans le Tableau 2.1 pour 32 sujets qualifiés <strong>de</strong> sains et 22 sujets qualifiés d’hémiplégiques.Nous argumentons que c’est au voisinage (temporels) <strong>de</strong>s instants auxquels les stimulationsdébutent ou s’achèvent qu’on est vraisemblablement susceptibles d’observer <strong>de</strong>s caractéristiquesdiscriminantes <strong>de</strong>s trajectoires (B t ) t∈T . A titre illustratif, on voit qu’il serait facile <strong>de</strong> <strong>de</strong>viner entrequels instants la phase <strong>de</strong> stimulations s’étend à partir <strong>de</strong> l’exemple <strong>de</strong> trajectoire apparaissant àdroite dans la Figure 2.4 (sujet hémiplégique, protocole <strong>de</strong> stimulation visuelle et musculaire). Latâche serait beaucoup plus ardue en revanche à partir <strong>de</strong> l’exemple <strong>de</strong> trajectoire apparaissant àgauche <strong>de</strong> la même figure (même sujet hémiplégique, stimulation visuelle) !


Application à l’étu<strong>de</strong> du maintien postural (2/2) 47protocole 1ère phase (0→15s) 2ème phase (15→50s) 3ème phase (50→70s)1 yeux fermés2 pas <strong>de</strong> perturbation stimulation musculaire pas <strong>de</strong> perturbation3 yeux fermésstimulation musculaire4 stimulation opto-cinétiqueTab. 2.1 – Description <strong>de</strong>s quatre protocoles exploités pour la secon<strong>de</strong> application à l’étu<strong>de</strong> dumaintien postural (présentée dans la Section 2.5). Chaque protocole est divisé en trois phases : unepremière phase sans perturbations du maintien postural est suivie d’une secon<strong>de</strong> avec perturbations,puis d’une troisième <strong>de</strong> nouveau sans perturbations. Les protocoles peuvent être qualifiés <strong>de</strong>visuels (fermeture <strong>de</strong>s yeux), proprioceptifs (stimulation musculaire) ou vestibulaire (stimulationopto-cinétique), selon quel système d’acquisition sensorielle est perturbé.Fig. 2.4 – Représentations <strong>de</strong>s trajectoirest↦→C t surT qui correspon<strong>de</strong>nt à <strong>de</strong>ux différentsprotocoles suivis par un même sujet hémiplégique (protocole 1 à gauche, protocole 3 à droite).Fig. 2.5 – Représentation visuelle <strong>de</strong> la définition <strong>de</strong> (∆ 11 , ∆ 12 , ∆ 22 ), mesure résumée <strong>de</strong>(X t ) t∈T . Les quatre segments horizontaux (lignes pleines) représentent, <strong>de</strong> gauche à droite, lesmoyennes ¯C 1 − , ¯C+ 1 , ¯C− 2 , ¯C+ 2 <strong>de</strong> (C t ) t∈T sur les intervalles [10, 15[, ]15, 20], [45, 50[, ]50, 55]. Lestrois segments verticaux (lignes pleines conclues par <strong>de</strong>s flèches) représentent, <strong>de</strong> haut en bas,∆ 11 , ∆ 12 , ∆ 22 . Deux lignes verticales supplémentaires indiquent le début et la fin <strong>de</strong> la secon<strong>de</strong>phase du protocole considéré.


48 Estimation <strong>de</strong> l’importance <strong>de</strong> variables (cadre observationnel)Nous décidons donc <strong>de</strong> résumer une trajectoire (X t ) t∈T (à travers (B t ) t∈T ) via notammentles statistiques ∆ 11 , ∆ 12 , ∆ 13 définies paroù¯C − 1 =δ 5∑t∈T∩[10,15[(∆ 11 , ∆ 12 , ∆ 22 ) = ( ¯C + 1 − ¯C − 1 , ¯C − 2 − ¯C + 1 , ¯C + 2 − ¯C − 2 ),C t , ¯C+ 1 = δ 5∑t∈T∩]15,20]C t ,¯C − 2 =δ 5∑t∈T∩[45,50[C t , ¯C+ 2 = δ 5∑t∈T∩]50,55]sont les moyennes <strong>de</strong>C t calculées sur les intervalles [10, 15[, ]15, 20], [45, 50[ et ]50, 55] (i.e. surles <strong>de</strong>rnières/premières cinq secon<strong>de</strong>s avant/après les début/fin <strong>de</strong> la secon<strong>de</strong> phase du protocoleconsidéré). Notons qu’il est inutile d’introduire les différences ¯C − 2 − ¯C − 1 = ∆ 12 +∆ 11 , ¯C + 2 − ¯C − 1 =∆ 22 +∆ 12 , ¯C + 2 − ¯C + 1 = ∆ 11 +∆ 12 +∆ 22 car elles sont combinaisons linéaires <strong>de</strong>s précé<strong>de</strong>ntes. Lelecteur peut se reporter à la Figure 2.5 pour une représentation visuelle <strong>de</strong> l’explication ci-<strong>de</strong>ssus.Les mesures résumées ∆ 11 , ∆ 12 , ∆ 22 sont relatives à la distance à un point <strong>de</strong> référence. Nousles enrichissons <strong>de</strong> mesures résumées relatives à l’orientation. Nous ajustons pour cela <strong>de</strong> simplesmodèles linéairesy(B t ) =vx(B t ) +u (oùx(B t ) ety(B t ) sont les abscisses et ordonnées <strong>de</strong>B t )à partir <strong>de</strong>s données{B t :t∈T∩ [10, 15[},{B t :t∈T∩ [15, 20[},{B t :t∈T∩ [20, 45[},{B t :t∈T∩ [45, 50[} et{B t :t∈T∩ [50, 55[}, puis exploitons les pentes estimées commemesures résumées d’une orientation moyenne sur chaque intervalle.Classement <strong>de</strong>s protocoles du plus au moins informatifs.Au final, on se ramène à une situation où l’observation génériqueOs’écritO = (W,A,Y 1 ,Y 2 ,Y 3 ,Y 4 ),–W∈ R×{0, 1} 2 × R 2 étant le vecteur <strong>de</strong>s covariables ;–A∈{0, 1} indiquant la qualification du sujet (avec la conventionA=1pour hémiplégique,A = 0 pour sain) ;–Y j ∈ R 8 étant, pour toutj∈{1, 2, 3, 4}, la mesure résumée associée aujème protocole.La vraie loi <strong>de</strong>Oest notéeP 0 , que l’on voit comme un élément du modèleM NP en l’absenced’indication sur la nature <strong>de</strong>P 0 .Notre objectif ultime est la mise au point d’une procédure <strong>de</strong> classification <strong>de</strong>s sujets selon leurqualification sain/hémiplégique à partir <strong>de</strong>s covariables collectées et <strong>de</strong>s trajectoires enregistréesselon les quatre protocoles décrits dans le Tableau 2.1. Nous ajoutons une difficulté supplémentaireà l’exercice : puisque les protocoles expérimentaux peuvent être assez pénibles pour certainespersonnes (<strong>de</strong>s sujets ont dû restés étendus plusieurs minutes à l’issue <strong>de</strong> leurs analyses), noussouhaitons limiter autant que possible le nombre <strong>de</strong> protocoles investis dans la procédure <strong>de</strong>classification.A cette fin, nous commençons par classer les protocoles du plus au moins informatifs. Nousnous fondons pour cela sur la fonctionnelle Ψ :M NP → R 32 , collection <strong>de</strong> mesures <strong>de</strong> l’importance<strong>de</strong>Asur les composantes <strong>de</strong>Y en contrôlantW (cf l’exemple <strong>de</strong> l’excès <strong>de</strong> risque (2.2) dans laSection 2.2.1), telle que pour toutP∈M NP , Ψ(P) = (Ψ j (P)) 1≤j≤4 , avec{Ψ j (P) =(E P E P [Y ji |A = 1,W]−E P [Y ji})1≤i≤8 |A = 0,W] .C t


Application à l’étu<strong>de</strong> du maintien postural (2/2) 49Heuristiquement, le plus éloigné <strong>de</strong> zéro est le paramètre Ψ j i (P 0) le plus d’information surAest-onsusceptible d’obtenir <strong>de</strong> l’observation <strong>de</strong>W et <strong>de</strong> la mesure résuméeY ji correspondante.Nous construisons <strong>de</strong>s tests <strong>de</strong>s hypothèses nulles “Ψ j i (P 0) = 0” contre “Ψ j i (P 0) ≠ 0” ennous appuyant sur une procédure <strong>de</strong> type TMLE (pour les détails, se référer à la Section 3.2<strong>de</strong> [A11]). Les statistiques <strong>de</strong> test sont combinées en <strong>de</strong>s scores relatifs à chaque protocole, àpartir <strong>de</strong>squels les dits protocoles sont ordonnés du plus au moins informatifs. A titre indicatif, laprocédure statistique élit le troisième protocole comme étant le plus informatif, suivi dans l’ordrepar le second, le premier et le quatrième.Contribution à l’élaboration d’une notion <strong>de</strong> style postural (2/2).Quatre procédures <strong>de</strong> classification sont enfin mises au point, qui reposent sur le plus informatif,les <strong>de</strong>ux plus informatifs, les trois plus informatifs et tous les protocoles, respectivement. Lesprocédures sont <strong>de</strong> type plug-in et mettent en jeu le super-learning <strong>de</strong> la fonction <strong>de</strong> régression<strong>de</strong>Asur ses prédicteurs (pour les détails, se référer à la Section 3.3 <strong>de</strong> [A11]).Les diverses performances sont évaluées selon le principe du leave-one-out, sur les donnéesréelles et via une étu<strong>de</strong> <strong>de</strong> simulations (cf Sections 4 et 5 dans [A11]).En particulier, on obtient, en n’exploitant que le plus informatif <strong>de</strong>s protocoles, le classementcorrect <strong>de</strong> 87% <strong>de</strong>s sujets (47 bien classés sur les 54). Détail intéressant, les performances <strong>de</strong>sautres classifieurs sont moins bonnes.


51Chapitre 3Estimation et test <strong>de</strong> l’importance <strong>de</strong>variables et <strong>de</strong> paramètres causaux surdonnées expérimentalesLes analyses cliniques randomisées, qui produisent <strong>de</strong>s données qualifiées d’expérimentales,sont reconnues comme offrant le cadre scientifique le plus rigoureux pour la mise en évi<strong>de</strong>nce etl’évaluation d’effets causaux (typiquement d’un traitement sur une maladie). Elles se démarquent<strong>de</strong>s étu<strong>de</strong>s observationnelles (comme celles considérées dans le Chapitre 2) en ce que leurs investigateursinterviennent dans le processus stochastique <strong>de</strong> création <strong>de</strong>s observations.Une troisième partie <strong>de</strong> mes travaux concerne la construction et l’étu<strong>de</strong> asymptotique <strong>de</strong>schémas groupes-séquentiels adaptatifs d’analyses cliniques randomisées. Par schéma adaptatif,j’entends ici un schéma d’analyse clinique randomisée qui permet aux investigateurs d’en adapterdynamiquement la randomisation en fonction <strong>de</strong>s données accumulées jusque-là, sans bien sûrnuire à l’intégrité statistique <strong>de</strong> l’analyse. Par schéma groupe-séquentiel, je fais allusion d’une partau fait que l’adaptation peut avoir lieu par blocs, et d’autre part au fait que <strong>de</strong>s procédures <strong>de</strong>test groupes-séquentielles peuvent parfaitement être exploitées dans un tel cadre d’analyse.Par ailleurs, je suppose que les investigateurs spécifient dès l’origine <strong>de</strong> la planification <strong>de</strong>l’analyse clinique un certain critère d’intérêt, et que celui-ci induit une notion <strong>de</strong> randomisationoptimale que l’on souhaite par conséquent cibler par adaptation <strong>de</strong> la randomisation initiale. Ainsipar exemple, le choix du critère pourrait traduire la nécessité <strong>de</strong> minimiser le nombre <strong>de</strong> patientsassignés à un bras <strong>de</strong> l’analyse qui n’est pas le bras le plus favorable pour eux, sous contrainte <strong>de</strong>niveau et <strong>de</strong> puissance statistiques. Ou bien le choix du critère pourrait traduire la nécessité qu’unrésultat soit obtenu le plus tôt possible, sous contrainte <strong>de</strong> niveau et <strong>de</strong> puissance statistiques. Larandomisation optimale est dans ce second cas connue sous le nom d’allocation <strong>de</strong> Neyman [43,page 13]. Elle minimise la variance asymptotique du paramètre d’intérêt. C’est cette randomisationque j’utilise comme exemple dans ce chapitre, la méthodologie s’appliquant plus généralement àune gran<strong>de</strong> classe <strong>de</strong> critères vali<strong>de</strong>s.Les schémas adaptatifs <strong>de</strong> randomisation ont déjà une longue histoire <strong>de</strong>rrière eux (elle remonteaux années 30), et je renvoie le lecteur intéressé par une perspective historique à [43,Section 1.2], [47, Section 17.4] et [70]. Les schémas adaptatifs <strong>de</strong> randomisation sont traditionnellementregroupés selon que l’adaptation dépend :– uniquement <strong>de</strong>s assignations passées – on parle dans ce cas <strong>de</strong> randomisation restreinte ;– uniquement <strong>de</strong>s assignations et <strong>de</strong>s réponses passées – on parle dans ce cas <strong>de</strong> randomisation


52 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)réponses-adaptative ;– uniquement <strong>de</strong>s assignations et <strong>de</strong>s covariables passées – on parle dans ce cas <strong>de</strong> randomisationajustée aux covariables ;– <strong>de</strong> tout le passé (covariables, assignations, réponses) – on parle dans ce cas <strong>de</strong> randomisationréponse-adaptative et ajustée aux covariables, ou encore <strong>de</strong> procédures CARA.Je présente dans ce chapitre un ensemble <strong>de</strong> résultats relatifs à l’adaptation d’analyses cliniquesrandomisées <strong>de</strong> types réponses-adaptative [A7-A8] (cf la Section 3.1) ou CARA [A10,A13] (cf laSection 3.2). Les articles [A7-A8,A10,A13] s’inspirent, illustrent et prolongent les résultats pionniersobtenus par van <strong>de</strong>r Laan [80].3.1 Ciblage <strong>de</strong>s analyses cliniques à randomisation réponses-adaptative[A7-A8]Je présente dans cette section un ensemble <strong>de</strong> résultats obtenus en collaboration avec Markvan <strong>de</strong>r Laan (Division of Biostatistics, UC Berkeley). Ils ont fait l’objet <strong>de</strong> <strong>de</strong>ux publications[A7-A8]. Nous y décrivons comment construire <strong>de</strong>s analyses cliniques à randomisation réponsesadaptativeet comment en analyser les résultats, selon le principe du maximum <strong>de</strong> vraisemblance,en termes d’estimation et <strong>de</strong> test (groupes-séquentiel). Une étu<strong>de</strong> <strong>de</strong> simulations exhaustive vientillustrer les résultats théoriques asymptotiques.3.1.1 Ciblage du schéma optimalNous nous intéressons dans cette section à la construction et à l’étu<strong>de</strong> statistique d’analysescliniques dont le mécanisme <strong>de</strong> randomisation (dynamique/adaptatif) ne tient pas compte <strong>de</strong>covariables <strong>de</strong> base. Commençons par préciser le formalisme statistique dans lequel nous inscrivonsnotre étu<strong>de</strong>, avant <strong>de</strong> décrire les stratégies d’adaptation et d’estimation. Celles-ci reposeront surle principe du maximum <strong>de</strong> vraisemblance. Nous renvoyons le lecteur vers la Section 1 <strong>de</strong> [A8]pour une série <strong>de</strong> références sur la métho<strong>de</strong> concurrente fondée sur les modèles d’urnes.Formalisme statistique.L’observation générique s’écritO=(W,A,Y ),W étant un vecteur <strong>de</strong> covariables <strong>de</strong> base, lavariableAspécifiant le bras assigné aléatoirement au patient et la variableY la nature <strong>de</strong> l’issueprincipale d’intérêt découlant <strong>de</strong> cette assignation. Disons pour simplifier que <strong>de</strong>ux traitements sontcomparés (soitA∈{0, 1}, e.g. pour placebo/protocole médicamenteux) et que l’issue principaleY s’exprime <strong>de</strong> façon binaire (soitY ∈{0, 1}, e.g. pour décès/guérison ; l’étu<strong>de</strong> du casY àvaleurs réelles serait presque similaire). NotonsQ = (Q W , ¯Q) le couple formé par la loi marginaleQ W <strong>de</strong>W et l’espérance conditionnelle ¯Q(A,W) <strong>de</strong>Y sachant (A,W). Les paramètres d’intérêttypiques, <strong>de</strong> type importance <strong>de</strong> la variableAsurY en contrôlantW, comparent les quantitésE QW { ¯Q(1,W)} etE QW { ¯Q(0,W)}, par exemple à l’échelle additive (cf l’exemple <strong>de</strong> l’excès <strong>de</strong>risque <strong>de</strong> la Section 2.2) ou à l’échelle multiplicative, e.g. avec le paramètre log-risque relatifΨ(Q) = log E Q W{ ¯Q(1,W)} , (3.1)E QW { ¯Q(0,W)}notre choix pour cette section. Quoiqu’il en soit, le fait que le mécanisme <strong>de</strong> randomisation (i.e.la loi conditionnelle <strong>de</strong>AsachantW) ne tienne pas compte <strong>de</strong>W (A etW sont indépendantes)implique une simplification notoire du problème statistique, puisque dans ces conditions


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives 53E QW { ¯Q(a,W)} =E QW ( ¯Q(A,W)|A =a) pour touta∈{0, 1} (cette notation est rigoureuse,car la loi conditionnelle <strong>de</strong>W sachantAcoïnci<strong>de</strong> par indépendance avec la loi marginale <strong>de</strong>W!).Par conséquent, la fonctionnelle Ψ définie en (3.1) satisfait aussi iciΨ(Q) = log E Q(Y|A = 1)E Q (Y|A = 0) .On apprend <strong>de</strong> cette égalité (en la lisant comme un résultat d’i<strong>de</strong>ntifiabilité) qu’il suffit pourestimer Ψ(Q) d’estimer les <strong>de</strong>ux valeurs moyennes <strong>de</strong> l’issue principaleY dans chaque bras, sansavoir besoin <strong>de</strong> tenir compte <strong>de</strong>s covariables <strong>de</strong> base. Puisque nous allons suivre cette approche,nous redéfinissons l’observation générique comme étant simplementO=(A,Y )∈{0, 1} 2 – nousnégligeons complètement les covariables <strong>de</strong> base.Remarque. Dans la Section 3.2, le mécanisme <strong>de</strong> traitement ciblé dépendra <strong>de</strong>s covariables <strong>de</strong>base. Il sera donc nécessaire d’introduire un modèle <strong>de</strong> travail pour la loi conditionnelle <strong>de</strong>Y sachant (A,W). Le jeu (sensiblement plus délicat on le verra) en vaut la chan<strong>de</strong>lle, carles gains en termes d’efficacité sont potentiellement importants (d’autant plus queW estun bon prédicteur <strong>de</strong>Y ).Postulons maintenant l’existence d’une donnée (dite complète)X = (Y (0),Y (1))∈{0, 1} 2qui contient les <strong>de</strong>ux issues contrefactuelles (ou potentielles)Y (0) etY (1) <strong>de</strong>s <strong>de</strong>ux traitements.Dans ces conditions, l’observation génériqueOsatisfaitO=(A,Y (A)) et apparaît comme unestructure <strong>de</strong> donnée manquante surX avec variable <strong>de</strong> censureA. La loiP X <strong>de</strong> la donnéecomplèteX a pour marginales <strong>de</strong>ux lois <strong>de</strong> Bernoulli <strong>de</strong> paramètresθ = (θ 0 ,θ 1 )∈Θ⊂]0, 1[ 2 , avecθ 0 =E PX {Y (0)} etθ 1 =E PX {Y (1)}. SiAprend la valeur 1 avec probabilitég(1)∈]0, 1[ et lavaleur 0 avec probabilitég(0) = 1−g(1) alors la vraisemblance <strong>de</strong>O s’écritθ Y A (1−θ A) 1−Y ×g(A).Pour cette raison, nous dirons par la suite que, dans ces conditions,O est obtenue sous (θ,g). Deplus, on voit que le paramètre log-relatif risque sousθsatisfaitΨ(θ)≡log E θ(Y|A = 1)E θ (Y|A = 0) = logθ 1θ 0,mettant ainsi en avant son interprétation causale.Remarque. A propos <strong>de</strong> l’existence <strong>de</strong> la donnée complète. Nous avons supposé l’existence<strong>de</strong> la donnée complèteX pour simplifier l’exposé et pour mettre en avant l’interprétationcausale sous-jacente, mais aurions pu tout aussi bien nous en passer. Par ailleurs, on peutdémontrer qu’il est toujours possible, au prix d’une augmentation <strong>de</strong> l’espace probabilisépermettant <strong>de</strong> tirer aléatoirement <strong>de</strong> nouvelles variables uniformes et en recourant à <strong>de</strong>stransformations quantile-quantile adéquates, <strong>de</strong> construire une donnée complèteX commeci-<strong>de</strong>ssus (cf Theorem 2.1 dans [89] et [35]).Une étu<strong>de</strong> élémentaire nous enseigne que la fonctionnelle Ψ est différentiable sur les chemins entout (θ,g), <strong>de</strong> fonction d’influence efficace notéeD ⋆ (θ,g) (cf la Section 2.2 pour la définition <strong>de</strong>la différentiabilité sur les chemins). De plus, la minimisation <strong>de</strong> la fonctiong↦→ Var θ,g D ⋆ (θ,g)(O)nous apprend que le mécanisme <strong>de</strong> traitement optimalG ⋆ (θ) que nous souhaitons cibler (mécanismeconnu dans la littérature sous le nom d’allocation <strong>de</strong> Neyman [43, page13]) satisfait√G ⋆ θ0 (1−θ 1 )(θ)(1) = √θ0 (1−θ 1 ) + √ θ 1 (1−θ 0 ) .Finalement, nous supposons que Θ est tel que{G ⋆ (θ)(1) :θ∈Θ}⊂[δ, 1−δ] pourδ> 0 connu.


54 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)Stratégies d’adaptation et d’estimation.Soitg b le mécanisme <strong>de</strong> traitement équilibré, tel queg b (1) =g b (0) = 1 2. Le mécanisme <strong>de</strong>génération <strong>de</strong> données est initialisé par le tirage <strong>de</strong>n 0 copies i.i.dO 1 ,...,O n0 <strong>de</strong>Osous (θ,g b ).Posons g n0 = (g 1 ,...,g n0 ) avecg i =g b pour touti≤n 0 , et disons que (O 1 ,...,O n0 ) a étéobtenue sous (θ, g n0 ). Nous finissons <strong>de</strong> caractériser le mécanisme <strong>de</strong> génération <strong>de</strong> données parrécurrence. Pour simplifier l’exposé, disons que la mise à jour du mécanisme <strong>de</strong> traitement estpurement séquentielle (i.e. réalisée à chaque fois qu’une nouvelle observation est obtenue ; lagénéralisation au cas où la mise à jour a lieu à chaque fois quecobservations sont recueilliesest quasi immédiate). Admettons que g n = (g 1 ,...,g n ) a été définie pourn≥n 0 et que l’ondispose <strong>de</strong> (O 1 ,...,O n ) obtenue sous (θ, g n ). En vertu <strong>de</strong> l’indépendance <strong>de</strong>sndonnées complètesassociées (X 1 ,...,X n ), on voit (cf la Section 3.2 <strong>de</strong> [A8]) que la vraisemblance <strong>de</strong> (O 1 ,...,O n )s’écritn∏ n∏θ Y iA i(1−θ Ai ) 1−Y i× g i (A i ).i=1Cette remarquable factorisation implique notamment que l’estimateur du maximum <strong>de</strong> vraisemblance<strong>de</strong>θest simplementθ n =(∑ ni=1Y i A i∑ ni=1A i,i=1∑ ni=1 )Y i (1−A i )∑ ni=1(1−A i )(avec la convention 0/0 = 1/2). Nous caractérisons alorsg n+1 parg n+1 (1) = 1−g n (0) =min(1−δ, max(G ⋆ (θ n )(1),δ)), tirons la donnée complèteX n+1 = (Y n+1 (0),Y n+1 (1)) indépendamment<strong>de</strong> (X 1 ,...,X n ), tironsA n+1 sousg n+1 (conditionnellement à (O 1 ,...,O n )), puisformons l’observationO n+1 = (A n+1 ,Y n+1 =Y n+1 (A n+1 )). La collection (O 1 ,...,O n+1 ) estobtenue sous (θ, g n+1 ), avec g n+1 = (g 1 ,...,g n ,g n+1 ).3.1.2 Etu<strong>de</strong> asymptotique <strong>de</strong> l’estimateur du maximum <strong>de</strong> vraisemblanceEtu<strong>de</strong> théorique <strong>de</strong>s propriétés asymptotiques.Nous démontrons queθ n construit sous (θ, g n ) est un estimateur fortement consistant <strong>de</strong>θ.Nous en déduisons par <strong>de</strong>s arguments <strong>de</strong> continuité que l’estimateurψ n = Ψ(θ n ) <strong>de</strong> Ψ(θ) construitsous (θ, g n ) est lui-même fortement consistant, et que la suite g n <strong>de</strong>s allocations adaptativesconverge presque sûrement vers le mécanisme <strong>de</strong> traitement optimal cibléG ⋆ (θ) (cf Theorem 1dans [A8]).De plus, nous obtenons un développement asymptotique linéaire <strong>de</strong>θ n sous (θ, g n ) qui induit(par la métho<strong>de</strong>-<strong>de</strong>lta) un développement asymptotique linéaire <strong>de</strong>ψ n = Ψ(θ n ) avec commefonction d’influence la fonction d’influence efficaceD ⋆ (θ,G ⋆ (θ)) (cf Theorem 2 dans [A8] ; noterque l’on obtient aussi <strong>de</strong> cette façon un développement asymptotique linéaire <strong>de</strong>g n+1 =G ⋆ (θ n )).Par conséquent,ψ n = Ψ(θ n ) construit sous (θ, g n ) satisfait un théorème <strong>de</strong> la limite centrale, etil est un estimateur efficace <strong>de</strong>θ:sa variance asymptotique est égale à la plus petite varianceatteignable par un estimateur régulier, soit Var θ,G ⋆ (θ)D ⋆ (θ,G ⋆ (θ))(O). Par ailleurs, celle-ci peutêtre estimée <strong>de</strong> façon consistante pars 2 n =P n D ⋆ (θ n ,G ⋆ (θ n )) 2 (comme si les données étaienti.i.d).La preuve <strong>de</strong>s résultats <strong>de</strong> consistance repose <strong>de</strong> façon essentielle sur une inégalité maximalepour <strong>de</strong>s martingales discrètes (cf Proposition A2 dans [87], dont dépend notre Theorem 6 dans[A8]). Les convergences en loi découlent d’un théorème <strong>de</strong> la limite centrale pour <strong>de</strong>s martingales


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives 55discrètes multivariées que nous déduisons <strong>de</strong> résultats classiques en invoquant le tour <strong>de</strong> Cramér-Wold (cf Theorem 8 dans [A8]).Ces résultats, déjà connus dans la littérature [43], justifient l’utilisation <strong>de</strong> [ψ n ±s n ξ 1−α/2 / √ n]comme intervalle <strong>de</strong> confiance <strong>de</strong> niveau asymptotique (1−α) pour Ψ(θ) sous (θ, g n ).Etu<strong>de</strong> <strong>de</strong> simulations <strong>de</strong>s propriétés asymptotique.Les résultats théoriques <strong>de</strong> convergence que nous venons d’évoquer sont illustrés dans unétu<strong>de</strong> <strong>de</strong> simulations que nous présentons dans [A7]. Cette étu<strong>de</strong> <strong>de</strong> simulations est exhaustivedans le sens où son schéma <strong>de</strong> simulation (décrit en détails dans la Section 3.1 <strong>de</strong> [A7]) considèreunε-réseau Θ 0 ⊂ Θ <strong>de</strong> 45 paramètres pour les lois marginales <strong>de</strong> la donnée complèteX = (Y (0),Y (1)) (on prendε= 110 ). Pour chaque valeur <strong>de</strong>θ∈Θ 0, trois analyses cliniquesrandomisées (indépendantes) sont simuléesB = 1000 fois (indépendamment d’une fois à l’autre) :<strong>de</strong>ux sous les schémas d’échantillonnage indépendants (θ,g b ) (allocation équilibrée) et (θ,G ⋆ (θ))(allocation optimale <strong>de</strong> Neyman, telle que ciblée par notre schéma adaptatif) et une sous le schémad’échantillonnage adaptatif (θ, g n ) (un quatrième schéma, qualifié “d’agressif”, est aussi considéré).Les performances <strong>de</strong> chaque schéma sont évaluées pour <strong>de</strong>s tailles d’échantillons à valeursdans{100, 250, 500, 1000, 2500, 5000}.Plus spécifiquement, nous évaluons ainsi– la validité du théorème <strong>de</strong> la limite centrale :nous concluons que la limite gaussienne est atteinte (uniformément sur Θ 0 ) pourn≥500sous le schéma d’échantillonnage adaptatif (θ, g n ) (pour mémoire, la limite gaussienne estatteinte pourn≥750 sous (θ,g b ) et pourn≥500 sous (θ,G ⋆ (θ)) – cf la Section 3.2 dans[A7]) ;– la validité asymptotique <strong>de</strong>s intervalles <strong>de</strong> confiance :nous concluons que les intervalles <strong>de</strong> confiance [ψ n ±s n ξ 1−α/2 / √ n] sont bien <strong>de</strong> niveau(1−α) (uniformément sur Θ 0 ) pourn≥500 sous le schéma d’échantillonnage adaptatif(θ, g n ) (pour mémoire, ces intervalles sont <strong>de</strong> niveau (1−α) pourn≥100 sous (θ,g b ) etsous (θ,G ⋆ (θ)) – cf la Section 3.3 dans [A7]) ;– les performances asymptotiques du schéma d’échantillonnage adaptatif relativement auschéma d’échantillonnage optimal ciblé :nous concluons que les intervalles <strong>de</strong> confiance obtenus sous le schéma d’échantillonnageadaptatif (θ, g n ) ne sont pas stochastiquement plus étendus que ceux obtenus sous le schémad’échantillonnage optimal ciblé (θ,G ⋆ (θ)) (uniformément sur Θ 0 ) pour toutn≥100 (<strong>de</strong>plus, ils sont significativement plus étroits (uniformément sur Θ 0 ) que ceux obtenus sous leschéma d’échantillonnage (θ,g b ) – cf la Section 3.4 dans [A7]).En résumé, les simulations confirment la théorie et nous enseignent que le régime asymptotiquepeut être considéré comme atteint sous le schéma d’échantillonnage adaptatif (uniformément surΘ 0 , lorsque l’on s’intéresse aux propriétés <strong>de</strong> convergence <strong>de</strong>ψ n ) pour <strong>de</strong>s tailles d’échantillonssupérieures ou égales àn = 500.Remarque. Pour tirer ces conclusions, nous avons dû développer une première batterie ad hoc<strong>de</strong> tests multiples dont la <strong>de</strong>scription est l’objet <strong>de</strong> la Section A.3 <strong>de</strong> la prépublicationcombinant les résultats <strong>de</strong> [A7-A8].


56 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)3.1.3 Etu<strong>de</strong> asymptotique <strong>de</strong> la procédure <strong>de</strong> test groupes-séquentielleNous nous intéressons aussi à la mise au point et à la validation théorique (dans un cadreasymptotique d’étu<strong>de</strong> <strong>de</strong> niveau et <strong>de</strong> puissance locale) d’une procédure <strong>de</strong> test groupes-séquentielleà partir <strong>de</strong>ψ n = Ψ(θ n ).La procédure <strong>de</strong> test groupes-séquentielle.Soit donc fixés une hypothèse nulle “Ψ(θ) =ψ 0 ” à tester contre son alternative “Ψ(θ)>ψ 0 ”,un paramètre alternatifψ 1 >ψ 0 pour le réglage <strong>de</strong> la puissance, <strong>de</strong>s erreurs ciblesαetβ <strong>de</strong> type Iet II.Les investigateurs spécifient également dans le protocole <strong>de</strong> l’analyse clinique un nombre entierK≥ 2, ainsi queK proportions 0


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives 57pour en déduire les limites en loi du vecteur (T 1 ,...,T K ) = ( √ n k (Ψ(θ nk )−Ψ(θ))/s nk ,k≤K)sous (θ, g n ) et (θ √ h/ n , g n ).Spécifiquement, nous obtenons (cf Theorem 3 dans [A8]) que (T 1 ,...,T K ) converge dans les<strong>de</strong>ux cas vers une loi gaussienne <strong>de</strong> covarianceC, la loi limite étant centrée pour l’échantillonnagesous (θ, g n ) et décentrée <strong>de</strong> moyenne (h 1 −h 0 )( √ p 1 ,..., √ √p K )/ Var θ,G ⋆ (θ)D ⋆ (θ,G ⋆ (θ))(O)>0pour l’échantillonnage sous (θ √ h/ n , g n ). Nous argumentons qu’un tel résultat vali<strong>de</strong> (au moinspartiellement, à cause <strong>de</strong> la substitution <strong>de</strong>n k àN k ) la procédure <strong>de</strong> test groupes-séquentielleexposée plus tôt (cf la discussion en fin <strong>de</strong> Section 5.2 dans [A8]). Des résultats similaires ont étéprésentés dans [92] (les métho<strong>de</strong>s <strong>de</strong> preuve en sont différentes).Etu<strong>de</strong> <strong>de</strong> simulations <strong>de</strong>s propriétés <strong>de</strong> niveau et <strong>de</strong> puissance locale asymptotiques.Les résultats théoriques concernant la procédure <strong>de</strong> test groupes-séquentielle que nous venonsd’évoquer sont illustrés dans le prolongement <strong>de</strong> l’étu<strong>de</strong> <strong>de</strong> simulations menée dans [A7] (dontnous avons déjà résumé les résultats relatifs aux propriétés <strong>de</strong> convergence <strong>de</strong> l’estimateurψ n dansla Section 3.1.2).Pour toutϑ∈Θ 0 , nous menons six analyses cliniques randomisées (indépendantes) pour testerB = 1000 fois (indépendamment d’une fois sur l’autre) “Ψ(θ) = Ψ(ϑ)” contre “Ψ(θ)>Ψ(ϑ)”au niveau asymptotiqueα=5% avec une puissance asymptotique souhaitée (1−β) = 90% en leparamètre alternatif Ψ(ϑ + (0, 0.05))>Ψ(ϑ). Pour chaqueϑ∈Θ 0 ,– trois analyses cliniques randomisées sont dédiées à l’étu<strong>de</strong> du niveau asymptotique : <strong>de</strong>uxsous les schémas d’échantillonnage indépendants (ϑ,g b ) (allocation équilibrée) et (ϑ,G ⋆ (ϑ))(allocation optimale <strong>de</strong> Neyman, telle que ciblée par notre schéma adaptatif), une sous leschéma d’échantillonnage adaptatif (ϑ, g n ) ;– trois analyses cliniques randomisées sont dédiées à l’étu<strong>de</strong> <strong>de</strong> la puissance asymptotique :<strong>de</strong>ux sous les schémas d’échantillonnage indépendants (ϑ + (0, 0.05),g b ) (allocation équilibrée)et (ϑ+(0, 0.05),G ⋆ (ϑ+(0, 0.05))) (allocation optimale <strong>de</strong> Neyman, telle que ciblée parnotre schéma adaptatif), une sous le schéma d’échantillonnage adaptatif (ϑ+(0, 0.05), g n ).Le détail <strong>de</strong> la procédure est exposé dans la Section 4.1 <strong>de</strong> [A7].Plus spécifiquement, nous évaluons ainsi– le niveau asymptotique :nous concluons que le niveau garanti est bien <strong>de</strong>α=5% (uniformément sur Θ 0 ) sous leschéma d’échantillonnage (ϑ, g n ) (pour mémoire, ce résultat est aussi vrai sous (ϑ,g b ) et(ϑ,G ⋆ (ϑ)) – cf la Section 4.2 <strong>de</strong> [A7]) ;– la puissance asymptotique :nous concluons à une légère sous-calibration en termes <strong>de</strong> puissance, avec une puissance garantieégale à 89% (uniformément sur Θ 0 ) plutôt que (1−β) = 90% sous le schémad’échantillonnage (ϑ + (0, 0.05), g n ) (pour mémoire, le schéma d’échantillonnage (ϑ +(0, 0.05),g b ) souffre du même défaut, alors qu’en revanche le schéma d’échantillonnage(ϑ + (0, 0.05),G ⋆ (ϑ + (0, 0.05))) garantit la puissance souhaitée – cf la Section 4.2 <strong>de</strong>[A7]) ;– les gains en termes <strong>de</strong> nombre d’observations requises pour atteindre une décision :nous concluons que le nombre d’observations requises pour atteindre une décision sousles schémas d’échantillonnage adaptatifs (ϑ, g n ) et (ϑ + (0, 0.05), g n ) ne sont pas stochastiquementplus grands que ceux requis sous les schémas d’échantillonnage optimauxciblés (ϑ,G ⋆ (ϑ)) et (ϑ + (0, 0.05),G ⋆ (ϑ + (0, 0.05))) (uniformément sur Θ 0 ) (<strong>de</strong> plus, ils


58 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)sont significativement plus petits (uniformément sur Θ 0 ) que ceux requis sous les schémasd’échantillonnage (ϑ,g b ) et (ϑ + (0, 0.05),g b ) – cf la Section 4.3 dans [A7]).En résumé, les simulations confirment la théorie : elles vali<strong>de</strong>nt la procédure <strong>de</strong> test groupesséquentielle(uniformément sur Θ 0 ) en termes <strong>de</strong> niveau et <strong>de</strong> puissance asymptotiques, en exhibanttoutefois une légère sous-calibration <strong>de</strong> cette <strong>de</strong>rnière (pour notre schéma d’échantillonnageadaptatif et pour le schéma d’échantillonnage indépendant équilibré). De plus, elles illustrent lemérite <strong>de</strong> cibler le schéma d’échantillonnage optimal en termes <strong>de</strong> nombre d’observations requisespour atteindre une décision : le schéma d’échantillonnage adaptatif se comporte comme le schémaindépendant optimal ciblé (uniformément sur Θ 0 ), avec <strong>de</strong>s gains (relativement au schéma indépendantéquilibré) pouvant s’élever à 33% (sous l’hypothèse nulle) et à 53% (sous l’hypothèsealternative).Remarque. Pour tirer ces conclusions, nous avons dû développer une secon<strong>de</strong> batterie ad hoc<strong>de</strong> tests multiples dont la <strong>de</strong>scription est l’objet <strong>de</strong> la Section A.4 <strong>de</strong> la prépublicationcombinant les résultats <strong>de</strong> [A7-A8].3.2 Ciblage <strong>de</strong>s analyses cliniques à randomisation réponses-adaptativeet ajustée aux covariables [A10,A13]Je présente dans cette section un ensemble <strong>de</strong> résultats obtenus en collaboration avec Mark van<strong>de</strong>r Laan (Division of Biostatistics, UC Berkeley). Ces résultats ont fait l’objet d’une publication[A10] et d’une prépublication [A13]. Nous y décrivons comment construire <strong>de</strong>s analyses cliniques<strong>de</strong> type CARA (i.e. à randomisation réponses-adaptative et ajustée aux covariables) et commenten analyser les résultats, selon le principe TMLE <strong>de</strong> minimisation <strong>de</strong> perte ciblée (cf la Section 2.2pour la présentation <strong>de</strong> ce principe), en termes d’estimation et <strong>de</strong> test (groupes-séquentiel). Uneétu<strong>de</strong> <strong>de</strong> simulations vient illustrer les résultats théoriques asymptotiques.La littérature dédiée aux analyses cliniques <strong>de</strong> type CARA est moins fournie que celle dédiéeaux analyses cliniques à randomisation réponses-adaptative (le thème est plus jeune), maiselle s’étoffe rapi<strong>de</strong>ment. La caractéristique fondamentale <strong>de</strong>s résultats que nous obtenons dans[A10,A13] relativement à ceux obtenus préalablement e.g. dans [69, 5, 2, 91, 90, 73, 92] (ceci estune sélection seulement d’articles tout particulièrement pertinents ; la convergence, presque sûreet en loi, <strong>de</strong>s probabilités d’allocation et <strong>de</strong>s paramètres estimés dans un modèle paramétriquesupposé correctement spécifié y est typiquement étudiée – mais [73] est concerné par l’étu<strong>de</strong> d’uneprocédure <strong>de</strong> test) tient au fait que nos résultats sont robustes à la mauvaise spécification.3.2.1 Formalisme statistique et i<strong>de</strong>ntification du schéma optimalL’observation générique s’écritO=(W,A,Y ),W étant le vecteur <strong>de</strong>s covariables <strong>de</strong> base,la variableAspécifiant le bras assigné aléatoirement au patient et la variableY quantifiantl’issue principale d’intérêt découlant <strong>de</strong> cette assignation. Comme dans la Section 3.1, nous considéronspour simplifier la situation où <strong>de</strong>ux traitements sont comparés (A∈{0, 1}, e.g. pourplacebo/protocole médicamenteux). L’issue principaleY est supposée ici continue (e.g. pour lamesure d’une charge virale en fin <strong>de</strong> traitement – l’étu<strong>de</strong> du casY binaire se déduit aisément <strong>de</strong>celle que nous allons présenter). NotonsQ = (Q W ,Q Y|A,W ) le couple formé par la loi marginale<strong>de</strong>W et la loi conditionnelleQ Y|A,W <strong>de</strong>Y sachant (A,W), ¯Q étant l’espérance conditionnelle<strong>de</strong>Y sachant (A,W) sousQ Y|A,W . Nous nous intéressons dans cette section au risque relatif


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives et ajustées aux covariables 59comme mesure <strong>de</strong> l’importance <strong>de</strong> la variableAsurY en contrôlantW :Ψ(Q)≡ER(Q) =E QW { ¯Q(1,W)− ¯Q(0,W)}(comme défini en (2.2)). Ce paramètre peut être interprété causalement dès lors que l’on est prêtà supposer l’existence d’une donnée complèteX = (W,Y (0),Y (1)) (cf la remarque A propos <strong>de</strong>l’existence <strong>de</strong> la donnée complète <strong>de</strong> la Section 2.2).En supposant l’absence <strong>de</strong> connaissance spécifique surQ 0 (le vrai facteurQ), nous voyonsQcomme un élément générique <strong>de</strong> l’ensemble non-paramétriqueQ<strong>de</strong> toutes les valeurs possibles<strong>de</strong>Q. En notantP une loi candidate pourO, nous voyons similairementP comme un élémentgénérique <strong>de</strong> l’ensemble non-paramétriqueM=M NP <strong>de</strong> toutes les lois possibles pourO. Lafonctionnelle Ψ est différentiable sur les chemins en toutP∈M. Nous savons que la fonctiond’influence efficace enP∈M ne dépend que <strong>de</strong>Q(P) etg(P), et notons doncD ⋆ (Q,g) celle-cipour toutP≡P Q,g ∈M tel queQ(P) =Q etg(P) =g (cf la Section 2.2 pour la définition <strong>de</strong>la différentiabilité sur les chemins et (2.3) pour la valeur <strong>de</strong>D ⋆ (Q,g)).Maintenant, la minimisation <strong>de</strong> la fonctiong↦→ Var PQ,g D ⋆ (Q,g)(O) sur l’ensemble <strong>de</strong>s allocationsajustées aux covariables nous apprend que, àQfixé, le mécanisme <strong>de</strong> traitementgquiminimise la plus petite variance <strong>de</strong> tout estimateur régulier <strong>de</strong> Ψ(Q) (i.e. l’allocation <strong>de</strong> Neymanpropre à notre cadre <strong>de</strong> travail) est caractérisé parg(1|W) =σ(Q)(1,W)σ(Q)(1,W) +σ(Q)(0,W) ,oùσ 2 (Q)(A,W) est la variance conditionnelle <strong>de</strong>Y sachant (A,W) sousQ.Cibler ce mécanisme <strong>de</strong> traitement pourrait bien être trop ambitieux, mathématiquement siW est complexe, et logistiquement en tout état <strong>de</strong> cause. Nous préconisons donc <strong>de</strong> ne considérerque les mécanismes <strong>de</strong> traitement ajustés aux covariables qui ne dépen<strong>de</strong>nt <strong>de</strong>W qu’à traversV ∈{1,...,ν} =V discrète résumantW. Par exemple,V pourrait typiquement caractériserl’appartenance à un sous-groupe d’intérêt constitué <strong>de</strong> patients <strong>de</strong> mêmes genre et gravité initiale<strong>de</strong> maladie (éventuellement rassemblés par centres <strong>de</strong> référence pour une analyse multi-centrique).La même minimisation que précé<strong>de</strong>mment (sur l’ensemble <strong>de</strong>s allocationsV -ajustées, notéG)permet d’i<strong>de</strong>ntifier le mécanisme <strong>de</strong> traitement optimalG ⋆ (Q) que nous allons cibler, qui estcaractérisé parG ⋆ ¯σ(Q)(1,V )(Q)(1|V ) =¯σ(Q)(1,V ) + ¯σ(Q)(0,V )avec ¯σ 2 (Q)(a|v) =E QW (σ 2 (Q)(a,W)|V =v) pour tout (a,v)∈{0, 1}×V.3.2.2 Modèle <strong>de</strong> travail, stratégie d’adaptation et initialisation <strong>de</strong> l’estimationLe mécanisme <strong>de</strong> génération <strong>de</strong>s données.Soitg b ∈G le mécanisme <strong>de</strong> traitement équilibré (V -ajusté), tel queg b (1|v) =g b (0|v) = 1 2pour toutv∈V. Le mécanisme <strong>de</strong> génération <strong>de</strong> données est initialisé par le tirage <strong>de</strong>n 0 copiesi.i.dO 1 ,...,O n0 <strong>de</strong>Osous (Q 0 ,g b ). Nous supposons <strong>de</strong> plus que ∑ n 0i=1 1{(V i,A i ) = (v,a)}>0pour tout (a,v)∈{0, 1}×V. Posons g n0 = (g 1 ,...,g n0 ) avecg i =g b pour touti≤n 0 , et disonsque (O 1 ,...,O n0 ) a été obtenue sous (Q 0 , g n0 ). Nous finissons <strong>de</strong> caractériser le mécanisme<strong>de</strong> génération <strong>de</strong> données par récurrence. Pour simplifier l’exposé, disons que la mise à jour dumécanisme <strong>de</strong> traitement est purement séquentielle (i.e. réalisée à chaque fois qu’une nouvelle


60 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)observation est obtenue ; la généralisation au cas où la mise à jour a lieu à chaque fois quec observations sont recueillies est quasi immédiate). Admettons que g n = (g 1 ,...,g n ) a étédéfinie pourn≥n 0 et que l’on dispose <strong>de</strong> (O 1 ,...,O n ) obtenue sous (Q 0 , g n ). En exploitantl’indépendance <strong>de</strong>sndonnées complètes associées auxnobservations, on voit (cf la Section 29.2dans [A10]) que la vraisemblance <strong>de</strong> (O 1 ,...,O n ) sous (Q, g n ) s’écritn∏ n∏{Q W (W i )×Q Y|A,W (O i )}× g i (A i |V i ), (3.2)i=1 i=1dont le second facteur est connu explicitement <strong>de</strong> nous.Modèle <strong>de</strong> travail et initialisation <strong>de</strong> l’estimation.Nos stratégie d’adaptation et procédure d’estimation reposent sur le choix d’un modèle <strong>de</strong>travailQ w n ⊂Q. L’expression “modèle <strong>de</strong> travail” fait écho au fait qu’à aucun moment on nesuppose queQ w n est bien spécifié pourQ 0,W ouQ 0,Y|A,W .Soit ainsim(A,W;β) une combinaison linéaire (i<strong>de</strong>ntifiable) <strong>de</strong> variables extraites <strong>de</strong> (A,W)etθ=(θ(1) ⊤ ,...,θ(ν) ⊤ ) ⊤ ∈ Θ = ∏ νv=1 Θ v avecθ(v) = (β v ,σ 2 v(0),σ 2 v(1)) ⊤ ∈ Θ v ⊂ R b × R ∗ +× R ∗ +pour toutv∈V. Nous choisissons spécifiquement le modèle <strong>de</strong> travailQ w n comme étantQ w n ={(Q W ,Q Y|A,W (θ))∈Q:Q W =P n,W ,θ∈Θ}où la loiQ Y|A,W (θ) est telle que la vraisemblance conditionnelle <strong>de</strong>OsousQ Y|A,W (θ) s’écrive{1Q Y|A,W (θ)(O) = exp − √2πσ (Y−m(A,W;β V )) 2 }V 2 (A) 2σV 2 (A) .En particulier, l’espérance conditionnelle ¯Q(θ) <strong>de</strong>Y sachant (A,W) sousQ Y|A,W (θ) satisfait¯Q(θ)(A,W) =m(A,W;β V ). La remarquable factorisation (3.2) implique notamment que l’onpeut invoquer le principe du maximum <strong>de</strong> vraisemblance. Soitg r ∈G un mécanisme <strong>de</strong> traitement(V -ajusté) <strong>de</strong> référence, par exempleg r =g b : nous lui associons l’estimateur du maximum <strong>de</strong>vraisemblance pondéréeθ n = arg maxθ∈Θn∑i=1dont lavième composanteθ n (v) satisfaitn (∑θ n (v) = arg minθ(v)∈Θ v i=1logQ Y|A,W (θ)(O i ) gr (A i |V i )g i (A i |V i )logσ 2 v(A i ) + (Y i−m(A i ,W i ;β v )) 2σ 2 v(A i ))g r (A i |V i )g i (A i |V i ) 1{V i =v}pour toutv∈V. L’estimateur initialQ 0 n = (P n,W ,Q Y|A,W (θ n )) <strong>de</strong>Q 0 induit un estimateur initial<strong>de</strong> substitution qui s’écritψ 0 n≡ Ψ(Q 0 n) = 1 nn∑¯Q(θ n )(1,W i )− ¯Q(θ n )(0,W i ).i=1Dans l’esprit du principe TMLE, cet estimateur initial va nous servir <strong>de</strong> point d’appui pour laconstruction d’un estimateur ciblé <strong>de</strong>ψ 0 = Ψ(Q 0 ).


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives et ajustées aux covariables 61Remarque. La pondération du critère <strong>de</strong> vraisemblance est un tour technique qui, en corrigeant lebiais introduit par la dépendance induite par l’adaptation du schéma d’échantillonnage, rendl’étu<strong>de</strong> <strong>de</strong>s propriétés asymptotiques <strong>de</strong>θ n plus aisée. Par ailleurs, la limite en probabilité<strong>de</strong>θ n (qui existe sous <strong>de</strong>s hypothèses faibles) jouit d’une interprétation naturelle en terme<strong>de</strong> projection <strong>de</strong> Kullback-Leibler <strong>de</strong>Q 0 sur le modèle <strong>de</strong> travail. Pour davantage <strong>de</strong> détails,cf les discussions <strong>de</strong> la Section 29.4.1 dans [A10].Stratégie d’adaptation.Il nous faut finalement expliquer comment générer la (n + 1)ième observation, concluant ainsila <strong>de</strong>scription <strong>de</strong> notre stratégie d’adaptation. Puisque le mécanisme <strong>de</strong> traitement optimal estG ⋆ (Q 0 ), nous allons le cibler (relativement au modèle <strong>de</strong> travail) en caractérisantg n+1 parg n+1 (1|v) = min(1−δ, max(G ⋆ (P n,W ,Q(θ n ))(1|v),δ))pour toutv∈V (et un petit seuilδ> 0, disonsδ= 0.01). Aussi, la génération <strong>de</strong>O n+1 s’effectueainsi : nous tirons la donnée complèteX n+1 = (W n+1 ,Y n+1 (0),Y n+1 (1)) indépendamment <strong>de</strong>sn données complètes précé<strong>de</strong>ntes, nous tironsA n+1 sousg n+1 (·|V n+1 ) (conditionnellement à(O 1 ,...,O n ) etW n+1 ), puis formons l’observationO n+1 = (W n+1 ,A n+1 ,Y n+1 =Y n+1 (A n+1 )).La collection (O 1 ,...,O n+1 ) est obtenue sous (θ, g n+1 ), avec g n+1 = (g 1 ,...,g n ,g n+1 ).3.2.3 Construction du TMLEPour construire le TMLEψn ∗ <strong>de</strong>ψ 0 = Ψ(Q 0 ), nous choisissons ici la perte opposée <strong>de</strong> la logvraisemblanceet fluctuons uniquement la loi conditionnelle <strong>de</strong>Y sachant (A,W) (comme dansl’exemple <strong>de</strong> l’excès <strong>de</strong> risque sur données i.i.d, la fluctuation <strong>de</strong> la loi marginale <strong>de</strong>W resteraitsans effet). Nous introduisons à cette fin la fluctuation{Q 0 n(ε) :ε∈R}⊂Q <strong>de</strong>Q 0 n en posant,pour toutε∈R,Q 0 n(ε) = (P n,W ,Q Y|A,W (θ n ,ε)) avec⎧ ( ) 2⎫1⎪⎨2A−1Q Y|A,W (θ,ε)(O) = exp √2πσV 2 (A) ⎪⎩ − Y−¯Q(θ)(A,W)−εG ⋆ (θ)(A|V ) σ2 V (A) ⎪⎬2σV 2 (A) .⎪ ⎭En particulier, l’espérance conditionnelle ¯Q(θ,ε)(A,W) <strong>de</strong>Y sachant (A,W) fait clairement échoà (2.5). Ensuite, nous déterminons le paramètre optimal <strong>de</strong> fluctuation (par minimisation <strong>de</strong> laperte, i.e. maximum <strong>de</strong> vraisemblance, encore une fois pondérée pour contrer les effets <strong>de</strong> ladépendance induits par l’adaptation du mécanisme <strong>de</strong> traitement) :ε n = arg maxε∈En∑i=1logQ Y|A,W (θ n ,ε)(O i ) g n(A i |V i )g i (A i |V i ) .On peut argumenter (cf la Section 29.4.3 dans [A10]) que la procédure TMLE converge en uneunique itération. Le TMLE <strong>de</strong>ψ 0 = Ψ(Q 0 ) s’écrit donc finalementψ ∗ n≡ Ψ(Q 0 n(ε n )) = 1 nn∑i=1¯Q(θ n ,ε n )(1,W i )− ¯Q(θ n ,ε n )(0,W i ).Nous en résumons ci-après les propriétés asymptotiques.


62 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)3.2.4 Etu<strong>de</strong> asymptotique du TMLEEtu<strong>de</strong> théorique <strong>de</strong>s propriétés asymptotiques.Nous démontrons que le couple (θ n ,ε n ) construit sous (Q 0 , g n ) converge en probabilité(sous <strong>de</strong>s hypothèses faibles) vers (θ 0 ,ε 0 ). De façon cruciale, cette convergence en probabilitéinduit la convergence en probabilité <strong>de</strong> la suite <strong>de</strong>s mécanismes <strong>de</strong> traitement adaptésg nvers un mécanisme limiteG ⋆ (Q 0,W ,Q Y|A,W (θ 0 )) (dont la proximité avec le mécanisme optimalciblé,G ⋆ (Q 0 ), dépend <strong>de</strong> la proximité <strong>de</strong>Q 0 au modèle <strong>de</strong> travailQ w n ; cf Propositions 3et 4 dans [A13]). En exploitant la double-robustesse <strong>de</strong> la fonction d’influence efficace à la limiteD⋆ ((Q 0,W ,Q Y|A,W (θ 0 ,ε 0 )),G ⋆ (Q 0,W ,Q Y|A,W (θ 0 ))), nous en déduisons que le TMLEψ ∗ nest un estimateur consistant <strong>de</strong>ψ 0 = Ψ(Q 0 ) (même si le modèle <strong>de</strong> travail est mal spécifié ;cf Proposition 6 dans [A13]).Nous obtenons <strong>de</strong> plus un développement linéaire asymptotique <strong>de</strong> (θ n ,ε n ) (sous <strong>de</strong>s hypothèsesfaibles, cf Proposition 5 dans [A13]), dont nous déduisons que le TMLEψ ∗ n est asymptotiquementlinéaire lui aussi, et donc que le TMLEψ ∗ n satisfait un théorème <strong>de</strong> la limite centrale(cf Proposition 7 dans [A13]). Ceci justifie l’utilisation <strong>de</strong> [ψ ∗ n±s n ξ 1−α/2 / √ n] comme intervalle<strong>de</strong> confiance <strong>de</strong> niveau asymptotique (1−α) pourψ 0 = Ψ(Q 0 ) sous (Q 0 , g n ).Les démonstrations <strong>de</strong> ces résultats reposent essentiellement sur les inégalité maximale pour <strong>de</strong>smartingales discrètes et théorème <strong>de</strong> la limite centrale pour <strong>de</strong>s martingales discrètes multivariéesdéjà au cœur <strong>de</strong>s preuves <strong>de</strong> [A7-A8] (cf Theorems 6 et 8 dans [A8]), ainsi que sur <strong>de</strong>s argumentsclassiques <strong>de</strong> théorie <strong>de</strong>s processus empiriques.Etu<strong>de</strong> <strong>de</strong> simulations <strong>de</strong>s propriétés asymptotiques.Les résultats théoriques <strong>de</strong> convergence que nous venons d’évoquer sont illustrés dans uneétu<strong>de</strong> <strong>de</strong> simulations présentée dans [A10].Nous fabriquons à cette fin unQ 0 ∈Qpour une observation génériqueO=(W,A,Y ) dont levecteur <strong>de</strong> covariables <strong>de</strong> baseW = (U,V ), avecU <strong>de</strong> loi uniforme sur [0, 1] etV∈V ={1, 2, 3}telle queQ 0,W (V = 1) = 1 2 ,Q 0,W (V = 2) = 1 3 etQ 0,W (V = 3) = 1 6(cf la Section 29.6 <strong>de</strong>[A10] pour le détail). Nous avons en particulier Ψ(Q 0 ) = 1.264.Trois analyses cliniques randomisées indépendantes sont simuléesB = 1000 fois (indépendammentd’une fois sur l’autre) : <strong>de</strong>ux sous les schémas d’échantillonnage indépendants (Q 0 ,g g )(allocation équilibrée) et (Q 0 ,G ⋆ (Q 0 )) (allocation optimale <strong>de</strong> Neyman, ciblée par notre schémaadaptatif à travers sa “projection” sur le modèle <strong>de</strong> travail) et une sous le schéma d’échantillonnageadaptatif (Q 0 , g n ). Les performances <strong>de</strong> chaque schéma sont évaluées pour <strong>de</strong>s taillesd’échantillons à valeurs dans{100, 250, 500, 750, 1000, 2500, 5000}.Le modèle <strong>de</strong> travail est tel que spécifié dans la Section 3.2.3, pourm(A,W;β) =β V,1 +β V,2 U +β V,3 A. De plus, le modèle <strong>de</strong> travail est massivement mal spécifié :Q 0,Y|A,W est une loigamma plutôt que gaussienne et les formes paramétriques <strong>de</strong>s espérance et variance conditionnellessont aussi mal spécifiées.Plus spécifiquement, nous évaluons ainsi– la validité asymptotique <strong>de</strong>s intervalles <strong>de</strong> confiance :nous concluons que les intervalles <strong>de</strong> confiance [ψn±s ∗ n ξ 1−α/2 / √ n] sont bien <strong>de</strong> niveau(1−α) pourn≥100 sous le schéma d’échantillonnage adaptatif (Q 0 , g n ) (pour mémoire,les intervalles correspondants sont <strong>de</strong> niveau (1−α) pourn≥500 sous (Q 0 ,g b ) et pourn≥250 sous (Q 0 ,G ⋆ (Q 0 )) – cf la Section 29.6 <strong>de</strong> [A10]) ;


Ciblage <strong>de</strong>s analyses cliniques réponses-adaptatives et ajustées aux covariables 63– les performances asymptotiques du schéma d’échantillonnage adaptatif relativement auschéma d’échantillonnage optimal ciblé :nous concluons que les intervalles <strong>de</strong> confiance obtenus sous le schéma d’échantillonnageadaptatif (Q 0 , g n ) ne sont pas stochastiquement plus étendus que ceux obtenus sous leschéma d’échantillonnage optimal ciblé (Q 0 ,G ⋆ (Q 0 )) pour toutn≥100 (<strong>de</strong> plus, ils sontsignificativement plus étroits que ceux obtenus sous le schéma d’échantillonnage (Q 0 ,g b ),avec un gain moyen <strong>de</strong> 12%– cf la Section 29.6 <strong>de</strong> [A10]).En résumé, les simulations confirment la théorie et nous enseignent que le régime asymptotiquepeut être considéré comme atteint sous le schéma d’échantillonnage adaptatif (lorsque l’ons’intéresse aux propriétés <strong>de</strong> convergence <strong>de</strong>ψ ∗ n) pour <strong>de</strong>s tailles d’échantillons supérieures ouégales àn = 100.3.2.5 Etu<strong>de</strong> asymptotique <strong>de</strong> la procédure <strong>de</strong> test groupes-séquentielle fondéesur le TMLENous nous intéressons aussi à la mise au point et à la validation théorique (dans un cadreasymptotique d’étu<strong>de</strong> <strong>de</strong> niveau et <strong>de</strong> puissance locale) d’une procédure <strong>de</strong> test groupes-séquentielleà partir du TMLEψ ∗ n.Etu<strong>de</strong> théorique du niveau et <strong>de</strong> la puissance locale asymptotiques.La <strong>de</strong>scription <strong>de</strong> cette procédure <strong>de</strong> test groupes-séquentielle est rigoureusement i<strong>de</strong>ntique àcelle <strong>de</strong> la Section 3.1.3 (cf la Section 6.1 <strong>de</strong> [A13]). Nous en étudions les propriétés asymptotiquesvia la même procédure simplifiée que celle introduite dans la Section 3.1.3. En recourant <strong>de</strong>nouveau à <strong>de</strong>s arguments <strong>de</strong> contiguïté et au troisième lemme <strong>de</strong> Le Cam, nous obtenons unevalidation théorique dans un cadre d’étu<strong>de</strong> <strong>de</strong> puissance locale (cf la Section 6.2 <strong>de</strong> [A13]).Etu<strong>de</strong> <strong>de</strong> simulations <strong>de</strong>s propriétés <strong>de</strong> niveau et <strong>de</strong> puissance locale asymptotiques.Les résultats théoriques concernant la procédure <strong>de</strong> test groupes-séquentielle que nous venonsd’évoquer sont illustrés dans le prolongement <strong>de</strong> l’étu<strong>de</strong> <strong>de</strong> simulations menée dans [A13] (dontnous avons déjà résumé les résultats relatifs aux propriétés <strong>de</strong> convergence <strong>de</strong> l’estimateurψ ∗ n dansla Section 3.2.4).Nous menons ainsi six analyses cliniques randomisées (indépendantes) pour testerB= 1000fois (indépendamment d’une fois sur l’autre) “Ψ(Q) = Ψ(Q 0 )” contre “Ψ(Q)>Ψ(Q 0 )” auniveau asymptotiqueα=5% avec une puissance asymptotique souhaitée (1−β) = 90% en leparamètre alternatif Ψ(Q 0 ) + 0.4 (rappelons que Ψ(Q 0 ) = 1.264). Sur ces six analyses cliniquesrandomisées :– trois sont dédiées à l’étu<strong>de</strong> du niveau asymptotique : <strong>de</strong>ux sous les schémas d’échantillonnageindépendants (Q 0 ,g b ) (allocation équilibrée) et (Q 0 ,G ⋆ (Q 0 )) (allocation optimale <strong>de</strong>Neyman, telle que ciblée par notre schéma adaptatif), une sous le schéma d’échantillonnageadaptatif (Q 0 , g n ) ;– trois sont dédiées à l’étu<strong>de</strong> <strong>de</strong> la puissance asymptotique : <strong>de</strong>ux sous les schémas d’échantillonnageindépendants (Q 1 ,g b ) (allocation équilibrée) et (Q 1 ,G ⋆ (Q 1 )) (allocation optimale<strong>de</strong> Neyman, telle que ciblée par notre schéma adaptatif), une sous le schéma d’échantillonnageadaptatif (Q 1 , g n ) –Q 1 est telle que Ψ(Q 1 ) = Ψ(Q 0 ) + 0.4.Le détail <strong>de</strong> la procédure est exposé dans la Section 8.2 <strong>de</strong> [A13].Plus spécifiquement, nous évaluons ainsi


64 Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables (cadre expérimental)– le niveau asymptotique :nous concluons que le niveau garanti est bien <strong>de</strong>α = 5% sous le schéma d’échantillonnage(Q 0 , g n ) (pour mémoire, ce résultat est aussi vrai sous (Q 0 ,g b ) et (Q 0 ,G ⋆ (Q 0 )) – cf laSection 8.2 <strong>de</strong> [A13]) ;– la puissance asymptotique :nous concluons à une légère sous-calibration en termes <strong>de</strong> puissance, avec une puissancegarantie égale à 88% plutôt que (1−β) = 90% sous le schéma d’échantillonnage (Q 1 , g n )(pour mémoire, les schémas d’échantillonnage (Q 1 ,g b ) et (Q 1 ,G ⋆ (Q 1 )) souffrent du mêmedéfaut – cf la Section 8.2 <strong>de</strong> [A13]) ;– les gains en termes <strong>de</strong> nombre d’observations requises pour atteindre une décision :nous concluons que le nombre d’observations requises pour atteindre une décision sousles schémas d’échantillonnage adaptatifs (Q 0 , g n ) et (Q 1 , g n ) sont significativement stochastiquementplus petits que ceux requis sous les schémas d’échantillonnage (Q 0 ,g b ) et(Q 1 ,g b ) ; ils sont par ailleurs stochastiquement plus grands que ceux requis sous les schémasd’échantillonnage optimaux ciblés (Q 0 ,G ⋆ (Q 0 )) et (Q 1 ,G ⋆ (Q 1 )) – cf la Section 8.3 dans[A13].En résumé, les simulations confirment la théorie : elles vali<strong>de</strong>nt la procédure <strong>de</strong> test groupesséquentielleen termes <strong>de</strong> niveau et <strong>de</strong> puissance asymptotiques, en exhibant toutefois une légèresous-calibration <strong>de</strong> cette <strong>de</strong>rnière (pour notre schéma d’échantillonnage adaptatif). De plus, ellesillustrent le mérite <strong>de</strong> cibler le schéma d’échantillonnage optimal en termes <strong>de</strong> nombre d’observationsrequises pour atteindre une décision : le schéma d’échantillonnage adaptatif se comportemieux que le schéma d’échantillonnage indépendant équilibré (avec un gain moyen <strong>de</strong> l’ordre <strong>de</strong>16%), et légèrement moins bien que le schéma d’échantillonnage indépendant optimal (avec uneperte moyenne <strong>de</strong> l’ordre <strong>de</strong> 6%).


65PerspectivesJ’organise dans cette <strong>de</strong>rnière section un bref panorama <strong>de</strong> mes perspectives <strong>de</strong> recherche.Elles s’articulent naturellement selon la construction <strong>de</strong> ce mémoire.Estimation et test <strong>de</strong> l’ordre d’une loi.Deux projets <strong>de</strong> nature théorique trouvent leur place ici :– J’ai commencé l’étu<strong>de</strong> d’une procédure d’estimation <strong>de</strong> l’ordre d’une loi par validation croiséefondée sur la vraisemblance. Alors que ce principe est d’une certain façon auto-pénalisé, onapprend qu’il faut nécessairement recourir à une pénalité pour que l’estimateur résultantjouisse <strong>de</strong> bonnes propriétés asymptotiques. La nature <strong>de</strong> la pénalité dépend <strong>de</strong> la géométrie<strong>de</strong>s modèles. Les remarquables progrès récemment effectués sur cette question [87, 33]laissent entrevoir la possibilité d’i<strong>de</strong>ntifier finement une pénalité minimale.– Le second projet s’inscrit peut-être illégitimement dans cette section : Cristina Butucea(Laboratoire d’Analyse et <strong>de</strong> Mathématiques Appliquées, Université Paris-Est Marne-la-Vallée) et moi souhaitons abor<strong>de</strong>r la question <strong>de</strong> l’estimation <strong>de</strong> la dimension intrinsèqued’une loi (estimer l’ordre d’une loi tel que considéré dans le Chapitre 1 revient en effet àestimer la dimension du plus petit modèle paramétrique contenant la loi). Or la dimensionintrinsèque pourrait bien s’exprimer (au moins approximativement) comme une fonctionnelle<strong>de</strong> la loi <strong>de</strong>sk-plus proches voisins d’unn-échantillon. Si tel était bien le cas, et s’il s’avéraitque cette fonctionnelle est différentiable sur les chemins, alors nous pourrions développerune version du principe TMLE pour son estimation.Estimation <strong>de</strong> l’importance <strong>de</strong> variables et <strong>de</strong> paramètres causaux (cadre observationnel).Je souhaite poursuivre mon effort <strong>de</strong> diffusion et <strong>de</strong> mise au point <strong>de</strong> métho<strong>de</strong>s statistiquesinnovantes appliquées à l’épidémiologie, en cultivant les relations fructueuses que j’ai pu noueravec <strong>de</strong>s collègues épidémiologistes et biostatisticiens. Un certain nombres <strong>de</strong> projets sont déjàen cours (j’exclus le prolongement <strong>de</strong> ma contribution à l’étu<strong>de</strong> DAIFI, presque finalisée et déjàévoquée dans la Section 2.3). En particulier :– En collaboration avec Isabelle Morlais (Institut <strong>de</strong> recherche pour le développement) etWilson Toussile (Centre <strong>de</strong> Recherche en Epidémiologie et Santé <strong>de</strong>s Populations, UniversitéParis Sud 11), nous mettons au point et appliquons une procédure <strong>de</strong> type TMLE pourl’i<strong>de</strong>ntification d’allèles (via l’estimation et le test d’une collection <strong>de</strong> mesures adéquates <strong>de</strong>l’importance <strong>de</strong> variables) dont la présence ou l’absence pourrait expliquer une plus gran<strong>de</strong>sensibilité au plasmodium falciparum, parasite qui provoque le paludisme chez l’homme.Cette étu<strong>de</strong> s’appuie sur un jeu <strong>de</strong> données collecté sous la direction d’Isabelle Morlais auCameroun.


66 Perspectives– En collaboration avec Jean Bouyer (Centre <strong>de</strong> Recherche en Epidémiologie et Santé <strong>de</strong>sPopulations, INED, INSERM et Université Paris Sud 11), Alice Gueguen (Centre Epidémiologieet santé <strong>de</strong>s populations, INSERM) et Gaëlle Santin (Institut <strong>de</strong> veille sanitaire),nous souhaitons mettre au point une procédure <strong>de</strong> surveillance épidémiologique <strong>de</strong> risquesprofessionnels en croisant (i) <strong>de</strong>s données longitudinales recueillies à l’échelle <strong>de</strong> la populationactive en France (dans le cadre du programme Coset, acronyme <strong>de</strong> Cohortes pourla surveillance épidémiologique en lien avec le travail) et (ii) les informations fournies par<strong>de</strong>s bases médico-administratives. Deux <strong>de</strong>s enjeux sont <strong>de</strong> parvenir à corriger les biais <strong>de</strong>censure (induit par l’absence <strong>de</strong> réponse <strong>de</strong> certains sondés) et <strong>de</strong> sélection d’une part, et <strong>de</strong>contrôler raisonnablement la multiplicité <strong>de</strong>s estimations et tests d’autre part. Je préconisela définition <strong>de</strong> paramètres causaux ad hoc et leur estimation par une procédure TMLE(pondérée, pour corriger le biais <strong>de</strong> sélection) – tout gain en efficacité étant une garantie<strong>de</strong> plus gran<strong>de</strong> sensibilité.Par ailleurs, j’ai commencé à considérer ou souhaite m’attaquer à quelques questions plusthéoriques afin <strong>de</strong> mieux comprendre ou d’exploiter les performances <strong>de</strong> la procédure TMLE. Enparticulier :– Je souhaite explorer les liens qui pourraient exister entre les métho<strong>de</strong>s d’approximationstochastique et la procédure TMLE.– En collaboration avec Wenjing Zheng (Division of Biostatistics, UC Berkeley) et Mark van<strong>de</strong>r Laan (Division of Biostatistics, UC Berkeley), nous avons commencé à étudier théoriquementl’impact qu’a la procédure d’initialisation du TMLE sur les performances <strong>de</strong> l’estimateurfinal ; <strong>de</strong>ux <strong>de</strong>s objectifs sont d’expliquer (i) pourquoi la pratique suggère qu’impliquer<strong>de</strong>s forêts aléatoires dans la procédure d’agrégation initiale nuit aux performances du TMLE,et (ii) dans quelle mesure la version validée croisée <strong>de</strong> la procédure TMLE s’affranchit <strong>de</strong>cette nuisance.– Avec Cristina Butucea (Laboratoire d’Analyse et <strong>de</strong> Mathématiques Appliquées, UniversitéParis-Est Marne-la-Vallée), nous aimerions voir dans quelle mesure le principe TMLE permettrait<strong>de</strong> corriger un estimateur <strong>de</strong> type plug-in <strong>de</strong> l’entropie d’une <strong>de</strong>nsité. Une étapepréliminaire serait <strong>de</strong> considérer le cas <strong>de</strong> la norme-2 d’une <strong>de</strong>nsité, car on sait que dansce cas les estimateurs plug-in fondés sur un estimateur à noyau <strong>de</strong> la <strong>de</strong>nsité sont mauvais,mais aussi qu’on peut les corriger pour atteindre vitesse paramétrique et efficacité sous <strong>de</strong>shypothèses minimes <strong>de</strong> régularité <strong>de</strong> la vraie <strong>de</strong>nsité.Finalement, dans le prolongement <strong>de</strong> l’Atelier INSERM sur les métho<strong>de</strong>s récentes pour l’analysestatistique <strong>de</strong> questions causales que j’ai organisé avec Michel Chavance (Centre <strong>de</strong> Recherche enEpidémiologie et Santé <strong>de</strong>s Populations, INSERM et Université Paris Sud 11),– Isabelle Drouet (Institut d’histoire et <strong>de</strong> philosophie <strong>de</strong>s sciences et <strong>de</strong>s techniques), moncollègue Jean-Christophe Thalabard (Laboratoire MAP5, Université Paris Descartes) et moimêmesommes en train <strong>de</strong> finaliser un article épistémologique sur la causalité. Dans cet articleécrit à six mains, nous croisons nos vues <strong>de</strong> philosophe, mé<strong>de</strong>cin et statisticiens sur cettenotion à la fois très concrète et fuyante qu’est la causalité.Estimation et test <strong>de</strong> l’importance <strong>de</strong> variables et <strong>de</strong> paramètres causaux (cadre expérimental).Parce que je souhaite que les schémas adaptatifs d’analyses cliniques randomisées soient popularisés,un certain nombre <strong>de</strong> questions pratiques négligées dans mes travaux théoriques doivent


67être considérées :– Suite à une rencontre avec Sylvie Chevret (Université Paris Di<strong>de</strong>rot – Paris 7 et INSERM)et Raphaël Porcher (Laboratoire Biostatistique et Epidémiologie clinique, Université ParisDi<strong>de</strong>rot – Paris 7 et INSERM), nous avons i<strong>de</strong>ntifié <strong>de</strong>s contraintes logistiques auxquellesun schéma adaptatif est nécessairement confrontés ; nous voudrions étudier (sans doute parsimulations) comment <strong>de</strong>s procédures adaptatives théoriques (telles que décrites dans leChapitre 3) sont affectées par ces contraintes.Je souhaite aussi poursuivre l’effort d’exploration théorique <strong>de</strong>s propriétés <strong>de</strong>s schémas adaptatifs.Ainsi, notamment :– Nous avons commencé d’étudier avec Michael Rosenblum (Department of Biostatistics,Johns Hopkins Bloomberg School of Public Health) les propriétés théoriques d’un schémaadaptatif d’analyses cliniques randomisées mêlant (i) l’adaptation <strong>de</strong> la randomisation (telleque décrite dans la Section 3.2) et (ii) la sélection adaptative <strong>de</strong>s hypothèses testées.– J’ai le projet avec Aurélien Garivier (Laboratoire Traitement et Communication <strong>de</strong> l’Information,CNRS et Télécom ParisTech) <strong>de</strong> mettre en parallèle sur le plan théorique les métho<strong>de</strong>s<strong>de</strong> bandits et <strong>de</strong> prédiction <strong>de</strong> suites individuelle par un groupe d’experts d’une part et celles<strong>de</strong>s schémas adaptatifs (tels que présentés dans le Chapitre 3) d’autre part. Il y a beaucoupà en attendre.


69Liste <strong>de</strong> publications(A1) A. Chambaz. Detecting abrupt changes in random fields, ESAIM:P&S, 6:189-209 (2002)(A2) A. Chambaz. Testing the or<strong>de</strong>r of a mo<strong>de</strong>l, Ann. Statist., 34(3):1166-1203 (2006)(A3) A. Chambaz and J. Rousseau. Bounds for Bayesian or<strong>de</strong>r i<strong>de</strong>ntification with application tomixtures, Ann. Statist., 36(2):938-962 (2008)(A4) A. Chambaz and C. Matias. Number of hid<strong>de</strong>n states and memory: a joint or<strong>de</strong>r estimationproblem for Markov chains with Markov regime, ESAIM:P&S, 13:38-50 (2009)(A5) A. Chambaz, A. Garivier and E. Gassiat. A minimum <strong>de</strong>scription length approach to hid<strong>de</strong>nMarkov mo<strong>de</strong>ls with Poisson and Gaussian emissions. Application to or<strong>de</strong>r i<strong>de</strong>ntification, J. Statist.Plann. Inference, 139(3):962-977 (2009)(A6) A. Chambaz, I. Bonan and P-P. Vidal. Deux modèles <strong>de</strong> Markov caché pour processus multipleset leur contribution à l’élaboration d’une notion <strong>de</strong> style postural, Journal <strong>de</strong> la SFdS, 150(1):28pages (2009)(A7) A. Chambaz and M. J. van <strong>de</strong>r Laan. Targeting the optimal <strong>de</strong>sign in randomized clinical trialswith binary outcomes and no covariate: simulation study, Int. J. Biostat., 7(1), Article 11 (2011)(A8) A. Chambaz and M. J. van <strong>de</strong>r Laan. Targeting the optimal <strong>de</strong>sign in randomized clinical trialswith binary outcomes and no covariate: theoretical study, Int. J. Biostat., 7(1), Article 10 (2011)(A9) A. Chambaz. Probability of success of an in vitro fertilization programme, chapter 25 in TargetedLearning: Causal Inference for Observational and Experimental Data, S. Rose and M. J. van <strong>de</strong>r Laan.Springer (2011)(A10) A. Chambaz and M. J. van <strong>de</strong>r Laan. TMLE in adaptive group sequential covariate-adjustedRCTs, chapter 29 in Targeted Learning: Causal Inference for Observational and Experimental Data,S. Rose and M. J. van <strong>de</strong>r Laan. Springer (2011)(A11) A. Chambaz and C. Denis. Classification in postural style, Preprint MAP5_2011_09, référenceHAL-00576070 (soumis) (2011)(A12) A. Chambaz, D. Choudat, C. Huber, J-C. Pairon and M. J. van <strong>de</strong>r Laan. Threshold regressionmo<strong>de</strong>ls adapted to case-control studies, and the risk of lung cancer due to occupational exposure toasbestos in France, Preprint MAP5_2011_10, référence HAL-00577883 (soumis) (2011)(A13) A. Chambaz and M. J. van <strong>de</strong>r Laan. Estimation and testing in targeted group sequentialcovariate-adjusted randomized clinical trials, Preprint MAP5_2011_11, référence HAL-00582753(soumis) (2011)(A14) A. Chambaz, P. Neuvial, and M. J. van <strong>de</strong>r Laan. Estimation of a non-parametric variableimportance measure, Preprint MAP5_2011_25 (2011)


BIBLIOGRAPHIE 71Bibliographie[1] G. D. Adamson, J. <strong>de</strong> Mouzon, P. Lancaster, K-G. Nygren, E. Sullivan, and F. Zegers-Hochscild. World collaborative report on in vitro fertilization, 2000. Fertility and Sterility,85(1586–1622), 2006.[2] A. C. Atkinson and A. Biswas. Adaptive biased-coin <strong>de</strong>signs for skewing the allocationproportion in clinical trials with normal responses. Stat. Med., 24(16) :2477–2492, 2005.[3] R. Azencott and D. Dacunha-Castelle. Series of irregular observations. Springer-Verlag, 1986.[4] R. R. Bahadur, S. L. Zabell, and J. C. Gupta. Large <strong>de</strong>viations, tests, and estimates. InAsymptotic theory of statistical tests and estimation (Proc. Adv. Internat. Sympos., Univ.North Carolina, Chapel Hill, N.C., 1979), pages 33–64. Aca<strong>de</strong>mic Press, New York, 1980.[5] U. Bandyopadhyay and A. Biswas. Adaptive <strong>de</strong>signs for normal responses with prognosticfactors. Biometrika, 88(2) :409–419, 2001.[6] J-M. Bar<strong>de</strong>t and P. Bertrand. I<strong>de</strong>ntification of the multiscale fractional Brownian motionwith biomechanical applications. J. Time Ser. Anal., 28(1) :1–52, 2007.[7] M. Basseville and I. V. Nikiforov. Detection of abrupt changes : theory and application.Prentice Hall Information and System Sciences Series. Prentice Hall Inc., Englewood Cliffs,NJ, 1993.[8] A. Belot, P. Grosclau<strong>de</strong>, N. Bossard, E. Jougla, E. Benhamou, P. Delafosse, A. V. Guizard,F. Molinié, A. Danzon, S. Bara, A. M. Bouvier, B. Trétarre, F. Bin<strong>de</strong>r-Foucard, M. Colonna,L. Daubisse, G. Hé<strong>de</strong>lin, G. Launoy, N. Le Stang, M. Maynadié, A. Monnereau, X. Troussard,J. Faivre, A. Collignon, I. Janoray, P. Arveux, A. Buemi, N. Raverdy, C. Schvartz, M. Bovet,L. Chérié-Challine, J. Estève, L. Remontet, and M. Velten. Cancer inci<strong>de</strong>nce and mortalityin France over the period 1980-2005. Rev. Epi<strong>de</strong>miol. Santé Publique, 56(3), 2008.[9] P. Bertrand, J-M. Bar<strong>de</strong>t, M. Dabonneville, A. Mouzat, and P. Vaslin. Automatic <strong>de</strong>terminationof the different control mechanisms in upright position by a wavelet method. IEEEEngineering in Medicine and Biology Society, 2 :1163–1166, 2001.[10] H. K. Biesalski, B. B. <strong>de</strong> Mesquita, A. Chesson, F. Chytil, R. Grimble, R. J. Hermus, J. Kohrle,R. Lotan, K. Norpoth, U. Pastorino, and D. Thurnham. European consensus statement onlung cancer : risk factors and prevention. Lung cancer panel. CA Cancer J. Clin., 48(3) :167–176, 1998.


72 BIBLIOGRAPHIE[11] J. Boivin, L Bunting, J. A. Collins, and K-G. Nygren. International estimates of infertilityprevalence and treatment-seeking : potential need and <strong>de</strong>mand for infertility medical care.Human Reproduction, 22(1506–1512), 2007.[12] S. Boucheron and E. Gassiat. Error exponents for AR or<strong>de</strong>r testing. IEEE Trans. Inform.Theory, 52(2) :472–488, 2006.[13] B. E. Brodsky and B. S. Darkhovsky. Nonparametric methods in change-point problems,volume 243 of Mathematics and its Applications. Kluwer Aca<strong>de</strong>mic Publishers Group, Dordrecht,1993.[14] O. Cappé, E. Moulines, and T. Rydén. Inference in hid<strong>de</strong>n Markov mo<strong>de</strong>ls. Springer Seriesin Statistics. Springer, New York, 2005.[15] E. Carlstein, H-G. Müller, and D. Siegmund, editors. Change-point problems. Institute ofMathematical Statistics, Hayward, CA, 1994. Papers from the AMS-IMS-SIAM SummerResearch Conference held at Mt. Holyoke College, South Hadley, MA, July 11–16, 1992.[16] N. N. Čencov. Statistical <strong>de</strong>cision rules and optimal inference, volume 53 of Translations ofMathematical Monographs. American Mathematical Society, Provi<strong>de</strong>nce, R.I., 1982. Translationfrom the Russian edited by Lev J. Leifman.[17] L. Chiari, A. Cappello, D. Lenzi, and U. Della Croce. An improved technique for the extractionof stochastic parameters from stabilograms. Gait Posture, 12(3) :225–234, 2000.[18] T. M. Cover and J. A. Thomas. Elements of information theory. Wiley-Interscience [JohnWiley & Sons], Hoboken, NJ, second edition, 2006.[19] I. Csiszár and P. C. Shields. The consistency of the BIC Markov or<strong>de</strong>r estimator. Ann.Statist., 28(6) :1601–1619, 2000.[20] D. Dacunha-Castelle and E. Gassiat. The estimation of the or<strong>de</strong>r of a mixture mo<strong>de</strong>l. Bernoulli,3(3) :279–299, 1997.[21] D. Dacunha-Castelle and E. Gassiat. Testing the or<strong>de</strong>r of a mo<strong>de</strong>l using locally conic parametrization: population mixtures and stationary ARMA processes. Ann. Statist., 27(4) :1178–1209, 1999.[22] E. <strong>de</strong> la Rochebrochard, N. Soullier, R . Peikrishvili, J. Guibert, and J. Bouyer. High invitro fertilization discontinuation rate in france. International Journal of Gynaecology andObstetrics, 103 :74–75, 2008.[23] E. <strong>de</strong> la Rochebrochard, C. Quelen, R. Peikrishvili, J. Guibert, and J. Bouyer. Long-termoutcome of parenthood project during in vitro fertil ization and after discontinuation ofunscuccessful in vitro fertilization. Fertility and Sterility, 92 :149–156, 2009.[24] J. De<strong>de</strong>cker. Exponential inequalities and functional central limit theorem for random fields.ESAIMP&S, 5 :77–104, 2001.[25] C-C. Finesso, L. Liu and P. Narayan. The optimal error exponent for Markov or<strong>de</strong>r estimation.IEEE Trans. Inform. Theory, 42(5) :1488–1497, 1996.


BIBLIOGRAPHIE 73[26] L. Finesso. Consistent estimation of the or<strong>de</strong>r for Markov and hid<strong>de</strong>n Markov chains. PhDthesis, University of Maryland, 1991.[27] T. D. Frank, A. Daffertshofer, and P. J. Beek. Multivariate Ornstein-Uhlenbeck processeswith mean-field <strong>de</strong>pen<strong>de</strong>nt coefficients : Application to postural sway. Phys. Rev. E, 63(1) :011905, Dec 2000.[28] P. A. Fransson, R. Johansson, A. Hafström, and M. Magnusson. Methods for evaluation ofpostural control adaptation. Gait Posture, 12(1) :14–24, 2000.[29] P. A. Fransson, A. Hafström, M. Karlberg, M. Magnusson, A. Tjä<strong>de</strong>r, and R. Johansson. Posturalcontrol adaptation during galvanic vestibular and vibratory proprioceptive stimulation.IEEE Trans. Biomed. Eng., 50(12) :1310–1319, 2003.[30] P. A. Fransson, E. K. Kristinsdottir, A. Hafström, M. Magnusson, and R. Johansson. Balancecontrol and adaptation during vibratory perturbations in middle-aged and el<strong>de</strong>rly humans.Eur. J. Appl. Physiol., 91(5-6) :595–603, 2004.[31] E. Gassiat. Likelihood ratio inequalities with applications to various mixtures. Ann. Inst. H.Poincaré Probab. Statist., 38(6) :897–906, 2002.[32] E. Gassiat and S. Boucheron. Optimal error exponents in hid<strong>de</strong>n Markov mo<strong>de</strong>ls or<strong>de</strong>restimation. IEEE Trans. Inform. Theory, 49(4) :964–980, 2003.[33] E. Gassiat and R. Van Han<strong>de</strong>l. Pathwise fluctuations of likelihood ratios and consistent or<strong>de</strong>restimation. Submitted, 2010. URL hal.archives-ouvertes.fr/hal-00453469/en/.[34] S. Ghosal, J. K. Ghosh, and A. W. van <strong>de</strong>r Vaart. Convergence rates of posterior distributions.Ann. Statist., 28(2) :500–531, 2000.[35] R. Gill and J.M. Robins. Causal inference in complex longitudinal studies : continuous case.Ann. Stat., 29(6), 2001.[36] X. Guyon and J. Yao. On the un<strong>de</strong>rfitting and overfitting sets of mo<strong>de</strong>ls chosen by or<strong>de</strong>rselection criteria. J. Multivariate Anal., 70(2) :221–249, 1999.[37] J. D. Hamilton. A new approach to the economic analysis of nonstationary time series andthe business cycle. Econometrica, 57(2) :357–384, 1989.[38] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100(1) :57–70, 2000.[39] E. J. Hannan. The estimation of the or<strong>de</strong>r of an ARMA process. Ann. Statist., 8(5) :1071–1081, 1980.[40] E. J. Hannan, A. J. McDougall, and D. S. Poskitt. Recursive estimation of autoregressions.J. R. Statist. Soc. B, 51(2) :217–233, 1989.[41] D. Haughton. Size of the error in the choice of a mo<strong>de</strong>l to fit data from an exponentialfamily. Sankhyā Ser. A, 51(1) :45–58, 1989.[42] E. M. Hemerly and M. H. A. Davis. Recursive or<strong>de</strong>r estimation of autoregressions withoutbounding the mo<strong>de</strong>l set. J. R. Statist. Soc. B, 53(1) :201–210, 1991.


74 BIBLIOGRAPHIE[43] F. Hu and W.F. Rosenberger. The theory of response adaptive randomization in clinicaltrials. New York Wiley, 2006.[44] IARC. IARC monographs on the evaluation of the carcinogenic risk of chemicals to man :asbestos, volume 14. IARC, 1977.[45] H. Ishwaran, L. F. James, and J. Sun. Bayesian mo<strong>de</strong>l selection in finite mixtures by marginal<strong>de</strong>nsity <strong>de</strong>compositions. J. Amer. Statist. Assoc., 96(456) :1316–1332, 2001.[46] W. Jefferys and J. Berger. Ockam’s razor and Bayesian analysis. American Scientist, 80 :64–72, 1992.[47] C. Jennison and B. W. Turnbull. Group Sequential Methods with Applications to ClinicalTrials. Chapman & Hall/CRC, Boca Raton, FL, 2000.[48] C. Keribin. Consistent estimation of the or<strong>de</strong>r of mixture mo<strong>de</strong>ls. Sankhyā Ser. A, 62(1) :49–66, 2000.[49] C. Keribin and D. Haughton. Asymptotic probabilities of overestimating and un<strong>de</strong>restimatingthe or<strong>de</strong>r of a mo<strong>de</strong>l in general regular families. Comm. Statist. Theory Methods, 32(7) :1373–1404, 2003.[50] J. Laurens and J. Droulez. Bayesian processing of vestibular information. Biol. Cybern., 96(4) :389–404, 2007.[51] M. Lavielle. Detection of multiple changes in a sequence of <strong>de</strong>pen<strong>de</strong>nt variables. StochasticProcess. Appl., 83(1) :79–102, 1999.[52] M. Lavielle and E. Moulines. Least-squares estimation of an unknown number of shifts in atime series. J. Time Ser. Anal., 21(1) :33–59, 2000.[53] M.-L. T. Lee and G. A. Whitmore. Threshold regression for survival analysis : mo<strong>de</strong>ling eventtimes by a stochastic process reaching a boundary. Statist. Sci., 21(4) :501–513, 2006.[54] M-L. T. Lee and G. A. Whitmore. Proportional hazards and threshold regression : theirtheoretical and practical connections. Lifetime Data Anal., 16(2) :196–214, 2010.[55] C. Léonard and J. Najim. An extension of Sanov’s theorem. Application to the Gibbs conditioningprinciple. Bernoulli, 8(6) :721–743, 2002.[56] G. P. Leonardi and I. Tamanini. Metric spaces of partitions, and Caccioppoli partitions. Adv.Math. Sci. Appl., 12(2) :725–753, 2002.[57] J. C. Lepecq, C. De Waele, S. Mertz-Josse, C. Teyssèdre, P. T. Huy, P. M. Baudonnière,and P-P. Vidal. Galvanic vestibular stimulation modifies vection paths in healthy subjects.J. Neurophysiol., 95(5) :3199–3207, 2006.[58] C. C. Liu and P. Narayan. Or<strong>de</strong>r estimation and sequential universal data compression of ahid<strong>de</strong>n markov source by the method of mixtures. IEEE Trans. Inf. Theory, 40(4) :1167–1180,1994.


BIBLIOGRAPHIE 75[59] T. Mergner, G. Schweigart, C. Maurer, and A. Blümle. Human postural responses to motionof real and virtual visual environments un<strong>de</strong>r different support base conditions. Exp. BrainRes., 167(4) :535–556, 2005.[60] P. Morfeld. Years of life lost due to exposure : Causal concepts and empirical shortcomings.Epi<strong>de</strong>miol. Perspect. Innov., 1(1), 2004.[61] K. M. Newell, S. M. Slobounov, E. S. Slobounova, and P. C. Molenaar. Stochastic processesin postural center-of-pressure profiles. Exp. Brain Res., 113(1) :158–164, 1997.[62] P. Nicolas, L. Bize, F. Muri, M. Hoebeke, F. Rodolphe, S.D. Ehrlich, B. Prum, and P. Bessières.Mining Bacillus subtilis chromosome heterogeneities using hid<strong>de</strong>n Markov mo<strong>de</strong>ls.Nucleic Acids Res., 30(6) :1418–1426, 2002.[63] R. A. Olshen, E. N. Bi<strong>de</strong>n, Marilynn P. Wyatt, and D. H. Sutherland. Gait analysis and thebootstrap. Ann. Statist., 17(4) :1419–1440, 1989.[64] J-C. Pairon, B. Legal-Régis, J. Ameille, J-M. Brechot, B. Lebeau, D. Valeyre, I. Monnet,M. Matrat, and B. Chamming’s, S. Housset. Occupational lung cancer : a multicentric casecontrolstudy in Paris area. European Respiratory Society, 19th Annual Congress, Vienna,2009.[65] B. M. Pötscher. Estimation of autoregressive moving-average or<strong>de</strong>r given an infinite numberof mo<strong>de</strong>ls and approximation of spectral <strong>de</strong>nsities. J. Time Ser. Anal., 11(2) :165–179, 1990.[66] J. Rissanen. Mo<strong>de</strong>lling by shortest data <strong>de</strong>scription. Automatica, 14 :465–471, 1978.[67] J. Robins and S. Greenland. The probability of causation un<strong>de</strong>r a stochastic mo<strong>de</strong>l forindividual risk. Biometrics, 45(4) :1125–1138, 1989.[68] J. Robins and S. Greenland. Estimability and estimation of expected years of life lost due toa hazardous exposure. Stat. Med., 10(1) :79–93, 1991.[69] W. F Rosenberger, A. N. Vidyashankar, and D. K. Agarwal. Covariate-adjusted responseadaptive<strong>de</strong>signs for binary response. J. Biopharm. Statist., 11(227-236), 2001.[70] W.F. Rosenberger. New directions in adaptive <strong>de</strong>signs. Statistical Science, 11 :137–149,1996.[71] A.M. Sabatini. A statistical mechanical analysis of postural sway using non-Gaussian farimastochastic mo<strong>de</strong>ls. IEEE Trans. Biomed. Eng., 47(9) :1219–1227, 2000.[72] G. Schwarz. Estimating the dimension of a mo<strong>de</strong>l. Ann. Statist., 6(2) :461–464, 1978.[73] J. Shao, X. Yu, and B. Bob Zhong. A theory for testing hypotheses un<strong>de</strong>r covariate-adaptiverandomization. Biometrika, 2010.[74] N. Soullier, J. Bouyer, Pouly J-L., J. Guibert, and <strong>de</strong> la Rochebrochard. Estimating the successof an in vitro fertilization programme using multiple imputation. Human Reproduction, 23 :187–192, 2008.


76 BIBLIOGRAPHIE[75] O. M. Stitelman, A. E. Hubbard, and N. P. Jewell. The impact of coarsening the explanatoryvariable of interest in making causal inferences : Implicit assumptions behind dichotomizingvariables. Technical report, U.C. Berkeley Division of Biostatistics Working Paper Series,2010.[76] The Cancer Genome Atlas (TGCA) research Network. Comprehensive genomic characterization<strong>de</strong>fines human glioblastoma genes and core pathways. Nature, 455 :1061–1068, 2008.[77] L. Tierney, R. E. Kass, and J. B. Kadane. Fully exponential Laplace approximations toexpectations and variances of nonpositive functions. J. Amer. Statist. Assoc., 84(407) :710–716, 1989.[78] M. J. van <strong>de</strong>r Laan. Statistical inference for variable importance. Int. J. Biostat., 2 :Art. 2,33 pp. (electronic), 2006.[79] M. J. van <strong>de</strong>r Laan. Estimation based on case-control <strong>de</strong>signs with known inci<strong>de</strong>nce probability.U.C. Berkeley Division of Biostatistics Working Paper Series, 2008. Paper 234.[80] M. J. van <strong>de</strong>r Laan. The construction and analysis of adaptive group sequential <strong>de</strong>signs.Technical report 232, Division of Biostatistics, University of California, Berkeley, March 2008.[81] M. J. van <strong>de</strong>r Laan and J. M. Robins. Unified methods for censored longitudinal data andcausality. Springer Series in Statistics. Springer-Verlag, New York, 2003.[82] M. J. van <strong>de</strong>r Laan and S. Rose. Targeted Learning. Springer, 2011. Causal Inference forObservational and Experimental Data.[83] M. J. van <strong>de</strong>r Laan and D. Rubin. Targeted maximum likelihood learning. Int. J. Biostat.,2 :Art. 11, 40, 2006.[84] M. J. van <strong>de</strong>r Laan, E. C. Polley, and A. E. Hubbard. Super learner. Stat. Appl. Genet. Mol.Biol., 6 :Art. 25, 23 pp. (electronic), 2007.[85] A. W. van <strong>de</strong>r Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, Cambridge, 1998.[86] A. W. van <strong>de</strong>r Vaart, S. Dudoit, and M. J. van <strong>de</strong>r Laan. Oracle inequalities for multi-foldcross validation. Statist. Decisions, 24(3) :351–371, 2006.[87] R. van Han<strong>de</strong>l. On the minimal penalty for Markov or<strong>de</strong>r estimation. Probab. Th. Rel. Fields,150 :709–738, 2011.[88] L. M. Wu. Large <strong>de</strong>viations, mo<strong>de</strong>rate <strong>de</strong>viations and LIL for empirical processes. Ann.Probab., 22(1) :17–27, 1994.[89] Z. Yu and M.J. van <strong>de</strong>r Laan. Construction of counterfactuals and the G-computationformula. Technical report, Division of Biostatistics, University of California, Berkeley, 2002.[90] L-X. Zhang and F.-F. Hu. A new family of covariate-adjusted response adaptive <strong>de</strong>signs andtheir properties. Appl. Math. J. Chinese Univ. Ser. B, 24(1) :1–13, 2009.[91] L-X. Zhang, F. Hu, S. H. Cheung, and W. S. Chan. Asymptotic properties of covariateadjustedresponse-adaptive <strong>de</strong>signs. Ann. Statist., 35(3) :1166–1182, 2007.


[92] H. Zhu and F. Hu. Sequential monitoring of response-adaptive randomized clinical trials.Ann. Statist., 38(4) :2218–2241, 2010.


Curriculum Vitæ79


Curriculum Vitæ d’<strong>Antoine</strong> Chambaz<strong>Antoine</strong> ChambazMAP5 (CNRS, UMR 8145)Université Paris Descartes45 rue <strong>de</strong>s Saints-Pères75270 Paris ce<strong>de</strong>x 06, Francené le 30 septembre 1973 à Parisfrançaismarié, père <strong>de</strong> trois enfantsantoine.chambaz@paris<strong>de</strong>scartes.frwww.math-info.univ-paris5.fr/˜chambazExpérience professionnelleDepuis 2003Maître <strong>de</strong> conférences Université Paris DescartesMembre du laboratoire MAP5 (Mathématiques Appliquées à Paris Descartes)2008–2009 Bourse recherche FulbrightProfesseur invité, Division of Biostatistics (UC Berkeley)2002–2003 ATER Université Paris DescartesMembre du MAP51999–2002 Moniteur Université Paris DescartesFinancement doctoral CIFRE France Télécom R&DUniversité Paris Sud 11, Orsay (CNRS, UMR 8628)Membre du laboratoire Probabilité, Statistiques et ModélisationFormation1999–2003 Thèse <strong>de</strong> mathématiques, spécialité statistiques Univ. Paris Sud 11, OrsaySegmentation spatiale et sélection <strong>de</strong> modèles : théorie et applications statistiquessoutenue le 6 janvier 2003directeurs : E. Gassiat et M. Laviellerapporteurs : A. Antoniadis, E. Moulinesjury : R. Cerf, E. Gassiat, L. Girard, M. Lavielle, C. Léonard, E. Moulines1998–1999 DEA <strong>de</strong> Probabilité et statistiques Université Paris Sud 11, Orsay1997–1998 Service militaire1996–1997 Agrégation <strong>de</strong> Mathématiques (classement : 64ème)1995–1996 Maîtrise <strong>de</strong> Mathématiques pures Université Paris Sud 11, OrsayFinancements2010–2011 Fonds France-Berkeley (10, 000 US$)Développement et application <strong>de</strong>s métho<strong>de</strong>s d’inférence pour l’analyse causale dans lescommunautés médicale et épidémiologique françaisesPI : A. Chambaz et M. J. van <strong>de</strong>r Laan (UC Berkeley)2008–2009 Bourse recherche Fulbright (10, 000 US$)Développement et application <strong>de</strong> l’analyse causale dans les communautés médicale etépidémiologique françaises2008 Bonus Qualité Recherche <strong>de</strong> l’Université Paris Descartes (30, 000¤)Evaluation statistique <strong>de</strong>s préférences sensorielles pour le maintien <strong>de</strong> la posturePI : A. Chambaz et P-P. Vidal (Université Paris Descartes)2008 Programme interdisciplinaire du CNRS : Longévité et vieillissement (15, 000¤)Analyse statistique <strong>de</strong>s préférences sensorielles pour le maintien posturalPI : A. Chambaz et P-P. Vidal (Université Paris Descartes)2007–2011 PEDR


Domaines d’intérêtProcessus empiriques, statistique semi-paramétrique et non-paramétrique, estimation par minimisation<strong>de</strong> perte ciblée (targeted minimum loss estimation, d’où l’acronyme TMLE), statistique bayésienne nonparamétrique,liens entre théorie <strong>de</strong> l’information et statistique.Causalité, estimation <strong>de</strong> paramètres causaux, données longitudinales : imputabilité <strong>de</strong> cancer du poumonà l’exposition professionnelle, application à une importante étu<strong>de</strong> cas-témoins française ; estimation <strong>de</strong> la probabilité<strong>de</strong> succès <strong>de</strong> programmes <strong>de</strong> FIV en France ; étu<strong>de</strong> <strong>de</strong> la relation entre variation du nombre <strong>de</strong> copieset expression dans <strong>de</strong>s cellules cancéreuses.Construction et analyse statistique <strong>de</strong> schémas adaptatifs groupe-séquentiels pour <strong>de</strong>s analyses cliniquesrandomisées.Estimation <strong>de</strong> l’ordre <strong>de</strong> mélanges ou <strong>de</strong> modèles markoviens.Application à la recherche biomédicale : étu<strong>de</strong> du maintien <strong>de</strong> la posture chez l’homme ; étu<strong>de</strong> du rôle <strong>de</strong>sprocessus stochastiques dans le fonctionnement du neurone ; estimation <strong>de</strong> l’âge-au-décès à partir d’ossements.Implémentation <strong>de</strong>s métho<strong>de</strong>s statistiques sous R.PublicationsPublications dans <strong>de</strong>s revues internationales à comité <strong>de</strong> lecture1. Targeting the optimal <strong>de</strong>sign in randomized clinical trials with binary outcomes and no covariate: theoreticalstudy, A. Chambaz and M. J. van <strong>de</strong>r Laan, Int. J. Biostat., 7(1), Article 10 (2011).2. Targeting the optimal <strong>de</strong>sign in randomized clinical trials with binary outcomes and no covariate: simulationstudy, A. Chambaz and M. J. van <strong>de</strong>r Laan, Int. J. Biostat., 7(1), Article 11 (2011).3. Deux modèles <strong>de</strong> Markov caché pour processus multiples et leur contribution à l’élaboration d’une notion <strong>de</strong>style postural, A. Chambaz, I. Bonan et P-P. Vidal, Journal <strong>de</strong> la SFdS, 150(1) (2009).4. A minimum <strong>de</strong>scription length approach to hid<strong>de</strong>n Markov mo<strong>de</strong>ls with Poisson and Gaussian emissions. Applicationto or<strong>de</strong>r i<strong>de</strong>ntification, A. Chambaz, A. Garivier and E. Gassiat, J. Statist. Plann. Inference, 139(3):962-977(2009).5. Control of neuronal persistent activity by voltage-<strong>de</strong>pen<strong>de</strong>nt <strong>de</strong>ndritic properties, E. Idoux, D. Eugene, A. Chambaz,C. Magnani, J.A. White and L. E. Moore, J. Neurophysiol., 100:1278-1286 (2008).6. Number of hid<strong>de</strong>n states and memory: a joint or<strong>de</strong>r estimation problem for Markov chains with Markov regime,A. Chambaz and C. Matias, ESAIM:P&S, 13:38-50 (2009).7. Bounds for Bayesian or<strong>de</strong>r i<strong>de</strong>ntification with application to mixtures, A. Chambaz and J. Rousseau, Ann. Statist.,36(2):938-962 (2007).8. Plica semilunaris temporal ectopia: an evi<strong>de</strong>nce of primary nasal pterygia traction, E. Denion, A. Chambaz P-H.Dalens, J. Petitbon and M. Gérard, Cornea, 26(7):1-9 (2007).9. Testing the or<strong>de</strong>r of a mo<strong>de</strong>l, A. Chambaz, Ann. Statist., 34(3):1166-1203 (2006).10. Detecting abrupt changes in random fields, A. Chambaz, ESAIM:P&S, 6:189-209 (2002).Chapitres <strong>de</strong> livre1. Targeted maximum likelihood estimation in adaptive group-sequential covariate-adjusted randomized clinicaltrials, A. Chambaz and M. J. van <strong>de</strong>r Laan, in Targeted Learning from Data (Estimation for Causal Effects), byS. Rose and M. J. van <strong>de</strong>r Laan, Springer (2011).2. Probability of success of an in vitro fertilization program, in Targeted Learning from Data (Estimation for CausalEffects), S. Rose and M. J. van <strong>de</strong>r Laan, Springer (2011).Prépublications (soumises pour publication)1. Estimation of a non-parametric variable importance measure, A. Chambaz, P. Neuvial, and M. J. van <strong>de</strong>r Laan,Preprint MAP5 2011 25 (2011).2. Targeted maximum likelihood estimation and testing in adaptive group-sequential covariate-adjusted randomizedclinical trials, A. Chambaz and M. J. van <strong>de</strong>r Laan, Preprint MAP5 2011 11, référence HAL-00582753 (2011).


3. Threshold regression mo<strong>de</strong>ls adapted to case-control studies, and the risk of lung cancer due to occupationalexposure to asbestos in France, A. Chambaz, D. Choudat, C. Huber, J-C. Pairon and M. J. van <strong>de</strong>r Laan,Preprint MAP5 2011 10, référence HAL-00577883 (2011).4. Classification in postural style, A. Chambaz and C. Denis, Preprint MAP5 2011 09, référence HAL-00576070(2011).Manuscrits en voie <strong>de</strong> finalisation1. A dialog on causality and statistics in medicine and epi<strong>de</strong>miology, A. Chambaz, I. Drouet and J-C. Thalabard.2. On the probability of success of an in vitro fertilization program and the DAIFI study, controlling for time<strong>de</strong>pen<strong>de</strong>ntconfoun<strong>de</strong>rs, A. Chambaz, J. Bouyer, S. Gruber, S. Rose and M. J. van <strong>de</strong>r Laan.3. Stochastic mo<strong>de</strong>ling and statistical study of neuronal activity: shifting from the <strong>de</strong>terministic Hodgkin-Huxleymo<strong>de</strong>l to a hid<strong>de</strong>n Markov mo<strong>de</strong>l, A. Chambaz, D. Eugène, L. E. Moore and A. Samson.Expertise⋆Editeur associé <strong>de</strong> la revue International Journal of BiostatisticsReviewer pour les revues suivantes– Revues internationales : Bernoulli, Bioinformatics, Biometrika, Communications in Statistics, ElectronicJournal of Statistics, International Journal of Biostatistics, Scandinavian Journal of Statistics, StochasticProcesses and their Applications– Revues nationales : Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, Revue d’Epidémiologie et <strong>de</strong> SantéPubliqueReviewer pour les organismes ou institutions suivantsAgence Nationale <strong>de</strong> la Recherche, Commission Franco-Américaine Fulbright, Sous-direction <strong>de</strong> la Rechercheet <strong>de</strong> l’Innovation–Région Ile-<strong>de</strong>-FranceMobilité et communicationsSéjours et séminaires à l’étranger2011 Division of Biostatistics, UC Berkeley, USA (une semaine)2010 Division of Biostatistics, UC Berkeley, USA (<strong>de</strong>ux semaines) — trois séminaires→ “Construction and statistical analysis of adaptive group-sequential <strong>de</strong>signs for randomizedclinical trials” (1 heure)→ “Estimating the probability of success of an in vitro fertilization procedure, accountingfor time-<strong>de</strong>pen<strong>de</strong>nt confounding” (1 heure)→ “Targeting the optimal <strong>de</strong>sign in group-sequential RCTs: the case with binaryoutcome and no covariates” (1 heure)2008–2009 ⋆Division of Biostatistics, UC Berkeley, USA (année académique) — unséminaire, plusieurs cours du module Causal Inference <strong>de</strong> M. J. van <strong>de</strong>r Laan→ “Studying high-throughput short read sequences with hid<strong>de</strong>n Markov mo<strong>de</strong>ls formultiple processes: first results” (1 heure)2008 Dipart. di Statistica, Univ. Cá Foscari Venezia, Italie (une semaine) — un séminaire→ “Statistical analysis of postural style” (1 heure)2007 Depart. of Math. Sciences, Copenhagen Univ., Danemark (<strong>de</strong>ux jours) — un séminaire→ “A MDL approach to HMM with Poisson and Gaussian emissions. Application toor<strong>de</strong>r i<strong>de</strong>ntification” (1 heure)2007 Division of Biostatistics, UC Berkeley, USA (une semaine) — un séminaire→ “Or<strong>de</strong>r estimation: some theory and examples” (1 heure)2006 Institute of Biomedical Engineering of Padova, Italie (une semaine) — un séminaire→ “A MDL approach to HMM with Poisson and Gaussian emissions. Application toor<strong>de</strong>r i<strong>de</strong>ntification” (1 heure)


Congrès et workshops2011 Invité, International Statistical Institute-satellite meeting, Copenhague (Danemark)→ “On targeted minimum loss estimation” (45 minutes)2011 Invité, Journées <strong>de</strong> Statistiques à Marne-la-Vallée, Marne-la-Vallée (France)→ “A new non-parametric measure of association between DNA copy number and geneexpression, and its robust estimation” (45 minutes)2011 GDR Statistique et Santé, Paris Descartes (France)→ “Construction and statistical analysis of adaptive group-sequential <strong>de</strong>signs for randomizedclinical trials” (25 minutes)2011 Invité, Atelier INSERM #209, Avancées statistiques récentes pour l’analyse causale, Bor<strong>de</strong>aux(France)→ “Statistical analysis of causal parameters in epi<strong>de</strong>miology: the DAIFI study example” (45minutes)→ “Construction and statistical analysis of adaptive group-sequential <strong>de</strong>signs for randomizedclinical trials” (45 minutes)2010 Invité, Mathématiques en mouvement, ENS Paris (France)→ “Statistique pour l’analyse causale en épidémiologie” (25 minutes)2010 Invité, Hommage <strong>de</strong> la SFdS à Daniel Schwartz, 42èmes Journées <strong>de</strong> Statistique, Luminy(France)→ “Analyse statistique <strong>de</strong> paramètres causaux en épidémiologie : l’exemple <strong>de</strong> l’étu<strong>de</strong> DAIFI(15 minutes)2009 GDR Statistique et Santé, Paris Descartes (France)→ “Modèles graphiques en épidémiologie” (30 minutes)2008 Modélisation Statistique <strong>de</strong>s Images, Luminy (France)→ “Two or three things I know about intrinsic dimension estimation: a talk with three images(40 minutes)2008 Troisième Journée <strong>de</strong> Biologie Systémique, Paris Descartes (France)→ “Analyse comparative <strong>de</strong> l’expression chronologique <strong>de</strong>s gènes (puces à ADN) : Evo-Devoet pathologies humaines” (40 minutes)2006 Statistique mathématique et applications, Luminy (France)2005 Workshop Analyse statistique <strong>de</strong>s données post-génomiques, INA-PG, Paris (France)→ “Bounds for Bayesian or<strong>de</strong>r i<strong>de</strong>ntification with application to mixtures” (30 minutes)2005 EMS Summer School on Statistics in Genetics and Molecular Biology, Warwick (England)→ “Bounds for Bayesian or<strong>de</strong>r i<strong>de</strong>ntification with application to mixtures” (25 minutes)2004 Statistique mathématique et applications, Luminy (France)→ “Estimation jointe du nombre d’états cachés et <strong>de</strong> la mémoire d’un processus auto-régressifà régime markovien” (30 minutes)2003 Colloque Jeunes Probabilistes et Statisticiens, Aussois (France)2002 34èmes Journées <strong>de</strong> Statistique, Bruxelles (Belgium)2001 33èmes Journées <strong>de</strong> Statistique, Nantes (France)2001 New directions in time series analysis, Luminy (France)Séminaires en France2011 Séminaire européen <strong>de</strong> statistique (Paris)2010 AgroParisTech (Paris), Hôpital Bichat, Centre <strong>de</strong> recherche en épidémiologie et santé <strong>de</strong>spopulations (Villejuif), Institut Curie (Unité <strong>de</strong> biostatistique), Institut Curie (Canceret génome : bioinformatique, biostatistiques et épidémiologie d’un système complexe),Institut Gustave Roussy, Séminaire parisien <strong>de</strong> statistique2009 Orsay, Paris Descartes2007 Nanterre2006 Maths pour le génome (Evry), Rennes 2, Séminaire parisien <strong>de</strong> statistique2005 Lyon 1, Versailles2003 Joseph Fourier (Grenoble), Orsay, Paris Descartes2002 Toulouse 3


Encadrement et animation scientifiquesMembre <strong>de</strong> la Société Française <strong>de</strong> StatistiqueFonctions électives ou représentatives2006–2009 Membre du Conseil <strong>de</strong> laboratoire du MAP52011 Membre <strong>de</strong> Comités <strong>de</strong> sélection (26ème section), Paris 13 Nord, Paris Dauphine,Paris Sud Faculté <strong>de</strong> Pharmacie2010 Membre <strong>de</strong> Comités <strong>de</strong> sélection (26ème section), Paris Dauphine, Paris Descartes,Paris 1 Panthéon Sorbonne2008 Membre <strong>de</strong> Comité <strong>de</strong> sélection (26ème section), Paris Descartes2007 Membre <strong>de</strong> Commités <strong>de</strong> sélection (26ème section), Paris Descartes, Paris OuestNanterre La DéfenseOrganisation <strong>de</strong> workshops et séminaires2011 ⋆Organisateur <strong>de</strong> l’Atelier INSERM #209, Avancées statistiques récentes pourl’analyse causale, 6–10 juin 2011, Bor<strong>de</strong>aux et Paris (France) — avec M.Chavance– conception du programme– sélection et invitation <strong>de</strong>s 11 orateurs– sélection sur dossier <strong>de</strong>s 80 participants– animation <strong>de</strong> la phase pratique <strong>de</strong> l’atelier– gestion <strong>de</strong> la mise en ligne <strong>de</strong>s documents <strong>de</strong> l’atelier– éditeur <strong>de</strong> l’issue spéciale dédiée à l’atelier par The International Journalof Biostatistics2009–présent⋆Organisateur du Groupe <strong>de</strong> Travail en Statistique du MAP5– groupe <strong>de</strong> travail bimensuel– sélection et invitation <strong>de</strong>s orateurs (une cinquantaine d’exposés)– responsable du site web du groupe <strong>de</strong> travail2011 Organisateur du Séminaire scientifique <strong>de</strong> l’ED420 Maximum <strong>de</strong> vraisemblance ciblée,principes et applications en épidémiologie (6 janvier 2011) — avec J. Bouyer2006–2008 Organisateur <strong>de</strong>s Première, Secon<strong>de</strong> et Troisième Journées <strong>de</strong> Biologie Systémique,Paris Descartes (France) — avec C. NériEncadrement <strong>de</strong> thèse2009–2012 C. Denis, Des neurones à la posture : une approche statistique, avec F. Comte et A.Samson– une prépublication soumise– <strong>de</strong>ux articles en préparation– <strong>de</strong>ux exposés aux 43èmes Journées <strong>de</strong> Statistique à Tunis (Tunisie) etau congrès Statistique mathématique et applications à Fréjus (France)Encadrement pour la Validation <strong>de</strong>s Acquis <strong>de</strong> l’Expérience (VAE)2011 N. Travier (Catalan Institute of Oncology, Barcelone, Espagne)– correspondant pédagogique pour la VAE dans le cadre du Master <strong>de</strong> SantéPublique, spécialité Recherche en Santé Publique (Paris Descartes)– suivi <strong>de</strong> la rédaction du rapport <strong>de</strong> l’expérience– participation au jury <strong>de</strong> validationEncadrement <strong>de</strong> monitorat2009–2012 F. Leroy (en thèse sous la direction <strong>de</strong> P. Tubert-Bitter)


Encadrement <strong>de</strong> projets2004 De l’utilisation <strong>de</strong>s inégalités <strong>de</strong> concentration en statistique, GT Troisième annéeENSAE2005 Analyse statistique <strong>de</strong> données d’âge-au-décès, 3ème année <strong>de</strong> Licence Mathématiques,Informatique et Applications2006 Couplage en probabilité, 3ème année <strong>de</strong> Licence Mathématiques, Informatique et Applications2010 Estimation <strong>de</strong> l’effet causal <strong>de</strong> l’activité physique sur la forme physique, à partir du jeu<strong>de</strong> données SPPARCS, Master Mathématiques et Informatique2012 Analyse statistique avancée <strong>de</strong> données longitudinales concernant la leishmaniose cutanéezoonotique, parcours Biostatistique <strong>de</strong> la spécialité Recherche en Santé Publiquedu Master Santé PubliqueEnseignementDe 1999 à 2002, 64 heures d’enseignement par année académique.De 2002 à 2011, 192 heures d’enseignement par année académique (excepté durant l’année <strong>de</strong> délégationau CNRS).Responsabilité <strong>de</strong> formation⋆Co-responsable du parcours <strong>de</strong> Biostatistique <strong>de</strong> la spécialité Recherche en Santé Publiquedu Master Santé Publique (Université Paris Descartes et Université Paris Sud 11) —avec P. Tubert-Bitter– conception <strong>de</strong> la nouvelle maquette– publicité, responsable du site web– coordination <strong>de</strong>s enseignements– étu<strong>de</strong> <strong>de</strong>s dossiers <strong>de</strong> candidature et sélection <strong>de</strong>s candidats– soutenances <strong>de</strong> stageEnseignement en tant que maître <strong>de</strong> conférences et qu’ATER à l’Université Paris DescartesLicence Mathématiques, Informatique et ApplicationsL1. “Mathématiques générales” TDL2. “Algèbre linéaire” cours magistral et TDL2. “Introduction à la statistique” cours magistral et TDMaster Mathématiques et Informatique, spécialité Mathématiques Appliquées et IngéniérieMathématique pour les Sciences du VivantM1. “Processus du second ordre” cours-TDM1. “Tests non paramétriques” cours-TDM1. “Epidémiologie, initiation” cours-TDM2. “Epidémiologie, approfondissement” cours-TD– ai créé les <strong>de</strong>ux cours d’Epidémiologie <strong>de</strong> toute pièce pour l’année 2004–2005– les ai assurés tous les ans <strong>de</strong>puis (sauf pendant mon séjour américain)– double angle d’attaque : (i) apport <strong>de</strong> la statistique théorique à l’épidémiologie et(ii) introduction aux principes <strong>de</strong> l’analyse causaleMaster <strong>de</strong> Santé Publique, spécialité Recherche en Santé Publique, parcours BiostatistiqueM2. “Analyse causale” cours-TDM2. “Essais cliniques à schémas adaptatifs” cours-TD– ai créé le cours <strong>de</strong> toute pièce pour l’année 2010–2011– cours suivi par les étudiants <strong>de</strong>s parcours Biostatistique et Recherche clinique– moitié théorie et moitié implémentation et étu<strong>de</strong> <strong>de</strong> simulations (sous R)TutoratEnseignement en tant que moniteur à l’Université Paris DescartesDEUG <strong>de</strong> BiologieL1. “Analyse” TDL2. “Statistiques” cours magistral et TDMaîtrise <strong>de</strong> BiologieM1. “Introduction à Matlab” cours-TD


Articles87


190 A. <strong>CHAMBAZ</strong>ESAIM: Probability and Statistics November 2002, Vol. 6, 189–209URL: http://www.emath.fr/ps/DOI: 10.1051/ps:2002011DETECTING ABRUPT CHANGES IN RANDOM FIELDS<strong>Antoine</strong> Chambaz 1, 2Abstract. This paper is <strong>de</strong>voted to the study of some asymptotic properties of a M-estimator in aframework of <strong>de</strong>tection of abrupt changes in random field’s distribution. This class of problems inclu<strong>de</strong>se.g. recovery of sets. It involves various techniques, including M-estimation method, concentrationinequalities, maximal inequalities for <strong>de</strong>pen<strong>de</strong>nt random variables and φ-mixing. Penalization of thecriterion function when the size of the true mo<strong>de</strong>l is unknown is performed. All the results apply un<strong>de</strong>rmild, discussed assumptions. Simple examples are provi<strong>de</strong>d.Mathematics Subject Classification. 60E15, 62C99, 62F12, 62G20, 62M40.1. IntroductionThe problem of abrupt changes <strong>de</strong>tecting inclu<strong>de</strong>s a wi<strong>de</strong> range of subjects unified by a common basicframework: observation of a random process whose distribution is long-scale heterogeneous but short-scalehomogeneous on some regions. Comprehensive presentations can be be found in the three books [5–7].The mathematical methods inclu<strong>de</strong> M-estimation as in the present paper or [14,15] and also nonparametricor Bayesian techniques, see e.g. [2,13].Handling estimation or test in a multiple changes case with an unknown number of changes is crucial andintricate. Akaike’s and Schwarz’s papers [1, 23] are most of the time invoked as milestones, as Yao’s [29],who proved consistency of Schwarz’s criterion based estimator in case of in<strong>de</strong>pen<strong>de</strong>nt Gaussian observations.Penalization methods are wi<strong>de</strong>ly used, for instance in context of the estimation of the or<strong>de</strong>r of a process (see [1]),of the or<strong>de</strong>r of a mixture (see [8]), or generally in a context of statistical learning theory (see for instance thelectures notes [16]). Barron et al. obtained in [4] some precise bounds in a framework of regression and <strong>de</strong>nsityestimation. Penalization in view of estimating a number of change-points is wi<strong>de</strong>ly used, for instance amongthe previous citations in [14,15].ExamplesOne of the simplest mo<strong>de</strong>ls for change-points can be summarized by the following mo<strong>de</strong>l: one observesresponses Y (X i )atX i = i with Y (X i )=ϑ ⋆ (X i )+ε(X i ) for centered possibly <strong>de</strong>pen<strong>de</strong>nt ε(X i )andsomepiecewise constant function ϑ ⋆ . Here, X 1 ,...,X n should be un<strong>de</strong>rstood as regular times of observation. Afirst natural extension consists of observing at random points X i on a d-dimensional lattice. Then to allowobservation throughout some general d-dimensional space X. And finally to observe some process Y in<strong>de</strong>xedby x ∈X at randomly chosen points X i of X.Keywords and phrases: Detection of change-points, M-estimation, penalized M-estimation, concentration inequalities, maximalinequalities, mixing.1 UMR C 8628 du CNRS, Équipe <strong>de</strong> Probabilités, Statistique et Modélisation, Université Paris-Sud, France;e-mail: <strong>Antoine</strong>.Chambaz@math.u-psud.fr2 FTR&D, 38 rue du Général Leclerc, 92130 Issy-les-Moulineaux, France.c○ EDP Sciences, SMAI 2002Recovery of sets obviously enters in this framework, too: one observes an image X composed of an object τ ⋆ 0and a background through noisy observations (X i ,Y i )(i =1,...,n), with in<strong>de</strong>pen<strong>de</strong>nt X i ∈X and responsesY i = f(X i )1l{X i ∈ τ ⋆ 0 } + ξ i, for some function f boun<strong>de</strong>d away from 0 and random centered noise ξ i .Here(X 1 ,... ,X n ) are supposed in<strong>de</strong>pen<strong>de</strong>nt of the mutually in<strong>de</strong>pen<strong>de</strong>nt n-tuple (ξ 1 ,...,ξ n ). The aim is to estimateτ ⋆ 0 , or equivalently the partition τ ⋆ =(τ ⋆ 0 , X−τ ⋆ 0 ).Aim of this paperWe address in this paper the estimation of a partition τ ⋆ of X from possibly <strong>de</strong>pen<strong>de</strong>nt random observationsY i at in<strong>de</strong>pen<strong>de</strong>nt and i<strong>de</strong>ntically P-distributed points X i in X. The proofs are based on Lavielle’s paper [12].The mo<strong>de</strong>l actually consists of a couple (τ ⋆ ,θ ⋆ ): τ ⋆ =(τj ⋆) 1≤j≤K ⋆ is a partition with K⋆ subsets, where K ⋆(the cardinality of τ ⋆ ) is possibly unknown, and θ ⋆ is a collection of K ⋆ finite-dimensional parameters θj ⋆.We<strong>de</strong>fine for convenience ϑ ⋆ = ∑ K ⋆j=1 θ⋆ j 1l{τ⋆ j }. We consi<strong>de</strong>r that changes affect the marginal distribution of Y i’s:conditionally on X i , Y i has a distribution which <strong>de</strong>pends on ϑ ⋆ (X i ). We assume that there exists an ad hoccontrast J n associated to the problem.In<strong>de</strong>ed, we estimate ϑ ⋆ by minimum contrast estimation and related techniques. Suppose first that we choosea priori the cardinality K of the estimator. By <strong>de</strong>finition, the value of the contrast computed at the estimatorˆϑ n =(ˆτ n , ˆθ n ) bounds below the contrast J n (τ,θ) computed at any mo<strong>de</strong>l (τ,θ) of cardinality K.Involved techniquesJ n is naturally <strong>de</strong>composed into the sum of a first term that <strong>de</strong>pends only on X 1 ,... ,X n and a term ofrandom centered fluctuations. Fluctuations take the form Σ n (G)= ∑ ni=1 Z i1l{X i ∈ G} for Z i = Y i − E(Y i |X i )and any G in a set G. Section 4 is <strong>de</strong>voted to the control of those fluctuations via maximal inequalities.A maximal inequality consists of an upper bound of the probability for sup{‖Σ n (G)‖ ∞ : G ∈G}to be greaterthan some δ>0. In the simple case where partitions are constructed with elementary rectangles, one can <strong>de</strong>riveeasily such maximal inequalities from mild control of the second or<strong>de</strong>r moment of the fluctuations (see [19,20]).This problem is more difficult in a general framework where partitions are constructed with elements of a largerclass of sets (see [9]). In comparison with the previous simple case, control of moments of or<strong>de</strong>r any p>2isnee<strong>de</strong>d here.Denote P n the empirical measure of (X 1 ,...,X n ). Another theoretical complication arises from the need to<strong>de</strong>rive lower bounds for (P(G) − P n (G))/P(G) from bounds of P(G) for a large class of sets G. Actually, thisis possible with large probability for sets G satisfying P(G) ≥ r n for some sequence {r n }↓0 carefully chosen.We cope with this difficulty thanks to concentration inequalities (refer to [18,25]), see Section 3.1.Results for a priori known cardinality K ⋆ . PenalizationWe finally obtain un<strong>de</strong>r mild assumptions and for a priori known K ⋆ that estimation is asymptoticallyconsistent and we bound below rates of convergences. Quite surprisingly, but according to Lavielle’s formerresults, the rate of convergence of the estimate ˆτ n of τ ⋆ does not seem to <strong>de</strong>pend on the <strong>de</strong>pen<strong>de</strong>nce structureof Y i ’s. It is strongly related to the rate {r n } mentioned in the previous section.One can generalize those results for known K ⋆ . We can in<strong>de</strong>ed construct an estimator ˆϑ n,K of ϑ ⋆ for anya priori choice of the cardinality K of the estimator. The point is then to select the best estimator among them.This is roughly speaking the aim of the penalization method: replace the contrast J n (τ,θ) by its penalizedversion J n (τ,θ)+β n K,withβ n > 0. The ad<strong>de</strong>d term β n K penalizes the mo<strong>de</strong>ls with large cardinality whereasthose mo<strong>de</strong>ls are favoured when minimizing J n (τ,θ) alone.We prove that, for sequences {β n }↓0 slowly enough, penalized estimation yields a consistent estimatedtriplet ( ˆK n , ˆτ n , ˆθ n ). Naturally, the <strong>de</strong>pen<strong>de</strong>nce structure of Y i ’s affects the maximum rate of convergence for{β n }.Comparison with previous worksWe noticed earlier that the field of recovery of sets is part of the general problem of abrupt changes <strong>de</strong>tection.Thus, we may wish to compare our results to classical ones in that field. Choose Mammem and Tsybakov’s [17]


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 191paper where the authors <strong>de</strong>rive some optimal convergence rates. Recall the previous cru<strong>de</strong> <strong>de</strong>scription of therecovery of sets problem. Here, the partition to estimate has cardinality 2, so the penalization procedure isnot nee<strong>de</strong>d. The point is to estimate τ0 ⋆ . Roughly speaking, the authors prove that the risk for the maximumlikelihood estimator (which is also a M-estimator) achieves the best possible rate of convergence in the minimaxapproach. Nevertheless, those results rely on in<strong>de</strong>pen<strong>de</strong>nce of responses Y i . On the contrary, our results apply ina framework of M-estimation of abrupt changes from <strong>de</strong>pen<strong>de</strong>nt observations and are satisfying in this context,see again the former citations.AsymptoticsThis paper is concerned with asymptotic results. In the whole text, the expression “as n,δ ↑ ∞” willcorrespond to limits lim lim ,andi<strong>de</strong>m for “as n,η ↑∞”.δ→∞ n→∞The practical interest of <strong>de</strong>tecting abrupt changes in the general setting <strong>de</strong>scribed above is certain, thoughour asymptotic results are mainly of theoretical value. They ensure confi<strong>de</strong>nce in a reasonable i<strong>de</strong>alistic frameworkand encourage to find practical recipes to apply. In<strong>de</strong>ed, rigorous minimum contrast estimation is herecomputationally intractable and the penalization coefficient β n would have to take a fixed value for real observeddata. The choice of such a value would be justified by practical consi<strong>de</strong>rations as presented e.g. in [3,11]. Anautomatic choice would require non asymptotic theory, see for example [4], but this is beyond the scope of thispaper.NotationIn the whole paper, different positive constants might be <strong>de</strong>noted by the same letter C.The organization of the paper is as follows: we introduce in Section 2 the partitions and the associatedparameters to be studied and we <strong>de</strong>fine a pseudo-distance between them with useful properties. Section 3 is<strong>de</strong>dicated to the <strong>de</strong>scription of both the observations and the contrast to be minimized. Further assumptionsare presented in Section 4. They <strong>de</strong>al with some crucial maximal inequalities. We consi<strong>de</strong>r estimation for knowncardinality K ⋆ of τ ⋆ in Section 5 and use those results to address the unknown cardinality case in Section 6.The Appendix 7 consists of three parts: the first one <strong>de</strong>voted to the postponed proof of a proposition; thesecond one to an exploration of the assumptions presented in Section 4; the third one to a sketch of proof of atechnical lemma.2. The partitions and the associated parameters2.1. Introducing partitions and associated parametersSet a probability space (Ω, A, P) upon which random variables will be <strong>de</strong>fined.Consi<strong>de</strong>r some probability space (X, G,P)whereP has support X, i.e. {x ∈X: O∋x =⇒ P(O) > 0} = X(O <strong>de</strong>notes an open set). X is typically inclu<strong>de</strong>d in R d . We will <strong>de</strong>fine partitions of X in the next paragraphs.First, choose some set F 0 ⊂Gof measurable sets. Roughly speaking, a partition τ of X will be constructedas a collection (τ k ) satisfying ∪ k τ k = X and where any τ k is a finite union of elements of F 0Then, <strong>de</strong>fine F which contains all finite unions of elements of F 0 and pairwise intersection of such sets. Moreover,we suppose for sake of simplicity (that is to overcome measurability difficulties) that all the mathematicalexpressions in this paper involving suprema over subsets of F are measurable (it suffices that for each of them,suprema are P-almost surely equal to suprema over some countable subsets).Examples of F 0 when X⊂R d inclu<strong>de</strong> the set of all rectangles of the form Π d i=1 (a i,b i ] (simply called rectanglesin the whole paper); the set of all the polygons whose edges have lengths boun<strong>de</strong>d below by some positiveconstant (polygons for short); or more generally (including rectangles and polygons), some Vapnik–Červonenkisclass whose Vapnik–Červonenkis dimension is finite (for references, see e.g. [26–28]). In the sequel, VC willstand for Vapnik–Červonenkis.Other assumptions will concern F 0 and F: we will state them in Section 3.192 A. <strong>CHAMBAZ</strong>Definition 2.1. We will consi<strong>de</strong>r F-partitions (or shortly partitions) ofX. The set of all the partitions is<strong>de</strong>noted T . Any τ ∈T,τ=(τ k ) 1≤k≤K , is a collection of subsets of X. K is called cardinality of τ, also<strong>de</strong>noted card(τ). Any τ k can be written as an union ∪ l τ k (l) of non intersecting elements τ k (l) ofF 0 whoseP-probabilities must be boun<strong>de</strong>d below by some fixed ∆ ⋆ > 0.T K <strong>de</strong>notes the set of partitions with cardinality K.Remark 2.2. Condition of minimal P-probability for the pieces τ k (l) ofτ k = ∪ l τ k (l) stands for technicalreasons. We will actually suppose that we know some lower bound of ∆ ⋆ . Besi<strong>de</strong>s, this condition yields thatthere exists a finite maximal partition cardinality K and that any τ k is a finite union of τ k (l).Some parameters are to be associated to a partition in the following way: a partition τ ∈T with cardinalityK may go with a collection θ of K Θ-valued vectors. Here, Θ is an open and precompact subset of R p .Thus,for τ =(τ k ) 1≤k≤K and θ =(θ k ) 1≤k≤K , the parameter θ k goes with τ k . We will <strong>de</strong>note Θ K =Θ K .2.2. Pseudo-distances for partitions and parametersTo start with, let us recall some notations. For two sets A and B, A▽B <strong>de</strong>notes their asymmetrical differenceand A △ B their symmetrical difference, that isA▽B = A \ A ∩ Band A △ B =(A▽B) ∪ (B▽A).We wish to <strong>de</strong>fine a pseudo-distance between two F-partitions of the set X that generalizes the natural <strong>de</strong>finitionin the usual one-dimensional case, see [12]: for t and t ⋆ two increasing vectors (respectively of length K and K ⋆ ),the pseudo-distance is taken to be max 1≤j≤K ⋆ min 1≤k≤K |t k − t ⋆ j |. Thus, that pseudo-distance is the largestdistance between points of t ⋆ and their respective closest point in t. Observe that it is zero if and only if eachpoint of t ⋆ appears in t. These consi<strong>de</strong>rations lead to the following:Definition 2.3. Let τ and τ ⋆ be two F-partitions of the set X. DenoteK and K ⋆ their respective cardinality.The gap g(τ,τ ⋆ ) between them is <strong>de</strong>fined as followsg(τ,τ ⋆ )= maxmin1≤j≤K ⋆ K(( ) ) ⋃P τ k △ τj⋆ .k∈KThe in<strong>de</strong>x K in the infimum ranges over all subsets of {1,...,K}.For j =1,...,K ⋆ ,we<strong>de</strong>noteK j a smallest subset of {1,...,K} achieving the minimum in the <strong>de</strong>finitionfor fixed j. Consequently,wehave⎛⎛⎞ ⎞g(τ,τ ⋆ )= max P ⎝⎝ ⋃τ k⎠ △ τ ⋆ ⎠1≤j≤K ⋆ j .Let us present a few interesting properties of the gap g.k∈KjProposition 2.4. Consi<strong>de</strong>r two F-partitions τ ⋆ =(τ ⋆ j ) 1≤j≤K ⋆ and τ =(τ k) 1≤k≤K .(i) Let j be in {1,...,K ⋆ } and k in {1,...,K}. Observe that if τ k ⊂ τj ⋆,thenk ∈K j whereas τ k ∩ τj ⋆ = ∅implies k/∈K j . Not surprisingly, if k/∈K j ,thenP ( ) ( )τ k ▽τj⋆ ≥ P τk ∩ τj⋆ . On the contrary, if k ∈Kjand card(K j ) > 1,thenP ( ) ( )τ k ▽τj⋆ ≤ P τk ∩ τj⋆ .WhenKj = {k}, the former inequality holds as soon asg(τ,τ ⋆ ) ≤ ∆ ⋆ /2.(ii) Set j 0 ≠ j 1 and k 0 ∈K j0 . Then P ( τ k0 ∩ τj1) ⋆ ≤ g(τ,τ ⋆ ).


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 193194 A. <strong>CHAMBAZ</strong>(iii) If g(τ,τ ⋆ )=0, then for all j, thereexistsK j such that τj ⋆ = ⋃ τ k (equalities hold up to P-null sets,Definition 2.5. Let τ and τ ⋆ be two F-partitions of the set X, K and K ⋆ their respective cardinality. Let θk∈Kjas the following conclusions). We <strong>de</strong>rive from this that K j ’s are mutually disjoint and K ≥ K ⋆ : τ is aand θ ⋆ be two parameters, respectively taken in Θ K and Θ K ⋆, θ ⋆ having no equal coordinates. We <strong>de</strong>fine thesub-partition of τ ⋆ . Thus, when g(τ,τ ⋆ )=0with K = K ⋆ ,wedohaveτ = τ ⋆ .two following pseudo-distances between themSuppose now that g(τ,τ ⋆ ) < ∆ ⋆ /2. We still have mutually disjoint K j ’s, and therefore again, K ≥ K ⋆ .In particular, when K = K ⋆ , one can assume that K j = {j} for each j. Observe that g(τ,τ ⋆ ) ≥ ∆ ⋆ asd 2 (θ,θ ⋆ ) = max max ‖θsoon as Kδand all parameters θ of length K such that d 2 (θ,θ ⋆ ) >δ.(ii) We have in<strong>de</strong>edWe suppose the existence of a random field in<strong>de</strong>xed by x ∈X of possibly <strong>de</strong>pen<strong>de</strong>nt random variables (rv):⎛⎛⎞ ⎞ ⎛⎛⎞ ⎞for any x ∈X,arvY x taking its values in R q is generated according to a law which <strong>de</strong>pends on ϑ(x).P ( τ k0 ∩ τj1) ⋆ (≤ P τk0 ▽τ ⋆ )j0 ≤ P ⎝⎝ ⋃τ k⎠▽τ ⋆ ⎠ ≤ P ⎝⎝ ⋃Our aim is to estimate K ⋆ , τ ⋆ and θ ⋆ from random observations un<strong>de</strong>r as mild conditions as possible.τj0k⎠ △ τ ⋆ ⎠ ≤ g(τ,τ ⋆ ).j0k∈Kj 0 k∈Kj 0Two classical examples• Detection in the mean(iii) Let g(τ,τ ⋆ ) < ∆ ⋆ /2. Suppose k ∈K j0 and k ∈K j1 with j 0 ≠ j 1 ,too.ThenHere, Y x = ϑ ⋆ (x)+Y ′ x for some strictly stationary field of centered rv (Y ′ x) x∈X . Consequently, the vectorP (τ k )=P ( (τ k ▽τ ⋆ ) ∪ (τ j0k▽τ ⋆ )) ≤ P ( τj1 k ▽τj0) ⋆ ( of true parameters θ+ P τk ▽τj1) ⋆ is un<strong>de</strong>rstood as the vector of the true possible means. Thus, Y x has mean θ ⋆ < ∆ ⋆ j ⋆ if and,only if x ∈ τj ⋆.• Detectioninbothmeanandvariancewhich is exclu<strong>de</strong>d.Denoting θ ⋆ =(µ ⋆ ,s 2⋆ )andϑ ⋆(iv) To see that, suppose we can take k 0 /∈ ⋃ 1 = ∑ j µ⋆ j 1l{τ⋆ j }, ϑ⋆ 2 = ∑ j s2⋆ j1l{τj ⋆}, we <strong>de</strong>fine Y x = ϑ ⋆ 1 + ϑ ⋆ 2 1/2 Y ′ x. Here,1≤j≤K K (Y ′ x) x∈X is a strictly stationary field of centered rv with variance 1. In this example, Y x has mean µ ⋆ jj: ⋆ and variance s 2⋆ j if and only if x ∈ τ⋆ j .⎛ ⎛ ⎞⎞P (τ k0 )= ∑ K ⋆ P ( τ k0 ∩ τj⋆ ) ∑≤ K ⋆ P ⎝τ j ⋆ ▽ ⎝ ⋃3.1. Observations and first assumptionsτ k⎠⎠


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 195A2: There exists a sequence {r n }↓0 such that liminf n nr n > 0and( { }P(F) −lim lim P Pn (F)sup: F ∈F,P(F) ≥ ηr n ≥ 1 )=0.η→∞ n→∞ P(F)2Remark 3.2 (on Assumptions A1 and A2). Assumption A1 is fulfilled whenever F is a finite VC-dimensionVC class. In the sequel, the case of F finite VC-dimension VC class will be the more general example for F.For a wi<strong>de</strong> family of examples, see for instance [26]. On the other hand, Proposition 3.3 below (whose proof,postponed in Appendix 7.1, require in<strong>de</strong>pen<strong>de</strong>nce of X i ’s) casts some light on the Assumption A2.Proposition { } 3.3. Assumption A2 holds whenever F is a finite VC-dimension VC class and the sequencelog rnis boun<strong>de</strong>d.nr nRemark 3.4 (on Prop. 3.3). Choices of r n =(log α n) β /n with integer α ≥ 1andpositiveβ are obviouslyinclu<strong>de</strong>d (with notation log α+1 =log◦ log α ,log 1 = log).The last assumption of this Section concerns the control of the moment of or<strong>de</strong>r h of P n (G) forG ∈G:A3: For any h ∈ (1, 2) and G ∈G,forsomeconstantA>0 <strong>de</strong>pending on h only,E ( P n (G) h) ≤ A (EP n (G)) h = AP(G) h .Remark 3.5 (on Assumption A3). Note that Jensen’s inequality yields straightforwardly to the reversed lowerbound P(G) h ≤ E(P n (G) h ). Assumption A3 is always satisfied for in<strong>de</strong>pen<strong>de</strong>nt, non necessarily i<strong>de</strong>nticallydistributed, rv: it is a simple consequence of Rosenthal’s inequality, see e.g. [21]. We will use this inequality to<strong>de</strong>rive useful maximal inequalities in Section 4.3.2. Further assumptions: on the contrastThe following assumption ensures the existence of a contrast J n adapted to our mo<strong>de</strong>l. J n (τ,θ) is obtainedas a sum of local contrasts W n (τ k ,θ k ) computed at (τ k ,θ k ).A4: Let ϕ : Θ → R and ψ : Θ → R r be two continuously differentiable functions with continuous extensionsof the <strong>de</strong>rivatives on Θ. Let ξ : R q → R r be such that ξ(Y x ) ∈ L 1 (P) for any x ∈X and ξ(Y X ) ∈ L 1 (P)for XP Xi -distributed. Define the local contrasts for (τ k ,θ k )(k =1,...,K)byW n (τ k ,θ k )=n −1 n ∑i=1{}ϕ(θ k )+ψ(θ k ) T ξ(Y i ) 1l{X i ∈ τ k }and introduce the corresponding limit contrast w : {θ1,...,θ ⋆ K ⋆ ⋆}×Θ → R, which is supposed to satisfy:• P-as for all i such that X i ∈ τj ⋆ and any θ ∈ Θ,w(θ ⋆ j,θ)=ϕ(θ)+〈ψ(θ), E(ξ(Y i ) | X i )〉; (1)• w(θj ⋆,θ) ≥ w(θ⋆ j ,θ⋆ j ) for any (θ⋆ j ,θ) ∈{θ⋆ 1 ,... ,θ⋆ K ⋆}×Θ, equality if and only if θ = θ⋆ j .Denote v the centered limit contrast, that is v(θj ⋆,θ)=w(θ⋆ j ,θ) − w(θ⋆ j ,θ⋆ j ), any (θ⋆ j ,θ). Then v is nonnegative,continuous on {θ1,...,θ ⋆ K ⋆ ⋆}×Θ, continuously differentiable on {θ⋆ 1,...,θK ⋆ ⋆}×Θ with respect to its secondvariable. Its <strong>de</strong>rivative has continuous extension on {θ1 ⋆,...,θ⋆ K⋆}×Θ. Finally, v is zero only on the diagonal.Thus, following Definition 2.5 in Section 2.2, we can <strong>de</strong>fine a pseudo-distance d v from v. Furthermore, since{v(θj ⋆, ·), j=1,...,K⋆ } are continuous, there exist ρ ⋆ ,v ⋆ > 0 such that, for any j 0 ≠ j 1 ,{inf v(θ ⋆ ,θ):‖θ − j0 θ⋆ ‖ j1 2 ≤ ρ ⋆} {− sup v(θ ⋆ ,θ):‖θ − j0 θ⋆ ‖ j0 2 ≤ ρ ⋆} ≥ v ⋆ .196 A. <strong>CHAMBAZ</strong>Remark 3.6 (on Assumption A4).• Condition (1) in the former assumption controls the way the rv Y i <strong>de</strong>pends on X i through τ ⋆ . In particular,P-as for i,i ′ such that X i ,X i ′ ∈ τ ⋆ j ,E(ξ(Y i ) | X i ) − E(ξ(Y i ′) | X i ′) ∈ Vect(ψ(Θ)) ⊥and they are equal as soon as Vect(ψ(Θ)) = R r , which is clearly the case for r =1andξ ≠0.• Note that E ( W n (τ ⋆ j ,θ) | Xn 1)= Pn (τ ⋆ j )w(θ⋆ j ,θ), where P n(τ ⋆ j ) tends to P(τ⋆ j ) P-as. Thus, w(θ⋆ j , ·)canbeun<strong>de</strong>rstood as a rescaled limit conditional expectation of the local contrast computed at τ ⋆ j .The next assumption concerns v:A5: There exist B>0, σ>0 such that (up to change of ρ ⋆ )Back to the classical examplesif ‖θ − θ ⋆ j ‖ 2 ≤ ρ ⋆ ,thenv(θ ⋆ j ,θ) ≥ B ‖θ − θ⋆ j ‖σ 2 (j =1,...,K⋆ ).• Detection in the meanWe choose the following local criterion function∑n ∑nW n (τ k ,θ k )=n −1 (Y i − θ k ) 2 1l{X i ∈ τ k }−n −1 Yi 2 1l{X i ∈ τ k }·i=1Here, ϕ(θ) =θ 2 , ψ(θ) = −2θ and ξ(y) =y. For this particular criterion, v(θ ⋆ j ,θ)=(θ − θ⋆ j )2 andAssumption A5 above is satisfied.• DetectioninbothmeanandvarianceThis time, we choosei=1∑n {W n (τ k ,θ k )=n −1 (Yi − µ k ) 2 }+logs 2 k 1l{X i ∈ τ k }·i=1Here, ϕ(µ, s 2 )=µ 2 /s 2 +logs 2 , ψ(µ, s 2 )=(−2µ,1)/s 2 and ξ(y)=(y,y 2 ). Moreover, we havev ( θ ⋆ j,θ ) =(µ⋆j − µ ) 2s 2s 2 k+log s2s ⋆ j 2 + s⋆ j 2s 2 − 1.Thus, v(θj ⋆,θ) is twice the Kullback–Leibler divergence H(N θj ⋆ |N θ) for Gaussian rv N θ ⋆j(resp. N θ )ofmean and variance θj ⋆ (resp. θ). Besi<strong>de</strong>s, Assumption A5 above does hold for Θ =]a, b[×]c, d[ withd>c>0.Note that in both cases, minimization of θ ↦→ W n (τj ⋆ ,θ) leads to the natural least squares estimators of theparameter θj ⋆.4. Controlling random fluctuations via maximal inequalitiesLet us <strong>de</strong>fine the centered random field of fluctuations (Z x ) x∈X and the rv Z i (i =1,...,n)byZ x = ξ (Y x ) − E (ξ (Y x )) and Z i = ξ (Y i ) − E (ξ(Y i ) | X i )


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 197and for any x n 1 ∈Xn , <strong>de</strong>fine the corresponding sums over any set G ∈GΣ x n1(G)=n∑Z xi 1l{x i ∈ G}i=1and Σ X n1(G)=n∑Z i 1l {X i ∈ G}·Denote finally S n (G; θ)=ψ(θ) T Σ X n1(G) and, for any F-partition τ, n kj = nP n (τ k ∩ τj ⋆ ). Thenj=1i=1∑K ⋆W n (τ k ,θ k )=n −1 {nkj w ( θj ⋆ ,θ ) (k + Sn τk ∩ τj ⋆ ; θ )}k ·We requireA6: There exist C 1 > 0andh ∈ (1, 2) such that, for any δ>0, G ∈G,( {}P sup ‖Σ X n1(F)‖ ∞ : F ∈F(G)≥ δ ∣ X1n)≤ C 1( n h∑δ 2 1l{X i ∈ G})P-as.Here, F(G) <strong>de</strong>notes the set {F ∩ G : F ∈F}.Observe that Assumptions A3 and A6 yield (uncondition with respect to X1 n and then bound above) thefollowing maximal inequality:Lemma 4.1. Un<strong>de</strong>r Assumptions A3 and A6,thereexistsC 2 > 0 such that, for any δ>0 and G ∈G,i=1P ( sup { ‖Σ X n1(F)‖ ∞ : F ∈F(G) } ≥ δ ) ≤ C 2 n h P(G) hδ 2 ·Remark 4.2 (on Assumption A6 and Lem. 4.1). Aim of Assumption A6 consists of ensuring result ofLemma 4.1. The linearity of S n in Σ X n1is nee<strong>de</strong>d to <strong>de</strong>rive uniform control of S n (G; θ) in(G, θ) from uniformcontrol of Σ X n1(G) inG. Besi<strong>de</strong>s, we can propose some mild alternative condition to Assumption A6,seeSection 7.2.Back to the classical examplesApplying Lemma 4.1 and Assumption A1,wegetthat(nPn(τ⋆j)) −1‖ΣX n1(τ⋆j)‖∞ = o P (1).The previous result yields that the least squares estimators obtained by minimization of the respectivecontrasts at τj ⋆ are consistent.Finally, state the last assumption: it concerns F 0 .A7: For some constant γ>0 <strong>de</strong>pending on F 0 only, for any G ∈Gand r>0, there exists ˜G ∈Gsubset ofG with P( ˜G) ≤ γr such that{F ∈ ˜F} )0 (G) :P(F) ≤ r ⊂F(˜G . (2)Here, ˜F 0 (G) <strong>de</strong>notes the set {F ∩ G : F ∈F 0 ,F▽G ≠ ∅}.Remark 4.3 (on Assumption A7). Simplest examples are again when F 0 is composed of rectangles or polygons.Besi<strong>de</strong>s, this result still holds when ˜F 0 (G) is replaced in (2) by the set ˜F(G) ={F ∩ G : F▽G ≠ ∅}where F is an union of at most K elements of F 0 .198 A. <strong>CHAMBAZ</strong>We can now state a key-result that completes Lemma 4.1:Lemma 4.4. Un<strong>de</strong>r Assumptions A3, A6, A7,thereexistsC 3 > 0 such that, for any G ∈G,anyv>0,( { } ) ‖ΣX nP sup 1(F)‖ ∞: F ∈nP(F)˜F(G),P(F) ≥ v ≥ δ ≤ C 3(nv) h−2δ 2 ·Proof. The event whose P-probability we want to bound above is inclu<strong>de</strong>d in the union over j ≥ 0 of the events{sup ‖Σ X n1(F)‖ ∞ : F ∈ ˜F(G),P(F)}≤ 2 j+1 v ≥ 2 j nv · δ.Assumption A7 yields that the former event in<strong>de</strong>xed by j is itself inclu<strong>de</strong>d in the following one{( )}sup ‖Σ X n1(F)‖ ∞ : F ∈F ˜Gj ≥ 2 j nv · δfor some subset ˜G j of G satisfying P( ˜G j ) ≤ γ2 j+1 v. Lemma 4.1 allows to conclu<strong>de</strong>.5. The case of known cardinality of the true partition5.1. Definition of the estimatorWe address in this section the consistency of our estimator when the cardinal K of the estimator of τ ⋆ isa priori fixed.The estimator (ˆτ n , ˆθ n )of(τ ⋆ ,θ ⋆ ) is constructed by minimization over T K × Θ K of the contrast J n ,orequivalently of the centered contrast U n ,withwhereJ n (τ,θ) =K∑W n (τ k ,θ k ),k=1U n (τ,θ) = J n (τ,θ) − J n (τ ⋆ ,θ ⋆ )=u n (τ,θ)+e n (τ,θ)u n (τ,θ)e n (τ,θ)∑K ⋆= n −1j=1 k=1∑K ⋆= n −1j=1 k=1K∑n kj v ( θj,θ ⋆ )kK∑and{ (Sn τk ∩ τj ⋆ ; θ ) (k − Sn τk ∩ τj ⋆ ; )} θ⋆ j ·In the sequel, we will <strong>de</strong>note ˆθ n (τ k ) = arg min{W n (τ k ,θ):θ ∈ Θ} for any τ ∈T K and 1 ≤ k ≤ K. Observethen that (ˆθ n (τ k )) k = arg min{J n (τ,θ):θ ∈ Θ K }. Moreover, we will <strong>de</strong>note ˆθ n ⋆ = ˆθ n (τ ⋆ ). We will write ˆθ nj ⋆ forthe j th coordinate of ˆθ n ⋆ and ˆθ nj for the j th coordinate of ˆθ n = ˆθ n (ˆτ n ).The next proposition casts some light on the behaviour of the total fluctuation term e n . It is a directconsequence of Lemma 4.1 since S n (τ k ∩ τj ⋆; θ)=ψ(θ)T Σ X n1(τ k ∩ τj ⋆ )andψ is boun<strong>de</strong>d.Proposition 5.1. Un<strong>de</strong>r the assumptions of Lemma 4.1, e n is uniformly o P (1) over T K × Θ K .


5.2. ConsistencyDETECTING ABRUPT CHANGES IN RANDOM FIELDS 199Consistency is of course hopeless for K 0 such that, for any K and all δ>0,⎛⎞lim P ⎝ ⋂[∀ (τ,θ) ∈T K × Θ K ,u l (τ,θ) ≥ C ⋆ d v (θ,θ ⋆ )] ⎠ =1,n→∞l≥n⎛lim P ⎝ ⋂n→∞l≥n[∀ (τ,θ) ∈T K,δ × Θ K ,u l (τ,θ) ≥ C ⋆ g(τ,τ ⋆ )] ⎠ =1.A sketch of proof of Lemma 5.3 can be found in Appendix 7.3. We are able now to <strong>de</strong>monstrate Theorem 5.2.Proof. (Th. 5.2). Let us prove that g(ˆτ n ,τ ⋆ )=o P (1). We would show that d 2 (ˆθ n ,θ)=o P (1) along the samelines.If g(ˆτ n ,τ ⋆ ) >δ, then inf{U n (τ,θ) : (τ,θ) ∈ T K,δ × Θ K } is non positive and consequently, thanks toLemma 5.3, for some constant c>0, for any ε>0andforn large enough,sup {|e n (τ,θ)| :(τ,θ) ∈T K × Θ K } >cδwith probability at least 1 − ε. Thus, the proof is complete, since e n is uniformly o P (1).5.3. Rate of convergenceThe following theorem gives lower bounds for rates of convergence of ˆτ n and ˆθ n for K = K ⋆ .Observethatthe case K>K ⋆ would be <strong>de</strong>alt with the same proof, up to minor changes.Theorem 5.4. Set K = K ⋆ and let (ˆτ n , ˆθ n ) be the estimator <strong>de</strong>fined in Section 5.1. Un<strong>de</strong>r Assumptions A1to A7, assuming moreover that the coefficient σ that appears in Assumption A5 satisfies σ ≥ 2/h, the sequences{r−1n g (ˆτ n,τ ⋆ ) } andare uniformly boun<strong>de</strong>d in P-probability.{n (2−h)/2(σ−1) d 2(ˆθn ,θ ⋆)}Remark 5.5. This lower bound of the rate of convergence of ˆτ n does not <strong>de</strong>pend on the <strong>de</strong>pen<strong>de</strong>nce structureof the sequence (Y i ) since the coefficient h of Assumption A6 does not appear in the bound and {r n } <strong>de</strong>pendssolely on X n 1 : g(ˆτ n ,τ ⋆ )=O P (r n ). On the contrary, the rate of convergence of ˆθ n does <strong>de</strong>pend on the <strong>de</strong>pen<strong>de</strong>ncestructure of (Y i ). Moreover, this rate is the same than when the true partition τ ⋆ is known.Proof. Denote in the sequel ζ =1/2(σ − 1). We shall prove that both probabilitiesP(δ n ≥ g (ˆτ n ,τ ⋆ ) ≥ δr n ,d 2(ˆθn ,θ ⋆) )0andj 0 ≠ j 1 (where F <strong>de</strong>notes any set of the form τ ∩ τ ⋆ with τ ∈Fj1such that P ( τ △ τj0) ⋆ ≤ δn ):}sup{‖Σ X n1(F)‖ ∞ : P (F) ≤ δr n ≥ δnr n ,{ } ‖ΣX nsup 1(F)‖ ∞: P (F) ≥ δr n ≥ c.nP (F)


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 201This is a direct consequence of Lemmas 4.1 and 4.4.Let us show now that the expression in (4) goes to 0 when n,δ ↑∞,too. Observethatforn large enoughand on the events whose probabilities are given by (4), we have (with a view to application of Lem. 5.3) thelower bounding d v (ˆθ n ,θ ⋆ ) ≥ Bδ σ n σζ(h−2) . Moreover, Assumption A1 together with g(τ,τ ⋆ ) < min j P ( )τj⋆ /2imply that, for any ε>0andforn large enough, P n (τ j ∩ τj ⋆) ≥ P ( )τj⋆ /4 with probability at least 1 − ε. Thus,for any (τ,θ) ofT K ⋆ × Θ K ⋆ satisfying the same conditions than (ˆτ n , ˆθ n ) on the events whose probabilities arewritten in (4), we have with probability at least 1 − ε, forn large enough and any 1 ≤ j ≤ K ⋆ ,u n (τ,θ) ≥ C ⋆ δ σ n σζ(h−2) ∨ a 1 ‖θ j − θ ⋆ j ‖σ 2 (6)for some a 1 > 0 in<strong>de</strong>pen<strong>de</strong>nt of (τ,θ). Note that the first term in the maximum comes from Lemma 5.3. Toconclu<strong>de</strong> this first step, remark that the preceding inequality together with the following onex/y σ ∨ y σ/(σ−1) z ≥ x 1/σ z (σ−1)/σ (x, y, z > 0),yield (for some constant a 2 > 0 in<strong>de</strong>pen<strong>de</strong>nt of (τ,θ))u n (τ,θ) ≥ a 2 ‖θ j − θ ⋆ j ‖ 2 δ σ−1 n h/2−1 . (7)The second step consists of the same kind of arguments than in the first part of the proof: we will boundfrom below (τ,θ) ↦→ U n (τ,θ)= ∑ K ⋆k=1 W n(τ k ; θ k ) − ∑ K ⋆j=1 W n(τj ⋆; θ⋆ j ) taking care of separating cases k = j andk ≠ j and distributing weight we know we can count on. Precisely, consi<strong>de</strong>r (τ,θ) such as above: on events ofprobability at least 1 − ε, forn large enough,∑K ⋆U n (τ,θ) ≥ n −1 { (a3 nu n (τ,θ)+S n τj ∩ τj ⋆ ) (; θ j − Sn τj ∩ τj ⋆ ; θj)}⋆j=1∑K ⋆∑+n −1 { ((nkj ∨ nu n (τ,θ))a 4 + S n τk ∩ τj ⋆ ) (; θ k − Sn τk ∩ τj ⋆ ; θj⋆ )},j=1 k≠jfor some constants a 3 ,a 4 > 0. Hence, thanks to Assumption A2 (as before, see (5)), Taylor–Lagrange’sinequality and (6, 7), with probability 1 − 2ε for n,δ large enough,∑K ⋆{U n (τ,θ) ≥ Cn −1 ‖θ j − θj ⋆ ‖ 2 δ σ−1 n h/2 (− a ‖Σ X n τj 1∩ τj⋆ ) }‖∞ ‖θ j − θj ⋆ ‖ 2j=1∑K ⋆∑ {(+Cn −1 nP ( τ k ∩ τj⋆ )∨ δ σ n σζ(h−2)+1) (b − c ‖Σ X n τj 1∩ τj⋆ ) }‖∞ ,j=1 k≠jfor constants a,b,c > 0.Consequently, the proof will be complete if we show that the convergence to 0 as n,δ ↑∞of the probabilitiesof the following events hold for any c>0andj 0 ≠ j 1 (where F <strong>de</strong>notes any set of the form τ ∩ τ ⋆ with τ ∈Fj1such that P ( τ △ τj0) ⋆ ≤ δn ):sup { ‖Σ X n1(F)‖ ∞ : F ∈F } ≥ δn h/2 ,{sup ‖Σ X n1(F)‖ ∞ : P (F) ≤ δn σζ(h−2)} ≥ δn σζ(h−2)+1 ,{ } ‖ΣX nsup 1(F)‖ ∞: P (F) ≥ δn σζ(h−2) ≥ c.nP (F)202 A. <strong>CHAMBAZ</strong>This is a direct consequence of Lemmas 4.1 and 4.4 since σ ≥ 2/h.5.4. Number of misclassified observationsThe standard scheme of proof applied to show that the number of misclassified observations is O P (1) requiresto bound below (up to a multiplicative constant) a generic term of the form P n (F)byP(F)forP(F) possibly lessthan δr n . Thus, Assumption A2 is useless and we can not conclu<strong>de</strong>. This difficulty is overcome in<strong>de</strong>pen<strong>de</strong>ntlyof the dimension d when proving that the number of misclassified observations is O P (nr n ). We can obtain theboun<strong>de</strong>dness in probability in the 1-dimensional case. The proof relies then on the natural or<strong>de</strong>ring over R.∑k≠j n kj <strong>de</strong>note the number of misclassified observations for ˆτ n withProposition 5.6. Let N n (ˆτ n )= ∑ K ⋆j=1respect to τ ⋆ . Un<strong>de</strong>r assumptions of Theorem 5.4, N n (ˆτ n )=O P (1) for d =1and N n (ˆτ n )=O P (nr n ) for higherdimensions.Sketch of the proof. Let η ∈{0, 1}. If N n (ˆτ n ) ≥ δ(nr n ) η and g(ˆτ n ,τ ⋆ ) ∨ d 2 (ˆθ n ,θ ⋆ ) ≤ δr n (study of thatconfiguration suffices thanks to Th. 5.4), then ˆτ n minimizes U ′ n which is lower boun<strong>de</strong>d (up to the usualsubstitutions and to some multiplicative constant) for large enough n,δ by∑ ∑{ (nkj ∨ δ (nr n ) η () − c‖Σ X n τk 1∩ τj⋆ ) }‖∞ ·n −1 K ⋆j=1 k≠jSet d =1andη = 0. We can follow the strategy of proof in [12]. Application of triangle’s inequality shifts theproblem to the control of the P-probabilities of the following events (and their left-symmetric){}t∑sup ‖ Z (s ⋆ +t)‖ ∞ : t ⋆ , 0 ≤ t ≤ δ ≥ δ,l=0{ ∑ ‖ tl=0supZ }(s ⋆ +t)‖ ∞: t ⋆ ,t≥ δ ≥ c.tHere, Z (s) = Z i for X (s) = X i ({X (s) } s <strong>de</strong>notes the increasing or<strong>de</strong>red vector X1 n). In<strong>de</strong>x t⋆ in the supremumranges over all right extremities of intervals constituting subsets of τ ⋆ . For some t ⋆ , X (s ⋆ ) corresponds tothe nearest X i greater than t ⋆ . Such probabilities do go to zero, as provi<strong>de</strong>d by the simplest Móricz’s onedimensional inequalities that we apply here in place of Lemmas 4.1 and 4.4, and proof is complete for thiscase. Heuristically, the conclusion holds because the union of all intervals containing at most δ points X i ofobservation contains itself a O(δ) number of such points.For d ≥ 2, we can not proceed as above. Actually, the union of all subsets τ k ∩ τj ⋆ that contain at most δpoints X i of observation may contain much more than O(δ) pointsitself,i.e. generally a O(n). Thus, we mustconclu<strong>de</strong> as in proof of Theorem 5.4, i.e. choose η = 1 and conclu<strong>de</strong> that N n (ˆτ n )=O P (nr n ).□Remark 5.7. Note that in a very specific multidimensional case usually called pixel case, wegetN n (ˆτ n )=O P (1). In<strong>de</strong>ed, suppose that F 0 is composed of rectangles and that [0, 1] d is <strong>de</strong>composed into the union of n dmutually disjoint rectangular boxes, the pixels. Suppose that X 1 ,... ,X n d are chosen uniformly in each box.Then, N n (ˆτ n )=O P (1). The scheme of proof applied in the one dimensional case above applies here, becausethe union of all subsets τ k ∩τ ⋆ j that contain at most δ points X i of observation contains a O(δ(log δ) d−1 )points.6. The case of unknown cardinality of the true partition6.1. Definition of the estimatorWe address in this section the case of an unknown cardinality K ⋆ of τ ⋆ . According to the former section,we can construct an estimator (ˆτ n,K , ˆθ n,K ) of any cardinality K, i.e. for any a priori fixed cardinality of the


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 203estimator. The point is to select the best estimator among the family (ˆτ n,K , ˆθ n,K ) K . Naturally, mo<strong>de</strong>ls withlarge cardinality K are favoured, hence the i<strong>de</strong>a of penalizing the contrast J n by adding a penalization termβ n K. Its role is to balance out this effect.Thus, estimation of the triplet (K ⋆ ,τ ⋆ ,θ ⋆ ) is performed by minimizing a penalized contrast constructed withJ n as <strong>de</strong>fined in Section 5. The estimator of (K ⋆ ,τ ⋆ ,θ ⋆ ) is taken to achieve the minimization of the penalizedcontrast ˜J n given by˜J n (K,τ,θ)=J n (τ,θ)+β n Kfor K ∈{1,...,K} and (τ,θ) ranging over T K × Θ K (recall that K ⋆ is boun<strong>de</strong>d above by K, as a consequenceof the <strong>de</strong>finition of a F-partition).The sequence {β n } is positive and tends to zero. The difficulty relies on the choice of its rate of convergence:since large β n (precisely slow rate of convergence) favours simple mo<strong>de</strong>ls (that is mo<strong>de</strong>ls with small cardinalityK) andvice versa, the point is to calibrate its rate of convergence. In<strong>de</strong>ed, β n appears as a tra<strong>de</strong>-off betweenfitting the observations and avoiding too big mo<strong>de</strong>ls to be selected. Actually, the calibration will follow fromthe rate of convergence of the estimator (ˆτ n , ˆθ n ) studied in the previous section, for a priori known K ⋆ .6.2. ConsistencyTheorem 6.1. Let {β n } be a sequence of positive numbers satisfying bothβ n → 0 and n (2−h)/2(σ−1) β n →∞.Un<strong>de</strong>r the assumptions of Theorem 5.4, ˆKn = K ⋆ with P-probability tending to one. Hence, the consistency of(ˆτ n , ˆθ n ) as <strong>de</strong>fined in Theorem 5.2 still holds.Proof. Proof that P( ˆK n K ⋆ ). It is boun<strong>de</strong>d above by the sum over K from K ⋆ +1 up to K of probabilitiesthatinf {U n (τ,θ) :(τ,θ) ∈T K × Θ K } + β n ≤ 0.For (τ,θ) ranging over T K,βn × Θ K , we can replace the previous events by{}inf U ′ n (τ) :τ ∈T K,βn ≤ 0and proceed as in the first part of the proof of Theorem 5.4; and when it ranges over T K ×Θ K,βn , proof is similarto its second part (that is why we impose n (2−h)/2(σ−1) β n →∞). Thus, we have to focus on the P-probabilitiesof those events for (τ,θ) inT K × Θ K satisfying g(τ,τ ⋆ ) ∨ d 2 (θ,θ ⋆ ) ≤ β n .For (τ,θ) as <strong>de</strong>scribed above, we get, applying Taylor–Lagrange’s inequality (b,c are some positive constants):∑K ⋆U n (τ,θ)+β n ≥ e n (τ,θ)+β n ≥ πn −1∑j=1 k∈Kj∑K ⋆+πn −1∑j=1 k∉Kj{ (nβn − bβ n ‖Σ X n τk 1∩ τj⋆ ) }‖∞{ (nβn − c ‖Σ X n τk 1∩ τj⋆ ) }‖∞ ·204 A. <strong>CHAMBAZ</strong>Finally, the proof will be complete if we show that the convergence to 0 as n ↑∞of the probabilities of thefollowing events hold for any c>0andj 0 ≠ j 1 (where F <strong>de</strong>notes any set of the form τ ∩ τ ⋆ with τ ∈Fsuchj1that P(τ △ τ ⋆ ) ≤ δ j0n,for{δ n }↓0 driven from consistency):sup { ‖Σ X n1(F) :F ∈F } ≥ cn,sup { ‖Σ X n1(F) :P(F) ≤ β n}≥ cnβn .Once again, this is a direct consequence of Lemmas 4.1 and 4.4.7.1. Proof of Proposition 3.37. AppendixProof. Obviously, it suffices to prove that( {∣ }∣∣∣lim lim P P(F) − P n (F)supη→∞ n→∞ P(F) ∣ : F ∈F,P(F) ≥ ηr n ≥ 1 )=0.2Thanks to Talagrand’s concentration inequalities for the supremum of empirical processes (see [18], Th. 2.4,p. 266), basic analysis yields an upper bound exp{−f(n,η)} where f(n,η) > 0 tends to infinity as n,η ↑∞,assoon as the expectation of the supremum in the former equation goes to zero.Let us first study the following expectation for fixed integers n,p and some η>0:( {∣ })∣∣∣ P(F) − P n (F)E supP(F) ∣ : F ∈Fp n ,for F p n = {F : F ∈F, 2p ηr n ≤ P(F) < 2 p+1 ηr n }. It is boun<strong>de</strong>d above by2 −p (ηr n ) −1 E (sup{|P(F) − P n (F)| : F ∈F p n }) ·Symmetrization arguments (refer to [18]) yield that the previous expression is boun<strong>de</strong>d above by( {∣ })∣∣∣∣ ∑n 2 −p+1 (ηnr n ) −1 E sup ε i 1l {X i ∈ F }∣ : F ∈Fp ni=1for in<strong>de</strong>pen<strong>de</strong>nt i<strong>de</strong>ntically distributed Ra<strong>de</strong>macher rv ε i (i.e. P(ε i =1)=P(ε i = −1) = 1/2) in<strong>de</strong>pen<strong>de</strong>ntof X1 n . Furthermore, for any C ⊂ F, the next result holds (as a consequence of Hoeffding’s inequality, seeProblem 2.14.8 of [27]): for a =sup F ∈C P(F), V the VC-dimension of F and some constant A(F) <strong>de</strong>pendingon F only,( {∣ })∣∣∣∣ ∑n [(E sup ε i 1l {X i ∈ F }∣ : F ∈C ≤ Cn 1/2 a + V n log V )log A(F) ] 1/2·a ai=1We apply this result together with the inequality (x + y) 1/2 ≤ x 1/2 + y 1/2 (x, y > 0) to (8) and get that, forA ′ = A(F) ∨ V ,⎛ (( {∣ })∣∣∣ P(F) − P n (F)E supP(F) ∣ : F ∈Fp n ≤ C √ 2 −p ⎝ log A ′ (ηr n ) −1) ⎞1/2 (A⎠log ′ (ηr n ) −1)+ C2 −p (2A ′ ) 1/2 ·ηnr nηnr n(8)


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 205Finally, the expectation of the supremum of interest over F n is boun<strong>de</strong>d above by the next sum that can becontrolled as shown⎛ (∑( {∣ })∣∣∣ P(F) − P n (F)E supP(F) ∣ : F ∈Fp n ≤ C ⎝ log A ′ (ηr n ) −1) ⎞1/2 (A⎠log ′ (ηr+ CA ′ 1/2 n ) −1)·ηnr nηnr np≥0The result follows immediately.7.2. Exploring Assumption A6Two alternative assumptionsWe propose in this section to study Assumption A6. In or<strong>de</strong>r to make things clearer, we will suppose throughthis section that r = 1, that is that ξ is real valued. Thus, ‖·‖ ∞ is systematically replaced by absolute values.The results are easy to adapt to the general case. Let us start with a very special case where slight control ofthe conditional second or<strong>de</strong>r moments suffices to ensure Assumption A6. StateA6a: There exist C 0 > 0andh ∈ (1, 2) such that, for any G ∈G,E(Σ ∣ () n hX n1(G) 2 ∑∣Xn1 ≤ C 0 1l{X i ∈ G})P-as.i=1Note that A6a would obviously be satisfied for h =1ifthervZ i were in<strong>de</strong>pen<strong>de</strong>nt. Moreover, Assumption A6aimplies Assumption A6 in the basic but fundamental case of rectangles:Proposition 7.1. Assumption A6 is satisfied as soon as A6a holds when F 0 is composed of rectangles.The proof of Proposition 7.1 relies on an adaptation of the method proposed in [19,20] to show such a resulton the real line and its extension to the multidimensional case. It can be done by induction and uses thenthoroughly basic properties of <strong>de</strong>composition of rectangles in union of rectangles.The sequel draws its inspiration from the theory of <strong>de</strong>pen<strong>de</strong>nt variables and random fields, see [10,22] and especially[9]. Actually, a natural loosened conditional Marcinkiewicz–Zygmund inequality yields Assumption A6for VC class F. In<strong>de</strong>ed, let Assumption A6b consists of the following:A6b: There exist C 0 > 0andh ∈ [1, 2) such that, for any p>2andG ∈G,(∣ )E(|Σ X n1(G)| p ∣∣ ∑ nXn1 ≤ C p 0 p p 2i=11l{X i ∈ G}Remark 7.2 (on Assumption A6b). The previous inequality is said to be “loosened” because of the power hin the right hand term, where h = 1 is usually expected. It is sharp an inequality thanks to the particularform of the factor C 0 p p p 2 : it allows some efficient optimization in p producing the expected final result viaPisier’s method for some rich class F – namely finite VC-dimension VC class. Precise statement is givenby Proposition 7.3. It un<strong>de</strong>rlines how exceptional seems Proposition 7.1, where only control of second or<strong>de</strong>rmoments is nee<strong>de</strong>d, to compare with Assumption A6b and control of moments of or<strong>de</strong>r any p>2. It is knownthat such a simple condition of control of moments can not be sufficient in the simple case of polygons.Proposition 7.3. Assumption A6 is satisfied as soon as Assumption A6b holds when F is a finite VCdimensionVC class.Proof. All the inequalities to come hold P-as. We set Σ n (G) forΣ X n1(G) andE n for E(·|X n 1 ). ‖·‖ p <strong>de</strong>notesthe L p norm with respect to the conditional probability P(·|X n 1 ).) hp2P-as.206 A. <strong>CHAMBAZ</strong>Pisier’s method (see [9]) consists of writing()E n sup |Σ n (F)| 2F ∈F(G)( ) 2∑p≤ ‖ sup |Σ n (F)|‖ 2 p ≤ E n (|Σ n (F)| p )F ∈F(G)F ∈Γn≤ N n2pmaxF ∈Γn‖Σ n (F)‖ 2 p ≤ N n2p pC02(∑ n h1l{X i ∈ G}),where Γ n <strong>de</strong>notes a family of sets of minimal cardinality N n such that any separation of {X 1 ,... ,X n }∩Gby elements of F(G) can be achieved with an element of Γ n . Since F has finite VC-dimension, so does F(G)and N n is finite. Set H n =logN n : VC theory (see for instance [28]) ensures that H n is boun<strong>de</strong>d above byV (1 + log ∑ 1l{X i ∈ G}) (V <strong>de</strong>notes the VC-dimension of F).Optimization in p yields()E n sup |Σ n (F)| 2 ≤F ∈F(G)≤(∑ n2C 2 0 H n 1l{X i ∈ G}i=1i=1) hi=1( n) h ()∑n∑2C 2 0 V 1l{X i ∈ G} 1+log 1l{X i ∈ G} .Set ε>0 such that h + ε


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 207We can state now the final result of this section and infer from it a sufficient rate of convergence to 0 for{φ(t)} to ensure that Assumptions A6a and A6b hold true:Proposition 7.4. Suppose that X = Z d and that (Z x ) x∈X is boun<strong>de</strong>d and strictly stationary. Assumptions A6aand A6b are satisfied as soon as, for some 1 ≤ h 0 <strong>de</strong>pending on h and for any n ≥ 1,The previous inequality is satisfied e.g. when φ(t)=O(t −(d+1−h) ).n∑t d−1 φ(t) ≤ C 0 n h−1 . (9)t=1Proposition 7.4 is a corollary of Proposition 7.6 below. The strategy of the proof relies on a Burkhol<strong>de</strong>r-likeinequality as shown by De<strong>de</strong>cker in [9], namely in Proposition 1(a) of this paper. Following his method, we getthat, for any G ∈Gand Y n ⊂X of cardinality n,(E (|Σ Yn (G)| p ) ≤ 2p ∑ {‖Z 2 x ‖ p/2 + ∑}‖Z x ′E d(x,x ′ )(Z x )‖ p/2 1l{x ′ ∈ G} 1l{x ∈ G}x∈Ynx ′ ∈Yn≤ (2p) p/2 ( ‖Z 0 ‖ ∞ + ‖Z 0 ‖ ∞2 )⎛⎝ ∑1l {x ∈ G} + ∑x∈Ynx,x ′ ∈Yn) p/2⎞‖E d(x,x ′ )(Z x )‖ p/2 1l {x, x ′ ∈ G} ⎠p/2. (10)In the previous display, E d(x,x ′ )(Z x ) <strong>de</strong>notes the conditional expectation of Z x with respect to the σ-fieldσ(Z y : y ∈Y n ,d(x, y) ≥ d(x, x ′ )).We can <strong>de</strong>rive from (10) some first ultimate condition on the field (Z x ) x∈X :Proposition 7.5. Suppose that X = Z d and that (Z x ) x∈X is centered, boun<strong>de</strong>d and strictly stationary. ThenAssumptions A6a and A6b hold as soon as, for some 1 ≤ h 0 <strong>de</strong>pending on h and for any p ≥ 2,G ∈G, n ≥ 1 and Y n ⊂X of cardinality n, for any x ∈Y n ,∑‖E d(x,x ′ )(Z x )‖ p/2 ≤ C 0 n h−1 .x ′ ∈YnThe condition above holds with h =1for m-conditionally centered fields (i.e. fields such that E d(x,x ′ )(Z x )=0for d(x, x ′ ) ≥ m). This inclu<strong>de</strong>s m-<strong>de</strong>pen<strong>de</strong>nce and consequently, in<strong>de</strong>pen<strong>de</strong>nce.Furthermore, combining the next upper bound for the right hand term of (10)( )⎛(2p) p/2 2‖Z 0 ‖ ∞ + ‖Z 0 ‖ ∞⎝ ∑1l{x ∈ G} + ∑with Serfling’s inequality (see [24])yields:x∈Ynx,x ′ ∈Yn‖E d(x,x ′ ) (Z x ) ‖ ∞ ≤ 2φ(d(x, x ′ ))‖Z x ‖ ∞ ,⎞‖E d(x,x′ )(Z x )‖ ∞ 1l {x, x ′ ∈ G} ⎠p/2208 A. <strong>CHAMBAZ</strong>Proposition 7.6. Suppose that X = Z d and that (Z x ) x∈X is centered, boun<strong>de</strong>d and strictly stationary. ThenAssumptions A6a and A6b hold as soon as, for some 1 ≤ h 0 <strong>de</strong>pending on h and for any G ∈G,any n ≥ 1 and any Y n ⊂X of cardinality n, for any x ∈Y n ,∑‖E d(x,x ′ )(Z x )‖ ∞ ≤ C 0 n h−1 .x ′ ∈YnCondition (9)inProposition7.4 is sufficient.7.3. Sketch of proof of Lemma 5.3Sketch of the proof. First, <strong>de</strong>fine f ij (θ ⋆ ,α) = inf{αv(θ ⋆ j ,θ)+(1− α)v(θ⋆ i ,θ):θ ∈ Θ} (1 ≤ i ≠ j ≤ K⋆ , 0 0andf ij (θ ⋆ ,α) ≥ A min(α, 1 − α).Set ε>0 and <strong>de</strong>fine the events Ω n (δ)=∩ l≥n [sup{|P l (F) − P(F)| : F ∈F}≤δ] (anyδ>0, n≥ 1). SinceP(Ω n (δ)) ↑ 1asn ↑∞, there exists n 1 ≥ 1 such that P(Ω n1 (δ 1 )) ≥ 1 − ε with δ 1 =∆ ⋆ /16K 2 . Let us restrictourselves to the event Ω n1 (δ 1 ) and consi<strong>de</strong>r (τ,θ) inT K,∆⋆ /4K × Θ K, n ≥ n 1 . We can prove that u n (τ,θ) isboun<strong>de</strong>d from below by a positive constant in<strong>de</strong>pen<strong>de</strong>nt of n and (τ,θ). In<strong>de</strong>ed, we have (applying Prop. 2.4(i))the existence of k,j 0 ,j 1 such that n kj0 /n and n kj1 /n are both boun<strong>de</strong>d from below by such a constant c, andthen, for α = n kj0 /(n kj0 + n kj1 ),u n (τ,θ) ≥ n kj0 + n kj1n(αv(θ⋆j0,θ k)+(1− α)v (θj1 ,θ k ) ) ≥ cA.Then, since g(τ,τ ⋆ ) ≤ 1andv is boun<strong>de</strong>d from above by its supremum over the compact set Θ × Θ, the studyof the case g(τ,τ ⋆ ) > ∆ ⋆ /4K is completed.( )Let us <strong>de</strong>al now with the lower bounding in d v (θ,θ ⋆ )forthemo<strong>de</strong>ls(τ,θ) ranging over T K −T K,∆⋆ /4K×Θ K .Sinceu n (τ,θ) ≥ max max1≤j≤K ⋆k∈Kjn kjn v ( θj ⋆ ,θ )k ,it suffices to bound by below all P ( )τ k ∩ τj⋆ ’s by some positive constant in<strong>de</strong>pen<strong>de</strong>nt of n and (τ,θ) and greaterthan δ 1 . We do so thanks to Proposition 2.4(i) again.Set δ>0. We still have to show that u n (τ,θ) ≥ C ⋆ g(τ,τ ⋆ )forτ verifying δ


DETECTING ABRUPT CHANGES IN RANDOM FIELDS 209References[1] H. Akaike, A new look at the statistical mo<strong>de</strong>l i<strong>de</strong>ntification. IEEE Trans. Automat. Control AC-19 (1974) 716-723. Systemi<strong>de</strong>ntification and time-series analysis.[2] A. Antoniadis, I. Gijbels and B. MacGibbon, Non-parametric estimation for the location of a change-point in an otherwisesmooth hazard function un<strong>de</strong>r random censoring. Scand. J. Statist. 27 (2000) 501-519.[3] Z.D. Bai, C.R. Rao and Y. Wu, Mo<strong>de</strong>l selection with data-oriented penalty. J. Statist. Plann. Inference 77 (1999) 103-117.[4] A. Barron, L. Birgé and P Massart, Risk bounds for mo<strong>de</strong>l selection via penalization. Probab. Theory Related Fields 113(1999) 301-413.[5] M. Basseville and I.V. Nikiforov, Detection of abrupt changes: Theory and application. Prentice Hall Inc. (1993).[6] B.E. Brodsky and B.S. Darkhovsky, Nonparametric methods in change-point problems. Kluwer Aca<strong>de</strong>mic Publishers Group(1993).[7] E. Carlstein, H.-G. Müller and D. Siegmund, Change-point problems. Institute of Mathematical Statistics, Hayward, CA (1994).Papers from the AMS-IMS-SIAM Summer Research Conference held at Mt. Holyoke College, South Hadley, MA July 11-16,1992.[8] D. Dacunha–Castelle and E. Gassiat, The estimation of the or<strong>de</strong>r of a mixture mo<strong>de</strong>l. Bernoulli 3 (1997) 279-299.[9] J. De<strong>de</strong>cker, Exponential inequalities and functional central limit theorems for random fields. ESAIM P&S 5 (2001) 77.[10] P. Doukhan, Mixing. Springer-Verlag, New York (1994). Properties and examples.[11] M. Lavielle, On the use of penalized contrasts for solving inverse problems. Application to the DDC (Detection of DiversChanges) problem (submitted).[12] M. Lavielle, Detection of multiple changes in a sequence of <strong>de</strong>pen<strong>de</strong>nt variables. Stochastic Process. Appl. 83 (1999) 79-102.[13] M. Lavielle and E. Lebarbier, An application of MCMC methods for the multiple change-points problem. Signal Process. 81(2001) 39-53.[14] M. Lavielle and C. Lu<strong>de</strong>ña, The multiple change-points problem for the spectral distribution. Bernoulli 6 (2000) 845-869.[15] M. Lavielle and E. Moulines, Least-squares estimation of an unknown number of shifts in a time series. J. Time Ser. Anal. 21(2000) 33-59.[16] G. Lugosi, Lectures on statistical learning theory. Presented at the Garchy Seminar on Mathematical Statistics and Applications,available at http://www.econ.upf.es/~lugosi (2000).[17] E. Mammen and A.B. Tsybakov, Asymptotical minimax recovery of sets with smooth boundaries. Ann. Statist. 23 (1995)502-524.[18] P. Massart, Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. (6) 9 (2000) 245-303.[19] F. Móricz, A general moment inequality for the maximum of the rectangular partial sums of multiple series. Acta Math.Hungar. 41 (1983) 337-346.[20] F.A. Móricz, R.J. Serfling and W.F. Stout, Moment and probability bounds with quasisuperadditive structure for the maximumpartial sum. Ann. Probab. 10 (1982) 1032-1040.[21] V.V. Petrov, Limit theorems of probability theory. The Clarendon Press Oxford University Press, New York (1995). Sequencesof in<strong>de</strong>pen<strong>de</strong>nt random variables, Oxford Science Publications.[22] E. Rio, Théorie asymptotique <strong>de</strong>s processus aléatoires faiblement dépendants. Springer (2000).[23] G. Schwarz, Estimating the dimension of a mo<strong>de</strong>l. Ann. Statist. 6 (1978) 461-464.[24] R.J. Serfling, Contributions to central limit theory for <strong>de</strong>pen<strong>de</strong>nt variables. Ann. Math. Statist. 39 (1968) 1158-1175.[25] M. Talagrand, New concentration inequalities in product spaces. Invent. Math. 126 (1996) 505-563.[26] A.W. van <strong>de</strong>r Vaart, Asymptotic statistics. Cambridge University Press (1998).[27] A.W. van <strong>de</strong>r Vaart and J.A. Wellner, Weak convergence and empirical processes. Springer-Verlag, New York (1996). Withapplications to statistics.[28] V.N. Vapnik, Statistical learning theory. John Wiley & Sons Inc., New York (1998).[29] Y.-C. Yao, Estimating the number of change-points via Schwarz’s criterion. Statist. Probab. Lett. 6 (1988) 181-189.


The Annals of Statistics2006, Vol. 34, No. 3, 1166–1203DOI: 10.1214/009053606000000344© Institute of Mathematical Statistics, 2006TESTING THE ORDER OF A MODELBY ANTOINE <strong>CHAMBAZ</strong>Université René DescartesThis paper <strong>de</strong>als with or<strong>de</strong>r i<strong>de</strong>ntification for nested mo<strong>de</strong>ls in the i.i.d.framework. We study the asymptotic efficiency of two generalized likelihoodratio tests of the or<strong>de</strong>r. They are based on two estimators which are provedto be strongly consistent. A version of Stein’s lemma yields an optimal un<strong>de</strong>restimationerror exponent. The lemma also implies that the overestimationerror exponent is necessarily trivial. Our tests admit nontrivial un<strong>de</strong>restimationerror exponents. The optimal un<strong>de</strong>restimation error exponent is achievedin some situations. The overestimation error can <strong>de</strong>cay exponentially withrespect to a positive power of the number of observations.These results are proved un<strong>de</strong>r mild assumptions by relating the un<strong>de</strong>restimation(resp. overestimation) error to large (resp. mo<strong>de</strong>rate) <strong>de</strong>viations ofthe log-likelihood process. In particular, it is not necessary that the classicalCramér condition be satisfied; namely, the log-<strong>de</strong>nsities are not required toadmit every exponential moment. Three benchmark examples with specificdifficulties (location mixture of normal distributions, abrupt changes and variousregressions) are <strong>de</strong>tailed so as to illustrate the generality of our results.1. Introduction. This paper is <strong>de</strong>voted to or<strong>de</strong>r i<strong>de</strong>ntification problems in thein<strong>de</strong>pen<strong>de</strong>nt and i<strong>de</strong>ntically distributed (i.i.d.) framework. It fits in the general settingof mo<strong>de</strong>l selection initiated by the seminal papers of Mallows [35], Akaike [1],Rissanen [38] and Schwarz [41]. Or<strong>de</strong>r i<strong>de</strong>ntification <strong>de</strong>als with the estimation andtest of a structural parameter which in<strong>de</strong>xes the complexity of the common distributionof the observations. The purpose is to <strong>de</strong>rive some new consistency and efficiencyresults. Or<strong>de</strong>r i<strong>de</strong>ntification applies, for instance, to mixture mo<strong>de</strong>ls [42],where the or<strong>de</strong>r is (loosely speaking) the number of populations. Another exampleof application is abrupt changes mo<strong>de</strong>ls, where the or<strong>de</strong>r is (roughly) the numberof changes. It will be argued below that this example conveniently mo<strong>de</strong>ls a medicalproblem in which the or<strong>de</strong>r is the number of distinct levels of expression of adisease.1.1. Description of the problem. We observe n i.i.d. random variables Z 1 ,...,Z n with values in a measurable sample space (Z,F ) (Z is Polish). These observationsare <strong>de</strong>fined on a common measurable space upon which all the randomvariables will be <strong>de</strong>fined.Received May 2003; revised August 2005.AMS 2000 subject classifications. 60F10, 60G57, 62C99, 62F03, 62F05, 62F12.Key words and phrases. Abrupt changes, empirical processes, error exponents, hypothesis testing,large <strong>de</strong>viations, mixtures, mo<strong>de</strong>l selection, mo<strong>de</strong>rate <strong>de</strong>viations, or<strong>de</strong>r estimation.1166TESTING THE ORDER OF A MODEL 1167The distribution P ⋆ of Z 1 may belong to one mo<strong>de</strong>l in the increasing family{ K } K≥1 of nested mo<strong>de</strong>ls. Here, each K is a parametric collection of probabilitydistributions which are absolutely continuous with respect to the same measureµ, K ={P θ :θ ∈ K }⊂ K+1 ,where {( K ,d K )} K≥1 is an increasing family of nested metric parameter sets. Inthis paper d K will be abbreviated to d.The integer K is called the or<strong>de</strong>r of the mo<strong>de</strong>l K . It is also the or<strong>de</strong>r of anyP θ ∈ K \ K−1 (with the convention 0 = ∅). The or<strong>de</strong>r of P ⋆ is <strong>de</strong>noted byK ⋆ . It is infinite whenever P ⋆ does not belong to ∞ = ⋃ K≥1 K .The central problem of this paper is an issue of composite hypotheses testing:we want to <strong>de</strong>ci<strong>de</strong> between the null hypothesis “K ⋆ ≤ K 0 ” and its alternative“K ⋆ >K 0 ” (for some integer K 0 ), that is, to test“P ⋆ ∈ K0 ” against “P ⋆ /∈ K0 .”This question is obviously crucial when the or<strong>de</strong>r is the quantity of interest.Furthermore, or<strong>de</strong>r i<strong>de</strong>ntification may also be a prerequisite to consistent parameterestimation, when overestimation of the or<strong>de</strong>r causes loss of i<strong>de</strong>ntifiability.1.2. Consistency and efficiency issues. Let α n and β n <strong>de</strong>note the type I andtype II errors of a procedure that tests the hypotheses above. This procedure isconsistent if α n and β n converge to zero as n tends to infinity. Its efficiency ismeasured in terms of rates of convergence of α n and β n to zero.In the classical statistical theory, a standard Neyman–Pearson procedure teststwo simple hypotheses by comparing the log-likelihoods at each of them to a constantthreshold. Now, it is known [10] that this procedure satisfieslim supn→∞ n−1 logα n < 0 and lim supn→∞ n−1 logβ n < 0.It is consequently natural, when investigating the efficiency of an or<strong>de</strong>r testingprocedure, to study whether the rates of convergence are exponential with respectto n or not.Two generalized likelihood ratio test procedures based on two different estimatorsof K ⋆ will be studied here. Obviously, if ˜K n estimates K ⋆ , then the naturalrule is to reject the null hypothesis if ˜K n >K 0 .Thenα n ≤ P ⋆ { ˜K n >K ⋆ } and β n ≤ P ⋆ { ˜K n


1168 A. <strong>CHAMBAZ</strong>2. Can e u > 0ore o > 0 (the un<strong>de</strong>restimation and overestimation error exponents,resp.) be found such thatorlim supn→∞ n−1 logP ⋆ { ˜K n K ⋆ }≤−e o ?If so, can the error exponents e u or e o be arbitrarily large? If not, what happensat a subexponential rate, that is, when replacing the factor n −1 by a factor= o(1), with v n = o(n)?v −1nThe consistency issue 1 has been studied for two <strong>de</strong>ca<strong>de</strong>s. The interest in theefficiency issue 2 is more recent. By formulating the efficiency issue this way,we adopt the error exponent perspective of the information theory literature [13].This notion of efficiency is asymptotic, as are all our results. It is connected toother notions of asymptotic efficiency, among which is Bahadur efficiency [3].The latter is usually <strong>de</strong>rived from large <strong>de</strong>viations results. In the following, theun<strong>de</strong>restimation (resp. overestimation) error will similarly be related to large (resp.mo<strong>de</strong>rate) <strong>de</strong>viations of the log-likelihood process.1.3. Results in perspective. Pioneering results about or<strong>de</strong>r i<strong>de</strong>ntification oftime series can be found in [2]. Strong consistency of the same or<strong>de</strong>r estimator inautoregressive mo<strong>de</strong>ls is shown in [24]and[26]. The test of the or<strong>de</strong>r of an ARMAprocess is addressed in [16]. Error exponents for autoregressive or<strong>de</strong>r testing areinvestigated in [8].Consistent estimation of the or<strong>de</strong>r of a mixture mo<strong>de</strong>l is at stake in [15, 21, 27,29, 30, 34]. Efficiency issues are addressed in [15]. Also, [16] is concerned withthe test of the or<strong>de</strong>r of a mixture.Or<strong>de</strong>r estimation in exponential mo<strong>de</strong>ls is studied in [25]. The rates of un<strong>de</strong>restimationand overestimation of two estimators of the or<strong>de</strong>r are investigated in [25](for exponential mo<strong>de</strong>ls), [31] (for regular mo<strong>de</strong>ls) and in [23] (for mo<strong>de</strong>ls characterizedby the existence of an exhaustive finite-dimensional statistic).The problem of or<strong>de</strong>r i<strong>de</strong>ntification in Markov mo<strong>de</strong>ls on a finite alphabet mustbe mentioned too. Some important papers are [14, 12] (they give insight into theconsistency issue for some classical or<strong>de</strong>r estimators) and also [20, 22] (whereoptimal un<strong>de</strong>restimation error exponents are obtained for the same classical or<strong>de</strong>restimators). A more comprehensive presentation of or<strong>de</strong>r i<strong>de</strong>ntification in Markovmo<strong>de</strong>ls can be found in [7].A new method for new results. In most previous work the choice of the frameworkis contingent on the need for tractable explicit calculus. In this paper weshall resort to general properties of empirical processes. Our approach yields severalnew results that hold un<strong>de</strong>r mild assumptions.TESTING THE ORDER OF A MODEL 1169In particular, our test procedures admit nontrivial un<strong>de</strong>restimation error exponents.Besi<strong>de</strong>s, one of them has an optimal un<strong>de</strong>restimation error exponent in somesituations. Any test procedure based on a consistent estimator is proved to admit anecessarily trivial overestimation error exponent. The overestimation probabilitiesof our procedures can <strong>de</strong>cay exponentially fast with respect to a positive powerof n.More <strong>de</strong>tails follow.Benchmark examples. Let us introduce very briefly our three benchmark examples.Their presentation is merely sketched here, including the results obtainedby applying our main general results. A whole section will be <strong>de</strong>voted to the <strong>de</strong>tailedstudy of the examples.Let σ <strong>de</strong>note a known positive number.• Location mixture example (LM): this is a notoriously difficult problem in theor<strong>de</strong>r i<strong>de</strong>ntification literature (see the references cited above). In this mo<strong>de</strong>l, oneobservesZ i = X i + σe i(i = 1,...,n),where X 1 ,...,X n are i.i.d. hid<strong>de</strong>n (i.e., not observed) random variables with acommon distribution of finite support {m 1 ,...,m K ⋆}, an<strong>de</strong> 1 ,...,e n are i.i.d.and in<strong>de</strong>pen<strong>de</strong>nt from X 1 ,...,X n , with centered Gaussian distribution of variance1. The goal is to estimate K ⋆ .Applying the main general results of this paper will imply the following:1. Our two estimators of K ⋆ are consistent.2. Their un<strong>de</strong>restimation error exponents are nontrivial and boun<strong>de</strong>d by anumber which <strong>de</strong>pends on squared distances between P ⋆ and K , K =1,...,K ⋆ − 1. Their overestimation error exponents are trivial but their overestimationprobabilities <strong>de</strong>cay exponentially fast with respect to a positivepower of n.These results are new for maximum likelihood procedures.• Abrupt changes example (AC): this example is original in the or<strong>de</strong>r i<strong>de</strong>ntificationliterature. In this mo<strong>de</strong>l one observesY i = f ⋆ (X i ) + σe i(i = 1,...,n),where X 1 ,...,X n are i.i.d. on a subset of R q (q ≥ 2); e 1 ,...,e n are i.i.d. and in<strong>de</strong>pen<strong>de</strong>ntof X 1 ,...,X n , with centered Gaussian distribution of variance 1, andthe function f ⋆ is piecewise constant. Loosely speaking, the goal is to estimatea minimal number of domains on which f ⋆ is constant.In virtue of the general results of this paper, the following new results hold(“almost surely” abbreviates to “a.s.”):1. P ⋆ -a.s., our estimators are greater than or equal to K ⋆ eventually.


1170 A. <strong>CHAMBAZ</strong>2. Our tests admit nontrivial un<strong>de</strong>restimation error exponents. Their overestimationerror exponents are necessarily trivial.• Various regression examples (VR): let {t k } k≥1 be an orthonormal system inL 2 ([0, 1]). In this mo<strong>de</strong>l one observesY i = f ⋆ (X i ) + σe i(i = 1,...,n),where X 1 ,...,X n are i.i.d., uniformly distributed on [0, 1], e 1 ,...,e n are i.i.d.and in<strong>de</strong>pen<strong>de</strong>nt of X 1 ,...,X n , with centered Gaussian distribution of variance1, and f ⋆ = ∑ K ⋆k=1 θ kt k with θ K ⋆ ≠ 0. The goal is to estimate K ⋆ .As a consequence of the main general results of this paper, the followingresults are obtained:1. Our two estimators of K ⋆ are consistent.2. Their un<strong>de</strong>restimation error exponents are nontrivial, and one of themachieves optimality. Their overestimation error exponents are necessarilytrivial, but their overestimation probabilities <strong>de</strong>cay exponentially fast withrespect to a positive power of n.In particular, the optimality of one of the un<strong>de</strong>restimation error exponentsis a new result.1.4. Organization of the paper. In Section 2 some notation prece<strong>de</strong>s the <strong>de</strong>finitionof the or<strong>de</strong>r estimators studied here. The basic assumptions are stated.Moreover, two limit theorems for the log-likelihood process which will play acentral role are recalled. The consistency results are stated and commented on inSection 3. The most conclusive part is Section 4. It is <strong>de</strong>voted to the statement ofthe efficiency results and comments. The application of our general results to thebenchmark examples is addressed in <strong>de</strong>tail in Section 5. The proofs are postponedto the Appendix.2. Notation and preliminaries. The integral ∫ fdλof a function f with respectto a measure λ will be written as λf . Besi<strong>de</strong>s, all the expressions involvingextrema and empirical processes will be assumed measurable.2.1. Two maximum penalized likelihood estimators. Let p θ <strong>de</strong>note the <strong>de</strong>nsityof P θ with respect to µ and l θ = logp θ (for all θ ∈ ∞ = ⋃ K≥1 K ). P ⋆ issupposed to be absolutely continuous with respect to µ without loss of generality.Its <strong>de</strong>nsity is <strong>de</strong>noted by p ⋆ and we set l ⋆ = logp ⋆ .IfP ⋆ ∈ K ⋆ \ K ⋆ −1, thenP ⋆ = P θ ⋆ for θ ⋆ ∈ K ⋆ \ K ⋆ −1.The log-likelihood l n of the observations isn∑l n (θ) = l θ (Z i ) (every θ ∈ ∞ ).i=1TESTING THE ORDER OF A MODEL 1171The penalized maximum likelihood criterion for the mo<strong>de</strong>l K is written ascrit(n,K) = supθ∈ Kl n (θ) − pen(n,K),where pen is a positive penalty function. It yields the two estimators of the or<strong>de</strong>rstudied in this paper,̂K L n̂K G n= inf{K ≥ 1:crit(n,K) ≥ crit(n,K + 1)},= inf arg sup{crit(n,K)}≥ ̂K n L .K≥1̂K n G is a global (hence, the G in its name) maximizer of the criterion. ̂K n G alwaysbounds from above ̂K n L , the first local (hence, the L) maximizer of the same criterion.Note that the computation of these estimators is a less <strong>de</strong>manding algorithmictask for ̂K n L than for ̂K n G.Comment. A prior bound K max for K ⋆ will be assumed known when studyingthe overestimation properties of ̂K n G . In<strong>de</strong>ed, we cannot control its overestimationprobability when infinitely many mo<strong>de</strong>ls are involved. This assumption is commonin the or<strong>de</strong>r i<strong>de</strong>ntification literature [2, 7, 20, 21, 23–25, 30, 31].On the one hand, there are situations where assuming the existence of K max ismandatory. It is, for instance, proven [14] that some classical (minimum <strong>de</strong>scriptionlength) or<strong>de</strong>r estimators are not consistent when no upper bound to the trueor<strong>de</strong>r is known a priori: they fail to recover the true or<strong>de</strong>r 0 of a uniformly distributedi.i.d. sequence on a finite alphabet A,when K is the set of all Markov chainsof or<strong>de</strong>r at most K. On the other hand, it is also shown in the same paper that theso-called Bayesian information criterion (BIC) or<strong>de</strong>r estimator is consistent whenno upper bound is known a priori. It is thus particularly interesting that the studyof the properties of ̂K n L does not require a prior bound for K⋆ .Now, it must be emphasized that our asymptotic study of the problem does notallow us to obtain conditions on the <strong>de</strong>pen<strong>de</strong>nce of pen(n,K) on K. In contrast,the former BIC or<strong>de</strong>r estimator studied by Csiszár and Shields [14] correspondsto pen(n,K) = 1 2 |A|K (|A|−1) logn. It is believed that this is a minimal penalty.In [22] the <strong>de</strong>pen<strong>de</strong>nce on K of the penalty function is also ma<strong>de</strong> precise (but thepenalty is certainly not minimal, according to the authors).The <strong>de</strong>pen<strong>de</strong>nce of pen(n,K) on K could be investigated through risk boundsfor maximum log-likelihood [6, 36] (in the testing framework of this paper, thechosen loss function is K ↦→ 1{K ≠ K ⋆ }). However, this would require at presenttime some restrictive assumptions. For instance, exact asymptotic risk bounds areyet out of reach for a mixture of Gaussian distributions. Furthermore, exact asymptoticbounds are not enough in overestimation, when we have to <strong>de</strong>al with infinitelymany mo<strong>de</strong>ls [12].


1172 A. <strong>CHAMBAZ</strong>2.2. Basic assumptions. Let us <strong>de</strong>note by H(P|Q) = P log dP/dQ if P ≪ Q,H(P|Q) =∞otherwise, the relative entropy of P with respect to Q. Asurveyof the relative entropy properties can be found, for instance, in [19]. If is asubset of M 1 (Z) [the set of all the probability measures on (Z,F )], the infimumof H(P|Q) for P (resp. Q) ranging through will be <strong>de</strong>noted by H(|Q)[resp. H(P|)].The following assumptions will be nee<strong>de</strong>d throughout this paper:A1. Compactness assumption. For all K ≥ 1, the parameter sets ( K ,d)are compactmetric sets and the mo<strong>de</strong>ls K are compact for the weak topology on thespace M 1 (Z).A2. Parameterization assumption. The parameterization θ ↦→ l θ (z) from K to Ris continuous for all z ∈ Z and K ≥ 1.A3. Bracket assumption. There exist l,u∈ R Z such that (u − l)∈ L 1 (P ⋆ ) andl ≤ l ⋆ ≤ u and l ≤ l θ ≤ u (all θ ∈ ∞ ).A4. Penalty assumption.pen(n,·) is an increasing function for all n ≥ 1.pen(n,K) →∞as n →∞and pen(n,K) = o(n) for all K ≥ 1.The continuous parameterization assumption A2 is standard in statistics (see,e.g., [43]). Assumption A3 is called “bracket assumption” after the <strong>de</strong>finition ofthe bracket [l,u] (which is the set of all functions f with l ≤ f ≤ u). It is alsostandard in the literature to invoke A3 when empirical processes are involved [43].Another standard assumption in this setting is the boun<strong>de</strong>dness of the parameterset. Assumption A1 is slightly stronger (at least when the parameter set is finitedimensional,by virtue of the Heine–Borel theorem, A2 and Lévy’s continuity theorem).Assumption A4 is the minimum requirement for a penalty function. Finally,it is worth noting that A3 implies that H(P ⋆ |P θ ) is finite for all θ ∈ ∞ .2.3. Large and mo<strong>de</strong>rate <strong>de</strong>viation of the log-likelihood process. It is shownin Section 4, which is <strong>de</strong>voted to efficiency issues, that un<strong>de</strong>restimation can berelated to large <strong>de</strong>viations of the log-likelihood process, while overestimation canbe related to mo<strong>de</strong>rate <strong>de</strong>viations of the latter. Large and mo<strong>de</strong>rate <strong>de</strong>viations of thelog-likelihood process both <strong>de</strong>scribe the limiting behavior of the empirical measureP n = n −1 ∑ ni=1 δ Zi (δ z <strong>de</strong>notes the Dirac measure at z) onrareeventsasn goes toinfinity. Let us state the principles we shall need (their lower bounds are omitted).Exten<strong>de</strong>d Sanov theorem [32].s ∈ R). The classes(1)(2)Let τ be given by τ(s)= exp(|s|) −|s|−1(allL τ (P ⋆ ) ={f ∈ R Z :∃a>0,P ⋆ τ(f/a)0,P ⋆ τ(f/a)0:P ⋆ τ(f/a)≤ 1} (all f ∈ L τ ),L τ (P ⋆ ) is a Banach space. Its topological dual is <strong>de</strong>noted by L ′ τ (P ⋆ ). In this paperwe shall be particularly interested in the setQ ={Q ∈ L ′ τ (P ⋆ ) :Q ≥ 0,Q1 = 1}∪P,where P ={p −1 ∑ pi=1 δ z i:p ≥ 1,z 1 ,...,z p ∈ Z}. It is equipped with the coarsesttopology that makes the linear forms Q ↦→ Qf continuous for every f ∈ L τ (P ⋆ )and with the coarsest σ -field that makes them measurable. It is worth noting thatP n ∈ Q ∩ P = P , hence, the need for P .By <strong>de</strong>finition, Q ∈ Q is P ⋆ -singular if there exists a sequence {A p } of measurablesets such that Q1{A c p }=0forallp ≥ 1, while lim p→∞ P ⋆ (A p ) = 0. It isknown (Theorem 2.3 and Proposition 2.4 in [32]) that:LEMMA 1. Any Q ∈ Q ∩ L ′ τ (P ⋆ ) is uniquely <strong>de</strong>composed into the sumQ = Q a + Q s , where Q a ∈ L ′ τ (P ⋆ ) is a probability measure, Q a ≪ P ⋆ , whileQ s ∈ L ′ τ (P ⋆ ) is P ⋆ -singular and Q s ≥ 0. Besi<strong>de</strong>s, for every f ∈ M τ (P ⋆ ),Qf = Q a f .REMARK 1. Q ∩ L ′ τ (P ⋆ ) is not a subset of M 1 (Z).IfQ ∈ Q ∩ L ′ τ (P ⋆ ),thenP(A)= Q1{A} (for any measurable set A) does <strong>de</strong>fine a probability measure P ,whichisinfactQ a . Besi<strong>de</strong>s, P and Q coinci<strong>de</strong> on M τ (P ⋆ ),butmaydifferonL τ (P ⋆ ) \ M τ (P ⋆ ) (Q = P = Q a if and only if Q s = 0).Let us finally introduce the nonnegative function I (the exten<strong>de</strong>d relative entropy)<strong>de</strong>fined for any Q = Q a + Q s ∈ Q ∩ L ′ τ (P ⋆ ) byI(Q)= H(Q a |P ⋆ ) + sup{Q s f :f ∈ L τ (P ⋆ ),P ⋆ exp(f ) < ∞}and I(Q)=∞if Q ∈ Q ∩ P = P . It particularly satisfies the following:LEMMA 2. For every Q ∈ Q, I(Q)≥ 0, with equality if and only if Q = P ⋆ .Theorem 3.2 in [32] encompasses the following result.


1174 A. <strong>CHAMBAZ</strong>THEOREM 1[32]. The function I is a convex, lower semicontinuous mappingfrom Q to [0,∞]. Its level sets {Q ∈ Q :I(Q)≤ α} are compact for all α>0.Moreover, for any measurable S ⊂ Q [with closure cl(S)],REMARK 2.its use, though:lim supn→∞ n−1 logP ⋆ {P n ∈ S}≤− infQ∈cl(S) I(Q).Theorem 1 requires an involved setting. Three reasons motivate• A classical Sanov theorem on M 1 (Z) would be insufficient here. In<strong>de</strong>ed, when<strong>de</strong>aling with the un<strong>de</strong>restimation rate, our proofs require that the linear formsQ ↦→ Ql θ be continuous on Q (any θ ∈ ∞ ), while possibly l θ ∈ L τ (P ⋆ ) \M τ (P ⋆ ).Now,Schied[40] has shown that the extension of a Sanov theoremon M 1 (Z) to a topology on M 1 (Z) that makes the linear form Q ↦→ Qf continuouson Q for some f ∈ L τ (P ⋆ ) is possible if and only if f ∈ M τ (P ⋆ ) (this isthe classical Cramér condition).• Provi<strong>de</strong>d the need that Q ↦→ Qf be continuous on Q for various f ∈ L τ (P ⋆ ) \M τ (P ⋆ ), the topology on Q introduced above is the natural one.• The simpler relative entropy rate function I ′ (Q) = H(Q a |P ⋆ ) for Q = Q a +Q s ∈ Q∩L ′ τ (P ⋆ ), I ′ (Q) =∞otherwise, does not have compact level sets (thisis also a consequence of [40]). This would be a major drawback in our schemeof proof.Mo<strong>de</strong>rate <strong>de</strong>viations of P n [44]. Let G <strong>de</strong>note a subclass of L 2 (P ⋆ ) with envelopeG ∈ R Z [i.e., |g(z)|≤G(z) for all g ∈ G and z ∈ Z].Let l ∞ (G) be the collection of all boun<strong>de</strong>d functions b ∈ R G . The uniformnorm ‖·‖ G <strong>de</strong>fined by ‖b‖ G = sup g∈G |b(g)| induces a topology and a σ -fieldon l ∞ (G).Let us <strong>de</strong>note by M 0 (Z) the space of all signed measures Q on (Z,F ) thatsatisfy Q1 = 0, sup g∈G |Qg| < ∞ and Q ≪ P ⋆ (the <strong>de</strong>rivative dQ/dP ⋆ is <strong>de</strong>notedby q). One observes that, for any Q ∈ M 0 (Z), Q ∞ g = Qg (all g ∈ G) <strong>de</strong>finesan element of l ∞ (G). Particularly, (P n − P ⋆ ) ∞ is a random variable on l ∞ (G)un<strong>de</strong>r P ⋆ .Let us finally introduce the nonnegative function J <strong>de</strong>fined for any b ∈ l ∞ (G)by}J(b)= inf{P ⋆q22 :Q ∈ M 0(Z),Q ∞ = b(with the convention inf∅ =+∞).THEOREM 2[44]. Let {v n } be an increasing sequence of positive numberssuch that v n = o(n), n logn = o(vn 2 ). Let us assume that there exist A ≥ 1,δ ∈ (0, 1) such that, for every k,n ≥ 1,v nk ≤ Ak 1−δ v n .TESTING THE ORDER OF A MODEL 1175If G is P ⋆ -Donsker and G ∈ L τ (P ⋆ ), then for any S ⊂ (l ∞ (G),‖·‖ G ),lim supn→∞ (v2 n /n)−1 logP ⋆ {nv −1n (P n − P ⋆ ) ∞ ∈ S}≤− infb∈cl(S) J(b).This theorem is a straightforward corollary of Theorem 5 in [44] (for a recentaccount of the P ⋆ -Donsker property, see [43]).3. Consistency issue. The statements of our three results of consistency aregathered here. These results are rather routine. However, the resort to empiricalprocess arguments allows us to achieve great generality. We refer to Section 5 forexamples of application and comparison with previous consistency results in eachbenchmark framework.From now on, Log <strong>de</strong>notes the truncated log, that is, Log(x) = log(x ∨ e) (allx ∈ R). The function ϕ is <strong>de</strong>fined by ϕ(x) = x 2 / Log Log(x) (all x ∈ R). Besi<strong>de</strong>s,let us introduce the classes of functions(4)G a K ={g θ = (l θ − l ⋆ ) :θ ∈ K } (every K ≥ 1).THEOREM 3. Let P ⋆ belong to K ⋆ \ K ⋆ −1. Suppose that ϕ(u−l)∈ L 1 (P ⋆ )and that the penalty function satisfiespen(n,K + 1)lim infn→∞ pen(n,K)(n log logn) 1/2> 1 and lim sup = 0 (any K ≥ 1).n→∞ pen(n,K)• If P ⋆ /∈ K implies H(P ⋆ | K+1 )


1176 A. <strong>CHAMBAZ</strong>We emphasize that the condition on the penalty function in Theorem 3 exclu<strong>de</strong>sBIC-like expressions pen(n,K) =2 1 dim( K) logn. This can be overcome,as shown in Theorem 4, by resorting to an example of a “peeling <strong>de</strong>vice” (seeAppendix A). To this end, substitutes for G a Kclasses are introduced, namely,{G b K = g θ = l θ − l ⋆}(5)H(θ) 1/2 :θ ∈ K,H(θ)>0 (every K ≥ 1),where H(θ)= H(P ⋆ |P θ ) for all θ ∈ ∞ .THEOREM 4. Let P ⋆ belong to K ⋆ \ K ⋆ −1. Suppose that ϕ(u−l)∈ L 1 (P ⋆ )and that the penalty function satisfiespen(n,K + 1)lim infn→∞ pen(n,K)> 1 and lim supn→∞log logn= 0 (any K ≥ 1).pen(n,K)• If P ⋆ /∈ K implies H(P ⋆ | K+1 )


1178 A. <strong>CHAMBAZ</strong>• The un<strong>de</strong>restimation probability can <strong>de</strong>cay exponentially fast with respect to n,and a best possible un<strong>de</strong>restimation error exponent, namely, H( K ⋆ −1|P ⋆ ),isexhibited.• The overestimation probability cannot <strong>de</strong>cay exponentially fast with respectto n: the overestimation error exponent is necessarily trivial.Consequently, the main issue is now to prove that ̂K n L and ̂K n G admit nontrivialun<strong>de</strong>restimation error exponents and to compare those exponents toH( K ⋆ −1|P ⋆ ). This will involve large <strong>de</strong>viations of the log-likelihood process;see Section 4.2. The second issue is to investigate the behavior of the overestimationprobabilities. These probabilities will be related to mo<strong>de</strong>rate (instead of large)<strong>de</strong>viations of the latter process; see Section 4.3.4.2. Un<strong>de</strong>restimation error exponent. Let us introduce for any α ≥ 0andK ≥ 1 the following subsets of Q ( stands for Local and Ŵ for Global):{}(8) α,K = Q ∈ Q :supQl θ − sup Ql θ ≥−α ,θ∈ K θ∈ K+1{}(9) Ŵ α,K = Q ∈ Q :supQl θ − sup Ql θ ≥−α .θ∈ K θ∈ K ⋆{Ŵ α,K } is non<strong>de</strong>creasing in α and K, and{ α,K } is non<strong>de</strong>creasing in α. Besi<strong>de</strong>s,for every α ≥ 0andK0 small enough, there exists afinite subset T ⊂ K such that∀θ ∈ K ,∃t ∈ T :|Ql θ − Ql t |≤ε.• If P ⋆ /∈ K implies H(P ⋆ | K+1 )


1180 A. <strong>CHAMBAZ</strong>(iii) (u − l)∈ M τ (P ⋆ ).(iv) For every K ≤ K ⋆ and z ∈ Z, the functions θ ↦→ l θ (z) are differentiableon the interior of K , with <strong>de</strong>rivative ˙l θ (z). Moreover, the coordinates of ˙l θ areelements of M τ (P ⋆ ) and there exists F ∈ L τ (P ⋆ ) such that(12)∣ l θ+h − l θ − ˙l T θ h∣∣ ≤ F · o(h).• If P ⋆ /∈ K implies H(P ⋆ | K+1 )


1182 A. <strong>CHAMBAZ</strong>4.3. Overestimation rate. The following theorem provi<strong>de</strong>s a first link betweenthe penalization function and the rate of overestimation (which is necessarilyslower than exponential in n; see Theorem 6) thatityieldsfor ̂K L n and ̂K G n .THEOREM 10. Let the penalty function be of the form pen(n,K) = v n D(K),where D ∈ R N and {v n } increase, v n = o(n), and for some A ≥ 1, δ ∈ (0, 1), forevery k,n ≥ 1,Let us also suppose that:v nk ≤ Ak 1−δ v n .(i) (u−l)∈ L τ (P ⋆ ), so that the classes G a K[<strong>de</strong>fined in (4)] admit an envelopefunction in L τ (P ⋆ ).(ii) n = o(vn 2).• If G a K ⋆ +1 is P ⋆ -Donsker, then(16)lim supn→∞ nv−2 n logP⋆ { ̂K L n >K⋆ } < 0.• If K ⋆ ≤ K max , and if, moreover, G a K maxis P ⋆ -Donsker, then(17)lim supn→∞ nv−2 n logP⋆ { ̂K G n >K⋆ } < 0.For instance, v n = n 1−δ , δ ∈ (0, 1/2) is an admissible sequence and Theorem 10applies to the LM and VR example.The resort to the same “peeling <strong>de</strong>vice” that allowed the transition from Theorem3 to Theorem 4 (both <strong>de</strong>voted to the consistency issue) in Section 3 yieldsagain a relaxed condition on {v n }.THEOREM 11.suppose that:Let pen be of the form <strong>de</strong>tailed in Theorem 10. Let us also(i) The classes G b K [<strong>de</strong>fined in (5)] admit an envelope function in L τ(P ⋆ ).(ii) logn = o(v n ).• If G b K ⋆ +1 is P ⋆ -Donsker, then(18)lim supn→∞ v−1 n logP⋆ { ̂K L n >K⋆ } < 0.• If K ⋆ ≤ K max , and if, moreover, G b K maxis P ⋆ -Donsker, then(19)lim supn→∞ v−1 n logP⋆ { ̂K G n >K⋆ } < 0.For instance, v n = (logn) 1+ǫ (ǫ>0) is an admissible sequence.TESTING THE ORDER OF A MODEL 1183Comment on Theorems 10 and 11. Theorem 10 is the main result on the efficiencyissue of overestimation in this paper. It notably relates the phenomenon ofoverestimation to the mo<strong>de</strong>rate <strong>de</strong>viations of the log-likelihood process. The assumptionsof the theorem are rather mild. This opinion is justified by the fact thatthe theorem applies to the LM and VR examples. It is worth pointing out that theconditions related to the log-<strong>de</strong>nsities l θ are expressed in terms of the envelopefunction and the P ⋆ -Donsker property (and not in terms of exponential momentsfor l θ ). As explained in Section 5, the AC example is exclu<strong>de</strong>d because we do notverify the P ⋆ -Donsker property.On the contrary, Theorem 11 relies on strong assumptions, particularly assumption(i), which exclu<strong>de</strong> the LM and VR examples. Although the conditionon {v n } is relaxed, Theorem 11 does not apply to the BIC-like penalty functionpen(n,K) =2 1 dim( K) logn (v n = logn). Besi<strong>de</strong>s, it is important to note thatthe choice of v n = (logn) 1+ε yields control of the overestimation probability that<strong>de</strong>cays like a negative power of n.We refer to Section 5 for comparison with previous results on the rate of overestimationin the LM and VR benchmark examples (none exists for the AC example).The last paragraph of the comment of Theorem 7 is also relevant here, as a paradigmof the methods based on tractable calculus in finite dimensions.5. Benchmark examples. This section is <strong>de</strong>voted to a <strong>de</strong>tailed investigationof our benchmark examples in or<strong>de</strong>r to illustrate the collection of results that havebeen stated in the two previous sections.5.1. Location mixture example. Let σ be a priori known and γ(·;m) <strong>de</strong>notethe <strong>de</strong>nsity of the Gaussian distribution with mean m and variance σ 2 with respectto the Lebesgue measure µ on R.LetM be a compact subset of R. Here, 1 is theset of all Gaussian probability measures with mean m ∈ M and variance σ 2 and 1 = M. Foreveryθ ∈ 1 , let us <strong>de</strong>fine p θ = γ(·;θ).Now,foranyK ≥ 2, let usintroduce the compact sets{ K = θ = (π, m) :π = (π 1 ,...,π K−1 ) ∈ R K−1 ∑π k ≤ 1, m ∈ M}.K+ , K−1Every θ ∈ K (K ≥ 2) is associated with a mixing distribution F θ = ∑ K−1k=1 π k ×δ mk + (1 − ∑ K−1∫k=1 )δ m Kon M and a probability measure P θ with <strong>de</strong>nsity p θ =M γ(·;m)dF θ(m) with respect to µ.ForK ≥ 2, K ={P θ :θ ∈ K }.In this setting, one observesZ i = X i + σe ik=1(i = 1,...,n),where X 1 ,...,X n are i.i.d. hid<strong>de</strong>n random variables, e 1 ,...,e n are i.i.d. and in<strong>de</strong>pen<strong>de</strong>ntof X 1 ,...,X n , with centered Gaussian distribution of variance 1, andthere exists θ ⋆ ∈ K ⋆ \ K ⋆ −1 such that X 1 ,...,X n have distribution F θ ⋆.Inthiscase, Z 1 ,...,Z n are i.i.d. and P ⋆ -distributed.


1184 A. <strong>CHAMBAZ</strong>Exploring the assumptions. The compactness assumption A1 is easily verified(by virtue of Lévy’s continuity theorem). The continuous parameterizationassumption A2 is satisfied. Defining l = infl θ and u = supl θ (the suprema rangeover θ ∈ ∞ ) ensures l ≤ l θ ≤ u (all θ ∈ ∞ )and(u − l) 1+c ∈ L τ (P ⋆ ) for somec>0. Hence, the bracket assumption A3 holds. Now, a slight adaptation of theproof of Lemma 3 in [34] yields the following:PROPOSITION 1. Let F be a mixing distribution on M (possibly with infinitesupport) and P ⋆ have <strong>de</strong>nsity p ⋆ = ∫ M γ(·;m)dF(m). In the LM example, ifP ⋆ /∈ K , then H(P ⋆ | K+1 ) 0andn large enough, P ⋆ { ˜K n ≠ K ⋆ }≤c 1 exp(−c 2 n −1 vn 2 ). The correspondingrate is the one of Theorem 10.As far as we know, our results on efficiency stated in Theorems 6, 7, 8 and 10are new for our maximum likelihood procedures.5.2. Abrupt changes example. Let (X,B,P)be an open subset of R q (q ≥ 2)equipped with the trace B of the Borel σ -field and a probability measure P ≪ µ,the Lebesgue measure on X (with <strong>de</strong>nsity dP/dµ <strong>de</strong>noted by p).Let CP be the set of all countable Caccioppoli partitions of X. It is knownthat there exists a metric d on CP such that the subset CP b of all partitions whose“perimeters” are boun<strong>de</strong>d by a fixed constant b>0 is a compact metric space whenequipped with d. (The <strong>de</strong>finitions and main properties of Caccioppoli partitionscan be found in [33].)A partition is a family τ ={τ j } j≥1 of measurable subsets of X such thatP(X \ ⋃ j τ j ) = 0, P(τ j ∩ τ j ′) = 0foreveryj ≠ j ′ and possibly P(τ j ) = 0. Thecardinality of τ is the number of j ≥ 1 such that P(τ j )>0. Given a compact set Mof R and τ ∈ CP b , it is easy to verify that one can associate m j ∈ M with every τ j ,yielding a marked partition {(τ j ,m j )} j≥1 , then modify the <strong>de</strong>finition of d so thatthe set of all marked partitions of CP b is also a compact set when equipped with d.It is worth noting that, if d[(τ 0 ,m 0 ),(τ 1 ,m 1 )]≤δ, then there exists a bijectivemap ϕ from I 0 ={j :P(τ 0 j )>0} to {j :P(τ1 j )>0} such that P(τ0 j τ 1 ϕ(j) ) ≤ δand |m 0 j − m1 ϕ(j) |≤δ for every j ∈ I 0 ( <strong>de</strong>notes the symmetrical difference betweensets).


1186 A. <strong>CHAMBAZ</strong>In this example, for every K ≥ 1, K is the set of all marked partitions of CP bwith cardinality at most K. ( K ,d)is a compact metric space, hence, the first halfof the compactness assumption A1. For σ a priori known, let us <strong>de</strong>note by γ(·;m)the <strong>de</strong>nsity of the Gaussian distribution with mean m and variance σ 2 , f θ (x) =∑k≥1 m k 1{x ∈ τ k } and finally p θ (z) = γ(y;f θ (x))p(x) [for all z = (x,y) ∈ Z =X×R, K ≥ 1andθ ∈ K ]. Let P θ have p θ for <strong>de</strong>nsity with respect to µ,thenset K ={P θ :θ ∈ K } for every K ≥ 1.In this setting, one observes Z i = (X i ,Y i ) withY i = f ⋆ (X i ) + σe i(i = 1,...,n),where X 1 ,...,X n are i.i.d. and P -distributed, e 1 ,...,e n are i.i.d. and in<strong>de</strong>pen<strong>de</strong>ntof X 1 ,...,X n , with centered Gaussian distribution of variance 1, and there existsθ ⋆ ∈ K ⋆ \ K ⋆ −1 such that f ⋆ = f θ ⋆. In this case, Z 1 ,...,Z n are i.i.d. and P θ ⋆-distributed.Exploring the assumptions. Lévy’s continuity theorem implies that the secondhalf of A1 is satisfied. Besi<strong>de</strong>s, the continuous parameterization assumption A2 isobviously verified. It is easily seen that the bracket assumption A3 holds. In<strong>de</strong>ed, ifone introduces f = infl θ and f = supl θ (the suprema range over ∞ ), functionsl,u ∈ R Z can be <strong>de</strong>fined such that (u − l) is continuous, l ≤ l θ ≤ u (all θ ∈ ∞ )and 2σ 2 (u−l)(z) = (f 2 +f 2 )(x)+2|y|(f −f)(x), hence, (u−l) 1+c ∈ L τ (P ⋆ )for some c>0.Furthermore, if the L 2 (P)-norm is <strong>de</strong>noted by ‖·‖ 2 , then it is worth stressingthat, for every θ,t ∈ ∞ ,(20)H(P θ |P t ) = ‖f θ − f t ‖ 2 22σ 2 .Using (20) yields (the proof is postponed to Section E.2) the following:LEMMA 7. In the AC example, if P ⋆ ∈ ∞ \ K , then H(P ⋆ | K+1 )


TESTING THE ORDER OF A MODEL 11891188 A. <strong>CHAMBAZ</strong>tained. However, the un<strong>de</strong>restimation error exponent is not shown to be at mostH( K ⋆ −1|P ⋆ P) and, of course, is not compared to it.n (l θ − l ⋆ ) + H(θ)= (P n − P ⋆ )(l θ − l ⋆ )(A.3)Thus, to the best of our knowledge, the results of Theorem 3, 6, 7, 8 and 10are new in this exponential mo<strong>de</strong>l framework for ̂K n L (which does not require any ≤ H(θ) 1/2 sup (P n − P ⋆ )g θ .θ∈ K2Exploring the assumptions. The compactness assumption A1 is clearly satisfiedprior bound on the true or<strong>de</strong>r), while the results of Theorems 3, 6 and 9 are new(by virtue of Lévy’s continuity theorem for K ). Besi<strong>de</strong>s, the continuousparameterization assumption A2 is readily verified. The bracket assumption A3holds: with f = infl θ and f = supl θ (the suprema range over θ ∈ ∞ ), l,u∈ R Zcan be <strong>de</strong>fined such that (u − l) is continuous, l ≤ l θ ≤ u (any θ ∈ ∞ )andfor ̂K n G. APPENDIX A: AN EXAMPLE OF THE PEELING DEVICE2σ 2 (u − l)(z) = (f 2 + f 2 )(x) + 2|y|(f − f )(x), hence, (u − l) 1+c ∈ L τ (P ⋆ )The so-called “peeling <strong>de</strong>vice” classically allows one to analyze the rate of convergenceof M-estimators in nonclassical frameworks. The original i<strong>de</strong>a is duefor some c>0. We emphasize that equality (20) also holds in this example when‖·‖ 2 <strong>de</strong>notes the L 2 ([0, 1]) norm. A straightforward consequence follows:to Huber [28]. Examples may be found, for instance, in [37] for simple proofsof uniform central limit theorems or in [6] (see Proposition 7 therein and the attachedLEMMA 9. In the VR example, if P ⋆ ∈ ∞ \ K , then H(P ⋆ | K+1 )K 1 ≥ K ⋆ , the or<strong>de</strong>r of P ⋆ . Then, both inequalitiesconclusions of Theorem 3 are valid.As for the efficiency issue of the un<strong>de</strong>restimation rate, the assumptions of Theoremsbelow hold, the second one providing an example of the peeling technique:7, 8 and 9 are satisfied in this example. If P ⋆ ∈ K ⋆ \ K ⋆ −1, it is clear(A.1) sup |(P n − P ⋆ )(l θ − l ⋆ )|≥ sup P n l θ − sup P n l θthat, for every θ ∈ ∞ , H(P θ |P ⋆ ) is finite, p θ l θ ∈ L 1 (µ) and l θ ∈ L τ (P ⋆ ) [thisθ∈ K2 θ∈ K2 θ∈ K1is assumption (i) of Theorem 7]. Moreover, following the proof of Lemma 6 yieldsandthe following:( ∣ ∣∣∣LEMMA 10. In the VR example, the finite sieve assumption (ii) of Theorem 7sup (P n − P ⋆ ) l θ − l ⋆ ∣) ∣∣∣ 2(A.2)θ∈ K2H(θ) 1/2 ≥ sup P n l θ − sup P n l θ .θ∈ K2 θ∈ K1is satisfied.Now, it has been already argued that (u − l) 1+c ∈ L τ (P ⋆ ), hence, (u − l) ∈PROOF. Inequality (A.1) is readily proved, sinceM τ (P ⋆ ) and assumption (iii) of Theorem 8 is valid. Furthermore, a cru<strong>de</strong> yetcareful application of Taylor’s integral remain<strong>de</strong>r theorem yields assumption (iv)sup P n l θ − sup P n l θ ≤ sup P n (l θ − l ⋆ )θ∈ K2 θ∈ K1 θ∈ K2of Theorem 8. In conclusion, the mo<strong>de</strong>ls are exponential in the VR example, soTheorem 9 applies.= sup {(P n − P ⋆ )(l θ − l ⋆ ) + P ⋆ (l θ − l ⋆ )}θ∈ K2Concerning the efficiency issue of the overestimation rate, the assumptions ofTheorem 10 have been verified in the lines above.≤ sup (P n − P ⋆ )(l θ − l ⋆ ).θ∈In summary, Theorems 3, 6, 7, 8, 9 and 10 apply in the VR example.K2For (A.2), let us <strong>de</strong>fine for all θ ∈ Comment. We present this example because it fits in the general framework ofK2 such that H(θ)>0(i.e.,P ⋆ ≠ P θ )thescaled log-<strong>de</strong>nsities ratioor<strong>de</strong>r i<strong>de</strong>ntification in nested exponential mo<strong>de</strong>ls. This important framework hasbeen investigated in [25] and[31] (who actually address the more general caseof regular mo<strong>de</strong>ls). In the latter, the authors study the properties of ̂K n G (with ag θ = l θ − l ⋆H(θ) 1/2prior bound on the true or<strong>de</strong>r). They prove its weak consistency. Rates of un<strong>de</strong>restimationand overestimation similar to the ones of Theorems 7 and 10 are ob-and g θ = 0 otherwise. Now, for any θ ∈ K2 , H(θ) nonnegative yields


1190 A. <strong>CHAMBAZ</strong>Let us set some θ 0 ∈ K2 such that both sup θ∈K2 P n (l θ − l ⋆ ) ≤ P n (l θ0 − l ⋆ ) + εand P n (l θ0 − l ⋆ ) ≥ 0. Then, (A.3) implies, for θ = θ 0 ,sup P n (l θ − l ⋆ ) ≤ H(θ 0 ) 1/2 sup (P n − P ⋆ )g θ + ε.θ∈ K2 θ∈ K2Furthermore, P n (l θ0 − l ⋆ ) ≥ 0 combined with (A.3) imply in turnhence,H(θ 0 ) ≤ H(θ 0 ) 1/2 sup (P n − P ⋆ )g θ ,θ∈ K2supθ∈ K2P n l θ − supθ∈ K1P n l θ ≤ supθ∈ K2P n (l θ − l ⋆ ) ≤which completes the proof, since ε>0 is arbitrary.APPENDIX B: PROOFS OF CONSISTENCY(supθ∈ K2(P n − P ⋆ )g θ) 2+ ε,B.1. No un<strong>de</strong>restimation eventually. A strong law of large numbers for thesupremum of the likelihood ratios is stated. Its routine proof relies on the achievementof H(P ⋆ | K ), the standard strong law of large numbers and the Borel–Lebesgue property.LEMMA B.1. P ⋆ -a.s., for any K ≥ 1,sup n −1( l n (θ) − l n (θ ⋆ ) ) −→ −H(P ⋆ | K ).θ∈ n→∞ KNow the result of no un<strong>de</strong>restimation can be stated and proved. It is seen inSection 5 that Proposition B.1 fully applies to the LM and AC examples. It is alsoshown that the VR example satisfies the assumption in the case of ̂K G n .PROPOSITION B.1. Let us assume that P ⋆ ∈ K ⋆ \ K ⋆ −1.• If P ⋆ /∈ K implies H(P ⋆ | K+1 )K ⋆ , G = G a K (resp. G = Gb K ) is P ⋆ -Donsker. Then there exists a positive constantC K such that, P ⋆ -a.s.,n 1/2 sup g∈G |(P n − P ⋆ )g|lim supn→∞ (log logn) 1/2 ≤ C K .


1192 A. <strong>CHAMBAZ</strong>PROOF OF PROPOSITION B.2. SetK = K ⋆ + 1.P ⋆ { ̂K n L >K⋆ i.o.}{}≤ P ⋆ sup P n l θ − sup P n l θ ≥ n −1 {pen(n,K) − pen(n,K ⋆ )} i.o.θ∈ K θ∈ K ⋆{[ (n log logn)≤ P ⋆ 1/2 }[ n 1/2 sup g∈G a |(PKn − P ⋆ )g|]pen(n,K ⋆ ) (log logn) 1/2≥ pen(n,K) }pen(n,K ⋆ ) − 1i.o. ,where the last inequality is straightforward [it is (A.1)]. Consequently, wheneverϕ(u− l)∈ L 1 (P ⋆ ) and G a K is P ⋆ -Donsker, Lemma B.2 applies and implies that, ifpen satisfies the condition of Theorem 3, thenP ⋆ { ̂K L n >K⋆ i.o.}=0.Now, renormalization yields an alternative bound for the second probabilityin the display above [by using (A.2) of the peeling technique Proposition A.1],namelyP ⋆ { ̂K L n >K⋆ i.o.}≤ P ⋆ {[ log lognpen(n,K ⋆ )][ n 1/2 sup g∈G bK|(P n − P ⋆ )g|(log logn) 1/2] 2≥ pen(n,K)pen(n,K ⋆ ) − 1i.o. }.Therefore, if ϕ(u − l) ∈ L 1 (P ⋆ ) and G b K is P ⋆ -Donsker, Lemma B.2 appliesand implies that, as soon as pen satisfies the condition of Theorem 3, P ⋆ { ̂K n L >K ⋆ i.o.}=0. This conclu<strong>de</strong>s the study of ̂K n L.Furthermore, if K ⋆ ≤ K max , then the union bound guarantees that it suffices toprove that P ⋆ { ̂K n G = K i.o.}=0forK = K⋆ + 1,...,K max in or<strong>de</strong>r to conclu<strong>de</strong>the study of ̂K n G . Minor changes in the previous lines yield the result. □APPENDIX C: PROOFS OF EFFICIENCY: UNDERESTIMATIONC.1. Proof of Theorem 7. Theorem 7 is first proven un<strong>de</strong>r assumption (ii).The modification of the proof un<strong>de</strong>r assumption (ii) ′ is sketched at the end of thissubsection. Let us begin with some useful lemmas.LEMMA C.1. Un<strong>de</strong>r the assumptions of Theorem 7, the sets α,K and Ŵ α,Kare measurable and closed in Q for every α>0 and K0andK 0 such that α ′ − 6ε >α and sup θ∈K Q 0 l θ −sup θ∈K+1Q 0 l θ < −α ′ . Let us <strong>de</strong>note by T K (resp. T K+1 ) the finite sieve subsetof K (resp. K+1 )forQ = Q 0 , ε and K (resp. K + 1) in assumption (ii). Letus then <strong>de</strong>fine the open neighborhood V of Q 0 byV = ⋂{Q ∈ Q :|Ql t − Q 0 l t |


1194 A. <strong>CHAMBAZ</strong>Because P ⋆ { ̂K n L


1196 A. <strong>CHAMBAZ</strong>REMARK C.1. A simple modification of the proof below implies that, un<strong>de</strong>rthe assumptions of Theorem 9 and for K = K ⋆ − 1,H ( 0,K ∩ M 1 (Z)|P ⋆) = H( K |P ⋆ ).The proof cannot be adapted anymore when K


1198 A. <strong>CHAMBAZ</strong>by virtue of (A.1). Also, the peeling <strong>de</strong>vice inequality (A.2) of the same propositionimplies that expression given by (D.1) can be boun<strong>de</strong>d by{(2 }(D.3) P ⋆ sup |(P n − P )g|) ⋆ ≥ n −1 v n K ⋆ .g∈G b KIn the rest of this paper, we shall focus on (16) in Theorem 10 [on the basis of theoverestimation probability upper bound (D.2)]. The proof of (18) in Theorem 11[on the basis of the overestimation probability upper bound (D.3)] is similar and isomitted.Let us <strong>de</strong>fine ∞ ={b ∈ l ∞ (G a K ) :‖b‖ G a ≥ K K⋆}. It is closed for the uniformtopology on l ∞ (G a K). Since the assumptions of Theorem 2 are satisfied,lim supn→∞ nv−2 nlogP⋆ { ̂K L n >K⋆ }≤lim supn→∞ nv−2 n≤−inf{J(b):b ∈ ∞ }.logP⋆ {(nv −1n )(P n − P ⋆ ) ∞ ∈ ∞ }Let us prove that the right-hand si<strong>de</strong> term above is negative.Suppose in<strong>de</strong>ed, on the contrary, that the infimum is zero: this implies 0 ∈ ∞ ,which is obviously not true. If the infimum were zero, then there would exist asequence {b p } of elements of l ∞ (G a K ) such that b p ∈ ∞ and J(b p ) ≤ 1/p. Consequently,there would exist a sequence {Q p } of elements of M(Z) such that,for every p ≥ 1, Q p ≪ P ⋆ (with <strong>de</strong>rivative dQ p /dP ⋆ <strong>de</strong>noted by q p ) and bothP ⋆ qp 2/2 ≤ J(b p) + 1/p ≤ 2/p and Q ∞ p = b p. Thus, for any g ∈ G a K ,( )(b p g) 2 = (P ⋆ q p g) 2 ≤ (P ⋆ qp 2 )(P ⋆ g 2 ) ≤ (4/p) P ⋆ g 2supg∈G a Kby virtue of the Cauchy–Schwarz inequality. Now, G a K is P ⋆ -Donsker, hence, itis totally boun<strong>de</strong>d in L 2 (P ⋆ ), and the above display implies that ‖b p ‖ G aK= o(1).Consequently, 0 ∈ ∞ as a limit of a sequence of elements of the closed set ∞ .This completes the proof of (16) of Theorem 10.The proof of (17) in Theorem 10 [which parallels the proof of (19) in Theorem11] is very similar. Once again, the union bound and Lemma 1.2.15 of [17]imply thatlim supn→∞ nv−2 n logP⋆ { ̂K G n >K⋆ }= sup lim supK n→∞ nv−2 n logP⋆ { ̂K n G = K}{≤ sup lim supK n→∞ nv−2 n logP⋆ sup P n l θ − supθ∈ Kθ∈ K ⋆P n l θ ≥ n −1 v n K}(sup K stands for sup K ⋆ 0 and Q ∈ Q, there exists a compact subset C of Z such thatQ(u − l)1{C c } 0andQ(u − l)1{(u − l) > M}≤Qψ(u − l)→ 0asM →∞. □Mψ(M)Of course, the assumptions of Lemma E.1 are satisfied in the LM example [here,ψ(x)= x 1+c (any x ≥ 0)]. Thus, let us set K ≥ 1, Q ∈ Q and ε>0. There existsa compact set C of Z such that, for every θ,t ∈ K ,(E.1)|Q(l θ − l t )|≤Q|l θ − l t |1{C}+Q(u − l)1{C c }≤Q|l θ − l t |1{C}+ε.Now, Ascoli’s theorem ensures that {l θ 1{C} :θ ∈ K } is precompact in the setof the continuous functions on C equipped with the uniform norm. Consequently,there exists a finite subset T of K such that, for every θ ∈ K , there exists t ∈ Tsuch that sup z∈C |l θ (z) − l t (z)|≤ε. Straightforwardly, for any θ ∈ K ,thereexistst ∈ T such that the left-hand si<strong>de</strong> term of (E.1) is boun<strong>de</strong>d by 2ε. This completesthe proof. □(E.2)E.2. Proof of Lemma 7.Let us suppose, on the contrary, thatH(P ⋆ | K ) ≤ H(P ⋆ | K+1 ),that is, that equality holds. Lower semicontinuity of H(P ⋆ |·) and compactnessof K ensure the existence of P 0 = P θ0 ∈ K such that H(P ⋆ |P 0 ) = H(P ⋆ | K ).Let us <strong>de</strong>note f 0 (x) = f θ0 (x) = ∑ Kk=1 m k 1{x ∈ τ k } (all x ∈ X). Now, equality (20)and ‖f ⋆ − f 0 ‖ 2 = ∑ Kk=1 P(f ⋆ − m k ) 2 1{τ k } imply that m k = Pf ⋆ 1{τ k }/P(τ k ) fork = 1,...,K. Let us prove that f ⋆ = f 0 , hence, P ⋆ ∈ K .In<strong>de</strong>ed, (E.2) ensures that, for any 1 ≤ k 0 ≤ K, for any subset S of τ k0 withpositive P -measure,K∑Pf ⋆2 (Pf ⋆ 1{τ k }) 2−≤ Pf ⋆2 −∑ (Pf ⋆ 1{τ k }) 2P(τk=1 k )P(τ1≤k≠k 0 ≤K k )( (Pf ⋆ 1{S}) 2− + (Pf ⋆ 1{τ k0 \ S}) 2 )P(S) P(τ k0 \ S)


1200 A. <strong>CHAMBAZ</strong>or, equivalently,(Pf ⋆ 1{S}) 2P(S)+ (Pf ⋆ 1{τ k0 \ S}) 2P(τ k0 \ S)≤ (Pf ⋆ 1{τ k0 }) 2.P(τ k0 )Thus, first expansion of the right-hand si<strong>de</strong> term and then factorization yield(E.3)P(τ k0 \ S)(Pf ⋆ 1{S}) 2 +P(S) ( Pf ⋆ 1 { τ k0 \ S }) 2P(S)P(τ k0 \ S)≤ 2(Pf ⋆ 1{S}) ( Pf ⋆ 1 { τ k0 \ S }) .Now, the basic inequality 2ab ≤ (au) 2 + (bu −1 ) 2 (all a,b ∈ R and positive u)together with (E.3) ensure [take u 2 = P(τ k0 \S)/P(S)] that equality holds in (E.3).Consequently, for any subset S of τ k0 with positive P -measure,Pf ⋆ 1{S}P(S)hence, for any subset S of τ k0 ,= Pf ⋆ 1{τ k0 \ S}P(τ k0 \ S)= Pf ⋆ 1{τ k0 }−Pf ⋆ 1{S},P(τ k0 \S)Pf ⋆ 1{S}= P(S)P(τ k0 ) Pf ⋆ 1 { τ k0} .The choice S = S + ={x ∈ τ k0 :f ⋆ (x) > Pf ⋆ 1{τ k0 }/P(τ k0 )} yields P(S + ) = 0.The choice S = S − ={x ∈ τ k0 :f ⋆ (x) < Pf ⋆ 1{τ k0 }/P(τ k0 )} yields, in turn,P(S − ) = 0, hence, finally P(S 0 ) = P(τ k0 ), whereS 0 ={x ∈ τ k0 :f ⋆ (x) =Pf ⋆ 1{τ k0 }/P(τ k0 )} (i.e., f ⋆ P -a.s. constant on τ k0 ). This conclu<strong>de</strong>s the proofbecause k 0 is arbitrary. □E.3. Proof of Lemma 8. Let us set K ≥ 1, Q ∈ Q ∩ L ′ τ (P ⋆ ) (with <strong>de</strong>compositionQ = Q a + Q s according to Lemma 1) andε>0. Because Q a ≪ P ⋆ ,thereexists δ>0 such that, for any measurable F , P ⋆ (F) ≤ δ yields Q a (F) ≤ ε.Now, it was emphasized in Section 5.2 that (u−l) 1+c ∈ L τ (P ⋆ ) for some c>0,hence, Lemma E.1 applies with ψ(x)= x 1+c (all x ≥ 0). So, there exists a compactset C of Z such that, for every θ,t ∈ K ,(E.4)|Q(l θ − l t )|≤Q|l θ − l t |1{C}+Q(u − l)1{C c }≤ Q|l θ − l t |1{C}+ε= Q a |l θ − l t |1{C}+ε≤ MQ a |f θ − f t |+ε,where the equality holds because (l θ − l t )1{C} is boun<strong>de</strong>d and M is a constantwhich <strong>de</strong>pends only on l,u (via C) andM.Furthermore, the Borel–Lebesgue property of compact sets guarantees that thereexists a finite subset T of K such that the union over t ∈ T of the balls ofTESTING THE ORDER OF A MODEL 1201center t and radius δ covers K .Letussett ∈ T [t = (τ 0 ,m 0 )]andθ ∈ K[θ = (τ 1 ,m 1 )] with d(t,θ)≤ δ. It can be assumed without loss of generality thatP(τj 0τ j 1) ≤ δ and |m0 j − m1 j|≤δ for all j = 1,...,K. Consequently, with notation,M ′ = sup{|m| :m ∈ M},foranyx ∈ X,K∑K∑|f θ − f t |(x) ≤ |m 0 j − m1 j |1{x ∈ τ j 0 ∩ τ j 1 }+M′ (K − 1) 1{x ∈ τj 0 τ j 1 }j=1j=1hence,K∑≤ Kδ+ M ′ (K − 1) 1{x ∈ τj 0 τ j 1 },j=1K∑Q a |f θ − f t |≤Kδ+ M ′ (K − 1) Q a (τj 0 τ j 1 × R).j=1Besi<strong>de</strong>s, P ⋆ (τj 0τ j 1 ×R) = P(τ0 j τ j 1) ≤ δ finally yields Qa (τj 0τ j 1 ×R) ≤ ε.Byinvoking (E.4), |Q(l θ − l t )|≤M ′′ ε, for a constant M ′′ <strong>de</strong>pending only on K, l,u(via C) andM. This completes the proof. □Acknowledgments. This work was done while I was affiliated with the UniversityParis-Sud and France Télécom Recherche & Développement. I wish toexpress my gratitu<strong>de</strong> to my Ph.D. advisor Elisabeth Gassiat. I would also like tothank Stéphane Boucheron, Raphaël Cerf, Christian Léonard, Pascal Massart andJamal Najim for helpful discussions. I am especially grateful to one of the refereesfor his suggestions and careful reading.REFERENCES[1] AKAIKE, H. (1974). A new look at the statistical mo<strong>de</strong>l i<strong>de</strong>ntification. IEEE Trans. AutomaticControl 19 716–723. MR0423716[2] AZENCOTT, R. and DACUNHA-CASTELLE, D. (1986). Series of Irregular Observations.Springer, New York. MR0848355[3] BAHADUR, R. R. (1967). An optimal property of the likelihood ratio statistic. Proc. FifthBerkeley Symp. Math. Statist. Probab. 1 13–26. Univ. California Press, Berkeley.MR0216637[4] BAHADUR, R. R. (1971). Some Limit Theorems in Statistics. SIAM, Phila<strong>de</strong>lphia. MR0315820[5] BAHADUR, R.R.,ZABELL, S.L.andGUPTA, J. C. (1980). Large <strong>de</strong>viations, tests, an<strong>de</strong>stimates. In Asymptotic Theory of Statistical Tests and Estimation (I. M. Chakravarti,ed.) 33–64. Aca<strong>de</strong>mic Press, New York. MR0571334[6] BARRON, A.,BIRGÉ, L.andMASSART, P. (1999). Risk bounds for mo<strong>de</strong>l selection via penalization.Probab. Theory Related Fields 113 301–413. MR1679028[7] BOUCHERON, S.andGASSIAT, E. (2005). An information-theoretic perspective on or<strong>de</strong>r estimation.In Inference in Hid<strong>de</strong>n Markov Mo<strong>de</strong>ls (O. Cappé, E. Moulines and T. Rydén,eds.) 565–601. Springer, New York. MR2159833


1202 A. <strong>CHAMBAZ</strong>[8] BOUCHERON, S.andGASSIAT, E. (2006). Error exponents for AR or<strong>de</strong>r testing. IEEE Trans.Inform. Theory 52 472–488.[9] ČENCOV, N. N. (1982). Statistical Decision Rules and Optimal Inference. Amer.Math.Soc.,Provi<strong>de</strong>nce, RI. MR0645898[10] CHERNOFF, H. (1956). Large sample theory: Parametric case. Ann. Math. Statist. 27 1–22.MR0076245[11] CSISZÁR, I. (1975). I -divergence geometry of probability distributions and minimization problems.Ann. Probab. 3 146–158. MR0365798[12] CSISZÁR, I. (2002). Large-scale typicality of Markov sample paths and consistency of MDLor<strong>de</strong>r estimators. IEEE Trans. Inform. Theory 48 1616–1628. MR1909476[13] CSISZÁR,I.andKÖRNER, J. (1981). Information Theory: Coding Theorems for Discrete MemorylessSystems. Aca<strong>de</strong>mic Press, New York. MR0666545[14] CSISZÁR, I.andSHIELDS, P. C. (2000). The consistency of the BIC Markov or<strong>de</strong>r estimator.Ann. Statist. 28 1601–1619. MR1835033[15] DACUNHA-CASTELLE, D.andGASSIAT, E. (1997). The estimation of the or<strong>de</strong>r of a mixturemo<strong>de</strong>l. Bernoulli 3 279–299. MR1468306[16] DACUNHA-CASTELLE, D.andGASSIAT, E. (1999). Testing the or<strong>de</strong>r of a mo<strong>de</strong>l using locallyconic parametrization: Population mixtures and stationary ARMA processes. Ann. Statist.27 1178–1209. MR1740115[17] DEMBO, A.andZEITOUNI, O. (1998). Large Deviations Techniques and Applications, 2nd ed.Springer, New York. MR1619036[18] DUDLEY, R.M.andPHILIPP, W. (1983). Invariance principles for sums of Banach spacevalued random elements and empirical processes. Z. Wahrsch. Verw. Gebiete 62 509–552.MR0690575[19] DUPUIS, P.andELLIS, R. S. (1997). A Weak Convergence Approach to the Theory of LargeDeviations. Wiley, New York. MR1431744[20] FINESSO, L.,LIU, C.-C.andNARAYAN, P. (1996). The optimal error exponent for Markovor<strong>de</strong>r estimation. IEEE Trans. Inform. Theory 42 1488–1497. MR1426225[21] GASSIAT, E. (2002). Likelihood ratio inequalities with applications to various mixtures. Ann.Inst. H. Poincaré Probab. Statist. 38 897–906. MR1955343[22] GASSIAT, E.andBOUCHERON, S. (2003). Optimal error exponents in hid<strong>de</strong>n Markov mo<strong>de</strong>lsor<strong>de</strong>r estimation. IEEE Trans. Inform. Theory 49 964–980. MR1984482[23] GUYON, X.andYAO, J. (1999). On the un<strong>de</strong>rfitting and overfitting sets of mo<strong>de</strong>ls chosen byor<strong>de</strong>r selection criteria. J. Multivariate Anal. 70 221–249. MR1711522[24] HANNAN, E.J.,MCDOUGALL, A.J.andPOSKITT, D. S. (1989). Recursive estimation ofautoregressions. J. Roy. Statist. Soc. Ser. B 51 217–233. MR1007454[25] HAUGHTON, D. (1989). Size of the error in the choice of a mo<strong>de</strong>l to fit data from an exponentialfamily. Sankhyā Ser.A51 45–58. MR1065558[26] HEMERLY, E.M.andDAVIS, M. H. A. (1991). Recursive or<strong>de</strong>r estimation of autoregressionswithout bounding the mo<strong>de</strong>l set. J. Roy. Statist. Soc. Ser. B 53 201–210. MR1094280[27] HENNA, J. (1985). On estimating of the number of constituents of a finite mixture of continuousdistributions. Ann. Inst. Statist. Math. 37 235–240. MR0799237[28] HUBER, P. J. (1967). The behavior of maximum likelihood estimates un<strong>de</strong>r nonstandard conditions.Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1 221–233. Univ. CaliforniaPress, Berkeley. MR0216620[29] JAMES, L.F.,PRIEBE, C.E.andMARCHETTE, D. J. (2001). Consistent estimation of mixturecomplexity. Ann. Statist. 29 1281–1296. MR1873331[30] KERIBIN, C. (2000). Consistent estimation of the or<strong>de</strong>r of mixture mo<strong>de</strong>ls. Sankhyā Ser.A6249–66. MR1769735TESTING THE ORDER OF A MODEL 1203[31] KERIBIN, C.andHAUGHTON, D. (2003). Asymptotic probabilities of overestimating and un<strong>de</strong>restimatingthe or<strong>de</strong>r of a mo<strong>de</strong>l in general regular families. Comm. Statist. TheoryMethods 32 1373–1404. MR1985856[32] LÉONARD, C.andNAJIM, J. (2002). An extension of Sanov’s theorem. Application to theGibbs conditioning principle. Bernoulli 8 721–743. MR1963659[33] LEONARDI, G.P.andTAMANINI, I. (2002). Metric spaces of partitions, and Caccioppolipartitions. Adv. Math. Sci. Appl. 12 725–753. MR1943988[34] LEROUX, B. G. (1992). Consistent estimation of a mixing distribution. Ann. Statist. 201350–1360. MR1186253[35] MALLOWS, C. L. (1973). Some comments on C P . Technometrics 15 661–675.[36] MASSART, P. (2000). Some applications of concentration inequalities to statistics. Probabilitytheory. Ann. Fac. Sci. Toulouse Math. (6) 9 245–303. MR1813803[37] POLLARD, D. (1985). New ways to prove central limit theorems. Econometric Theory 1295–314.[38] RISSANEN, J. (1978). Mo<strong>de</strong>lling by shortest data <strong>de</strong>scription. Automatica 14 465–471.[39] ROCKAFELLAR, R. T. (1970). Convex Analysis. Princeton Univ. Press. MR0274683[40] SCHIED, A. (1998). Cramer’s condition and Sanov’s theorem. Statist. Probab. Lett. 39 55–60.MR1649347[41] SCHWARZ, G. (1978). Estimating the dimension of a mo<strong>de</strong>l. Ann. Statist. 6 461–464.MR0468014[42] TITTERINGTON, D.M.,SMITH, A.F.M.andMAKOV, U. E. (1985). Statistical Analysis ofFinite Mixture Distributions. Wiley, Chichester. MR0838090[43] VAN DER VAART, A. W. (1998). Asymptotic Statistics. Cambridge Univ. Press. MR1652247[44] WU, L. (1994). Large <strong>de</strong>viations, mo<strong>de</strong>rate <strong>de</strong>viations and LIL for empirical processes.Ann. Probab. 22 17–27. MR1258864MAP5 CNRS UMR 8145UNIVERSITÉ RENÉ DESCARTES45 RUE DES SAINTS-PÈRES75270 PARIS CEDEX 06FRANCEE-MAIL: chambaz@univ-paris5.fr


The Annals of Statistics2008, Vol. 36, No. 2, 938–962DOI: 10.1214/009053607000000857© Institute of Mathematical Statistics, 2008BOUNDS FOR BAYESIAN ORDER IDENTIFICATION WITHAPPLICATION TO MIXTURESBY ANTOINE <strong>CHAMBAZ</strong> AND JUDITH ROUSSEAUUniversité Paris Descartes and Université DauphineThe efficiency of two Bayesian or<strong>de</strong>r estimators is studied. By usingnonparametric techniques, we prove new un<strong>de</strong>restimation and overestimationbounds. The results apply to various mo<strong>de</strong>ls, including mixture mo<strong>de</strong>ls.In this case, the errors are shown to be O(e −an ) and O((logn) b / √ n)(a,b > 0), respectively.1. Introduction. Or<strong>de</strong>r i<strong>de</strong>ntification <strong>de</strong>als with the estimation and test of astructural parameter which in<strong>de</strong>xes the complexity of a mo<strong>de</strong>l. In other words, themost economical representation of a random phenomenon is sought. This problemis encountered in many situations, including: mixture mo<strong>de</strong>ls [13, 19] with an unknownnumber of components; cluster analysis [9], when the number of clusters isunknown; autoregressive mo<strong>de</strong>ls [1], when the process memory is not known.This paper is <strong>de</strong>voted to the study of two Bayesian estimators of the or<strong>de</strong>r of amo<strong>de</strong>l. Frequentist properties of efficiency are particularly investigated. We obtainnew efficiency bounds un<strong>de</strong>r mild assumptions, providing a theoretical answer tothe questions raised, for instance, in [7] (see their Section 4).1.1. Description of the problem. We observe n i.i.d. random variables (r.v.)(Z 1 ,...,Z n ) = Z n with values in a measured sample space (Z,F ,μ).Let ( k ) k≥1 be an increasing family of nested parametric sets and d the Eucli<strong>de</strong>andistance on each. The dimension of k is <strong>de</strong>noted by D(k). Let ∞ =⋃k≥1 k and for every θ ∈ ∞ ,letf θ be the <strong>de</strong>nsity of the probability measureP θ with respect to the measure μ.The or<strong>de</strong>r of any distribution P θ0 is the unique integer k such that P θ0 ∈{P θ :θ ∈ k \ k−1 } (with convention 0 = ∅). It is assumed that the distribution P ⋆ ofZ 1 belongs to {P θ :θ ∈ ∞ }. The <strong>de</strong>nsity of P ⋆ is <strong>de</strong>noted by f ⋆ = f θ ⋆ (θ ⋆ ∈ k ⋆ \ k ⋆ −1). The or<strong>de</strong>r of P ⋆ is <strong>de</strong>noted by k ⋆ , and is the quantity of interesthere.We are interested in frequentist properties of two Bayesian estimates of k ⋆ .Inthat perspective, the problem can be restated as an issue of composite hypothesestesting (see [4]), where the quantities of interest are P ⋆ {˜k n Received November 2005; revised May 2007.AMS 2000 subject classifications. 62F05, 62F12, 62G05, 62G10.Key words and phrases. Mixture, mo<strong>de</strong>l selection, nonparametric Bayesian inference, or<strong>de</strong>r estimation,rate of convergence.938BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 939k ⋆ }, the un<strong>de</strong>r- and over-estimation errors, respectively. In this paper we <strong>de</strong>termineupper-bounds on both errors on ˜k n <strong>de</strong>fined as follows.Let be a prior on ∞ that writes as d(θ)= π(k)π k (θ)dθ, forallθ ∈ kand k ≥ 1. We <strong>de</strong>note by (k|Z n ) the posterior probability of each k ≥ 1. In aBayesian <strong>de</strong>cision theoretic perspective, the Bayes estimator associated with the0–1 loss function is the mo<strong>de</strong> of the posterior distribution of the or<strong>de</strong>r k:̂k G n= arg max{(k|Z n )}.k≥1It is a global estimator. Following a more local and sequential approach, we proposeanother estimator:̂k L n = inf{k ≥ 1:(k|Zn ) ≥ (k + 1|Z n )}≤̂k G n .If the posterior distribution on k is unimodal, then obviously both estimators areequal. The advantage of ̂k n L over ̂k n G is that ̂k n L does not require the computationof the whole posterior distribution on k. It can also be slightly modified into thesmallest integer k such that the Bayes factor comparing k+1 to k is less thanone. When consi<strong>de</strong>ring a mo<strong>de</strong>l comparison point of view, Bayes factors are oftenused to compare two mo<strong>de</strong>ls; see [11]. In the following, we shall focus on ̂k n Ĝk andn L, since the sequential Bayes factor estimator shares the same properties as ̂k n L.1.2. Results in perspective. In this paper we prove that the un<strong>de</strong>restimationerrors are O(e −an ) (some a>0); see Theorem 1. We also show that the overestimationerrors are O((logn) b /n c ) (some b ≥ 0, c>0); see Theorems 2 and 3. Allconstants can be expressed explicitly, even though they are quite complicated. Weapply these results in a regression mo<strong>de</strong>l and in a change points problem. Finally,we show that our results apply to the important class of mixture mo<strong>de</strong>ls. Mixturemo<strong>de</strong>ls have interesting nonregularity properties and, in particular, even thoughthe mixing distribution is i<strong>de</strong>ntifiable, testing on the or<strong>de</strong>r of the mo<strong>de</strong>l has provedto be difficult; see, for instance, [6]. There, we obtain an un<strong>de</strong>restimation error ofor<strong>de</strong>r O(e −an ) and an overestimation error of or<strong>de</strong>r O((logn) b / √ n) (b>0); seeTheorem 4.Efficiency issues in the or<strong>de</strong>r estimation problem have been studied mainly inthe frequentist literature; see [4] for a review on these results. There is an extensiveliterature on Bayesian estimation of mixture mo<strong>de</strong>ls and, in particular, on the or<strong>de</strong>rselection in mixture mo<strong>de</strong>ls. However, this literature is essentially <strong>de</strong>voted to<strong>de</strong>termining coherent noninformative priors (see, e.g., [15]) and to implementation(see, e.g., [14]). To the best of our knowledge, there is hardly any work on frequentistproperties of Bayesian estimators such as ̂k n G and ̂k n L outsi<strong>de</strong> the regular case.In the case of mixture mo<strong>de</strong>ls, Ishwaran, James and San [10] suggest a Bayesianestimator of the mixing distribution when the number of components is unknownand boun<strong>de</strong>d and study the asymptotic properties of the mixing distribution. It isto be noted that <strong>de</strong>riving rates of convergence for the or<strong>de</strong>r of the mo<strong>de</strong>l from


940 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUthose of the mixing distribution would be suboptimal since the mixing distributionconverges at a rate at most equal to n −1/4 to be compared to our O((logn) b / √ n)(b>0) in Theorem 4.1.3. Organization of the paper. In Section 2 we state our main results. Generalbounds are presented in Sections 2.1 (un<strong>de</strong>restimation) and 2.2 (overestimation).The regression and change points examples are treated in Section 2.3. We <strong>de</strong>alwith mixture mo<strong>de</strong>ls in Section 2.4. The main proofs are gathered in Section 3(un<strong>de</strong>restimation), Section 4 (overestimation) and Section 5 (examples). Section Cin the Appendix is <strong>de</strong>voted to an aspect of mixture mo<strong>de</strong>ls which might be ofinterest in its own.2. Efficiency bounds. Hereafter, the integral ∫ fdλof a function f with respectto a measure λ is written as λf .Let L 1 + (μ) be the subset of all nonnegative functions in L1 (μ). Foreveryf ∈L 1 + (μ) \{0}, the measure P f is <strong>de</strong>fined by its <strong>de</strong>rivative f with respect to μ.For every f,f ′ ∈ L 1 + (μ),wesetV(f,f′ ) = P f (logf − logf ′ ) 2 [with conventionV(f,f ′ ) =∞whenever necessary].Let l ⋆ = logf ⋆ .Forallθ,θ ′ ∈ ∞ ,wesetl θ = logf θ and <strong>de</strong>fine H(θ,θ ′ ) =P θ (l θ − l θ ′) when P θ ≪ P θ ′ (∞ otherwise), the Kullback–Leibler divergence betweenP θ and P θ ′.WealsosetH(θ)= H(θ ⋆ ,θ)(each θ ∈ ∞ ).Let us <strong>de</strong>fine, for every k ≥ 1, α,δ > 0andt ∈ k , θ ∈ ∞ ,l t,δ = inf{f θ ′ :θ ′ ∈ k ,d(t,θ ′ ) 0, M ≥ 1 such that, for all δ ∈ (0,δ 0 ],sup{q(θ,α):θ ∈ S k (δ)}≤M.A2. For every k ≥ 1andθ ∈ k , there exists η θ > 0 such thatV(u θ,ηθ ,f ⋆ ) + V(f ⋆ ,l θ,ηθ ) + V(f ⋆ ,u θ,ηθ ) + V(u θ,ηθ ,f θ ) 0 forall δ>0 and k = 1,...,k ⋆ .(1)(i) There exist c1 ′ ,c′ 2> 0 such that, for every n ≥ 1,P ⋆n {̂k G n H⋆ k+1 for k = 1,...,k⋆ − 1, then there exist c 1 ,c 2 > 0such that, for every n ≥ 1,(2)P ⋆n {̂k L n 0andtwofunctions l ≤ u, the bracket [l,u] is the set of all functions f with l ≤ f ≤ u. Wesay that [l,u] is a δ-bracket if l,u∈ L 1 + (μ) andμ(u − l)≤ δ, P ⋆ (logu − logl) 2 ≤ δ 2 ,P u−l (logu − logf ⋆ ) 2 ≤ δ log 2 δ and P l (logu − logl) 2 ≤ δ log 2 δ.For C a class of functions, the δ-entropy with bracketing of C is the logarithmE(C,δ) of the minimum number of δ-brackets nee<strong>de</strong>d to cover C. Asetofcardinalityexp(E(C,δ))of δ-brackets which covers C is written as H(C,δ).


942 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUFor all θ ∈ ∞ , we introduce the following quantities: l n (θ) = ∑ ni=1 l θ (Z i ),l ⋆ n = ∑ ni=1 l ⋆ (Z i ) and, for every k ≥ 1, B n (k) = π(k) ∫ ke l n(θ)−l ⋆ n dπ k (θ). Obviously,if kk ⋆ be an integer. We consi<strong>de</strong>r the following three assumptions:O1(K). There exist C 2 ,D 1 (k) > 0(k = k ⋆ + 1,...,K) such that, for every sequence{δ n } <strong>de</strong>creasing to 0, for all n ≥ 1, and all k ∈{k ⋆ + 1,...,K},π k {θ ∈ k :H(θ)≤ δ n }≤C 2 δ nD 1 (k)/2 .O2(K). There exists C 3 > 0 such that, for each k ∈{k ⋆ + 1,...,K}, there existsa sequence {Fn k},F n k ⊂ k, such that, for all n ≥ 1,π k {(F kn )c }≤C 3 n −D 1(k)/2 .O3. Thereexistβ 1 ,L,D 2 (k ⋆ )>0, and β 2 ≥ 0 such that, for all n ≥ 1,(4)P ⋆n{ B n (k ⋆ )< ( β 1 (logn) β 2n D 2(k ⋆ )/2 ) −1 } ≤ L (logn)3D 1(k ⋆ +1)/2+β 2n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 .When O3 holds, let n 0 be the smallest integer n such thatδ 0 = 4maxm≥n{ m −1 log [ β 1 (logm) β 2m D 2(k ⋆ )/2 ]} ≤ e −2 /2.When O1(K) and O3 hold with D 2 (k ⋆ )k ⋆ ) are introduced, which might be different from the usual dimensions D(k).They should be un<strong>de</strong>rstood as effective dimensions of k relative to k ⋆. In mo<strong>de</strong>lsof mixtures of g γ <strong>de</strong>nsities (γ ∈ Ŵ ⊂ R d ), for instance, D 1 (k ⋆ + 1) = D(k ⋆ ) + 1,while D(k ⋆ +1) = D(k ⋆ )+(d +1). It is to be noted that this assumption is crucial.In particular, in the different context of [16], it is proved that if such a condition isnot satisfied, then some inconsistency occurs for the Bayes factor.Finally, O3 is mil<strong>de</strong>r than the existence of a Laplace expansion of the marginallikelihood (which holds in “regular mo<strong>de</strong>ls” as <strong>de</strong>scribed in [18]), since in suchcases (see [18]), for c as large as need be, <strong>de</strong>noting by J n the Jacobian matrix,there exist δ,C > 0 such that∫( )B n (k ⋆ ) ≥ e l n(θ)−l n 2π D(k ⋆(̂θ) )/2dπ k ⋆(θ) ≥ |J n | −1/2( 1 + O P (1/n) ) ,n|θ−̂θ| 1 ≤δand P ⋆n {|J n |+|O P (1/n)| >C}≤n −c , implying O3 with β 1 > 0, β 2 = 0andD 2 (k ⋆ ) = D(k ⋆ ). In some cases however, dimensional in<strong>de</strong>x D 2 (k ⋆ ) may differfrom D(k ⋆ ); see, for instance, Lemma 1.According to (7) and(8), both overestimation errors <strong>de</strong>cay as a negative powerof the sample size n (uptoapowerofalogn factor). Note that the overestimationrate is necessarily slower than exponential, as stated in another variant of the SteinLemma (see Lemma 3 in [4]).We want to emphasize that the overestimation rates obtained in Theorems 2and 3 <strong>de</strong>pend on intrinsic quantities [such as dimensions D 1 (k) and D 2 (k ⋆ ),powerβ 2 ]. On the contrary, the rates obtained in Theorems 10 and 11 of [4] <strong>de</strong>penddirectly on the choice of a penalty term.


944 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAU2.3. Regression and change points mo<strong>de</strong>ls. Theorems 1, 2 and 3 (resp. 1 and 3)apply to the following regression (resp. change points) mo<strong>de</strong>l. In the rest ofthis section, σ>0isgiven,g γ is the <strong>de</strong>nsity of the Gaussian distribution withmean γ and variance σ 2 ; X 1 ,...,X n are i.i.d. and uniformly distributed on [0, 1],e 1 ,...,e n are i.i.d. with <strong>de</strong>nsity g 0 and in<strong>de</strong>pen<strong>de</strong>nt from X 1 ,...,X n . Moreover,one observes Z i = (X i ,Y i ) with Y i = ϕ θ ⋆(X i ) + e i (i = 1,...,n), where the <strong>de</strong>finitionof ϕ θ ⋆ <strong>de</strong>pends on the example.Regression (see also Section 5.3 of [4]). Let {t k } k≥1 be a uniformly boun<strong>de</strong>d systemof continuous functions on [0, 1] forming an orthonormal system in L 2 ([0, 1])(for the Lebesgue measure). Let Ŵ be a compact subset of R that contains 0and k = Ŵ k (each k ≥ 1). For every θ ∈ k ,setϕ θ = ∑ kj=1 θ j t j and f θ (z) =g ϕθ (x)(y) [all z = (x,y) ∈[0, 1]×R].Change points. For each k ≥ 1, let T k be the set of (k+1)-tuples (t j ) 0≤j≤k , witht 0 = 0, t j ≤ t j+1 (all j 0 such thatP gγ1 ,η 1−g γ1 ,η 1(1 + log 2 g γ2 ) ≤ Mη 1 , P gγ2 log 2 (g γ1 ,η 1/g γ1 ,η 1) ≤ Mη 2 1 ,andj=1P gγ1 ,η 1−g γ1 ,η 1log 2 g γ1 ,η 1≤ Mη 1 log 2 η 1P gγ1 ,η 1(log 2 g γ1 ,η 1+ log 2 g γ2 ) + P gγ2 (log 2 g γ1 ,η 1+ log 2 g γ1 ,η 1) ≤ M.M3. For every γ 1 ,γ 2 ∈ Ŵ, there exists α>0 such thatsup{P γ1 (g γ2 /g γ ) α :γ ∈ Ŵ} < ∞.M4. The parameterization γ ↦→ g γ (z) is C 2 for μ-almost every z ∈ Z. Moreover,μ[sup γ∈Ŵ (|∇g γ | 1 +|D 2 g γ |)] is finite.The parameterization γ ↦→ logg γ (z) is C 3 for μ-almost every z ∈ Z andfor every γ ∈ Ŵ, the Fisher information matrix I(γ) is positive <strong>de</strong>finite. Besi<strong>de</strong>s,for all γ 1 ,γ 2 ∈ Ŵ, there exists η>0forwhichP γ1 |D 2 logg γ2 | 2 + P γ1 sup{|D 3 logg γ | 2 :|γ − γ 2 | 1 ≤ η} < ∞.


946 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUM5. Let I ={(r,s) :1 ≤ r ≤ s ≤ d}. There exist a nonempty subset A of Iand two constants η 0 ,a > 0 such that, for every k ≥ 2, for every k-tuple(γ 1 ,...,γ k ) of pairwise distinct elements of Ŵ:(a) functions g γj ,(∇g γj ) l (j ≤ k,l ≤ d) are linearly in<strong>de</strong>pen<strong>de</strong>nt;(b) for every j ≤ k, functions g γj ,(∇g γj ) l ,(D 2 g γj ) rs (all l ≤ d,(r,s)∈A) are linearly in<strong>de</strong>pen<strong>de</strong>nt;(c) for each j ≤ k,(r,s) ∈ I \ A, thereexistλ 0jrs,...,λ djrs ∈ R such that(D 2 g γj ) rs = λ 0jrsg γj + ∑ dl=1 λ ljrs(∇g γj ) l ;(d) for all η ≤ η 0 and all u,v ∈ R d , for each j ≤ k,if∑∑(|u r u s |+|v r v s |) +λ 0j∣ rs (u ru s + v r v s )∣ ≤ η,(r,s)∈Athen |u| 2 2 +|v|2 2 ≤ aη.(r,s)/∈AThese assumptions suffice to guarantee the bounds below.THEOREM 4. If M1–M5 are satisfied, then there exists n 1 ≥ 1 and c 4 > 0such that, for all n ≥ n 1 ,(10)(11)P ⋆ {̂k L n k⋆ }≤c 4(logn) [3(d+1)k⋆ /2]√ n.The positive constants c 1 ,c 2 are <strong>de</strong>fined in Theorem 1.Note that all assumptions involve the mixed <strong>de</strong>nsities g γ (γ ∈ Ŵ) rather thanthe resulting mixture <strong>de</strong>nsities f θ (θ ∈ ∞ ). Assumption M2 implies A2 and M3implies A1. Assumption M4 is a usual regularity condition. Assumption M5is a weaker version of the strong i<strong>de</strong>ntifiability condition <strong>de</strong>fined by [5], which isassumed in most paper <strong>de</strong>aling with asymptotic properties of mixtures. In particular,strong i<strong>de</strong>ntifiability does not hold in location-scale mixtures of Gaussian r.v.,but M5 does (with A = I \{(1, 1)}). In fact, Theorem 4 applies,andwehavethefollowing:COROLLARY 1. Set A,B > 0 and Ŵ ={(μ,σ 2 ) ∈[−A,A]×[B 1 ,B]}. Forevery γ = (μ,σ 2 ) ∈ Ŵ, let us <strong>de</strong>note by g γ the Gaussian <strong>de</strong>nsity with mean μ andvariance σ 2 . Then (10) and (11) hold with d = 2 for all n ≥ n 0 .Other examples inclu<strong>de</strong>, for instance, mixtures of Gamma(a,b)ina or in b [butnot in (a,b)], of Beta(a,b) in(a,b), ofGIG(a,b,c) in(b,c) (another examplewhere strong i<strong>de</strong>ntifiability does not hold, but M5 does).BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 9473. Un<strong>de</strong>restimation proofs. Let us start with new notation. For f,f ′ ∈L 1 + (μ) \{0}, wesetH(f,f′ ) = P f (logf − logf ′ ) when it is <strong>de</strong>fined (∞ otherwise),H(f)= H(f ⋆ ,f),andV(f)= V(f ⋆ ,f)∨ V(f,f ⋆ ).Foreveryθ ∈ ∞ ,the following shortcuts will be used (W stands for H or V ): W(f,f θ ) = W(f,θ),W(f θ ,f)= W(θ,f), W(f θ ) = W(θ). For every probability <strong>de</strong>nsity f ∈ L 1 (μ),Pf⊗nis <strong>de</strong>noted by Pf n and the expectation with respect to P f (resp. Pf n)byE f(resp. E n f ).Theorem 1 relies on the following lower bound on B n (k).LEMMA 2. Let k ≤ k ⋆ and δ ∈ (0,αM∧ δ 0 ]. Un<strong>de</strong>r the assumptions of Theorem1, with probability at least 1 − 2exp{−nδ 2 /8M},PROOF.B n (k) ≥ π(k)π k{S k (δ)}e −n[H⋆ k +δ] .2Let 1≤ k ≤ k ⋆ ,00) implies that, forall θ ∈ S k (δ),P ⋆n {B c }≤exp{−ns[H ⋆ k + δ]+n logϕ θ(s)}≤ exp{−ns[H ⋆ k + δ − H(θ)]+ns2 M/2}.We choose s =[Hk ⋆ + δ − H(θ)]/M ∈[δ/2,α] so that the above probability isboun<strong>de</strong>d by exp{−nδ 2 /8M} and Lemma 2 is proved. □To prove Theorem 1, we construct nets of upper bounds for the f θ ’s (θ ∈ k ,k = 1,...,k ⋆ − 1). Similar nets have been first introduced in a context of nonparametricBayesian estimation in [3]. We focus on ̂k n L; the proof for ̂k n G is a straightforwardadaptation.PROOF OF THEOREM 1. SinceP ⋆n {̂k n L


948 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAULet δ 0 such that d(θ,θ ′ )


950 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUapplies here: there exist c 4 ,c 5 > 0 which do not <strong>de</strong>pend on n and guaranteethat(17)p n,1 ≤ c 4 e −nc 5.When δ n 0. Thus, according to (30) of Proposition B.1,E n θ (1 − φ i,j) ≤ exp { − n[H(u i) − (ρ + ρ ′ ()] H(ui ) − (ρ + ρ ′ )2V(θ))}∧ 1 .Since H(θ) ≤ (j + 1)δ n ≤ 2δ 0 ≤ e −2 , then log 2 δ n ≥ log 2 (jδ n ) and (3) yieldV(θ)≤ C 1 H(θ)log 2 H(θ)≤ C 1 (j + 1)δ n log 2 (jδ n ). Consequently, j/(j + 1) ≥1/2 and8C 1 log 2 (jδ n ) ≥ 1imply{}E n θ (1 − φ njδ n(18)i,j) ≤ exp −64C 1 log 2 .(jδ n )Step 2. Proposition B.1 and (29) ensure that{E ⋆n φ i,j ≤ exp − njδ ( )}n jδn4 2V(u i ) ∧ 1 .The point is now to bound V(u i ). Let again θ ∈ S n,j be such that f θ ∈[l i ,u i ].Using repeatedly (a + b) 2 ≤ 2(a 2 + b 2 ) (a,b ∈ R), the <strong>de</strong>finition of a δ-bracketand (3) yield(19)and similarly,(20)V(θ ⋆ ,u i ) = P ⋆ (l ⋆ − logu i + logμu i ) 2≤ 2P ⋆ (l ⋆ − logu i ) 2 + 2log 2 μu i≤ 4P ⋆ (l ⋆ − l θ ) 2 + 4P ⋆ (l θ − logu i ) 2 + 2 ( μ(u i − l i ) ) 2≤ 4V(θ)+ 4P ⋆ (logu i − logl i ) 2 + 2 ( μ(u i − l i ) ) 2≤ 2(2C 1 + 3)(j + 1)δ n log 2 (jδ n ),V(u i ,θ ⋆ ) ≤ 4(C 1 + 2)(j + 1)δ n log 2 (jδ n ).BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 951A bound on V(u i ) is <strong>de</strong>rived from (19) and(20), which yields in turn{}E ⋆n njδ n(21)φ i,j ≤ exp −64(C 1 + 2) log 2 .(jδ n )(22)Step 3. Now, consi<strong>de</strong>r the global testφ n = max { φ i,j :i ≤ exp{E(S n,j ,jδ n /4)},j ≤ n} .Equation (18) implies that, for every j ≤ n and θ ∈ S n,j ,{E n θ (1 − φ n) ≤ exp −njδ n64C 1 log 2 (jδ n )Furthermore, if we set ρ n = nδ n /[64(1+s)(C 1 + 2) log 2 δ n ], then bounding φ n bythe sum of all φ i,j , invoking (21) and(6) yieldE ⋆n φ n ≤≤∑ nj=1 n{exp E(S n,j ,jδ n /4) −∑j=1exp{−jρ n }≤ exp{−ρ n}1 − exp{−ρ n } .}.}njδ n64(C 1 + 2) log 2 (jδ n )Since δ 1 ≥ 128(1 + s)(C 1 + 2)[D 1 (k ⋆ + 1) − D 2 (k ⋆ )]∨log −3 (n 0 ), one haslog 2 δ n ≤ 4log 2 n,andρ n ≥ 1 2 [D 1(k ⋆ + 1) − D 2 (k ⋆ )] logn. Thus, the final boundis(23) Step 4. We now bound p n,2 :E ⋆n φ n ≤1n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 − 1 .p n,2 = E ⋆n( )}φ n + (1 − φ n ) 1 e{∫S l n(θ)−l ⋆ n dπ k ⋆ +1(θ) ≥ 1/w nn≤ E ⋆n φ n + P ⋆n {∫S n ∩Fnce l n(θ)−l ⋆ n dπ k ⋆ +1(θ) ≥ 1/2w n}}+ E ⋆n (1 − φ n )1 e{∫S l n(θ)−l ⋆ n dπ k ⋆ +1(θ) ≥ 1/2w n .n ∩F nThe first term of the right-hand si<strong>de</strong> is boun<strong>de</strong>d according to (23). Moreover, applyingthe Markov inequality and Fubini theorem to the second term above, p n,2,2 ,ensures that(24)p n,2,2 ≤ 6β 1 C 3(logn) β 2n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 .


952 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUAs for the third term, p n,2,3 , invoking again the Markov inequality and Fubinitheorem, then (22), yields∑ n ∫p n,2,3 ≤ 2w n E n θ (1 − φ n)dπ k ⋆ +1(θ)j=1S n,j∑ n {≤ 2w n exp −j=1{≤ 2w n exp −njδ n64C 1 log 2 (jδ n )}nδ n64C 1 log 2 δ n≤ 6β 1 π(k ⋆ (logn) β 2+ 1)n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 .Combining inequalities (23), (24) and(25) yields}π k ⋆ +1{S n,j }{≤ 2w n exp − δ }1logn256C 11p n,2 ≤n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 − 1 + 6β (1 π(k ⋆ ) (logn) β 2+ 1) + C 3n [D 1(k ⋆ +1)−D 2 (k ⋆ )]/2 .Inequalities (16), (17) and the one above conclu<strong>de</strong> the proof. □5. Mixtures proofs. In the sequel we use the notation θ ⋆ = (p ⋆ ,γ ⋆ ), p ⋆ =(p1 ⋆,...,p⋆ k ⋆ −1 ) and p⋆ k ⋆ = 1 − ∑ k ⋆ −1j=1 p⋆ j . Also, if θ = (p,γ) ∈ k,then1−∑ k−1j=1 p j is <strong>de</strong>noted by p k .The standard conditions hold. Assumption A1 is verified by proving (with usualregularity and convexity arguments) the existence of α>0 such that the functionθ ↦→ P ⋆ e α(l⋆ −l θ ) is boun<strong>de</strong>d on k ⋆. Assumption A2 follows from M2. Lemma 3in [12] guarantees that Hk ⋆ >H⋆ k+1 (every k 0 such that, in the setting of mixture mo<strong>de</strong>ls,for every sequence {δ n } that <strong>de</strong>creases to 0, for all n ≥ 1,π k ⋆ +1{θ ∈ k ⋆ +1 :H(θ)≤ δ n }≤C 2 δ n[D(k ⋆ )+1]/2 .PROPOSITION 2. If F k⋆ +1n ={(p,γ) ∈ k ⋆ +1 :min j≤k ⋆ +1 p j ≥ e −n } approximatesthe set k ⋆ +1, then O2(k ⋆ + 1) is fulfilled. Furthermore, the entropy condition(6) holds as soon as δ 1 is chosen large enough.The technical proofs of Propositions 1 and 2 are postponed to Appendix Cand D, respectively. Assumption O3 is obtained (with β 2 = 0) from the Laplaceexpansion un<strong>de</strong>r P ⋆ , which is regular (see also the comment after Theorem 3).Finally, Theorem 3 applies and Theorem 4 is proven.BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 953APPENDIX A: PROOF OF LEMMA 1Let θ ⋆ = (α ⋆ ,t ⋆ ) and θ ∈ k ⋆ +1 satisfy H(θ)≤ δ n .Foreveryj ≤ k ⋆ (resp.j ≤ k), we <strong>de</strong>note by τj ⋆ (resp. τ j ) the interval [tj−1 ⋆ ,t⋆ j [ (resp. [t j−1,t j [) [hence,H(θ)= ∑ j≤k∑j ⋆ ′ ≤k ⋆ +1(αj ⋆ − α j ′)2 μ(τj ⋆ ∩ τ j ′)], and set s(j) such that μ(τ⋆ j ∩τ s(j) ) = max l≤k μ(τj ⋆ ∩ τ l). So,μ(τj ⋆ ∩ τ s(j)) ≥ μ(τj ⋆)/k, and(α⋆ j − α s(j)) 2 ≤ cδ nfor all j ≤ k ⋆ .Ifs(j) = s(j ′ ) for j ′ >j, then necessarily j ′ ≥ (j + 2) ands(j + 1) = s(j), while αj ⋆ ≠ α⋆ j+1 ,sowedogetk⋆ conditions on θ. Suppose nowwithout loss of generality that s(j)= j for all j ≤ k ⋆ .Then(α k −αk ⋆ ⋆)2 (1−t k ⋆) ≤δ n , another condition on θ. Moreover, for all j0(in<strong>de</strong>pen<strong>de</strong>nt of t) such that, on that event,∫(25)Ŵ k⋆ el n(θ)−l ⋆ n dπ k ⋆(α|t)≥ Cn k⋆ /2 el n(ˆα t ,t)−l ⋆ n ≥ Cn k⋆ /2 el n(α ⋆ ,t)−l ⋆ n,where ˆα t is the maximum likelihood estimator for fixed t. Denote n j (t) =∑ ni=11{X i ∈[t ⋆ j ,t⋆ j + u j/n[} and v 2 (t) = σ 2 ∑ k ⋆j=1 (α⋆ j − α⋆ j−1 )2 n j (t) for anyt ∈ S n .Thenξ(t)= l n (α ⋆ ,t)−l ⋆ n + 1 2 v2 (t) is, conditionally on X 1 ,...,X n , a centeredGaussian r.v. with variance v 2 (t). Because each n j (t) is Binomial(n,u j /n)distributed, the Chernoff method implies, for any t ∈ S n ,(26)P ⋆n {v 2 (t) ≥ τ logn}=O ( 1/ √ n ) .Moreover, since ξ(t) is conditionally Gaussian, it is easily seen by using (26)that,for any t ∈ S n , setting B ={Z n :l n (α ⋆ ,t)− l ⋆ n ≥−1 2 (v2 (t) + τ logn)},(27)P ⋆n {B c }=O ( 1/ √ n ) ,too. Now, the same technique as in the proof of Lemma 2 yields{∫} ∫P ⋆n e l n(α ⋆ ,t)−l ⋆ n dπ k ⋆(t) ≤ n −k⋆ +1−τ 2P ⋆n {B c(28)}≤S n S n π k ⋆(S n ) dπ k ⋆(t)whenever π k ⋆{S n }=c(log logn/n) k⋆ −1 ≥ 2n −k⋆ +1 . By combining (25), (27) and(28), we obtain that O3 holds with D 2 (k ⋆ ) = 3k ⋆ + 2(τ − 1).


954 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUAPPENDIX B: CONSTRUCTION OF TESTSPROPOSITION B.1. Let (ρ,c) belong to R ∗ + × (0, 1] and f ∈ L1 + (μ) \{0}.Assume that V(f)is positive and finite. Let l n,f = ∑ ni=1 logf(Z i ) andThe following bound holds:(29)φ n,f,ρ,c = 1{l n,f − l ⋆ n + nH(f ) ≥ nρ + logc}.E ⋆n φ n,f,ρ,c ≤ 1 c exp {− nρ 2( ρV(f) ∧ 1 )}.Let ρ ′ ∈ R + and g ∈ L 1 + (μ) be such that μg = 1, g ≤ eρ′ , f and V(g)is finite.If, in addition, (ρ + ρ ′ )0), we havec E ⋆n φ n,f,ρ,c ≤ exp[−nsρ + ns 2 V(f)/2].The choice s = 1 ∧ ρV(f)yields (29). Similarly, for all s ∈ (0, 1],E n g (1 − φ n,f,ρ,c) ≤ P n g {l⋆ n − l n,f >n[H(f)− ρ]}≤ P n g {l⋆ n − l n,g >n[H(f)− (ρ + ρ ′ )]}≤ e −ns[H(f)−(ρ+ρ′ )] ( P g e s(l⋆ −l g ) ) n .The same arguments as before lead to P g e s(l⋆ −l g ) ≤ 1 + s 2 V(g)/2andE n g (1 − φ n,f,ρ,c) ≤ exp{−ns[H(f)− (ρ + ρ ′ )]+ns 2 V(g)/2}.The choice s = 1 ∧ H(f)−(ρ+ρ′ )V(g)yields (30). □BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 955APPENDIX C: PROOF OF PROPOSITION 1Let {δ n } be a <strong>de</strong>creasing sequence of positive numbers which tend to 0. Let us<strong>de</strong>note by ‖·‖the L 1 (μ) norm. Because √ H(θ)≥‖f ⋆ −f θ ‖/2, M1 ensures thatProposition 1 holds if(31){π k ⋆ +1 θ ∈ k ⋆ +1 :‖f ⋆ − f θ ‖≤ √ } √ D(kδ ⋆ )+1n ≤ C2 δnfor some C 2 > 0 which does not <strong>de</strong>pend on {δ n }.Weuseanewparameterizationfor translating ‖f ⋆ − f θ ‖≤ √ δ n in terms of parameters p and γ .Itisavariantofthe locally conic parameterization [6], using the L 1 norm instead of the L 2 norm.In the sequel, c,C will be generic positive constants.L 1 locally conic parameterization. For each θ = (p,γ) ∈ int( k ⋆ +1), we<strong>de</strong>fineiteratively the permutation σ θ upon {1,...,k ⋆ + 1} as follows:• (j 1 ,σ θ (j 1 )) = min (j,j ′ ) arg min{|γ ⋆ j −γ j ′| 1 :j ≤ k ⋆ ,j ′ ≤ k ⋆ + 1}, where the firstminimum is for the lexicographic ranking;• if (j 1 ,σ θ (j 1 )),...,(j l−1 ,σ θ (j l−1 )) with l


956 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUBOUNDS FOR BAYESIAN ORDER IDENTIFICATION 957Note that ∑ j≤k ⋆ ρ j =−1. Now, <strong>de</strong>fineBecause Ŵ is compact, continuity arguments on the norm in finite dimensionalk ⋆N(γ θ ,R θ ) =∥ g ∑γ θ+ pj ⋆ k ⋆ ∥ spaces yield the following useful property: Un<strong>de</strong>r M5,ifg rT j ∇g γj ⋆ + ∑ ∥∥∥∥ 1 ,...,g k ∈ L 1 (μ) are kρ j g γ ⋆j,functions such that, for every γ ∈ Ŵ, g γ ,g 1 ,...,g k are linearly in<strong>de</strong>pen<strong>de</strong>nt, thenj=1j=1there exists C>0 such that, for all a = (a 0 ,...,a k ) ∈ R k+1 and γ ∈ Ŵ,∥then t θ = p θ N(γ θ ,R θ ).k∑ ∥∥∥∥ k∑(36)LEMMA C.1. For all θ ∈ ⋆ ∥ a 0g γ + a j g j ≥ C |a j |., let (θ) = (t θ ,γ θ ,R θ ). The function is aj=1 j=0PROOF OF LEMMA C.2. Letτ>0, let (t,γ,R)∈ ( ⋆ ) and θ = (p,γ) = −1 (t,γ,R)satisfy |γ θ −γj ⋆| 1 >τfor all j ≤ k ⋆ and ‖f ⋆ −f θ ‖≤ √ δ n . Given anyz ∈ Z, a Taylor–Lagrange expansion (in t) of[f ⋆ (z) − f θ (z)] yields the existenceof t o ∈ (0,t)(<strong>de</strong>pending on z) such thatPROOF. It is readily seen that is a bijection. We point out that N(γ,R) isnecessarily positive for all (t,γ,R)∈ ( ⋆ ), by virtue of M5. As for the finitenessof T , note that, for any θ ∈ ⋆ |f,⋆ (z) − f θ (z)|≥ t N ∣ g ∑k ⋆γ(z) + p ⋆ k ⋆j rT j ∇g γ ⋆(z) + ∑ρj j g γ ⋆j(z)∣j=1j=1k ⋆t θ =∥ p ∑θg γθ + pj ⋆ (γ j − γj ⋆ ∑k ⋆)T ∇g γ ⋆j+ (p j − pj ⋆ )g ∣γj⋆ ∣∣∣∣∥− t2 ∑k ⋆j=1j=1N(33)2 ρ j rj T ∇g γ o(z) + 1 k ⋆∣∑∣∣∣∣pj j o 2rT j D2 g γ oj(z)r j ,j=1j=1∑k ⋆≤ 2 + pj ⋆ ‖(γ j − γj ⋆ )T ∇g γ ⋆j‖.where γj o = γ j ⋆ +to r j /N and pj o = p⋆ j +to ρ j /N (all j ≤ k ⋆ ). Therefore, by virtuej=1of M4, there exists C>0 such that(‖f ⋆ − f θ ‖≥t 1 − C t[ k ⋆∑(37)N 2 (|ρ j ||r j | 1 +|r j | 2 2]). )j=1The last part of the lemma is a straightforward consequence of the compactnessof Ŵ and continuity of θ ↦→ f θ (z). □Furthermore, M5 and (36) imply that, for some C>0 (<strong>de</strong>pending on τ ),(kProof of (31). For any τ>0, <strong>de</strong>fine the sets⋆)∑B1 {θ τ = ∈ ⋆ :min |γj≤k ⋆ θ − γj ⋆ | 1 >τ,‖f ⋆ − f θ ‖≤ √ }(38)N ≥ C 1 + (|ρ j |+pj ⋆ |r j| 1 ) ,δj=1nso the following lower bound on ‖f ⋆ − f θ ‖ is <strong>de</strong>duced from (37):andB2 {θ τ = ∈ ⋆ :min |γj≤k ⋆ θ − γj ⋆ | 1 ≤ τ,‖f ⋆ − f θ ‖≤ √ }( ∑ k ⋆δ n .‖f ⋆ j=1− f θ ‖≥t 1 − C(|p j − pj ⋆||γ j − γj ⋆| 1 +|γ j − γj ⋆|2 2 ) )(39)∑ k ⋆j=1 (|p j − pj ⋆|+p⋆ j |γ j − γj ⋆| .1)Inequality (31) is a consequence of the following:By mimicking the last part of the proof of Lemma C.1, we obtain that the righthandterm in (39) islargerthant/2forn large enough (in<strong>de</strong>pen<strong>de</strong>ntly of θ). Be-LEMMA C.2. Given τ>0, there exists C>0 such that, for all n ≥ 1,cause t = pπ k ⋆ +1{B1 τ }≤C√ kδ ⋆ θ N and (38) holds, there exists c>0 such that(d+1)(34) n .{πLEMMA C.3. There exist τ,C > 0 such that, for all n ≥ 1,k ⋆ +1{B1 τ }≤π k ⋆ +1 θ ∈ ⋆ ∑k ⋆: (|p j − pj ⋆ |+p⋆ j |γ j − γj ⋆ | 1) ≤ c √ δ n},j=1(35)bijection between ⋆ and ( ⋆ ). Furthermore, T = sup θ∈ ⋆ t θ is finite, so thatthe projection of ( ⋆ ) along its first coordinate is inclu<strong>de</strong>d in [0,T]. Finally,for all ε>0, there exists η>0 such that, for every θ ∈ ⋆ , ‖f ⋆ − f θ ‖≤η yieldst θ ≤ ε.The right-hand si<strong>de</strong> term above is finite because Ŵ is boun<strong>de</strong>d and ‖(∇g γ ⋆j) l ‖ (j ≤k ⋆ ,l≤ d) are finite thanks to M4. Hence, T is finite.π k ⋆ +1{B τ 2 }≤C√ δ nk ⋆ (d+1).leading to (34) by virtue of M1.□


958 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUPROOF OF LEMMA C.3. Let τ>0andθ = (p,γ) ∈ ⋆ satisfying ‖f ⋆ −f θ ‖≤ √ δ n . Assume that |γ θ − γj ⋆| 1 ≤ τ for some j ≤ k ⋆ ,say,j = 1. By constructionof ⋆ , |γ 1 − γ1 ⋆| 1 ≤|γ θ − γ1 ⋆| 1 ≤ τ ,andτ can be chosen small enough so thatγ θ must be different from γj ⋆ for every j = 2,...,k⋆ . We consi<strong>de</strong>r without loss ofgenerality that γ θ /∈{γj ⋆ :j ≤ k⋆ }.Lemma C.1 implies that |γ j − γj ⋆| 1 and |p j − pj ⋆ | goto0asn ↑∞for everyj = 2,...,k ⋆ . This yields that |p 1 + p θ − p1 ⋆ | goes to 0 as n ↑∞. Therefore, byvirtue of M5 and (36), there exist c,C > 0 such that, for n large enough,(40)‖f ⋆ − f θ ‖≥C( k ⋆∑j=2∣|p j − pj ⋆ k ⋆|+ ∑pj ⋆ |γ j − γj ⋆ | 1j=2+∣ (p 1 + p θ − p1 ⋆ )+ ∑+ ∑(r,s)∈A(r,s)/∈Aλ 0 rs [p θ(γ θ − γ ⋆ 1 ) r(γ θ − γ ⋆ 1 ) s+ p 1 (γ 1 − γ1 ⋆ ) r(γ 1 − γ1 ⋆ ) s]∣[p θ |(γ θ − γ ⋆ 1 ) r(γ θ − γ ⋆ 1 ) s|+ p 1 |(γ 1 − γ1 ⋆ ) r(γ 1 − γ1 ⋆ ) s|]d∑+∣ p 1(γ 1 − γ1 ⋆ ) l + p θ (γ θ − γ1 ⋆ ) ll=1+ ∑λ l rs [p θ(γ θ − γ1 ⋆ ) r(γ θ − γ1 ⋆ ) s(r,s)/∈A)+ p 1 (γ 1 − γ1 ⋆ ) r(γ 1 − γ1 ⋆ ) s]∣− c(p θ |γ θ − γ1 ⋆ |3 1 + p 1|γ 1 − γ1 ⋆ k ⋆ )|3 1 + ∑|γ j − γj ⋆ |2 2= CA 1 − cA 2 .Since |γ j − γj ⋆| 1 goes to 0 for j = 2,...,k ⋆ , ∑ k ⋆j=2 |γ j − γj ⋆|2 2can be neglectedcompared to ∑ k ⋆j=2 p⋆ j |γ j − γj ⋆| 1 when n is large enough. If CA 1 ≤ 2cA 2 ,thenj=2BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 959∑ k ⋆j=2 |p j − pj ⋆|≤2cA 2,sothat|p 1 + p θ − p1 ⋆|≤2cA 2, which yields in turn∑λ 0 ∣ rs [p θ(γ θ − γ1 ⋆ ) r(γ θ − γ1 ⋆ ) s + p 1 (γ 1 − γ1 ⋆ ) r(γ 1 − γ1 ⋆ ) s]∣(r,s)/∈A+ ∑(r,s)∈A[p θ |(γ θ − γ ⋆ 1 ) r(γ θ − γ ⋆ 1 ) s|+p 1 |(γ 1 − γ ⋆ 1 ) r(γ 1 − γ ⋆ 1 ) s|] ≤ 4cA 2 .Consequently, M5 guarantees the existence of C ′ > 0 such thatp θ |γ θ − γ ⋆ 1 |2 2 + p 1|γ 1 − γ ⋆ 1 |2 2≤ C ′ (p θ |γ θ − γ ⋆ 1 |3 1 + p 1|γ 1 − γ ⋆ 1 |3 1 ),which is impossible when τ is chosen small enough. Therefore, CA 1 > 2cA 2 and(40) together with M5 give‖f ⋆ − f θ ‖≥C( k ⋆∑j=2for some C>0. Finally,(41)|p j − pj ⋆ k ⋆|+ ∑pj ⋆ |γ j − γj ⋆ | 1 +|p 1 + p θ − p1 ⋆ |j=2+ p θ |γ θ − γ1 ⋆ |2 2 + p 1|γ 1 − γ1 ⋆ |2 2d∑+∣ p 1(γ 1 − γ1 ⋆ ) l + p θ (γ θ − γ1 ⋆ ) ll=1+ ∑(r,s)/∈Aλ l rs [p θ(γ θ − γ ⋆ 1 ) r(γ θ − γ ⋆ 1 ) s)+ p 1 (γ 1 − γ1 ⋆ ) r(γ 1 − γ1 ⋆ ) s],∣|p 1 + p θ − p1 ⋆ k ⋆|+ ∑|p j − pj ⋆ |+p 1|γ 1 − γ1 ⋆ |2 2 + p θ|γ θ − γ1 ⋆ |2 2j=2+|p 1 (γ 1 − γ1 ⋆ ) + p θ(γ θ − γ1 ⋆ )| ∑1 + pj ⋆ |γ j − γj ⋆ | 1 ≤ C √ δ n .Therefore, for τ small enough and n large enough,k ⋆j=2π k ⋆ +1{B τ 2 }≤π k ⋆ +1{θ ∈ ⋆ :(41) holds}.The conditions on p j and γ j (j = 2,...,k ⋆ ) and a symmetry argument implythat the right-hand si<strong>de</strong> term above is boun<strong>de</strong>d by a constant times √ δ n[(d+1)(k ⋆ −1)]


BOUNDS FOR BAYESIAN ORDER IDENTIFICATION 961960 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAUnjNow, let us note thatδ nlog 2 (j ′ δ n ) ≥ nδ n(logδ n ) log(j ′ δ n ) ≥ nδ nand consi<strong>de</strong>r each term[12] LEROUX, B. G. (1992). Consistent estimation of a mixing distribution. Ann. Statist. 20 1350–log 2 δ nof the right-hand si<strong>de</strong> of (42) in turn. It is readily seen that a logn ≤ nδ n / log 2 1360. MR1186253δ n [13] MCLACHLAN, G.andPEEL, D. (2000). Finite Mixture Mo<strong>de</strong>ls. Wiley, New York. MR1789474times w n ,whereis equivalent to∫w n = 1{p θ ≥ p 1 }1 { |p 1 + p θ − p1 ⋆ |+p 1|γ 1 − γ1 ⋆ |2 2(43)δ 1 ≥ [ (log 3 n)n (δ 1/a) 1/2 −1 ] −1 .+ p θ |γ θ − γ1 ⋆ |2 2 +|p 1(γ 1 − γ1 ⋆ )Now, −b log(j ′ nδδ n ) ≤ n(logδ n ) log(j ′ δ n ) if and only if −b logδ n ≤ δ 1 log 3 n.Since+ p θ (γ θ − γ1 ⋆ )| 1 ≤ C √ } γδ n dπk ⋆ +1 (γ)dπp k ⋆ +1 (p).log 2 δ n ≤ 4log 2 n, both are valid as soon asNote that the conditions in the integrand imply that |γ θ − γ 1 | 2 2 ≤ 4C√ (44)δ 1 ≥ 2b/log 2 n.δ n /p 1 andp θ ≥ p1 ⋆/4 as soon as C√ δ n ≤ p1 ⋆ /2. Simple calculus (based on M1) yields theFinally, using again log 2 δ n ≤ 4log 2 n yields that c ≤ nδ n / log 2 δ n whenresult.(45)δ 1 ≥ 4c/logn.APPENDIX D: PROOF OF PROPOSITION 2It is readily seen that O2(k ⋆ + 1) holds for the chosen approximating set. Let usWhen δ 1 ≥ a, the largest values of the right-hand si<strong>de</strong>s of (43), (44)and(45)areachieved at n 0 .So,δ 1 can be chosen large enough (in<strong>de</strong>pen<strong>de</strong>ntly of j ′ and n) sofocus now on the entropy condition (6).that (5), (43), (44) and(45) hold for all n ≥ n 0 and j ′ ≤⌊δ 0 /δ n ⌋. This completesthe proof of Proposition 2, because E(F k⋆ +1n ,j ′ δ n /4) is larger than the left-handConstructing δ-brackets. Let δ 1 satisfy (5). A convenient value will be chosensi<strong>de</strong> of (6) (with j ′ substituted to j). □later on. Set j ′ ≤⌊δ 0 /δ n ⌋, ε = j ′ δ n /4andτ ≥ 1.Let θ = (p,γ) ∈ k ⋆ +1 be arbitrarily chosen. Let η ∈ (0,η 1 ) be small enoughAcknowledgments. We thank the referees and Associate Editor for their helpfulso that, for every j ≤ k ⋆ + 1, u j = g γj ,η and v j = g γj(as <strong>de</strong>fined in M2) satisfy,,ηsuggestions.for all γ ∈ Ŵ, P uj −v j(1+log 2 g γ ) ≤ ε/τ, P gγ (logu j − logv j ) 2 ≤ (ε/τ) 2 andP uj −v jlog 2 u j ≤ (ε/τ) log 2 REFERENCES(ε/τ).If we <strong>de</strong>fine v θ = (1 − ε/τ)( ∑ k ⋆ +1j=1 p jv j ) and u θ = (1 + ε/τ)( ∑ k ⋆ +1j=1 p [1] AZENCOTT, R. and DACUNHA-CASTELLE, D. (1986). Series of Irregular Observations.ju j ),Springer, New York. MR0848355then there exists τ ≥ 1 (which <strong>de</strong>pends only on k ⋆ and the constant M of M2)such that the bracket [v θ ,u θ ] is an ε-bracket. The repeated use of ( ∑ [2] BAHADUR, R.R.,ZABELL, S.L.andGUPTA, J. C. (1980). Large <strong>de</strong>viations, tests, and estimates.In Asymptotic Theory of Statistical Tests and Estimation 33–64. Aca<strong>de</strong>mic Press,j p j u j /∑j p j v j ) ≤ max j u j /v j is the core of the proof we omit.New York. MR0571334[3] BARRON, A.,SCHERVISH, M.J.andWASSERMAN, L. (1999). The consistency of posteriorControl of the entropy. The rule x 1 (1 − ε/τ) = e −n distributions in nonparametric problems. Ann. Statist. 27 536–561. MR1714718and x j+1 (1 − ε/τ) =x j (1 + ε/τ) is used for <strong>de</strong>fining a net for the interval (e −n [4] <strong>CHAMBAZ</strong>, A. (2006). Testing the or<strong>de</strong>r of a mo<strong>de</strong>l. Ann. Statist. 34 1166–1203. MR2278355, 1). Such a net has[5] CHEN, J. H. (1995). Optimal rate of convergence for finite mixture mo<strong>de</strong>ls. Ann. Statist. 23at most [1 + n/ log(1 + ε/τ)/(1 − ε/τ) ≤ 1 + 2nτ/ε] support points. Using repeatedly221–233. MR1331665this construction on each dimension of the (k ⋆ + 1)-dimensional sim-plex yields a net for {p ∈ R k⋆+ :min j≤k ⋆ p j ≥ e −n , 1 − ∑ [6] DACUNHA-CASTELLE, D.andGASSIAT, E. (1999). Testing the or<strong>de</strong>r of a mo<strong>de</strong>l using locallyconic parametrization: Population mixtures and stationary ARMA processes. Ann.j≤k ⋆ p j ≥ e −n } with atmost O((n/ε) (k⋆ +1) ) support points. We can choose a net for Ŵ k⋆ +1 Statist. 27 1178–1209. MR1740115with at mostO(ε −d(k⋆ +1) ) support points such that each γ ∈ Ŵ k⋆ +1 [7] FRALEY, C.andRAFTERY, A. E. (2002). Mo<strong>de</strong>l-based clustering, discriminant analysis, andis within |·| 1 -distance ε of<strong>de</strong>nsity estimation. J. Amer. Statist. Assoc. 97 611–631. MR1951635some element of the net.[8] GHOSAL, S.,GHOSH, J.K.andVAN DER VAART, A. W. (2000). Convergence rates of posteriordistributions. Ann. Statist. 28 500–531. MR1790007Consequently, the minimum number of ε-brackets nee<strong>de</strong>d to cover F k⋆ +1n isO(n (k⋆ +1) /ε (d+1)(k⋆ +1) [9] HASTIE, T.,TIBSHIRANI, R.andFRIEDMAN, J. (2001). The Elements of Statistical Learning.), so there exist constants a,b,c > 0forwhichSpringer, New York. MR1851606(E F k⋆ +1n , j′ )δ n(42)≤ a logn − b log(j ′ [10] ISHWARAN, H.,JAMES, L.F.andSUN, J. (2001). Bayesian mo<strong>de</strong>l selection in finite mixturesδ n ) + c.by marginal <strong>de</strong>nsity <strong>de</strong>compositions. J. Amer. Statist. Assoc. 96 1316–1332. MR19465794[11] KASS, R.E.andRAFTERY, A. E. (1995). Bayes factors. J. Amer. Statist. Assoc. 90 773–795.


962 A. <strong>CHAMBAZ</strong> AND J. ROUSSEAU[14] MENGERSEN, K.L.andROBERT, C. P. (1996). Testing for mixtures: A Bayesian entropicapproach. In Bayesian Statistics 5 (J.M.Bernardo,J.O.Berger,A.P.DawidandA.F.M. Smith, eds.) 255–276. Oxford Univ. Press. MR1425410[15] MORENO, E.andLISEO, B. (2003). A <strong>de</strong>fault Bayesian test for the number of components ina mixture. J. Statist. Plann. Inference 111 129–142. MR1955877[16] ROUSSEAU, J. (2007). Approximating interval hypothesis: p-values and Bayes factors. InBayesian Statistics 8 417–452. Oxford Univ. Press.[17] SHEN, X.andWASSERMAN, L. (2001). Rates of convergence of posterior distributions. Ann.Statist. 29 687–714. MR1865337[18] TIERNEY, L.,KASS, R.E.andKADANE, J. B. (1989). Fully exponential Laplace approximationsto expectations and variances of nonpositive functions. J. Amer. Statist. Assoc. 84710–716. MR1132586[19] TITTERINGTON, D.M.,SMITH, A.F.M.andMAKOV, U. E. (1985). Statistical Analysis ofFinite Mixture Distributions. Wiley, Chichester. MR0838090[20] WONG, W.H.andSHEN, X. (1995). Probability inequalities for likelihood ratios and convergencerates of sieve MLEs. Ann. Statist. 23 339–362. MR1332570MAP5 CNRS UMR 8145UNIVERSITÉ PARIS DESCARTES45 RUE DES SAINTS-PÈRES75270 PARIS CEDEX 06FRANCEE-MAIL: chambaz@univ-paris5.frCÉRÉMADE, UMR CNRS 7534UNIVERSITÉ PARIS DAUPHINE AND CRESTPLACE DE LATTRE DE TASSIGNY75775 PARIS CEDEX 16FRANCEE-MAIL: rousseau@cerema<strong>de</strong>.dauphine.fr


ESAIM: PS January 2009, Vol. 13, p. 38–50DOI: 10.1051/ps:2007048ESAIM: Probability and Statisticswww.esaim-ps.orgORDER ESTIMATION FOR MARKOV CHAINS WITH MARKOV REGIME 39The set Π k,m <strong>de</strong>notes the set of all such probability measures P on (X ×Y) N⋆ formally <strong>de</strong>scribed by, for alln ∈ N ⋆ and (x n 1 ,yn 1 ) ∈ (X ×Y)n ,P(x n 1,y n 1 )=µ(x 1 ,y m 1 ){ n−1}{ ∏ n}∏a(x i ,x i+1 ) b(y i |y i−1i−m ; x i) . (1)i=1i=m+1NUMBER OF HIDDEN STATES AND MEMORY: A JOINT ORDERESTIMATION PROBLEM FOR MARKOV CHAINS WITH MARKOV REGIME<strong>Antoine</strong> Chambaz 1 and Catherine Matias 2Abstract. This paper <strong>de</strong>als with or<strong>de</strong>r i<strong>de</strong>ntification for Markov chains with Markov regime (MCMR)in the context of finite alphabets. We <strong>de</strong>fine the joint or<strong>de</strong>r of a MCMR process in terms of the numberk of states of the hid<strong>de</strong>n Markov chain and the memory m of the conditional Markov chain. We studythe properties of penalized maximum likelihood estimators for the unknown or<strong>de</strong>r (k,m)ofanobservedMCMR process, relying on information theoretic arguments. The novelty of our work relies in the jointestimation of two structural parameters. Furthermore, the different mo<strong>de</strong>ls in competition are notnested. In an asymptotic framework, we prove that a penalized maximum likelihood estimator isstrongly consistent without prior bounds on k and m. We complement our theoretical work with asimulation study of its behaviour. We also study numerically the behaviour of the BIC criterion. Atheoretical proof of its consistency seems to us presently out of reach for MCMR, as such a result doesnot yet exist in the simpler case where m = 0 (that is for hid<strong>de</strong>n Markov mo<strong>de</strong>ls).Mathematics Subject Classification. 62B10, 62B15, 62M07.Received October 17, 2006. Revised July 6, 2007.Markov chains with Markov regime1. IntroductionLet X = {1,...,k} and Y = {1,...,r} be two finite sets and m be some integer. Here, N ⋆ <strong>de</strong>notes the setof positive integers and for any i ≤ j, weusex j i to <strong>de</strong>note the sequence x i,x i+1 ,...,x j . We consi<strong>de</strong>r a process{X j ,Y j } j≥1 on (X ×Y) N⋆ with distribution as follows. Process {X j } j≥1 is a Markov chain with memory oneon X with transition matrix A =(a(i,j)) 1≤i,j≤k . Besi<strong>de</strong>s, conditionally on {X j } j≥1 , process {Y j } j≥1 is aMarkovchainwithmemorym [abbreviated to MC(m)], and the conditional distribution of Y s conditional on({X j } j≥1 , {Y j } jm. The process has some initial distribution µ onX×Y m .Keywords and phrases. Markov regime, or<strong>de</strong>r estimation, hid<strong>de</strong>n states, conditional memory, hid<strong>de</strong>n Markov mo<strong>de</strong>l.1 Laboratoire MAP5, UMR CNRS 8145, Université René Descartes, 45 rue <strong>de</strong>s Saints-Pères, 75270 Paris Ce<strong>de</strong>x 06, France;<strong>Antoine</strong>.Chambaz@univ-paris5.fr2 Laboratoire Statistique et Génome, UMR CNRS 8071, Tour Évry 2, 523 pl. <strong>de</strong>s Terrasses <strong>de</strong> l’Agora, 91000 Évry, France;matias@genopole.cnrs.frLet us <strong>de</strong>note by M 1 (X×Y m ) the set of probability measures on X×Y m .ThesetΠ k,m is naturally parametrizedby M 1 (X×Y m ) × Θ k,m ,where⎧⎨Θ k,m =⎩ θ =(A, B) :A =(a(i,j)) 1≤i,j≤k, a(i,j) ≥ 0,k∑a(i,j)=1andj=1B =(b(y|y m 1 ; x)) y∈Y,y m1 ∈Y m ,x∈X; b(y|y m 1 ; x) ≥ 0,}r∑b(y|y1 m ; x)=1 . (2)Thus, Π k,m = { P = P µ,θ :(µ, θ) ∈M 1 (X×Y m ) × Θ k,m} . Moreover, for stationary processes with stationarymeasure π θ on X×Y m ,weusethenotationP θ = P πθ,θ to remind that the initial probability is fixed.The observations consist in {Y j } 1≤j≤n which is called a Markov chain with Markov regime (abbreviated toMCMR). Note that {Y j } j≥1 is not a Markov process. We assume that its distribution is the marginal onto Y nof some P θ0 (θ 0 is the true and unknown parameter value), which is stationary, ergodic and belongs to Π k0,m0for some unknown (k 0 ,m 0 ) ∈ N ⋆ × N. In other words, it is assumed that there exists a hid<strong>de</strong>n stationaryprocess {X j } j≥1 such that the complete process {(X j ,Y j )} j≥1 has distribution P θ0 ∈ Π k0,m0 . When there is noambiguity, P θ0 will abbreviate to P 0 . In this setup, the cardinality r of the observed alphabet is known.While HMMs can mo<strong>de</strong>l the heterogeneity of a sequence by distinguishing different segments with differenti.i.d. distributions (i.e. m = 0), MCMRs enable furthermore a Markovian mo<strong>de</strong>lling of each segment (m ≥ 1).HMMs and MCMRs are wi<strong>de</strong>ly used in practical applications among which genomics, econometrics and speechrecognition. We refer to [4,8] for recent and comprehensive overviews on this topic. Note that more flexibilitycould be ad<strong>de</strong>d to these mo<strong>de</strong>ls by authorising different memory lengths for the different regimes but the choiceof these lengths is a problem which is as <strong>de</strong>licate as the one we address here.When the couple (k 0 ,m 0 ) associated with the distribution P 0 of a MCMR is aprioriknown, inference on theparameters has been investigated to a great extent (most recent results can be found in [10]). However, in manyapplications where MCMR are used as a mo<strong>de</strong>ling <strong>de</strong>vice, there is no clear indication about a good choice for(k 0 ,m 0 ). So, inference about (k 0 ,m 0 ) is a crucial issue, for even consistency may fail to hold in a wrong mo<strong>de</strong>l.In this paper, we propose a sound <strong>de</strong>finition of the or<strong>de</strong>r of a MCMR which we substitute to (k 0 ,m 0 )asmainquantity of interest. We explain why below.Defining the or<strong>de</strong>r of a MCMRMo<strong>de</strong>l selection for MCMRs already appears in [3]. The authors propose a reversible jump MCMC procedureto select the memory m as well as the number of regimes k. However, no simulations were given to establishthe correctness of the procedure (the method was rather directly applied to real biological data) and it is stillan open question to know whether such a procedure is consistent or not.Mo<strong>de</strong>l selection for HMMs is a more wi<strong>de</strong>ly studied subject (see for instance [9,11,16,17,21,22]). The or<strong>de</strong>rof a HMM simply is the minimal number of hid<strong>de</strong>n states (here m = 0). Our approach to mo<strong>de</strong>l selection forMCMRs draws its inspiration from [11].y=1Article published by EDP Sciences c○ EDP Sciences, SMAI 2009


40 A. <strong>CHAMBAZ</strong> AND C. MATIASOne of the interesting problems raised by HMM mo<strong>de</strong>ling is the question of i<strong>de</strong>ntifiability: when do twodifferent Markov chains generate the same stochastic process? This question first raised by [1]canbesolvedforHMM using linear algebra (see [9,13]). To our knowledge, such a complete solution does not exist in the contextof MCMR mo<strong>de</strong>ls. As an immediate consequence, the <strong>de</strong>finition of the or<strong>de</strong>r of a MCMR has to be clarified.In the convenient case where each mo<strong>de</strong>l M α is characterized by α ∈ N, the or<strong>de</strong>r of the distribution ofthe observations is the smallest α such that this distribution belongs to M α . This <strong>de</strong>finition is motivated bythe will to guarantee that the statistician is looking for the most economical representation of the process (thenumber of parameters required for its <strong>de</strong>scription is minimized). In contrast, the <strong>de</strong>finition of the or<strong>de</strong>r may bemore involved when the above notion of minimality does not have a natural meaning anymore. Two examplesfollow.First, or<strong>de</strong>r i<strong>de</strong>ntification for autoregressive moving average ARMA(p,q) mo<strong>de</strong>ls is a well-known examplewhere the structural parameter is bivariate (see for example [12,20]). Nevertheless, this problem is very differentfrom the one studied here because there exists a minimal representation (p 0 ,q 0 ) thus <strong>de</strong>fined as the true one. In<strong>de</strong>ed,the spectral <strong>de</strong>nsity of an ARMA process admits a unique representation of the form λ ↦→ |Q/P(e −iλ )| 2 /2πwhere P and Q are polynomial functions with no common factors, P(z) ≠0, for all |z| ≤1andQ(z) ≠ 0, for all|z| < 1. Then the true or<strong>de</strong>r of the ARMA process is <strong>de</strong>fined as the couple (p 0 ,q 0 ) of <strong>de</strong>grees of the polynomialsP and Q respectively.Second, when <strong>de</strong>aling with mo<strong>de</strong>l selection for context trees, the or<strong>de</strong>r to be selected is a tree. However, thereexists a natural or<strong>de</strong>ring (given by the inclusion) which is not a total or<strong>de</strong>ring. Csiszàr and Talata [5] establishthe consistency of both penalized (with Bayesian Information Criterion, alias BIC, penalization) maximumlikelihood and minimum <strong>de</strong>scription length procedures.A particularity of MCMR mo<strong>de</strong>ling is that the sets Π k,m are not globally nested, even though {Π k,m } k≥1and {Π k,m } m≥0 are nested. In general, for a given probability P ∈ ∪ (k,m)∈N ⋆ ×NΠ k,m , there is no unique(k 0 ,m 0 ) ∈ N ⋆ × N such that P ∈ Π (k0,m0) and P does not belong to any of its subsets (that is, for any(k,m) ∈ N ⋆ × N such that (k


42 A. <strong>CHAMBAZ</strong> AND C. MATIASORDER ESTIMATION FOR MARKOV CHAINS WITH MARKOV REGIME 43Let us <strong>de</strong>note by ϕ an increasing function which maps (N ⋆ × N, ≺) to N. Let us choose α>1 and introduce,Lemma 3.4 is a combination of results which essentially go back to [23] and[6]. The proof is similar to thefor all n ∈ N ⋆ , k ≥ 1 and m ≥ 0,proofof[16], Lemma 3.4 and thus omitted.(τ(n,k,m)=max 0, logk + m logr − k log Γ(k/2)Γ(1/2) − krm log Γ(r/2)Applying Lemma 3.4 allows to control the distribution of ( ̂k,m) n un<strong>de</strong>r P 0 with respect to the dimensionsof the involved mo<strong>de</strong>ls. More precisely, we haveΓ(1/2)+ k2 (k − 1)+ krm+1 (r − 1)+ 5k )Proposition 3.5. Un<strong>de</strong>r the assumptions of Theorem 3.1, forfixed(k,m) ∈ N ⋆ × N,4n 4n 24n (1 + rm ) . (6)P 0{( ̂k,m)}n =(k,m) ≤ exp{−pen(n,k,m)+pen(n,k 0 ,m 0 )}Let ( ̂k,m) n be <strong>de</strong>fined by (4), with Q k,m =ML k,m and(}1∑ ( × exp{1pen(n,k,m)=2 N(k′ ,m ′ )logn + τ(n,k ′ ,m))′ 2 N(k 0,m 0 )logn + τ(n,k 0 ,m 0 ) 1l{Q k,m =NML k,m or KT k,m }+ αϕ(k,m)logn. (7){ }1)(k ′ ,m ′ )(k,m)+exp2 N(k,m)logn + τ(n,k,m) 1l{Q k,m =ML k,m } .Then, P 0 -almost surely, ( ̂k,m) n =(k 0 ,m 0 ) eventually.Put in other words, ( ̂k,m)Proof of Proposition 3.5. Let Q k,m be the probability measure NML k,m or KT k,m . Using Definition (4) ofn does not overestimate, nor un<strong>de</strong>restimate the true or<strong>de</strong>r (k 0 ,m 0 ) eventually,( ̂k,m) n and Lemma 3.4 implies thatP 0 -almost surely. The proof is naturally divi<strong>de</strong>d accordingly: overestimation is consi<strong>de</strong>red in Section 3.1 andun<strong>de</strong>restimation in Section 3.2. Note that a simple way to choose ϕ is to set ϕ(k,m)=card{(k ′ ,m ′ ) ∈ N ⋆ × N :(k ′ ,m ′ )(k,m)}.P 0{( ̂k,m)} {n =(k,m) ≤ P 0 log Q }k,m(Y1 n ) ≥ pen(n,k,m) − pen(n,k 0 ,m 0 )Q k0,m0Remark 3.2. The theorem is valid more generally for Q k,m =KT k,m or NML k,m with the penaltyQ k,m∑ (≤ P 0{log (Y1 1pen(n,k,m)=2 N(k′ ,m)logn)n ) ≥ pen(n,k,m) − pen(n,k 0 ,m 0 ) − 1 }ML ′ k0,m02 N(k 0,m 0 )logn − τ(n,k 0 ,m 0 ) .+ αϕ(k,m)logn.(k ′ ,m ′ )(k,m)Because P 0 ∈ Π k0,m0 ,wemayusethat− log ML k0,m0(Y1 n)≤−log P 0(Y1 n),P 0-almost surely, hence we have,Note also that the precise form of the penalty is used in the non-overestimation step (see the proof of Prop. 3.3).Any rea<strong>de</strong>r familiar with the BIC criterion will immediately interpret our penalty in terms of cumulatedP 0{( ̂k,m)}n =(k,m) ≤sum of BIC penalty terms (i.e. of the form 1 2N(k,m)logn). We do not prove here the consistency of the BIC{procedure. We think this would be a very difficult task in our setup, and such a result does not even existP 0 log Q k,m(Y1 n ) ≥ pen(n,k,m) − pen(n,k 0 ,m 0 ) − 1 }P 0 2 N(k 0,m 0 )logn − τ(n,k 0 ,m 0 )in the simpler HMM case. One explanation of this lack is that no explicit expression exists for the maximumlikelihood estimate, turning explicit computations unfeasible. Thus our penalty is heavier than the BIC one but= ∑ {P 0 (y1 n )1l log Q k,m(y1 n)P 0 (yit is inspired by the penalty studied in [11] for or<strong>de</strong>r estimation in the HMM framework. However, if we cannoty n 1 n) ≥ pen(n,k,m) − pen(n,k 0 ,m 0 ) − 1 }2 N(k 0,m 0 )logn − τ(n,k 0 ,m 0 )1 ∈Ynpropose a theoretical study of the BIC estimator, we provi<strong>de</strong> an original numerical study of the consistency of≤ exp { 1both our estimator and the BIC one in Section 4.2 N(k 0,m 0 )logn + τ(n,k 0 ,m 0 ) − pen(n,k,m)+pen(n,k 0 ,m 0 ) } × ∑Q k,m (y1 n ).y1 n∈Yn 3.1. No overestimationIn this section, we prove that, P 0 -almost surely, ( ̂k,m)This is the expected result, since Q k,m is a probability measure. Let us assume now that Q k,m =ML k,m .n does not overestimate the true or<strong>de</strong>r (k 0 ,m 0 )eventually.Besi<strong>de</strong>s, a rate of <strong>de</strong>crease to zero of the overestimation probability is also obtained.Similarly,Proposition 3.3. Un<strong>de</strong>r the assumptions and notations of Theorem 3.1, P 0 -almost surely, ( ̂k,m) n (k 0 ,m 0 )P 0{( ̂k,m)} {n =(k,m) ≤ P 0 log ML k,m(Y1 n )}eventually. Moreover,P 0{( ̂k,m)}ML k0,m0(Y1 n)≥ pen(n,k,m) − pen(n,k 0,m 0 ){n ≻(k 0 ,m 0 ) = O(n −α ),≤ P 0 log ML k,m(Y1 n)}where α>1 is chosen in Theorem 3.1.P 0 (Y1 n)≥ pen(n,k,m) − pen(n,k 0 ,m 0 )∑The proof of Proposition 3.3 heavily relies on the following≤ ML k,m (y1 n )exp{−pen(n,k,m)+pen(n,k 0 ,m 0 )} .Lemma 3.4. Let us fix (k,m) ∈ N ⋆ y1 × N and <strong>de</strong>note by Q k,m the coding probability KT k,m or NML k,m .Letusn∈Yn recall that τ is <strong>de</strong>fined by (6). Then the following bounds hold:Using the bound ML k,m (y1 n ) ≤ KT k,m (y1 n )exp{N(k,m)/2 · log n + τ(n,k,m)} given by Lemma 3.4 yields the{0 ≤ max log ML k,m(y1 n)}y1 n Q ∈Yn k,m (y1 n) ≤ 1 2 N(k,m)logn + τ(n,k,m). expected result. Thus, the proof is complete.□The proof of Proposition 3.3 is now at hand.


44 A. <strong>CHAMBAZ</strong> AND C. MATIASProof of Proposition 3.3. Let us <strong>de</strong>note by A n the event {( ̂k,m) n ≻(k 0 ,m 0 )}. By virtue of the Borel-Cantellilemma, it is sufficient to prove that ∑ n≥1 P 0(A n ) is finite in or<strong>de</strong>r to conclu<strong>de</strong> that overestimation eventuallydoes not occur, P 0 -almost surely.Let us assume that Q k,m =NML k,m or KT k,m (the very similar proof in the case Q k,m =ML k,m is omitted).If C 0 bounds sequence {τ(n,k 0 ,m 0 )} n ,thenP 0 {A n } =(b)≤∑∑(k,m)≻(k0,m0)(a)≤(k,m)≻(k0,m0)∑(k,m)≻(k0,m0)⎧ ⎡⎨exp⎩ − ⎣P 0{( ̂k,m)}n =(k,m)exp { 12 N(k,m)logn + τ(n,k 0,m 0 ) − pen(n,k,m)+pen(n,k 0 ,m 0 ) }∑(k,m)≻(k ′ ,m ′ )≻(k0,m0)⎤⎫1⎬2 N(k′ ,m ′ )logn⎦ + τ(n,k 0 ,m 0 ) − α[ϕ(k,m) − ϕ(k 0 ,m 0 )] log n⎭∑≤ C 0 exp{−α[ϕ(k,m) − ϕ(k 0 ,m 0 )] log n}.(k,m)≻(k0,m0)Here, Proposition 3.5 and N(k,m) ≥ N(k 0 ,m 0 )(forall(k,m)(k 0 ,m 0 )) yield (a) and (b) follows from the<strong>de</strong>finition of the penalty term (note that the second sum may be empty). Now ϕ : N ⋆ × N → N increases, hence∑P 0 {A n }≤C 0 exp{−αj log n} ≤C 0 n −α (1 − n −α ) −1 = O(n −α ).Since α>1, the sum ∑ n P 0{A n } is finite, and the proof is complete.3.2. No un<strong>de</strong>restimationj≥1In this section, we prove that, P 0 -almost surely, ( ̂k,m) n does not un<strong>de</strong>restimate the true or<strong>de</strong>r (k 0 ,m 0 )eventually.Proposition 3.6. Un<strong>de</strong>r the assumptions of Theorem 3.1, P 0 -almost surely, ( ̂k,m) n (k 0 ,m 0 ) eventually.The first step while proving Proposition 3.6 is to relate the distribution of ( ̂k,m) n with the behaviour of thelogarithm of the maximum likelihood ratio [log ML k,m (Y1 n)− log P 0(Y1 n )]. This is the purpose of Lemma 3.7,whose proof is given in the appendix. From now on, “infinitely often” abbreviates to “i.o.”.Lemma 3.7. Un<strong>de</strong>r the assumptions of Theorem 3.1, for every k ≥ 1 and m ≥ 0,thereexistsasequence{ε n }of random variables that converges to zero P 0 -almost surely such that, for all n ≥ 1,P 0{( ̂k,m)} { }1n =(k,m)i.o. ≤ P 0n [log ML k,m(Y1 n ) − log P 0 (Y1 n )] ≥ ε n i.o. .Now, Proposition 3.6 essentially relies on two properties: a) the existence of a convenient Strong Law of LargeNumbers for logarithms of likelihood ratios, in the spirit of the Shannon-Breiman-McMillan theorem – seeLemma 3.8; b) the existence of a finite sieve for the set of all ergodic distributions in Π k,m – see Lemma 3.9.Let us recall that for any probability measures P 1 and P 2 on the same measurable space (Ω, A) therelativeentropy D(P 1 |P 2 ) is <strong>de</strong>fined by∫D(P 1 |P 2 )= log dP 1dP 1 ,dP 2if P 1 is absolutely continuous with respect to P 2 ,and+∞ otherwise.□ORDER ESTIMATION FOR MARKOV CHAINS WITH MARKOV REGIME 45Now, consi<strong>de</strong>r any probability measures P 1 and P 2 on the same sequence space (Ω N , A N ), with marginalsonto (Ω n , A n ) <strong>de</strong>noted by P n 1 and Pn 2 , respectively. The asymptotic relative entropy D ∞(P 1 |P 2 ) (or divergencerate) is <strong>de</strong>fined, when it exists, by1D ∞ (P 1 |P 2 ) = limn→∞ n D(Pn 1 |P n 2).Lemma 3.8 (Shannon-Breiman-McMillan). Let {Y j } j≥1 be an ergodic stationary process whose distribution P 0belongs to ∪ k≥1,m≥0 Π k,m . For all k ≥ 1, m ≥ 0 and any stationary ergodic P θ ∈ Π k,m , the divergence rateD ∞ (P 0 |P θ ) exists and is finite. Moreover, P 0 -almost surely,1limn [log P θ(Y1 n ) − log P 0 (Y1 n )] = −D ∞ (P 0 |P θ ). (8)n→∞We omit the proof of Lemma 3.8, which is a generalization of a similar classical theorem that holds for hid<strong>de</strong>nMarkov mo<strong>de</strong>ls [2,9,11,15]. Lemma 3.8 notably ensures the existence of D ∞ (P 1 |P 2 ) for stationary ergodicdistributions P 1 and P 2 belonging to ∪ k≥1,m≥0 Π k,m .Stating the existence of a finite sieve involves two new subsets. For any δ>0, let us <strong>de</strong>note by Π k,mδthesubset of stationary probabilities P θ in Π k,m such that θ has all its coordinates lower boun<strong>de</strong>d by δ. Moreover,let Π k,me stand for the subset of stationary ergodic probabilities in Π k,m .Lemma 3.9. Let us set k ≥ 1 and m ≥ 0. For every ε>0,thereexistδ>0, a finite set of in<strong>de</strong>xes Iεk,m and afinite set of stationary probabilities {P i } i∈Ik,m inclu<strong>de</strong>d in Π k,mεδsuch that, for all stationary ergodic P θ ∈ Π k,m ,there exists some P i (i ∈ Iεk,m ) which guarantees that:1sup maxn∈N ⋆ y1 n∈Yn n [log P θ(y1 n ) − log P i(y1 n )] ≤ ε.Lemma 3.9 is a key for replacing the term log P θ in the left-hand si<strong>de</strong> of (8) bylogML k,m and the right-handsi<strong>de</strong> term of the same equation by − inf P D ∞ (P 0 |P)(forP ranging over Π k,me ). Its proof is given in the appendix.Proof of Proposition 3.6. Let us set ε>0 such thatmininf(k,m)≺(k0,m0) P∈Π k,meD ∞ (P 0 |P) >ε.Such an ε exists according to a result (whose generalization is easy and omitted in our framework) first obtainedby [14], Propositions 1 and 2.Let us choose arbitrarily (k,m)≺(k 0 ,m 0 ) and prove that P 0 {( ̂k,m) n =(k,m) i.o.} =0.According to Lemma 3.7, there exists a sequence {ε n } of random variables that converges to zero P 0 -almostsurely such thatP 0{( ̂k,m)} { }1n =(k,m)i.o. ≤ P 0n [log ML k,m(Y1 n ) − log P 0(Y1 n )] ≥ ε n i.o. .Now, Lemma 3.9 guarantees the existence of a finite set {P i } i∈Ik,mεbelong to Π k,mδ⊂ Π k,m such thatP 0{( ̂k,m)}n =(k,m)i.o.≤≤{ 1P 0n∑i∈Iεk,m[maxi∈Iεk,mof stationary probability measures which]}log P i (Y1 n ) − log P 0 (Y1 n ) ≥ (−ε + ε n )i.o.P 0{ 1n [log P i(Y n1 ) − log P 0 (Y n1 )] ≥ (−ε + ε n )i.o.}.


46 A. <strong>CHAMBAZ</strong> AND C. MATIASTable 1. The four smallest dimensions N(k,m) ofMCMRofor<strong>de</strong>r(k,m) whenr =4.(m)1 260 8 15 242 3 4 (k)Finally, Lemma 3.8 yields the convergence of n −1 [log P i (Y1 n)− log P 0(Y1 n)]to −D ∞(P 0 |P i ), P 0 -almost surely,for all i ∈ Iεk,m . The choice of ε then ensures thatP 0{( ̂k,m)}n =(k,m)i.o. =0.Since (k,m)≺(k 0 ,m 0 ) was chosen arbitrarily, the previous equation implies thatP 0{( ̂k,m)}n ≺(k 0 ,m 0 )i.o. =0or, put in other words, that P 0 -almost surely, ( ̂k,m) n (k 0 ,m 0 ) eventually. Thus, the proof is complete.4. Simulation studyIn this section, we choose to discard the case k = 1. In<strong>de</strong>ed, this case corresponds to Markov mo<strong>de</strong>ls andthus, to a data <strong>de</strong>pen<strong>de</strong>ncy structure which is very different from that of the case where there are at leasttwo regimes. This distinction does not appear in the theoretical part of this article. However, all results (andtheir proofs) can be easily adapted to that slightly different framework. Finally note that in practice, MCMRmo<strong>de</strong>lling with at least two different regimes (k ≥ 2) is used for data with no finite memory. MCMR with oneregime (Markov mo<strong>de</strong>ls) poorly fit such data.Our theoretical study is motivated by application to biology and more precisely, to genome analysis. Choosinga good mo<strong>de</strong>l within a prescribed family is a very sensitive task. In [18], MCMR or<strong>de</strong>r selection (noti<strong>de</strong>ntification) is performed for mining Bacillus subtilis chromosome heterogeneity. After fitting all mo<strong>de</strong>ls withk ∈{2,...,8} and m ∈{0, 1, 2, 3}, the authors select (by eyeball and using biological consi<strong>de</strong>rations) a MCMRof or<strong>de</strong>r (k,m) =(3, 2) for <strong>de</strong>tecting atypical segments of lengthapproximately25kb(1kbequals1, 000 nucleoti<strong>de</strong>s)upon the 4, 200 kb long chromosome. In this framework, Y stands for the nucleoti<strong>de</strong>s set {A, C, G, T }(r = 4). In particular, the four smallest dimensions of MCMR are given in Table 1.In or<strong>de</strong>r to illustrate our work, we un<strong>de</strong>rtake a simple simulation study in the framework <strong>de</strong>scribed above.Evaluation of ML k,m (y1 n ) is processed by Expectation-Maximization (EM) algorithm [4,7]. We run EM withmultiple random initializations, and select the final result presenting the highest value. We use the packageSHOW [19], where SHOW stands for Structured HOmogeneities Watcher. It is a set of executable programs thatimplements different uses of MCMR mo<strong>de</strong>ls for DNA sequences. The source co<strong>de</strong> of SHOW is freely available.The software is protected by the GNU Public Licence.We arbitrarily <strong>de</strong>ci<strong>de</strong> to consi<strong>de</strong>r only MCMR of dimension at most 26. The corresponding or<strong>de</strong>rs (k,m)appear in Table 1. SetM = {Π k,m :(k,m) ∈ N ⋆ ×N,N(k,m) ≤ 26}. Foreachmo<strong>de</strong>lM 0 ∈M(line 1 in Fig. 1),we repeat 10 times (line 2) the following: we choose P 0 ∈ M 0 (line 3), then simulate a chain y1 n (n = 100 000)with distribution P 0 (line 4), next for each mo<strong>de</strong>l M ∈M(line 5), for each ñ ∈{25 000;50 000;100 000} (line 6),we evaluate sup P∈M P(yñ1 ) (line 7). Afterwards, i<strong>de</strong>ntifying the or<strong>de</strong>r boils down to applying (4) for a particularchoice of penalty term. Before discussing this final step, let us go into <strong>de</strong>tails about the way we choose P 0 ∈ M 0(line 3). Because this simulation study is motivated by [18], we choose the final distribution obtained by fittingthe same segment [3450001; 3475000] of length 25 kb of the Bacillus Subtilis chromosome than used in [18],Figure 1. For each repetition, a possibly slightly different distribution P 0 is thus selected (EM is run withmultiple random initializations).□ORDER ESTIMATION FOR MARKOV CHAINS WITH MARKOV REGIME 471foreach(M 0 ∈M) {2 repeat (10 times) {3 choice of a distribution P 0 in mo<strong>de</strong>l M 04 simulation of a chain y n 1 with distribution P 05 foreach (M ∈M) {6 foreach (ñ ∈{25 000; 50 000; 100 000}) {7 EM-evaluation of sup P P(yñ1 )forP ranging over M8 }9 }10 }11}Figure 1. Evaluation of ML k,m (yñ1 ) for various mo<strong>de</strong>ls in<strong>de</strong>x (k,m) and simulated observations(ñ ∈{25 000;50 000;100000},n= 100 000).y n 1This simulation study validates Theorem 3.1: whenñ = 50 000 and ñ = 100 000, ( ̂k,m)ñ =(k 0 ,m 0 )tentimesout of ten for each true un<strong>de</strong>rlying mo<strong>de</strong>l of or<strong>de</strong>r (k 0 ,m 0 ). Interestingly, this numerical evi<strong>de</strong>nce of consistencyfor very large values of ñ does not inclu<strong>de</strong> the case ñ = 25 000. In<strong>de</strong>ed consistency then fails: ( ̂k,m)ñ =(k 0 ,m 0 )ten times out of ten when (k 0 ,m 0 )=(2, 0), ( ̂k,m)ñ =(k 0 ,m 0 ) eight times out of ten when (k 0 ,m 0 )=(3, 0)[(2, 0) otherwise], ( ̂k,m)ñ =(k 0 ,m 0 ) two times out of ten when (k 0 ,m 0 )=(4, 0) [(3, 0) otherwise], and finally( ̂k,m)ñ =(3, 0) ≠(k 0 ,m 0 ) ten times out of ten when (k 0 ,m 0 )=(2, 1). Each time ( ̂k,m)ñ differs from (k 0 ,m 0 ),one has N(( ̂k,m)ñ) ≤ N(k 0 ,m 0 ). In other words, our penalty is too heavy for that sample size, and theasymptotic regime is arguably not reached yet when ñ = 25 000 whereas it is when ñ ≥ 50 000.We emphasized earlier that our penalty is heavier than the BIC penalty (i.e. 1 2N(k,m)logn). How does theBIC estimator behave? For every sample size ñ ∈{25 000;50 000;100000} and every true un<strong>de</strong>rlying mo<strong>de</strong>l,the BIC estimator coinci<strong>de</strong>s ten times out of ten with the true or<strong>de</strong>r. For this estimator, the asymptotic regimeis already reached when ñ = 25 000. Note that a slight modification of our penalty function yields anotherestimator which performs as well as the BIC one: if we replace pen(ñ,k,m) as <strong>de</strong>fined in (7)by 1 2 pen(ñ,k,m),then the new estimator equals the true or<strong>de</strong>r ten times out of ten for every sample size ñ and every trueun<strong>de</strong>rlying mo<strong>de</strong>l. One may finally won<strong>de</strong>r for which sample size the BIC criterion reaches its asymptotic regime.If the BIC estimator behaviour is still perfect when ñ = 25 000, it actually fails when ñ = 15 000. Denote by( ˜k,m) n the BIC estimator: ( ˜k,m) n =(k 0 ,m 0 ) ten times out of ten when (k 0 ,m 0 )=(2, 0), ( ˜k,m) n =(k 0 ,m 0 )eight times out of ten when (k 0 ,m 0 )=(3, 0) [(2, 0) otherwise], ( ˜k,m) n =(k 0 ,m 0 ) ten times out of ten when(k 0 ,m 0 )=(4, 0), and finally ( ˜k,m) n =(k 0 ,m 0 ) nine times out of ten when (k 0 ,m 0 )=(2, 1) [(3, 0) otherwise].Again, each time ( ˜k,m) n differs from (k 0 ,m 0 ), one has N(( ˜k,m) n ) ≤ N(k 0 ,m 0 ). It is even worse whenñ = 10 000, where we obtain ( ˜k,m) n =(k 0 ,m 0 ) ten times out of ten when (k 0 ,m 0 )=(2, 0), ( ˜k,m) n =(k 0 ,m 0 )eight times out of ten when (k 0 ,m 0 )=(3, 0) [(2, 0) otherwise], ( ˜k,m) n =(k 0 ,m 0 ) eight times out of ten when(k 0 ,m 0 )=(4, 0) [(3, 0) otherwise], and finally ( ˜k,m) n =(k 0 ,m 0 ) nine times out of ten when (k 0 ,m 0 )=(2, 1)[(3, 0) otherwise].In conclusion, we apply the BIC criterion to the original sequence of Bacillus Subtilis: the resulting or<strong>de</strong>restimator equals (2, 1) (results are reported in Tab. 2). Our estimator equals (3, 0).A. Appendix A. Proof of Lemma 3.7Let us set k ≥ 1andm ≥ 0. The proof is straightforward when Q k,m =ML k,m . In<strong>de</strong>ed,P 0{( ̂k,m)} { }1n =(k,m)i.o. ≤ P 0n [log ML k,m(Y1 n ) − log P 0(Y1 n )] ≥−pen(n,k 0,m 0 )i.o.nand pen(n,k 0 ,m 0 )=o(n).


48 A. <strong>CHAMBAZ</strong> AND C. MATIASTable 2. EM-evaluated maximum likelihood of the original sequence (length n = 25 000)for mo<strong>de</strong>ls MCMR of or<strong>de</strong>r (k,m) andBICpenalty 1 2N(k,m)logn. The resulting BIC or<strong>de</strong>restimator equals (2, 1).(k,m) (2, 0) (3, 0) (4, 0) (2, 1)log ML k,m (y1 n ) −34372.5 −34197.2 −34075.9 −33984.3BIC penalty 40.5 75.9 121.5 131.6pen(n,k,m) 43.5 124.3 251.8 391.3Let us assume that Q k,m =NML k,m or KT k,m .Sincepen(n,k,m) is non negative, the <strong>de</strong>finition of ( ̂k,m) nreadily yields thatP 0{( ̂k,m)}n =(k,m)i.o.{≤ P 0 log ML k,m (Y1 n ) − log P 0(Y1 n )≥ log ML k,m(Y1 n )Q k,m (Y1 n)− log P 0 (Y1 n )}Q k0,m0(Y1 n)− pen(n,k 0,m 0 )i.o. .Then, by virtue of Lemma 3.4, it holds that:1{n ∣ max log ML k,m(y1 n)} ∣ ∣∣∣y1 n Q ∈Yn k,m (y1 n) −→ 0, (9)n→∞{1n max P 0 (y n }1 )logy1 n Q k0,m0(y n ∈Yn 1 ) ≤ 1 {n max log ML k0,m0(y n }1 )y1 n Q k0,m0(y n ∈Yn 1 ) −→ 0. (10)n→∞The final step is a variant of the so-called Barron’s lemma [9], Theorem 4.4.1: a smart application of theBorel-Cantelli lemma yields that, P 0 -almost surely,lim infn→∞1n logP 0 (Y n1 )Q k0,m0(Y n1≥ lim inf) n→∞−2lognn=0. (11)Now, combining (9,10,11)withpen(n,k,m)=o(n) ensures the existence of a sequence {ε n } of random variablesthat converge to zero P 0 -almost surely such thatP 0{( ̂k,m)} { }1n =(k,m)i.o. ≤ P 0n [log ML k,m(Y1 n ) − log P 0 (Y1 n )] ≥ ε n i.o. .This conclu<strong>de</strong>s the proof of Lemma 3.7.B. Appendix B. Proof of Lemma 3.9 for the existence of finite sievesORDER ESTIMATION FOR MARKOV CHAINS WITH MARKOV REGIME 49Lemma B.1 is a simple generalization of a result of Liu and Narayan [16] (Lem. 2.6), so we omit its proof. Theproof of Lemma B.2 is also adapted from [16] (see their Ex. 2). The <strong>de</strong>tails are postponed after the proof ofLemma 3.9.Proof of Lemma 3.9. Let us set ε>0. According to Lemma B.1, foreachθ δ ∈ Θ k,mδ, there exists an open ballB(θ δ ) ⊂ Θ k,mδsuch that, for every θ ∈B(θ δ ),1sup maxn∈N ⋆ y1 n∈Yn n |log P θ(y n 1 ) − log P θδ (y n 1 )| ≤ε/2.Since Θ k,mδis a compact set, the Borel-Lebesgue property ensures the existence of a finite subset {θδ i : i ∈ I ε}of Θ k,mδsuch that ∪ i∈Iε B(θδ i)=Θk,mδ. Let us <strong>de</strong>note by P i the probability measure P θ iδ(for each i ∈ I ε ). Insummary, for all θ δ ∈ Θ k,mδ,thereexistsi ∈ I ε such that1sup maxn∈N ⋆ y1 n∈Yn n |log P θδ (yn 1 ) − log P i (y n 1 )| ≤ε/2. (12)Let us set δ ≤ ε/[4(k 2 + r 2 )]. By virtue of Lemma B.2, for every θ ∈ Θ k,me ,thereexistsθ δ ∈ Θ k,mδsuch that1sup maxn∈N ⋆ y1 n∈Yn Combining (12,13) conclu<strong>de</strong>s the proof.n [log P θ(y n 1 ) − log P θδ (y n 1 )] ≤ 2(k 2 + r 2 )δ ≤ ε/2. (13)Proof of Lemma B.2. Set θ =(A, B) ∈ Θ k,me (see Def. 2 for the <strong>de</strong>composition of parameter θ)andδ>0. Theparameter θ δ is constructed in the following way.For each row i ∈{1,...,k} of matrix A, replace the maximal coefficient a(i,j max )bya(i,j max )−(k−1)δ,thenadd δ to the other coefficients of this row. This yields the new parameter A δ . Moreover, for each fixed “row”(t m ; x) ∈Y m ×X, replace the maximal coefficient of matrix B,namelyb(j max |t m ; x), by b(j max |t m ; x)−(r −1)δ,then add δ to the other coefficients.It is easily checked that the constructed parameter θ δ =(A δ ,B δ ) belongs to Θ k,mδfor δ ≤ 1/ max(k 2 ,r 2 ).Besi<strong>de</strong>s, it is also readily seen that, for all i,j ∈{1,...,k} and (t m ; x) ∈Y m ×X,Therefore, for all n ∈ N ⋆ and y n 1 ∈Y n ,a(i; j) ≤ a δ(i; j)(1 − k 2 δ)and b(j|t m ; x) ≤ b δ(j|t m ; x)(1 − r 2 δ) ·□Let us set k ≥ 1andm ≥ 0 and recall that the cardinality of Y is <strong>de</strong>noted by r. The proof of Lemma 3.9 isa straightforward consequence of the two lemmas below.Lemma B.1. For all δ>0,thesetoffunctionsθ ↦→ P θ (y n 1 ) in<strong>de</strong>xed by n ∈ N ⋆ and y n 1 ∈Y n is equicontinuousover Θ k,mδ.Lemma B.2. For every θ ∈ Θ k,me and δ>0 small enough, there exists θ δ ∈ Θ k,mδand y1 n ∈Y n , the following bound holds:such that, for all n ∈ N ⋆henceP θ (y n 1 ) ≤ P θδ (y n 1 )(1 − k 2 δ) −n (1 − r 2 δ) −n ,1n [log P θ(y n 1 ) − log P θδ (y n 1 )] ≤−log(1 − k 2 δ) − log(1 − r 2 δ).This conclu<strong>de</strong>s the proof, because − log(1 − u) ≤ 2u for any u small enough.□1n [log P θ(y n 1 ) − log P θδ (yn 1 )] ≤ 2(k2 + r 2 )δ.Acknowledgements. We want to thank the associate editor and a referee, whose comments led to important improvementsof this paper.


50 A. <strong>CHAMBAZ</strong> AND C. MATIASReferences[1] D. Blackwell and L. Koopmans, On the i<strong>de</strong>ntifiability problem for functions of finite Markov chains. Ann. Math. Stat. 28(1957) 1011–1015.[2] S. Boucheron and E. Gassiat, Or<strong>de</strong>r estimation and mo<strong>de</strong>l selection, in Inference in hid<strong>de</strong>n Markov mo<strong>de</strong>ls, Olivier Cappé,Eric Moulines, and Tobias Rydén (Eds.), Springer Series in Statistics. New York, NY: Springer (2005).[3] R.J. Boys and D.A. Hen<strong>de</strong>rson, A Bayesian approach to DNA sequence segmentation. Biometrics 60 (2004) 573–588.[4] O. Cappé, E. Moulines and T. Rydén (Eds.), Inference in hid<strong>de</strong>n Markov mo<strong>de</strong>ls. Springer Series in Statistics (2005).[5] I. Csiszár and Z. Talata, Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans.Info. Theory 52 (2006) 1007–1016.[6] L.D. Davisson, R.J. McEliece, M.B. Pursley and M.S. Wallace, Efficient universal noiseless source co<strong>de</strong>s. IEEE Trans. Inf.Theory 27 (1981) 269–279.[7] A.P. Dempster, N.M. Laird and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat.Soc. Ser. B 39 (1977) 1–38. With discussion.[8] Y. Ephraim and N. Merhav, Hid<strong>de</strong>n Markov processes. IEEE Trans. Inform. Theory, special issue in memory of Aaron D.Wyner 48 (2002) 1518–1569.[9] L. Finesso, Consistent estimation of the or<strong>de</strong>r for Markov and hid<strong>de</strong>n Markov chains. Ph.D. Thesis, University of Maryland,ISR, USA (1991).[10] C-D. Fuh, Efficient likelihood estimation in state space mo<strong>de</strong>ls. Ann. Stat. 34 (2006) 2026–2068.[11] E. Gassiat and S. Boucheron, Optimal error exponents in hid<strong>de</strong>n Markov mo<strong>de</strong>l or<strong>de</strong>r estimation. IEEE Trans. Info. Theory48 (2003) 964–980.[12] E.J. Hannan, The estimation of the or<strong>de</strong>r of an ARMA process. Ann. Stat. 8 (1980) 1071–1081.[13] H. Ito, S.I. Amari and K. Kobayashi, I<strong>de</strong>ntifiability of hid<strong>de</strong>n Markov information sources and their minimum <strong>de</strong>grees offreedom. IEEE Trans. Inf. Theory 38 (1992) 324–333.[14] J.C. Kieffer, Strongly consistent co<strong>de</strong>-based i<strong>de</strong>ntification and or<strong>de</strong>r estimation for constrained finite-state mo<strong>de</strong>l classes. IEEETrans. Inf. Theory 39 (1993) 893–902.[15] B.G. Leroux, Maximum-likelihood estimation for hid<strong>de</strong>n Markov mo<strong>de</strong>ls. Stochastic Process. Appl. 40 (1992) 127–143.[16] C.C. Liu and P. Narayan, Or<strong>de</strong>r estimation and sequential universal data compression of a hid<strong>de</strong>n Markov source by themethod of mixtures. IEEE Trans. Inf. Theory 40 (1994) 1167–1180.[17] R.J. MacKay, Estimating the or<strong>de</strong>r of a hid<strong>de</strong>n markov mo<strong>de</strong>l. Canadian J. Stat. 30 (2002) 573–589.[18] P. Nicolas, L. Bize, F. Muri, M. Hoebeke, F. Rodolphe, S.D. Ehrlich, B. Prum and P. Bessières, Mining bacillus subtilischromosome heterogeneities using hid<strong>de</strong>n Markov mo<strong>de</strong>ls. Nucleic Acids Res. 30 (2002) 1418–1426.[19] P. Nicolas, A.S. Tocquet and F. Muri-Majoube, SHOW User Manual. URL:www-mig.jouy.inra.fr/ssb/SHOW/show doc.pdf(2004). Software available at URL: http://www-mig.jouy.inra.fr/ssb/SHOW/.[20] B.M. Pötscher, Estimation of autoregressive moving-average or<strong>de</strong>r given an infinite number of mo<strong>de</strong>ls and approximation ofspectral <strong>de</strong>nsities. J. Time Ser. Anal. 11 (1990) 165–179.[21] C.P. Robert, T. Rydén and D.M. Titterington, Bayesian inference in hid<strong>de</strong>n Markov mo<strong>de</strong>ls through the reversible jumpMarkov chain Monte Carlo method. J. R. Stat. Soc., Ser. B, Stat. Methodol. 62 (2000) 57–75.[22] T. Rydén, Estimating the or<strong>de</strong>r of hid<strong>de</strong>n Markov mo<strong>de</strong>ls. Statistics 26 (1995) 345–354.[23] Y.M. Shtar’kov, Universal sequential coding of single messages. Probl. Inf. Trans. 23 (1988) 175–186.


Journal ofStatisticalPlanningandInference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 963Contents lists available at ScienceDirectJournal of StatisticalPlanningand Inferencejournal homepage: www.elsevier.com/locate/jspiA minimum <strong>de</strong>scription length approach to hid<strong>de</strong>n Markov mo<strong>de</strong>ls withPoisson andGaussianemissions. Applicationtoor<strong>de</strong>r i<strong>de</strong>ntificationA.Chambaz a,∗ , A. Garivier b , E. Gassiat ca MAP5,Université ParisDescartes, Franceb CNRS&TELECOM ParisTech, Francec Laboratoire<strong>de</strong> Mathématiques, Université Paris-Sud, FranceARTICLE INFO ABSTRACTArticlehistory:Received23November2005Receivedinrevisedform28June 2007Accepted 6June 2008Availableonline26June2008Keywords:BICInfinitealphabetMo<strong>de</strong>l selectionOr<strong>de</strong>r estimation1. IntroductionWe address the issue of or<strong>de</strong>r i<strong>de</strong>ntification for hid<strong>de</strong>n Markov mo<strong>de</strong>ls with Poisson andGaussianemissions.Weproveinformation-theoreticBIC-likemixtureinequalitiesinthespiritof Finesso [1991. Consistent estimation of the or<strong>de</strong>r for Markov and hid<strong>de</strong>n Markov chains.Ph.D.Thesis,UniversityofMaryland];LiuandNarayan[1994.Or<strong>de</strong>restimationandsequentialuniversal data compression of a hid<strong>de</strong>n Markov source by the method of mixtures. Canad.J. Statist. 30(4), 573–589]; Gassiat and Boucheron [2003. Optimal error exponents in hid<strong>de</strong>nMarkovmo<strong>de</strong>lsor<strong>de</strong>restimation.IEEETrans.Inform.Theory49(4),964–980].Theseinequalitiesleadtoconsistentpenalize<strong>de</strong>stimatorsthatneednopriorboundontheor<strong>de</strong>r.Asimulationstudyand anapplicationtopostural analysisin humansareprovi<strong>de</strong>d.© 2008Elsevier B.V.Allrightsreserved.Hid<strong>de</strong>n Markov mo<strong>de</strong>ls (HMM) were formally introduced by Baum and Petrie (1966). Since then, they have proved useful invariousapplications,fromspeechrecognition(Levinsonetal.,1983)toblind<strong>de</strong>convolutionofunknowncommunicationchannels(Kaleh and Vallet, 1994), biostatistics (Koski, 2001) or meteorology (Hughes and Guttorp, 1994). For a mathematical survey intoHMM, see Ephraim and Merhav (2002) and Cappé et al. (2005). Mixture mo<strong>de</strong>ls with in<strong>de</strong>pen<strong>de</strong>nt observations are a particularcase ofHMMs.In most practical cases, the or<strong>de</strong>r of the mo<strong>de</strong>l (i.e. the true number of hid<strong>de</strong>n states) is unknown and has to be estimated.There is an extensive literature <strong>de</strong>dicated to the issue of or<strong>de</strong>r estimation. The particular case of or<strong>de</strong>r estimation for mixturesof continuous <strong>de</strong>nsities with in<strong>de</strong>pen<strong>de</strong>nt i<strong>de</strong>ntically distributed (abbreviated to i.i.d) observations is notoriously challenging(see Chambaz, 2006 for a comprehensive bibliography). It has been addressed through various methods: ad hoc or minimumdistance(Henna,1985;ChenandKalbfleisch,1996;Dacunha-CastelleandGassiat,1997;Jamesetal.,2001),maximumlikelihood(Leroux,1992b;Keribin,2000;Gassiat,2002;Chambaz,2006)orBayesian(Ishwaranetal.,2001;ChambazandRousseau,2008).Actually,Bayesianliteratureonor<strong>de</strong>rselectioninmixturemo<strong>de</strong>lsisessentially<strong>de</strong>votedto<strong>de</strong>terminingcoherentnon-informativepriors,seeforinstanceMorenoandLiseo(2003)andtoimplementingprocedures,seeforinstanceMengersenandRobert(1996).Or<strong>de</strong>r estimation in HMMs is much more difficult. It has been proved that, even if the null hypothesis is true, the maximumlikelihood test statistic is unboun<strong>de</strong>d (Gassiat and Kéribin, 2000) in the case of in<strong>de</strong>pen<strong>de</strong>nt mixture only if parameters areunboun<strong>de</strong>d,seeAzaisetal.(2006)andreferencestherein.Thisiswhythechoiceofapenaltytoobtainestimatorsusingpenalizedmaximumlikelihoodthatdonotover-estimatetheor<strong>de</strong>risadifficultproblem.Earlierresultsonpenalizedmaximumlikelihoo<strong>de</strong>stimators(asinFinesso,1991)andBayesianprocedures(asinLiuandNarayan,1994)assumeapriorupperboundontheor<strong>de</strong>r.∗ Correspondingauthor.E-mail addresses: antoine.chambaz@univ-paris5.fr(A. Chambaz),garivier@telecom-paristech.fr(A.Garivier),elisabeth.gassiat@math.u-psud.fr(E. Gassiat).0378-3758/$-seefront matter©2008ElsevierB.V. Allrightsreserved.doi:10.1016/j.jspi.2008.06.010InMacKay(2002),theminimumdistanceestimatorintroducedbyChenandKalbfleisch(1996)formixturesisexten<strong>de</strong>dtoHMMs.Regarding finite emission alphabet, Kieffer (1993) proves the consistency of the penalized maximum likelihood estimator withpenalties increasing exponentially fast with the or<strong>de</strong>r with no prior upper bound. In the same context, Gassiat and Boucheron(2003) prove almost sure (abbreviated to “a.s.”) consistency with penalties increasing as a power of the or<strong>de</strong>r. The question oftheminimal penalty whichis sufficientto obtainalmost sureconsistencywith nopriorupper boundremains open.In this paper, we address the issue of or<strong>de</strong>r i<strong>de</strong>ntification for HMM with Poisson and Gaussian emissions. Rissanen (1978)introducedtheminimum<strong>de</strong>scriptionlength(MDL)principlewhichconnectedmo<strong>de</strong>lselectiontocodingtheoryviathefollowingprinciple:“Choosethemo<strong>de</strong>lthatgivestheshortest<strong>de</strong>scriptionofdata.”WeprovehereMDL-inspiredmixtureinequalitieswhichlead toconsistentpenalized estimatorsrequiring nopriorboundontheor<strong>de</strong>r.Let us recall basic i<strong>de</strong>as that sustain the MDL principle. Given any k-dimensional mo<strong>de</strong>l (i.e. parametric family of <strong>de</strong>nsitiesin<strong>de</strong>xed by Θ of dimension k1), let E θ be the expectation with respect to a random variable X1 n with distribution P θ , whose<strong>de</strong>nsity is g θ (with respect to Lebesgue measure). For any <strong>de</strong>nsity q such that q(x n 1 ) =0 implies g θ (xn ) =0, the Kullback–Leibler1divergence between g θ and qisK n(g θ ,q) =E θ log g θ (Xn 1 )q(X n 1 ) =E θ [−logq(Xn 1 ) −(−logg θ (Xn 1 ))].InInformationTheory, −logq(X1 n)isinterpretedastheco<strong>de</strong>lengthforXn 1 whenusingcodingdistributionq,soE θ [−logg θ (Xn 1 )]isthe i<strong>de</strong>al co<strong>de</strong> length for X1 n. In this perspective, Kn(g θ ,q) is the average additional cost (or redundancy) caused by using the sameqforcompressingall g θ (θ ∈ Θ).If one assumes that the maximum likelihood estimator ̂θ(X n 1 )achievesa√ n-rate and that there exists a summable sequence{δ n}ofpositive numbers whichis suchthat, forevery θ ∈ Θ,P θ { √ n‖̂θ(X1 n ) − θ‖ logn}δn,then Theorem1in Rissanen (1986)guarantees thatK n(glim inf θ ,q)1 (1)n→∞ k2 lognfor all θ ∈ Θ except on a set with Lebesgue measure 0 (that <strong>de</strong>pends on q and k, the dimension of Θ). This result has a minimaxcounterpart fori.i.d sequences (Clarke and Barron,1990):un<strong>de</strong>r mildassumptions,K ∗ n =min qsupθ∈ΘK n(g θ ,q)(k/2)log n +O(1). (2)2πeBoth(1)and(2)putforwardaleadingterm(k/2)lognthathastakenagreatimportanceinInformationTheoryandStatistics.Thecoding <strong>de</strong>nsity q is called optimal if it achieves equality in (1). The following optimal coding distributions are often encounteredin InformationTheory(wereferto Barronet al., 1998;Hansenand Yu, 2001 forsurveys):• two-stage coding,whichyields<strong>de</strong>scriptionlength−logq(x n 1 ) =−logĝθ(x n 1 )(xn 1 ) + k 2 logn;• mixture coding,whereqis amixture ofall <strong>de</strong>nsities g θ (θ ∈ Θ).We want to highlight that the quantity −logĝθ(x n 1 )(xn ) + (k/2)logn, also called Bayesian Information Criterion (BIC), has been1consi<strong>de</strong>rablystudiedsinceitsfirstintroductionby Schwarz (1978)with theaim ofestimating mo<strong>de</strong>ldimension.Now, let us consi<strong>de</strong>r the following problem: given a family of mo<strong>de</strong>ls (M i) i∈I , which best represents some given data x n 1 ?TheMDLmethodologysuggeststo choosemo<strong>de</strong>l ̂M = Mî that yieldstheshortest<strong>de</strong>scriptionlength of x n 1 .Let k i be the dimension of mo<strong>de</strong>l M i for every i ∈ I. Each of the two optimal coding distributions presented above selects amo<strong>de</strong>l:• two-stage codingchooseŝM BIC =arg minM i(i∈I){−logĝθ i(x n 1 )(xn 1 ) + ki2 logn },wherêθ i isthe maximum likelihood estimatorover mo<strong>de</strong>l M i;


964 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 965• mixture codingchooseŝM MIX =arg min{−logq i(x n 1 )},M i(i∈I)whereq iisaparticularmixturetobespecifiedlater—wewillactuallyintroduceapenalizedversionofthisestimationprocedure.Thechallengingtaskistoprovethatsuchestimatorsareconsistent:ifx n 1 isemittedbyasourceof<strong>de</strong>nsityg θ 0suchthatg θ0 ∈ M i0and g θ0 ∈ M i implies M i0 ⊂ M i, then ̂M = M i0 eventually a.s. This has been successfully accomplished for Markov Chains byCsiszár and Shields (2000), and for Context Tree Mo<strong>de</strong>ls (or Variable Length Markov Chains) by Csiszár and Talata (2006) andGarivier (2006).1.1. Organizationof thepaperIn Section 2 we prove inequalities that compare maximum likelihood and a particular mixture coding distribution(see Theorems 1 and 2) for HMM mixture mo<strong>de</strong>ls and i.i.d mo<strong>de</strong>ls, with Poisson or Gaussian emissions. In Section 3, theseinequalities are used to calibrate a penalty to obtain a.s. consistent estimators using penalized likelihood or penalized mixturecodingdistributions.Theyrequirenopriorboundonor<strong>de</strong>rs(seeTheorems5and6).ThepenaltiesareheavierthanBICpenalties.The question whether BIC penalties lead to consistent estimation of the or<strong>de</strong>r remains open. In Section 4, we investigate thisquestion through a simulation study. An application to postural analysis in humans is also presented. Proofs of two lemmas aswell as auseful result <strong>de</strong>monstratedby Leroux(1992a) are contained in AppendicesAand B.2. Mixture inequalities2.1. Mixtureinequalities forHMM mixturemo<strong>de</strong>lLet σ 2 beapositivenumber.TheGaussian<strong>de</strong>nsitywithmeanmandvariance σ 2 (withrespecttotheLebesguemeasureontherealline)is<strong>de</strong>notedby m,σ 2.ThePoisson<strong>de</strong>nsitywithmeanm(withrespecttothecountingmeasureonthesetofnon-negativeintegers)is <strong>de</strong>notedby π m.Let {X n} n1 be a sequence of random variables with values in the measured space (X, A,μ). Let us <strong>de</strong>note by {Z n} n0 asequenceofhid<strong>de</strong>nrandomvariablessuchthat,conditionallyonZ1 n =(Z 1 ,...,Zn),X 1 ,...,Xn arein<strong>de</strong>pen<strong>de</strong>ntandthedistributionofeach X i only<strong>de</strong>pendsonZ i (all in).We<strong>de</strong>noteby Rthesetofrealnumbersandby R + thatofnon-negativerealnumbers.Foreveryk1,let(p o : jk) ∈ R k j + beaninitialdistribution,andlet S k bethesetofpossibletransitionprobabilitiesp=(p jj′ : j,j ′ k) ∈ R k2+ (∑ kLet C ⊂ Rbe aboun<strong>de</strong>dset. Thenthe parameter setisΘ k ={θ =(p,m):p ∈ S k ,m =(m 1 ,...,m k ) ∈ C k }.j ′ =1 p ′jj=1 for all jk).Un<strong>de</strong>rparameterθ=(p,m) ∈ Θ k (somek1), {Z n} n0 isaMarkovchainwithvaluesin {1,...,k},initialdistributionP θ {Z 0 =j ′ }=p o j ′and transitionprobabilitiesP θ {Z i+1 =j ′ |Z i =j}=p jj′ (all j,j ′ k). Therefore, {X n} n1 isaHMM un<strong>de</strong>rparameter θ.We shall consi<strong>de</strong>rtwoexamples ofemissiondistributions:Gaussianemission(GE):For every n1,X n has<strong>de</strong>nsity mZn ,σ 2 conditionallyonZ n.Poissonemission(PE):For every n1, X n has <strong>de</strong>nsity π mZn conditionallyonZ n.For all parameter θ ∈ Θ k (any k1), let g θ be the <strong>de</strong>nsity of X1 n = (X 1 ,...,Xn)un<strong>de</strong>rθ. For every k1, let ν k be a priorprobabilityon Θ k such that, forsomechosen τ>0,un<strong>de</strong>r ν k :• p and mare in<strong>de</strong>pen<strong>de</strong>nt,• p o j ′ =1/kforall j ′ kare <strong>de</strong>terministic,• thevectors(p jj′ : j ′ k)(jk)arein<strong>de</strong>pen<strong>de</strong>ntlyDirichlet( 1 2 ,...,1 2 )distributed,• m 1 ,...,m k are in<strong>de</strong>pen<strong>de</strong>nt, i<strong>de</strong>ntically distributed with <strong>de</strong>nsity 0,τ 2 in example GE and with <strong>de</strong>nsity Gamma(τ, 1 2 )inexample PE.The related mixture statistic is <strong>de</strong>finedbyq k (X1∫Θ n ) = g θ (X1 n )dν k (θ). (3)kIt isworthnotingthat q k isapositive functionofx n 1 ∈ Xn in examples GE andPE.Themainresultsofthissectionarecomparisonsbetweenthemaximumlog-likelihoodandthemixturestatisticsinexamplesGE and PE.Denotethepositivepartofarealnumbertby(t) + .LetX (n) and |X| (n) bethemaximaofX 1 ,...,X n and |X 1 |,...,|X n|,respectively.Let us also introduce,forall k,n1,(c kn = logk −klog Γ(k/2)Γ(1/2) + k2 (k −1)4n( ( ))k τ 2d kn =2 log kσ 2 + 1 ,n+( ) ke kn =2 (1 + τ −log(kτ)) .+)+ k ,12n+Theorem 1 (HMM mixturemo<strong>de</strong>ls). Un<strong>de</strong>r the assumptions<strong>de</strong>scribed above, forevery integerk1 andforevery integern1,GE:PE:0 supθ∈Θ klogg θ (X n 1 ) −logq k (Xn 1 ) k2 2 logn + k2τ 2 |X|2 (n) +c kn +d kn ; (4)0 supθ∈Θ klogg θ (X n 1 ) −logq k (Xn 1 ) k2 2 logn +kτX (n) +c kn +e kn . (5)2.2. Particularcaseof i.i.dmixturemo<strong>de</strong>lsThei.i.d mixturemo<strong>de</strong>l isaparticular case oftheHMM mo<strong>de</strong>l.Here, {Z n} n0 isasequence ofi.i.d random variables.For every k1, let us introduce the set S ′ k of possible discrete distributions p = (po jparameter setisΘ ′ k ={θ =(p,m):p ∈ S′ k ,m =(m 1 ,...,m k ) ∈ Ck }.: jk) ∈ R k + (∑ kj=1 po = 1), then thejAgain, g θ is the <strong>de</strong>nsity of X1 n un<strong>de</strong>r parameter θ ∈ Θ′ k . For every k1, a new mixing probability ν′ k on Θ′ k is chosen such that,un<strong>de</strong>r ν ′ k :• p and mare in<strong>de</strong>pen<strong>de</strong>nt,• p is Dirichlet( 1 2 ,...,1 2 ) distributed,• m 1 ,...,m k arein<strong>de</strong>pen<strong>de</strong>nt,i<strong>de</strong>nticallydistributedwith<strong>de</strong>nsity 0,τ 2 inexampleGEandwith<strong>de</strong>nsityGamma(τ, 1 2 )inexamplePE.Equality (3) with ν ′ k in place of ν k and Θ′ k in place of Θ k <strong>de</strong>fines a mixture statistic q k (Xn ) in this framework. The second main1result isanother comparisonbetween the maximum log-likelihoodand themixture statisticsin examples GEand PE.Let usintroduce, foralln,k1,(ckn ′ = −log Γ(k/2)Γ(1/2)k(k −1)+ + 14n 12n).+Theorem 2 (i.i.dmixturemo<strong>de</strong>ls). Un<strong>de</strong>r theassumptions<strong>de</strong>scribed above, forevery integerk1 andforevery integern1,GE:PE:0 sup logg θ (X1 n ) −logq 2k −1k (Xn 1 ) logn + kθ∈Θ2k2τ 2 |X|2 (n) +c′ kn +d kn ; (6)0 sup logg θ (X1 n ) −logq 2k −1k (Xn 1 ) logn +kτXθ∈Θ2 (n) +c ′ kn +e kn . (7)k


966 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 9672.3. CommentNow,as showninDavissonet al. (1981)(seeEqs. (52)–(61)therein),In (4)–(7), the upper bounds are written as a sum of 1 2 dim(Θ k )logn, a boun<strong>de</strong>d term and a random term which involves thesup max log P θ {zn 1 |z 0 } Γ(n +k/2)Γ(1/2)maximum of |X 1 |,...,|X n|. The following lemmas guarantee that these random terms are boun<strong>de</strong>d in probability at rate logn inθ∈Θ k z0 n∈{1,...,k}n+1 q k (z1 n|z klog0 ) Γ(k/2)Γ(n +1/2)(example GE and slower than logn in example PE (for HMM and i.i.d mixture mo<strong>de</strong>ls). In<strong>de</strong>ed, the probability that |X| (n) or X (n)k −1exceeds some level u n may be written as the expectation of the same probability conditionally on the hid<strong>de</strong>n variables. As soon k logn −log Γ(k/2) k(k −1)+ + 1 ), (9)2 Γ(1/2) 4n 12nas this conditional probability has an upper bound that does not <strong>de</strong>pend on the hid<strong>de</strong>n variables, the same upper bound holdswherethesecondinequalityis <strong>de</strong>rivedfromthefollowingRobbins–Stirlingapproximation formula, valid forall z>0,fortheunconditionalprobability.√2πe−z z z−1/2 Γ(z) √ 2πe −z+1/12z z z−1/2 .Lemma 3. Let {Y n} n1 beasequenceofin<strong>de</strong>pen<strong>de</strong>ntGaussianrandomvariableswithvariance σ 2 .ThemeanofY n is<strong>de</strong>notedbym n.If sup n1 |m n|isfinite,then fornlarge enough,This conclu<strong>de</strong>s the study of the second ratio in the right-hand term of (8). The last step of the proof is <strong>de</strong>dicated to boundingthe first ratio. The same scheme of proof applies to both examples GE and PE. It is nevertheless simpler to address each of themP{|Y| 2 (n) 5σ2 logn} 1at atime.n 3/2.GE: Conditionally on Z1 n =zn 1 the maximum likelihood estimator of m j is ¯x j for every jk, so that the following bound holdsforevery x nLemma 4. Let {Y n} n1 be a sequence of in<strong>de</strong>pen<strong>de</strong>nt Poisson random variables. The mean of Y n is <strong>de</strong>noted by m n. If sup n1 m n is1 ∈ Xn andz1 n ∈{1,...,k}n :⎛finite,then for nlarge enough,{} g θ (x nP Y (n) √ logn 1 1 |zn 1 ) ∏ k ∏ 1k ∑¯xj ,σ 2 (x i) =(σ √ ∏ i∈I x 2 ⎞j i2π) n exp ⎝−2σlog logn n 2.j=1i∈I j j=12 + n j (¯x j )2 ⎠.2σ 2 (10)Besi<strong>de</strong>s,simple calculations yield⎛⎞TheproofsofLemmas 3 and4are postponedtoAppendixA.q k (x n 12.4. Proof of Theorems 1 and2|zn 1 ) = ∏ k ∫1(σ √ 12π) n j τ √ 2π exp ⎜⎝− m22τj=12 − 1 ∑2σ 2 (x i −m) 2 ⎟⎠ dmi∈I j⎛First,let us introducesome notations.1k ∑Forall θ ∈ Θ k or θ ∈ Θ ′ k (anyk1),asappropriate,andforallxn 1 ∈ Xn ,z0 n =(z 0 ,...,zn) ∈{1,...,k}n+1 ,we<strong>de</strong>notebyg θ (x n 1 |zn 1 )=(σ √ ∏ 12π) n √the <strong>de</strong>nsity of X1 n at xn 1 conditionally on Zn 1 =zn 1 . The mixture <strong>de</strong>nsity q k (xn 1 |zn 1 )atxn 1 conditionally on Zn 1 =zn j=1 1 +n j τ 2 /σ exp i∈I x 2 j i n 2 ⎞j⎝−2 2σ 2 +2σ ( 2 n j + σ 2 /τ 2)(¯x j )2 ⎠. (11)is <strong>de</strong>fined as in (3),1with asubstitutionofg θ (x n 1 |zn 1 )forg θ (Xn 1 ).We nowget, as aby-productof(10)and (11),Similarly, we <strong>de</strong>note by g θ (x n 1 |z 0 ) the <strong>de</strong>nsity of Xn 1 at xn 1 conditionally on Z 0 =z 0 ,andq k (·|z 0 ) the corresponding conditional√ ⎛⎞mixture<strong>de</strong>nsity.Besi<strong>de</strong>s,ifP θ {z1 n|z 0 }isashorthandforP θ {Zn 1 =zn 1 |Z 0 =z 0 },thenthemixture<strong>de</strong>nsityatzn 1 q k (zn 1 |z 0 )is<strong>de</strong>finedasg θ (x n 1 |zn 1 )in (3), withreplacement of g θ (X1 n)byP θ {zn 1 |z 0 }. Finally, foreveryjksuch that n j >0,let us setq k (x n 1 |zn 1 ) ∏ k 1 + n j τ2 k∑σj=12 exp n⎝ j2σj=12 (1 +n j τ 2 /σ 2 ) (¯x j )2 ⎠.n∑n j = 1{z i =j}, I j ={in : z i =j} and ¯x j =n −1 ∑Byconvexity, thefirstfactor intheright-handsi<strong>de</strong>expression abovesatisfiesxj i.√i=1i∈I jk∏1 + n ( ) k/2j τ2Byconvention,weset ¯x j =0whenever n j =0.σ j=12 1 + nτ2kσ 2 , (12)Proof of Theorem 1. Let usset x n 1 ∈ Xn . The left-handinequalities of(4)and(5)are obvious.whiletheratiosn j /(1 +n j τ 2 /σ 2 )areupper boun<strong>de</strong>dby σ 2 /τ 2 foralljk. Therefore,Straightforwardly,usingtwicetheinequality ∑ jk α j /∑ jk β j max jk α j /β j (validforallnon-negativeα 1 ,...,α k andpositiveβ 1 ,...,β k ) yieldssup max log g θ (xn 1 |zn 1 )( )θ∈Θsup log g θ (xn 1 )∑k zzθ∈Θ q k k (x n = logk + sup log 0 k g θ (xn 1 |z 0 n∈{1,...,k}n+1 q k (x n 1 |zn 1 ) k 2 log 1 + nτ2kσ 2 + k2τ 2 |x|2 (n) . (13)0 )po z∑ 0Combining (8),(9)and (13)yieldstheresult.)1 θ∈Θ kz 0 k q k (xn 1 |z 0 )PE: Thesame argument as inexample GE implies that,foreach jk, foreveryx n 1 ∈ Xn andz1 n ∈{1,...,k}n : logk + sup max log g θ (xn 1 |z 0 )po z 0θ∈Θ z k 0 k q k (x n g θ (x n1 |z 0 )1 |zn 1 ) ∏ k ∏ ∏ kπ¯xj (x i) =P n exp(−n j¯x j (1 −log ¯x j )) (14)j=1i∈I j j=1 logk + sup max log g θ (xn 1 |z 0 )ifP n =1/ ∏ ni=1(x i)!Inparticular,thefactorassociatedwithsomejkforwhich ¯x j =0equalsone.Furthermore,thefollowingcanθ∈Θ z k 0 k q k (x n 1 |z 0 )easily be<strong>de</strong>rived:∑z1 n logk + sup max log∈{1,...,k}ng θ (xn 1 |zn 0 )P θ {zn 1 |z 0 }q k (x n k √1 |zn 1 ) ∏∫ τ=Pn m nj¯x j −1/2 exp(−(n2πj + τ)m)dmj=1∑θ∈Θ z k 0 kz1 n∈{1,...,k}nq k (xn 1 |zn 0 )q k (zn 1 |z 0 )k √∏ τ Γ(n j¯x j +1/2)=P n2π logk + sup max log g θ (xn 1 |zn 1 )θ∈Θ k z0 n∈{1,...,k}n+1 q k (x n 1 |zn 1 ) · Pθ {zn 1 |z 0 }j=1 (n j + τ) nj¯x (15)j +1/2.q k (z1 n|z 0 ). (8) Here, thefactorassociated withsomejk forwhich ¯x j =0equals√τ/(nj + τ).


968 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 969At this stage, the ratio g θ (x n 1 |zn 1 )/q k (xn 1 |zn ) is naturally <strong>de</strong>composed into the product of k ratios: for each jk, the right-hand (with convention Θ 1 0 =∅). By <strong>de</strong>finition, k 0 is the or<strong>de</strong>r of P 0 . In examples GE and PE, k 0 is the minimal number of Gaussian orsi<strong>de</strong>factorof(14)divi<strong>de</strong>dbytheright-handsi<strong>de</strong>factor of(15)is upper boun<strong>de</strong>dbyPoisson<strong>de</strong>nsitiesnee<strong>de</strong>d to<strong>de</strong>scribethedistribution P 0 . Ourgoal inthis sectionis toestimate k 0 .Letus<strong>de</strong>notebypen(n,k)apositivelyvaluedincreasingfunctionofn,k1suchthat,foreachk1,pen(n,k)=o(n).We<strong>de</strong>fine√ (eτ ×exp 12 logn j(n + j¯x j + 1 )log(1 + τ ))hereby theestimators:2 n⎧⎫ j ⎨⎬ ̂kMLwhether ¯x j = 0 or not. This simple calculation relies again on the lower bound for Γ(n j¯x j + 1 n =arg min) yiel<strong>de</strong>d by the Robbins–Stirlingk1⎩ − sup logg θ (X1 n ) +pen(n,k) θ∈Θ⎭k2approximation formula.andConsequently,thefollowingholds:̂kMIX n =arg min{−logq log g θ (xn 1 |zn 1 )q k (x n 1 |zn 1 ) k k∑[ ( 12 (1 −logτ) + 2 logn j + τ x (n) + 1 )]k (X1 n ) +pen(n,k)}.k12Convenient choices of the penalty term involve the following quantities: for every n,k1, we introduce the cumulative sumsj=1 k 2 log n k +kτx (n) + k C kn = ∑ kc =1 n , Ckn ′ = ∑ kc ′ =1 n , D kn = ∑ kd =1 n andE kn = ∑ ke =1 n .Allofthemareboun<strong>de</strong>dfunctionsofn.(1 + τ −logτ) (16)2Theorem 5 (Consistencyof̂k ML n ). Set α>2, andforeachn3,k1,(the second inequality follows by convexity). Combining (8), (9) and (16) (we emphasize that the right-hand term in (16) doesnot<strong>de</strong>pend onz0 n noron θ)gives theresult. k∑ D() + αpen(n,k) = logn +R2 kn +S kn ,=1Notethat (12)cannot beimproved,sinceequality is attained whenthen j are equal.TheschemeofproofforTheorem2issimilar tothat ofTheorem1.where D(k) = dim(Θ k ) = k 2 and R kn = C kn for HMM mixtures mo<strong>de</strong>ls, D(k) = dim(Θ ′ k ) = (2k − 1) and R kn = C′ for i.i.d mixturesknProof of Theorem 2. Let x n 1 ∈ Xn .Straightforwardly,forevery θ ∈ Θ ′ mo<strong>de</strong>ls andk ,GE:g θ (x n 1 ) =∑g θ (x n k 1 |zn 1 ) ∏(p o ∑j )n j g θ (x n k 1 |zn 1 ) ∏( ) nj nj.Snkn =D kn +5σ 2 k(k +1)logn,z1 n∈{1,...,k}n j=1 z1 n∈{1,...,k}n j=1PE:In addition,lognS kn =E kn +k(k +1) √ .log lognq k (x n 1 ) =∑∫ k∏q k (x n 1 |zn 1 ) S ′ (p o j )nj dν ′ k (p)Un<strong>de</strong>r the assumptions<strong>de</strong>scribed above,̂k MLz1 n n =k 0 eventually P 0 -a.s.∈{1,...,k}n k j=1∑ Γ(k/2)k =Γ(n +k/2) q k (xn 1 |zn 1 ) ∏Similarly,Γ(n j +1/2).Γ(1/2)z1 n∈{1,...,k}nj=1Theorem 6 (Consistencyof̂k MIX n ). Set α>2, andforeach n3,k1,k−1 ∑ D() + αConsequently,using thesame argument as theonethat yiel<strong>de</strong>d(8)implies thatpen(n,k) = logn +S2 kn ,⎛( ) nj⎞=1log g θ (xn 1 )q k (x n 1 ) sup ⎜ Γ(n +k/2)Γ(1/2)kk∏ n j /n⎝log +logz1 n Γ(k/2) Γ(n ∈{1,...,k}n j=1 j +1/2) +log g θ (xn 1 |zn 1 ) ⎟q k (x n 1 |zn 1 ) ⎠.where D(k) =dim(Θ k ) =k 2 forHMMmixtures mo<strong>de</strong>ls, D(k) =dim(Θ ′ k ) =(2k −1)fori.i.d mixturesmo<strong>de</strong>ls andGE:Handlingthesecondtermintheright-handsi<strong>de</strong>ofthedisplayabovehasalreadybeendoneintheProofofTheorem1.AsfortheSfirstterm, it isboun<strong>de</strong>dbykn =5σ 2 k(k +1)logn,Γ(n +k/2)Γ(1/2)logΓ(k/2)Γ(n +1/2) k −1PE:logn +c ′ 2 knlognS kn =k(k +1) √ .log logn(by virtue of Davisson et al., 1981, Eqs. (52)–(61) again and the Robbins–Stirling approximation formula). This completes theproof. Un<strong>de</strong>r the assumptions<strong>de</strong>scribed above,̂k MIX n =k 0 eventually P 0 -a.s.3. Application to or<strong>de</strong>r i<strong>de</strong>ntificationTheorems 5 and 6 thus guarantee that̂k ML n and̂k MIX n are consistent estimators of k 0 . We emphasize that no prior bound on k 0isrequired.Let k 0 bethesoleinteger such that thedistributionP 0 ofprocess {X n} n1 satisfiesThe penalty function satisfies pen(n,k) = O(logn) for every k1 in both examples. It is also important to compare the<strong>de</strong>pen<strong>de</strong>ncy of pen(n,k)withrespecttok with that of the BIC criterion. We do not get a single term 1 D(k) on the logn scale,P 0 ∈{P θ : θ ∈ Θ k0 }\{P θ : θ ∈ Θ k0 −1 } 2but rather acumulative sum ofterms 1 [D() + α]for ranging from1tok.2


970 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 971It is well un<strong>de</strong>rstood that Bayesian estimators naturally take into account the uncertainty on the parameter by integrating itout (Jefferys and Berger, 1992), thus providing an example of auto-penalization. This is illustrated by the equivalence betweenmarginal likelihoodand BICcriterion that holds,forinstance, in regular mo<strong>de</strong>ls:henceP 0 {̂k MLn >k 0 ,Untn} ∑ exp{− α }2 (k −k 0 )logn =O(n −α/2 ).k>k 0−logq k (X1 n ) =−log sup g θ (X1 n ) + 1 2 D(k)logn +O P(1),θ∈Θ kas n goes to infinity, valid for every k1. It is proven in Chambaz and Rousseau (2008) that efficient or<strong>de</strong>r estimation can beachieved by comparing marginal likelihoods (implicitly, without additional penalization) even in non-regular mo<strong>de</strong>ls (and forinstanceformixturesofcontinuous<strong>de</strong>nsities).However,CsiszárandShields(2000)provi<strong>de</strong>anexamplewherêk ML n isconsistentwhile MIXn is not when its penalty term is set to zero. Here, we (over-) penalize q k (X1 n ) so that the Proofs of Theorems 5 and 6mainly relyon themixture inequalities stated inTheorems 1and 2.Proof of Theorem 5. Inthei.i.dframework,showingthat̂k ML n k 0 eventuallyP 0 -a.s.isarathersimpleconsequenceofthestronglaw of large numbers and min k0 for any θ 0 ∈ Θ ′ k \Θ ′ 0 k 0 −1 (see Leroux, 1992b for a proof of the latter,where∫ gK(g θ0 ,g θ ) =x 1 ∈X g θ0 (x 1 )θ (x 0 1 )logg θ (x 1 ) dμ(x 1 )is theP θ0 -a.s. limit ofn −1 [logg θ0 (X n 1 ) −logg θ (Xn 1 )].In the HMM framework, it is a consequence of Lemma 8 (see Appendix B), which contains a Shannon–Breiman–McMillantheorem for HMM that holds in examples GE and PE (see Leroux, 1992a, Theorem 2) and a useful by-product of the proof ofTheorem 3in thesame paper.Themoredifficultpart isto obtainthat̂k ML n k 0 eventually P 0 -a.s.Let P 0 =P θ0 for θ 0 ∈ Θ k0 \Θ k0 −1 . Let us consi<strong>de</strong>r a positively valued sequence {tn} n3 to be chosen conveniently later on.Let k>k 0 and n3. Obviously,if̂k ML n =k, thenlogg θ0 (X1 n ) sup logg θ (X1 n ) +pen(n,k 0 ) −pen(n,k).θ∈T kHere, T k equals Θ k for HMM mixture mo<strong>de</strong>ls and equals Θ ′ k for i.i.d mixture mo<strong>de</strong>ls. Consequently, using (4), (5), (6) or (7)(with τ = 1 inexample GEand τ =2 in example PE),̂k ML2 n =kyieldswithlogg θ0 (X n 1 ) logq k (Xn 1 ) + Δ nk (17)Δ nk =pen(n,k 0 ) −pen(n,k) + D(k)2 logn +a kn +b kn +2kUn,whereU n =|X| 2 (n) ,b kn =d kn inexampleGEandUn =X (n) ,b kn =e kn inexamplePE,whilea kn =c kn forHMMmixturemo<strong>de</strong>lsanda kn =c ′ kn for i.i.d mixture mo<strong>de</strong>ls. Let us choose tn =5σ2 logn in example GE and t n =logn/ √ log logn in example PE, so that assoonas U nt n, thenΔ nk − α 2 (k −k 0 )logn. (18)Obviously,wehaveP 0 {̂k MLn >k 0 }P 0 {̂kMLn >k 0 ,Untn}+P 0 {Untn}. (19)Because q k <strong>de</strong>finesaprobabilitymeasure, we haveP 0 {̂k MLg θ0 (xn =k,U nt n}∫x n 1 ) {n1 ∈Xn q k (x n 1 ) 1 g θ0 (x n 1 )}logq k (x n 1 ) Δ nk ,Untn q k (x n 1 )dμ(xn 1 ) exp{− α }2 (k −k 0 )logn ,As a consequence of Lemmas 3 and 4, P 0 {̂k ML n >k 0 } is O(n−α/2 + n −3/2 ) in example GE and O(n −α/2 + n −2 ) in example PE: weapply theBorel–Cantelli lemma tocomplete theproof. TheProofofTheorem6uses thefollowing:Lemma 7. There existsasequence {ε n} n1 ofrandom variables that converges to0P 0 -a.s. suchthat, forany n1,if̂k MIX n


A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 973972 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977Fig.1. Foreachoffivesamplesof lengthn =1000, penalizedmaximumlikelihoodcriteria −sup θ∈Θk loggθ(x n) +pen(n,k)fork varyingfrom 1to 20. 1 to<strong>de</strong>tect such asensorytypology,soas toproposean adapted reeducation program.where {a k0 n } n1 is a boun<strong>de</strong>d sequence. The <strong>de</strong>finition of the penalty guarantees that, as soon as Untn, one has (18).7Consequently,P 0 {̂k MIXn >k 0 and U ∑ nt n} exp{− α }2 (k −k 0 )logn =O(n −α/2 6).k>k 05TheresultfollowsbyvirtueoftheBorel–Cantelli lemma, thepreviousboundand Lemmas 3, 4:̂k MIX n k 0 eventually P 0 -a.s. 44. Simulations and experimentation3In this section, we focus on the penalized maximum likelihood estimator ̂k MLn . In Section 4.1 we investigate the importanceof the choice of the penalty term. We first illustrate that the penalty given in Theorems 5 and 6 is heavy enough to obtain a.s.consistency with no prior upper bound. Then we try to un<strong>de</strong>rstand whether a smaller penalty could be chosen to retain a.s.2consistency in the same context. Section 4.2 is <strong>de</strong>dicated to the presentation of an application to postural analysis in humanswithinframeworkGE.Inor<strong>de</strong>rtocomputethemaximumlikelihoo<strong>de</strong>stimates,weusestandardEMalgorithm(Baumetal.,1970;1Cappéetal.,2005).Thealgorithmisrunwithseveralrandomstartingpoints,anditerationsarestoppedwhenevertheparameterestimates hardlydifferfromone iterationto theother.00 500 1000 1500 2000 2500 3000 3500 4000 4500 50004.1. A simulationstudy of thepenalty calibrationFig. 2. Almost sureconvergenceof̂k MLMLn .As thesamplesizegrows(x-axis),thevaluesof̂k n (y-axis)increasetothe trueor<strong>de</strong>r k0 =6.We first propose to illustrate the a.s. convergence of ̂k MLn in a toy-mo<strong>de</strong>l of HMM with Poisson emissions. We simulate fivesamples of distribution P θ for θ = (p,m) ∈ Θ 6 , where m j = 3j (each j6), and p 6,1 = 1, p j,j+1 = 1 − p j,1 = 0.9 (each j5).As estimator ̂k MLn requires no upper bound on the or<strong>de</strong>r, the question arises to <strong>de</strong>termine at which values of k the penalized100maximumlikelihoodshouldbeevaluated.Fig.1illustratesthebehaviorofcriterion −sup θ∈Θk logg θ (x n 1 )+pen(n,k)withasamplesizen =1000versusthenumberkofhid<strong>de</strong>nstates.Thecriterionlooksveryregular:itfirst<strong>de</strong>creasesrapidly,thenstabilizes,andfinallyincreasesslowlybutsystematically.Thus,i<strong>de</strong>ntifyingthemaximizer̂k ML n isaneasytask.Thevaluesof̂k ML n aredisplayedin50Fig. 2. We emphasize that only un<strong>de</strong>r-estimation and never over-estimation occur with our choice of penalty. This may indicatethat ourpenalty is assmall as possible.We also study the examples consi<strong>de</strong>red in MacKay (2002, Section 5, pp. 582–585). As expected, estimator ̂k MLn has a good0behavior for sample sizes which are large enough. Fig. 3 represents the evolution of the penalized maximum likelihood criteria−sup θ∈Θk logg θ (x n 1 ) + pen(n,k) for k4 as the sample size n grows for a realization xn of the so-called “well separated,1unbalanced” mo<strong>de</strong>l ofor<strong>de</strong>r2taken from MacKay (2002,Section 5).−50For small samples, smaller mo<strong>de</strong>ls are systematically chosen, and this agrees with our presumption that our penalty is tooheavy. Note that the BIC criterion suffers from the same <strong>de</strong>fect, as can be seen in MacKay (2002, Fig. 2). In that perspective, onemay search for some minimal penalty leading to a consistent estimator. We address this issue by computing the differences−100[sup θ∈Θ2 logg θ (x n 1 ) −sup θ∈Θ logg k θ (x n )] for k =1,3,4, see Fig. 4.Fork =1, the difference grows linearly so that any sub-linear1−15042000 500 1000 1500 2000 2500 300040003800Fig. 3. Valuesof −sup θ∈Θk loggθ(x n 1 ) +pen(n,k)(k4)asngrows.Fromtop to bottom, forlargevaluesof n:k =4,1,3,2.penalty prevents from un<strong>de</strong>r-estimation (see also the beginning of the Proof of Theorem 5). For k = 3,4, the differencesseem almost constant in expectation. A convenient penalty should dominate (eventually almost surely) their extreme values.3600For instance, it is proved in Chambaz (2006) that a log logn penalty guarantees consistency when an upper bound on the or<strong>de</strong>ris known.Without such abound,it remains openwhether alognpenalty is optimalor not.34004.2. Application topostural analysisin humans3200Maintaining posture efficiently is achieved by dynamically resorting to the best available sensory information. The latter3000is divi<strong>de</strong>d in three categories: vestibular, proprioceptive, and visual information. Every individual has <strong>de</strong>veloped his/her ownpreferences accordingto his/hersensorimotorexperience.Sometimes,asolekindofinformation—usually,visual—isprocessedinallsituations.Thisoccursinhealthyindividuals,butit2800ismorecommoninel<strong>de</strong>rlypeople,inpeoplehavingsufferedfromastroke,inpeopleafflictedbyParkinsonDiseaseforinstance.0 2 4 6 8 10 12 14 16 18 20Although processing a sole kind of information may be efficient for maintaining posture in one's usual environment, it is likelynottobeadaptedtoneworunexpectedsituations,andmayresultinafall.Therefore,itisofprimordialimportancetolearnhow


974 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 975250200150100500−500 500 1000 1500 2000 2500 3000−1−2−3−4−5−6−7−8605040302010050−5−10−15−203210 10 20 30 40 50 60 700 10 20 30 40 50 60 70−90 500 1000 1500 2000 2500 3000Fig. 5. Realizations t n (top) and 1 xn (middle), and a posteriori most likely sequence of hid<strong>de</strong>n states 1 zn 1 (bottom). The vertical dotted lines indicate the limits of thevibratory stimulation phase. Top: points (nΔ,tn). Middle: points (nΔ,xn). Bottom: points (nΔ,zn); each of the three postulated hid<strong>de</strong>n states is associated with aparticularlevelof volatilityfor ∇Tn.Notethat thescaleonthey-axisisnotlinear.Fig.4.Representationofdifferences[sup θ∈Θ2 loggθ(x n) −sup 1 θ∈Θ loggθ(x n k 1 )](fork =1,3,4)asngrows.Top:allcurves(fromtoptobottom,forlargen:k =1,3,4).Bottom: curves fork =3,4(fromtop to bottom forlarge n:k =3,4;notethe changeinscalealongy-axis).Postural analysis in humans at stable equilibrium has already been addressed using fractional Brownian motion (see Bar<strong>de</strong>tand Bertrand, 2007 and references therein), or diffusion processes (Rozenholc et al., 2007). We illustrate now how the studyof this difficult issue can be addressed within the theoretical framework of HMM with Gaussian emission. Data are collectedduringa70-sexperiment.Every Δ =0.025s,thepositionwhereacontrolsubjectexertsmaximalpressureonaforceplatformisrecor<strong>de</strong>d.We <strong>de</strong>noteby T n thedistance between the latter at time nΔ andareference position.The experimental protocol we choose to present here is <strong>de</strong>composed into three phases: a phase of 35s during which thesubject'sbalanceisperturbed(byvibratorystimulationofthelefttendon,knowntoforcetotiltforward)isprece<strong>de</strong>dby15sandfollowedby20sofrecordingwithoutstimulation.Accordingtothemedicalbackgroundandapreliminaryanalysis,theprocess(X n) n1 ofinterest<strong>de</strong>rivesfromthedifferencedprocess(∇T n) n1 =(T n+1 −T n) n1 , which isarguably stationary:foralln1,X n =log{(∇T n) 2 }(in any continuous mo<strong>de</strong>l, ∇T n = 0 has probability 0). We hereafter assume that (X n) n1 is a HMM with Gaussian emission.Heuristically, wefocusontheevolutionofthevolatilityofprocess (T n) n1 .Theestimatedor<strong>de</strong>r̂k ML n equals3.Theresultcoinci<strong>de</strong>swiththatoftheBICcriterion.Inor<strong>de</strong>rtocomputêk ML n ,weestimatedσonanin<strong>de</strong>pen<strong>de</strong>ntexperiment(samesubject,eyesopen,noperturbation).Weassumethatthevarianceofthevolatilityprocessremains the same all over the three-phase experiment. We are also interested in the inference concerning the unobservablesequence of hid<strong>de</strong>n states. We compute the a posteriori most likely sequence of states by the Viterbi algorithm. In other words,we find the sequence z1 n which maximizes (with respect to ξn 1 ∈{1,2,3}n ) the joint conditional probability P̂θ {Zn 1 = ξn 1 |Xn 1 =xn 1 },̂θ ∈ Θ 3 <strong>de</strong>notingthevalueof θ outputby theEM algorithm onthat mo<strong>de</strong>l. Fig. 5represents thedata andz1 n.Sequence z1 n carries (non-distributional) information about the mo<strong>de</strong>l, and helps interpreting the event “̂k MLn =3”. The threehid<strong>de</strong>nstatesHMMprovesverysatisfactoryfromamedicalpointofview.Fig.5suggeststhefollowinginterpretation:areferencebehavior in standard conditions of standing up (time intervals [0;15] and [∼ 65;70]) is a combination of two regimes (in<strong>de</strong>xedby 1 and 2); a learning behavior to adapt to new conditions when standing up corresponds to the third regime (in<strong>de</strong>xed by 3).The first, second, and third regimes are, respectively, associated with medium (m 1 =−3.90), small (m 2 =−6.13), and large(m 3 =−1.52)volatilityforprocess(T n) n1 .Theempiricalproportions ˆπ i(ξ)ofeachregimeξ ∈{1,2,3}oneachphasei ∈{1,2,3}areasfollows: ˆπ 1 (1)=0.69, ˆπ 1 (2)=0.31, ˆπ 1 (3)=0; ˆπ 2 (1)=0.64, ˆπ 2 (2)=0.04, ˆπ 2 (3)=0.32; ˆπ 3 (1)=0.50, ˆπ 3 (2)=0.25, ˆπ 3 (3)=0.25.The whole <strong>de</strong>scription (characterization of the three regimes and their succession through the duration of the experiment)coinci<strong>de</strong>swith the expectations ofthe medical team.AcknowledgmentWe want to thank Isabelle Bonan (LNRS, UMR CNRS 7060, Université Paris Descartes, and APHP Lariboisière–Fernand-Widal)forprovidinguswiththeposturalanalysisproblemanddataset.Wealsowanttothanktheassociateeditorandrefereesfortheirsuggestions.


976 A.Chambaz et al./JournalofStatistical Planningand Inference139(2009)962--977A.Chambaz et al./JournalofStatistical Planning andInference139(2009)962--977 977Appendix A. Proofs of Lemmas 3 and 4√Proof of Lemma 3. Let m =sup n1 |m n|and t n = 5σ 2 logn (alln1). Let nbelarge enough, sothat t nm. Forevery in,P{|Y i|t n}=P{|m i +Y i −m i|t n} P{|Y i −m i|t n −|m i|} P{|Y i −m i|t n −m}∫ tn−m= 0,σ 2(y)dy=−t( n+m1 − σ )0,σ 2 (t n)(1 +o(1)).t nHence, by virtueofthein<strong>de</strong>pen<strong>de</strong>nce ofY 1 ,...,Y n,P{|Y| 2 n (n) t2 n }=1 − ∏P{|Y i|t n}i=1( 1 − 1 − σ )0,σ 2 (t n n)(1 +o(1))t n⎧⎨ nexp(−t 2=1 −exp⎩ − n /2σ2)⎫⎬√ (1 +o(1))t n 2π⎭nexp(−5σ 2 logn/2σ 2)= √5σ 2 logn √ (1 +o(1))2π n −3/2 ,as soonas n islarge enough. Proof of Lemma 4. Let m = sup n1 m n and t n = logn/ √ log logn (all n3). Let Y be a Poisson random variable with mean m.The logarithmic moment generating function Ψ of (Y −m)satisfiesΨ(λ) =logEe λ(Y−m) =m(e λ − λ −1) (all λ0). Its Legendretransform Ψ ∗ isgiven forallt0byΨ ∗ (t) = sup {λt − Ψ(λ)}=(t +m)log t +mλ0m −t.Now,it is obviousthatP{Y it}P{Y t} (foreach in andt>m). Therefore,byusing theChernoffboundingmethod,Besi<strong>de</strong>s,P{Y (n) t n}nP{Y t n}=nP{Y −mt n −m}nexp{−Ψ ∗ (t n −m)}.Ψ ∗ (t n −m) =t nlog tn m −tn −m=(logn) √log logn(1 +o(1))3lognas soonas n islarge enough.We conclu<strong>de</strong>byplugging thislowerboundinto (A.1). (A.1)one-pointcompactificationof Θ k0 )andε>0suchthatinf θ′ ∈OθK ∞(g θ0 ,g θ )>ε.BecauseΘ k0 −1 isprecompact,itiscoveredbythefiniteunionof O θ1 ,...,O θI (each ofthem associatedwith ε i>0)and thereforeinf K ∞(g θ0 ,g θ ) min inf K ∞(g θ0 ,gθ∈Θ k0iIθ ) min−1θ∈O εi>0.θiiIReferencesAzais, J.-M., Gassiat, E., Mercadier, C., 2006. Asymptotic distribution and power of the likelihood ratio test for mixtures: boun<strong>de</strong>d and unboun<strong>de</strong>d case. Bernoulli12(5),775–799.Bar<strong>de</strong>t,J.-M., Bertrand,P., 2007.I<strong>de</strong>ntificationof themultiscalefractionalBrownianmotionwithbiomechanicalapplications.J.TimeSer.Anal. 28,1–52.Barron,A.R., Rissanen,J., Yu, B., 1998.Theminimum<strong>de</strong>scriptionlengthprincipleincodingandmo<strong>de</strong>ling.IEEE Trans.Inform.Theory44,2743–2760.Baum,L.E., Petrie,T., 1966.Statisticalinferencefor probabilisticfunctionsoffinitestate Markov chains.Ann.Math. Statist.37,1554–1563.Baum, L.E., Petrie, T., Soules, G., Weiss, N., 1970. A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann.Math. Statist.41, 164–171.Cappé,O., Moulines,E., Rydén, T., 2005.InferenceinHid<strong>de</strong>nMarkov Mo<strong>de</strong>ls.Springer,NewYork.Chambaz,A., 2006.Testingthe or<strong>de</strong>rofamo<strong>de</strong>l.Ann.Statist.34(3),1166–1203.Chambaz,A., Rousseau, J., 2008.Bounds forBayesianor<strong>de</strong>r i<strong>de</strong>ntificationwithapplicationto mixtures. Ann.Statist. 36(2),938–962.Chen,J., Kalbfleisch,J.D., 1996.Penalizedminimum-distanceestimatesinfinitemixture mo<strong>de</strong>ls.Canad.J. Statist. 24(2),167–175.Clarke, B.S.,Barron,A.R., 1990.Information-theoreticasymptotics ofBayesmethods.IEEE Trans.Inform.Theory36,453–471.Csiszár,I., Shields,P.C., 2000.Theconsistency oftheBICMarkov or<strong>de</strong>restimator. Ann.Statist. 6,1601–1619.Csiszár,I., Talata,Z., 2006.Context treeestimationfornot necessarilyfinitememoryprocesses,viaBICandMDL. IEEE Trans.Inform.Theory3, 123–145.Dacunha-Castelle,D., Gassiat,E., 1997.Theestimationof theor<strong>de</strong>rof amixture mo<strong>de</strong>l.Bernoulli3(3),279–299.Davisson,L.D.,McEliece, R.J., Pursley, M.B.,Wallace,M.S., 1981.Efficientuniversalnoiselesssource co<strong>de</strong>s. IEEE Trans.Inform.Theory27,269–279.Ephraim,Y., Merhav,N.,2002.Hid<strong>de</strong>nMarkov processes.IEEE Trans.Inform.Theory48,1518–1569.Finesso,L.,1991.Consistentestimationof theor<strong>de</strong>rfor Markovandhid<strong>de</strong>nMarkov chains.Ph.D. Thesis,UniversityofMaryland.Garivier,A., 2006.Consistencyof theunlimitedBICcontext tree estimator.IEEE Trans.Inform.Theory 52(10),4630–4635.Gassiat,E., 2002.Likelihoodratioinequalitieswithapplicationsto variousmixtures. Ann.Inst.H.PoincaréProbab. Statist.38(6),897–906.Gassiat,E., Boucheron,S.,2003.Optimalerrorexponentsinhid<strong>de</strong>nMarkov mo<strong>de</strong>lsor<strong>de</strong>restimation.IEEE Trans.Inform.Theory49(4),964–980.Gassiat,E., Kéribin,C., 2000.Thelikelihoodratiotest for thenumberofcomponentsinamixture withMarkov regime.ESAIMP&S, 2000.Hansen,M.H., Yu, B.,2001.Mo<strong>de</strong>lselectionandtheprincipleof minimum<strong>de</strong>scriptionlength.J. Amer. Statist.Assoc. 96(454),746–774.Henna,J., 1985.Onestimatingof thenumberofconstituentsof afinitemixture of continuousdistributions.Ann.Inst.Statist. Math.37(2),235–240.Hughes, J.P., Guttorp, P, 1994. A class of stochastic mo<strong>de</strong>ls for relating synoptic atmospheric patterns to regional hydrologic phenomena. Water Resources Res.30, 1535–1546.Ishwaran, H., James, L.F., Sun, J., 2001. Bayesian mo<strong>de</strong>l selection in finite mixtures by marginal <strong>de</strong>nsity <strong>de</strong>compositions. J. Amer. Statist. Assoc. 96 (456),1316–1332.James,L.F., Priebe,C.E., Marchette, D.J., 2001.Consistentestimationofmixture complexity.Ann. Statist.29(5),1281–1296.Jefferys, W.,Berger,J., 1992.Ockam's razorandBayesiananalysis.Amer. Sci.80,64–72.Kaleh,G.K., Vallet,R., 1994.Joint parameterestimationandsymbol<strong>de</strong>tection forlinearor nonlinearunknownchannels.IEEE Trans.Commun. 42,2406–2413.Keribin,C., 2000.Consistentestimationoftheor<strong>de</strong>r ofmixture mo<strong>de</strong>ls.SankhyaSer. A62(1),49–66.Kieffer, J.C., 1993. Strongly consistent co<strong>de</strong>-based i<strong>de</strong>ntification and or<strong>de</strong>r estimation for constrained finite-state mo<strong>de</strong>l classes. IEEE Trans. Inform. Theory 39,893–902.Koski, T,2001.Hid<strong>de</strong>nMarkov Mo<strong>de</strong>lsfor Bioinformatics.KluwerAca<strong>de</strong>micPublishersGroup, Drodrecht.Leroux, B.G., 1992a.Maximum-likelihoo<strong>de</strong>stimationfor hid<strong>de</strong>nMarkov mo<strong>de</strong>ls.StochasticProcess. Appl.40,127–143.Leroux, B.G., 1992b.Consistentestimationof amixingdistribution.Ann. Statist.20(3),1350–1360.Levinson, S.E., Rabiner, L.R., Sondhi, M.M., 1983. An introduction to the application of the theory of probabilistic functions of a Markov process to automaticspeechrecognition.BellSystem Tech.J. 62,1035–1074.Liu, C.C., Narayan, P, 1994. Or<strong>de</strong>r estimation and sequential universal data compression of a hid<strong>de</strong>n Markov source by the method of mixtures. Canad. J. Statist.30 (4),573–589.MacKay, R.J., 2002.Estimatingtheor<strong>de</strong>rof ahid<strong>de</strong>nMarkov mo<strong>de</strong>l.Canad.J. Statist. 30(4),573–589.Mengersen,K.,Robert, C., 1996.Testingfor mixtures:aBayesianentropyapproach.In:Berger,J.O., Bernardo,J.M., Dawid,A.P. (Eds.),BayesianStatistics, vol.5.Moreno, E., Liseo,B.,2003.A<strong>de</strong>faultBayesiantest for thenumberofcomponentsinamixture. J.Statist. Plann.Inference111(1–2),129–142.Rissanen,J, 1978.Mo<strong>de</strong>llingbyshortest data<strong>de</strong>scription.Automatica 14,465–471.Rissanen,J., 1986.Stochasticcomplexity andmo<strong>de</strong>ling.Ann.Statist. 14(3),1080–1100.Rozenholc, Y., Chambaz, A., Bonan, I., 2007. Penalized nonparametric mean square estimation for diffusion processes with application to postural analysis inhuman.Proceedingsof ISI2007heldinLisboa.Schwarz,G., 1978.Estimatingthe dimensionof amo<strong>de</strong>l.Ann.Statist. 6(2),461–464.Appendix B. A useful lemma for HMM mixture mo<strong>de</strong>lsLemma 8 (Leroux). ForHMMmixturemo<strong>de</strong>lswithboun<strong>de</strong>dparametersets,bothinexamplesGEandPE,foreveryk1andθ 0 ,θ ∈ Θ k ,there exists a constant K ∞(g θ0 ,g θ )


74 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalJournal <strong>de</strong> la Société Française <strong>de</strong> StatistiqueVolume 150, numéro 1, 20091 IntroductionDeux modèles <strong>de</strong> Markov caché pour processusmultiples et leur contribution à l’élaboration d’unenotion <strong>de</strong> style postural<strong>Antoine</strong> Chambaz 1 , Isabelle Bonan 2 et Pierre-Paul Vidal 3TitleTwo hid<strong>de</strong>n Markov mo<strong>de</strong>ls for multiple processes and their contribution to the <strong>de</strong>finition of a posturalstyleRésuméNous décrivons dans cet article <strong>de</strong>ux modèles <strong>de</strong> Markov caché (HMMs) pour processus multiplesque nous mettons au point à <strong>de</strong>s fins d’application à un problème biomédical d’étu<strong>de</strong> du maintienpostural. Informellement, ces modèles sont <strong>de</strong>s HMMs standard considérant simultanément plusieursprocessus individuels. Ceux-ci partagent les mêmes distributions <strong>de</strong> chaîne <strong>de</strong> Markov cachée etd’émission, avec l’ajout éventuel d’un effet <strong>de</strong> type mixte dans la partie conditionnelle du modèle.Nous discutons <strong>de</strong>s aspects théoriques <strong>de</strong> l’estimation <strong>de</strong> l’ordre <strong>de</strong>s <strong>de</strong>ux modèles, ainsi que <strong>de</strong>saspects numériques. Nous contribuons enfin à l’élaboration d’une notion <strong>de</strong> “style postural”.Mots-clés : BIC, estimation <strong>de</strong> l’ordre, modèle <strong>de</strong> Markov caché (HMM), postureAbstractTwo hid<strong>de</strong>n Markov mo<strong>de</strong>ls (HMMs) for multiple processes are introduced and applied to the biomedicalstudy of how people maintain posture. Informally, they mo<strong>de</strong>l simultaneously multiple processesby making them share the same hid<strong>de</strong>n Markov and observed conditional distributions. Oneof them incorporates covariates and random effects in the conditional part. We discuss the theoreticalissue of or<strong>de</strong>r estimation for both mo<strong>de</strong>ls as well as computational issues. Finally, we contribute tothe <strong>de</strong>finition of a notion of “postural style”.Keywords : BIC, hid<strong>de</strong>n Markov mo<strong>de</strong>ls (HMM), or<strong>de</strong>r estimation, postureMathematics Subject Classification: (62M02, 62M05, 62P10)1 MAP5, Université Paris Descartes,<strong>Antoine</strong>.Chambaz@paris<strong>de</strong>scartes.fr2 LNRS, Université Paris Descartes, Hôpitaux Fernand-Widal et Lariboisière, Isabelle.Bonan@chu-rennes.fr3 LNRS, Université Paris Descartes,Pierre-Paul.Vidal@univ-paris5.frLes modèles <strong>de</strong> Markov cachés (“Hid<strong>de</strong>n Markov mo<strong>de</strong>ls” en anglais, d’où l’abréviationHMMs) sont un outil très apprécié pour la modélisation <strong>de</strong> données longitudinales. Selon cesmodèles, un processus caché (c’est-à-dire non observé) {Z t : t = 1, ...,n} coexiste avec leprocessus observé {Y t : t = 1, ...,n}. Le processus {Z t }, supposé être une chaîne <strong>de</strong> Markov,convoie toute l’information relative à la dépendance : conditionnellement à {Z t }, le processusobservé est à coordonnées indépendantes — plus précisément, conditionnellement à Z t , Y t estindépendante <strong>de</strong> {Y s , Z s : s ≠ t}. Ainsi, la distribution <strong>de</strong> {Y t } est caractérisée par la loiinitiale et le noyau <strong>de</strong> transition <strong>de</strong> {Z t } d’une part, et par les lois conditionnelles <strong>de</strong> Y t sachantZ t d’autre part. Lorsque Z t ne prend qu’un nombre fini <strong>de</strong> valeurs, la loi marginale <strong>de</strong> Y t s’écritcomme un mélange fini. À ce titre, les HMMs offrent un moyen <strong>de</strong> modéliser la sur-dispersion.Évi<strong>de</strong>mment, si {Z t } est une suite <strong>de</strong> variables indépendantes i<strong>de</strong>ntiquement distribuées (i.i.d),alors {Y t } est elle-même une suite <strong>de</strong> variables i.i.d.La littérature dédiée à l’étu<strong>de</strong> <strong>de</strong>s HMMs est foisonnante. Nous renvoyons le lecteur à la monographie[6] pour une présentation mathématique très complète du thème. Les exemples d’utilisation<strong>de</strong>s HMMs sont nombreux, nous citerons parmi tant d’autres [21] pour une applicationà la reconnaissance vocale, [17] pour <strong>de</strong>s applications à l’étu<strong>de</strong> <strong>de</strong> données post-génomiques,[1] pour une application à l’étu<strong>de</strong> du nombre <strong>de</strong> lésions chez un patient atteint <strong>de</strong> sclérose enplaques.En général, un unique processus {Y t } (éventuellement multidimensionnel) est modélisé parun modèle <strong>de</strong> Markov caché. Il serait pourtant souvent pertinent <strong>de</strong> modéliser simultanémentplusieurs processus {Yt i }, i = 1, ...,N, dans l’esprit <strong>de</strong>s modèles <strong>de</strong> Markov cachés. Partant<strong>de</strong> ce constat, Altman [2] a développé une extension <strong>de</strong>s HMMs. Appelée HMMs mixtes, cetteextension repose sur l’introduction <strong>de</strong> covariables et d’effets aléatoires dans les <strong>de</strong>ux parties(la cachée et la conditionnelle) du modèle <strong>de</strong> Markov caché standard. Appliquée à l’étu<strong>de</strong> dunombre <strong>de</strong> lésions chez plusieurs patients (et non plus un unique) atteints <strong>de</strong> sclérose en plaques,l’extension permet <strong>de</strong> bien mieux comprendre les mécanismes <strong>de</strong> la maladie et <strong>de</strong>s différentstraitements qui sont en jeu.Notre travail, motivé par une application biomédicale, repose sur <strong>de</strong>ux modèles <strong>de</strong> Markovcaché pour plusieurs processus (d’où l’expression “HMM pour processus multiple”) qui sont<strong>de</strong>s cas particuliers <strong>de</strong> HMM standard (caractérisé par une matrice <strong>de</strong> transition creuse) et <strong>de</strong>HMM mixte. L’application biomédicale en question est présentée dans la Section 2. Le premiermodèle statistique sur lequel l’étu<strong>de</strong> repose est décrit dans la Section 3 et nous exposons dans laSection 4 les résultats <strong>de</strong> son ajustement à nos données. Nous y argumentons l’intérêt d’introduireun second modèle, présenté dans la Section 5, sa contribution à l’élaboration d’une notion<strong>de</strong> style postural étant l’objet <strong>de</strong> la Section 6. Une discussion <strong>de</strong> notre approche numérique etles démonstrations <strong>de</strong> l’article sont finalement reportées en Annexe.Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 7576 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style postural2 La recherche d’une notion <strong>de</strong> style postural : un enjeu important2.1 Éléments biomédicauxLe contrôle postural d’un individu est fondé sur trois types d’informations encodées parles systèmes visuel, vestibulaire (situé dans l’oreille interne) et proprioceptif (composé par lesrécepteurs sensoriels situés au voisinage <strong>de</strong>s os, <strong>de</strong>s articulations et <strong>de</strong>s muscles et sensiblesaux stimulations produites par les mouvements du corps). La façon dont le système nerveuxcentral traite ces informations varie avec l’expérience sensorimotrice <strong>de</strong> chaque individu. Elleest mo<strong>de</strong>lée notamment par l’âge, les sports et les professions pratiqués, ainsi sans doute que par<strong>de</strong>s facteurs génétiques. Un <strong>de</strong>s éléments fondamentaux du contrôle postural tient à la capacitédu cerveau central à saisir l’information sensorielle la plus pertinente à un instant donné pourréagir à une situation donnée.Dans ce contexte, on peut comprendre que chacun tend, en raison <strong>de</strong> son expérience sensorimotrice,à développer une préférence pour un type d’information sensorielle particulier, quise retrouve ainsi sollicité <strong>de</strong> façon prédominante. La préférence visuelle est sans doute la plusfréquente, en tout cas la mieux décrite. On peut la relever chez <strong>de</strong>s sujets sains mais elle estsurtout courante chez les personnes âgées, chez les personnes souffrant <strong>de</strong> la maladie <strong>de</strong> Parkinsonou suite à un acci<strong>de</strong>nt vasculaire cérébral ou à une pathologie vestibulaire. Or si une tellesélection systématique d’un mo<strong>de</strong> perceptif permet à un individu <strong>de</strong> se déplacer efficacementdans son environnement habituel, il est clair qu’elle est peu adaptée pour répondre à <strong>de</strong>s situationsnouvelles ou inattendues, pour ne pas dire qu’elle est dangereuse. Un tel mo<strong>de</strong> <strong>de</strong> fonctionnementest potentiellement plus susceptible d’entraîner une chute, chute dont les conséquencessont souvent dramatiques au-<strong>de</strong>là <strong>de</strong> soixante ans puisque les chutes et leurs séquelles entraînentle décès <strong>de</strong> près <strong>de</strong> dix mille personnes par an. Confronté à ce problème, le rééducateur tâched’i<strong>de</strong>ntifier, sur <strong>de</strong>s bases cliniques et empiriques, une “préférence sensorielle” du patient quilui a été adressé pour <strong>de</strong>s troubles du contrôle postural. Il s’efforce ensuite <strong>de</strong> les corriger enamenant le sujet à prendre en compte la totalité <strong>de</strong> ses afférences sensorielles pour réguler saposture. La “préférence sensorielle”, un concept encore flou, n’est pas encore validée par <strong>de</strong>smesures quantitatives et son éventuelle modification par la rééducation non plus.Ce travail a pour objet <strong>de</strong> contribuer à l’élaboration d’une notion <strong>de</strong> “style postural”.2.2 Un aperçu <strong>de</strong> l’état <strong>de</strong> l’artL’étu<strong>de</strong> du contrôle postural est l’objet d’un grand nombre <strong>de</strong> publications. On peut trèsschématiquement les répartir en <strong>de</strong>ux catégories, selon qu’il y est question <strong>de</strong> la modélisation duprocessus d’intégration <strong>de</strong>s informations (voir le récent [18] et ses références) ou bien <strong>de</strong> l’analysestatistique <strong>de</strong>s déplacements d’un sujet (voir [26] pour une étu<strong>de</strong> statistique par régression,pionnière du genre, <strong>de</strong> la démarche – et non <strong>de</strong> la posture). Notre approche s’inscrit dans cesecond groupe.Toujours schématiquement, les données ici analysées statistiquement sont obtenues sous laforme d’enregistrements <strong>de</strong>s petits déplacements d’un individu se tenant <strong>de</strong>bout campé sur ses<strong>de</strong>ux pieds et soumis à diverses perturbations. Chaque type <strong>de</strong> stimulation vise à explorer un<strong>de</strong>s trois systèmes d’acquisition <strong>de</strong> l’information :– la fermeture <strong>de</strong>s yeux et la stimulation optocinétique permettent <strong>de</strong> s’intéresser au rôle dusystème visuel (voir par exemple [23] et ses références) ;– les stimulations galvaniques permettent d’étudier le système vestibulaire (voir par exemple[13, 19] et leurs références) ;– les stimulations vibratoires permettent <strong>de</strong> se pencher sur le rôle du système proprioceptif(voir par exemple [13, 15] et ses références).La quantification du contrôle postural est obtenue via une plateforme <strong>de</strong> force. Cet appareil,semblable à un pèse-personne, évalue les positions successives du point <strong>de</strong> pression maximalexercé par le sujet.C’est en abordant l’éventail <strong>de</strong>s techniques statistiques développées pour étudier le contrôlepostural qu’on peut mettre en exergue en quoi notre projet se démarque <strong>de</strong> ses prédécesseurs.L’argumentation statistique d’un grand nombre d’articles est fondée sur <strong>de</strong>s comparaisons<strong>de</strong> quantités moyennes ou extrémales (déplacement moyen, vitesse moyenne, étendue min-maxetc.). Ces quantités ont le défaut <strong>de</strong> ne pas exploiter suffisamment la dynamique temporelle duphénomène.Cette dynamique est en revanche centrale dans les articles qui exploitent les modèles classiques<strong>de</strong> séries temporelles (comme notamment dans [14, 28]). C’est à plus forte raison aussile cas <strong>de</strong>s travaux dans lesquels les trajectoires <strong>de</strong>s positions du point maximal <strong>de</strong> pression sontmodélisées par un mouvement brownien (voir [8] et ses références). Dans leurs prolongements,<strong>de</strong>s modélisations par mouvement brownien fractionnaire introduites dans [9] ont aussi été exploitées; les auteurs <strong>de</strong> [4, 3] étudient ainsi la façon dont la dépendance se propage à traversle temps au sein d’une trajectoire (pour <strong>de</strong>s protocoles sans stimulation). Un autre courant metenfin en jeu <strong>de</strong>s modélisations par diffusion, typiquement <strong>de</strong> type Ornstein-Uhlenbeck (voir enparticulier [25, 12]).Nous accordons aussi une place centrale à la dynamique temporelle du phénomène dansnotre étu<strong>de</strong>.2.3 Protocoles médicaux d’acquisition <strong>de</strong> donnéesNous n’exploitons dans ce travail qu’une partie <strong>de</strong>s données produites par l’équipe médicaledont nous sommes partenaires pour cette étu<strong>de</strong>.Notre choix s’est porté sur l’un <strong>de</strong>s protocoles qui ont été à ce jour appliqués à tous les participantsà l’étu<strong>de</strong>, qu’ils soient <strong>de</strong>s sujets sains ou <strong>de</strong>s patients vestibulaires ou hémiplégiques.Il entre dans la catégorie <strong>de</strong>s protocoles à stimulations vibratoires.Les patients se tiennent <strong>de</strong>bout les yeux fermés sur une plateforme <strong>de</strong> force, appareil dontnous avons déjà écrit qu’il s’apparente à un pèse-personne. Cet appareil enregistre à intervallesréguliers les positions successives <strong>de</strong>s points où le pied gauche et le pied droit du patientexercent séparément une pression maximale. La fréquence d’échantillonnage est <strong>de</strong> 40Hz (soitun relevé toute les δ = 25 millisecon<strong>de</strong>s).Le protocole se découpe en trois phases expérimentales : lors <strong>de</strong>s 15 premières secon<strong>de</strong>s,Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 7778 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalle maintien n’est pas perturbé artificiellement ; suivent alors 35 secon<strong>de</strong>s durant lesquelles lemaintien est perturbé par stimulation vibratoire <strong>de</strong>s triceps du patient ; pendant les 20 <strong>de</strong>rnièressecon<strong>de</strong>s, le maintien n’est finalement plus perturbé artificiellement.2.4 Description succincte <strong>de</strong>s donnéesNous observons N (0,0) = 32 patients sains, N (1,0) = 23 patients vestibulaires et N (0,1) = 16patients hémiplégiques, soit un total <strong>de</strong> N = 71 patients.Les observations brutes pour l’un d’entre eux s’écrivent {Utδ G, UD tδ }, UG tδ (respectivementUtδ D ) étant la position où le pied gauche (respectivement droit) exerce une pression maximale àl’instant tδ. Ces positions sont repérées par leurs coordonnées cartésiennes.Nous procédons à un premier résumé <strong>de</strong> ces données brutes en ne considérant que la suite<strong>de</strong>s milieux <strong>de</strong>s segments {U tδ = 1 2 (UG tδ +UD tδ )}. Heuristiquement, cette suite décrit la trajectoire<strong>de</strong> la projection du centre <strong>de</strong> gravité du patient sur la plateforme <strong>de</strong> force.Pour avoir observé sur plusieurs trajectoires <strong>de</strong>s sursauts en début <strong>de</strong> protocole (la plupart dutemps lors <strong>de</strong> la première secon<strong>de</strong>), nous ne prenons pas en compte les <strong>de</strong>ux premières secon<strong>de</strong>sd’enregistrement.Par ailleurs, une étu<strong>de</strong> préliminaire a confirmé la pertinence <strong>de</strong> résumer ces données en neconsidérant que la suite {X tδ } <strong>de</strong>s distances séparant U tδ d’un point <strong>de</strong> référence. Possibilitéparmi plusieurs, ce <strong>de</strong>rnier est défini comme la valeur médiane <strong>de</strong>s U tδ sur la première pério<strong>de</strong>(c’est-à-dire avant les stimulations).Cette même étu<strong>de</strong> préliminaire nous a enfin permis <strong>de</strong> mettre en évi<strong>de</strong>nce que la volatilité<strong>de</strong> la suite {X tδ } est particulièrement intéressante. Partant <strong>de</strong> ce constat, nous effectuons une<strong>de</strong>rnière transformation en introduisant pour tout t la quantité{ (X(t+1)δ ) } 2log − X tδ .Nous reportons en Annexe un argument justifiant la forme <strong>de</strong> cette transformation.En <strong>de</strong>rnier lieu nous ne considérons qu’un sous-échantillon <strong>de</strong>s données transformées. Formellement,en posant ∆ = 10δ, pour tout t = 1, ...,n = 272,{ (X(t+1)∆ ) } 2Y t = log − X t∆ .Cette manipulation a pour principal objectif <strong>de</strong> simplifier la mise en œuvre <strong>de</strong> notre étu<strong>de</strong>.En effet nous argumentons plus bas en faveur d’une métho<strong>de</strong> d’optimisation <strong>de</strong> la vraisemblance(exacte) <strong>de</strong> type quasi-Newtonienne plutôt que fondée sur l’algorithme Expectation-Maximization (EM), sa populaire alternative. La métho<strong>de</strong> quasi-Newtonienne converge beaucoupplus vite que l’algorithme EM, et les résultats qu’elle produit sont sans surprise quantitativementmeilleurs (par exemple en termes <strong>de</strong> norme du gradient au terme <strong>de</strong> l’optimisation).Très aisé à programmer lorsque l’échantillon est <strong>de</strong> l’ordre <strong>de</strong> quelques centaines, le calcul <strong>de</strong> lavraisemblance exacte et <strong>de</strong> son gradient <strong>de</strong>vient malheureusement très délicat lorsque celui-ciest <strong>de</strong> l’ordre <strong>de</strong> quelques milliers. Cette étu<strong>de</strong> a pour ambition <strong>de</strong> poser <strong>de</strong>s jalons ; nous nousemploierons dans un futur proche à adapter nos programmes pour qu’ils puissent traiter les jeux<strong>de</strong> données originaux. Nous avons toutefois la conviction que la nature <strong>de</strong>s résultats obtenus nesera pas bouleversée.3 Un premier modèle <strong>de</strong> Markov caché pour processus multiplesSoit Y it l’observation et Z i t l’état caché associés au patient i, i = 1, ...,N, à l’instant t,t = 1, ...,n. Nous notons Y i et Z i les vecteurs n-dimensionnels <strong>de</strong>s observations et étatscachés du patient i. Les vecteurs nN-dimensionnels Y et Z contiennent l’ensemble <strong>de</strong> toutesles observations et <strong>de</strong> tous les états cachés. Les versions en lettres minuscules y i t,y i ,y et z i t,z i ,zcorrespon<strong>de</strong>nt à <strong>de</strong>s réalisations.3.1 DéveloppementPour un nombre <strong>de</strong> régimes K fixé, les Z i sont <strong>de</strong>s chaînes <strong>de</strong> Markov à valeurs dans{1, ...,K}, <strong>de</strong> mêmes probabilités <strong>de</strong> transition {P kl } et loi initiale {π k }. Nous imposons enfinà toutes les chaînes une initiation dans l’état numéroté 1 : pour tout i, z i 1 = 1, ou autrement ditπ 1 = 1.Soit θ ∈ Θ K un vecteur générique contenant tous les paramètres du modèle que nousdécrivons. Sous θ, conditionnellement à Z, les Yt i sont indépendantes, <strong>de</strong> distribution f(yt|Z i t i =k; θ) gaussienne <strong>de</strong> variance σ 2 , supposée connue, et <strong>de</strong> moyenne µ i tk ,f(y i t|Z i t = k; θ) =1√2πσ2 exp{−(yi t − µ i tk) 2 /2σ 2 },µ i tk = m k + x T i β k ,la covariable x i valant (0, 0) T pour un patient sain, (1, 0) T pour un patient vestibulaire et (0, 1) Tpour un patient hémiplégique. Ainsi, m k représente la moyenne d’émission pour un patient saindans le régime k, prise comme référence, la première et la secon<strong>de</strong> coordonnées <strong>de</strong> β k correspondantaux modulations <strong>de</strong> cette moyenne pour les patients vestibulaires et hémiplégiques,respectivement.Ainsi, chaque Y i est un HMM à émission gaussienne. Nous parlons <strong>de</strong> HMM pour un processusmultiple car la caractérisation <strong>de</strong> la loi jointe <strong>de</strong> Y requiert l’introduction <strong>de</strong>s N chaînes<strong>de</strong> Markov cachées <strong>de</strong> Z. Le modèle est cependant bien un HMM standard caractérisé par unematrice <strong>de</strong> transition creuse (dont la dimension dépend <strong>de</strong> N).Finalement, la log-vraisemblance <strong>de</strong> ce modèle s’écriti=1avec{ }N∑ ∑ n∏l(θ;y) = log π z i1f(y1|z i 1; i θ) P z it−1 ,z if(yi t t|zt; i θ) ,z il’argument z i dans la somme parcourant {1, ...,K} n avec les contraintes z i 1 = 1.t=2Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 7980 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalCe modèle apparaît aussi comme un cas particulier <strong>de</strong> HMM mixte telle qu’elles sont introduitesdans [2]. L’exemple 1 <strong>de</strong> la page 204 <strong>de</strong> [2] (appelé single, patient-specific randomeffect) s’écrit en effet <strong>de</strong> la même façon, à l’ajout près d’un terme d’effet aléatoire u i centré(les u 1 , ...,u n étant i.i.d et à <strong>de</strong>nsité) à la définition <strong>de</strong> µ i tk . À ce titre, les HMM mixtes sontune extension <strong>de</strong>s HMM standard dans l’esprit <strong>de</strong>s modèles linéaires à effets mixtes. Nous avonsrenoncé pour l’instant à adopter ce modèle à cause <strong>de</strong> son coût computationnel trop élevé pournotre jeu <strong>de</strong> données. À titre <strong>de</strong> comparaison, bien que les valeurs <strong>de</strong> N soient comparablesdans [2] et ici, notre valeur <strong>de</strong> n est plus <strong>de</strong> dix fois supérieure à celles <strong>de</strong> [2].Concernant le partage par toutes les chaînes <strong>de</strong> Markov cachées d’une même distribution(caractérisée par {P kl } et {π k }), il relève d’une tentative <strong>de</strong> structurer <strong>de</strong> façon assez rigi<strong>de</strong> lemodèle que nous ajusterons. Nous pourrions invoquer aussi le rasoir d’Ockam, d’autant plusque la structure cachée du modèle (qui est une vue <strong>de</strong> l’esprit) n’est pas observée et qu’onne peut donc pas en tester l’adéquation à une distribution (les conséquences <strong>de</strong> ce choix selisent dans les lois marginales). À ce titre, nous pouvons argumenter sans risque que le choixd’une distribution paramétrique économe est <strong>de</strong> bonne pratique. Il faut bien voir que le choixdu nombre <strong>de</strong> régimes K offre une gran<strong>de</strong> latitu<strong>de</strong> à la famille <strong>de</strong> modèles. Exemple parmitant d’autres, le paramètre θ ∈ Θ 3 peut être choisi <strong>de</strong> telle sorte que les marginales d’émissionpour les patients sains n’aient qu’une moyenne commune, que celles <strong>de</strong>s patients vestibulairesen aient <strong>de</strong>ux et que celles <strong>de</strong>s patients hémiplégiques en aient trois. Pour conclure, le choixd’une initiation systématique <strong>de</strong>s chaînes Z i dans l’état numéroté 1 fait sens. Heuristiquement,ce choix implique l’existence d’un régime (au moins) <strong>de</strong> référence correspondant à un maintienpostural sans stimulation, hypothèse raisonnable du point <strong>de</strong> vue <strong>de</strong> la biomécanique.3.2 Estimation du nombre <strong>de</strong> régimesLe choix du nombre <strong>de</strong> régimes K est un problème crucial. La littérature dédiée à l’estimationdu nombre <strong>de</strong> classes dans un mélange (dont notre problème relève) est très vaste et nousnous référons à [5, 7] pour une bibliographie. Nous proposons ici une procédure <strong>de</strong> sélectioninspirée du célèbre Bayesian Information Criterion (abrégé BIC).En nous inspirant <strong>de</strong>s travaux pionniers <strong>de</strong> [11, 22] et en droite ligne du développementprésenté dans [7] pour l’estimation du nombre <strong>de</strong> régimes d’un HMM à émissions gaussiennes,nous obtenons en effet un résultat <strong>de</strong> consistance que nous exposons ci-après.Notons (t) + la partie positive du réel t. Soitc KnN =(−K log Γ(K/2)Γ(1/2) + K2 (K − 1)4nN)d KnN = 3K 2 log ( 52K + 1nN+ K )12nNet les sommes cumulées C KnN = ∑ Kk=1 (c knN) + , D KnN = ∑ Kk=1 (d knN) + . Pour tout α > 2 fixéarbitrairement, introduisons enfin le terme <strong>de</strong> pénalitépen KnN =K∑( dim(Θk ) + αk=12)log nN + k log nN + C KnN + D KnN ,Notons P 0 la distribution <strong>de</strong> Y et P θ la distribution associée au paramètre θ. Admettons queP 0 ∈ ∪ K≥1 {P θ : θ ∈ Θ K } et soit K 0 l’unique entier tel queP 0 ∈ {P θ : θ ∈ Θ K0 } \ {P θ : θ ∈ Θ K0−1}(avec la convention Θ 0 = ∅). On appelle K 0 l’ordre <strong>de</strong> P 0 .Proposition 1. Sous les hypothèses énoncées dans les Sections 3.1 et 3.2, si les espaces <strong>de</strong>paramètres Θ K sont compacts, alors P 0 -presque sûrement,{}̂K := min arg max sup l(θ;y) − pen KnN = K 0K≥1 θ∈Θ Kpour n assez grand.Cette proposition a ceci <strong>de</strong> remarquable qu’aucune borne a priori sur l’ordre <strong>de</strong> P 0 n’estrequise pour garantir ce résultat <strong>de</strong> consistance forte <strong>de</strong> ̂K, estimateur du maximum <strong>de</strong> vraisemblancepénalisée. Comme annoncé, notre pénalisation pen KnN s’inspire du critère BIC,qui s’écrirait dans notre cadre sous la simple forme 1 dim(Θ 2 K) log nN. Sous cet angle, notrepénalisation s’exprime pour partie, et principalement, comme une somme cumulée <strong>de</strong>s critèresBIC, et la contribution <strong>de</strong> cette somme à la pénalisation complète est un O(K 3 log nN). La secon<strong>de</strong>contribution à la définition <strong>de</strong> notre pénalisation (par ordre <strong>de</strong> gran<strong>de</strong>ur relativement à Ket nN) est la somme cumulée <strong>de</strong>s k log nN, dont la contribution à la pénalisation complète estun O(K 2 log nN). Il apparaîtra dans la démonstration <strong>de</strong> la Proposition 1 (reportée en Annexe)que cette secon<strong>de</strong> somme cumulée permet à la pénalisation <strong>de</strong> contrôler le maximum <strong>de</strong>s carrés<strong>de</strong>s observations max i≤N,t≤n [Yt i ] 2 . Il est intéressant <strong>de</strong> souligner que la nécessité <strong>de</strong> contrôlerce terme ne trouve pas à proprement parler son origine dans la technique <strong>de</strong> preuve, mais plutôtdans la nature continue <strong>de</strong>s observations (exploitée dans un cadre à observations discrètes, ellene produit pas ce terme caractéristique).Pour conclure, notre expérience <strong>de</strong> l’utilisation du critère <strong>de</strong> type “BIC cumulé” suggèreque les ordres estimés selon lui coïnci<strong>de</strong>nt avec les ordres estimés selon le critère BIC lorsquen prend <strong>de</strong>s valeurs <strong>de</strong> l’ordre <strong>de</strong> plusieurs milliers. Il a été prouvé par ailleurs dans un cadretrès général <strong>de</strong> HMM [16] que le critère BIC pour l’estimation <strong>de</strong> l’ordre est consistant dès lorsqu’une borne a priori sur celui-ci est connue.4 Contribution à l’élaboration d’une notion <strong>de</strong> style postural,première partiePréliminaire.Pour commencer, nous estimons le paramètre σ supposé connu sur un jeu <strong>de</strong> donnéesindépendant <strong>de</strong> celui décrit en Section 2.4. Ce jeu correspond au maintien en position <strong>de</strong>boutsur une plateforme <strong>de</strong> force, les yeux fermés et en l’absence <strong>de</strong> stimulations. Nous obtenonsainsi la valeur σ = 1.65 (avec un écart-type <strong>de</strong> 0.1).Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 8182 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalLe jeu <strong>de</strong> données décrit en Section 2.4 est découpé aléatoirement en trois parties indépendantes<strong>de</strong> tailles comparables, constituées respectivement <strong>de</strong>s nombres <strong>de</strong> patients sains, vestibulaireset hémiplégiques suivant : (10, 7, 6), (10, 8, 6) et (12, 8, 4). Le premier sous-échantillonest dédié à l’estimation <strong>de</strong> l’ordre, le second à l’ajustement du modèle correspondant à l’ordreélu et le troisième étant réservé à <strong>de</strong>s fins <strong>de</strong> validation.Certes ce découpage initial rend aisé l’exposition <strong>de</strong>s résultats, mais il faut tenir compte<strong>de</strong> l’aléa introduit par la répartition aléatoire en ces trois sous-échantillons. Pour ce faire nousavons ultérieurement appliqué une procédure <strong>de</strong> type bootstrap, répétant 100 fois le découpagealéatoire et les procédures intitulées ci-après ajustement et validation. Nous n’en reportons pasles résultats numériques, qui témoignent d’une bonne robustesse en termes d’ordre sélectionné(nous allons argumenter en faveur <strong>de</strong> l’estimateur ̂K) et d’estimateurs <strong>de</strong>s paramètres d’intérêt(matrices <strong>de</strong> transition, moyennes d’émission, et probabilités <strong>de</strong> répartition telles que nous lesintroduirons en Section 5), les écarts-types obtenus par bootstrap étant cependant systématiquementun peu plus grands (typiquement d’une ordre <strong>de</strong> magnitu<strong>de</strong> <strong>de</strong> 25%) que ceux produitspar inversion <strong>de</strong> la matrice d’information.Ajustement.Nous reportons dans le Tableau 1 les valeurs <strong>de</strong>s log-vraisemblances maximisées ˆl 1 (k) dansle cadre du premier modèle pour k = 1, ...,10, ainsi que leurs versions pénalisées aux sens<strong>de</strong>s critères BIC et “BIC cumulé”. L’ordre sélectionné par le critère BIC est ̂K BIC = 6 tandis quel’ordre sélectionné par le critère “BIC cumulé” est ̂K = 5.Lequel <strong>de</strong>s <strong>de</strong>ux choisir ? Nous ajustons les modèles d’ordre 5 et 6 au second sous-échantillon.L’inspection <strong>de</strong>s résultats apporte une réponse catégorique à la question du choix. L’estimateurdu maximum <strong>de</strong> vraisemblance dans Θ ̂KBIC présente <strong>de</strong>ux valeurs aberrantes (correspondant à{β k }) et la matrice d’information observée n’est pas inversible. En revanche, l’ajustement dansΘ ̂Kfait sens, et la matrice d’information observée est inversible. Nous adoptons donc ̂K = 5et exposons ses résultats dans le Tableau 3. La Figure 1 présente visuellement les différentesmoyennes d’émission.On notera que le premier régime <strong>de</strong>s patients sains diffère significativement <strong>de</strong>s premiersrégimes <strong>de</strong>s autres patients, par ailleurs non significativement différents. Dans ce régime, lecomportement <strong>de</strong>s patients sains apparaît beaucoup plus concentré (au sens <strong>de</strong> la concentrationd’une variable aléatoire autour <strong>de</strong> sa moyenne) que ceux <strong>de</strong>s autres patients. En d’autrestermes, les comportements <strong>de</strong>s patients vestibulaires et hémiplégiques apparaissent beaucoupplus dispersés que celui <strong>de</strong>s patients sains.On notera aussi que les comportements <strong>de</strong>s patients sains et <strong>de</strong>s patients hémiplégiquesdans le régime 3 diffèrent significativement, et que les comportements <strong>de</strong>s patients sains etvestibulaires diffèrent significativement dans le régime 4.Curiosité enfin, les moyennes d’émission <strong>de</strong>s trois types <strong>de</strong> patients ne sont pas significativementdifférentes dans les régimes 2 et 5, et ces moyennes ne varient pas significativementd’un régime à l’autre. On lit dans la matrice <strong>de</strong> transition estimée que le régime 2 est <strong>de</strong> naturetransitoire vers le régime 5 (probabilité ∼ 90%) et le régime 3 (probabilité ∼ 10%).−8.7 −4.1 −2.8 −1.6−7.9 −3.4 −1 0.4−7.3 −3.1 −1.4 0.3SVHS VHSVHSVH1 2 3 4 5FIGURE 1 – Les différentes moyennes d’émission du premier modèle d’ordre ̂K = 5 ajusté.Les segments verticaux représentent les intervalles <strong>de</strong> confiance <strong>de</strong> niveau 95% <strong>de</strong>s moyennesd’émission pour chaque régime et chaque type <strong>de</strong> patient. Les segments correspondant aux patientssains, vestibulaires et hémiplégiques sont respectivement surmontés d’un S, d’un V et d’unH. L’estimée ponctuelle <strong>de</strong>s moyennes est indiquée par un cercle. Les barres verticales situéesà gauche <strong>de</strong> la figure permettent d’évaluer l’échelle numérique. Les barres correspon<strong>de</strong>nt, <strong>de</strong>gauche à droite, aux estimées ponctuelles <strong>de</strong>s patients sains, vestibulaires et hémiplégiquesrespectivement.rVS HJournal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 8384 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalValidation.Pour juger <strong>de</strong> la qualité <strong>de</strong> l’ajustement, nous proposons d’essayer <strong>de</strong> classer chaque membredu sous-échantillon <strong>de</strong> validation dans l’un <strong>de</strong>s trois groupes <strong>de</strong> patients à partir <strong>de</strong>s seulestrajectoires observées.Sur les 12 patients sains, 8 sont i<strong>de</strong>ntifiés comme tels, 2 comme vestibulaires et 2 commehémiplégiques (66% d’i<strong>de</strong>ntification correcte). Sur les 8 patients vestibulaires, 6 sont i<strong>de</strong>ntifiéscomme tels, 1 comme sain et 1 comme hémiplégique (75% d’i<strong>de</strong>ntification correcte). Sur les 4patients hémiplégiques, <strong>de</strong>ux sont i<strong>de</strong>ntifiés comme tels, 1 comme sain et 1 comme vestibulaire(50% d’i<strong>de</strong>ntification correcte). Dans l’ensemble, 66% <strong>de</strong>s patients sont correctement i<strong>de</strong>ntifiés.Nous jugeons décevant ce résultat. Peut-être cette performance en <strong>de</strong>mi-teinte est-elle unindice <strong>de</strong> ce que les types <strong>de</strong> patients ne caractérisent pas la nature <strong>de</strong>s trajectoires observées.Nous proposons dans la prochaine section un second modèle ayant vocation à explorer la voieque cette remarque ouvre.5 Un second modèle <strong>de</strong> Markov caché pour processus multiplesNous avons argumenté dans la Section 4 que le premier modèle <strong>de</strong> Markov caché pourprocessus multiples construit en Section 3 laisse à penser que les trois groupes <strong>de</strong> patients(sains, codé par x T = (0, 0) ; vestibulaires, codés par x T = (1, 0) ; hémiplégiques, codé parx T = (0, 1)) ne sont pas associés chacun à un comportement caractéristique (en termes <strong>de</strong>réponse au protocole médical décrit dans la Section 2.3). Nous proposons dans cette section uneextension du premier modèle fondée sur l’idée qu’il existe bien <strong>de</strong>s comportements différentsmais que ceux-ci ne sont pas rigoureusement caractéristiques <strong>de</strong>s groupes <strong>de</strong> patients.Plus formellement, nous supposons que chaque patient i, i = 1, ...,N, a associée unevariable aléatoire cachée W i à valeurs dans {1, 2, 3}. Selon ce nouveau modèle, il existe jusqu’àtrois types <strong>de</strong> comportements et W i = w si le patient i est un représentant du groupe w.Le nouveau modèle est présenté en détails dans la prochaine section. La question <strong>de</strong> l’estimation<strong>de</strong> l’ordre est l’objet <strong>de</strong> la section suivante. Nous notons dans la suite <strong>de</strong> cet article Wle vecteur N-dimensionnel <strong>de</strong>s variables cachées W i et w, w i <strong>de</strong>s réalisations respectives.5.1 DéveloppementSoit ψ = (ψ vw : v ∈ {(0, 0) T , (1, 0) T , (0, 1) T }, w = 1, 2, 3) ∈ Ψ un paramètre additionnelà coordonnées positives satisfaisant l’égalité ∑ 3w=1 ψ vw = 1 pour tout v. Le paramètre ψ caractérisela distribution <strong>de</strong> W selon les groupes auxquels les patients appartiennent (c’est-à-direselon les x i ). Nous supposons que les W i sont mutuellement indépendantes et, sous ψ ∈ Ψ unparamètre générique, que la probabilité pour que le patient i appartienne au groupe w vaut ψ xiw.Formellement,P ψ (W = w) =N∏ψ i. xiwDésormais, sous (θ, ψ) ∈ Θ K × Ψ un paramètre générique d’un modèle d’ordre au plusK, conditionnellement à W = w et à Z, les Yt i sont indépendantes, <strong>de</strong> distribution f(yt|Z i t i =k,w i ; θ) gaussienne <strong>de</strong> variance σ 2 et <strong>de</strong> moyenne⎧⎨ m k si w i = 1µ i tk = m k + (1, 0) T β k si w i = 2 .⎩m k + (0, 1) T β k si w i = 3La log-vraisemblance <strong>de</strong> ce second modèle a donc la forme{ ( )}N∑ ∑ ∑ n∏l(θ, ψ;y) = log ψ xiw i π z i1f(y1|z i 1, i w i ; θ) P z it−1 ,z if(yi t t|zt, i w i ; θ) ,w i z ii=1les arguments w i et z i dans les sommes parcourant respectivement {1, 2, 3} et {1, ...,K} n avecles contraintes z i 1 = 1.Le modèle apparaît comme un mélange <strong>de</strong> modèles similaires à celui introduit en Section3. Bien évi<strong>de</strong>mment, l(θ, ψ 0 ;y) = l(θ;y) pour tout θ ∈ Θ K si ψ 0 satisfait ψ 0 v1 = 1pour v = (0, 0) T , ψ 0 v2 = 1 pour v = (1, 0) T , ψ 0 v3 = 1 pour v = (0, 1) T — c’est-à-dire si lescomportements sont rigoureusement caractérisés par les groupes d’appartenance. On peut noterque le modèle ainsi construit s’écrit encore comme un cas particulier <strong>de</strong> l’exemple 1 <strong>de</strong> la page204 <strong>de</strong> [2].5.2 Estimation du nombre <strong>de</strong> régimesOn note toujours P 0 la distribution <strong>de</strong> Y et P θ,ψ la distribution associée au paramètre (θ, ψ).Admettons que P 0 ∈ ∪ K≥1 {P θ,ψ : θ ∈ Θ K , ψ ∈ Ψ} et soit K 0 l’unique entier tel que(avec la convention Θ 0 = ∅).P 0 ∈ {P θ,ψ : θ ∈ Θ K0 , ψ ∈ Ψ} \ {P θ,ψ : θ ∈ Θ K0−1, ψ ∈ Ψ}Il apparaît qu’une modification mineure <strong>de</strong> la pénalité introduite en Section 7.3 permet d’obtenirun résultat similaire à la Proposition 1. Sa démonstration est reportée en Annexe.Soit donc le terme <strong>de</strong> pénalité complémentaire( dim(Ψ)comp KnN = K2i=1t=2)log N + 3 log 2 .Proposition 2. Sous les hypothèses énoncées dans les Sections 5.1 et 5.2, si les espaces <strong>de</strong>paramètres Θ K sont compacts, alors P 0 -presque sûrement,{}̂K := min arg max sup l(θ, ψ;y) − pen KnN − comp KnN = K 0K≥1pour n assez grand.supθ∈Θ K ψ∈ΨJournal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 8586 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalLes commentaires qui étaient pertinents pour le terme <strong>de</strong> pénalité pen KnN dans le cadre dupremier modèle le sont encore pour le terme <strong>de</strong> pénalité complété (c’est-à-dire ajusté au cadredu second modèle) pen KnN +comp KnN . Il est bon <strong>de</strong> noter que le critère BIC s’écrirait ici sousla simple forme 1 dim(Θ 2 K) log nN + 1 dim(Ψ) log N et qu’à ce titre, la pénalité complétée2apparaissant dans la Proposition 2 s’exprime elle aussi pour partie, et principalement, commeune somme cumulée <strong>de</strong>s critères BIC.6 Contribution à l’élaboration d’une notion <strong>de</strong> style postural,secon<strong>de</strong> partieAjustement.Nous reportons dans le Tableau 2 les valeurs <strong>de</strong>s log-vraisemblances maximisées ˆl 2 (k) dansle cadre du second modèle pour k = 1, ...,10, ainsi que leurs versions pénalisées aux sens <strong>de</strong>sBICcritères BIC et “BIC cumulé”. L’ordre sélectionné par le critère BIC est ̂K = 6 tandis quel’ordre sélectionné par le critère “BIC cumulé” est ̂K = 3.La question du choix se pose <strong>de</strong> nouveau. Nous y répondons dans les mêmes termes. Nousajustons les modèles d’ordre 3 et 6 au second sous-échantillon. L’inspection <strong>de</strong>s résultats apporteune réponse catégorique à la question du choix. L’estimateur du maximum <strong>de</strong> vraisemblancedans Θ ̂KBIC présente plusieurs valeurs aberrantes et la matrice d’information observéen’est pas inversible. En revanche, l’ajustement dans Θ ̂Kfait sens, et la matrice d’informationobservée est inversible. Nous adoptons donc cette fois encore ̂K = 3 et exposons ses résultatsdans le Tableau 4. La Figure 2 présente visuellement les différentes moyennes d’émission.On notera que pour chaque régime, les moyennes d’émissions sont classées dans le mêmeordre : pour tous les régimes, le premier comportement est celui <strong>de</strong> la moindre dispersion, lesecond comportement est intermédiaire et le troisième comportement est caractérisé par la plusgran<strong>de</strong> dispersion <strong>de</strong>s émissions. Dans le premier régime, la moyenne du premier comportementest significativement différente <strong>de</strong>s <strong>de</strong>ux autres moyennes. Dans le second régime, les troismoyennes sont significativement différentes. Dans le troisième régime, la moyenne du premiercomportement est significativement différente <strong>de</strong> celle du troisième.Enfin, la lecture <strong>de</strong> l’estimée <strong>de</strong>s probabilités { ̂ψ vw } <strong>de</strong> répartition dans chaque groupe<strong>de</strong> comportement selon la nature du patient permet d’i<strong>de</strong>ntifier le premier <strong>de</strong>s comportementscomme celui <strong>de</strong>s patients sains, le second comme celui <strong>de</strong>s patients vestibulaires plutôt quehémiplégiques (75% contre 25%) et le troisième comme celui <strong>de</strong>s patients hémiplégiques plutôtque vestibulaires (66% contre 33%).Prolongements.−8.7 −3.8 −1.9−7.7 −2.8 −0.7−7 −2.4 0.7SHVS1 2 3FIGURE 2 – Les différentes moyennes d’émission du second modèle d’ordre ̂K = 3 ajusté.Les segments verticaux représentent les intervalles <strong>de</strong> confiance <strong>de</strong> niveau 95% <strong>de</strong>s moyennesd’émission pour chaque régime et chaque type <strong>de</strong> comportement. Les segments correspondantaux premier, second et troisième comportements sont respectivement surmontés d’un a, d’unb et d’un c. L’estimée ponctuelle <strong>de</strong>s moyennes est indiquée par un cercle. Les barres verticalessituées à gauche <strong>de</strong> la figure permettent d’évaluer l’échelle numérique. Les barres correspon<strong>de</strong>nt,<strong>de</strong> gauche à droite, aux estimées ponctuelles <strong>de</strong>s comportements a, b,c respectivement.rVHSVHCe travail nous ouvre <strong>de</strong> nombreuses perspectives. Nous souhaitons bien sûr mettre au pointla procédure qui permettra d’attribuer un comportement parmi les trois à un individu sur la base<strong>de</strong> la trajectoire observée seule. L’objectif à un peu plus long terme est <strong>de</strong> pouvoir procé<strong>de</strong>rà un regroupement selon les trois comportements d’un ensemble <strong>de</strong> patients. Par ailleurs, onJournal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 8788 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalserait en droit <strong>de</strong> nous reprocher <strong>de</strong> ne pas utiliser les autres covariables dont nous disposons :âge, sexe, taille, poids, profession. Ce sera aussi l’une <strong>de</strong> nos priorités dans un futur proche.Nous souhaitons enfin bien sûr appliquer cette approche à l’ensemble <strong>de</strong>s trajectoires dont nousdisposons et telles que nous les observons (c’est-à-dire sans la substitution <strong>de</strong> ∆ à δ).7 AnnexeL’Annexe est décomposée en trois sections : dans la première, nous exposons un argumentqui justifie l’intérêt que nous portons à Y t = log{(X t+1 − X t ) 2 } ; la secon<strong>de</strong> est consacrée à ladémonstration <strong>de</strong> la Proposition 1 et la troisième à celle <strong>de</strong> la Proposition 2.7.1 Commentaire <strong>de</strong> la définition <strong>de</strong>s Y itDans cet article, nous accordons notre attention à {Y t } = {log[(X t+1 − X t ) 2 ]}, que nousconsidérons comme un représentant <strong>de</strong> la volatilité du processus {X t }. Nous développonsci-<strong>de</strong>ssous, dans un modèle non paramétrique <strong>de</strong> diffusion, un argument justifiant cette interprétation.Bien que très général, ce cadre n’est pas compatible avec les modélisations quenous avons construites dans l’article. Nous pensons néanmoins que l’argument éclaire notredémarche.Un argument éclairant.Supposons donc (exclusivement dans cette section) que le processus {X t } satisfait l’équationdifférentielle stochastiquedX t = X 0 + b(X t )dt + σ(X t )dB t ,où {B t } est un mouvement brownien, X 0 est une variable aléatoire indépendante <strong>de</strong> {B t }, lesfonctions <strong>de</strong> drift b et <strong>de</strong> volatilité σ étant <strong>de</strong> classe C 2 et lipschitziennes sur R + . En particulier,il existe une constante K > 0 telle que, pour tout x, y, |σ(x) − σ(y)| ≤ K|x − y| etb 2 (x)+σ 2 (x) ≤ K(1+x 2 ). La théorie <strong>de</strong>s diffusions garantit l’existence d’une unique solutionadaptée à {σ(X 0 , B s : s ≤ t)}. Pour simplifier, nous supposons enfin que le processus {X t } eststationnaire et qu’il existe σ 0 > 0 telle que σ(x) ≥ σ 0 pour tout x.Soit ∆ un pas d’échantillonnage (petit). Alors( ) (X(i+1)∆ − X 2 ∫ (i+1)∆i∆√ = ∆ −1 b(X s )ds +∆i∆∫ (i+1)∆i∆σ(X s )dB s) 2⎧( ⎨ ∫ 2 ((i+1)∆∫ ) 2(i+1)∆= ∆ −1 b(X s )ds)+ σ(X s )dB s⎩ i∆i∆( ∫ ) ((i+1)∆ ∫ )}(i+1)∆+ 2 b(X s )ds σ(X s )dB si∆( )= σ 2 B(i+1)∆ − B 2i∆(X i∆ ) √ + R i , (1)∆pour un certain R i qui satisfait, nous le montrons ci-<strong>de</strong>ssous, E(|R i |) = O( √ ∆). Posons ε i =∆ −1/2 (B (i+1)∆ − B i∆ ), variable aléatoire <strong>de</strong> loi normale centrée réduite. Alors{ (X(i+1)∆ ) }− X 2 {i∆log √ = log{σ 2 (X i∆ )} + log ε 2 i +R }i,∆ σ(X i∆ )le second terme convergeant en probabilité vers log{ε 2 i } quand ∆ tend vers 0. Ainsi, la formuleprécé<strong>de</strong>nte met en évi<strong>de</strong>nce la relation entre le processus {Y t } et la volatilité <strong>de</strong> {X t } dans notreétu<strong>de</strong>.Démonstration <strong>de</strong> l’égalité (1).Pour commencer, le terme dominant provient <strong>de</strong> la décomposition suivante :( ∫ ) 2 ((i+1)∆∫ ) 2(i+1)∆∆ −1 σ(X s )dB s = σ 2 (X i∆ )ε 2 i + ∆ −1 [σ(X s ) − σ(X i∆ )]dB si∆avec R i,1 ≥ 0. En effet( ∫ )(i+1)∆+ 2∆ −1/2 ε i [σ(X s ) − σ(X i∆ )]dB s = σ 2 (X i∆ )ε 2 i + R i,1 + R i,2i∆∫ (i+1)∆E(|R i,1 |) = ∆ −1 E ( [σ(X s ) − σ(X i∆ )] 2) ds≤i∆i∆∫ (i+1)∆K 2 ∆ −1 E ( [X s − X i∆ ] 2) ds. (2)Or pour tout s ∈ [i∆, (i + 1)∆], grâce à l’inégalité 1 2 (a + b)2 ≤ (a 2 + b 2 ),i∆12 E ( [ (∫[X s − X i∆ ] 2) s) ] [ 2 (∫ s) ] 2≤ E b(X u )du + E σ(X u )dB u , (3)i∆i∆i∆Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 8990 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturaloù, en vertu <strong>de</strong> l’inégalité <strong>de</strong> Cauchy-Schwarz et <strong>de</strong> b 2 (x) ≤ K(1 + x 2 ),[ (∫ s) ] 2 (∫ s)E b(X u )du ≤ (s − i∆)E b 2 (X u )du ≤ K∆ 2 (1 + E(X0)), 2 (4)i∆i∆tandis que[ (∫ s) ] 2 (∫ s)E σ(X u )dB u = E σ 2 (X u )du ≤ K∆(1 + E(X0)). 2 (5)i∆i∆En injectant (4) et (5) dans (3) puis l’inégalité obtenue dans (2), on obtient que E(|R i,1 |) =O(∆). Par ailleurs, l’inégalité <strong>de</strong> Cauchy-Schwarz donne[E(|R i,2 |)] 2 ≤ 4E(ε 2 i)E(|R i,1 |),dont on déduit évi<strong>de</strong>mment que E(|R i,2 |) = O( √ ∆).( ∫ 2Le terme résiduel positif R i,3 = ∆ −1 (i+1)∆b 2 (Xi∆ s )ds)est à son tour contrôlé en espérancegrâce à (4), avec E(|R i,3 |) = O(∆). Quant au <strong>de</strong>rnier terme résiduel( ∫ ) ((i+1)∆ ∫ )(i+1)∆R i,4 = 2∆ −1 b(X s )ds σ(X s )dB s ,i∆nous invoquons une encore l’inégalité <strong>de</strong> Cauchy-Schwarz pour obtenir[ (∫ s) ] [ 2 (∫ s) ] 2[E(|R i,4 |)] 2 ≤ 4∆ −2 E b(X u )du E σ(X u )dB u ,i∆i∆les inégalités (4) et (5) garantissant alors que E(|R i,4 |) = O( √ ∆).En résumé, le terme résiduel R i = R i,1 +R i,2 +R i,3 +R i,4 satisfait bien E(|R i |) = O( √ ∆),comme annoncé.7.2 Aspects numériquesLe plus souvent, les paramètres d’un HMM sont estimés en utilisant l’algorithme Expectation-Maximization(EM). Nous lui préférons ici un algorithme <strong>de</strong> maximisation <strong>de</strong> la vraisemblanceexacte, calculée à partir <strong>de</strong> son expression faisant intervenir <strong>de</strong>s produits <strong>de</strong> matrices.Soit A i1 le vecteur <strong>de</strong> coordonnées π k f(y1|k; i θ) et A it la matrice d’entrées P kl f(yt|k; i θ)pour t = 2, ...,n. On note 1 le vecteur K-dimensionnel à coordonnées toutes égales à 1. Lalog-vraisemblance du premier modèle s’écrit <strong>de</strong> façon équivalente sous la forme{n∑n}∏l(θ;y) = log (A i1 ) T A it 1 .i=1Les matrices ne commutant pas, il faut bien sûr lire le produit comme A i2 × . .. × A in .i∆t=2On peut aussi calculer <strong>de</strong> façon exacte le score.Soit en effet B i1 = I, la matrice i<strong>de</strong>ntité K ×K, et B is le produit <strong>de</strong> matrices ∏ st=2 Ait pours = 2, ...,n. Soit ϑ l’une quelconque <strong>de</strong>s coordonnées <strong>de</strong> θ et ∂ ϑ l’opérateur <strong>de</strong> dérivation parrapport à ϑ. La relation <strong>de</strong> récurrence suivante, valable pour tout s ≥ 2,∂ ϑ {B is } = ∂ ϑ {B i(s−1) A is } = ∂ ϑ {B i(s−1) }A is + B i(s−1) ∂ ϑ {A is }permet d’obtenir la dérivée ∂ ϑ {B in } à partir <strong>de</strong>s expressions, simples, <strong>de</strong>s ∂ ϑ {A it }. Il vientalors immédiatement que∂ ϑ l(θ;y) =( n∑ ∂ϑ {A i1 } T B in + (A i1 ) T ∂ ϑ {B in } ) 1.(A i1 ) T B in 1i=1Cette approche numérique s’adapte sans aucune difficulté à la forme <strong>de</strong> la log-vraisemblancedu second modèle.La maximisation <strong>de</strong>s vraisemblances est donc menée, à partir <strong>de</strong> plusieurs valeurs initiales,en appliquant la métho<strong>de</strong> quasi-Newtonienne BFGS [24] telle qu’elle est implémentée dansR [27].7.3 Preuve <strong>de</strong> la Proposition 1La Proposition 1 est un corollaire <strong>de</strong> l’inégalité dite “<strong>de</strong> mélange” exposée dans la Proposition3, qui consiste en une comparaison entre la log-vraisemblance maximisée et la statistique<strong>de</strong> mélange définie ci-après.Par souci <strong>de</strong> simplicité, nous utiliserons par la suite la notation générique f(u) pour la<strong>de</strong>nsité (ou la masse discrète) <strong>de</strong> la variable aléatoire (ou du vecteur) U, et f(u|v) pour laversion conditionnelle à V = v. À titre d’exemple, on a ainsi f(z i t+1|z i t; θ) = P z it z i t+1 .Soit τ > 0 une constante à choisir ultérieurement et K un nombre <strong>de</strong> régimes fixé. Quitte àprocé<strong>de</strong>r à un changement <strong>de</strong> paramétrisation, on peut supposer ici que le paramètre θ consisteen les {P kl }, {π k }, {m 1 k } (les K moyennes d’émission pour les patients sains, associés à w =1), {m 2 k } (celles pour les patients vestibulaires, associés à w = 2) et {m3 k } (celles pour lespatients hémiplégiques, associés à w = 3). Soit ν K la loi a priori sur Θ K caractérisée par lespropriétés suivantes :– sous ν K , {P kl }, {m 1 k }, {m2 k } et {m3 k } sont indépendantes ;– sous ν K , π 1 = 1 (les chaînes <strong>de</strong> Markov Z i sont toutes initiées en 1) ;– sous ν K , les vecteurs {P 1l }, ...,{P Kl } sont i.i.d <strong>de</strong> loi Dirichlet(1/2, ...,1/2) ;– sous ν K , m w 1 , ...,m w K sont i.i.d <strong>de</strong> loi normale centrée <strong>de</strong> variance τ2 pour w = 1, 2, 3.La statistique <strong>de</strong> mélange d’ordre K est enfin définie par∫q K (Y) = f(Y; θ)ν K (dθ).Θ K(6)Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 9192 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalProposition 3. Pour tout nombre <strong>de</strong> régimes K ≥ 1,f(Y; θ)0 ≤ sup logθ∈Θ Kq K (Y) ≤ dim(Θ K)2log nN + 3K2τ 2max [Y t i ] 2i≤N,t≤n+ c KnN + 3K 2 log ( τ23Kσ 2 + 1nN). (7)Introduisons quelques notations pour l’exposition <strong>de</strong> la démonstration <strong>de</strong> la Proposition 3.Les statistiques q K (y|z) et q K (z) sont définies à partir <strong>de</strong> (6) en substituant à f(Y; θ) soitf(y|z; θ), soit f(z; θ), respectivement. On note I w l’ensemble <strong>de</strong>s indices i tels que le patient isoit <strong>de</strong> type w, Tk i = {t ≤ n : zi t = k}, Nk w = ∑ i∈I card(T i w k ) et ȳw k la moyenne empirique <strong>de</strong>syt i pour i ∈ I w et t ∈ Tk i.Démonstration <strong>de</strong> la Proposition 3. Grâce à l’inégalité ∑ k≤K α k/ ∑ k≤K β k ≤ max k≤K α k /β k(vali<strong>de</strong> pour tout {α k } positifs et tout {β k } strictement positifs), et puisquef(y; θ) = ∑ zf(y|z; θ)f(z; θ) et q K (y) = ∑ zq K (y|z)q K (z)(l’argument z dans les <strong>de</strong>ux sommes précé<strong>de</strong>ntes et dans la suite <strong>de</strong> la démonstration parcourt({1, ...,K} n ) N avec les conditions initiales z1 i = 1), il apparaît que{}f(y; θ)sup logθ∈Θ Kq K (y) ≤ sup f(z; θ) f(y|z; θ)max log + log . (8)θ∈Θ z Kq K (z) q K (y|z)Chacun <strong>de</strong>s <strong>de</strong>ux termes requiert un traitement spécifique. On obtient d’une part quesupmaxθ∈Θ z Kf(z; θ)logq K (z)K(K − 1)≤ log nN − K log Γ(K/2)2Γ(1/2) + K2 (K − 1)+ K4nN 12nN , (9)grâce au choix <strong>de</strong> l’a priori sur {P kl } (voir [10]) et à une approximation <strong>de</strong> Robbins-Stirling.D’autre part, à z fixé,⎧ ⎛⎞⎫f(y|z; θ) ≤ f(y|z; ˆθ) 13∏ K∏ ⎨= √ exp2πσ2 nN ⎩ − 1 ⎝ ∑⎬[y i2σt] 2 − N w 2 k [ȳk w ] 2 ⎠⎭ (10)w=1 k=1i∈I w ,t∈T i kuniformément en θ ∈ Θ K (ˆθ est l’estimateur du maximum <strong>de</strong> vraisemblance), tandis qu’uncalcul sans difficulté particulière conduit à l’égalitéq K (y|z) =1√2πσ2 nN3∏K∏w=1 k=11√1 + τ2 Nkwσ 2⎧ ⎛⎨× exp⎩ − 1 ⎝ ∑2σ 2i∈I w ,t∈T i k⎞⎫⎬[yt] i 2 − Nw k [ȳw k ]2 ⎠1 + σ2 ⎭ . (11)τ 2 NkwEn combinant (10) et (11), il vient quef(y|z; θ)logq K (y|z)≤3∑K∑w=1 k=1{ ( )12 log 1 + τ2 Nkw + 1σ 23∑ K∑( )1≤2 log 1 + τ2 Nkwσ 2 w=1 k=1≤ 3K ( )2 log 1 + τ2 nN+ 3Kσ 2 3K 2τ 2la <strong>de</strong>rnière inégalité découlant d’un argument <strong>de</strong> convexité.Le résultat final est le fruit <strong>de</strong> (9) et (12).+ 3K2τ 2Nk w[ȳw k ]22σ 2 1 + τ2 Nkwσ 2 }maxi≤N,t≤n [yi t] 2maxi≤N,t≤n [yi t] 2 , (12)Il est important <strong>de</strong> noter que ce sont les <strong>de</strong>ux premiers termes du membre <strong>de</strong> droite <strong>de</strong>l’inégalité (7) qui dominent le majorant. Ils le dominent d’ailleurs avec le même rythme parrapport à nN, puisque (voir le Lemme 3 <strong>de</strong> [7]), dans notre cadre <strong>de</strong> travail :{P 0 maxi≤N,t≤n [Y i}t ] 2 ≥ 5σ 2 log nN≤ (nN) −3/2 . (13)Ainsi, le membre <strong>de</strong> droite <strong>de</strong> (13) étant le terme général d’une série en n convergente, lelemme <strong>de</strong> Borel-Cantelli garantit que P 0 -presque sûrement, max i≤N,t≤n [Yt i ] 2 ≤ 5σ 2 log nNpour n assez grand.Nous sommes désormais en mesure <strong>de</strong> démontrer queProposition 4. P 0 -presque sûrement, ̂K ≤ K 0 pour n assez grand.Démonstration. Soit Ω nN l’événement dont la probabilité est majorée en (13). Nous avonsdéjà argumenté que P 0 {lim sup n Ω c nN } = 0, donc il nous suffit <strong>de</strong> considérer la probabilité<strong>de</strong> l’événement [ ̂K > K 0 ] ∩ Ω c nN .Soit K > K 0 et P 0 = P θ0 pour θ 0 ∈ Θ K0 . Si ̂K = K, alorsl(θ 0 ;Y) ≤ supθ∈Θ K0l(θ;Y) ≤≤supθ∈Θ Kl(θ;Y) + pen K0nN − pen KnNlog q K (Y) + ∆ KnNen vertu <strong>de</strong> la Proposition 3 (avec τ 2 = 15σ22— notez que cette valeur n’a aucune significationparticulière), où∆ KnN = pen K0nN − pen KnN + dim(Θ K)log nN + K max2 5σ [Y 2 t i ] 2 + c KnN + d KnNi≤N,t≤n≤ − α(K − K 0)log nN + K ()max25σ [Y i2 t ] 2 − 5σ 2 log nN .i≤N,t≤nJournal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 9394 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalAinsi, sur Ω c nN , ∆ KnN ≤ − α(K−K0)2log nN. L’argument principal est un changement <strong>de</strong> probabilité:∫P 0 {[ ̂K f(y; θ0 )= K] ∩ Ω nN } =q K (y) 1l{[ ̂K = K] ∩ Ω nN }q K (y)dy∫ {En conséquence,≤exp− α(K − K 0)log nN2P 0 {[ ̂K > K 0 ] ∩ Ω nN } = O((nN) −α/2 ),}q K (y)dy = (nN) −α(K−K0)/2 .et, α étant opportunément choisi supérieur à 2, une secon<strong>de</strong> application du lemme <strong>de</strong> Borel-Cantelli permet <strong>de</strong> conclure.En complément, il s’avère aussi qu’en vertu d’un résultat <strong>de</strong> type loi <strong>de</strong>s grands nombres,Proposition 5. P 0 -presque sûrement, ̂K ≥ K 0 pour n assez grand.Démonstration. Il suffit bien entendu <strong>de</strong> prouver que pour tout K < K 0 , P 0 { ̂K = K} = 0 pourn assez grand. Or, si ̂K = K < K 0 , alors sup θ∈ΘK l(θ;Y) ≥ l(θ 0 ;Y) + o(n) (avec θ 0 ∈ Θ K0tel que P θ0 = P 0 ). Par ailleurs, θ ↦→ l(θ;y) est uniformément continue sur Θ K , le module <strong>de</strong>continuité pouvant <strong>de</strong> pas dépendre <strong>de</strong> y. L’espace Θ K étant compact, il existe pour tout ε > 0une collection finie et déterministe {θ j } ⊂ Θ K telle que sup θ∈ΘK l(θ;Y) ≤ max j l(θ j ;Y)+nε.Maintenant, le Lemme 8 <strong>de</strong> [7] (un résultat à la Shannon-Breiman-McMillan, lui-mêmedérivé <strong>de</strong>s Théorèmes 2 et 3 <strong>de</strong> [20]) assure que, pour chaque θ j , n −1 [l(θ j ;Y) − l(θ 0 ;Y)]converge P 0 -presque sûrement vers un certain ε j < ε 0 < 0, le majorant ε 0 étant déterministe.Le choix <strong>de</strong> ε = −ε 0 /2 permet dès lors <strong>de</strong> conclure.La Proposition 1 est la combinaison <strong>de</strong>s Propositions 4 et 5.7.4 Preuve <strong>de</strong> la Proposition 2La Proposition 2 est un corollaire <strong>de</strong> l’inégalité “<strong>de</strong> mélange” exposée dans la Proposition 6,dans laquelle on reconnaîtra une évolution <strong>de</strong> la Proposition 3.Soit K un nombre <strong>de</strong> régimes fixé. Nous complétons la loi a priori ν K sur Θ K en une loi apriori ν K ⊗ ν sur Θ K × Ψ avec la caractérisation suivante <strong>de</strong> ν :– sous ν, {ψ (0,0) T w}, {ψ (1,0) T w}, {ψ (0,1) T w} sont i.i.d <strong>de</strong> loi Dirichlet(1/2, 1/2, 1/2).La nouvelle statistique <strong>de</strong> mélange d’ordre K est définie par∫¯q K (Y) = f(Y; θ, ψ)ν K (dθ)ν(dψ).Θ K×ΨC’est à elle que l’on compare la nouvelle log-vraisemblance maximisée dans la propositionsuivante.Proposition 6. Pour tout nombre <strong>de</strong> régimes K ≥ 1,0 ≤ supsupθ∈Θ K ψ∈Ψf(Y; θ, ψ)log¯q K (Y)+ 3K2τ 2≤ dim(Θ K)2log nN + dim(Ψ) log N2max [Y t i ] 2 + c KnN + 3K ( τ2i≤N,t≤n 2 log 3Kσ + 12 nN)+ 3 log 2. (14)Cette fois, les statistiques ¯q K (y|z,w), ¯q K (z) et ¯q K (w) sont définies à partir <strong>de</strong> (14) ensubstituant à f(Y; θ, ψ) soit f(y|z,w; θ, ψ), soit f(z; θ), soit f(w; ψ), respectivement. On peutnoter que q K (z) = ¯q K (z).Démonstration. On se fon<strong>de</strong> cette fois sur les <strong>de</strong>ux égalitésf(y; θ, ψ) = ∑ w¯q K (y) = ∑ w∑f(y|z,w; θ, ψ)f(z; θ)f(w; ψ),z∑¯q K (y|z,w)¯q K (z)¯q K (w)z(ici et par la suite les arguments w et z parcourent respectivement {1, 2, 3} N et ({1, ...,K} n ) Navec les conditions initiales z i 1 = 1) pour mettre en évi<strong>de</strong>nce quesupsupθ∈Θ K ψ∈Ψf(y; θ, ψ)log¯q K (y){≤ sup max logψ∈Ψ w+ supsupθ∈Θ K ψ∈Ψ}f(w; ψ)¯q K (w)max maxw z{f(z; θ) f(y|z,w; θ, ψ)log + log¯q K (z) ¯q K (y|z,w)}. (15)Une adaptation évi<strong>de</strong>nte <strong>de</strong>s définitions <strong>de</strong> I w , Tk i, Nw k , ȳw k introduites en Section 7.3 permet<strong>de</strong> transposer la démonstration <strong>de</strong> la Proposition 3 et d’obtenir le contrôle du second terme dumembre <strong>de</strong> droite <strong>de</strong> (15).Grâce au choix <strong>de</strong> l’a priori sur ψ, on démontre aussi que pour tout ψ ∈ Ψ et w,f(w; ψ)log ≤ 8(N (0,0) + 1/2)(N (1,0) + 1/2)(N (0,1) + 1/2),¯q(w)où N (0,0) , N (1,0) , N (0,1) sont les nombres respectifs <strong>de</strong> patients sains, vestibulaires et hémiplégiques.Par convexité, le majorant ci-<strong>de</strong>ssus est plus petit que 3 log N + 3 log 2 + 3 log(1/3 +1/2N) ≤ 3 log N + 3 log 2, ce qui conclut la preuve.La preuve <strong>de</strong> la Proposition 2 se fait facilement dans le même esprit que celle <strong>de</strong> la Proposition1, mais à partir <strong>de</strong> la Proposition 6 plutôt que <strong>de</strong> la Proposition 3. Nous omettons ici lesdétails <strong>de</strong> cette adaptation aisée.Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 9596 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalRemerciementsCe travail bénéficie <strong>de</strong>s soutiens financiers <strong>de</strong> l’Université Paris Descartes (via son programmeBonus Qualité Recherche 2008) ainsi que du CNRS (via son programme interdisciplinaireLongévité et vieillissement pour l’année 2008).Les auteurs remercient Adélaï<strong>de</strong> Marquer (Hôpital Fernand-Widal), Céline Quhen (LNRSet Hôpital Lariboisière) pour la partie médicale, Benjamin Favetto (MAP5), Valentine Genon-Catalot (MAP5), Yves Rozenholc (MAP5) et A<strong>de</strong>line Samson (MAP5) pour la partie statistique.Ils sont aussi reconnaissants envers les rapporteurs pour <strong>de</strong>s remarques très pertinentes sur lapremière version du manuscrit.k 1 2 3 4 5 6 7 8 9 10ˆl1(k) -20964.45 -15095.46 -14302.02 -14118.07 -13895.69 -13790.21 -13787.56 -13731.94 -13695.73 -13686.34BIC -20977.56 -15130.42 -14367.58 -14222.96 -14048.66 -14000.00 -14062.91 -14081.59 -14128.42 -14210.82BIC cumulé -20996.42 -15190.43 -14499.59 -14466.54 -14451.02 -14616.52 -14959.15 -15331.86 -15815.76 -16427.00TABLEAU 1 – Valeurs <strong>de</strong>s log-vraisemblances maximisées pour le premier modèle (notée ˆl1(k) pour le modèle d’ordre k) et leur versionspénalisées par les critères BIC et “BIC cumulé”.k 1 2 3 4 5 6 7 8 9 10ˆl2(k) -16154.30 -14806.19 -14182.05 -14095.74 -13876.33 -13776.84 -13760.86 -13715.88 -13680.90 -13664.92BIC -16176.82 -14850.56 -14257.02 -14210.04 -14038.71 -13996.04 -14045.62 -14074.94 -14123.00 -14198.80BIC cumulé -16197.76 -14924.13 -14414.09 -14490.16 -14489.09 -14672.07 -15012.86 -15407.69 -15904.30 -16520.44TABLEAU 2 – Valeurs <strong>de</strong>s log-vraisemblances maximisées pour le second modèle (notée ˆl2(k) pour le modèle d’ordre k) et leur versionspénalisées par les critères BIC et “BIC cumulé”.Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 9798 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style postural⎛⎞{ ˆPkl} =0.917 0.030 0.053 0.000 0.000⎜ ⎜⎜⎜⎝ 0.000 0.000 0.111 0.000 0.8890.204 0.000 0.089 0.243 0.4630.009 0.000 0.045 0.920 0.0260.017 0.752 0.090 0.092 0.049⎟ ⎟⎟⎟⎠k 1 2 3 4 5ˆmk -4.058 ± 0.090 -2.837 ± 0.189 -8.725 ± 0.182 -1.639 ± 0.979 -2.703 ± 0.185ˆmk + (1, 0)ˆβk 0.404 ± 0.223 -3.386 ± 0.517 -7.917 ± 0.404 -0.959 ± 0.206 -3.019 ± 0.457ˆmk + (0, 1)ˆβk 0.314 ± 0.204 -2.711 ± 0.439 -7.301 ± 0.391 -1.405 ± 0.218 -3.106 ± 0.403TABLEAU 3 – Résultats <strong>de</strong> l’ajustement du premier modèle d’ordre 5. Les coefficients sont arrondis au millième. Les écarts-types apparaissentaprès le symbole ±.{ ˆPkl} =⎛⎝0.860 0.039 0.1010.026 0.929 0.0450.556 0.345 0.099⎞⎠k 1 2 3ˆmk -3.764 ± 0.078 -1.925 ± 0.065 -8.672 ± 0.179ˆmk + (1, 0)ˆβk -2.833 ± 0.168 -0.734 ± 0.138 -7.687 ± 0.375ˆmk + (0, 1)ˆβk -2.428 ± 0.190 0.741 ± 0.152 -7.040 ± 0.405w 1 2 3ψ (0,0) T w 1 0.000 0.000ψ (1,0) T w 0 0.750 0.250ψ (0,1) T w 0 0.667 0.333TABLEAU 4 – Résultats <strong>de</strong> l’ajustement du second modèle d’ordre 3. Les coefficients sont arrondis au millième. Les écarts-types apparaissentaprès le symbole ±.Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


Chambaz et al. 99100 Deux HMMs pour l’élaboration d’une notion <strong>de</strong> style posturalRéférences[1] P. S. Albert, H. F. McFarland, M. E. Smith, and Frank . A. Time series for mo<strong>de</strong>llingcounts from a relapsed-remitting disease : application to mo<strong>de</strong>lling disease activity inmultiple sclerosis. Statistics in Medicine, (24) :453–466, 1994.[2] R. MacKay Altman. Mixed hid<strong>de</strong>n Markov mo<strong>de</strong>ls : an extension of the hid<strong>de</strong>n Markovmo<strong>de</strong>l to the longitudinal data setting. J. Amer. Statist. Assoc., 102(477) :201–210, 2007.[3] J-M. Bar<strong>de</strong>t and P. Bertrand. I<strong>de</strong>ntification of the multiscale fractional Brownian motionwith biomechanical applications. J. Time Ser. Anal., 28(1) :1–52, 2007.[4] P. Bertrand, J-M. Bar<strong>de</strong>t, M. Dabonneville, A. Mouzat, and P. Vaslin. Automatic <strong>de</strong>terminationof the different control mechanisms in upright position by a wavelet method. IEEEEngineering in Medicine and Biology Society, 2 :1163–1166, 2001.[5] C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture mo<strong>de</strong>l for clustering withthe integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell., 22(7) :719–725, 2000.[6] O. Cappé, E. Moulines, and T. Rydén. Inference in hid<strong>de</strong>n Markov mo<strong>de</strong>ls. SpringerSeries in Statistics. Springer, New York, 2005.[7] A. Chambaz, A. Garivier, and E. Gassiat. A minimum <strong>de</strong>scription length approach tohid<strong>de</strong>n Markov mo<strong>de</strong>ls with Poisson and Gaussian emissions. Application to or<strong>de</strong>r i<strong>de</strong>ntification.J. Statist. Plann. Inference, pages 1–16, 2008.[8] L. Chiari, A. Cappello, D. Lenzi, and U. Della Croce. An improved technique for the extractionof stochastic parameters from stabilograms. Gait Posture, 12(3) :225–234, 2000.[9] J.-J. Collins and C. J. De Lucas. Open-loop and closed-loop control of posture : a randomwalkanalysis of center-of-pressure trajectories. Exp. Brain Res., 1993.[10] Lee D. Davisson, Robert J. McEliece, Michael B. Pursley, and Mark S. Wallace. Efficientuniversal noiseless source co<strong>de</strong>s. IEEE Trans. Inform. Theory, 27(3) :269–279, 1981.[11] L. Finesso. Consistent estimation of the or<strong>de</strong>r for Markov and hid<strong>de</strong>n Markov chains. PhDThesis, University of Maryland, 1991.[12] T. D. Frank, A. Daffertshofer, and P. J. Beek. Multivariate ornstein-uhlenbeck processeswith mean-field <strong>de</strong>pen<strong>de</strong>nt coefficients : Application to postural sway. Phys. Rev. E,63(1) :011905, Dec 2000.[13] P. A. Fransson, A. Hafström, M. Karlberg, M. Magnusson, A. Tjä<strong>de</strong>r, and R. Johansson.Postural control adaptation during galvanic vestibular and vibratory proprioceptive stimulation.IEEE Trans. Biomed. Eng., 50(12) :1310–1319, 2003.[14] P. A. Fransson, R. Johansson, A. Hafström, and M. Magnusson. Methods for evaluationof postural control adaptation. Gait Posture, 12(1) :14–24, 2000.[15] P. A. Fransson, E. K. Kristinsdottir, A. Hafström, M. Magnusson, and R. Johansson. Balancecontrol and adaptation during vibratory perturbations in middle-aged and el<strong>de</strong>rlyhumans. Eur. J. Appl. Physiol., 91(5-6) :595–603, 2004.[16] E. Gassiat. Likelihood ratio inequalities with applications to various mixtures. Ann. Inst.H. Poincaré Probab. Statist., 38(6) :897–906, 2002. En l’honneur <strong>de</strong> J. Bretagnolle, D.Dacunha-Castelle, I. Ibragimov.[17] T. Koski. Hid<strong>de</strong>n Markov mo<strong>de</strong>ls for bioinformatics, volume 2 of Computational BiologySeries. Kluwer Aca<strong>de</strong>mic Publishers, Dordrecht, 2001.[18] J. Laurens and J. Droulez. Bayesian processing of vestibular information. Biol. Cybern.,96(4) :389–404, 2007.[19] J. C. Lepecq, C. De Waele, S. Mertz-Josse, C. Teyssèdre, P. T. Huy, P. M. Baudonnière,and P-P. Vidal. Galvanic vestibular stimulation modifies vection paths in healthy subjects.J. Neurophysiol., 95(5) :3199–3207, 2006.[20] B. G. Leroux. Maximum-likelihood estimation for hid<strong>de</strong>n Markov mo<strong>de</strong>ls. StochasticProcess. Appl., 40(1) :127–143, 1992.[21] S. E. Levison, L. R. Rabiner, and M. M. Sondhi. An introduction to the application ofthe theory of probabilistic functions of a Markov process to automatic speech recognition.Bell System Tech. J., 62 :1035–1074, 1983.[22] C. C. Liu and P. Narayan. Or<strong>de</strong>r estimation and sequential universal data compression ofa hid<strong>de</strong>n markov source by the method of mixtures. Canad. J. Statist., 30(4) :573–589,1994.[23] T. Mergner, G. Schweigart, C. Maurer, and A. Blümle. Human postural responses tomotion of real and virtual visual environments un<strong>de</strong>r different support base conditions.Exp. Brain Res., 167(4) :535–556, 2005.[24] J. C. Nash. Compact numerical methods for computers : linear algebra and functionminimisation. John Wiley & Sons Inc., New York, 1979. A Halsted Press Book.[25] K. M. Newell, S. M. Slobounov, E. S. Slobounova, and P. C. Molenaar. Stochastic processesin postural center-of-pressure profiles. Exp. Brain Res., 113(1) :158–164, 1997.[26] R. A. Olshen, E. N. Bi<strong>de</strong>n, Marilynn P. Wyatt, and D. H. Sutherland. Gait analysis and thebootstrap. Ann. Statist., 17(4) :1419–1440, 1989.[27] R Development Core Team. R : A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2008. ISBN 3-900051-07-0.[28] A.M. Sabatini. A statistical mechanical analysis of postural sway using non-gaussianfarima stochastic mo<strong>de</strong>ls. IEEE Trans. Biomed. Eng., 47(9) :1219–1227, 2000.Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009Journal <strong>de</strong> la Société Française <strong>de</strong> Statistique, 150(1), 73-100,http://smf.emath.fr/Publications/JSFdS/c○ Société Française <strong>de</strong> Statistique et Société Mathématique <strong>de</strong> France, 2009


The International Journal ofBiostatisticsVolume 7, Issue 1 2011 Article 11Targeting the Optimal Design in RandomizedClinical Trials with Binary Outcomes and NoCovariate: Simulation Study<strong>Antoine</strong> Chambaz, Laboratoire MAP5, Université ParisDescartes and CNRSMark J. van <strong>de</strong>r Laan, University of California, BerkeleyTargeting the Optimal Design in RandomizedClinical Trials with Binary Outcomes and NoCovariate: Simulation Study<strong>Antoine</strong> Chambaz and Mark J. van <strong>de</strong>r LaanAbstractWe un<strong>de</strong>rtake here a comprehensive simulation study of the theoretical properties that we<strong>de</strong>rive in a companion article <strong>de</strong>voted to the asymptotic study of adaptive group sequential <strong>de</strong>signsin the case of randomized clinical trials (RCTs) with binary treatment, binary outcome and nocovariate. By adaptive <strong>de</strong>sign, we mean in this setting a RCT <strong>de</strong>sign that allows the investigator todynamically modify its course through data-driven adjustment of the randomization probabilitybased on data accrued so far without negatively impacting on the statistical integrity of the trial.By adaptive group sequential <strong>de</strong>sign, we refer to the fact that group sequential testing methods canbe equally well applied on top of adaptive <strong>de</strong>signs.The simulation study validates the theory. It notably shows in the estimation framework thatthe confi<strong>de</strong>nce intervals we obtain achieve the <strong>de</strong>sired coverage even for mo<strong>de</strong>rate sample sizes.In addition, it shows in the testing framework that type I error control at the prescribed level isguaranteed and that all sampling procedures only suffer from a very slight increase of the type IIerror.A three-sentence take-home message is “Adaptive <strong>de</strong>signs do learn the targeted optimal<strong>de</strong>sign and inference and testing can be carried out un<strong>de</strong>r adaptive sampling as they would un<strong>de</strong>rthe targeted optimal randomization probability iid sampling. In particular, adaptive <strong>de</strong>signsachieve the same efficiency as the fixed oracle <strong>de</strong>sign. This is confirmed by a simulation study, atleast for mo<strong>de</strong>rate or large sample sizes, across a large collection of targeted randomizationprobabilities.”KEYWORDS: adaptive <strong>de</strong>sign, coarsening at random, group sequential testing, maximumlikelihood, randomized clinical trialAuthor Notes: This collaboration took place while <strong>Antoine</strong> Chambaz was a visiting scholar at UCBerkeley, supported in part by a Fulbright Research Grant and the CNRS. The authors thank thereviewers for their interesting comments.Recommen<strong>de</strong>d Citation:Chambaz, <strong>Antoine</strong> and van <strong>de</strong>r Laan, Mark J. (2011) "Targeting the Optimal Design inRandomized Clinical Trials with Binary Outcomes and No Covariate: Simulation Study," TheInternational Journal of Biostatistics: Vol. 7: Iss. 1, Article 11.DOI: 10.2202/1557-4679.1310Available at: http://www.bepress.com/ijb/vol7/iss1/11©2011 Berkeley Electronic Press. All rights reserved.


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Study1 IntroductionThe present article and its companion (Chambaz and van <strong>de</strong>r Laan, 2010b) are<strong>de</strong>votedtotheasymptoticstudyofadaptivegroupsequential<strong>de</strong>signsinthecaseofrandomized clinical trials with binary treatment, binary outcome and no covariate,focusinghereonitsstudybysimulationsandthereon itstheoretical<strong>de</strong>velopment.By adaptive <strong>de</strong>sign, we mean in this setting a clinical trial <strong>de</strong>sign that allowsthe investigator to dynamically modify its course through data-driven adjustmentof the randomization probability based on data accrued so far. We assumethat the protocol specifies a user-supplied optimal unknown choice of randomizationscheme. We consi<strong>de</strong>r here that randomization scheme which minimizes theasymptoticvarianceofour maximumlikelihoo<strong>de</strong>stimator(MLE)of theparameterof interest, known as the Neyman allocation, which is interesting because minimizingthe asymptotic variance of our estimator guarantees narrower confi<strong>de</strong>nceintervals and earlier <strong>de</strong>cision to reject the null for its alternative or not. Yet, targetingthis treatment mechanism in real-life clinical trials my raise ethical issues,since this may result in more patients assigned to the inferior treatment arm. Butwe emphasize that there is nothing special about targeting the Neyman allocation,the whole methodology applying equally well to any choice of targeted treatmentmechanism. By adaptive group sequential <strong>de</strong>sign, we refer to the fact that groupsequentialtestingmethodscan beequallywell appliedon topofadaptive<strong>de</strong>signs.This article builds upon the seminal technical report (van <strong>de</strong>r Laan, 2008)whichpaves theway torobustand efficient estimationinrandomized clinicaltrialsthanks to adaptation of the <strong>de</strong>sign in a variety of settings. A more <strong>de</strong>tailed presentationof adaptive group sequential <strong>de</strong>signs and an overview of the literature canbe found in (Chambaz and van <strong>de</strong>r Laan, 2010b, Section 1). In the latter article,weobtainthattheadaptive<strong>de</strong>signconvergesalmostsurelytothetargetedunknownrandomization scheme; we <strong>de</strong>rive strong consistency and asymptotic normality resultsfortheMLEoftheparameterofinterest;wefinallyinvestigatethetheoreticalproperties of a group sequential testing procedure. In the present article, the comprehensivesimulationstudythatweun<strong>de</strong>rtakevalidatesthetheory:• estimationframework: notablyshowingthat the confi<strong>de</strong>nce intervalsthat weobtainachievethe<strong>de</strong>siredcoverageevenformo<strong>de</strong>ratesamplesizes;• testingframework: notablyshowingthattypeIerrorcontrolattheprescribedlevel is guaranteed, and that all sampling procedures only suffer from a veryslightincrease ofthetypeIIerror.A three-sentence take-home message is “Adaptive <strong>de</strong>signs do learn the targetedoptimal<strong>de</strong>signandinferenceandtestingcanbecarriedoutun<strong>de</strong>radaptivesamplingThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11astheywouldun<strong>de</strong>rthetargetedoptimalrandomizationprobabilityiidsampling. Inparticular, adaptive <strong>de</strong>signs achieve the same efficiency as the fixed oracle <strong>de</strong>sign.Thisisconfirmedbyasimulationstudy,atleastformo<strong>de</strong>rateorlargesamplesizes,across alargecollectionoftargeted randomizationprobabilities.”The article is organized as follows. In Section 2, we briefly <strong>de</strong>fine the targetedoptimal <strong>de</strong>sign, <strong>de</strong>scribe how to adapt to it and summarize the results of theasymptoticstudycarried out in thecompanion article(Chambaz and van <strong>de</strong>r Laan,2010b). We present the results of the simulation study in Sections 3 and 4. Section3is<strong>de</strong>dicatedtotheinvestigationofmo<strong>de</strong>rateandlargesamplesizepropertiesof the adaptive <strong>de</strong>sign methodology with respect to estimation and assessment ofuncertainty. Section4is<strong>de</strong>dicatedtotheperformancesoftheadaptive<strong>de</strong>signgroupsequentialtestingmethodology. Inbothsections,ourdata-adaptivemethodologyisappliedto alargecollectionofproblems.Finally, in or<strong>de</strong>r to ease the reading, we highlight throughout the text themost important results. We point out in which terms the simulations validate theconstruction of confi<strong>de</strong>nce intervals and the group sequential testing procedurewhile targeting the optimal <strong>de</strong>sign and thus accruing observations data-adaptively.Moreover, we compare the performances of the targeted optimal <strong>de</strong>sign samplingscheme with those of the oracle iid sampling scheme (i.e. the targeted scheme).Five highlights (numbered from 3 since two highlights appear in (Chambaz andvan<strong>de</strong>rLaan, 2010b))arescattered in thearticle, respectivelyentitled3. empiricalvalidationofcentrallimittheorem (Section 3.2),4. empiricalcoverageof theconfi<strong>de</strong>nceintervals(Section 3.3),5. empiricalwidthsofconfi<strong>de</strong>nceintervals(Section 3.4),6. empiricaltypeIandtypeIIerrors(Section 4.2),and7. empiricalsamplesizesat<strong>de</strong>cision(Section 4.3).2 Targetingtheoptimal<strong>de</strong>signIn this section, we briefly summarize the methodology for targeting the optimal<strong>de</strong>sign in randomized clinical trials with binary treatment, binary outcome and nocovariateasitis<strong>de</strong>velopedandstudiedtheoreticallyin(Chambazandvan<strong>de</strong>rLaan,2010b). Werefer to thelatterfor<strong>de</strong>tails.2.1 Statisticalframework.Weconsi<strong>de</strong>rthesimplestexampleofrandomizedtrials,whereanobservationwritesasO =(A,Y),AbeingabinarytreatmentofinterestandY abinaryoutcomeofinter-Published by Berkeley Electronic Press, 20111http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.13102


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11est. We postulatetheexistenceofafulldatastructureX = (X(0),X(1))containingthe two counterfactual (or potential) outcomes un<strong>de</strong>r the two possible treatments.TheobserveddatastructureO=(A,X(A)) = (A,Y)onlycontainstheoutcomecorrespondingto the treatment the experimental unit did receive. Therefore O is amissingdatastructureonX withmissingnessvariableA. We<strong>de</strong>notetheconditionalprobability distributions of treatment A by g(a|x) = P(A = a|X = x). It complieswith the coarsening at random (abbreviated to CAR) assumption: g(a|x) = g(a)for all a ∈ {0,1},x ∈ {0,1} 2 . The distribution P X of the full data structure X hastwo marginal Bernoulli distributions characterized by θ = (θ 0 ,θ 1 ) ∈]0,1[ 2 withθ 0 =E PX X(0)and θ 1 =E PX X(1)(theonlyi<strong>de</strong>ntifiablepart ofP X ). SincethelikelihoodofOequals[θA Y(1 − θ A) 1−Y ]g(A),wecan saythattheobserveddatastructureO isobtainedun<strong>de</strong>r (θ,g).Say that the parameter of scientific interest is the log-relative risk Ψ(θ) =logθ 1 −logθ 0 . The theory of semiparametric statistics teaches us that the efficientasymptotic variance un<strong>de</strong>r (θ,g) is minimized, as a function of g, at the optimaltreatmentmechanismg ⋆ (θ) characterized by√g ⋆ θ0 (1 − θ 1 )(θ)(1) = √θ0 (1 − θ 1 )+ √ θ 1 (1 − θ 0 ) . (1)The optimaltreatment is known in the literature as theNeyman allocation (Hu andRosenberger,2006,page13). Interestingly,g ⋆ (θ)(1) ≤ 1 2 whenever θ 0 ≤ θ 1 ,meaningthattheNeymanallocationg⋆ (θ)favorstheinferiortreatment. Thecorrespondingoptimalefficientasymptoticvariancev⋆ (θ) thensatisfiesv ⋆ (θ) =(√ √ ) 2 (1 − θ 0 1 − θ1 1 − θ0+≥2 + 1 − θ )1=v b (θ),θ 0 θ 1 θ 0 θ 1where v b (θ) <strong>de</strong>notes the efficient asymptoticvariance associated with thestandardbalanced treatment characterized by g b (1) =2 1 , hence the relative efficiency criterionR(θ)=v ⋆ (θ)/v b (θ) ∈ ( 1 2 ,1].2.2 Targeted optimal <strong>de</strong>sign adaptive sampling, estimation andgroupsequential testing.We <strong>de</strong>note by A i , X i = (X i (0),X i (1)), Y i = X i (A i ) and O i = (A i ,Y i ) the treatmentassignment, full data structure, outcome, and observation for experimental unit i.In addition, we set O n = (O 1 ,...,O n ) and O n (i) = (O 1 ,...,O i ) (with O n (0) = /0).TherandomvariablesX 1 ,...,X n are assumediid.The iid g b -balanced or the iid g ⋆ (θ)-optimal sampling schemes are fullycharacterized by the fact that the random variables A 1 ,...,A n are in<strong>de</strong>pen<strong>de</strong>ntlydistributedfromeitherg b org ⋆ (θ),respectively. Asfortheadaptivedatageneratingmechanisms targeting the optimal <strong>de</strong>sign that we consi<strong>de</strong>r in this article, they arebestpresented by recursion.Thus, let us set g ⋆ 1 (·|O n(0)) = g b . One starts by sampling A 1 from theBernoulli distribution with parameter g ⋆ 1 (·|O n(0)), hence the first observation O 1 .Assume now that one has sampled O n (i) with i ≥1. The MLE θ i = (θ i,0 ,θ i,1 ) ofθ based on O n (i) is such that θ i,a = (∑ i j=1 Y j1l{A j =a})/(∑ i j=1 1l{A j =a}), as ifO 1 ,...,O i were iid, with θ i,a = 1 2 as long as ∑i j=1 1l{A j = a} = 0 by convention.It yields the MLE of Ψ(θ), Ψ n = Ψ(θ n ), and the estimate g ⋆ (θ n ) of the optimal<strong>de</strong>sign g ⋆ (θ), to which we apply a thresholding in or<strong>de</strong>r to avoid that the adaptive<strong>de</strong>signstopsatreatmentarm withprobabilitytendingto1: thuswe<strong>de</strong>fineg ⋆ i+1 (1|O n(i)) =min{1 − δ,max{δ,g ⋆ (θ n )(1)}} (2)(sometimesabbreviatedtog ⋆ i+1(1)),where δ >0ischosensmallenoughtoguaranteethat δ max a∈A g ⋆ (θ)(a)). This characterizesthe (random) element g ⋆ i+1 (·|O n(i)) of the set of fixed CAR <strong>de</strong>signs, from whichone samples A i+1 , hence the next observation O i+1 , hence O n by recursion. SincethelikelihoodofO n equals ∏ n i=1 θYi A i(1−θ Ai ) 1−Yi ×∏ n i=1 g⋆ i (A i),wecansaythatO nisobtainedun<strong>de</strong>r (θ,g ⋆ n)-adaptivesampling,whereg ⋆ n = (g ⋆ 1 ,...,g⋆ n).Analternativeadaptivesamplingcanbe<strong>de</strong>finedsimilarly. Itischaracterizediterativelyas abovebysubstitutingg a 1 (·|O n(0)) =g b tog ⋆ 1 (·|O n(0))and{ ( ) } 2g a i+1 (1|O 1 in(i)) = argminγ∈[δ,1−δ] i+1∑g a j (1|O n(j−1))+γ −g ⋆ (θ n )(1)j=1to g ⋆ i+1 (1|O n(i))for each i ≥1. Thisalternativechoiceaims at obtainingabalancebetween the two treatments which, at experiment i, closely approximates g ⋆ (θ), inthe sense that 1 i ∑i j=1 1l{A j =1} ≃g ⋆ (θ n )(1), the current best guess. This second<strong>de</strong>finitionis moreaggressiveinthepursuitof theoptimaltreatment mechanism,asittriestocompensateontheflyforearlysub-optimalsampling. Inthiscase,wesaythatO n is obtainedun<strong>de</strong>r (θ,g a n)-adaptivesampling,whereg a n = (g a 1 ,...,ga n).Itisseenin(Chambazandvan<strong>de</strong>rLaan,2010b,Section4,Theorems1and2,Highlight1)that,un<strong>de</strong>r (θ,g ⋆ n ),theadaptive<strong>de</strong>signg⋆ n convergesalmostsurelytog ⋆ (θ);thattheMLE Ψ n convergesalmostsurelyto Ψ(θ);andthat √ n(Ψ n −Ψ(θ))convergestoacenteredGaussiandistributionwithasymptoticvarianceconsistentlyestimated(as ifsamplingwas iid)withPublished by Berkeley Electronic Press, 20113http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.13104


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studys 2 n = 1 nn∑i=1()(Y i − θ n,0 ) 21l{A i =0}θn,0 2 g⋆ n(0) 2 +(Y i − θ n,1 ) 21l{A i =1}θn,1 2 g⋆ n(1) 2 . (3)Thus,theconfi<strong>de</strong>nceinterval [Ψ n ± sn √ nξ 1−α/2 ]has asymptoticcoverage (1 − α).Moreover, a so-called targeted optimal<strong>de</strong>sign group sequential testing procedureof“Ψ(θ)= ψ 0 ”against“Ψ(θ) > ψ 0 ”basedonamultidimensionalt-statisticof the form ( √ N k (Ψ Nk − ψ 0 )/s Nk ) k≤K (for some integer K ≥ 1 and well-chosenstopping times (N k ) k≤K ) is <strong>de</strong>scribed and asymptotically studied in (Chambaz andvan<strong>de</strong>rLaan, 2010b,Section 5, Theorem3, Highlight2).3 Simulation study of the performances of targetedoptimal<strong>de</strong>signadaptiveestimationInthissection,wecarryoutasimulationstudyoftheperformancesoftargetedoptimal<strong>de</strong>sign adaptiveprocedures in terms of estimation and uncertainty assessment.The two main questions at stake are “Do the confi<strong>de</strong>nce intervals obtained un<strong>de</strong>rthetargeted optimal<strong>de</strong>signadaptivesamplingschemeguaranteethe<strong>de</strong>sired coverage?”and “How well do they compare with the intervals we would obtain un<strong>de</strong>rthetargeted optimaliidsamplingscheme?”WecarefullypresentthesimulationschemeinSection3.1. Wevalidatewithsimulationsthecentrallimittheoremthatwe<strong>de</strong>rivedtheoreticallyin(Chambazandvan<strong>de</strong>rLaan, 2010b,Section4,Theorem2). ThesectionculminatesinSection3.3with the investigation of the covering properties of the confi<strong>de</strong>nce intervals basedonthedata-drivensamplingschemes. Then,weconsi<strong>de</strong>rtheperformancesintermsof widths of the confi<strong>de</strong>nce intervals in Section 3.4, Section 3.5 finally containingan illustrationoftheprocedure.3.1 The simulationscheme.Define ε =0.1and the ε-net Θ 0 = {(iε, jε):1≤i≤ j ≤9}overtheset {(θ 0 ,θ 1 ):ε ≤ θ 0 ≤ θ 1 ≤1 − ε}. It has cardinality #Θ 0 =45. The log-relative risk functionΨ maps Θ 0 onto the set Ψ(Θ 0 ) ⊂ [0;2.1973], see Table 1, which is well <strong>de</strong>scribedby its cumulative distribution function (cdf) plotted in Figure 1. The set R(Θ 0 ) ⊂[0.6097;1] is presented in Table 2. It is also interesting to look in Figure 2 at theleft-handplotof {(Ψ(θ),R(θ)):θ ∈ Θ 0 }. All θ ∈ Θ 0 whichareonthediagonalareassociated with a log-relativerisk Ψ(θ) =0 and arelativeefficiency R(θ) =1 andare therefore represented by the single point (0,1). It is also seen in the left-handplot of Figure 2 that the relative efficiency R(θ) can be significantly lower than 1evenwhen Ψ(θ)is notlarge.The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11θ .1 .2 .3 .4 .5 .6 .7 .8 .9.1 0 0.693 1.099 1.386 1.609 1.792 1.946 2.079 2.197.2 - 0 0.405 0.693 0.916 1.099 1.253 1.386 1.504.3 - - 0 0.288 0.511 0.693 0.847 0.981 1.099.4 - - - 0 0.223 0.405 0.560 0.693 0.811.5 - - - - 0 0.182 0.336 0.470 0.588.6 - - - - - 0 0.154 0.288 0.405.7 - - - - - - 0 0.134 0.251.8 - - - - - - - 0 0.118.9 - - - - - - - - 0Table1: Values of Ψ(θ) for θ ∈ Θ 0 (with precision 10 −3 ).θ .1 .2 .3 .4 .5 .6 .7 .8 .9.1 1 0.962 0.904 0.850 0.800 0.753 0.708 0.662 0.610.2 - 1 0.982 0.945 0.900 0.850 0.796 0.735 0.662.3 - - 1 0.988 0.958 0.916 0.862 0.796 0.708.4 - - - 1 0.990 0.962 0.916 0.850 0.753.5 - - - - 1 0.990 0.958 0.9 0.800.6 - - - - - 1 0.988 0.945 0.850.7 - - - - - - 1 0.982 0.904.8 - - - - - - - 1 0.962.9 - - - - - - - - 1Table2: Values of R(θ) for θ ∈ Θ 0 (with precision 10 −3 ).Table 3 and the two right-hand plots in Figure 2 are even more interesting,because our search of efficiency relies for each θ ∈ Θ 0 on targeting its optimaltreatment mechanism g ⋆ (θ). In Table 3 we report the various optimal proportionsoftreatedg ⋆ (θ)(1). Inthetworight-handplotsinFigure2,werepresenttheoptimalproportion of treated g ⋆ (θ)(1) against the log-relative risk Ψ(θ) (middle plot) andagainsttherelativeefficiency R(θ) (rightmostplot). Table3and the rightmostplotinFigure2bothillustratetheclosedformequalityg ⋆ (θ)(1) =2 1(1−√ R(θ) −1 −1)which can be easily <strong>de</strong>rived from (1). The above equality, related table and figureteach us that more significant gains in terms of relativeefficiency R(θ) correspondtosmalleroptimalproportionsoftreated g ⋆ (θ)(1).Setthesequenceofsamplesizesn=(100,250,500,750,1000,2500,5000).Forevery θ ∈ Θ 0 ,weestimateM=1000timesthelog-relativerisk Ψ(θ)basedonO m n 7(n i ),m=1,...,M,i =1,...,7, un<strong>de</strong>r• iid (θ,g b )-balanced sampling,• iid (θ,g ⋆ (θ))-optimalsampling,• (θ,g ⋆ n 7)-adaptivesampling,• (θ,g a n 7)-adaptivesampling.Wechoose δ =0.01in(2).Published by Berkeley Electronic Press, 20115http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.13106


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11cdf of Ψ(Θ0) at ψ0.0 0.2 0.4 0.6 0.8 1.00.0 0.5 1.0 1.5 2.0ψFigure1: Cumulative distribution function ψ ↦→ 1#Θ 0∑ θ ∈Θ0 1l{Ψ(θ) ≤ ψ}.θ .1 .2 .3 .4 .5 .6 .7 .8 .9.1 0.500 0.400 0.337 0.290 0.250 0.214 0.179 0.143 0.100.2 - 0.500 0.433 0.380 0.333 0.290 0.247 0.200 0.143.3 - - 0.500 0.445 0.396 0.348 0.300 0.247 0.179.4 - - - 0.500 0.449 0.400 0.348 0.290 0.214.5 - - - - 0.500 0.449 0.396 0.333 0.250.6 - - - - - 0.500 0.445 0.380 0.290.7 - - - - - - 0.500 0.433 0.337.8 - - - - - - - 0.500 0.400.9 - - - - - - - - 0.500Table3: Values of g ⋆ (θ)(1) for θ ∈ Θ 0 (withprecision 10 −3 ).3.2 Empiricaldistribution of maximumlikelihood estimates.In (Chambaz and van <strong>de</strong>r Laan, 2010b, Theorem 2) we proved that a central limitresult holds for Ψ n when targeting the optimal <strong>de</strong>sign, as it is obviously the caseun<strong>de</strong>r iid sampling. In or<strong>de</strong>r to check by simulations that remarkable property andto<strong>de</strong>terminehowquicklythelimitisreached,weproposethefollowingprocedure.Testing the empirical distribution ofmaximum likelihoo<strong>de</strong>stimates.For every θ ∈ Θ 0 , all types of sampling, and each sample size n i , we compare theempiricaldistributionofthe(centered andrescaled) estimatorsof Ψ(θ)√ (ni Ψni (O m nZ(θ) ni,m = 7(n i )) − Ψ(θ) )√ , m =1,...,Mv(θ)R(θ)0.6 0.7 0.8 0.9 1.00.0 0.5 1.0 1.5 2.0Ψ(θ)g ⋆ (θ)(1)0.1 0.2 0.3 0.4 0.50.0 0.5 1.0 1.5 2.0Ψ(θ)g ⋆ (θ)(1)0.1 0.2 0.3 0.4 0.50.6 0.7 0.8 0.9 1.0Figure 2: Plots of {(Ψ(θ),R(θ)) : θ ∈ Θ 0 } (left), {(Ψ(θ),g ⋆ (θ)(1)) : θ ∈ Θ 0 } (middle)and {(R(θ) g ⋆ (θ)(1)): θ ∈ Θ }(right)., 0(wherev(θ) =v b (θ)un<strong>de</strong>rbalancediidsamplingandv(θ) =v ⋆ (θ)otherwise)withitsstandardnormaltheoreticallimitdistributionintermsoftwo-si<strong>de</strong>dKolmogorov-Smirnov goodness-of-fit test. This results in a collection of in<strong>de</strong>pen<strong>de</strong>nt p-values{P(θ) cltn i: θ ∈ Θ 0 ,i =1,...,7} which are uniformly distributed un<strong>de</strong>r the null hypothesisstatingthatallZ(θ) ni,m followthestandardnormaldistribution.Un<strong>de</strong>r the null, {P(θ) cltn i: θ ∈ Θ 0 } contains iid copies of the Uniform distributionover [0;1] for every i =1,...,7. This statement can be tested in terms ofone-si<strong>de</strong>d Kolmogorov-Smirnov goodness-of-fit procedure, the alternative statingthat these iid random variables are stochastically smaller than a uniform randomvariable,hence7final p-values foreach samplingschemeas reported inTable4.samplesizesamplingscheme n 1 n 2 n 3 n 4 n 5 n 6 n 7iidg b -balanced p


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studythecentral limittheoremisnot rejected un<strong>de</strong>r• iidg b -balanced samplingforanysamplesizen i ≥n 4 =750,• iidg ⋆ -optimalsamplingforanysamplesizen i ≥n 3 =500,• g ⋆ n-adaptivesamplingforanysamplesizen i ≥n 3 =500,• g a n -adaptivesamplingforanysamplesizen i ≥n 6 =2500,adjustingformultipletestingintermsoftheBenjaminiandYekutieliprocedureforcontrollingtheFalseDiscoveryRateat level5%.Less formally, the Gaussian limit theoretically guaranteed by the centrallimittheoremisreached un<strong>de</strong>riidg ⋆ -optimalandg ⋆ n -adaptivesamplingschemesassoonas 500 observationsare accrued. Thelimitisreached as soon as 750observationsare collected when consi<strong>de</strong>ring the iid g b -balanced sampling scheme. This isa very satisfying result for the g ⋆ n-adaptive sampling scheme. On the contrary, thelimitisreachedforasurprisinglylargeminimalsamplesizeun<strong>de</strong>rg a n-adaptivesamplingscheme.This said,we areless interested intheminimalsamplesizerequiredto reach the Gaussian limit than in the minimal sample size required to guaranteethe<strong>de</strong>siredcoveragepropertiestoourconfi<strong>de</strong>nceintervals. Thecoveragepropertiesofourconfi<strong>de</strong>nceintervalsareinvestigatedinSection 3.3. In conclusion,Highlight 3 (empirical validation of central limit theorem). In view of (Chambazand van <strong>de</strong>r Laan, 2010b, Theorem 2), the convergence of √ n(Ψ n − Ψ(θ)) to itslimit Gaussian distribution un<strong>de</strong>r (θ,g ⋆ n)-adaptive sampling scheme is empiricallyreached as soon as 500 observations are accrued. This is as good as what we getun<strong>de</strong>rtheiid (θ,g ⋆ )-optimalsamplingscheme.Illustratingthe convergence.Togiveasenseofhowwellthestandardnormallimitdistributionisreached,itisinterestingtoconsi<strong>de</strong>r,foreach adaptivesamplingschemeand forthecorrespondingfirst sample size for which the central limit theorem is not rejected, that empiricalcdfwhich is thefarthest to thestandard normallimitcdf. Howfar an empirical cdfis from the standard normal cdf is measured in terms of p-value of the two-si<strong>de</strong>dKolmogorov-Smirnov goodness-of-fit test. For a sample size n 3 = 500 (the firstsamplesizeforwhichthecentrallimittheoremisnotrejectedun<strong>de</strong>rtheg ⋆ n-adaptivesampling scheme; that first sample size is n 6 =2500 for the g a n-adaptive samplingscheme), it is also interesting to compare the worse empirical cdf obtained un<strong>de</strong>rg ⋆ n-adaptivesamplingschemetotheworseempiricalcdfobtainedun<strong>de</strong>rg a n-adaptivesamplingscheme.Thus, we represent in Figure 3 (left) the empirical cdf of the sequence(Z(θ − ) n3 ,m) m≤M with θ − =argmin θ∈Θ0 P(θ) cltn 3un<strong>de</strong>radaptive (θ − ,g ⋆ n 3)sampling.We obtain θ − = (0.3,0.9) (for which Ψ(θ − ) =1.0986). Even though P(θ − ) cltn 3≃0.0017,theempiricalcdfand itslimitare almostsuperposable.Similarly, we represent in Figure 3 (middle) the empirical cdf of the sequence(Z(θ ′− ) n6 ,m) m≤M associated with θ ′− = argmin θ∈Θ0 P(θ) cltn 6un<strong>de</strong>r adaptive(θ ′− ,g a n 6)sampling. We obtain θ ′− = (0.1,0.9)(forwhich Ψ(θ ′− ) =2.1972).Again,theempirical cdfand itslimitarealmostsuperposable.Finally, we also represent in Figure 3 (right) the empirical cdf of the sequence(Z(θ − ) n3 ,m) m≤M , that of the sequence (Z(θ ′′− ) n3 ,m) m≤M associated withθ ′′− = argmin θ∈Θ0 P(θ) cltn 3un<strong>de</strong>r adaptive (θ ′′− ,g a n 3) sampling, that is before theasymptoticdistributionis reached for that <strong>de</strong>sign, and their common limit. We obtainθ ′′− = θ ′− = (0.1,0.9). A logarithmic scale is used on the y-axis in or<strong>de</strong>r toenhancethedifferencesoccurringatthelefttail. TheZ(θ ′′− ) n3 ,m’sareseenstochastically(empirically)largerthantheZ(θ ′− ) n3 ,m’s,themselvesslightlystochastically(empirically)largerthan astandard normalrandomvariable.Fn 3(z),F0(z)0.0 0.2 0.4 0.6 0.8 1.0The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11−3 −2 −1 0 1 2 3zF ′ n 6(z),F0(z)0.0 0.2 0.4 0.6 0.8 1.0−2 −1 0 1 2 3zFn 3(z),F ′ n 3(z),F0(z)0.001 0.005 0.050 0.500−3 −2 −1 0 1 2 3Figure 3: Giving a sense of how well the standard normal limit distribution is reachedun<strong>de</strong>r each adaptive sampling scheme. Left: Un<strong>de</strong>r g ⋆ n-adaptive sampling scheme and forsamplesizen 3 =500, empiricalcdfF n3 (solidline) ofthesequence (Z(θ − ) n3,m) m≤M whoseempirical distribution is the further from its limit standard normal distribution. The referencelimitcdfF0 isalsoplotted (dashed). Middle: Un<strong>de</strong>rg a n-adaptive sampling schemeandfor sample size n 6 =2500, empirical cdf F n ′6(solid line) of the sequence (Z(θ ′− ) n6,m) m≤Mwhose empirical distribution is the further from its limit standard normal distribution. Thereference limit cdf F 0 is also plotted (dashed). Right: Empirical cdf F n3 (solid line; it is thesame as that plotted in the leftmost graph), empirical cdf F n ′′3(dotted line) of the sequence(Z(θ ′′− ) n3,m) m≤M obtained un<strong>de</strong>r g a n-adaptive sampling scheme whose empirical distributionis the further from its limit standard normal distribution, and their common limit cdfF 0 (dashed). In this last graph only, we use a logarithmic scale on the y-axis in or<strong>de</strong>r toenhance thedifferences atthe left tail.zPublished by Berkeley Electronic Press, 20119http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131010


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 113.3 Empiricalcoverageof the confi<strong>de</strong>nce intervals.We invoke the central limit theorem (Chambaz and van <strong>de</strong>r Laan, 2010b, Theorem2) in or<strong>de</strong>r to construct confi<strong>de</strong>nce intervals for the log-relative risk. The empiricalvalidation of the theorem presented in Section 3.2 also provi<strong>de</strong>s us with anindirect validation of the coverage properties of those confi<strong>de</strong>nce intervals. Howeverit is interesting to test directly if the coverage requirements are satisfied. Obviously,Section3.3isthemostimportantsubsectionofSection 3.empirical coverage0.86 0.88 0.90 0.92 0.94 0.96 0.98Testing the empirical coverageofthe confi<strong>de</strong>nce intervals.Set α =5%. For every θ ∈ Θ 0 , all types of sampling, every iteration m and eachsamplesizen i , we estimatetheasymptoticvarianceof theMLE Ψ ni (O m n 7(n i )) withs(θ) 2 n i,m (see(3))and buildtheconfi<strong>de</strong>nce interval[I (θ) ni,m = Ψ ni (O m n 7(n i )) ± s(θ) ]n√ i,mξ 1−α/2 niwhere ξ 1−α/2 is the (1 − α/2)-quantileif thestandard normal distribution. We areinterested in the empirical coverage guaranteed by I (θ) ni,m (its width are consi<strong>de</strong>redin Section 3.4).Empiricalcoverageofintervals I (θ) ni,m,m =1,...,M, thatisproportionsc(θ) ni = 1 MM∑m=11l{Ψ(θ) ∈ I (θ) ni,m}, θ ∈ Θ 0 ,i =1,...,7,are reported in (Chambaz and van <strong>de</strong>r Laan, 2010a), see Tables 12 and 13 (fori =1,2,3,4 and i =5,6,7 respectively) for iid g b -balanced sampling, in Tables 14and 15 (for i = 1,2,3,4 and i = 5,6,7 respectively) for iid g ⋆ -optimal sampling,in Tables 16 and 17 (for i = 1,2,3,4 and i = 5,6,7 respectively) for g ⋆ n-adaptivesampling, and in Tables 18 and 19 (for i =1,2,3,4 and i =5,6,7 respectively) forg a n -adaptivesampling.Because those tables are very <strong>de</strong>nse, we invite the rea<strong>de</strong>r to skim throughthem and rather comment on Figure 4 before testing if the empirical coverage behavesasitshould.In Figure 4, the leftmost boxplot at each sample size (associated to iid g b -balanced sampling scheme) serves as a benchmark. There is no striking differencebetween them and the corresponding boxplots associated to iid g ⋆ -optimal sampling.Surprisingly, a rather good coverage is guaranteed at sample sizes n 1 =100,n 2 = 250, i.e. even before the central limit theorem is empirically validatedn 1 n 2 n 3 n 4 n 5 n 6 n 7sample sizeFigure 4: Boxplots representing the empirical coverage proportions {c(θ) ni : θ ∈ Θ 0 } fori =1,...,7(each samplesize) an<strong>de</strong>achsampling scheme: from lefttorightateachsamplesize, iid g b -balanced, iid g ⋆ -optimal, g ⋆ n-adaptive and g a n−adaptive sampling schemes. Everybox features a solid horizontal line showing the mean value, its bottom and top limitscorresponding to the first and third quartiles. Its whiskers extend to the most extreme datapoint which isno more than 1.5 timesthe interquartile range. Anhorizontal line indicatingthe aimed level 95% isad<strong>de</strong>d.(seeSection3.2). Incontrast,theboxplotsassociatedtotheadaptive<strong>de</strong>signsrevealaverypoorempiricalcoverageatthesmallestsamplesizesn 1 =100andn 2 =250.When the sample size is larger than or equal to n 3 , the boxplots associated to theadaptive <strong>de</strong>signs illustrate an empirical coverage that compares equally to that ofthe in<strong>de</strong>pen<strong>de</strong>nt <strong>de</strong>signs. This is in agreement with the empirical validation of thecentral limittheorem forg ⋆ n -adaptive<strong>de</strong>sign,butnotforga n -adaptive<strong>de</strong>sign.More rigorously now, the in<strong>de</strong>pen<strong>de</strong>nt rescaled empirical coverage proportions{Mc(θ) ni : θ ∈ Θ 0 } should be distributed according to the Binomial distributionwith parameter (M,1 −a) with a = α for every i =1,...,7. This propertycan be tested in terms of our tailored test (Chambaz and van <strong>de</strong>r Laan, 2010a, SectionA.3),thealternativestatingthata> α. Thisresultsinacollectionof7 p-valuesforeach samplingscheme,as reported inTable5.Empirical validationofthe coverageofthe confi<strong>de</strong>nce intervals.Consi<strong>de</strong>ring each sampling scheme (i.e. each row of Table 5) separately, we conclu<strong>de</strong>thatthe (1 − α)-coveragecannotbe<strong>de</strong>clared <strong>de</strong>fectiveun<strong>de</strong>r• iidg b -balanced samplingforanysamplesizen i ,• iidg ⋆ -optimalsamplingforanysamplesizen i ,Published by Berkeley Electronic Press, 201111http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131012


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studysamplesizesamplingscheme n 1 n 2 n 3 n 4 n 5 n 6 n 7iid g b -balanced 0.498 0.966 0.923 0.247 0.369 0.045 0.925iidg ⋆ -optimal 0.995 0.981 0.769 0.533 0.958 0.586 0.007g ⋆ n-adaptive p


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyTesting the empirical widths ofconfi<strong>de</strong>nce intervals.Now, thelatter qualitativecomments are backed by quantitativeresults that we obtainin a testing framework. On one hand in<strong>de</strong>ed, the widths {|I (θ) ni,m| : m =1,...,M}oftheMconfi<strong>de</strong>nceintervalsobtainedun<strong>de</strong>riidg ⋆ -optimalsamplingprovi<strong>de</strong>uswithanempiricalcounterpart ofabenchmark distributionofoptimalwidthfor sample size n i . On the other hand the distributions of the widths of the confi<strong>de</strong>nceintervalsatsamplesizeni obtainedun<strong>de</strong>rbothadaptivesamplingschemesaretheempiricalcounterpartsoftwodistributionswhichmaybeclosetotheempiricalbenchmark distribution (at least, the theory teaches us that the empirical distributionsun<strong>de</strong>riidg⋆ -optimalsamplingandg ⋆ n -adaptivesamplingschemesconverge,asthesamplesizeincreases,tothethesameDiracprobabilitydistribution). Similarly,the rescaled widths { √ R(θ)|I (θ) ni,m|:m=1,...,M} of the M confi<strong>de</strong>nce intervalsobtainedun<strong>de</strong>r iid g b -balanced samplinggiverise to theempirical counterpartof a distributionwhich should be close to the empirical benchmark distribution (atleast again, the theory teaches us that the empirical distributions un<strong>de</strong>r iid optimaland balanced sampling schemes converge, as the sample size increases, to the thesameDiracprobabilitydistribution).Therefore we can test at each sample size and across Θ 0 , in terms of ourtailored test for comparison of widths (Chambaz and van <strong>de</strong>r Laan, 2010a, SectionA.4),ifforeachintermediatesamplesizethetwodistributionsofwidthsun<strong>de</strong>rboth adaptive sampling schemes coinci<strong>de</strong> with the benchmark distribution (null),rather than being stochastically larger (alternative hypothesis). This yields 14 p-values all almost equal to one, see Table 6. In other words, no matter the samplesize,wecannotconclu<strong>de</strong>thatthewidthsoftheconfi<strong>de</strong>nceintervalsobtainedun<strong>de</strong>reither adaptive sampling scheme are larger than their counterparts obtained un<strong>de</strong>riidg ⋆ -optimalsampling.Regarding the comparison of the iid g b -balanced and g ⋆ -optimal samplingschemes, we can test at each sample size and across Θ 0 , in terms of our tailoredtest for comparison of widths (Chambaz and van <strong>de</strong>r Laan, 2010a, Section A.4), iffor each intermediate sample size the distribution of rescaled widths un<strong>de</strong>r iid g b -balanced samplingschemecoinci<strong>de</strong>s withthebenchmark distribution(null), ratherthan being stochastically smaller (alternative hypothesis). This yields 7 p-valuesall smaller than 10 −6 . In other words, no matter the sample size, we can conclu<strong>de</strong>that the widths of the confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r iid g ⋆ -optimal samplingscheme are stochastically larger than their rescaled (by the corresponding factor√R(θ)) counterparts obtained un<strong>de</strong>r iid g b -balanced sampling for some θ ∈ Θ 0 .This is not very surprising: rescaling is meant here to adjust the means, the variancesbeingfor instancepossiblystilldifferent forsome θ ∈ Θ 0 .The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11However we can slightly adapt the procedure we just presented, rescalingmore mo<strong>de</strong>stly by a sub-optimal factor. We compare now, in the same terms, theempiricalbenchmarkdistributionsofoptimalwidthwiththeempiricaldistributionsof { √ R(θ) ρ |I (θ) ni,m|:m=1,...,M} un<strong>de</strong>r iid g b -balanced sampling for the arbitrarilychosenρ =0.9. Weobtainthe7 p-valuesreported inTable6.samplesizesamplingscheme n 1 n 2 n 3 n 4 n 5 n 6 n 7iidg b -balanced p


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11adjustingformultipletestingintermsoftheBenjaminiandYekutieliprocedureforcontrollingtheFalseDiscoveryRateat level5%. In conclusion,Highlight 5 (empirical widths of confi<strong>de</strong>nce intervals). In view of (Chambaz andvan <strong>de</strong>r Laan, 2010b, Theorem 2) and for any sample size, the widths of the confi<strong>de</strong>nceintervals obtained un<strong>de</strong>r (θ,g ⋆ n)-adaptive sampling scheme are not significantlygreater than the widths of the confi<strong>de</strong>nce intervals that we obtain un<strong>de</strong>r iid(θ,g ⋆ )-optimalsamplingscheme.3.5 Illustrating example.So far, we have been concerned with results averaged across randomly sampledtrajectories and θ’s ranging over Θ 0 . Here we present as an illustrating examplefour trajectories produced by the iid (θ,g b ) and (θ,g ⋆ ) sampling schemes and theadaptive (θ,g ⋆ n ) and (θ,ga n )samplingschemesfor θ = (0.2,0.6) ∈ Θ 0.For each of them, we report the point estimates Ψ ni (O 1 n 7(n i )) of Ψ(θ) =1.099 at every samplesize n i , as well as the estimatedstandard <strong>de</strong>viationss(θ) 2 n , i,1confi<strong>de</strong>nce intervals I (θ) ni,1, and estimates g ⋆ n i(1) and g a n i(1) of the optimal proportionoftreatedg ⋆ (θ)(1) =0.290forthetwoadaptiveprocedures–seeTable11.In addition, we exhibit in Figure 5 several plots illustrating (from left toright) how the sequences θ n , Ψ n , g n , and s(θ) 2 n evolveas the samplesize increaseswhen applying the two adaptive sampling schemes. The most striking feature inthefigure,whichisrepresentativeofallthetrajectorieswehaveobserved,concernsthe adaptive treatment mechanism sequence. Estimating (or targeting) the optimaltreatment mechanism is the driving force of our new adaptive estimation procedure.It is proven in (Chambaz and van <strong>de</strong>r Laan, 2010b, Theorem 1) that g ⋆ n(1)and, therefore, the cumulated mean 1 n ∑n i=1 g⋆ i (1), converge to the optimal proportionoftreatedg ⋆ (θ)(1)when thesamplingschemeischaracterized by (2). We seehere that it is the cumulated mean of g a n(1) only that converges to g ⋆ (θ)(1) whenconsi<strong>de</strong>ringthe (θ,g a n ) samplingscheme.4 Simulation study of the performances of targetedoptimal group sequential testing procedure poweredatlocalalternativesIn this section, we carry out a simulation study of the performances of targetedoptimal <strong>de</strong>sign adaptiveprocedures in terms of group sequential testing. The threemain questions at stake are “Does the group sequential testing procedure un<strong>de</strong>rthe targeted optimal <strong>de</strong>sign adaptivesampling scheme guarantee the <strong>de</strong>sired type Ierror?”, then “Does it guaranteethe<strong>de</strong>sired power?”, then lastly“How well does itcomparewith thegroup sequential testingprocedure un<strong>de</strong>rthetargeted optimaliidsamplingscheme?”We carefully present the simulation scheme in Section 4.1. The sectionculminates in Section 4.2 with the investigation of the properties of the adaptivegroupsequentialtestingprocedureintermsoftypeIandtypeIIerrors. Weconclu<strong>de</strong>inSection4.3withthesimulationstudyofitsperformancesintermsofsamplesizesat <strong>de</strong>cision. In each section, we compare our setting or results with those of (Zhuand Hu,2010).4.1 The simulationscheme (continued).For each θ ∈ Θ = Θ 0 \ {(.1,.7),(.1,.8),(.1,.9),(.2,.8),(.2,.9),(.3,.9)} 1 , we testM = 1000 times the null “ψ = Ψ(θ)” against the alternative “ψ > Ψ(θ)” withasymptotic type I error α = 5% and type II error β = 10% at ψ = Ψ(θ)+∆(θ),where ∆(θ) = Ψ(θ +(0,η)) − Ψ(θ) = log(1+η/θ 1 ) (with η = 0.05) is a smallincrement. Depending on whether we want to investigate the empirical behaviorsofthedifferenttestingprocedures withrespect to typeI(i)ortypeII (ii)errors, weresort for θ ∈ Θ to(i) EmpiricaltypeIerror study:• iid (θ,g b )balanced sampling,• iid (θ,g ⋆ )optimalsampling,• (θ,g ⋆ n ) adaptivesampling,• (θ,g a n) adaptivesampling,(ii) EmpiricaltypeIIerror study:• iid (θ +(0,η),g b ) balanced sampling,• iid (θ +(0,η),g ⋆ ) optimalsampling,• (θ +(0,η),g ⋆ n )adaptivesampling,• (θ +(0,η),g a n)adaptivesampling.We apply a one-si<strong>de</strong>d group sequential testing procedure (Chambaz andvan <strong>de</strong>r Laan, 2010b, Section 5.1). It is based on the proportions (p 1 ,p 2 ,p 3 ,p 4 ) =(0.25,0.50,0.75,1)and α-and β-spendingfunctionsbothequal tot ↦→t 2 . The<strong>de</strong>signs(valuesof∆(θ), maximumcommittedinformationI max , rejectionand futilityregionsbounds)arereported inTable7.1 Thosesixvaluesareleftasi<strong>de</strong>becauseitwouldbecomputationally<strong>de</strong>mandingtoconsi<strong>de</strong>rthemtoo: for θ = (θ 0 ,θ 1 ) ∈ Θ 0 \ Θ, the key quantities v ⋆ (θ)I max(θ 1 ) and v b (θ)I max(θ 1 ) (which can beinterpretedasaveragemaximalsamplesizesat<strong>de</strong>cision)areverylarge.Published by Berkeley Electronic Press, 201117http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131018


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studyθ 1 ∆(θ) =log(1+η/θ 1 ) I max(θ 1 ) rejection/futilityboundaries.1 0.405 56.561 (2.734,2.302,2.006,1.716)(-0.973,0.136,0.965,1.716).2 0.223 186.747 (2.734,2.301,2.013,1.720)(-0.973,0.138,0.963,1.720).3 0.154 391.32 (2.734,2.307,2.008,1.716)(-0.973,0.135,0.962,1.716).4 0.118 670.281 (2.734,2.303,2.010,1.718)(-0.973,0.145,0.962,1.718).5 0.095 1023.632 (2.734,2.298,2.008,1.715)(-0.973,0.140,0.961,1.715).6 0.080 1451.372 (2.734,2.305,2.006,1.716)(-0.973,0.135,0.962,1.716).7 0.069 1947.12 (2.734,2.300,2.006,1.715)(-0.976,0.133,0.957,1.715).8 0.061 2530.022 (2.734,2.306,2.006,1.715)(-0.973,0.139,0.961,1.715).9 0.054 3180.931 (2.734,2.299,2.006,1.715)(-0.973,0.138,0.963,1.715)Table 7: Description of one-si<strong>de</strong>d sequential testing <strong>de</strong>signs with proportions(p 1 ,p 2 ,p 3 ,p 4 ) = (0.25,0.50,0.75,1), α- and β-spending functions both equal to t ↦→ t 2 ,and asymptotic type I error α =5% and type II error β =10%. Every θ = (θ 0 ,θ 1 ) ∈ Θ isassociated with the single entry corresponding with θ 1 . For each entry, we provi<strong>de</strong> (withprecision 10 −3 ) the value of the increment ∆(θ) = log(1+η/θ 1 ) that yields the parameterψ 1 = Ψ(θ)+∆(θ) at which the test of “ψ = Ψ(θ)” against “ψ > Ψ(θ)” is powered,themaximumcommittedinformation I max (θ 1 ),therejection regionbounds (above)andthefutility region bounds (below); note that, of course, the final rejection and futility boundscoinci<strong>de</strong>.By contrast the simulation study carried out in Zhu and Hu (2010) consi<strong>de</strong>rsa single scenario un<strong>de</strong>r the null (with θ = (.5,.5)) and a single scenario un<strong>de</strong>rthe alternative (with θ = (.5,.625)). The targeted treatment mechanism are eitherthe optimal allocation (used to minimize the expected number of failures subjectto power constraint; see (Rosenberger et al., 2001) and (Zhu and Hu, 2010, Example2 (ii))) or the so-called urn-allocation, see (Wei and Durham, 1978) and (Zhuand Hu, 2010, Example 2 (iii)). The total sample size is fixed beforehand, with atotal of 500 observations to sample, and the results are evaluated on 5000 replications.Proportions are set to (0.2,0.5,1), and three different α-spending functionsare consi<strong>de</strong>red(early stoppingforfutilityisnotpermitted).The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 114.2 Empiricaltype I andtype IIerrors.Let us consi<strong>de</strong>r here the empirical typeIand type II errors. We wish to answer thequestions “Does the group sequential testing procedure un<strong>de</strong>r the targeted optimal<strong>de</strong>signadaptivesamplingschemeguarantee the<strong>de</strong>sired typeIerror?” and “Does itguaranteethe<strong>de</strong>siredpower?”Testing the empirical typeIand typeIIerrors.Empirical type I errors, that is proportions {a(θ) : θ ∈ Θ}} such that Ma(θ) isthe number of times the null was falsely rejected for its alternative by the testingprocedure of“ψ = Ψ(θ)”against “ψ > Ψ(θ)”powered at ψ = Ψ(θ +(0,η)), arereported in(Chambaz andvan <strong>de</strong>rLaan, 2010a,Table29).EmpiricaltypeIIerrors,thatisproportions {b(θ):θ ∈ Θ}suchthatMb(θ)is the number of times the null was falsely not rejected for its alternative by thetestingprocedure“ψ = Ψ(θ)”against“ψ > Ψ(θ)”poweredat ψ = Ψ(θ +(0,η)),are reportedin(Chambazand van<strong>de</strong>rLaan, 2010a, Table30).In both tables, the numbers are strikingly close to the wished values (0.05forTable29,and 0.9 forTable30).Here again we rely on testing to assess rigorously if the requirements ontypeIandIIerrorsaremet. Tothisend,weusethatthein<strong>de</strong>pen<strong>de</strong>ntrescale<strong>de</strong>mpiricalproportions {Ma(θ):θ ∈ Θ} should be distributed according to the Binomialdistributionwithparameter (M,a)witha=α. Thispropertycanbetestedintermsof our tailored test, the alternative stating that a > α (Chambaz and van <strong>de</strong>r Laan,2010a, Section A.3). This results in 4 p-values, as reported in Table 8. Similarly,the in<strong>de</strong>pen<strong>de</strong>nt rescaled empirical proportions {Mb(θ) : θ ∈ Θ} should be distributedaccording to the Binomial distributionwith parameter (M,b) = (M,10%).This property can also be tested in terms of our tailored test, the alternativestatingthat b > β = 10% (Chambaz and van <strong>de</strong>r Laan, 2010a, Section A.3). This resultsin 4 p-values, and we also report them in Table 8. The latter p-values teach us thatthe study is un<strong>de</strong>r-powered. It remains to assert whether the study is slightly orstrongly un<strong>de</strong>r-powered: to this end we now test the null stating that the in<strong>de</strong>pen<strong>de</strong>ntrescaled empirical proportions {Mb(θ) : θ ∈ Θ} are distributed according totheBinomialdistributionwithparameter (M,b) = (M,11%)againstthealternativestatingthatb>11%. Thecorresponding4 p-values arealso reportedin Table8.Empirical validationoftype Iand typeII errors.Consi<strong>de</strong>ring each sampling scheme (i.e. each row of Table 8) separately, we conclu<strong>de</strong>thatPublished by Berkeley Electronic Press, 201119http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131020


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11samplingscheme typeIerror typeIIerror(10%) typeIIerror(11%)iidg b -balanced 0.974 p 11% respectively, and report the obtained p-values for each samplingscheme. The tailored test used here is presented in (Chambaz and van <strong>de</strong>r Laan, 2010a,Section A.3).• the type I error control cannot be <strong>de</strong>clared <strong>de</strong>fectivefor any sampling procedureor, in less formal terms, that the type I error control is guaranteed forboth iid g b -balanced and g ⋆ -optimal sampling schemes as well as for bothg ⋆ n-adaptiveand g a n-adaptivesamplingschemes;• thegroupsequential testingprocedures are all slightlyun<strong>de</strong>r-powered, inthesensethat:– the type II error control is <strong>de</strong>clared <strong>de</strong>fective for all sampling schemes(for each of them, there exists at least one θ ∈ Θ for which the type IIerror islikelylarger than β =10%);– however, the type II error control cannot be <strong>de</strong>clared <strong>de</strong>fective for anysamplingprocedurewhen substituting β ′ =11%to β =10% or,in lessformal terms, a 11% (rather than 10%) control of the type II error isguaranteed for both iid g b -balanced and g ⋆ -optimal sampling schemesas well as forbothg ⋆ n-adaptiveand g a n-adaptivesamplingschemes.This summary notably confirms our conjecture that Theorem 3 in (Chambaz andvan <strong>de</strong>r Laan, 2010b, Section 5.2) for group sequential testing procedures at <strong>de</strong>terministicsample sizes still holds for “real-life” group sequential testing proceduresat random sample sizes, as <strong>de</strong>scribed in (Chambaz and van <strong>de</strong>r Laan, 2010b, Section5.1).In conclusion,Highlight6(empiricaltypeIandtypeIIerrors). In viewof(Chambazandvan<strong>de</strong>rLaan,2010b,Section5.1,Theorem3),the (θ,g ⋆ n )-adaptivegroupsequentialtestingprocedureachievesthe<strong>de</strong>sired type Ierrorwhen testingagainstlocal alternatives.Itisslightlyun<strong>de</strong>rpoweredinthesensethatthetypeIIerrorcontrolisguaranteedat89%insteadof90%–butthesameholdsfortheiid (θ,g ⋆ )-optimalgroupsequentialtestingprocedure.Regarding the results obtained in (Zhu and Hu, 2010), it is straightforwardto check if the requirements on type I error control are met when targeting the optimalallocation(Zhuand Hu,2010,Example2(ii)). Surprisingly,simplebinomialtests (based on the results reported in (Zhu and Hu, 2010, Table 30) strongly rejectthe validity of the empirical type I error control (taking into account multipletesting) in two out of three cases (i.e. for two out of three α-spending functions).This may be due to their allocation strategy (trying to attain the Cramér-Rao lowerbounds on the allocation strategies themselves, see also our comment at the end ofSection3.3in(Chambazandvan<strong>de</strong>rLaan,2010a)). However,thesameconclusionholdsfor theirsimulationun<strong>de</strong>riid balanced samplingin oneout of three cases, sothemostlikelyexplanationis thatthetotal samplesize, fixed to500 beforehand, isnot large enough for the Gaussian limit of the test statistic to be achieved (at leastat proportion0.2).Asforthepower,itisalsostraightforwardtocheckifthetestingprocedures(one foreach α-spending function)behaveequally well–as theyshould–when targetingeithertheoptimalallocationortheurnallocation(ZhuandHu,2010,Examples2 (ii) and (iii)). Surprisingly again, simple tests (based on the results reportedin(ZhuandHu,2010,Tables4and5)stronglyrejectthehypothesisofequallywellperformingproceduresintermsofpower(takingintoaccountmultipletesting). In<strong>de</strong>ed,empirical powers are equal to 0.810, 0.768 and 0.754 (optimal allocation)and 0.811, 0.762 and 0.749 (urn allocation). Since the same phenomenon occursun<strong>de</strong>r iid balanced sampling, it is likely that the observed lack of power is due tothefixedtotalsamplesize,apparentlynotlargeenoughfortheGaussianlimitoftheteststatistictobeachieved(at least atproportion0.2).4.3 Empiricaldistributions ofsample size at<strong>de</strong>cision.Let us now consi<strong>de</strong>r the empirical distributionsof sample size at <strong>de</strong>cision, and answerthe question “How well does it compare with the group sequential testingprocedureun<strong>de</strong>rthetargeted optimaliidsamplingscheme?”Testing the empirical samplesizeat<strong>de</strong>cision.We report in (Chambaz and van <strong>de</strong>r Laan, 2010a, Table 30) the mean sample sizesat <strong>de</strong>cision for each θ ∈ Θ when checking the a<strong>de</strong>quateness of type I error controlofourgroupsequentialtestingproceduresof“ψ= Ψ(θ)”against“ψ > Ψ(θ)”Published by Berkeley Electronic Press, 201121http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131022


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studypowered at ψ = Ψ(θ + (0,η)). We also report in (Chambaz and van <strong>de</strong>r Laan,2010a, Table 31) themean sample sizes at <strong>de</strong>cision for each θ ∈ Θ when checkingthea<strong>de</strong>quateness oftypeII errorcontrol ofourgroupsequentialtestingproceduresof “ψ = Ψ(θ)” against “ψ > Ψ(θ)” powered at ψ = Ψ(θ + (0,η)). InspectingTables 30 and 31 of Chambaz and van <strong>de</strong>r Laan (2010a) tells us, at least in termsof mean sample sizes at <strong>de</strong>cision and regarding either empirical type I or type IIerrors, first that the two adaptive group sequential testing procedures perform aswell as the iid g ⋆ -optimal group sequential testing procedure, and second that thethree latter procedures perform (sometimes, much) better than the iid g b -balancedgroup sequential testing procedure when balanced and optimal iid procedures differ.As a summary, we provi<strong>de</strong> in Table 9 a comparison of mean sample sizes at<strong>de</strong>cision when resorting to iid g b -balanced group sequential testing procedure withrespecttog ⋆ n-adaptivegroupsequentialtestingprocedure. Naturally,thefurtherthepercentageisawayfromthediagonal,thelargeristhegain. Sometimes,thegainisdramatic.Again, we push further the comparison between empirical distributions ofsamplesize at <strong>de</strong>cision un<strong>de</strong>r each group sequential testing procedure (in thesamespirit as the comparison of widths in Section 3.4). On the one hand, the samplesizes at <strong>de</strong>cision {S(θ,g ⋆ ) m :m=1,...,M} of theM in<strong>de</strong>pen<strong>de</strong>ntcopies ofthe iidg ⋆ -optimal group sequential testing procedure provi<strong>de</strong>s us with an empirical counterpartofabenchmarkdistributionofoptimalsamplesizeat<strong>de</strong>cision.Ontheotherhand, we also have at hand the empirical distributions of sample sizes at <strong>de</strong>cisionobtainedun<strong>de</strong>riidg b -balancedandbothg ⋆ n -adaptiveandga n -adaptivegroupsequentialtesting procedures which we see as empirical counterparts of distributions thatwewouldliketo comparetotheaforementionedbenchmarkdistribution.Regardingthecomparisonoftheiidgroupsequentialtestingprocedures,weproposetotestacross Θ,intermsofourtailoredtestforcomparisonofsamplesizesat <strong>de</strong>cision (Chambaz and van <strong>de</strong>r Laan, 2010a, Section A.4), if the distribution ofsample size at <strong>de</strong>cision un<strong>de</strong>r iid g b -balanced group sequential testing procedurerescaled by a factor R(θ) coinci<strong>de</strong>s with the benchmark distribution (null), ratherthan being stochastically smaller (alternative hypothesis). This yields a p-valuesmaller than 10 −6 . In other words, we can conclu<strong>de</strong> that there exists some θ ∈ Θforwhichthesamplesizeat<strong>de</strong>cisionun<strong>de</strong>riidg ⋆ -optimalgroupsequentialtestingisstochastically larger than the corresponding R(θ)-rescaled sample size at <strong>de</strong>cisionun<strong>de</strong>r iid g b -balanced group sequential testing. However, we can slightly adaptthe procedure we just presented, rescaling more mo<strong>de</strong>stly by a sub-optimal factor.We compare now, in the same terms and with the same benchmark distribution,the distribution of sample size at <strong>de</strong>cision un<strong>de</strong>r iid g b -balanced group sequentialtestingprocedurerescaledbyafactorR(θ) ρ forthearbitrarilychosen ρ =0.45. Thetwo p-values thus obtained are reported in Table 10. Regarding the comparison ofThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11θ .1 .2 .3 .4 .5 .6 .7 .8 .9.1 1% 6% 10% 14% 26% 31% - - -.2 - 2% 1% 5% 12% 18% 24% - -.3 - - 2% 0% 6% 9% 16% 24% -.4 - - - 1% 2% 2% 11% 17% 33%.5 - - - - 1% -1% 6% 11% 26%.6 - - - - - 2% 1% 3% 13%.7 - - - - - - -2% 3% 12%.8 - - - - - - - 3% 4%.9 - - - - - - - - 6%gainswhenevaluatingempiricaltypeI errorθ .1 .2 .3 .4 .5 .6 .7 .8 .9.1 2% 6% 19% 21% 32% 36% - - -.2 - 0% 4% 8% 14% 21% 30% - -.3 - - -4% 2% 9% 13% 22% 33% -.4 - - - 0% 2% 5% 13% 27% 53%.5 - - - - 0 2% 10% 19% 41%.6 - - - - - 1% 3% 11% 29%.7 - - - - - - 3% 4% 30%.8 - - - - - - - 3% 21%.9 - - - - - - - - 10%gainswhenevaluatingempiricaltypeIIerrorTable9: Comparingmeansamplesizesat<strong>de</strong>cisionwhenresortingtoiidg b -balancedgroupsequential testingprocedure withrespecttog ⋆ n-adaptive groupsequential testing procedure.The top table corresponds to (Chambaz and van <strong>de</strong>r Laan, 2010a, Table 30)and evaluationof empirical type I error, while the bottom table corresponds to (Chambaz and van <strong>de</strong>rLaan, 2010a, Table 31) and evaluation of empirical type II error. Entries are of the form(¯S(θ,g b )− ¯S(θ,g ⋆ n ))/¯S(θ,g ⋆ n ),where ¯S(θ,g b )(respectively, ¯S(θ,g b ))<strong>de</strong>notestheempiricalmean sample size at <strong>de</strong>cision un<strong>de</strong>r iid g b -balanced sampling (respectively, g ⋆ n -adaptivesampling).the g ⋆ n -adaptive and ga n -adaptive group sequential testing procedures to the iid g⋆ -optimalgroupsequentialtestingprocedure,weproposetotestacross Θ,intermsofour tailored test for comparison of sample sizes at <strong>de</strong>cision (Chambaz and van <strong>de</strong>rLaan, 2010a, Section A.4), if the distributions of sample size at <strong>de</strong>cision un<strong>de</strong>reither adaptive group sequential testing procedures coinci<strong>de</strong>s with the benchmarkdistribution (null), rather than being stochastically larger (alternative hypothesis).This yields 4 p-values (two when investigatingthebehaviors with respect to type Ierror, two withrespect totypeIIerror) thatwereport inTable10.Published by Berkeley Electronic Press, 201123http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131024


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11samplingscheme typeIerror typeIIerroriid g b -balanced 0.625 0.727g ⋆ n-adaptive 0.994 1.000g a n-adaptive 0.898 0.949Table10: Comparingacross Θtheempiricaldistributions ofsamplesizesat<strong>de</strong>cision. Firstrow: We report p-values <strong>de</strong>rived when comparing, across Θ, the empirical distributions ofrescaled sample sizes at <strong>de</strong>cision (by a factor R(θ) ρ with ρ =0.45) un<strong>de</strong>r iid g b -balancedsampling tothe empirical counterpart ofthe benchmark distributions of sample sizes at<strong>de</strong>cisionobtained un<strong>de</strong>r iid g ⋆ -optimal sampling, in terms of our tailored test for comparisonofsamplesizesat<strong>de</strong>cision(Chambazandvan<strong>de</strong>rLaan,2010a,SectionA.4),thealternativehypothesis stating that the latter are stochastically larger than the former. Second and thirdrows: We report p-values <strong>de</strong>rived when comparing, across Θ, the empirical distributionsof sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r g ⋆ n-adaptive sampling (second row), or un<strong>de</strong>rg a n-adaptive sampling (third row), to the empirical counterpart of the benchmark distributionsof sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r iid g ⋆ -optimal sampling, in terms of ourtailoredtestforcomparisonofsamplesizesat<strong>de</strong>cision (Chambazandvan<strong>de</strong>rLaan,2010a,Section A.4),thealternative hypothesis stating inboth case thatthe latter arestochasticallysmaller than the former distributions. The second column corresponds to data gatheredwhen investigating the behaviors of the group sequential testing procedures in terms oftypeIerror, thethirdcolumncorresponding tothesameinvestigation butintermsoftypeIIerror.Empirical validationofsamplesizes at<strong>de</strong>cision.Consi<strong>de</strong>ring each sample scheme (i.e. each row of the table) separately, we conclu<strong>de</strong>that• the sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r iid (θ,g b ) balanced group sequentialtesting and rescaled by the corresponding factor R(θ) ρ (ρ = 0.45)arenotstochasticallysmallerthanthesamplesizesat<strong>de</strong>cisionobtainedun<strong>de</strong>riid (θ,g ⋆ ) optimalgroupsequentialtesting;• thesamplesizesat<strong>de</strong>cisionobtainedun<strong>de</strong>riid (θ +(0,η),g b )balancedgroupsequential testing procedure and rescaled by the corresponding factor R(θ +(0,η)) ρ (ρ = 0.45) are not stochastically smaller than the sample sizes at<strong>de</strong>cision obtained un<strong>de</strong>r iid (θ +(0,η),g ⋆ ) optimal group sequential testingprocedure;• the sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r both (θ,g ⋆ n ) and (θ,ga n ) adaptivegroupsequential testingprocedures are not stochasticallylarger than thesample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r iid (θ,g ⋆ ) optimal group sequentialtestingprocedure;• the sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r both (θ + (0,η),g ⋆ n) and (θ +(0,η),g a n )adaptivegroupsequentialtestingproceduresarenotstochasticallylarger than the sample sizes at <strong>de</strong>cision obtained un<strong>de</strong>r iid (θ + (0,η),g ⋆ )optimalgroup sequentialtestingprocedure.Overall, the main message stated in less formal terms is that both adaptivegroup sequential testing procedures perform as well as the optimal iid group sequentialtesting procedure with respect to sample size at <strong>de</strong>cision, either un<strong>de</strong>r thenullorun<strong>de</strong>rthealternative.Highlight7(empiricalsamplesizesat<strong>de</strong>cision). Inviewof(Chambazandvan<strong>de</strong>rLaan,2010b,Section5.1,Theorem3),the (θ,g ⋆ n)-adaptivegroupsequentialtestingprocedure behaves as the iid (θ,g ⋆ )-optimal group sequential testing procedureintermsof samplesizesat<strong>de</strong>cision,bothun<strong>de</strong>r thenullandun<strong>de</strong>r localalternatives.5 DiscussionWe have carried out in this article a simulation study of the properties (consi<strong>de</strong>redfrom a theoretical point of view in the companion article (Chambaz and van <strong>de</strong>rLaan,2010b))ofanewadaptivegroupsequential<strong>de</strong>signmethodologyforrandomizedclinical trials with binary treatment and outcome and no covariate (the experimentalunit writes as O = (A,Y) ∈ {0,1} 2 , A being the assigned treatment and Ythe corresponding outcome). The methodology is adaptive in the sense that it targetsatreatment mechanismspecified beforehand by thetrial protocol. We<strong>de</strong>ci<strong>de</strong>dto focus in (Chambaz and van <strong>de</strong>r Laan, 2010b) and here on the log-relative riskΨ =logE(Y |A =1) −logE(Y |A =0) and on that <strong>de</strong>sign g ⋆ which minimizes theasymptoticvarianceoftheMLEof Ψ. Otherchoices can betreated likewise.Theestimatorofg ⋆ , which appears to bestronglyconsistent(Chambaz andvan <strong>de</strong>r Laan, 2010b, Highlight 1), is alternatively used in the process of accruingnew data, then updated and so on. The resulting MLE of Ψ, Ψ n , is strongly consistent(Chambaz and van <strong>de</strong>r Laan, 2010b, Highlight 1). It satisfies a central limittheorem, and performs as well (in terms of asymptotic variance) as its counterpartun<strong>de</strong>riidsamplingusingg ⋆ itself(Chambazandvan<strong>de</strong>rLaan,2010b,Highlight1).Therefore, one easily constructs confi<strong>de</strong>nce intervals which are as narrow as theintervals one would get, has one known in advance g ⋆ and used it to sample in<strong>de</strong>pen<strong>de</strong>ntlydata(Chambaz and van <strong>de</strong>r Laan, 2010b, Highlight1). We validateherethose theoretical results by simulations. Notably, a test across a large collection ofdata-generating distributionsin<strong>de</strong>xed by Θ shows that the limitingGaussian law isempirically reached by the sequence of laws of √ n(Ψ n − Ψ) as soon as 500 observationsare collected. This is as good as what one would get, has one known inPublished by Berkeley Electronic Press, 201125http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131026


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation StudyThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11advance g ⋆ and used it to sample in<strong>de</strong>pen<strong>de</strong>ntly data (see Highlight 3). Most importantly,anothertestacrossΘrevealsthatthewishedcoverageisachievedassoonas 500 observations are collected. In contrast, a sample size of 100 observationswould suffice, has one known in advance g ⋆ and used it to sample in<strong>de</strong>pen<strong>de</strong>ntlydata (see Highlight 4). This is the price to pay for adapting. In conclusion, yetanothertestacross Θshowsthat,wheneverthesamplesizeexceeds100,thewidthsof confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r adaptive sampling schemes are not significantlygreater than the widths of the intervals one would get, has one known inadvanceg ⋆ and usedit tosamplein<strong>de</strong>pen<strong>de</strong>ntlydata(see Highlight5).Furthermore, we explain in (Chambaz and van <strong>de</strong>r Laan, 2010b) how agroup sequential testing procedure can be equally well applied on top of the adaptivesamplingmethodology(Chambaz and van <strong>de</strong>r Laan, 2010b, Highlight2)). Anaccompanyingtheoreticalresultvalidatestheadaptivegroupsequentialtestingprocedureinthecontextofcontiguousnullandalternativehypotheses.Itisshownherethatsimulationssupportthelatterresult. Mostimportantly,atestacrossalargecollectionofpairsofnullandlocalalternativehypothesesin<strong>de</strong>xedbyΘ ′ <strong>de</strong>monstratesthat the adaptive group sequential testing procedure achieves the <strong>de</strong>sired type I error(see Highlight 6). Moreover, a complementary test across Θ ′ reveals that theadaptive group sequential testing procedure is very slightly un<strong>de</strong>r-powered. Interestingly,has one known in advance g ⋆ and used it to sample in<strong>de</strong>pen<strong>de</strong>ntly data,the resulting group sequential testing procedure would suffer from the same minorlack ofpower(seeHighlight6). Finally,alast testacross Θ ′ showsthat thelawsofsample sizes at <strong>de</strong>cision un<strong>de</strong>r adaptive group sequential testing procedure do notsignificantlydifferfromthelawsofsamplesizesat<strong>de</strong>cisionthatonewouldget,hasone known in advance g ⋆ and used it to sample in<strong>de</strong>pen<strong>de</strong>ntly data and apply theiidgroupsequentialtestingprocedure(seeHighlight7).In essence, everything works as predicted by theory. However, theory alsowarns us that gains cannot be dramatic in the particular setting of clinical trialswith binary treatment and outcome without covariate. Nonetheless, this article isimportant: it provi<strong>de</strong>s a theoretical template and tools for asymptotic analysis ofrobust adaptive <strong>de</strong>signs in less constrained settings, which we will consi<strong>de</strong>r in futurework. This notably inclu<strong>de</strong>s the settingof clinical trials with covariate, binarytreatment, and discrete or continuous outcome, or the setting of clinical trials withcovariate,binarytreatment,andpossiblycensoredtime-to-eventamongothers. Resortingtotargetedmaximumlikelihoo<strong>de</strong>stimation(van<strong>de</strong>rLaanandRubin,2006)alongwithadaptationofthe<strong>de</strong>signprovi<strong>de</strong>ssubstantialgainsin efficiency.θnθn0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.01 100 5000n1 100 5000ng ⋆ n(1)g a n (1)0.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.01 100 5000n1 100 5000nΨnΨn0.0 0.5 1.0 1.5 2.00.0 0.5 1.0 1.5 2.01 100 5000n1 100 5000ns(θ) 2 ns(θ) 2 n0 5 10 15 200 5 10 15 201 100 5000n1 100 5000Figure 5: Illustrating how the two adaptive (θ,g ⋆ n), top, and (θ,g a n), bottom, samplingschemes behave asthesample sizeincreases (onx-axis, logarithmic scale; thevertical greylinesindicate samplesizesn i ,i=1,...,7). Fromlefttorightwerepresent atthesamescaleover columns the sequences θ n , Ψ n = Ψ(θ n ), g ⋆ n (1) or ga n (1) and s(θ)2 n . Horizontal greylines indicate the theoretical limits of the plotted sequences. By convention, θ n is initiatedat ( 1 2 , 1 2 ) while g⋆ n (1) and ga n (1) are initiated at 1 2. Another convention requires that at least10 observations are collected before computing s(θ) 2 n for the firsttime, explaining whythecorresponding plots start at n = 10 rather than n = 1. The most striking feature is howsmoothly g ⋆ n(1) converges to g ⋆ (θ)(1) when g n = g ⋆ n, top, as opposed to when g n = g a n,bottom.nPublished by Berkeley Electronic Press, 201127http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131028


ReferencesThe International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 11Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs: Simulation Studysample sizesampling scheme n1 n2 n3 n4 n5 n6 n7(θ,g b ) 1.053 1.285 1.259 1.167 1.118 1.158 1.117 Ψni (O1 n 7(ni))2.611 2.999 2.993 2.964 2.954 3.164 3.046 s(θ)ni,1[0.541,1.565] [0.913,1.656] [0.997,1.522] [0.955,1.379] [0.935,1.301] [1.034,1.282] [1.033,1.202] I (θ)ni,112121212121212 g b (1)(θ,g ⋆ ) 0.783 1.131 1.032 1.068 1.097 1.080 1.101 Ψni (O1 n 7(ni))2.873 2.836 2.798 2.823 2.863 2.831 2.809 s(θ)ni,1[0.220,1.346] [0.779,1.482] [0.787,1.278] [0.865,1.270] [0.919,1.274] [0.969,1.191] [1.023,1.179] I (θ)ni,10.290 0.290 0.290 0.290 0.290 0.290 0.290 g ⋆ (θ)(1)(θ,g ⋆ n ) 1.329 1.212 1.272 1.219 1.238 1.090 1.111 Ψni(O1 n 7(ni))3.560 2.984 2.928 2.939 2.936 2.804 2.829 s(θ)ni,1[0.631,2.027] [0.842,1.582] [1.015,1.528] [1.008,1.429] [1.056,1.420] [0.980,1.200] [1.033,1.190] I (θ)ni,10.281 0.277 0.262 0.273 0.269 0.291 0.288 gni(θ,g a n) 1.216 1.178 1.247 1.369 1.278 1.122 1.136 Ψni (O1 n 7(ni))2.544 2.709 2.770 3.094 2.990 2.853 2.842 s(θ)ni,1[0.718,1.715] [0.842,1.514] [1.004,1.490] [1.147,1.590] [1.092,1.463] [1.010,1.233] [1.058,1.215] I (θ)ni,10.317 0.060 0.367 0.213 0.630 0.750 0.309 gniTable 11: Illustrating example. We report here the point estimates Ψ ni (O 1 (ni)) of Ψ(θ) =1.099 (θ = (0.2,0.6)) at every samplen7size n i, the estimated standard <strong>de</strong>viations s(θ) ni,1, confi<strong>de</strong>nce intervals I (θ) ni,1, and estimates g ⋆ (1) and ni ga (1) of the optimalniChambaz, A. and M. J. van <strong>de</strong>r Laan (2010a): “Targeting the optimal <strong>de</strong>sign inrandomizedclinicaltrialswithbinaryoutcomesandnocovariate,”U.C.BerkeleyDivisionof BiostatisticsWorkingPaper Series,paper258.Chambaz, A. and M. J. van <strong>de</strong>r Laan (2011): “Targeting the optimal <strong>de</strong>sign inrandomized clinical trials with binary outcomes and no covariate: theoreticalstudy,”Int.J.Biostat.,Volume 7, Issue 1, Article 10.Hu, F. H. and W. F. Rosenberger (2006): The theoryof response-adaptiverandomizationinclinicaltrials,Wiley.Rosenberger, W. F., N. Stallard, A. Ivanova,C. N. Harper, and M. L. Ricks (2001):“Optimaladaptive<strong>de</strong>signsfor binaryresponsetrials,”Biometrics,57, 909–913.van <strong>de</strong>r Laan, M. J. (2008): “The construction and analysis of adaptive group sequential<strong>de</strong>signs,” U.C. Berkeley Division of Biostatistics Working Paper Series,paper232.van<strong>de</strong>rLaan,M.J.andD.Rubin(2006): “Targetedmaximumlikelihoodlearning,”Int.J. Biostat.,2,Art. 11,40pp.Wei, L. J. and S. D. Durham (1978): “The randomized play-the-winner rule inmedicaltrials,”JournaloftheAmericanStatisticalAssociation,73, 840–843.Zhu,H.andF.Hu(2010): “Sequentialmonitoringofresponse-adaptiverandomizedclinicaltrials,”TheAnnalsofStatistics.proportion of treated g ⋆ (θ)(1) =0.290 for the twoiidand two adaptive (θ,g ⋆ n )and (θ,ga n )sampling schemes. Seealso Figure 5. 29Published by Berkeley Electronic Press, 2011http://www.bepress.com/ijb/vol7/iss1/11DOI: 10.2202/1557-4679.131030


The International Journal ofBiostatisticsVolume 7, Issue 1 2011 Article 10Targeting the Optimal Design in RandomizedClinical Trials with Binary Outcomes and NoCovariate: Theoretical Study<strong>Antoine</strong> Chambaz, Laboratoire MAP5, Université ParisDescartes and CNRSMark J. van <strong>de</strong>r Laan, University of California, BerkeleyTargeting the Optimal Design in RandomizedClinical Trials with Binary Outcomes and NoCovariate: Theoretical Study<strong>Antoine</strong> Chambaz and Mark J. van <strong>de</strong>r LaanAbstractThis article is <strong>de</strong>voted to the asymptotic study of adaptive group sequential <strong>de</strong>signs in thecase of randomized clinical trials (RCTs) with binary treatment, binary outcome and no covariate.By adaptive <strong>de</strong>sign, we mean in this setting a RCT <strong>de</strong>sign that allows the investigator todynamically modify its course through data-driven adjustment of the randomization probabilitybased on data accrued so far, without negatively impacting on the statistical integrity of the trial.By adaptive group sequential <strong>de</strong>sign, we refer to the fact that group sequential testing methods canbe equally well applied on top of adaptive <strong>de</strong>signs. We obtain that, theoretically, the adaptive<strong>de</strong>sign converges almost surely to the targeted unknown randomization scheme. In the estimationframework, we obtain that our maximum likelihood estimator of the parameter of interest is astrongly consistent estimator, and it satisfies a central limit theorem. We can estimate itsasymptotic variance, which is the same as that it would feature had we known in advance thetargeted randomization scheme and in<strong>de</strong>pen<strong>de</strong>ntly sampled from it. Consequently, inference canbe carried out as if we had resorted to in<strong>de</strong>pen<strong>de</strong>nt and i<strong>de</strong>ntically distributed (iid) sampling. In thetesting framework, we obtain that the multidimensional t-statistic that we would use un<strong>de</strong>r iidsampling still converges to the same canonical distribution un<strong>de</strong>r adaptive sampling.Consequently, the same group sequential testing can be carried out as if we had resorted to iidsampling.Furthermore, a comprehensive simulation study that we un<strong>de</strong>rtake in a companion articlevalidates the theory.A three-sentence take-home message is “Adaptive <strong>de</strong>signs do learn the targeted optimal<strong>de</strong>sign and inference, and testing can be carried out un<strong>de</strong>r adaptive sampling as they would un<strong>de</strong>rthe targeted optimal randomization probability iid sampling. In particular, adaptive <strong>de</strong>signsachieve the same efficiency as the fixed oracle <strong>de</strong>sign. This is confirmed by a simulation study, atleast for mo<strong>de</strong>rate or large sample sizes, across a large collection of targeted randomizationprobabilities.'”KEYWORDS: adaptive <strong>de</strong>sign, canonical distribution, coarsening at random, contiguity, groupsequential testing, maximum likelihood, randomized clinical trialRecommen<strong>de</strong>d Citation:Chambaz, <strong>Antoine</strong> and van <strong>de</strong>r Laan, Mark J. (2011) "Targeting the Optimal Design inRandomized Clinical Trials with Binary Outcomes and No Covariate: Theoretical Study," TheInternational Journal of Biostatistics: Vol. 7: Iss. 1, Article 10.DOI: 10.2202/1557-4679.1247Available at: http://www.bepress.com/ijb/vol7/iss1/10©2011 Berkeley Electronic Press. All rights reserved.


Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesAuthor Notes: This collaboration took place while <strong>Antoine</strong> Chambaz was a visiting scholar at UCBerkeley, supported in part by a Fulbright Research Grant and the CNRS. The authors thank thereviewers for their interesting comments.1 IntroductionThe present article and its companion article (Chambaz and van <strong>de</strong>r Laan, 2010b)are <strong>de</strong>voted to the asymptotic study of adaptive group sequential <strong>de</strong>signs in thecase of randomized clinical trials with binary treatment, binary outcome and nocovariate, focusing here on its theoretical <strong>de</strong>velopment and there on its study bysimulations. Thus, the experimental unit writes as O = (A,Y) where the treatmentA and the outcome Y are <strong>de</strong>pen<strong>de</strong>nt Bernoulli random variables. Typical parametersof scientific interest are Ψ + = E(Y |A = 1) −E(Y |A = 0) (additive scale)or Ψ × = logE(Y |A = 1) −logE(Y |A = 0) (multiplicative scale, which we consi<strong>de</strong>rhereafter). One can interpret causally such parameters whenever one is willingto postulate the existence of a full data structure X = (X(0),X(1)) containingthe two counterfactual outcomes un<strong>de</strong>r the two possible treatments and such thatY = X(A) and A in<strong>de</strong>pen<strong>de</strong>nt of X. If so in<strong>de</strong>ed, Ψ + = E(X(1)) −E(X(0)) andΨ × =logE(X(1)) −logE(X(0)). Let us now explain what we mean by adaptivegroupsequential<strong>de</strong>sign.1.1 The notion ofadaptive groupsequential <strong>de</strong>signs.By adaptive <strong>de</strong>sign, we mean in this setting a clinical trial <strong>de</strong>sign that allows theinvestigatortodynamicallymodifyitscoursethroughdata-drivenadjustmentoftherandomizationprobabilitybased ondataaccrued sofar, withoutnegativelyimpactingon the statistical integrity of the trial. This <strong>de</strong>finition is slightly adapted from(Golub, 2006), the introductory article to the proceedings (to which many articlescited below belong) of a workshop entitled “Adaptive clinical trial <strong>de</strong>signs: Readyfor prime time?” held in October 2004, and jointly sponsored by the FDA andHarvard-MIT Division of Health Science and Technology. Using the <strong>de</strong>finition ofprespecifiedsamplingplansgivenin(Emerson,2006),letusemphasizethatweassumethat, prior to collection of the data, the trial protocol specifies: the parameterofscientificinterest,and• estimation framework: the confi<strong>de</strong>nce level to be used in constructing frequentistconfi<strong>de</strong>nce intervals for the latter parameter, the related inferentialmethod;• testing framework: the null and alternative hypotheses regarding the latterparameter, the wished type I and type II errors, the rule for <strong>de</strong>termining themaximal statistical information to be accrued, the frequentist testing procedure(includingconditionsforearlystopping).Published by Berkeley Electronic Press, 20111


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Furthermore, we assume that the protocol specifies a user-supplied optimal unknownchoice of randomization scheme: our adaptive <strong>de</strong>sign does not belong tothe class of prespecified sampling schemes in that it targets the latter optimal unknownchoice of randomization scheme, learning it based on accrued data. Wefocus in this article on maximum likelihood estimation and testing. The consi<strong>de</strong>reduser-suppliedoptimalunknownchoiceofrandomizationschemeisthatwhichminimizestheasymptoticvarianceofourmaximumlikelihoo<strong>de</strong>stimator(MLE)ofthe parameter of interest. This choice, known in the literature as the Neyman allocation,is interesting because minimizing the asymptotic variance of our estimatorguarantees narrower confi<strong>de</strong>nce intervals and earlier <strong>de</strong>cision to reject the null forits alternative or not. Yet, targeting this treatment mechanism in real-life clinicaltrials my raise ethical issues, since this may result in more patients assigned to theinferior treatment arm. But we emphasize that there is nothing special about targetingtheNeymanallocation,thewholemethodologyapplyingequallywelltoanychoice of targeted treatment mechanism. To be more specific, we could also havechosen to target that treatment mechanism which minimizes the expected numberof treatment failures subject to power constraint (the optimal allocation proposedby Rosenberger, Stallard, Ivanova, Harper, and Ricks (2001)), or any other treatmentmechanism of interest (see for instance (Hu and Rosenberger, 2006, Chapter2)or(Tymofyeyev,Rosenberger,andHu,2007)). QuotingJamesHungoftheFDA(about <strong>de</strong>sign adaptation in general–see (Hung, 2006)), our adaptive <strong>de</strong>sign meetsclearly stated objectives, it is certainly a “more careful planning, not sloppy planning”.By adaptive group sequential <strong>de</strong>sign, we refer to the fact that group sequentialtesting methods can be equally well applied on top of adaptive <strong>de</strong>signs.According to Hu and Rosenberger (2006), “The basic statistical formulation of asequential testing procedure requires <strong>de</strong>termining the joint distribution of the sequentiallycomputed test statistics. Un<strong>de</strong>r response-adaptive randomization, thisis a difficult task. There has been little theoretical work done to this point, norhas there been any evaluation of sequential monitoringin the context of sequentialestimation procedures [i.e. targeted adaptive <strong>de</strong>signs] such as the double adaptivebiasedcoin<strong>de</strong>sign.” Theseauthorsendtheirfinalchapterwithaquotefrom(Rosenberger,2002):“Surprisingly,thelinkbetweenresponse-adaptiverandomizationandsequentialanalysishasbeen tenuousatbest,and thisisperhapsthelogicalplacetosearch for open research topics.” In<strong>de</strong>ed, we <strong>de</strong>termine the limit joint distributionofthesequentiallycomputedteststatisticsbasedonourresults,andprovi<strong>de</strong>atheoreticalbackground to rely uponfor adaptive<strong>de</strong>signgroup sequential testing procedures.Simultaneously,asimilarresultisobtainedin(ZhuandHu,2010)throughadifferentapproachbasedonthestudyofthelimitdistributionofastochasticprocess<strong>de</strong>fined over (0,1].Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes1.2 Bibliography.The literature on adaptive <strong>de</strong>signs is vast and we apologize for not including all ofit. Quite misleadingly, the expression “adaptive <strong>de</strong>sign” has also been used in theliteratureforsequentialtestingand, ingeneral, for<strong>de</strong>signsthat allowdata-adaptivestoppingtimesforthewholestudy(orforcertaintreatmentarms)whichachievethewishedtypeIandtypeIIerrorsrequirementswhentestinganullhypothesisagainstitsalternative.In the literature <strong>de</strong>dicated to what corresponds to our <strong>de</strong>finition of adaptive<strong>de</strong>sign,suchadaptive<strong>de</strong>signsare referred toas “response-adaptiverandomization”<strong>de</strong>signs (see the quote from (Hu and Rosenberger, 2006) above). Of course, dataadaptiverandomization schemes have a long history, that goes back to the 1930s,and we refer to (Hu and Rosenberger, 2006, Section 1.2) and (Jennison and Turnbull,2000,Section17.4)toprovi<strong>de</strong>therea<strong>de</strong>rwithacomprehensivehistoricalperspective.Theorganizationof theSection 1.1 illustratesthefact that wehave<strong>de</strong>ci<strong>de</strong>dtotackleseparatelythegroupsequentialtestingproblemfromthedata-adaptive<strong>de</strong>terminationof the randomization probability in response to data collected so far–achoicejustifiedbythefactthatseparatingthecharacterizationofagroupsequentialtesting procedure from the adaptation of the randomization probability makes perfectsensefromamethodologicalpointofview.Resortingtothesameorganizationhere is more <strong>de</strong>licate, because response-adaptive treatment allocation is in<strong>de</strong>btedto early studies in the context of sequential statistical analysis, such as (Armitage,1975, Chernoff and Roy, 1965, Flehinger and Louis, 1971) among many others.However,Hu and Rosenberger (2006)manageto trace back the i<strong>de</strong>aofincorporatingrandomization in the context of adaptive treatment allocation <strong>de</strong>signs to (Weiand Durham,1978).Ontheonehand,regardingtheadaptationoftherandomizationprobability,data-adaptive randomization schemes belong to either the “urn mo<strong>de</strong>l” or “doubleadaptivebiasedcoin”families(see thequotefrom (Hu and Rosenberger, 2006)above). Adaptive<strong>de</strong>signsbasedonurnmo<strong>de</strong>ls(socalledbecausetherandomizationscheme can be mo<strong>de</strong>led after different ways of pulling various colored balls froman urn) notably inclu<strong>de</strong> the seminal “randomized play-the-winner rule” from theaforementionedarticle(Wei andDurham,1978)orthemorerecent “drop-the-loserrule” (Ivanova, 2003). The theory of adaptive <strong>de</strong>signs based on homogeneous urnmo<strong>de</strong>ls (referring to the fact that the updating rule does not evolve through time)is presented in <strong>de</strong>tail in (Hu and Rosenberger, 2006, Chapter 4), with a comprehensivebibliography. Nonhomogeneous urn mo<strong>de</strong>ls are adaptive <strong>de</strong>signs based onurn mo<strong>de</strong>ls which target a randomization scheme. Also known in the literature asestimation-adjusted urn mo<strong>de</strong>ls, these adaptive <strong>de</strong>signs involve updating rules ofhttp://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.12472Published by Berkeley Electronic Press, 20113


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10the urn which rely on the estimation of some parameters based on data accruedso far. Recently, Zhang, Hu, and Cheung (2006) established the consistency andasymptotic normality for both the randomization scheme and the estimator of theparameter of interest in thischallenging framework, un<strong>de</strong>r weak assumptions. Targetinga specific user-supplied optimal unknown choice of randomization schemeis at the core of adaptive<strong>de</strong>signs based on flipping a (data-adaptively)biased coin.Moreprecisely,thelattertargeted randomizationschemeisexpressedasafunctionf(θ) of an unknown parameter θ of the response mo<strong>de</strong>l, and the adaptive <strong>de</strong>signis characterized by the sequence f(θ n ) which is based on updated estimates θ n ofθ as the data accrue. For instance, the targeted randomization scheme we consi<strong>de</strong>r(namely, theminimizerofthe asymptoticvarianceof our MLE of theparameter ofinterestin clinical trialswithbinary treatment,binary outcomeand no covariate)isafunctionofthetwomarginalprobabilitiesofsuccess. Again,HuandRosenberger(2006) manage to trace back this kind of procedure to (Eisele, 1994). A seriesof articles including (Rosenberger et al., 2001, Hu and Zhang, 2004b) address thetheoretical study of such adaptive <strong>de</strong>signs, or investigate their properties based onsimulations (Hu and Rosenberger, 2003). Overall, the most relevant references forour present article certainly are (Hu and Rosenberger, 2006) (already cited manytimes),whichconcernsasymptotictheoryforlikelihood-base<strong>de</strong>stimation(nottesting)based on data-adaptiverandomization schemes in clinical trials, and (Zhu andHu, 2010). In the latter article, Zhu and Hu mainly <strong>de</strong>rive the limit joint distributionof their sequentially computed test statistic and carry out a simulation studyof its properties. We compare their results with ours in the appropriate sections of(Chambazand van<strong>de</strong>r Laan, 2010b).On the other hand, regarding the group sequential testing problem, let usemphasize that we consi<strong>de</strong>r the case where one starts with a large up-front commitmentsample size and uses group sequential testing to allow early stoppingrather than starting out with a small commitment of sample size and extendingit if necessary–the latter distinction is taken from (Mehta and Patel, 2006). Therefore,negative results obtained e.g. in (Jennison and Turnbull, 2003, Tsiatis andMehta, 2003) for such procedures (inconveniently referred to as adaptive <strong>de</strong>signsmethods in (Mehta and Patel, 2006, Tsiatis and Mehta, 2003)) that start out with asmall commitment do not apply at all to our procedure. On the contrary, we canbuild upon the thorough un<strong>de</strong>rstanding of group sequential methods as exposed in(JennisonandTurnbull,2000,Proschan,Lan,andWittes,2006),andmorerecentlyexploredin(Lokhnyginaand Tsiatis,2008).Furthermore,thereisalsoarichliteratureontheBayesianapproachtoadaptive<strong>de</strong>signs. The rea<strong>de</strong>r is referred to (Berry and Stangl, 1996, J., Abrams, andMyles,2004,Berry, 2006,Banerjee andTsiatis,2006)forfurther<strong>de</strong>tails.Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesFinally, this article builds upon the seminal technical report (van <strong>de</strong>r Laan,2008)whichpavesthewaytorobustan<strong>de</strong>fficientestimationinrandomizedclinicaltrialsthanksto adaptationofthe<strong>de</strong>signin avarietyofsettings.1.3 Forthcoming resultsin words.Following the same presentation as in Section 1.1, we obtain that the adaptive <strong>de</strong>signconvergesalmostsurelytothetargetedunknownrandomizationscheme(Theorem1),and that• estimation framework: the MLE of the parameter of interest is a stronglyconsistent estimator (Theorem 1), it satisfies a central limit theorem (Theorem2); we can estimate its asymptotic variance, which is the same as that itwould feature had we known in advance the targeted randomization schemeandin<strong>de</strong>pen<strong>de</strong>ntlysampledfromit(Theorem2);consequently,inferencecanbecarriedoutasifwehadresortedtoin<strong>de</strong>pen<strong>de</strong>ntandi<strong>de</strong>nticallydistributed(iid)sampling;• testingframework: themultidimensionalt-statisticsthatwewoulduseun<strong>de</strong>riidsamplingstillconvergestothesamecanonicaldistributionun<strong>de</strong>radaptivesampling (Theorem 3); consequently, the same group sequential testing canbecarried out asifwehad resorted toiidsampling.Furthermore, the comprehensive simulation study that we un<strong>de</strong>rtake in (Chambazand van<strong>de</strong>rLaan, 2010b)validatesthetheory,notablyshowingthattheconfi<strong>de</strong>nceintervalsweobtainachievethe<strong>de</strong>siredcoverageevenformo<strong>de</strong>ratesamplesizesandthat type I error control at the prescribed level is guaranteed, and that all samplingproceduresonlysufferfromaveryslightincreaseofthetypeIIerror–see(Chambazand van<strong>de</strong>rLaan, 2010b)forthe<strong>de</strong>tails.A three-sentence take-home message is “Adaptive <strong>de</strong>signs do learn the targetedoptimal <strong>de</strong>sign and inference and testing can be carried out un<strong>de</strong>r adaptivesampling as they would un<strong>de</strong>r the targeted optimal randomization probability iidsampling. In particular, adaptive <strong>de</strong>signs achieve the same efficiency as the fixedoracle<strong>de</strong>sign. Thisisconfirmedbyasimulationstudy,atleastformo<strong>de</strong>rateorlargesamplesizes, acrossalargecollectionoftargeted randomizationprobabilities.”In essence, everything works as predicted by theory. However, theory alsowarnsusthatgainscannotbedramaticintheparticularsettingofclinicaltrialswithbinarytreatment,binaryoutcomeandnocovariate. Nonetheless,thisarticleanditscompanionarticle(Chambazandvan<strong>de</strong>rLaan,2010b)areimportant: theyprovi<strong>de</strong>http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.12474Published by Berkeley Electronic Press, 20115


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10a theoretical template and tools for asymptotic analysis of robust adaptive <strong>de</strong>signsin less constrained settings, which we will consi<strong>de</strong>r in future work. This notablyinclu<strong>de</strong>sthesettingofclinicaltrialswithcovariate,binarytreatment,anddiscreteorcontinuousoutcome,orthesettingofclinicaltrialswithcovariate,binarytreatment,andpossiblycensoredtime-to-eventamongothers. Resortingtotargetedmaximumlikelihood estimation(van <strong>de</strong>r Laan and Rubin, 2006) along with adaptation of the<strong>de</strong>signprovi<strong>de</strong>ssubstantialgainsin efficiency.Finally, we want to emphasizethat the wholeadaptive<strong>de</strong>sign methodologythat we <strong>de</strong>velop here is only relevant for clinical trials in which a substantial numberof observations are available before all patients are randomized. From now on,we assume that the clinical trial’s time scale permits the application of the adaptive<strong>de</strong>sign methodology. It must be noted that one can find in the literature somearticles <strong>de</strong>voted to the study of response-adaptive clinical trials based on possibly<strong>de</strong>layedresponses(seeSection7.1in(HuandRosenberger,2006),(HuandZhang,2004a,Zhang,Chan,Cheung,andHu,2007)). Atypicalsufficientconditionwhichallowshandlingthepossible<strong>de</strong>layisthefollowing: theprobabilityoftheevent“theoutcomeofthenthpatientisunavailablewhenthe (n+m)thpatientisrandomized”isupper-boun<strong>de</strong>dby aconstanttimesm −γ forsome γ >0.1.4 Organizationof the article.The article is organized as follows. We <strong>de</strong>fine the targeted optimal <strong>de</strong>sign in Section2, and <strong>de</strong>scribe how to adapt to it in Section 3. The asymptotic study of theMLE of the parameter of interest un<strong>de</strong>r adaptive <strong>de</strong>sign is addressed in Section 4,whereweobtainthattheMLEisstronglyconsistencyandasymptoticallyGaussian.In Section 5 we show how a group sequential testing procedure can be applied ontop of the adaptive <strong>de</strong>sign methodology. In Appendix A.1 we present an importantbuilding block for consistency results. It consists of a uniform Kolmogorovstrong law of large numbers for martingales sums that essentially relies on a maximalinequality for martingales. Another important building block for central limittheoremsispresentedinAppendixA.2,wherewe<strong>de</strong>riveacentrallimittheoremfordiscretemartingales.Finally, in or<strong>de</strong>r to ease the reading, we highlight throughout this articleand its companion article (Chambaz and van <strong>de</strong>r Laan, 2010b) the most importantresults. Wenotablypointoutherehowtoconstructconfi<strong>de</strong>nceintervalsandhowtoapply a group sequential testing procedure while targeting the optimal <strong>de</strong>sign andthusaccruing observationsdata-adaptively,see thetwo highlightsentitled1. pointwiseestimationandconfi<strong>de</strong>nceinterval(Section 4),2. targetedoptimal<strong>de</strong>signadaptivegroupsequentialtesting(Section 5.1).Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes2 Balancedversusoptimaltreatmentmechanisms2.1 The observeddata structure andrelated likelihood.Weconsi<strong>de</strong>rthesimplestexampleofrandomizedtrials,whereanobservationwritesasO =(A,Y),AbeingabinarytreatmentofinterestandY abinaryoutcomeofinterest.We postulatetheexistenceofafulldatastructureX = (X(0),X(1))containingthe two counterfactual (or potential) outcomes un<strong>de</strong>r the two possible treatments.TheobserveddatastructureO=(A,X(A)) = (A,Y)onlycontainstheoutcomecorrespondingto the treatment the experimental unit did receive. Therefore O is amissingdatastructureon X withmissingnessvariableA.We<strong>de</strong>notetheconditionalprobabilitydistributionsoftreatmentAbyg(a|x) =P(A =a|X =x).We assume that the coarseningat random(abbreviated to CAR) assumptionholds:forall a ∈ A = {0,1},x ∈ X = {0,1} 2 ,g(a|x) =P(A =a|X(a) =x(a)), (1)We<strong>de</strong>noteby G thesetofsuchCARconditionaldistributionsofAgivenX,referredtoas theset offixed <strong>de</strong>signs. In theframework ofthisarticle,(1)is equivalenttog(a|x) =g(a)for all a ∈ A ,x ∈ X : g ∈ G if and only if the random variables A and X arein<strong>de</strong>pen<strong>de</strong>nt. Weonlyconsi<strong>de</strong>rsuchtreatmentmechanismsintherestofSection2.The distribution P X of the full data structure X has two marginal Bernoullilaws characterized by θ = (θ 0 ,θ 1 ) ∈]0,1[ 2 with θ 0 =E PX X(0) and θ 1 = E PX X(1)(theonlyi<strong>de</strong>ntifiablepartofP X ). Therefore,introducing X (O) = {x ∈ X :x(A) =Y } (the set of full data structure realizations compatible with O), the likelihood ofO writesasL(O) =∑x∈X (O)P(O,X =x) =∑x∈X (O)P(O|X =x)P(X =x)= ∑ g(A|x)P(X =x) =g(A)P(X ∈ X (O))x∈X (O)= θ Y A (1 − θ A) 1−Y g(A) = θ(O)g(A),using the convenient shorthand notation θ(O) = θ Y A (1 − θ A) 1−Y . Because of theform of the likelihood, we can say that the observed data structure O is obtainedun<strong>de</strong>r (θ,g).http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.12476Published by Berkeley Electronic Press, 20117


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 102.2 Efficient influence curve and efficient asymptotic variancefor the log-relativerisk.Say that theparameter ofscientificinterestisΨ(θ) =log θ 1θ 0,the log-relative risk; of course, the sequel applies to other choices, such as theexcessrisk.Inaclassicalrandomizedtrial,wewould<strong>de</strong>termineafixedtreatmentmechanismg (therefore complying with CAR) and sample as many iid copies of O asnecessary.Thetheoryofsemiparametricstatisticsteachesusthattheefficientinfluencecurvesfor parameters θ 0 and θ 1 un<strong>de</strong>r (θ,g)arerespectivelyD ⋆ 0 (θ,g)(O) = (Y − θ 0)1l{A =0},D ⋆ 1g(0)(θ,g)(O) = (Y − θ 1l{A =1}1) . (2)g(1)Then the <strong>de</strong>lta-method (and page 386 in (van <strong>de</strong>r Vaart, 1998)) implies that theefficient influencecurveforparameter Ψ(θ) un<strong>de</strong>r (θ,g)writesas1l{A =0}IC(θ,g)(O) = −θ 0 g(0) (Y − θ 1l{A =1}0)+θ 1 g(1) (Y − θ 1) (3)sothat theefficient asymptoticvarianceun<strong>de</strong>r (θ,g)isE θ,g IC(θ,g) 2 (O) = σ2 (θ)(0)θ 2 0 g(0) + σ2 (θ)(1)θ 2 1 g(1) = 1 − θ 0θ 0 g(0) + 1 − θ 1θ 1 g(1)withnotation σ 2 (θ)(a) =Var θ (Y|A =a) = θ a (1 − θ a ), theconditionalvarianceofY givenA=a.2.3 Arelative efficiency criterion.Defining OR(θ) = θ 1/(1−θ 1 )θ 0 /(1−θ 0 ), the efficient asymptotic variance as a function of thetreatment mechanism g is minimized at the optimal treatment mechanism characterizedbyg ⋆ (θ)(1) =√11+ √ OR(θ) = θ0 (1 − θ 1 )√θ0 (1 − θ 1 )+ √ θ 1 (1 − θ 0 ) , (4)known as the Neyman allocation (Hu and Rosenberger, 2006, page 13). Interestingly,g ⋆ (θ)(1) ≤ 1 2 whenever θ 0 ≤ θ 1 , meaning that the Neyman allocation g ⋆ (θ)Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomesfavors the inferior treatment. The corresponding optimal efficient asymptoticvariancev⋆ (θ) thensatisfiesv ⋆ (θ) =(√ √ ) 2 (1 − θ 0 1 − θ1 1 − θ0+≥2 + 1 − θ )1=v b (θ),θ 0 θ 1 θ 0 θ 1where v b (θ) <strong>de</strong>notes the efficient asymptoticvariance associated with thestandardbalanced treatment characterized by g b (1) =2 1 , hence the relative efficiency criterionR(θ) = v⋆ (θ)v b (θ) = 1 √OR(θ)2 + 1+OR(θ) ∈ (1 2,1]. (5)The <strong>de</strong>finition of our relative efficiency criterion illustrates the fact that we<strong>de</strong>ci<strong>de</strong> to consi<strong>de</strong>rthe balanced treatment mechanism as a benchmark. We emphasizethatanyfixed<strong>de</strong>signcouldbechosenasbenchmarktreatmentmechanism,withminorimpacton thestudyweexposebelow.It is worth noting that v b (θ) = v ⋆ (θ), or in other words that the so-calledbalanced treatment mechanism is actually optimal, if and only if θ 0 = θ 1 . In particular,there is no gain to expect from adapting the treatment mechanism in termsof type I error control when testing the null “Ψ(θ) = 0” against its negation. Inaddition,thefollowingboundinvolvingtherelativeefficiencycriterionononesi<strong>de</strong>and thelog-relativeriskontheotherholds:R(θ) ≤ 1 2 + √e Ψ(θ)1+e Ψ(θ) ∈ (1 2 ,1].We present in Figure 1 three curves θ 1 ↦→R(θ) for threedifferent values ofθ 0 . Itnotablyillustratesthatwhen θ 0 issmall,R(θ)can besignificantlylowerthan1 for values of Ψ(θ) which are not very large. For instance, θ = (100 1 , 5100 ) yieldsΨ(θ) ≃ 1.609, R(θ) ≃ 0.868 and optimal treatment mechanism characterized byg ⋆ (θ)(1) ≃0.305. Werewegiventheoptimaltreatmentmechanisminadvance,wewouldobtainconfi<strong>de</strong>nceintervals(basedonthecentrallimittheoremandSlutsky’slemma) whose widths are approximately √ R(θ) ≃ 0.931 times those of the correspondingconfi<strong>de</strong>nce intervals we would have got using the balanced treatmentmechanism.Howeverthe gain could actually be moredramaticthan theprevious examplesuggests.Letusconsi<strong>de</strong>ragainthetestingsetting: wewanttotestthenull“Ψ(θ) =0”against the alternative “Ψ(θ) >0” with type I error α and power (1 − β) at somehttp://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.12478Published by Berkeley Electronic Press, 20119


0.0 0.2 0.4 0.6 0.8 1.0The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesR(θ)0.5 0.6 0.7 0.8 0.9 1.0Figure1: PlotoftherelativeefficiencyR(θ)asafunctionof θ 1 fordifferentvaluesof θ 0 (θ = (θ 0 ,θ 1 )). The solid, dashed and dotted curves respectively correspondto θ 0 = 1 2 , 10 1 , 1100.user-<strong>de</strong>fined alternative ψ > 0. By the <strong>de</strong>lta-method, we know that the MLE ofΨ(θ)Ψ n =log ∑n i=1 Y i1l{A i =1}∑ n i=1 1l{A i =1} −log ∑n i=1 Y i1l{A i =0}∑ n i=1 1l{A i =0}basedonniidcopiesO i = (A i ,Y i )ofOisasymptoticallyefficient. ItisfurthermorenaturaltorefertoI n =n/s 2 n ,theinverseoftheestimatedvarianceof Ψ n attimen,asthestatisticalinformationavailableatthattime. Un<strong>de</strong>r ψ,thecentrallimittheoremapplies and teaches us that √ I n (Ψ n − ψ) converges in distribution, as n grows toinfinity,to thestandard normaldistribution.Deciding to reject the null if √ I n Ψ n ≥ ξ 1−α yields a test with asymptotictype I error α. In or<strong>de</strong>r to ensure that its asymptotic power at alternative ψ is(1 − β),itis sufficientthat{ ( )ξ1−α − ξ 2}βn =inf t ≥1:I t ≥ =I max , (6)ψI max being theso-called maximumcommittedinformation.For n large enough, I n ≃ n/v b (θ) if we use the balanced treatment mechanism,while I n would have been approximately equal to n/v ⋆ (θ), had we usedthe optimal treatment mechanism. Substituting bluntly n/v b (θ) or n/v ⋆ (θ) to I nin (6), we see that the ratio of the testing times n b (using the balanced treatmentmechanism)and n ⋆ (usingtheoptimalone)satisfiesθ 1n ⋆n b ≃ v⋆ (θ)v b (θ) =R(θ), 10therelativeefficiencycriterion. Inotherwords,werewegiventheoptimaltreatmentmechanisminadvance,wewouldinaverageneedtosampleR(θ) ∈ ( 1 2 ,1)timesthenumber of observations required when using the balanced treatment mechanism.In the previous example where θ = ( 1 ), setting α = 0.05, β = 0.1 and thealternativeparameter ψ = Ψ(θ) ≃1.609 >0, the maximalcommittedinformationisI max ≃3.306,n ⋆ ≃676.901andn b ≃780.248.In summary, resorting to the balanced treatment mechanism may be a verypoor(inefficient)choice. Sincetheoptimaltreatmentmechanismg ⋆ canbelearnedfrom the data, why not use it? Once again, we emphasize that there is nothingspecial about targeting this specific treatment mechanism. One could have alsochosen to target that treatment mechanism which minimizes the expected numberoffailures subjecttopowerconstraint(Rosenberger et al., 2001).Ofcourse,targetingtheoptimaltreatmentmechanismontheflyimplieslosingin<strong>de</strong>pen<strong>de</strong>ncebetweensuccessiveobservations,makingthestudyofthe<strong>de</strong>signmoreinvolved. However,wepresent andstudyheresuch amethodology. It isbuilton theseminaltechnical report(van <strong>de</strong>rLaan, 2008).100 , 51003 Targetingtheoptimal<strong>de</strong>signInthissection,we<strong>de</strong>scribehowwecarryouttheadaptationofthesamplingscheme.For simplicity’s sake, the randomization probability is updated at each step (i.e.each time a new observation is sampled). However, it is important to un<strong>de</strong>rstandthat all our results (and their proofs) still hold when adaptation only occurs oncec new observations are accrued, where c is a pre-<strong>de</strong>termined integer. In this view,weusec=1inthewholearticleaswellasinthecompanionarticle(Chambazandvan<strong>de</strong>rLaan, 2010b)–althoughourRco<strong>de</strong>can handlethegeneral case.3.1 Adaptive coarsening atrandom assumption.We <strong>de</strong>note by A i , X i = (X i (0),X i (1)), Y i = X i (A i ) and O i = (A i ,Y i ) the treatmentassignment, full data structure, outcome, and observation for experimental unit i.Whereas X 1 ,...,X n are assumed iid, the random variables A 1 ,...,A n are not in<strong>de</strong>pen<strong>de</strong>ntanymore since we want to adapt the treatment mechanism based on pastobservations. DefiningA n = (A 1 ,...,A n ), X n = (X 1 ,...,X n ), O n = (O 1 ,...,O n ),and foreveryi=0,...,nA n (i) = (A 1 ,...,A i ), X n (i) = (X 1 ,...,X i ), O n (i) = (O 1 ,...,O i )http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.1247Published by Berkeley Electronic Press, 201111


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10(with conventionA n (0) =X n (0) =O n (0) = /0), let g n (·|X n ) <strong>de</strong>notetheconditionaldistributionofthe<strong>de</strong>signsettingsA n giventhefulldataX n : bythechain rule,hencetheadditionalnotationg n (A n |X n ) =n∏i=1P(A i |A n (i −1),X n ), (7)g i (a i |a(i −1),x) =P(A i =a i |A n (i −1) =a(i −1),X n =x)forall 1 ≤i≤n, a = (a 1 ,...,a n ) ∈ A n , x = (x 1 ,...,x n ) ∈ X n .In this new setting, we state the following adaptivecounterpart of the CARassumption (1): for all 1 ≤i≤n, a ∈ A n , x ∈ X n , letting o i = (a i ,x i (a i )) be thecorrespondingrealizationofobservationO i ,g i (a i |a(i −1),x) = P(A i =a i |X i =x i ,O n (i −1) =o(i −1))= P(A i =a i |X i (a i ) =x i (a i ),O n (i −1) =o(i −1)).Withobviousconvention,thenewadaptiveCAR assumptionalso writesasg i (a i |a(i −1),x) =g i (a i |x i ,o(i −1)) =g i (a i |x i (a i ),o(i −1)) (8)for all 1 ≤ i ≤ n, a ∈ A n , x ∈ X n ; it states that for each i A i is conditionallyin<strong>de</strong>pen<strong>de</strong>ntofthefulldataX n giventheobserveddataO n (i−1)forthefirst (i−1)experimentalunitsandthefulldataX i fortheithexperimentalunit,andinadditionthat the conditional probability of A i = a i given X i and O n (i −1) actually only<strong>de</strong>pendson theobservedpartX i (a i ) andO n (i −1). In particular, (7)reduces tog n (A n |X n ) =n∏i=1g i (A i |X i (A i ),O n (i −1)), (9)which justifies the notation g n = (g 1 ,...,g n ). Note that in the framework of thisarticle, a treatment mechanism g n complies with the adaptiveCAR assumption (9)ifand onlyif,forall 1 ≤i≤n,a ∈ A n ,x ∈ X n ,g i (a i |a(i −1),x) =g i (a i |o(i −1)).Note finally that we find useful to consi<strong>de</strong>r g i satisfying (8) as random element(throughOn (i −1))oftheset G ofCAR <strong>de</strong>signs.Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes3.2 Data generating mechanism for adaptive <strong>de</strong>sign and likelihood.Given the available data O n (i −1) = (O 1 ,...,O i−1 ) at step i, one first draws X ifromP X in<strong>de</strong>pen<strong>de</strong>ntlyofX n (i−1),thenonecalculatestheconditionaldistributiong i (·|X i ,O n (i −1)) and one samples A i givenX i from it, the next observationfinallybeingO i = (A i ,X i (A i )). RegardingthelikelihoodofO n , ifX (O n ) = {x ∈ X n :x i (A i ) =Y i ,i ≤n} = ⊗ n i=1 X (O i)(thesetofthoserealizationsxofX n compatiblewithO n ), thenL(O n ) ===∑x∈X (O n)n∏i=1n∏i=1P(O n ,X n =x) =∑x∈X (O n)g i (A i |Y i ,O n (i −1))P(X n ∈ X (O n ))g i (A i |Y i ,O n (i −1))n∏i=1g n (A n |x)P(X n =x)θ(O i ), (10)the third and fourth equalities being <strong>de</strong>rived from the adaptive CAR equality (9)and from in<strong>de</strong>pen<strong>de</strong>nce ofX 1 ,...,X n respectively. Thus, thelikelihoodremarkablyfactorizes into the product of a θ-factor and a g n -factor. Thanks to the form of thelikelihood, we can say that O n is obtained un<strong>de</strong>r (θ,g n ). For convenience, we alsowrite sometimes that O n is obtained un<strong>de</strong>r g n -adaptive sampling scheme withoutspecifying the parameter θ. Likewise, we later refer to data obtained un<strong>de</strong>r iidg b -balanced orun<strong>de</strong>riidg ⋆ -optimalsamplingschemes.3.3 Strategy.Let θ i = (θ i,0 ,θ i,1 ) <strong>de</strong>note for each i ≤ n the MLE of θ = (θ 0 ,θ 1 ) ∈]0,1[ 2 basedon O n (i) (with convention θ i,a = 1 2as long as no relevant observation is available).Thanks to the form of the log-likelihood exhibited in (10) and as soon as∑ i j=1 1l{A j =a} >1, θ i,a istheempiricalmeanθ i,a = ∑i j=1 Y j1l{A j =a}∑ i j=1 1l{A j =a} ,as ifweuseda<strong>de</strong>terministictreatmentmechanism(and observationswere iid).http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124712Published by Berkeley Electronic Press, 201113


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10These empirical means yield plug-in estimates σi 2(a)= θ i,a(1 − θ i,a ) ofσ 2 (θ)(a), as well as plug-in estimates of the optimal treatment mechanism g ⋆ (θ)introducedin(4), characterized by g s 1 (1|O n(0)) = 1 2and fori ≥1,√g s i+1 (1|O θi,0 (1 − θ i,1 )n(i)) = √θi,0 (1 − θ i,1 )+ √ (11)θ i,1 (1 − θ i,0 )(sometimes abbreviated to g s i+1(1)), hence a first adaptive CAR treatment mechanismg s n = (gs 1 ,...,gs n ). Another interesting choice is also consi<strong>de</strong>red here, whichwecharacterize iterativelyby g 1 (1|O n (0)) =2 1 andfori ≥1,{ ( ) 21 ig i+1 (1|O n (i)) =argminγ∈(0,1) i+1∑g j (1|O n (j−1))+γ −gi(1)} s . (12)j=1This alternative choice aims at obtaining a balance between the two treatmentswhich,atexperimenti,closelyapproximatesg ⋆ (θ),inthesensethat 1 i ∑i j=1 1l{A j =1} ≃g s i−1(1), the current best guess. This second <strong>de</strong>finition is more aggressive inthepursuitoftheoptimaltreatmentmechanism,as it tries tocompensateon theflyforearly sub-optimalsampling.Atechnicalconditionwasactuallyleftasi<strong>de</strong>inthe<strong>de</strong>finitionofg s n. Becausewe want to exclu<strong>de</strong> the possibility that the adaptive <strong>de</strong>sign stops a treatment armwithprobabilitytendingto1, weimposethatg i+1 (1|O n (i)) ∈ [δ;1 − δ] forasmallδ > 0 (such that δ < min a∈A g ⋆ (θ)(a) and 1 − δ > max a∈A g ⋆ (θ)(a)) by lettingg ⋆ 1 (1|O n(0)) = 1 2and fori ≥1,g ⋆ i+1(1|O n (i)) =min { 1 − δ,max { δ,g s i+1 (1|O n(i)) }} (13)(sometimesabbreviatedtog ⋆ i+1 (1)),thuscharacterizingtheadaptiveCARtreatmentmechanismg ⋆ n = (g⋆ 1 ,...,g⋆ n ). Similarly,wesubstituteg⋆ i (1)togs i(1)andallow γ tovaryin [δ,1−δ]onlyin(12),yieldinganotheradaptiveCARtreatmentmechanismg a n (wherethesuperscriptastandsforaggressive). Thesamekindof δ-thresholding,whichmakesperfectsensefromanappliedpointofviewtoo,wasalreadysuggestedin(Tymofyeyevetal., 2007).It is worth noting that Hu, Zhang, and He (2009) address a very interestingproblem that we do not consi<strong>de</strong>r here: they show how to target the user-suppliedtreatment mechanism in such a way that the Cramér-Rao lower bounds on the allocationvariances are achieved, thanks to a clever generalization of the celebratedEfron’s biased coin <strong>de</strong>sign (Efron, 1971). Of course, resorting to the same kind ofstrategywouldbepossiblehere, butisoutsi<strong>de</strong>thescopeofthisarticle.In the rest of this article, we investigate theoretically the properties of thedata-adaptive <strong>de</strong>signs based on g ⋆ n only. The theoretical study is backed by theChambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomessimulationcarriedoutinthecompanionarticle(Chambazandvan<strong>de</strong>rLaan,2010b),whereg a n isalsoconsi<strong>de</strong>red.4 AsymptoticstudyWe address in this section the asymptotic statistical study of the method presentedin the previous section. We state strong consistency results then a central limittheorem,from whichwe<strong>de</strong>rivearuleforconstructingconfi<strong>de</strong>nceintervals.The following consistency result holds, which teaches us that the methoddoeslearnwhatistheoptimal<strong>de</strong>sign. ItsproofmainlyreliesonTheorem6,andwerefer to(Chambaz andvan <strong>de</strong>rLaan, 2010a)forthe<strong>de</strong>tails.Theorem 1. Let θ n be the MLE of θ ∈]0,1[ 2 based on O n sampled from (θ,g ⋆ n),for theadaptiveCAR treatment mechanismg ⋆ n characterizedby (13). Then θ n convergesalmost surely to θ. Consequently, Ψ n is a strongly consistent estimate ofΨ(θ) and g ⋆ n converges to the optimal <strong>de</strong>sign g ⋆ (θ) in the sense that g ⋆ n(1) and1n ∑ n i=1 g⋆ i (1)bothconverge almostsurelytog⋆ (θ)(1).In or<strong>de</strong>r to provi<strong>de</strong> statistical inference, we need now to establish that thevector √ n(θ n − θ) converges in distribution so that one can construct confi<strong>de</strong>nceintervals for θ and come up with valid testing procedures. The following centrallimittheoremactuallyholds. Itsproofmainlyinvolvesthemultivariatecentrallimittheorem stated in Theorem 8 and resorting to the <strong>de</strong>lta-method, see (Chambaz andvan<strong>de</strong>rLaan, 2010a)forthe<strong>de</strong>tails.Theorem 2. Let θ n be the MLE of θ ∈]0,1[ 2 based on O n sampled from (θ,g ⋆ n ),for the adaptive CAR treatment mechanism g ⋆ n <strong>de</strong>fined in (13). Let D⋆ 0 (θ,g⋆ (θ)),D ⋆ 1 (θ,g⋆ (θ)) and IC(θ,g ⋆ (θ)) be as <strong>de</strong>fined in (2) and (3) with g = g ⋆ (θ). LetD ⋆ (θ,g ⋆ (θ)) = (D ⋆ 0 (θ,g⋆ (θ)),D ⋆ 1 (θ,g⋆ (θ))). Then un<strong>de</strong>r (θ,g ⋆ n )√ n(θn − θ) = √ 1 nn∑ D ⋆ (θ,g ⋆ (θ))(O i )(1+o P (1)). (14)i=1Furthermore,1 √n ∑ n i=1 D⋆ (θ,g ⋆ (θ))(O i ) is a normalizeddiscrete martingalewhichconvergesun<strong>de</strong>r (θ,g ⋆ n )toacenteredGaussiandistributionwithcovariancematrixΣ ⋆ =P θ,g ⋆ (θ)D ⋆ (θ,g ⋆ (θ)) ⊤ D ⋆ (θ,g ⋆ (θ))=diag(θ0 (1 − θ 0 )g ⋆ (θ)(0) , θ )1(1 − θ 1 )g ⋆ . (15)(θ)(1)http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124714Published by Berkeley Electronic Press, 201115


Thelatterisconsistentlyestimatedwithitsempiricalcounterpartn()1n∑diag(Y i − θ n,0 ) 21l{A i =0}g ⋆ ,(Y i − θ n,1 ) 21l{A i =1}n (0)2 g ⋆ n (1)2i=1asif thesamplingwas iid.Thus un<strong>de</strong>r (θ,g ⋆ n ),we alsohave(16)√ n(Ψn − Ψ(θ)) = √ 1 nn∑ IC(θ,g ⋆ (θ))(O i )+o P (1), (17)i=1and convergence in distribution of √ n(Ψ n − Ψ(θ)) to a centered Gaussian distributionwith variance v ⋆ (θ), the optimal efficient asymptoticvariance. The latter isfinallyconsistentlyestimatedwith eitherv ⋆ (θ n )or1nn∑i=1IC(θ n ,g ⋆ n )(O i) 2 = 1 nasif samplingwas iid.The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10n∑i=1()(Y i − θ n,0 ) 21l{A i =0}θn,0 2 +(Y g⋆ n (0)2 i − θ n,1 ) 21l{A i =1}θn,1 2 g⋆ n (1)2We wish to construct a confi<strong>de</strong>nce interval for Ψ(θ) based on O n sampledun<strong>de</strong>r (θ,g ⋆ n ). Let us <strong>de</strong>note by s2 n either consistent estimates of v⋆ (θ) based onO n as introduced in Theorem 2, and let ξ 1−α/2 be the (1 − α/2)-quantile of thestandardnormal distribution. By thelattertheorem,Highlight1(pointwiseestimationand confi<strong>de</strong>nce interval). In viewof Theorems 1and2,theestimator Ψ n of Ψ(θ)obtainedun<strong>de</strong>r (θ,g ⋆ n)-adaptivesamplingschemeisstronglyconsistent,theestimatedprobabilityofbeingtreatedg ⋆ n (1)alsoconvergingalmost surely to the optimal probabilityof being treated g ⋆ (θ)(1). In addition,theconfi<strong>de</strong>nceinterval [Ψ n ± s ]n√ ξ n 1−α/2hasasymptoticcoverage (1 − α).ThistheoreticalresultisvalidatedwithsimulationsinSection3ofthecompanionarticle (Chambaz and van <strong>de</strong>r Laan, 2010b), focusing on the empirical distributionof the MLE in Section 3.2, on the empirical coverage guaranteed by theconfi<strong>de</strong>nce intervals in Section 3.3 and on their empirical widths in Section 3.4.Moreover, an illustrating example is <strong>de</strong>veloped in Section 3.5 of (Chambaz andvan<strong>de</strong>rLaan, 2010b).Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes5 Targetedoptimal<strong>de</strong>signgroupsequentialtestingObviously, the sequence of estimators (Ψ n ) n≥1 can be used to carry out the test ofthenull“Ψ(θ) = ψ 0 ” againstitsunilateral alternative“Ψ(θ) > ψ 0 ”forsome ψ 0 ∈R. We build in this section a group sequential testing procedure, that is a testingprocedurewhichrepeatedlytriestomakea<strong>de</strong>cisionatintervalsratherthanoncealldata are collected, or than after every new observation is obtained (such a testingprocedure would be said fully sequential). We refer to (Jennison and Turnbull,2000, Proschan et al., 2006) for a general presentation of group sequential testingprocedures.5.1 The targetedoptimal<strong>de</strong>signgroupsequential testingprocedure.Formal<strong>de</strong>scriptionofthetargetedoptimal<strong>de</strong>signgroupsequentialtestingprocedure.We wish to test thenull “Ψ(θ) = ψ 0 ” against“Ψ(θ) > ψ 0 ” with asymptotictypeIerror α and asymptotic type II error β at some ψ 1 > ψ 0 . We intend to proceedgroup sequentially with K ≥ 2 steps, and we wish to rely on a multidimensionalt-statisticoftheform(√Nk (Ψ Nk − ψ 0 )(˜T 1 ,..., ˜T K ) =, (18)s Nk)k≤Kwhere each N k is a carefully chosen (random) sample size and where s 2 n estimatesthe asymptotic variance of √ n(Ψ n − Ψ(θ)) un<strong>de</strong>r (θ,g ⋆ n ) sampling (see Theorem2).To this end, let 0 < p 1 < ... < p K =1 be increasingly or<strong>de</strong>red proportions.Consi<strong>de</strong>r the α-spending and β-spending strategies (α 1 ,...,α K ) and (β 1 ,...,β K ),i.e. K-tuples of positive numbers such that ∑ K k=1 α k = α and ∑ K k=1 β k = β. Onecould for instance choose α-spending and β-spending functions f α , f β , that areincreasing functions from [0,1] to [0,1] such that f α (0) = f β (0) =0 and f α (1) =f β (1) =1, and set ∑ k l=1 α l = f α (p k )α, ∑ k l=1 β l = f β (p k )β for allk≤K.Now,let (Z 1 ,...,Z K )followthecenteredGaussiandistributionwithcovariancematrix C = ( √ p k∧l /p k∨l ) k,l≤K and let us assume that there exists a uniquevalue I > 0, the so-called maximum committed information from now on <strong>de</strong>notedbyI max ,suchthatthereexistarejectionboundary (a 1 ,...,a K )andafutilityboundary(b 1 ,...,b K ) satisfying a K = b K , P(Z 1 ≥ a 1 ) = α 1 , P(Z 1 + (ψ 1 − ψ 0 ) √ p 1 I ≤b 1 ) = β 1 , andfor every1≤k


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesP(∀j ≤k,b j


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomesnp k )foreachk≤K. Wewishtothetestthenullagainstitsalternativebasedonthemultidimensionalt-statistic(√nk (Ψ nk − ψ 0 )(T 1 ,...,T K ) =(19)s nk)k≤K(s 2 n estimates the asymptotic variance of √ n(Ψ n − Ψ(θ)) un<strong>de</strong>r (θ,g ⋆ n )–see Theorem2). Before going any further, we state a crucial theorem which <strong>de</strong>scribes howthe test statistic converges towards the so-called canonical distribution. A similarresult is obtained by Zhu and Hu (2010) through a different approach based on thestudyofthelimitdistributionofastochasticprocess<strong>de</strong>fined over (0,1].Theorem3. Consi<strong>de</strong>rh=(h 0 ,h 1 ) ∈ R 2 satisfyingbothh 1 >h 0 and γh 1 +γ −1 h 0 ≠0 where γ =OR(θ)g ⋆ (θ)(1)/g ⋆ (θ)(0). Defineθ h/√ n = ( θ 0 (1+h 0 / √ n),θ 1 (1+h 1 / √ n) )for all n ≥ n 0 large enough to ensure θ h/√ n ∈]0,1[ 2 . The sequence (θ h/√ n ) n≥n0<strong>de</strong>fines a sequence (ψ n ) n≥n0 of contiguous parameters (“from direction h”), withψ n = Ψ(θ h/√ n ) > Ψ(θ).Introduce the mean vector µ(h) = (h 1 −h 0 )( √ p 1 ,..., √ p K )/ √ v ⋆ (θ) andthecovariancematrix C = ( √ p k∧l /p k∨l ) k,l≤K . Then:(i) un<strong>de</strong>r (θ,g ⋆ n ), (T 1,...,T K ) converges in distribution, as n tends to infinity, tothecentered Gaussiandistributionwith covariancematrix C;(ii) un<strong>de</strong>r (θ h/√ n ,g ⋆ n), (T 1 ,...,T K )convergesindistribution,asntendstoinfinity,totheGaussiandistributionwith mean µ(h) andcovariancematrix C.Say we want to perform a test such with asymptotic type I error α andasymptotic power (1 − β) at the limit of the sequence of contiguous parameters(ψ n ) n≥n0 , i.e. such that (a) the probability of rejecting the null for its alternativeun<strong>de</strong>r (θ,g ⋆ n), and (b) the probability of failing to reject the null for its alternativeun<strong>de</strong>r (θ √ h/ n ,g ⋆ n )converge(as n tendsto infinity)towards α and β, respectively.Let us consi<strong>de</strong>r the α-spending and β-spending strategies (α 1 ,...,α K ) and(β 1 ,...,β K ). It is usuallyassumedthat thenextlemmaholds:Lemma 4. In the framework of Theorem 3, let (Z 1 ,...,Z K ) follow the centeredGaussian distribution with covariance matrix C (as <strong>de</strong>fined in Theorem 3). Assumethat α + β < 1. There exits a unique ε > 0, a unique rejection boundary(a 1 ,...,a K ), a unique futility boundary (b 1 ,...,b K ) such that a K = b K , P(Z 1 ≥a 1 ) = α 1 , P(Z 1 + µ(εh) 1 ≤b 1 ) = β 1 , andforevery1 ≤k


Proofof Theorem 3. By the continuous mapping theorem, we readily obtain fromLemma5theconvergenceindistributionun<strong>de</strong>r (θ,g ⋆ n ) of( √ n 1 (Ψ n1 − Ψ(θ)),..., √ n K (Ψ nK − Ψ(θ)))tothecentered Gaussian distributionwithcovariancematrix(v ⋆ (θ) √ p k∧l /p k∨l ) k,l≤K .Then Slutsy’slemmastraightforwardlyyieldthefirst convergence(i).Regarding (ii), we first invoke Lemma 5 and Le Cam’s third lemma (seeExample6.7in(van<strong>de</strong>rVaart,1998))inor<strong>de</strong>rtoobtainthat,un<strong>de</strong>r (θ h/√ n ,g ⋆ n),thevector ( √ n 1 (Ψ n1 − Ψ(θ)),..., √ n K (Ψ nK − Ψ(θ))) converges in distributionto theGaussiandistributionwithmean √ v ⋆ (θ)µ(h) and covariancematrix(v ⋆ (θ) √ p k∧l /p k∨l ) k,l≤K .In addition, the (θ √ h/ n ,g ⋆ n ) and (θ,g⋆ n ) experiments are mutually contiguous, implyingthat if s 2 n estimates v ⋆ (θ) un<strong>de</strong>r (θ,g ⋆ n), then it also estimates v ⋆ (θ) un<strong>de</strong>r(θ √ h n ,g ⋆ n ). We apply again Slutsky’s lemma in or<strong>de</strong>r to obtain the second convergence(ii).Proofof Lemma 5. Let us consi<strong>de</strong>r first the log-likelihoodratio of the experiments(θ h/√ n ,g ⋆ n )withrespectto (θ,g⋆ n ). Theshorthandnotations θ(O) = θY A (1−θ A) 1−Yand θ h/√ n (O) = [θ A (1+h A / √ n)] Y [1 − θ A (1+h A / √ n)] 1−Y and (10)readilyyieldwithΛ n = log=The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10n∏i=1n∑i=1g ⋆ i (A i|Y i ,O n (i −1))g ⋆ i (A i|Y i ,O n (i −1))[Y i log1l{A i =0}+1l{A i =1}= √ 1 nn∑ L 1 (O i ) − 1i=12nn∏θ h/√ n (O i )i=1θ(O i )(1+ h 0√)+(1 −Y i )log n[ (Y i log 1+ h (1√)+(1 −Y i )log nn∑i=1L 2 (O i )+o P (1),( )]θ 0 h 01 −(1 − θ 0 ) √ n)]θ 1 h 11 −(1 − θ 1 ) √ nL 1 (O i ) = 1l{A i =0} Y i − θ 0h 0 +1l{A i =1} Y i − θ 1h 1 ,1 − θ 0 1 − θ[1( ) ] 2 θ0L 2 (O i ) = 1l{A i =0} Y i +(1 −Y i ) h 2 01 − θ 0Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes+1l{A i =1}[Y i +(1 −Y i )(θ11 − θ 1) 2]h 2 1 .1First, we observe that, since L 2 is boun<strong>de</strong>d (and measurable), n ∑ n i=1 L 2(O i ) =1n ∑n i=1 P θ,g ⋆L i 2 + 1 n ∑n i=1 [L 2(O i ) −P θ,g ⋆iL 2 ] = 1 n ∑n i=1 P θ,g ⋆L i 2 +o P (1) by virtueof theKolmogorovlawoflarge numbers. Now,P θ,g ⋆iL 2 =g ⋆ i (0) θ 0h 2 01−θ 0+g ⋆ i (1) θ 1h 2 11−θ 1, hencen1n∑ θ,g ⋆iL 2 =i=1P 1 nn∑i=1g ⋆ i (0) θ 0h 2 0+ 1 1 − θ 0 nn∑i=1g ⋆ i (1) θ 1h 2 11 − θ 1by virtueofTheorem1. In summary,weobtainthat=g ⋆ (θ)(0) θ 0h 2 01 − θ 0+g ⋆ (θ)(1) θ 1h 2 11 − θ 1+o P (1)Λ n = 1 √ nn∑i=1L 1 (O i ) − 1 2 τ2 +o P (1) (20)for τ 2 =g ⋆ (θ)(0) θ 0h 2 01−θ 0+g ⋆ (θ)(1) θ 1h 2 11−θ 1.Second, we <strong>de</strong>fine Z i = (1l{i ≤ n 1 },...,1l{i ≤ n K }) and we introduce theboun<strong>de</strong>d(and measurable)function f suchthatf(O i ,Z i ) = (Z i IC(θ,g ⋆ (θ))(O i ),L 1 (O i )).Let us show that M n =n 1 ∑n i=1 [f(O i,Z i ) −P θ,g ⋆if] =n 1 ∑n i=1 f(O i,Z i ) (this equalityholds because for all i ≤n, one has P θ,g ⋆iIC(θ,g ⋆ (θ)) =P θ,g ⋆iL 1 =0) is asymptoticallyGaussian. In view of Theorem 8, let W n = ∑ n i=1 P θ,g ⋆ f ⊤ f and Σi n = 1 n EW n.The entries ofW n take one of the followingforms: A n = ∑ n ki=1 P θ,g ⋆L i1IC(θ,g ⋆ (θ)),orB n = ∑ n k∧n li=1P θ,g ⋆iIC(θ,g ⋆ (θ)) 2 , orC n = ∑ n i=1 P θ,g ⋆L2 i 1 . Now,• A n = h 1g ⋆ (1) ∑n ki=1 g⋆ i (1) − h 0g ⋆ (0) ∑n ki=1 g⋆ i (0), so that 1 n EA n = p k (h 1 −h 0 ) +o(1)andn 1 A n − 1 n EA n =o P (1) since the almost sure convergence of the boun<strong>de</strong>dsequence 1 n ∑n i=1 g⋆ i (a) towards g⋆ (θ)(a) (see Theorem 1) implies its convergenceinL 1 normtothesamelimit;• B n = 1−θ 1θ 1 g ⋆ (θ)(1) 2 ∑ n k∧n li=1g ⋆ i (1)+ 1−θ 0that 1 n EB n = p k∧l(1−θ1θ 0 g ⋆ (θ)(0) 2 ∑ n k∧n li=1g ⋆ iθ 1 g ⋆ (θ)(1) + 1−θ 0θ 0 g ⋆ (θ)(0)1nB n − 1 n EB n =o P (1)forthesamereasons as above;(0), from which it follows)+o(1) = p k∧l v ⋆ (θ)+o(1), hencehttp://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124722Published by Berkeley Electronic Press, 201123


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10• C n = θ 1h 2 11−θ 1∑ n i=1 g⋆ i (1)+ θ 0h 2 01−θ 0∑ n i=1 g⋆ i (0),hence 1 n EC n = τ 2 +o(1),andn 1C n −1nEC n =o P (1)forthesamereasons as above.Thosecalculationsnotablyteachusthat,settingm = (h 1 −h 0 )(p 1 ,...,p K )and Σ 0 =(p k∧l ) k,l≤K , Σ n convergesto(vΣ =⋆ (θ)Σ 0 m ⊤ )m τ 2 .Is Σ apositive<strong>de</strong>finitecovariancematrix? Well, Σ 0 isapositive<strong>de</strong>finitecovariancematrix (that ofthe vector (B p1 ,...,B pK ) where (B t ) t≥0 is astandard Brownian motion),hence the symmetric matrix Σ is a positive <strong>de</strong>finite covariance matrix if andonly if its <strong>de</strong>terminant <strong>de</strong>t(Σ) >0. Subtracting (h 1 −h 0 )/v ⋆ (θ) times the Kth rowof Σ to itslastrow,we getthat <strong>de</strong>t(Σ) =v ⋆ (θ) K <strong>de</strong>t(Σ 0 ) ×(τ 2 −(h 1 −h 0 ) 2 /v ⋆ (θ)).Now, using v ⋆ (θ) = 1−θ 0θ 0 g ⋆ (θ)(0) + 1−θ 1θ 1 g ⋆ (θ)(1) and γh 1 + γ −1 h 0 ≠0 (required in Theorem3)yieldsv ⋆ (θ)τ 2 −(h 1 −h 0 ) 2(=h 2 θ1 v ⋆ )(θ)1 g ⋆ (θ)(1) −11 − θ 1+h 2 0(θ0 v ⋆ )(θ)g ⋆ (θ)(0) −1 +2h 0 h 11 − θ 0= (γh 1 + γ −1 h 0 ) 2 >0.Insummary, Σisapositive<strong>de</strong>finitecovariancematrix,theconditionsofTheorem8are met, and therefore √ nM n = 1 √ n∑ n i=1 f(O i,Z i ) converges in distribution to thecentered Gaussiandistributionwithcovariancematrix Σ.Let us <strong>de</strong>fine the diagonal matrices ∆ n = diag( √ n/n 1 ,..., √ n/n K ,1) and∆ =diag(1/ √ p 1 ,...,1/ √ p K ,1); obviously, ∆ n = ∆+o(1), √ nM n ∆ n = √ nM n ∆+o P (1). Invoking(17)in Theorem2and (20),it holdsthat, un<strong>de</strong>r (θ,g ⋆ n ),( √ n 1 (Ψ n1 − Ψ(θ)),..., √ n K (Ψ nK − Ψ(θ)),Λ n )= (0,...,0,− 1 2 τ2 )+ √ nM n ∆ n +o P (1)= (0,...,0,− 1 2 τ2 )+ √ nM n ∆+o P (1).Thisentails theconvergencein distributionof ( √ n 1 (Ψ n1 − Ψ(θ)),..., √ n K (Ψ nK −Ψ(θ)),Λ n ) un<strong>de</strong>r (θ,g ⋆ n ) to the Gaussian distribution with mean (0,...,0,−1 2 τ2 )and covariance matrix ∆Σ∆. Simple calculations finally reveal that ∆Σ∆ equals thepositive<strong>de</strong>finitecovariancematrixgiveninthelemma.Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomes6 DiscussionWe have studied in this article the theoretical properties of a new adaptive groupsequential<strong>de</strong>signmethodologyforrandomizedclinicaltrialswithbinarytreatment,binary outcome and no covariate (the experimental unit writes as O = (A,Y) ∈{0,1} 2 , A being the assigned treatment andY the corresponding outcome). In thecompanion article (Chambaz and van <strong>de</strong>r Laan, 2010b), the study is carried out bysimulations.Priortoaccruingdata,thetrialprotocolmustspecify Ψ,theparameterofinterest.Regardingtheestimationof Ψ,thetrialprotocolmustspecifytheconfi<strong>de</strong>nceleveltobeusedinconstructingtheconfi<strong>de</strong>nceinterval. Regardingthetestingof Ψ,thetrialprotocolmustspecifythenullandalternativehypotheses,thewishedtypeIerror, the alternative parameter at which the test is to be powered and the relatedwishedtypeIIerror. Iftheinvestigatorwantsto resort toagroupsequentialtestingprocedure,thenthetrialprotocolmustalsospecifythenumberofintermediatetests,therelated proportions,the α- and β-spending strategies (then themaximumcommittedinformation,rejectionand futilityboundariesare fully<strong>de</strong>termined). Finally,the trial protocol must specify the (fixed) targeted <strong>de</strong>sign. We <strong>de</strong>ci<strong>de</strong>d to focus inthis article on the log-relative risk Ψ = logE(Y|A = 1) −logE(Y |A = 0) and onthat <strong>de</strong>sign g ⋆ which minimizes the asymptotic variance of the MLE of Ψ. Otherchoices can betreated likewise.The methodology is adaptive in the sense that the estimator of g ⋆ , whichappears to be strongly consistent (see Highlight1), is alternatively used in the processof accruing new data, then updated and so on. The resulting MLE of Ψ, Ψ n ,isstronglyconsistent(seeHighlight1). Itsatisfiesacentral limittheorem,and performsaswell(intermsofasymptoticvariance)asitscounterpartun<strong>de</strong>riidsamplingusing g ⋆ itself (see Highlight1). Therefore, one easily constructs confi<strong>de</strong>nce intervalswhich are as narrow as theintervals onewould get, has one knownin advanceg ⋆ andusedittosamplein<strong>de</strong>pen<strong>de</strong>ntlydata(seeHighlight1). Thosetheoreticalresultsarevalidatedwithsimulationsinthecompanionarticle(Chambazandvan<strong>de</strong>rLaan, 2010b).Furthermore, we explain how a group sequential testing procedure can beequallywellappliedontopoftheadaptivesamplingmethodology(seeHighlight2).An accompanying theoretical result validates the adaptive group sequential testingprocedure in the context of contiguous null and alternative hypotheses. It issupported by the simulations un<strong>de</strong>rtaken in the companion article (Chambaz andvan<strong>de</strong>rLaan, 2010b).As stated in the abstract, a three-sentence take-home message is “Adaptive<strong>de</strong>signsdolearnthetargetedoptimal<strong>de</strong>signandinferenceandtestingcanbecarriedout un<strong>de</strong>r adaptive sampling as they would un<strong>de</strong>r the targeted optimal randomiza-http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124724Published by Berkeley Electronic Press, 201125


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary Outcomestion probability iid sampling. In particular, adaptive<strong>de</strong>signs achieve the same efficiencyasthefixedoracle<strong>de</strong>sign.Thisisconfirmedbyasimulationstudypresentedinthecompanionarticle(Chambazandvan<strong>de</strong>rLaan, 2010b),atleastformo<strong>de</strong>rateor large sample sizes, across a large collection of targeted randomization probabilities.”In essence, everything works as predicted by theory. However, theory alsowarnsusthatgainscannotbedramaticintheparticularsettingofclinicaltrialswithbinarytreatment,binaryoutcomeandnocovariate. Nonetheless,thisarticleanditscompanionarticle(Chambazandvan<strong>de</strong>rLaan,2010b)areimportant: theyprovi<strong>de</strong>a theoretical template and tools for asymptotic analysis of robust adaptive <strong>de</strong>signsinlessconstrainedsettings,whichwewillconsi<strong>de</strong>rinfuturework. Thisnotablyinclu<strong>de</strong>sthe setting ofclinical trials with covariate, binary treatment, and discreteorcontinuousoutcome,orthesettingofclinicaltrialswithcovariate,binarytreatment,andpossiblycensoredtime-to-eventamongothers. Resortingtotargetedmaximumlikelihood estimation(van <strong>de</strong>r Laan and Rubin, 2006) along with adaptation of the<strong>de</strong>signprovi<strong>de</strong>ssubstantialgainsin efficiency.A AppendixNormalized martingale sums of the form M n (f) = 1 n ∑n i=1 [f(O i,Z i ) −P θ,gi f] withZ i = Z i (O n (i −1)) play a central role in this study. The Kolmogorov strong lawof large numbers (see e.g. (Sen and Singer, 1993, Theorem 2.4.2)) guarantees thatM n (f)convergesinprobabilityto0almostsurelyforanyuniformlyboun<strong>de</strong>dfunctionf. However,inor<strong>de</strong>rtogetconsistencyresults,weneedauniformconvergenceresultforsup f ∈F |M n (f)|foracertainclass F. ThisissueisaddressedinA.1. Similarly,inor<strong>de</strong>rtogetacentrallimittheorem,weneedtheconvergenceof√ nM n (f)to a Gaussian random variable. We <strong>de</strong>rive this result in Section A.2 from a standardcentral limittheorem for discrete martingales,see e.g. (Sen and Singer, 1993,Theorem 3.3.7).A.1 Building block for consistencyresults.LetO n beasequenceofsuccessiveobservationsobtainedas<strong>de</strong>scribedinSection3.We <strong>de</strong>note by Z i = Z i (O n (i −1)) ∈ Z ⊂ R d a summary measure of O n (i −1) offixeddimensiond (forinstanceZ i = θ i−1 ∈ R 2 ,thecurrentMLEof θ atstepi). LetF be a class of boun<strong>de</strong>d (and measurable) functions of (o,z) = (a,y,z) such thatsup f ∈F ‖f‖ ∞ =U < ∞ (for instance, f(O i ,Z i ) =D(ϑ)(O i ) −P θ,g ⋆iD(ϑ) for someϑ ∈ [0,1] 2 , wherethe<strong>de</strong>pen<strong>de</strong>ncywrt Z i = θ i−1 isconveyedthroughg ⋆ i ). DefiningM n (f) = 1 nn∑i=1[f(O i ,Z i ) −P θ,gi f]forall f ∈ F and n ≥1, wenotethat nM n (f)isadiscretemartingalesum.Ouruniformconvergenceresult,Theorem6,essentiallyreliesonamaximalinequality for martingales (van Han<strong>de</strong>l, 2010, Proposition A.2)–see (Chambaz andvan<strong>de</strong>rLaan,2010a)forthe<strong>de</strong>tails. DenotebyN(F,‖·‖ ∞ ,ε)thecardinalityofthesmallest finite collection (l j ,u j ) j≤N , where l j ≤ u j are (measurable) functions of(o,z) = (a,y,z)satisfying ‖u j −l j ‖ ∞ ≤ ε for all j ≤N, such that for every f ∈ F,thereexists j ≤N forwhich l j ≤ f ≤u j .Theorem 6. Recallthatsup f ∈F ‖f‖ ∞ =U < ∞. If∫ √ 2/3U √logN(F,‖ · ‖∞ ,x)dx < ∞0thenfor all α >0thereexists c >0 suchthat,fornlargeenough,( )Psup M n (f) ≥ αf ∈F≤2e −nc .Consequently,sup f ∈F |M n (f)|converges to0almostsurely.A.2 Building block for centrallimit theorems.First,weobtainacentrallimittheoremforunivariatediscretemartingalesumsas aby-product of a classical theorem, see e.g. (Sen and Singer, 1993, Theorem 3.3.7).Second,werelyonitandinvoketheCramér-Wold<strong>de</strong>vice,seee.g. (SenandSinger,1993, Theorem 3.2.4), in or<strong>de</strong>r to extend the result to the case of multivariate discretemartingale sums. We refer to (Chambaz and van <strong>de</strong>r Laan, 2010a) for the<strong>de</strong>tailsoftheproofs.Univariatecase.We use the same framework and notation as those exposed at the beginning ofSection A.1. Foragivenreal-valued f ∈ F,let usintroducew n (f) 2 =n∑i=1P θ,gi f 2 , s n (f) 2 =n∑i=1Thefollowingunivariatecentral limittheorem holds:Ef(O i ,Z i ) 2 ,σ n (f) 2 = s n(f) 2.nhttp://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124726Published by Berkeley Electronic Press, 201127


Theorem 7. Assumethatforalli=1,...,n,g i (A i |A n (i −1),X n ) =g i (A i |X i (A i ),O n (i −1))(by virtue of the adaptiveCAR assumption(8)) only <strong>de</strong>pends on O n (i −1) throughZ i . If liminfσ n (f) 2 >0 and 1 n w n(f) 2 − 1 n Ew n(f) 2 = 1 n w n(f) 2 − σ n (f) 2 convergesin probability to 0, then w n (f) 2 /s n (f) 2 converges to 1 in probability and √ n Mn(f)σ n(f)converges in distributionto thestandardnormaldistribution.Furthermore,theempiricalmean ˆσ n (f) 2 = 1 n ∑n i=1 f(O i,Z i ) 2 mimics σ n (f) 2in the sense that ˆσ n (f) 2 − σ n (f) 2 converges to 0 in probability. Consequently,√ M n n(f)ˆσ n(f)alsoconverges in distributiontothestandardnormaldistribution.This result notably teaches us that we can estimatethe asymptoticvarianceof √ nM n (f) by consi<strong>de</strong>ring O 1 ,...,O n as in<strong>de</strong>pen<strong>de</strong>nt draws from P θ,gi , treatingeachg i asagiven<strong>de</strong>terministicfixed<strong>de</strong>signin G. Therefore,theparametricornonparametricbootstrapignoringthe<strong>de</strong>pen<strong>de</strong>ncestructureofOn consistentlyestimatethelimitingvariance.Multivariatecase.Let us statethe multivariateversionof Theorem 7. It involves thefollowingmultidimensionalcounterpartsofwn (f) 2 ,s n (f) 2 and σ n (f) 2 inthecasethat f ∈ F takesvaluesin R r (expectationsaretaken componentwise):W n (f) =The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10n∑i=1P θ,gi f f ⊤ , S n (f) =n∑i=1Ef(O i ,Z i )f(O i ,Z i ) ⊤ ,Σ n (f) = S n(f)n .For every positive <strong>de</strong>finite symmetric matrix Σ, let Σ −1/2 be the positive<strong>de</strong>finitesymmetricmatrixsuch that (Σ −1/2 ) 2 is theinverseof Σ. It holdsthat:Theorem 8. Assumethatforalli=1,...,n,g i (A i |A n (i −1),X n ) =g i (A i |X i (A i ),O n (i −1))(by virtue of the adaptiveCAR assumption(8)) only <strong>de</strong>pends on O n (i −1) throughZ i . If Σ n (f) converges to a positive<strong>de</strong>finite covariance matrix Σ(f) andn 1 W n(f)−1√n EW n(f) =n 1W n(f) − Σ n (f) converges componentwise in probability to 0, thennMn (f) converges in distributionto the centered Gaussian law over R r with covariancematrixΣ(f).Furthermore,theempiricalmean ˆΣ n (f)= 1 n ∑n i=1 f(O i,Z i )f(O i ,Z i ) ⊤ issuchthat ˆΣ n (f) converges to Σ(f) in probability. Consequently, ˆΣ n (f) is invertiblewithprobability tending to 1 and √ nˆΣ n (f) −1/2 M n (f) converges in distribution to thestandardnormaldistributionover R r .Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesReferencesArmitage,P. (1975): Sequentialmedicaltrials,New-York: Wiley.Banerjee, A. and A. A. Tsiatis (2006): “Adaptive two-stage <strong>de</strong>signs in phase IIclinicaltrials,”StatisticsinMedicine, 25,3382–3395.Berry, D. A. (2006): “Bayesian clinicaltrials,”NatureReviews,5, drugdiscovery.Berry, D. A. and D. K. Stangl (1996): Bayesian biostatistics, New-York: MarcelDekker.Chambaz, A. and M. J. van <strong>de</strong>r Laan (2010a): “Targeting the optimal <strong>de</strong>sign inrandomizedclinicaltrialswithbinaryoutcomesandnocovariate,”U.C.BerkeleyDivisionof BiostatisticsWorkingPaper Series,paper258.Chambaz, A. and M. J. van <strong>de</strong>r Laan (2011): “Targeting the optimal <strong>de</strong>sign inrandomized clinical trials with binary outcomes and no covariate: simulationstudy,”Int.J.Biostat.,V olume 7 , Issue 1 , Article 11.Chernoff,H.andS.N.Roy(1965): “ABayessequentialsamplinginspectionplan,”Annalsof MathematicalStatistics,36,1387–1407.Efron, B. (1971): “Forcing a sequential experiment to be balanced,” Biometrika,58,403–417.Eisele,J.R.(1994): “Thedoublyadaptivebiasedcoin<strong>de</strong>signforsequentialclinicaltrials,”Journalof StatisticalPlanningand Inference, 38,249–261.Emerson, S. S. (2006): “Issues in the use of adaptive clinical trial <strong>de</strong>signs,” StatisticsinMedicine,25,3270–3296.Flehinger,B. J.andT.A.Louis(1971): “Sequentialtreatmentallocationinclinicaltrials,”Biometrika,58, 419—426.Golub, H. L. (2006): “The need for mor efficient trial <strong>de</strong>signs,” Statistics inMedicine,25, 3231–3235.Hu, F. and W. F. Rosenberger (2003): “Optimality, variability, power: evaluatingresponse-adaptive randomization procedures for treatment comparisons,” JournaloftheAmericanStatisticalAssociation,98,671–678.Hu, F. and L.-X. Zhang (2004a): “Asymptoticnormality of urn mo<strong>de</strong>ls for clinicaltrialswityh<strong>de</strong>layedresponse,”Bernoulli,10, 447–463.Hu,F. and L.-X.Zhang(2004b): “Asymptoticpropertiesofdoublyadaptivebiasedcoin<strong>de</strong>signsformulti-treatmentclinicaltrials,”AnnalsofStatistics,32,268–301.Hu, F., L.-X. Zhang, and X. He (2009): “Efficient randomized-adaptive <strong>de</strong>signs,”TheAnnalsofStatistics,37,2543–2560.Hu, F. H. and W. F. Rosenberger (2006): The theoryof response-adaptiverandomizationinclinicaltrials,Wiley.Hung,H.-M. J.(2006): “Discussion,”StatisticsinMedicine, 25,3313–3314.Ivanova, A. (2003): “A play-the-winner type urn <strong>de</strong>sign with reduced variability,”Metrika,58,1–13.http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124728Published by Berkeley Electronic Press, 201129


The International Journal of Biostatistics, Vol. 7 [2011], Iss. 1, Art. 10Chambaz and van <strong>de</strong>r Laan: Targeting the Optimal Design in RCTs with Binary OutcomesJ.,D.,K.R.Abrams,andJ.P.Myles(2004): Bayesianapproachestoclinicaltrialsandhealthcareevaluation,Chichester: Wiley.Jennison, C. and B. W. Turnbull (2000): Group Sequential Methods with ApplicationstoClinicalTrials,Boca Raton, FL:Chapman & Hall/CRC.Jennison, C. and B. W. Turnbull (2003): “Mi-course sample size modification inclinical trials based on the observed treatment effect,” Statisticsin Medicine, 22,971–993.Lokhnygina, Y. and A. A. Tsiatis (2008): “Optimal two-stage groupsequential<strong>de</strong>signs,” J. Statist. Plann. Inference, 138, 489–499, URLhttp://dx.doi.org/10.1016/j.jspi.2007.06.011.Mehta, C. R. and N. R. Patel (2006): “Adaptive, group sequential and <strong>de</strong>cisiontheoretic approaches to sample size <strong>de</strong>termination,” Statistics in Medicine, 25,3250–3269.Proschan, M. A., G. K. K. Lan, and J. T. Wittes (2006): Statistical Monitoring ofClinicalTrials: AUnifiedApproach,Statisticsforbiologyandhealth,New-York:Springer.Rosenberger, W. F. (2002): “Randomized urn mo<strong>de</strong>ls and sequential <strong>de</strong>sign,” SequentialAnalysis,21,1–41,(withdiscussion).Rosenberger, W. F., N. Stallard, A. Ivanova,C. N. Harper, and M. L. Ricks (2001):“Optimaladaptive<strong>de</strong>signsfor binaryresponsetrials,”Biometrics,57, 909–913.Sen, P. K. and J. M. Singer (1993): Large samplemethods in statistics,New York:Chapman&Hall,an introductionwithapplications.Tsiatis, A. A. and C. R. Mehta (2003): “On the inefficiency of the adaptive <strong>de</strong>signformonitoringclinical trials,”Biometrika,90, 367–368.Tymofyeyev, Y., W. F. Rosenberger, and F. Hu (2007): “Implementing optimal allocationinsequentialbinaryresponseexperiments,”J.Amer.Statist.Assoc.,102,224–234.van <strong>de</strong>r Laan, M. J. (2008): “The construction and analysis of adaptive group sequential<strong>de</strong>signs,” U.C. Berkeley Division of Biostatistics Working Paper Series,paper232.van<strong>de</strong>rLaan,M.J.andD.Rubin(2006): “Targetedmaximumlikelihoodlearning,”Int.J. Biostat.,2,Art. 11,40pp.van <strong>de</strong>r Vaart, A. W. (1998): Asymptotic statistics, Cambridge Series in Statisticaland Probabilistic Mathematics, volume 3, Cambridge: Cambridge UniversityPress.van Han<strong>de</strong>l, R. (2010): “On the minimal penalty for markov or<strong>de</strong>r estimation,”Probab.TheoryRelat.Fields,to appear.Wei, L. J. and S. D. Durham (1978): “The randomized play-the-winner rule inmedicaltrials,”JournaloftheAmericanStatisticalAssociation,73, 840–843.Zhang, L.-X., W. S. Chan, S. H. Cheung, and F. Hu (2007): “A generalized dropthe-loserurn for clinical trials with <strong>de</strong>layed responses,” Statist. Sinica, 17, 387–409.Zhang,L.-X.,F.Hu,andS.H.Cheung(2006): “Asymptotictheoremsofsequentialestimation-adjusted urn mo<strong>de</strong>ls,” The Annals of Applied Probability, 16, 340–369.Zhu,H.andF.Hu(2010): “Sequentialmonitoringofresponse-adaptiverandomizedclinicaltrials,”TheAnnalsofStatistics.http://www.bepress.com/ijb/vol7/iss1/10DOI: 10.2202/1557-4679.124730Published by Berkeley Electronic Press, 201131


Chapter 25Probability of Success of an In VitroFertilization Program<strong>Antoine</strong> ChambazAbout 9 to 15% of couples have difficulty in conceiving a child, i.e., do not conceivewithin 12 months of attempting pregnancy (Boivin et al. 2007). In response tosubfertility, assisted reproductive technology has <strong>de</strong>veloped over the last 30 years,resulting in in vitro fertilization (IVF) techniques (the first “test-tube baby” wasborn in 1978). Nowadays, more than 40,000 IVF cycles are performed each year inFrance and more than 63,000 in the USA (Adamson et al. 2006). Yet, how to quantifythe success in assisted reproduction still remains a matter of <strong>de</strong>bate. One could,for instance, rely on the number of pregnancies or <strong>de</strong>liveries per IVF cycle. However,an IVF program often consists of several successive IVF cycles. So, instead ofconsi<strong>de</strong>ring each IVF cycle separately, one could rather rely on an evaluation of thewhole program. IVF programs are emotionally and physically bur<strong>de</strong>nsome. Providingthe patients with the most a<strong>de</strong>quate and accurate measure of success is thereforean important issue that we propose to address in this chapter.25.1 The DAIFI StudyOur contribution is based on the French Devenir Après Interruption <strong>de</strong> la FIV(DAIFI) study (Soullier et al. 2008; <strong>de</strong> la Rochebrochard et al. 2008, 2009). InFrance, the four first IVF cycles are fully reimbursed by the national health insurancesystem. Therefore, as in the previous references, we conclu<strong>de</strong> that the mosta<strong>de</strong>quate measure of success for French couples is the probability of <strong>de</strong>livery (resultingfrom embryo transfer) during the course of a program of at most four IVFcycles. We will refer to this quantity as the probability of success of a program of atmost four IVF cycles, or even sometimes as the probability of success.Data were provi<strong>de</strong>d by two French IVF units (Cochin in Paris and Clermont-Ferrand, a medium-sized city in central France). All women who followed their firstIVF cycle in these units between 1998 and 2002 and who were un<strong>de</strong>r 42 at the startof the cycle were inclu<strong>de</strong>d. Women over 42 were not inclu<strong>de</strong>d, unless they had a420 <strong>Antoine</strong> Chambaznormal ovarian reserve and a specific IVF indication. For every enrolled woman,the data were mainly the atten<strong>de</strong>d IVF unit and the woman’s date of birth, and foreach IVF cycle, its start date, number of oocytes harvested, number of embryostransferred or frozen, indicators of pregnancy, and successful <strong>de</strong>livery (for a comprehensive<strong>de</strong>scription, see <strong>de</strong> la Rochebrochard et al. 2009). Data collection wasdiscontinued after the woman’s fourth IVF cycle. Since the first four IVF cyclesare fully reimbursed, it is reasonable to assume that economic factors do not playa role in the phenomenon of interest. Specifically, whether a couple will abandonthe IVF program mid-course without a successful <strong>de</strong>livery or un<strong>de</strong>rgo the wholeprogram does not <strong>de</strong>pend on economic factors (on the contrary, if the IVF cycleswere not fully reimbursed, then disadvantaged couples would likely abandon theprogram mid-course more easily). Furthermore, successive IVF cycles occur closetogether in time: hence, the sole age at the start of the first IVF cycle is a relevantsummary of the successive ages at the start of each IVF cycle during the program.Likewise, we make the assumption that the number of embryos transferred or frozenduring the first IVF cycle is a relevant summary of the successive number of harvestedoocytes and transferred or frozen embryos associated to each IVF cycle (i.e.,a relevant summary measure of the couple’s fertility during the program). Relaxingthis assumption will be consi<strong>de</strong>red in future work.Estimating the probability of success of a program of at most four IVF cycles isnot easy due to couples who abandon the IVF program mid-course without a successful<strong>de</strong>livery. Moreover, since those couples have a smaller probability of havinga child than couples that un<strong>de</strong>rgo the whole program, it would be wrong to ignorethe right censoring, and simply count the proportion of successes (Soullier et al.2008), even if the <strong>de</strong>cision to abandon the program is not informed by any relevantfactors. In addition, it seems likely that some of the baseline factors, such as baselinefertility, might be predictive of the dropout time (measured on the discrete scaleof number of IVF cycles): in statistical terms, we expect that the right-censoringmechanism will be informative.Three approaches to estimating the probability of success of a program of atmost four IVF cycles are consi<strong>de</strong>red in Soullier et al. (2008). The most naive approachestimates the probability of success as the ratio of the number of <strong>de</strong>liveriessuccessive to the first IVF cycle to the total number of enrolled women, yielding apoint estimate of 37% and a 95% confi<strong>de</strong>nce interval given by (0.35, 0.38). This firstapproach obviously overlooks a lot of information in the data.A second approach is a standard nonparametric survival analysis based on theKaplan–Meier estimator. Specifically, Soullier et al. (2008) compute the Kaplan–Meier estimate S n of the survival function t 7! P(T ≥ t), where T <strong>de</strong>notes thenumber of IVF cycles attempted after the first one till the first successful <strong>de</strong>livery.The observed data structure is represented as (min(T, C), I(T ≤ C)), with C theright-censoring time. The estimated probability of success is given by 1−S n (3).This method resulted in an estimated probability of success equal to 52% and a 95%confi<strong>de</strong>nce interval (0.49, 0.55). This much more sensible approach still neglectsthe baseline covariates and thus assumes that a woman’s <strong>de</strong>cision to abandon theprogram is not informed by relevant factors that predict future success, such as those419


25 Probability of Success of an In Vitro Fertilization Program 421measured at baseline. One could argue that this method should provi<strong>de</strong> an estimatedupper bound on the probability of success.Actually, formulating the problem of interest in terms of survival analysis is asensible option, and in<strong>de</strong>ed the methods in Part V for right-censored data can beemployed to estimate the survival function of T. In Sect. 25.6, we discuss the equivalencebetween our approach presented here and the survival analysis approach.In or<strong>de</strong>r to improve the estimate of the probability of success, Soullier et al.(2008) finally resort to the so-called multiple imputation methodology (Schafer1997; Little and Rubin 2002). Based on iteratively estimating missing data usingthe past, this third approach leads to a point estimate equal to 46%, with a 95%confi<strong>de</strong>nce interval given by (0.44, 0.48).The three methods that we summarized either answer only partially the questionof interest (naive approach and nonparametric Kaplan–Meier analysis) or sufferfrom bias due to reliance on parametric mo<strong>de</strong>ls (multiple-imputation approach). Weexpose in the next sections how the TMLE methodology paves the way to solvingthis <strong>de</strong>licate problem with great consi<strong>de</strong>ration for theoretical validity.25.2 Data, Mo<strong>de</strong>l, and ParameterThe observed data structure is longitudinal:O=(L 0 , A 0 , L 1 , A 1 , L 2 , A 2 , L 3 = Y),where L 0 = (L 0,1 , L 0,2 , L 0,3 , L 0,4 ) <strong>de</strong>note the baseline covariates and L 0,1 indicates theIVF center, L 0,2 indicates the age of the woman at the start of the first IVF cycle, L 0,3indicates the number of embryos transferred or frozen at the first IVF cycle, and L 0,4indicates whether the first IVF cycle is successful, i.e., yields a <strong>de</strong>livery, (L 0,4 = 1)or not (L 0,4 = 0). For each 1≤ j≤3, A j−1 indicates whether the woman completesher jth IVF cycle (A j−1 = 1) or not (A j−1 = 0) this also enco<strong>de</strong>s for dropout, andL j indicates whether the jth IVF cycle is successful (L j = 1) or not (L j = 0). Thelongitudinal data structure becomes <strong>de</strong>generate after a time point t at which eitherthe woman abandons the program (A t = 0) or has a successful IVF cycle (L t = 1 forsome t). By encoding convention, the data structure O is constrained as follows. (1)If A j−1 = 0 for some 1≤ j


25 Probability of Success of an In Vitro Fertilization Program 423Pr(Y (1,1,1) = 1)=E P0( XP 0 (Y= 1|A 0:2 = 1 0:2 , L 1:2 =`1:2 , L 0 )`1:22{0,1} 2×P 0 (L 2 =`2|A 0:1 = 1 0:1 , L 1 =`1, L 0 )!×P 0 (L 1 =`1|A 0 = 1, L 0 ) . (25.3)This equality is an example of the g-computation formula. It relates the probabilitydistribution of Y (1,1,1) to the probability distribution of the observed data structure O.In addition, it teaches us thatψ 0 =Ψ(P 0 ) can be interpreted (at the cost of a weak assumption)as the probability of a successful outcome after four IVF cycles (the IVFprogram being interrupted if the woman gives birth mid-course). Note that whenL 0,4 = 1, the sum has only one nonzero term, whereas it has three nonzero termswhen L 0,4 = 0.We need to emphasize that the counterfactuals whose existence is guaranteed bythese theorems in Yu and van <strong>de</strong>r Laan (2002), and Gill and Robins (2001) are notnecessarily interesting nor have an interpretation that is causal in the real world.The structural equations framework that we present hereafter makes the <strong>de</strong>finitionof counterfactuals explicit and truly causal since they correspond with interveningon the system of equations (Chap. 2). Alternatively, as in the Neyman–Rubin counterfactualframework discussed in Chap. 21, one <strong>de</strong>fines the counterfactuals in termsof an experiment, and one assumes the consistency and randomization assumptionwith respect to these user-supplied <strong>de</strong>finitions of the counterfactuals.It is possible, at the cost of untestable (and stronger) assumptions, to provi<strong>de</strong>another interpretation of (25.1). This second interpretation is at the core of Pearl(2009). It is of course compatible with the previous one. Let us assume that therandom phenomenon of interest has no unmeasured confoun<strong>de</strong>rs. A causal graph isequivalent to the following system of structural equations: there exist ten in<strong>de</strong>pen<strong>de</strong>ntrandom variables (U 1,...,U4, U , U ,...,U , U ) and ten <strong>de</strong>terministicL0 L0A0 L1 A2 L3functions ( f 1,..., f 4, f , f ,..., f , f ) such thatL0 L0A0 L1 A2 L38L 0,1 = f 1 (U1 ), L0 L0L 0,2 = f 2 (L 0,1, U 2 ),L0 L0L>< 0,3 = f 3 (L 0,2, LL0 0,1 , U 3 ), L0L 0,4 = f 4 (L 0,3, LL0 0,2 , L 0,1 , U 4 ), L0and for every 0≤ j≤2,A j = f A j(L 0: j−1 , A 0: j−1 , U A j),>:L j+1 = f L j+1(A 0: j , L 0: j , U L j+1).(25.4)One can intervene upon this system by setting the intervention no<strong>de</strong>s A 0:2 equalto some values a 0:2 2A. Formally, this simply amounts to substituting the equalityA j = a j to A j = f A j(L 0: j , A 0: j−1 , U A j) for all 0≤ j≤2 in (25.4). This yieldsa new causal graph, the so-called graph un<strong>de</strong>r intervention A 0:2 = a 0:2 . The intervened,new, causal graph or system of structural equations <strong>de</strong>scribes how Y= L 3 israndomly generated un<strong>de</strong>r this intervention. Un<strong>de</strong>r the intervention A 0:2 = a 0:2 , this424 <strong>Antoine</strong> Chambazlast (chronologically speaking) random variable is <strong>de</strong>noted by Y a0:2 , naturally usingthe same notation. Moreover, it is known (see, for instance, Robins 1986, 1987a;Pearl 2009) that the g-computation formula (25.3) also holds in this nonparametricstructural equation mo<strong>de</strong>l framework, relating the probability distribution of Y (1,1,1)to the probability distribution of the observed data structure O.Finally, even if one is not willing to rely on the causal assumptions in the SCM,and one is also not satisfied with the <strong>de</strong>finition of an effect in terms of explicitlyconstructed counterfactuals, there is still a way forward. Assuming that the timeor<strong>de</strong>ring of observed variables L 0:4 and A 0:2 is correct (which it in<strong>de</strong>ed is), the targetparameter still represents an effect of interest aiming to get as close as possible to acausal effect as the data allow. In any case,ψ 0 =Ψ(P 0 ) is a well-<strong>de</strong>fined effect ofan intervention on the distribution of the data, that can be interpreted as a variableimportance measure (Chaps. 4, 22, and 23).25.3 The TMLEIt can be shown thatΨ is a pathwise differentiable parameter (Appendix A). Thereforethe theory of semiparametric mo<strong>de</strong>ls applies, providing a notion of asymptoticallyefficient estimation and, in particular, its key ingredient, the efficient influencecurve. The TMLE procedure takes advantage of the pathwise differentiability andrelated properties in or<strong>de</strong>r to build an asymptotically efficient substitution estimatorofψ 0 =Ψ(P 0 ).Let L0 2 (P) <strong>de</strong>note the set of measurable functions s mapping the setO(where theobserved data structure takes its values) to R, such that Ps= 0 and Ps 2


25 Probability of Success of an In Vitro Fertilization Program 425Let us introduce the shorthand notation Q(L 0 ; P)=P(L 0 ), Q(L 1 | A 0 , L 0 ; P)=P(L 1 | A 0 , L 0 ), Q(L 2 | A 0:1 , L 0:1 ; P)=P(L 2 | A 0:1 , L 0:1 ), Q(Y| A 0:2 , L 0:2 ; P)=P(Y|A 0:2 , L 0:2 ), g(A 0 | X; P)=P(A 0 | L 0 ), g(A 0:1 | X; P)=g(A 0 | X; P)× P(A 1 | L 0:1 , A 0 ),and g(A 0:2 | X; P)=g(A 0:1 | X; P)× P(A 2 | L 0:2 , A 0:1 ). The likelihood P(O) can berepresented asP(O)=3YQ(L j | A 0: j−1 , L 0: j−1 ; P)j=0×2Yg(A j = 1|A 0: j−1 , L 0: j ; P) A j (1−g(A j = 1|A 0: j−1 , L 0: j ; P)) 1−A jj=0and thus factorizes as P=Qg. The parameter of interest at P=Qg can be straightforwardlyexpressed as a function of Q:Ψ(P)=E P( XQ(Y= 1|A 0:2 = 1 0:2 , L 1:2 =`1:2 , L 0 ; P)`1:22{0,1} 2 !×Q(L 2 =`2|A 0:1 = 1 0:1 , L 1 =`1, L 0 ; P)× Q(L 1 =`1|A 0 = 1, L 0 ; P) .Note that the outer expectation is with respect to the probability distribution Q(L 0 ; P)of the baseline covariates L 0 .The following proposition states thatΨ is pathwise differentiable and it presentsits efficient influence curve at P2M.Proposition 25.1. The functionalΨ is pathwise differentiable at every P2M. Theefficient influence curve D ∗ (·|P) at P2Mis writtenwhereD ∗ (·|P)=D ∗ 0 (O|P)=E P(Y (1,1,1) | L 0 )−Ψ(P)= P(L 1 = 1|A 0 = 1, L 0 )3XD ∗ j (·|P),j=0+P(L 1 = 0|A 0 = 1, L 0 )× P(L 2 = 1|A 0:1 = 1 0:1 , L 1 = 0, L 0 )+P(L 1 = 0|A 0 = 1, L 0 )× P(L 2 = 0|A 0:1 = 1 0:1 , L 1 = 0, L 0 )×P(Y= 1|A 0:2 = 1 0:2 , L 1:2 = 0 1:2 , L 0 )−Ψ(P),D ∗ 1 (O|P)= I(A 0 = 1)g(A 0 = 1|X; P) × (L 1−P(L 1 = 1|A 0 , L 0 ))×{E Q0 (Y (1,1,1) | L 0 , A 0 = 1, L 1 = 1)− E Q0 (Y (1,1,1) | L 0 , A 0 = 1, L 1 = 0)}I(A 0 = 1)=g(A 0 = 1|X; P) × (1− P(L 2= 1|A 0:1 = 1 0:1 , L 1 = 0, L 0 )426 <strong>Antoine</strong> Chambaz×P(Y= 1|A 0:2 = 1 0:2 , L 1:2 = (0, 1), L 0 )−P(L 2 = 0|A 0:1 = 1 0:1 , L 1 = 0, L 0 )×P(Y= 1|A 0:2 = 1 0:2 , L 1:2 = 0 1:2 , L 0 ))× (L 1 − P(L 1 = 1|A 0 , L 0 )),D ∗ 2 (O|P)= I(A 0:1 = 1 0:1 )g(A 0:1 = 1 0:1 | X; P) × (L 2−P(L 2 = 1|A 0:1 , L 0:1 ))×{E Q0 (Y (1,1,1) | L 0:1 , A 0:1 = 1, L 2 = 1)− E Q0 (Y (1,1,1) | L 0:1 , A 0:1 = 1, L 2 = 0)}I(A 0:1 = 1 0:1 )=g(A 0:1 = 1 0:1 | X; P) × (P(Y= 1|A 0:2= 1 0:2 , L 2 = 1, L 0:1 )−P(Y= 1|A 0:2 = 1 0:2 , L 2 = 0, L 0:1 ))× (L 2 − P(L 2 = 1|A 0:1 , L 0:1 )),D ∗ 3 (O|P)= I(A 0:2 = 1 0:2 )g(A 0:2 = 1 0:2 | X; P) × (Y− P(Y= 1|A 0:2, L 0:2 )),and the latter equalities involve convention (25.2). Furthermore, the efficient influencecurve D ∗ (·|P) is double robust: if P 0 = Q 0 g 0 and P=Qg, thenif either Q=Q 0 or g=g 0 .E P0 D ∗ (O|P)=0 implies Ψ(P)=Ψ(P 0 )The theory of semiparametric mo<strong>de</strong>ls teaches us that the asymptotic variance ofany regular estimator ofψ 0 is lower-boun<strong>de</strong>d by the variance of the efficient influencecurve, E P0 D ∗ (O|P 0 ) 2 . A regular estimator ofψ 0 having as limit distributionthe mean-zero Gaussian distribution with variance E P0 D ∗ (O|P 0 ) 2 is therefore saidto be asymptotically efficient.25.3.1 TMLE ProcedureWe assume that we observe n in<strong>de</strong>pen<strong>de</strong>nt copies O (1) ,...,O (n) of the observed datastructure O. The TMLE procedure takes advantage of the pathwise differentiabilityof the parameter of interest and bends an initial estimator, obtained as a substitutionestimatorΨ(P 0 n), into an updated substitution estimatorΨ(P ∗ n) (with P ∗ n an updateof P 0 n), which enjoys better properties.Initial estimate. We start by constructing an initial estimate P 0 n of the distribution P 0of O, which could also be used to construct an initial estimateψ 0 n=Ψ(P 0 n). The initialestimator of the probability distribution of the baseline covariates will be <strong>de</strong>finedas the empirical probability distribution of L (i)0, i=1,...,n. The initial estimate ofthe other factors of P 0 n is obtained by super learning, using the log-likelihood loss


25 Probability of Success of an In Vitro Fertilization Program 427function for each of the binary conditional distributions in Q 0 .Updating the initial estimate. The optimal theoretical properties enjoyed by a superlearner P 0 n as an estimator of P 0 do not necessarily translate into optimal propertiesofΨ(P 0 n) as an estimator of the parameter of interestψ 0 =Ψ(P 0 ). In particular,writing P 0 n=Q 0 ng 0 n, due to the curse of dimensionality,Ψ(Q 0 n) may still be overlybiased due to an optimized tra<strong>de</strong>off in bias and variance with respect to the infinitedimensionalparameter Q 0 instead ofΨ(Q 0 ) itself.The second step of the TMLE procedure stretches the initial estimate P 0 n in thedirection of the targeted parameter of interest, through a maximum likelihood step.If the initial estimateΨ(P 0 n) is biased, then this step removes all asymptotic biasfor the target parameter whenever the g-factor of P 0 , g 0 , is estimated consistently:in fact, it maps an inconsistentΨ(P 0 n) into a consistent TMLE ofψ 0 . Hence, theresulting updated estimator is said to be double robust: it is consistent if the initialfirst-stage estimator of the Q-factor of P 0 , Q 0 , is consistent or if the g-factor of P 0 ,g 0 , is consistently estimated.Let’s now <strong>de</strong>scribe the specific TMLE. We first fluctuate P 0 n with respect to theconditional distribution of Y given its past (A 0:2 , L 0:2 ), i.e., construct a fluctuationmo<strong>de</strong>l{P 0 n(ɛ) :|ɛ|


25 Probability of Success of an In Vitro Fertilization Program 429(1) For each 1≤ j≤3, D ∗ j (·|P3− j )=D ∗ j (·|P∗ n).nP ni=1I(L (i)0 = L 0) (expressed in words, the marginal distribution of(2) Q(L 0 ; P 0 n)= 1 nL 0 is estimated with its empirical distribution), and P n D ∗ (·|P ∗ n)=0.(3) The TMLE ofψ 0 ,ψ ∗ n=Ψ(P ∗ n), satisfiesψ ∗ n= 1 nnXi=1X`1:22{0,1} 2 Q(Y= 1|A 0:2 = 1 0:2 , L 1:2 =`1:2 , L (i)0 ; P∗ n)×Q(L 2 =`2|A 0:1 = 1 0:1 , L 1 =`1, L (i)0 ; P∗ n)× Q(L 1 =`1|A 0 = 1, L (i)0 ; P∗ n).Item (1) in Proposition 25.2 is an example of the so-called monotonicity property ofthe clever covariates, which states that the clever covariate of the jth factor in Q 0only <strong>de</strong>pends on the future (later) factors of Q 0 . This monotonicity property impliesthat the TMLE procedure presented above converges in one single step, referring tothe iterative nature of the general TMLE procedure. A typical iterative TMLE procedure(Chap. 24 and Appendix A) would use the same logistic regression fluctuationmo<strong>de</strong>ls as presented above, but it would enforce a commonɛ across the differentfactors of Q 0 , and thus updates all factors simultaneously at each maximum likelihoodupdate step. This iterative TMLE converges very fast: in similar examples,experience shows that convergence is often achieved in two or three steps, and thatmost reduction occurs during the first step). Item (2) is of fundamental importancesince it allows us to study the properties ofΨ(P ∗ n) from the point of view of the generaltheory of estimating equations. Item (3) just states thatψ ∗ n is a plug-in estimatorΨ(P ∗ n) and provi<strong>de</strong>s a simple formula for evaluatingψ ∗ n.25.3.2 Merits of TMLE ProcedureSince the efficient influence curve D ∗ (·|P) is double robust and since P ∗ n solvesthe efficient influence curve estimating equation, the general theory of estimatingequations teaches us that the TMLEψ ∗ n enjoys remarkable asymptotic propertiesun<strong>de</strong>r certain assumptions. Stating the latter assumptions is outsi<strong>de</strong> the scope ofthis chapter. One often refers to such conditions as regularity conditions. Let P ∗ n=Q ∗ ng ∗ n. The regularity conditions typically inclu<strong>de</strong> the requirements that the sequence(ψ ∗ n : n=1,...) must belong to a compact set; that both Q ∗ n and g ∗ n must convergeto some Q 1 and g 1 with at least one of these limits representing the truth; that theestimated efficient influence curve D(·|P ∗ n) must belong to a P 0 -Donsker class withP 0 -probability tending to one; and that a second-or<strong>de</strong>r term that involves a productof Q ∗ n−Q 1 and g ∗ n− g 1 is o P (1/ p n). See Appendix A for more <strong>de</strong>tails.The following classical result holds:Proposition 25.3. Un<strong>de</strong>r regularity conditions,(1) The TMLEψ ∗ n consistently estimatesψ 0 as soon as either Q ∗ n or g ∗ n consistentlyestimates Q 0 or g 0 .430 <strong>Antoine</strong> Chambaz(2) If the TMLE consistently estimatesψ 0 , then it is asymptotically linear: there existsD such thatp n(ψ∗n −ψ 0 )= 1 p nnXD(O (i) )+o P (1). (25.5)pEquation (25.5) straightforwardly yields that n(ψ∗n −ψ 0 ) is asymptoticallyGaussian with mean zero and variance consistently estimated by1ni=1nXD n (O (i) ) 2 ,where D n is a consistent estimator of influence curve D.i=1If Q ∗ n and g ∗ n consistently estimate Q 0 and g 0 (hence P ∗ n consistently estimatesP 0 ), then D=D ∗ (·; P 0 ), so that the asymptotic variance is consistently estimatedby1nXD ∗ (O (i) | P ∗nn) 2 . (25.6)i=1In this case, the TMLE is asymptotically efficient: its asymptotic variance is assmall as possible (in the family of regular estimators).Furthermore, if g ∗ n is a maximum-likelihood-based consistent estimator of g 0 ,then (25.6) is a conservative estimator of the asymptotic variance of p n(ψ ∗ n−ψ 0 )(it converges to an upper bound on the latter asymptotic variance).Proposition 25.3 is the cornerstone of the TMLE methodology. It allows us to buildconfi<strong>de</strong>nce intervals. We assess how well such confi<strong>de</strong>nce intervals perform from apractical point of view through a simulation study in Sect. 25.4. For the estimationof the probability of success carried out in Sect. 25.5, we resort to the bootstrap tocompute a confi<strong>de</strong>nce interval.We emphasized how the TMLE benefits from advances in the theory of estimatingequations. Yet, it enjoys some remarkable advantages over estimating equationmethods. Let us briefly evoke the most striking in the context of this chapter:(1) The TMLE is a substitution estimator. Thus, it automatically satisfies any constrainton the parameter of interest (here that the parameter of interest is a proportionand must therefore belong to the unit interval), and it respects the knowledgethat the parameter of interest is a particular function of the data-generating distribution.On the contrary, solutions of an estimating equation may fail to satisfysuch constraints.(2) The TMLE methodology cares about the likelihood. The log-likelihood of theupdated estimate of P 0 , 1 nP ni=1log P ∗ n(O i ), is available, thereby allowing for theC-TMLE extension (Part VII).


25 Probability of Success of an In Vitro Fertilization Program 43125.3.3 Implementing TMLE432 <strong>Antoine</strong> ChambazThe TMLE procedure is implemented following the specification in Sect. 25.3.1.Only the <strong>de</strong>tails of the super learning procedure are missing. We chose to rely ona least squares loss functions, and on a collection of algorithms containing sevenestimation procedures: generalized linear mo<strong>de</strong>ls, elastic net (α=1), elastic net(α=0.5), generalized additive mo<strong>de</strong>ls (<strong>de</strong>gree=2), generalized additive mo<strong>de</strong>ls(<strong>de</strong>gree=3), DSA, and random forest (ntree=1000).DensityTMLE25.4 SimulationsThe simulation scheme attempts to mimic the data-generating distribution of theDAIFI data set. We start with L 0 drawn from its empirical distribution based on theDAIFI data set and for j=0,...,2, successively, A j ∼ Ber(q j (L 0,1 , L 0,2 , L 0,3 )) andL j+1 ∼ Ber(p j+1 (L 0,1 , L 0,2 , L 0,3 )), where for each j=0, 1, 2, p j+1 (L 0,1 , L 0,2 , L 0,3 )=expit (α 1,L0,1 +α 2,L0,1 log L 0,2 +α 3,L0,1 log(5+min(L 0,3 , 5) 5 )), and for each j=0, 1, 2,q j (L 0,1 , L 0,2 , L 0,3 )=expit (β 1,L0,1 +β 2,L0,1 L 0,3 ). The values of theα- andβ-parametersare reported in Table 25.1.Regarding the empirical distribution of L 0 , the IVF unit random variable L 0,1 followsa Bernoulli distribution with parameter approximately equal to 0.517. Bothconditional distributions of age L 0,2 given the IVF unit are Gaussian-like, withmeans and standard <strong>de</strong>viations roughly equal to 33 and 4.4. The marginal distributionof the random number L 0,3 of embryos transferred or frozen has mean andvariance approximately equal to 3.3 and 7.5, with values ranging between 0 and 23(only 20% of the observed L (i)0,3are larger than 5). We refer to Table 25.3 in Sect. 25.5for a comparison of the empirical probabilities that A j = 1 and L j = 1 computedun<strong>de</strong>r the empirical distribution of a simulated data set with 10,000 observationsand the empirical distribution of the DAIFI data set.The super learning library is correctly specified for the estimation of the censoringmechanism, and misspecified for the estimation of the Q-factor. In<strong>de</strong>ed,L 0,2 plays a role in p j+1 (L 0,1 , L 0,2 , L 0,3 ) through its logarithm, and L 0,3 throughlog(5+min(L 0,3 , 5) 5 ). We choose this expression because x7! log(5+min(x, 5) 5 )cannot be well approximated by a polynomial in x over [0, 23]. Furthermore, the trueTable 25.1 Values of theα- andβ-parameters used in the simulation schemeIVF cycle jParameters 0 1 2 3100×α·,0 − (61,−55, 4.5) (13,−45, 1.2) (60,−40, 1.5)100×α·,1 − (65,−70, 3.3) (19,−49, 1.7) (80,−50, 1)100×β·,0 (40, 10) (−45, 32) (−30, 5) −100×β·,1 (40, 9) (−50, 34) (−40, 6) −MLEsimul0.50 0.55 0.60 0 0.70 0.75Fig. 25.1 MLE and TMLE empirical <strong>de</strong>nsitiesTable 25.2 Simulation resultsEmpiricalEstimator Bias MSE Cover (p-value)MLE −0.01571 0.00065 −TMLE 0.00052 0.00027 0.958 (89%)value of the parameter of interest for this simulation scheme can be estimated withgreat precision by Monte Carlo. Using a simulated data set of one million observationsun<strong>de</strong>r the intervention (1,1,1) yieldsψ simul0= 0.652187, with a 95% confi<strong>de</strong>nceinterval equal to (0.6512535, 0.6531205).We repeat B=1000 times the following steps: (1) simulate a data set with samplesize n=3001 according to the simulation scheme presented above and (2) estimateψ simul0withΨ(P ∗ n,b )=ψ∗ n,b, the bth TMLE based on this bth data set. In or<strong>de</strong>r to shedsome light one the properties of the TMLE procedure, we also keep track of theinitial maximum-likelihood-based substitution estimatorΨ(P 0 n,b), based on the bthdata set.We summarize the results of the simulation study in Table 25.2. They are illuminating:the MLE is biased, whereas the TMLE is unbiased. In the process ofstretching the initial MLE into the updated TMLE in the direction of the parameterof interest, the update step not only corrects the bias but also diminishes the variance.Those key features are well illustrated in Fig. 25.1, where it is also seen thatthe TMLE is approximately normally distributed.Let us now investigate the validity of the coverage guaranteed by the 95% confi<strong>de</strong>nceintervals based on the central limit theorem satisfied byψ ∗ n, using (25.6) as


25 Probability of Success of an In Vitro Fertilization Program 433an estimate of the asymptotic variance. Since the super learner g ∗ n is a consistentestimator of g 0 , the latter estimate of the asymptotic variance of the TMLE is sensible,and may be slightly conservative due to the misspecification of Q ∗ n. Amongthe B=1000 95% confi<strong>de</strong>nce intervals, 958 contain the true valueψ simul0. This isstrongly in favor of the conclusion that the confi<strong>de</strong>nce intervals do meet their requirement.The probability for a binomial random variable with parameter (B, 95%)to be less than 958 equals 89%.25.5 Data ApplicationWe observe n=3001 experimental units. We report in Table 25.3 the empiricalprobabilities of A j = 1 (each j=0, 1, 2) and L j = 1(j=0, 1, 2, 3) for all IVFcycles. It is obvious from these numbers that the censoring mechanism plays a greatrole in the data-generating experiment. We applied the TMLE methodology andobtained a point estimate ofψ 0 equal toψ ∗ n= 0.505. The corresponding 95% confi<strong>de</strong>nceinterval based on the central limit theorem, using (25.6) as an estimate of theasymptotic variance of p n(ψ ∗ n−ψ 0 ), is equal to (0.480, 0.530).However, we have no certainty of the convergence of g ∗ n to g 0 (which wouldguarantee that the confi<strong>de</strong>nce interval is conservative). Therefore we also carried outa bootstrap study. Specifically, we iterated B=1000 times the following procedure.First, draw a data set of n=3001 observations from the empirical measure P n ;second, compute and store the TMLEψ ∗ n,bobtained on this data set. This resultsin the following 95% confi<strong>de</strong>nce interval (using the originalψ ∗ n as center of theinterval): (0.470, 0.540), which is wi<strong>de</strong>r than the previous one.As a si<strong>de</strong> remark, the MLEΨ(P 0 n) updated during the second step of the TMLEprocedure (applied to the original data set) is equal to 0.490. We also note that theTMLEψ ∗ n falls between the estimates obtained in Soullier et al. (2008) by multipleimputation and the Kaplan–Meier method. The probability of success of a programof at most four IVF cycles may be slightly larger than previously thought. In conclusion,future participants in a program of at most four IVF cycles can be informedthat approximately half of them may subsequently succeed in having a child.Table 25.3 Empirical probabilities that A j= 1 and L j= 1 based on a simulated data set of 10,000observations and the DAIFI data setSimulated data setDAIFI data setEmpirical probability of Empirical probability ofIVF cycle j A j= 1 L j= 1 A j= 1 L j= 10 73% 21% 75% 22%1 57% 32% 59% 32%2 46% 37% 49% 35%3 − 40% − 37%434 <strong>Antoine</strong> Chambaz25.6 DiscussionWe studied the performance of IVF programs and provi<strong>de</strong>d the community with anaccurate estimator of the probability of <strong>de</strong>livery during the course of a program of atmost four IVF cycles in France (abbreviated to probability of success). We first expressedthe parameter of interest as a functionalΨ(P 0 ) of the data-generating distributionP 0 of the observed longitudinal data structure O=(L 0 , A 0 , L 1 , A 1 , L 2 , A 2 , Y).Subsequently, we applied the TMLE. Un<strong>de</strong>r regularity conditions, the estimator isconsistent as soon as at least one of two fundamental components of P 0 is consistentlyestimated; moreover, the central limit theorem allowed us to construct a confi<strong>de</strong>nceinterval. These theoretical properties are illustrated by a simulation study. Weobtained a point estimate that is approximately equal to 50%, with a 95% confi<strong>de</strong>nceinterval given by (48%, 53%). Earlier results obtained by Soullier et al. (2008) basedon the multiple-imputation methodology were slightly more pessimistic, with an estimatedprobability of success equal to 46% and (44%, 48%) as 95% confi<strong>de</strong>nceinterval.These authors also consi<strong>de</strong>red another approach that involves phrasing the problemof interest as the estimation of a survival function based on right-censored data.The key to this second approach is that the probability of success coinci<strong>de</strong>s withthe probability P(T≤ 3), where T is the number of IVF cycles attempted after thefirst one till the first successful <strong>de</strong>livery. Our observed longitudinal data structureO is equivalent to the right-censored data structure O 0 = (W, min(T, C), I(T≤ C)),where W= L 0 and C= min(0≤ j≤3 :A j = 0) with the additional conventionA 3 = 0. Neglecting the baseline covariates W and assuming that the dropout timeC is in<strong>de</strong>pen<strong>de</strong>nt of T, Soullier et al. (2008) estimated the probability of successby 1−S n (3), S n being the Kaplan–Meier estimate of the survival function of T.Although the TMLE methodology to address the estimation of P(T≤ 3) (incorporatingthe baseline covariates) is well un<strong>de</strong>rstood (Chaps. 17 and 18), we choose toadopt the point of view of a longitudinal data structure rather than that of a rightcensoreddata structure. From the survival analysis point of view, our contributionis to incorporate the baseline covariates in or<strong>de</strong>r to improve efficiency and to allowfor informative censoring. We finally emphasize that an extension of the TMLE procedurepresented in this chapter will allow, in future work, to take into account thesuccessive number of embryos transferred or frozen at each IVF cycle (instead ofthe sole number at the first IVF cycle), thereby acknowledging the possibility thatthis time-<strong>de</strong>pen<strong>de</strong>nt covariate may yield time-<strong>de</strong>pen<strong>de</strong>nt confounding.AcknowledgementsThe author would like to thank E. <strong>de</strong> la Rochebrochard, J. Bouyer (Ined; Inserm,CESP; Univ Paris-Sud, UMRS 1018), and S. Enjalric-Ancelet (AgroParisTech,Unité Mét@risk) for introducing him to this problem, as well as the Cochin andClermont-Ferrand IVF units for sharing the DAIFI data set.


Chapter 29TMLE in Adaptive Group SequentialCovariate-Adjusted RCTs<strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r LaanThis chapter is <strong>de</strong>voted to group sequential covariate-adjusted RCTs analyzedthrough the prism of TMLE. By adaptive covariate-adjusted <strong>de</strong>sign we mean anRCT group sequential <strong>de</strong>sign that allows the investigator to dynamically modify itscourse through data-driven adjustment of the randomization probability based ondata accrued so far, without negatively impacting the statistical integrity of the trial.Moreover, the patient’s baseline covariates may be taken into account for the randomtreatment assignment. This <strong>de</strong>finition is slightly adapted from Golub (2006). Inparticular, we assume following the <strong>de</strong>finition of prespecified sampling plans givenin Emerson (2006) that, prior to collection of the data, the trial protocol specifiesthe parameter of scientific interest, the inferential method, and confi<strong>de</strong>nce level tobe used when constructing a confi<strong>de</strong>nce interval for the latter parameter.Furthermore, we assume that the investigator specifies beforehand in the trialprotocol a criterion of special interest that yields a notion of optimal randomizationscheme that we therefore wish to target. For instance, the criterion could translatethe necessity to minimize the number of patients assigned to their correspondinginferior treatment arm, subject to level and power constraints. Or the criterion couldtranslate the necessity that a result be available as quickly as possible, subject tolevel and power constraints. The sole restriction on the criterion is that it must yieldan optimal randomization scheme that can be approximated from the data accruedso far. The two examples above comply with this restriction.We choose to consi<strong>de</strong>r specifically the second criterion cited above. Consequently,the optimal randomization scheme is the so-called Neyman allocation,which minimizes the asymptotic variance of the TMLE of the parameter of interest.We emphasize that there is nothing special about targeting the Neyman allocation,the whole methodology applying equally well to a large class of optimal randomizationschemes <strong>de</strong>rived from a variety of valid criteria.By adaptive group sequential <strong>de</strong>sign we refer to the possibility of adjusting therandomization scheme only by blocks of c patients, where c≥1 is a prespecifiedinteger (the case where c=1 corresponds to a fully sequential adaptive <strong>de</strong>sign).The expression also refers to the fact that group sequential testing methods can be496 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanequally well applied on top of adaptive <strong>de</strong>signs, an extension that we do not consi<strong>de</strong>rhere. Although all our results (and their proofs) still hold for any c≥1, we consi<strong>de</strong>rthe case c=1 in the theoretical part of the chapter for simplicity’s sake but the casewhere c>1is consi<strong>de</strong>red in the simulation study.The literature on adaptive <strong>de</strong>signs is vast, and our review is not comprehensive.The expression “adaptive <strong>de</strong>sign” has also been used in the literature for sequentialtesting and, in general, for <strong>de</strong>signs that allow data-adaptive stopping times forthe whole study (or for certain treatment arms) which achieve the <strong>de</strong>sired type Iand type II error requirements when testing a null hypothesis against its alternative.Data-adaptive randomization schemes go back to the 1930s, and we refer the interestedrea<strong>de</strong>r to Hu and Rosenberger (2006, Sect. 1.2), Jennison and Turnbull (2000,Sect. 17.4), and Rosenberger (1996) for a comprehensive historical perspective.Many articles have been <strong>de</strong>voted to the study of “response adaptive <strong>de</strong>signs,” anexpression implicitly suggesting that those <strong>de</strong>signs only <strong>de</strong>pend on past responsesof previous patients and not on the corresponding covariates. We refer rea<strong>de</strong>rs toHu and Rosenberger (2006) and Chambaz and van <strong>de</strong>r Laan (2010) for a bibliographyon that topic. On the contrary, covariate-adjusted response-adaptive (CARA)randomizations tackle the so-called issue of heterogeneity (i.e., the use of covariatesin adaptive <strong>de</strong>signs) by dynamically calculating the allocation probabilities on thebasis of previous responses and current and past values of certain covariates. In thisview, this chapter studies a new type of CARA procedure. The interest in CARAprocedures is more recent, and there is a steadily growing number of articles <strong>de</strong>dicatedto their study, starting with Rosenberger et al. (2001) and Bandyopadhyay andBiswas (2001), then Atkinson and Biswas (2005), Zhang et al. (2007), Zhang andHu (2009), and Shao et al. (2010), among others. The latter articles are typicallyconcerned with the convergence (almost sure and in law) of the allocation probabilitiesvector and of the estimator of the parameter in a correctly specified parametricmo<strong>de</strong>l. The article by Shao et al. (2010) is <strong>de</strong>voted to the testing issue.By contrast, the consistency and asymptotic normality results that we obtain hereare robust to mo<strong>de</strong>l misspecification. Thus, they notably contribute significantly tosolving the question raised by the Food and Drug Administration (2006): “When isit valid to modify randomization based on results, for example, in a combined phase2/3 cancer trial?" Finally, this chapter mainly relies on Chambaz and van <strong>de</strong>r Laan(2010) and van <strong>de</strong>r Laan (2008b), the latter technical report paving the way to robustand more efficient estimation based on adaptive RCTs in a variety of other setting(including the case that the outcome Y is a possibly censored time-to-event).29.1 Statistical FrameworkThis chapter is <strong>de</strong>voted to the asymptotic study of adaptive group sequential <strong>de</strong>signsin the case of RCTs with covariate, binary treatment and a one-dimensional primaryoutcome of interest. Thus, the experimental unit is O=(W, A, Y), where W2Wconsists of some baseline covariates, A2A={0, 1} <strong>de</strong>notes the assigned binary495


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 497treatment, and Y2Yis the primary outcome of interest. For example, Y can indicatewhether the treatment has been successful or not (Y={0, 1}), or Y can countthe number of times an event of interest has occurred un<strong>de</strong>r the assigned treatmentduring a period of follow-up (Y=N), or Y can measure a quantity of interest aftera given time has elapsed (Y=R). Although we will focus on the last case in thischapter, the methodology applies equally well to each example cited above.Let us <strong>de</strong>note by P 0 the true distribution of the observed data structure O in thepopulation of interest. We see P 0 as a specific element of the nonparametric setM of all possible observed data distributions. Note that, in or<strong>de</strong>r to avoid sometechnicalities, we assume (or rather impose) that all elements ofMare dominatedby a common measure. The parameter of scientific interest is the marginal effect oftreatment a=1 relative to treatment a=0 on the additive scale, or risk difference:ψ 0 = E P0 [E P0 (Y| A=1, W)−E P0 (Y| A=0, W)]. Of course, other choices suchas the log-relative risk (the counterpart of the risk difference on the multiplicativescale) could be consi<strong>de</strong>red, and <strong>de</strong>alt with along the same lines. The risk differencecan be interpreted causally un<strong>de</strong>r certain assumptions.For all P2M, Q W (W; P)=P(W), g(A|W; P)=P(A|W), and Q Y|A,W (O; P)=P(Y| A, W). We use the alternative notation P=P Q,g with Q=Q(·; P)≡(Q W (·; P),Q Y|A,W (·; P)), and g = g(·|·; P). Equivalently, P Q,g is the data-generating distributionsuch that Q(·; P Q,g ) = Q and g(·|·; P Q,g ) = g. In particular, we <strong>de</strong>noteQ 0 = Q(·; P 0 ) = (Q W (·; P 0 ), Q Y|A,W (·; P 0 )). We also introduce the notationQ={Q(·; P) :P2M} for the nonparametric set of all possible values of Q, andG={g(·|·; P) :P2M} for the nonparametric set of all possible values of g.Setting ¯Q(A, W; P)=E P (Y| A, W) and ¯Q 0 = ¯Q(·; P 0 ) [with a slight abuse, we alsowrite sometimes ¯Q(·; Q) instead of ¯Q(·; P Q,g )], we <strong>de</strong>fine in greater generality(Ψ(P)=E P ¯Q(1, W; P)− ¯Q(0, W; P) )over the whole setM, so thatψ 0 equivalently can be written asψ 0 =Ψ(P 0 ). Thisnotation also emphasizes the fact thatΨ(P) only <strong>de</strong>pends on P through ¯Q(·; P) andQ W (·; P), justifying the alternative notationΨ(P Q,g )=Ψ(Q). The following propositionsummarizes the most fundamental properties enjoyed byΨ.Proposition 29.1. The functionalΨ is pathwise differentiable at every P2M. Theefficient influence curve ofΨ at P Q,g 2Mis characterized byD ∗ (O; P Q,g )=D ∗ 1 (W; Q)+ D∗ 2 (O; P Q,g), whereD ∗ 1 (W; Q)= ¯Q(1, W)− ¯Q(0, W)−Ψ(Q), andD ∗ 2 (O; P Q,g)= 2A−1g(A|W) (Y− ¯Q(A, W)).The variance var P D ∗ (O; P) 2 is the lower bound of the asymptotic variance of anyregular estimator ofΨ(P) in the i.i.d. setting. Furthermore, even if Q,Q 0 ,E P0 D ∗ (O; P Q,g )=0 implies Ψ(Q)=Ψ(Q 0 ) (29.1)498 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanwhen g=g(·|·; P 0 ).The implication (29.1) is the key to the robustness of the TMLE introduced andstudied in this chapter. It is another justification of our interest in the pathwise differentiabilityof the functionalΨ and its efficient influence curve.29.2 Data-Generating MechanismIn or<strong>de</strong>r to formally <strong>de</strong>scribe the data-generating mechanism, we need to state astarting assumption. During the course of a clinical trial, it is possible to recruit in<strong>de</strong>pen<strong>de</strong>ntlythe patients from a stationary population. In the counterfactual framework,this is equivalent to supposing that it is possible to sample as many in<strong>de</strong>pen<strong>de</strong>ntcopies of the full-data structure as required. Let us <strong>de</strong>note the ith observeddata structure by O i = (W i , A i , Y i ). We also find it convenient to introduceO n = (O 1 ,...,O n ), and for every i=0,...,n, O n (i)=(O 1 ,...,O i ) [with the conventionO(0)=;].By adjusting the randomization scheme as the data accrue, we mean that the nthtreatment assignment A n is drawn from g n (·|W n ), where g n (·|W) is a conditionaldistribution (or treatment mechanism) given the covariate W, which additionally<strong>de</strong>pends on past observations O n−1 . Since the sequence of treatment mechanismscannot reasonably grow in complexity as the sample size increases, we will onlyconsi<strong>de</strong>r data-adaptive treatment mechanisms such that g n (·|W) <strong>de</strong>pends on O n−1only through a finite-dimensional summary measure Z n =φ n (O n−1 ), where the measurablefunctionφ n mapsO n−1 onto R d (whereOis the set from which O takes itsvalues) for some fixed d ≥ 0[d = 0 corresponds to the case where g n (·| W)actually does not adapt]. For instance, Z n+1 = φ n+1 (O n ) ≡ (n −1P ni=1Y i I(A i =0), n −1P ni=1Y i I(A i = 1)) characterizes a proper summary measure of the past, whichkeeps track of the mean outcome in each treatment arm. Another sequence of mappingsφn will be at the core of the adaptive methodology that we study in <strong>de</strong>pth inthis chapter, see (29.4).Formally, the data-generating mechanism is specified by the following factorizationof the likelihood of O n :nY (QW (W i ; P 0 )× Q Y|A,W (O i ; P 0 ) ) nY× g i (A i | W i ),i=1which suggests the introduction of g n = (g 1 ,...,g n ), referred to as the <strong>de</strong>sign of thestudy, and the expression “O n is drawn from (Q 0 , g n ).” Likewise, the likelihood ofO n un<strong>de</strong>r (Q, g n ) [where Q=(Q W , Q Y|A,W )2Qis a candidate value for Q 0 ] isi=1i=1nY (QW (W i )× Q Y|A,W (O i ) ) nY× g i (A i | W i ),i=1


✏apple✏apple29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 499W n W n+1%%9 Y n9Y n+1%%··· / A n/ Z n+1/ A n+1/9Z n+2/···O n−1O n9Fig. 29.1 A possible causal graph <strong>de</strong>scribing the data-generating mechanismwhere we emphasize that the second factor is known. Thus we will refer with aslight abuse of terminology to P ni=1log Q W (W i )+log Q Y|A,W (O i ) as the log-likelihoodof O n un<strong>de</strong>r (Q, g n ). Furthermore, given g n , we introduce the notation P Q0,gi f ≡E PQ0 ,g i( f (O i )|O n (i−1)) for any possibly vector-valued measurable f <strong>de</strong>fined onO.Another equivalent characterization of the data-generating mechanism involvesthe causal graph shown in Fig. 29.1. It is seen again that W n is drawn in<strong>de</strong>pen<strong>de</strong>ntlyfrom the past O n−1 . Secondly, A n is a <strong>de</strong>terministic function of W n , the summarymeasure Z n (which <strong>de</strong>pends on O n−1 ), and a new in<strong>de</strong>pen<strong>de</strong>nt source of randomness[in other words, it is drawn conditionally on (W n , Z n ) and conditionally in<strong>de</strong>pen<strong>de</strong>ntlyof the past O n−1 ]. Thirdly, Y n is a <strong>de</strong>terministic function of (A n , W n ) and anew in<strong>de</strong>pen<strong>de</strong>nt source of randomness [it is drawn conditionally on (A n , W n ) andconditionally in<strong>de</strong>pen<strong>de</strong>ntly of the past O n−1 ]. Then the next summary measure Z n+1is obtained as a function of O n−1 and O n = (W n , A n , Y n ), and so on.Finally, it is interesting in practice to adapt the <strong>de</strong>sign group sequentially. Thiscan be simply formalized. For a given prespecified integer c≥1(c=1 correspondsto a fully sequential adaptive <strong>de</strong>sign), going forward c-group sequentially simplyamounts to imposingφ (r−1)c+1 (O (r−1)c )=...=φ rc (O rc−1 ) for all r≥1. Then thec treatment assignments A (r−1)c+1 ,...,A rc in the rth c-group are all drawn from thesame conditional distribution g (r−1)c (·|W). Yet, although all our results (and theirproofs) still hold for any c≥1, we prefer to consi<strong>de</strong>r in the rest of this section andin Sects. 29.4 and 29.5 the case where c=1 for simplicity’s sake. In contrast, thesimulation study carried out in Sect. 29.6 involves some c>1.29.3 Optimal DesignOne of the most important features of the adaptive group sequential <strong>de</strong>sign methodologyis that it targets a user-supplied specific <strong>de</strong>sign of special interest. This specific<strong>de</strong>sign is generally an optimal <strong>de</strong>sign with respect to a criterion that translateswhat the investigator is most concerned about. Specifically, one could be most concernedwith the well-being of the target population, wishing that a result be available500 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanas quickly as possible and aspiring therefore to the highest efficiency (i.e., the abilityto reach a conclusion as quickly as possible subject to level and power constraints).Or one could be most concerned with the well-being of the subjects participatingin the clinical trial, therefore trying to minimize the number of patients assigned totheir corresponding inferior treatment arms, subject to level and power constraints.Obviously, these are only two important examples from a large class of potentiallyinteresting criteria. The sole purpose of the criterion is to generate a random elementinGof the form g n = g Zn , where Z n =φ n (O n−1 ) is a finite-dimensional summarymeasure of O n−1 , and g n converges to a <strong>de</strong>sired or optimal fixed treatment mechanism.We <strong>de</strong>ci<strong>de</strong> to focus in this chapter on the first example, but it must be clearthat the methodology applies to a variety of other criteria. See van <strong>de</strong>r Laan (2008b)for other examples.By Proposition 29.1, the asymptotic variance of any regular estimator of the riskdifferenceΨ(Q 0 ) has lower bound var PQ0 ,g D∗ (O; P Q0,g) if the estimator relies on datasampled in<strong>de</strong>pen<strong>de</strong>ntly from P Q0,g. We have thatvar PQ0 ,g D∗ (O; P Q0,g)=E Q0 ( ¯Q 0 (1, W)− ¯Q 0 (0, W)−Ψ(Q 0 )) 2( σ 2 !(Q 0 )(1, W)+E Q0 + σ2 (Q 0 )(0, W),g(1|W) g(0|W)whereσ 2 (Q 0 )(A, W) <strong>de</strong>notes the conditional variance of Y given (A, W) un<strong>de</strong>r Q 0 .We use the notation E Q0 above (for the expectation with respect to the marginaldistribution of W un<strong>de</strong>r P 0 ) in or<strong>de</strong>r to emphasize the fact that the treatment mechanismg only appears in the second term of the right-hand si<strong>de</strong> sum. Furthermore, itholds P 0 –almost surely thatσ 2 (Q 0 )(1, W)+ σ2 (Q 0 )(0, W)≥ (σ(Q 0 )(1, W)+σ(Q 0 )(0, W)) 2 ,g(1|W) g(0|W)with equality if and only ifg(1|W)=σ(Q 0 )(1, W)σ(Q 0 )(1, W)+σ(Q 0 )(0, W) , (29.2)P 0 –almost surely. Therefore, the following lower bound holds for all g2G:var PQ0 ,g D∗ (O; P Q0,g)≥E Q0 ( ¯Q 0 (1, W)− ¯Q 0 (0, W)−Ψ(Q 0 )) 2+ E Q0 (σ(Q 0 )(1, W)+σ(Q 0 )(0, W)) 2 ,with equality if and only if g2Gis characterized by (29.2). This optimal <strong>de</strong>signis known in the literature as the Neyman allocation (Hu and Rosenberger 2006,p. 13). This result makes clear that the most efficient treatment mechanism assignswith higher probability a patient with covariate vector W to the treatment arm withthe largest variance of the outcome Y, regardless of the mean of the outcome (i.e.,whether the arm is inferior or superior).


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 501Due to logistical reasons, it might be preferable to consi<strong>de</strong>r only treatment mechanismsthat assign treatment in response to a subvector V of the baseline covariatevector W. In addition, if W is complex, targeting the optimal Neyman allocationmight be too ambitious. Therefore, we will consi<strong>de</strong>r the important case where V isa discrete covariate with finitely many values in the setV={1,...,ν}. The covariateV indicates subgroup membership for a collection ofνsubgroups of interest.We <strong>de</strong>ci<strong>de</strong> to restrict the search for an optimal <strong>de</strong>sign to the setG 1 ⊂Gof thosetreatment mechanisms that only <strong>de</strong>pend on W through V. The same calculations asabove yield straightforwardly that, for all g2G 1var PQ0 ,g D∗ (O; P Q0,g)≥E Q0 ( ¯Q 0 (1, W)− ¯Q 0 (0, W)−Ψ(Q 0 )) 2+ E Q0 (¯σ(Q 0 )(1, V)+ ¯σ(Q 0 )(0, V)) 2 ,where ¯σ 2 (Q 0 )(a, V)=E Q0 (σ 2 (Q 0 )(a, W)|V) for a2A, with equality if and only ifg coinci<strong>de</strong>s with g ∗ (Q 0 ), characterized byg ∗ (Q 0 )(1|V)=¯σ(Q 0 )(1, V)¯σ(Q 0 )(1, V)+ ¯σ(Q 0 )(0, V) ,P 0 –almost surely. Hereafter, we refer to g ∗ (Q 0 ) as the optimal <strong>de</strong>sign.Because g ∗ (Q 0 ) is characterized as the minimizer over g2G 1 of the variance un<strong>de</strong>rP Q0,g of the efficient influence curve at P Q0,g, we propose to construct g n+1 2G 1as the minimizer over g2G 1 of an estimator of the latter variance based on past observationsO n . We proceed by recursion. We first set g 1 = g b , the so-called balancedtreatment mechanism such that g b (1|W)=1/2 for all W2W, and assume thatO n has already been sampled from (Q 0 , g n ), the sample size being large enough toguarantee P ni=1I(V i = v)>0 for all v2V(if n 0 is the smallest sample size suchthat the previous condition is met, then we set g 1 =...= g n0 = g b ).The issue is now to construct g n+1 . Let us assume for the time being that wealready know how to construct an estimator Q n of Q 0 based on O n [hence the estimators¯Q n = ¯Q(·; Q n ) of ¯Q 0 andΨ(Q n ) ofΨ(Q 0 )=ψ 0 ]. The reasoning is notcircular by virtue of the chronological or<strong>de</strong>ring as it is summarized in Fig. 29.1, forinstance. Then, for all g2G 1 ,S n (g)= 1 nX (D ∗ 1n(W i; Q n ) 2 + 2D ∗ 1 (W i; Q n )D ∗ 2 (O i; P g(A i| V i )Qn,g)gi=1i (A i | V i )+ D ∗ 2 (O i; P Qn,g) 2 g(A !i| V i )g i (A i | V i )0−B@ 1 nXD ∗ 1n(W i; Q n )+ D ∗ 2 (O i; P g(A 12i| V i )Qn,g)g i (A i | V i )CA= 1 nnXi=1i=1(Y i − ¯Q n (A i , W i )) 2g(A i |V i )g i (A i |V i )502 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laan8>< 1nX+ D >:∗ 1n(W i; Q n ) 2 + 2D ∗ 1 (W i; Q n )D ∗ 2 (O i; P ) Qn,gii=10−B@ 1 nX2D ∗ 1n(W i; Q n )+ D ∗ 2 (O i; P ) Qn,gi1CA9 >=> ;i=1estimates var PQ0 ,g D∗ (O; P Q0,g) (the weighting provi<strong>de</strong>s the a<strong>de</strong>quate tilt of the empiricaldistribution; it is not necessary to weight the terms corresponding to D ∗ 1 becausethey do not <strong>de</strong>pend on the treatment mechanism). Now, only the first term in therightmost expression still <strong>de</strong>pends on g. The same calculation as above straightforwardlyyields that S n (g) is minimized at g n+1 2G 1 characterized byg n+1 (1|v)=for all v2V, where for each (v, a)2V×As 2 v,n(a)=1P ni=1 (Yi− ¯Qn(Ai,Wi)) 2ngi(Ai|Vi)s v,n (1)s v,n (1)+ s v,n (0)I((V i , A i )=(v, a)).1nP ni=1I(V i = v)Yet, instead of consi<strong>de</strong>ring the above characterization, we find it more convenient to<strong>de</strong>fineg ∗ n+1 (1|v)= σ v,n (1)σ v,n (1)+σ v,n (0) , (29.3)for all v2V, where for each (v, a)2V×Aσ 2 v,n(a)=1nP ni=1 (Yi− ¯Qn(Ai,Wi)) 21nI((V i , A i )=(v, a)).gi(Ai|Vi)P ni=1 I((Vi,Ai)=(v,a))gi(a|v)Note that s 2 v,n(a) andσ 2 v,n(a) share the same numerator, and that the different <strong>de</strong>nominatorsconverge to the same limit. Substitutingσ 2 v,n(a) for s 2 v,n(a) is convenientbecause one naturally interprets the former as an estimator of the conditional varianceof Y given (A, V) = (a, v) based on O n , a fact that we use in Sect. 29.4.2.Finally, we emphasize that g ∗ n+1 = g Zn+1 for the summary measure of the past O nZ n+1 =φ n+1 (O n )≡((σ 2 v,n(0),σ 2 v,n(1)) : v2V). (29.4)The rigorous <strong>de</strong>finition of the <strong>de</strong>sign g ∗ n = (g ∗ 1 ,...,g∗ n) follows by recursion,but it is still subject to knowledge about how to construct an estimator Q n of Q 0based on O n . Because this last missing piece of the formal <strong>de</strong>finition of the adaptivegroup sequential <strong>de</strong>sign data-generating mechanism is also the core of the TMLEprocedure, we address it in Sect. 29.4.


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 50329.4 TMLE ProcedureWe assume hereafter that O n has already been sampled from the (Q 0 , g ∗ n)-adaptivesampling scheme. In this section, we construct an estimator Q n (actually <strong>de</strong>noted byQ ∗ n) of Q 0 , thereby yielding the characterization of g ∗ n+1and completing the formal<strong>de</strong>finition of the adaptive <strong>de</strong>sign g ∗ n. In particular, the next data structure O n+1 canbe drawn from (Q 0 , g ∗ n+1), and it makes sense to un<strong>de</strong>rtake the asymptotic study ofthe properties of the TMLE methodology based on adaptive group sequential sampling.As in the i.i.d. framework, the TMLE procedure maps an initial substitutionestimatorΨ(Q 0 n) ofψ 0 into an updateψ ∗ n=Ψ(Q ∗ n) by fluctuating the initial estimateQ 0 n of Q 0 .29.4.1 Initial ML-Based Substitution EstimatorThe working mo<strong>de</strong>l. In or<strong>de</strong>r to construct the initial estimate Q 0 n of Q 0 , we consi<strong>de</strong>ra working mo<strong>de</strong>lQ w n . With a slight abuse of notation, the elements ofQ w n are <strong>de</strong>notedby (Q W (·; P n ), Q Y|A,W (·;θ)) for some parameterθ2Θ, where Q W (·; P n ) is the empiricalmarginal distribution of W. Specifically, the working mo<strong>de</strong>lQ w n is chosen insuch a way that81 >: q2πσ −(Y− m(A, W;β V)) 2 9 > = > 2V (A) 2σ 2 V (A) ; .This implies that for any P θ 2Msuch that Q Y|A,W (·; P θ )=Q Y|A,W (·;θ), the conditionalmean ¯Q(A, W; P θ ), which we also <strong>de</strong>note by ¯Q(A, W;θ), satisfies ¯Q(A, W;θ)=m(A, W;β V ), the right-hand si<strong>de</strong> expression being a linear combination of variablesextracted from (A, W) and in<strong>de</strong>xed by the regression vectorβ V (of dimension b).Definingθ(v)=(β v ,σ 2 v(0),σ 2 v(1)) > 2Θ v ⊂ R b × R ∗ +× R ∗ + (29.5)for each v2V, the complete parameter is given byθ=(θ(1) > ,...,θ(ν) > ) > 2Θ,whereΘ= Q νv=1 Θ v. We impose the following conditions on the parameterization:The parameter setΘis compact. Furthermore, the linear parameterization is i<strong>de</strong>ntifiable;for all v2V, if m(a, w;β v )=m(a, w;β 0 v) for all a2Aand w2W(compatiblewith v), then necessarilyβ v =β 0 v.Characterizing Q 0 n. Let us set a reference fixed <strong>de</strong>sign g r 2G 1 . We now characterizeQ 0 n by letting Q 0 n= (Q W (·; P n ), Q Y|A,W (·;θ n )), whereθ n = arg maxθ2ΘnXi=1log Q Y|A,W (O i ;θ) gr (A i | V i )g ∗ i (A i| V i )(29.6)504 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanis a weighted maximum likelihood estimator with respect to the working mo<strong>de</strong>l.Thus, the vth componentθ n (v) ofθ n satisfiesθ n (v)=arg minθ(v)2ΘvnX (logσ 2 v(A i )+ (Y i− m(A i , W i ;β v )) 2i=1σ 2 v(A i )! g r (A i | V i )g ∗ i (A i| V i ) I(V i= v)for every v2V. Note that this initial estimate Q 0 n of Q 0 yields the initial maximumlikelihood-basedsubstitution estimatorΨ(Q 0 n) ofψ 0 :Ψ(Q 0 n)= 1 nnX¯Q(1, W i ;θ n )− ¯Q(0, W i ;θ n ).i=1Studying Q 0 n throughθ n . For simplicity, let us introduce, for allθ 2Θ, the additionalnotation:`θ,0 = log Q Y|A,W (·;θ), ˙`θ,0 =@θ`θ,0, @ and ¨`θ,0 = @2@θ 2`θ,0 . The firstasymptotic property ofθ n that we <strong>de</strong>rive concerns its consistency (see Theorem 5 invan <strong>de</strong>r Laan 2008b).Proposition 29.2. Assume thatA1. There exists a unique interior pointθ 0 2Θ such thatθ 0 = arg max P r`θ,0; Q0,gθ2ΘA2. The matrix−P Q0,g r ¨`θ0,0 is positive <strong>de</strong>finite.Provi<strong>de</strong>d thatOis a boun<strong>de</strong>d set,θ n consistently estimatesθ 0 .The limit in probability ofθ n has a nice interpretation in terms of projectionof Q Y|A,W (·; P 0 ) onto{Q Y|A,W (·;θ) :θ 2 Θ}. Preferring to discuss this issue interms of the data-generating distribution rather than conditional distribution, let usset Q θ0 = (Q W (·; P 0 ), Q Y|A,W (·;θ 0 )) and assume that P Q0,g r log Q Y|A,W(·; P 0 ) is well<strong>de</strong>fined (this weak assumption concerns Q 0 , not g r , and holds for instance when| log Q Y|A,W (·; P 0 )| is boun<strong>de</strong>d). Then A1 is equivalent to P Qθ0 ,gr being the uniqueKullback–Leibler projection of P Q0,gr onto the set{P2M:9θ2Θs.t. Q Y|A,W (·; P)=Q Y|A,W (·;θ),and Q W (·; P)=Q W (·; P 0 ), g(·|·; P)=g r }.In addition to being consistent,θ n actually satisfies a central limit theorem if supplementarymild conditions are met. The latter central limit theorem is embed<strong>de</strong>d ina more general result that we state in Sect. 29.4.3; see Proposition 29.5.The cornerstone of the proof of Proposition 29.2 is to interpretθ n as the solutioninθof the martingale estimating equation P ni=1D 1 (θ)(O i , Z i ) = 0, whereZ i is the finite-dimensional summary measure of past observation O n (i−1) suchthat g ∗ i<strong>de</strong>pends on O n (i−1) only through Z i (hence the notation g ∗ i= g Zi ) andD 1 (θ)(O, Z)= ˙`θ,0 (O)g r (A|V)/g Z (A|V) satisfies P Q0,g ∗ i D 1(θ 0 )=0 for all i≤n.


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 505By relying on a Kolmogorov strong law of large numbers for martingales (see Theorem8 in Chambaz and van <strong>de</strong>r Laan 2010), one obtains that n −1P ni=1P Q0,g ∗ i D 1(θ n )converges to zero almost surely. This results in the convergence in probability ofθ n toθ 0 by a Taylor expansion ofθ7! P Q0,g r ˙`θ,0 atθ 0 (hence assumption A2). Thestrong law of large numbers applies because the geometry ofF={D 1 (θ) :θ2Θ} ismo<strong>de</strong>rately complex [heuristically,F can be covered by finitely manyk·k 1 -balls becauseΘisa compact set, (O, Z) is boun<strong>de</strong>d, and the mapping (o, z,θ)7! D 1 (θ)(o, z)is continuous; and the number of such balls of radiusηnee<strong>de</strong>d to coverF does notgrow too fast asηgoes to zero].Furthermore, maximizing a weighted version of the log-likelihood is a technicaltwist that makes the theoretical study of the properties ofθ n easier. In<strong>de</strong>ed, theunweighted maximum likelihood estimator t n = arg max θ2ΘP ni=1log Q Y|A,W (O i ;θ)targets the parameterTḡn (Q 0 )=arg maxθ2ΘnXi=1P Q0,g ∗ log Qi Y|A,W (O i ;θ)=arg max P Q0,ḡn`θ,0,θ2Θwhere ḡ n = 1 nP ni=1g ∗ i . Therefore, t n asymptotically targets the limit, if it exists,of Tḡn (Q 0 ). Assuming that ḡ n converges itself to a fixed <strong>de</strong>sign g 1 2G, then t nasymptotically targets parameter T g1 (Q 0 ). The latter parameter is very difficult tointerpret and to analyze as it <strong>de</strong>pends directly and indirectly (through g 1 ) on Q 0 .29.4.2 Convergence of the Adaptive DesignConsi<strong>de</strong>r the mapping G ∗ fromΘtoG 1 [respectively equipped with the Eucli<strong>de</strong>andistance and, for instance, the distance d(g, g 0 )= P v2V|g(1|v)−g 0 (1|v)|] suchthat, for anyθ2Θ, for any (a, v)2A×VG ∗ (θ)(a|v)=σ v (a)σ v (1)+σ v (0) . (29.7)Equation (29.7) characterizes G ∗ , which is obviously continuous. Since g ∗ n is adaptedin such a way that g ∗ n= G ∗ (θ n ), Proposition 29.2 and the continuous mapping theorem(see Theorem 1.3.6 in van <strong>de</strong>r Vaart and Wellner 1996) straightforwardly implythe following result.Proposition 29.3. Un<strong>de</strong>r the assumptions of Proposition 29.2, the adaptive <strong>de</strong>signg ∗ n converges in probability to the limit <strong>de</strong>sign G ∗ (θ 0 ).The convergence of the adaptive <strong>de</strong>sign g ∗ n is a crucial result. It is noteworthy thatthe limit <strong>de</strong>sign G ∗ (θ 0 ) equals the optimal <strong>de</strong>sign g ∗ (Q 0 ) if the working mo<strong>de</strong>l is correctlyspecified (which never happens in practical applications), but not necessarilyotherwise. Furthermore, the relationship g ∗ n= G ∗ (θ n ) also entails the possibility of<strong>de</strong>riving the convergence in distribution of p n(g ∗ n− G ∗ (θ 0 )) to a centered Gaussian506 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laandistribution with known variance by application of the <strong>de</strong>lta method (G ∗ is differentiable)from a central limit theorem onθ n (Proposition 29.5).29.4.3 The TMLEFluctuating Q 0 n. The second step of the TMLE procedure stretches the initial estimateΨ(Q0 n) in the direction of the parameter of interest, through a maximum likelihoodstep over a well-chosen fluctuation of Q 0 n. The latter fluctuation of Q 0 n is justa one-dimensional parametric mo<strong>de</strong>l{Q 0 n(ɛ) :ɛ2E}⊂Q in<strong>de</strong>xed by the parameterɛ2E,E⊂R being a boun<strong>de</strong>d interval that contains a neighborhood of the origin.Specifically, we set, for allɛ2E, Q 0 n(ɛ)=(Q W (·; P n ), Q Y|A,W (·;θ n ,ɛ)), where foranyθ2Θ:81 >: q2πσ −(Y− ¯Q(A, W;θ)−ɛH ∗ (A, W;θ)) 2 9 > = > 2V (A) 2σ 2 V (A) ; , (29.8)withH ∗ (A, W;θ)=2A−1G ∗ (θ)(A|V) σ2 V (A).In particular, the fluctuation goes through Q 0 n atɛ= 0 (i.e., Q 0 n(0)=Q 0 n). Let P 0 n(ɛ)2M be a data-generating distribution such that Q Y|A,W (·; P 0 n(ɛ)) = Q Y|A,W (·;θ n ,ɛ).The conditional mean ¯Q(A, W; P 0 n(ɛ)), which we also <strong>de</strong>note by ¯Q(A, W;θ n ,ɛ), is¯Q(A, W;θ n ,ɛ)= ¯Q(A, W;θ n )+ɛH ∗ (A, W;θ n ). Furthermore, the score atɛ= 0 ofP 0 n(ɛ) equals@@ɛ log P0 n(ɛ)(O) ∣ 2A−1∣ɛ=0 =G ∗ (θ n )(A|V) (Y− ¯Q(A, W;θ n ))=D ∗ 2 (O; P Q 0 n,G (θn)), ∗the second component of the efficient influence curve ofΨ at P Q 0 n,G (θn)=P ∗ Q 0 n,g ∗.nRecall that g ∗ n= G ∗ (θ n ).Characterizing the TMLE Q ∗ n. We characterize the update Q ∗ n of Q 0 n in the fluctuation{Q0 n(ɛ) :ɛ2E} by Q ∗ n=Q 0 n(ɛ n ), whereɛ n = arg maxɛ2EnXi=1log Q Y|A,W (O i ;θ n ,ɛ) g∗ n(A i | V i )g ∗ i (A i| V i )(29.9)is a weighted maximum likelihood estimator with respect to the fluctuation. It isworth noting thatɛ n is known in closed form (we assume, without serious loss ofgenerality, thatEis large enough for the maximum to be achieved in its interior).Denoting the vth componentθ n (v) ofθ n by (β v,n ,σ 2 v,n(0),σ 2 v,n(1)) > , it holds that


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 507P ni=1(Y i − ¯Q(A i , W i ;θ n )) 2Ai−1gɛ n =∗ i (Ai|Vi).P ni=1σ 2 V i ,n (Ai)g ∗ n(Ai|Vi)gi(Ai|Vi)The notation Q ∗ n for this first update of Q 0 n is a reference to the fact that theTMLE procedure, which is in greater generality an iterative procedure, convergeshere in one single step. In<strong>de</strong>ed, suppose that one fluctuates Q ∗ n as we fluctuate Q 0 ni.e., by introducing Q 1 n(ɛ)=(Q W (·; P n ), Q Y|A,W (·;θ n ,ɛ n ,ɛ)) with Q Y|A,W (O;θ,ɛ 0 ,ɛ)equal to the right-hand si<strong>de</strong> of (29.8), where one substitutes ¯Q(A, W;θ,ɛ 0 ) for¯Q(A, W;θ). In addition, suppose that one then <strong>de</strong>fines the weighted maximum likelihoodɛn0 as the right-hand si<strong>de</strong> of (29.9), where one substitutes Q Y|A,W (O i ;θ n ,ɛ n ,ɛ)for Q Y|A,W (O i ;θ n ,ɛ). Then it follows thatɛn= 0 0 so that the “updated” Q ∗ n(ɛn)=Q 0 ∗ n.The updated estimator Q ∗ n of Q 0 maps into the TMLEψ ∗ n=Ψ(Q ∗ n) of the risk differenceψ0 =Ψ(Q 0 ):ψ ∗ n= 1 nnX¯Q(1, W i ;θ n ,ɛ n )− ¯Q(0, W i ;θ n ,ɛ n ). (29.10)i=1The asymptotics ofψ ∗ n relies on a central limit theorem for (θ n ,ɛ n ), which we discussin Sect. 29.5.29.5 AsymptoticsWe now state and comment on a consistency result for the stacked estimator (θ n ,ɛ n ),which complements Proposition 29.2 (see Theorem 8 in van <strong>de</strong>r Laan 2008b). Forsimplicity, let us generalize the notation`θ,0 introduced in Sect. 29.4.1 by setting, forall (θ,ɛ)2Θ×E,`θ,ɛ = log Q Y|A,W (·;θ,ɛ). Moreover, let us set, for all (θ,ɛ)2Θ×E:Q θ,ɛ = (Q W (·; P 0 ), Q Y|A,W (·;θ,ɛ)).Proposition 29.4. Suppose that assumptions A1 and A2 from Proposition 29.2 hold.In addition, assume that:A3. There exists a unique interior pointɛ 0 2Esuch thatɛ 0 = arg max P Q0,G ∗ (θ0)`θ0,ɛ.ɛ2E(1) It holds thatΨ(Q θ0,ɛ0 )=Ψ(Q 0);(2) Provi<strong>de</strong>d thatOis a boun<strong>de</strong>d set, (θ n ,ɛ n ) consistently estimates (θ 0 ,ɛ 0 ).We already discussed the interpretation of the almost sure limit ofθ n in termsof the Kullback-Leibler projection. Likewise, the almost sure limitɛ 0 ofɛ n enjoyssuch an interpretation. Let us assume that P Q0,G ∗ (θ0) log Q Y|A,W (·; P 0 ) is well<strong>de</strong>fined [this weak assumption concerns Q 0 , not G ∗ (θ 0 ), and holds for instancewhen| log Q Y|A,W (·; P 0 )| is boun<strong>de</strong>d]. Then A3 is equivalent to P Qθ0 ,ɛ 0 ,G ∗ (θ0) being508 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanthe unique the Kullback–Leibler projection of P Q0,G ∗ (θ0) onto the set{P2M:9ɛ2E s.t. Q(·; P)=Q θ0,ɛ and g(·|·; P)=G ∗ (θ 0 )}.Of course, the most striking property thatɛ 0 enjoys is (1): Even if ¯Q 0 θ,0g Z (A|V)(O)gr (A|V),@ɛ`θ,ɛ(O)G @ ∗ (θ)(A|V) ) >satisfies P Q0,g ∗ i D(θ 0,ɛ 0 )=0for all i≤n. We have thatD(θ,ɛ)(O, Z)= B@ D 1(θ) > (O),0@@ɛ`θ,ɛ(O)G ∗ (θ)(A|V)g Z (A|V)1CA>(29.11)is an extension of D 1 (θ)(O), which we introduced earlier when summarizing theproof of Proposition 29.2. Here, too, the proof involves the Kolmogorov strong lawof large numbers for martingales (Chambaz and van <strong>de</strong>r Laan 2010, Theorem 8),which yields that n −1P ni=1P Q0,g ∗D(θ in,ɛ n ) converges to zero almost surely. This resultsin the convergence in probability of (θ n ,ɛ n ) to (θ 0 ,ɛ 0 ) by a Taylor expansionof (θ,ɛ)7! (P ˙`>θ,0 Q0,g r , P Q0,G ∗ (θ)@ɛ`θ,ɛ) @ at (θ 0 ,ɛ 0 ). Note that assumption A3 is a clearcounterpart of assumption A1 from Proposition 29.2 but that there is no counterpartof assumption A2 from Proposition 29.2 in Proposition 29.4. In<strong>de</strong>ed, it automaticallyholds in the framework of the proposition that−P Q0,G ∗ (θ0) @2 > 0, while@ɛ 2`θ0,ɛ0the proof requires that the latter quantity be different from zero.We now state and comment on a central limit theorem for the stacked estimator(θ n ,ɛ n ) (van <strong>de</strong>r Laan 2008b, Theorem 9). Let us introduce, for all (θ,ɛ)2Θ×E,eD(θ,ɛ)(O)=(˙`θ,0 (O)g r (A|V), @ @ɛ`θ,ɛ(O)G ∗ (θ)(A|V)),so that D(θ,ɛ)(O, Z), <strong>de</strong>fined in (29.11), can be represented aseD(θ,ɛ)(O)/g Z (A|V).


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 509Proposition 29.5. Suppose that assumptions A1, A2 and A3 from Propositions 29.2and 29.4 hold. In addition, assume that:A4. Un<strong>de</strong>r Q 0 , the outcome Y is not a <strong>de</strong>terministic function of (A, W).Then the following asymptotic linear expansion holds:wherep n ((θn ,ɛ n )−(θ 0 ,ɛ 0 ))=S −100¨`θ0,0(O)1nXp D(θ 0 ,ɛ 0 )(O i , Z i )+o P (1), (29.12)ni=1S 0 = E Q0,G ∗ (θ0) h B@ @ 2@θ@ɛ`θ,ɛ(O)G ∗ (θ)(A|V) i ∣ ∣∣∣ >g r (A|V)G ∗ (θ0)(A|V)01(θ,ɛ)=(θ0,ɛ0) G ∗ (θ0)(A|V)@ 2@ɛ 2`θ0,ɛ(O)| ɛ=ɛ01CAis an invertible matrix. Furthermore, (29.12) entails that p n((θ n ,ɛ n )−(θ 0 ,ɛ 0 )) convergesin distribution to the centered Gaussian distribution with covariance matrixS0 −1Σ0(S0 −1)>, where0eD(θ 0 ,ɛ 0 )eD(θ 0 ,ɛ 0 )Σ 0 = E Q0,G ∗ (θ0) B@> (O)1CAG ∗ (θ 0 )(A|V) 2is a positive <strong>de</strong>finite symmetric matrix. Moreover, S 0 is consistently estimated byS n = 1 n0nXB@i=1¨`θn,0(O i) gr (Ai|Vi)gZ i (Ai|Vi) 0h @ 2@θ@ɛ`θ,ɛ(Oi)G∗ (θ)(A i| V i) i ∣ ∣∣∣ >andΣ 0 is consistently estimated byΣ n = 1 n(θ,ɛ)=(θn,ɛn)1gZ i (Ai|Vi)nXD(θ n ,ɛ n )D(θ n ,ɛ n ) > (O i , Z i ).i=1@1CA ,2@ɛ 2`θn,ɛ(Oi)|ɛ=ɛngZi(Ai| Vi)We will investigate how the above central limit theorem translates into a centrallimit theorem for the TMLE. The proof of Proposition 29.5 still relies on the factthat (θ n ,ɛ n ) solves the martingale estimating equation P ni=1D(θ,ɛ)(O i , Z i )=0. Itinvolves the Taylor expansion of D(θ,ɛ)(O, Z) at (θ 0 ,ɛ 0 ), a multidimensional centrallimit theorem for martingales and again the Kolmogorov strong law of largenumbers (Chambaz and van <strong>de</strong>r Laan 2010, Theorems 8 and 10) . Assumption A4guarantees thatΣ 0 is positive <strong>de</strong>finite.TMLE is consistent and asymptotically Gaussian. In the first place, the TMLEψ ∗ nis robust: It is a consistent estimator even when the working mo<strong>de</strong>l is misspecified.Proposition 29.6. Suppose that assumptions A1, A2, and A3 from Propositions 29.2and 29.4 hold. Then the TMLEψ ∗ n consistently estimates the risk differenceψ 0 .510 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r LaanIf the <strong>de</strong>sign of the RCT was fixed (and, consequently, the n first observations werei.i.d.), then the TMLE would be a robust estimator ofψ 0 : Even if the working mo<strong>de</strong>lis misspecified, then the TMLE still consistently estimatesψ 0 because the treatmentmechanism is known (or can be consistently estimated, if one wants to gainin efficiency). Thus, the robustness of the TMLE stated in Proposition 29.6 is theexpected counterpart of the TMLE’s robustness in the latter i.i.d. setting: Expectedbecause the TMLE solves a martingale estimating function that is unbiased forψ 0at misspecified Q and correctly specified g i , i=1,...,n.The proof of Proposition 29.6 is twofold and will now be <strong>de</strong>scribed. SettingQ ∼ n = (Q W (·; P 0 ), Q Y|A,W (·;θ n ,ɛ n )), a continuity argument, and the convergence inprobability of the stacked estimator (θ n ,ɛ n ) to (θ 0 ,ɛ 0 ) entail the convergence in probabilityofΨ(Q ∼ n ) toΨ(Q )=ψ θ0,ɛ0 0 [see (1) in Proposition 29.4]. The conclusion followsbecauseψ ∗ n−Ψ(Q ∼ n ) converges almost surely to zero by the Glivenko–Cantellitheorem [which, roughly speaking, guarantees that P n f converges almost surely toP 0 f uniformly in f2F={¯Q(1,·;θ,ɛ)− ¯Q(0,·;θ,ɛ) :(θ,ɛ)2Θ×E} because thesetF is mo<strong>de</strong>rately complex].The TMLEψ ∗ n is also asymptotically linear and therefore satisfies a central limittheorem. To see this, let us introduce the real-valued functionφonΘ×Esuch thatφ(θ,ɛ)=Ψ(Q θ,ɛ ). Because functionφis differentiable on the interior ofΘ×E, we<strong>de</strong>note its gradient at (θ,ɛ) withφ 0 θ,ɛ. The latter gradient satisfies{φ 0 θ,ɛ = E Qθ,ɛ,G ∗ (θ) D ∗ (O; P Qθ,ɛ,G ∗ (θ)) ( @@θ`>θ,ɛ (O), @ɛ`θ,ɛ(O) ) } @ >.Note that the right-hand-si<strong>de</strong> expression cannot be computed explicitly because themarginal distribution Q W (·; P 0 ) is unknown. By the law of large numbers (in<strong>de</strong>pen<strong>de</strong>ntcase), we can build an estimatorφ 0 n ofφ 0 as follows. For B a large numberθ0,ɛ0(say B=10 4 ), simulate B in<strong>de</strong>pen<strong>de</strong>nt copies Õ b of O from the data-generatingdistribution P Q ∼ n,G ∗ (θn), and computeφ 0 n= 1 BBXb=1D ∗ (O b ; P Q ∼ n,G ∗ (θn)) ( @@θ`>θ,ɛn (O b)| θ=θn , @ @ɛ`θn,ɛ(O b )| ɛ=ɛn) >.Proposition 29.7. Suppose that assumptions A1, A2, A3, and A4 from Propositions29.2, 29.4, and 29.5 hold. Then the following asymptotic linear expansionholds:p n(ψ∗n −ψ 0 )= p 1 nXIC(O i , Z i )+o P (1), (29.13)nwherei=1IC(O, Z)=D ∗ 1 (W; Q θ0,ɛ0 )+φ0> θ0,ɛ0 S−1 0 D(θ 0,ɛ 0 )(O, Z). (29.14)Furthermore, (29.13) entails that p n(ψ ∗ n−ψ 0 ) converges in distribution to the centeredGaussian distribution with a variance consistently estimated by


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 511s 2 n= 1 n+ 2 nnXD ∗ 1 (W i; Q ∗ n) 2i=1nXi=1D ∗ 1 (W i; Q ∗ n)φ 0>n S −1n D(θ n ,ɛ n )(O i , Z i )+(φ 0>n S −1n )Σ n (φ 0>n S −1n ) > .Proposition 29.7 is the backbone of the statistical analysis of adaptive groupsequential RCTs as constructed in Sect. 29.4. In particular, <strong>de</strong>noting the (1−α)-quantile of the standard normal distribution byξ 1−α , the proposition guarantees thatthe asymptotic level of the confi<strong>de</strong>nce interval"ψ ∗ n± s #np ξ 1−α/2 , (29.15)nfor the risk differenceψ 0 is (1−α).The proof of (29.13) relies again on writing p n(ψ ∗ n−ψ 0 )= p pn(ψ ∗ n−Ψ(Q ∼ n ))+n(Ψ(Q∼n )−ψ 0 ). It is easy to <strong>de</strong>rive the asymptotic linear expansion of the first term[the influence function is D ∗ 1 (·; Q )]. Moreover, the <strong>de</strong>lta method and (29.12) provi<strong>de</strong>sthe asymptotic linear expansion of the second term. Thus, the influence func-θ0,ɛ0tion IC is known in closed form. A central limit theorem for martingales (Chambazand van <strong>de</strong>r Laan 2010, Theorem 9) applied to (29.13) yields the stated convergenceand validates the use of s 2 n as an estimator of the asymptotic variance.Extensions. We conjecture that the influence function IC computed at (O, Z),(29.14), is equal toD ∗ 1 (W; Q )+ θ0,ɛ0 D∗ 2 (O; P Qθ 0 ,ɛ 0 ,G ∗ (θ0)) G∗ (θ 0 )(A|V).g Z (A|V)This conjecture is backed by the simulations that we carry out and present inSect. 29.6. We will tackle the proof of the conjecture in future work. Let us assumefor the moment that the conjecture is true. Then the asymptotic linear expansion(29.13) now implies that the asymptotic variance of p n(ψ ∗ n−ψ 0 ) can be consistentlyestimated bys ∗2n = 1 nnXi=1( ! 2D ∗ 1 (W i; Q ∗ n)+ D ∗ 2 (O i; P Q ∗ n,G ∗ (θn)) G∗ (θ n )(A i | V i ),g Zi (A i | V i )another in<strong>de</strong>pen<strong>de</strong>nt argument showing that s ∗2n converges towardvar Q0,G ∗ (θ0)D ∗ (O; P Qθ0 ,ɛ 0 ,G ∗ (θ0)),i.e., the variance un<strong>de</strong>r the fixed <strong>de</strong>sign P Q0,G ∗ (θ0) of the efficient influence curve atP Qθ0 ,ɛ 0 ,G ∗ (θ0).Furthermore, the most essential characteristic of the joint methodologies of <strong>de</strong>signadaptation and TMLE is certainly the utmost importance of the role played by512 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laanthe likelihood. The targeted maximized log-likelihood of the datanX ( log QW (W i ; P n )+log Q Y|A,W (O i ;θ n ,ɛ n ) ) ,i=1provi<strong>de</strong>s us with a quantitative measure of the quality of the fit of the TLME ofQ 0 (targeted toward the parameter of interest). It is therefore possible, for example,to use that quantity for the sake of selection among different working mo<strong>de</strong>ls forQ 0 . As with TMLE for i.i.d. data, we can use likelihood-based cross-validation toselect among more general initial estimators in<strong>de</strong>xed by fine-tuning parameters. Thevalidity of such TMLEs for the group sequential adaptive <strong>de</strong>signs as studied in thischapter is outsi<strong>de</strong> the scope of this chapter.29.6 SimulationsWe characterize the component Q 0 = Q(·; P 0 ) of the true distribution P 0 of thedata structure O=(W, A, Y) as follows. The baseline covariate W= (U, V) whereU is uniformly distributed over the unit interval [0, 1], and the subgroup membershipcovariate V 2V= {1, 2, 3} (henceν = 3) satisfies P 0 (V = 1) =1/2, P 0 (V = 2) = 1/3, and P 0 (V = 3) = 1/6. The conditional distribution ofY given (A, W) is the gamma distribution characterized by the conditional mean¯Q 0 (Y | A, W) = 2U 2 + 2U+ 1+AV+ (1−A)/(1+V) and the conditionalstandard <strong>de</strong>viation p var P0 (Y| A, W) = U+A(1+V)+(1−A)/(1+V). Therisk differenceψ 0 = Ψ(Q 0 ), our parameter of interest, is known in closed formψ 0 = 91/72'1.264, as is the variance: v b (Q 0 )=var Q0,g bD∗ (O; P Q0,gb) of the efficientinfluence curve un<strong>de</strong>r balanced sampling. The numerical value of v b (Q 0 ) isreported in Table 29.1.We target the <strong>de</strong>sign that (a) <strong>de</strong>pends on the baseline covariate W= (U, V) onlythrough V (i.e., belongs toG 1 ) and (b) minimizes the variance of the efficient influencecurve of the parameter of interestΨ. The latter treatment mechanism g ∗ (Q 0 )and optimal efficient asymptotic variance v ∗ (Q 0 )=var Q0,g ∗ (Q0)D ∗ (O; P Q0,g ∗ (Q0)) arealso known in closed form, and numerical values are reported in Table 29.1.Table 29.1 Numerical values of the allocation probabilities and variance of the efficient influencecurve. The ratio of the variances of the efficient influence curve un<strong>de</strong>r targeted optimal and balancedsampling schemes satisfies R(Q 0)=v ∗ (Q 0)/v b (Q 0)'0.762Sampling scheme (Q 0, g) Allocation probabilities Varianceg(1|v=1) g(1|v=2) g(1|v=3) var Q0,gD ∗ (O; P Q0,g)(Q 0, g b )-balanced 1/2 1/2 1/2 23.864(Q 0, g ∗ (Q 0))-optimal 0.707 0.799 0.849 18.181


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 513Let n=(100, 250, 500, 750, 1000, 2500, 5000) be a sequence of sample sizes.We estimate M= 1000 times the risk differenceψ 0 =Ψ(Q 0 ) based on O m (n n7i), m=1,...,M, i=1,...,7, un<strong>de</strong>r i.i.d. (Q 0 , g b )-balanced sampling, i.i.d. (Q 0 , g ∗ (Q 0 ))-optimal sampling, and (Q 0 , g ∗ )-adaptive sampling. Finally, we emphasize that then7data structure O = (W, A, Y) is not boun<strong>de</strong>d, whereas O is assumed boun<strong>de</strong>d inPropositions 29.2–29.7.For each v 2V, let us <strong>de</strong>noteθ(v) = (β v ,σ 2 v(0),σ 2 v(1)) > 2 Θ v , whereΘ v ⊂R 3 × R ∗ +× R ∗ + is compact, andβ v = (β v,1 ,β v,2 ,β v,3 )[b=3 in (29.5)] is the vectorof regression coefficients. Letθ=(θ1 > ,θ> 2 ,θ> 3 )> 2Θ=Θ 1 ×Θ 2 ×Θ 3 . Followingthe <strong>de</strong>scription in Sect. 29.4.1, the working mo<strong>de</strong>lQ w n that the TMLE methodologyrelieson is characterized by the conditional likelihood of Y given (A, W):81 >: q2πσ −(Y− m(A, W;β V)) 2 9 > = > 2V (A) 2σ 2 V (A) ; ,where the conditional mean ¯Q(Y; A, W;θ) of Y given (A, W) is mo<strong>de</strong>led as¯Q(Y; A=a, W= w;θ)=m(a, w;β v )=β v,1 +β v,2 u+β v,3 a,for all a2Aand w=(u, v)2W= R×V. As required, the parameterizationcondition is met. Obviously, the working mo<strong>de</strong>l is heavily misspecified: a Gaussianconditional likelihood is used instead of a gamma conditional likelihood, and theparametric forms of the conditional expectation and variance are wrong, too.Regarding the choice of a reference fixed <strong>de</strong>sign g r 2G 1 (Sect. 29.4.1), weselect g r = g b , the balanced <strong>de</strong>sign. The parameterθ 0 only <strong>de</strong>pends on Q 0 and theworking mo<strong>de</strong>l, but its estimatorθ n <strong>de</strong>pends on g r , which may negatively affect itsperformance. Therefore, we propose to dilute the impact of the choice of g r as aninitial reference <strong>de</strong>sign as follows. For a given sample size n, we first compute afirst estimateθn 1 ofθ 0 as in (29.6) but withdn/4e (the smallest integer not smallerthan n/4) substituted for n in the sum. Thenθ n is computed as in (29.6), but this timewith G ∗ (θn)(A 1 i |V i ) substituted for g r (A i |V i ). The proofs can be adapted to incorporatethis modification of the procedure. We refer the interested rea<strong>de</strong>r to van <strong>de</strong>r Laan(2008b, Section 8.5).We update the <strong>de</strong>sign each time c=25 new observations are sampled. In addition,the first update only occurs when there are at least 5 completed observationsin each treatment arm and for all V-strata. Thus, the minimal sample size at thefirst update is 30, and it can be shown that, un<strong>de</strong>r the balanced <strong>de</strong>sign, the expectedsample size at the first sample size at which there are at least 5 observations ineach arm equals 75. Finally, as a precautionary measure, we systematically apply athresholding to the updated treatment mechanism: Using the notation of Sect. 29.4,we substitute max{δ, min{1−δ, g ∗ i (A i| V i )}} to g ∗ i (A i| V i ) in all computations. Wearbitrarily chooseδ=0.01.We now invoke the central limit theorem stated in Proposition 29.7 to constructconfi<strong>de</strong>nce intervals for the risk difference. Let us introduce, for all types of samplingand each sample size n i , the confi<strong>de</strong>nce intervals514 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r Laan2 sI ni,m= 64 ψ∗ ni (Om (n v ni (O m n7n7i))±(n i))ξ 1−α/2n i375 , m=1,...,M,where the <strong>de</strong>finition of the variance estimator v n (O m (n)) based on the n first observationsO m (n) <strong>de</strong>pends on the sampling scheme. Un<strong>de</strong>r i.i.d. (Q n7n70, g b )-balanced sampling,v n (O m (n)) is the estimator of the asymptotic variance of the n7 TMLEΨ(Q∗ n,iid ):v n (O m n7 (n))= 1 nnXD ∗ (O m i ; P Q ∗ n,iid ,gb)2 . (29.16)i=1Un<strong>de</strong>r i.i.d. (Q 0 , g ∗ (Q 0 ))-optimal sampling, v n (O m (n)) is <strong>de</strong>fined as in (29.16), replacingg b with g ∗ (Q 0 ). Lastly, un<strong>de</strong>r (Q 0 , g ∗ n)-adaptive sampling, v n (O m (n)) = n7n7s ∗2n (O m (n)), the estimator of the conjectured asymptotic variance ofp n(ψ ∗ n7 n(O m (n))− n7ψ 0 ) computed on the n first observations O m (n). We are interested in the empiricaln7coverage (reported in Table 29.2, top rows) c ni = 1/M P Mm=1 I(ψ 0 2I ni,m) guaranteedfor each sampling scheme and every i=1,...,7 by{I ni,m : m=1,...,M}.The rescaled empirical coverage proportions Mc ni should have a binomial distributionwith parameter (M, 1−a) and a=α for every i=1,...,7. This property canbe tested in terms of a standard binomial test, the alternative hypothesis stating thata>α. This results in a collection of seven p-values for each sampling scheme, asreported in Table 29.2 (bottom rows).Consi<strong>de</strong>ring each sampling scheme (i.e., each row of Table 29.2) separately, weconclu<strong>de</strong> that the (1−α)-coverage cannot be <strong>de</strong>clared <strong>de</strong>fective un<strong>de</strong>r i.i.d. (Q 0 , g b )-balanced sampling for any sample size n i ≥ n 3 = 500, i.i.d. (Q 0 , g ∗ (Q 0 ))-optimalsampling for any sample size n i ≥ n 2 = 250, and (Q 0 , g ∗ )-adaptive sampling for anyn7sample size n i ≥ n 1 = 100, adjusting for multiple testing in terms of the Benjamini–Yekutieli procedure for controlling the false discovery rate at level 5%.This is a remarkable result that not only validates the theory but also provi<strong>de</strong>sus with insight into the finite sample properties of the TMLE procedure based onadaptive sampling. The fact that the TMLE procedure behaves better un<strong>de</strong>r an adaptivesampling scheme than un<strong>de</strong>r balanced i.i.d. sampling scheme at sample sizen 1 = 100 may not be due to mere chance only. Although the TMLE procedure basedTable 29.2 Checking the validity of the coverage of our simulated confi<strong>de</strong>nce intervals, values c ni(top row), p-values (bottom row, between parentheses)Sampling schemeSample sizen 1 n 2 n 3 n 4 n 5 n 6 n 7i.i.d. (Q 0, g b )-balanced 0.913 0.925 0.939 0.934 0.945 0.940 0.946(p


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 515on an adaptive sampling scheme is initiated un<strong>de</strong>r the balanced sampling scheme(so that each stratum consists at the beginning of comparable numbers of patientsassigned to each treatment arm, allowing one to estimate, at least roughly, the requiredparameters), it starts <strong>de</strong>viating from it [as soon as every (A, V)-stratum counts5 patients)] each time 25 new observations are accrued. The poor performance ofthe TMLE procedure based on an optimal i.i.d. sampling scheme at sample sizen 1 is certainly due to the fact that, by starting directly from the optimal samplingscheme (a choice we would not recommend in practice), too few patients from stratumV = 3 are assigned to treatment arm A=0 among the n 1 first subjects. Atlarger sample sizes, the TMLE procedure performs equally well un<strong>de</strong>r an adaptivesampling scheme and un<strong>de</strong>r both i.i.d. schemes in terms of coverage.Now that we know that the TMLE-based confi<strong>de</strong>nce intervals based on (Q 0 , g ∗ n)-adaptive sampling are valid confi<strong>de</strong>nce regions, it is of interest to compare thewidths of the latter adaptive-sampling (Q 0 , g ∗ n)-confi<strong>de</strong>nce intervals with their counterpartsobtained un<strong>de</strong>r i.i.d. (Q 0 , g b )-balanced or (Q 0 , g ∗ (Q 0 )) optimal samplingschemes. For this purpose, we compare, for each sample size n i , the empirical distributionof{ p v n (O m (n n7 i)) : m=1,...,M} as in (29.16) [i.e., the empirical distributionof width of the TMLE-based confi<strong>de</strong>nce intervals at sample size n i obtainedun<strong>de</strong>r i.i.d (Q 0 , g b )-balanced sampling, up to the factor 2ξ 1−α/2 / p n i ] to the empiricaldistribution of{s ∗ n(O m (n n7i)) : m=1,...,M} [i.e., the empirical distribution ofthe width of the TMLE-based confi<strong>de</strong>nce intervals at sample size n i obtained un<strong>de</strong>r(Q 0 , g ∗ n)-adaptive sampling, up to the factor 2ξ 1−α/2 / p n i ] in terms of the two-sampleKolmogorov–Smirnov test, where the alternative states that the confi<strong>de</strong>nce intervalsobtained un<strong>de</strong>r adaptive sampling are stochastically smaller than their counterpartsun<strong>de</strong>r i.i.d. balanced sampling. This results in seven p-values, all equal to zero,which we nonetheless report in Table 29.3 (bottom row). In or<strong>de</strong>r to get a sense ofhow much narrower the confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r adaptive sampling are,we also compute and report in Table 29.3 (top row) the ratios of empirical averagewidths:1 Mm=1MPs ∗ n(O m (n n7i))1MP Mm=1pvn (O m n7 (n i)) , (29.17)for each sample size n i . Informally, this shows a 12% gain in width.On the other hand, we also compare, for each sample size n i , the empirical distributionof{ p v n (O m n7 (n i)) : m=1,...,M}, as in (29.16), but replacing g b by g ∗ (Q 0 )[i.e., the empirical distribution of width of the TMLE-based confi<strong>de</strong>nce intervals atsample size n i obtained un<strong>de</strong>r i.i.d. (Q 0 , g ∗ (Q 0 ))-optimal sampling, up to the factor2ξ 1−α/2 / p n i ] to the empirical distribution of{s ∗ n(O m n7 (n i)) : m=1,...,M} [i.e., theempirical distribution of the width of the TMLE-based confi<strong>de</strong>nce intervals at samplesize n i obtained un<strong>de</strong>r (Q 0 , g ∗ n)-adaptive sampling, up to the factor 2ξ 1−α/2 / p n i ]in terms of the two-sample Kolmogorov–Smirnov test, the alternative stating thatthe confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r adaptive sampling are stochastically largerthan their counterparts un<strong>de</strong>r i.i.d. optimal sampling. This results in seven p-valuesthat we report in Table 29.3 (bottom row). In or<strong>de</strong>r to get a sense of how similar theconfi<strong>de</strong>nce intervals obtained un<strong>de</strong>r adaptive and i.i.d. optimal sampling schemes516 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r LaanTable 29.3 Comparing the width of our confi<strong>de</strong>nce intervals with the ratios of average widths as<strong>de</strong>fined in (29.17) and p-values (bottom rows, between parentheses)ComparisonSample sizen 1 n 2 n 3 n 4 n 5 n 6 n 7(Q 0, g ∗ ) vs. (Q0, n7 gb ) 0.856 0.871 0.879 0.880 0.878 0.877 0.876(0) (0) (0) (0) (0) (0) (0)(Q 0, g ∗ ) vs. (Q0, n7 g∗ (Q 0)) 0.962 0.977 0.992 0.995 0.997 1.000 1.000(0.144) (0.236) (0.100) (0.060) (0.407) (0.236) (0.144)are, we also compute and report for each sample size n i in Table 29.3 (top row) theratios of empirical average widths as in (29.17) replacing again g b by g ∗ (Q 0 ) in the<strong>de</strong>finition (29.16) of v n (O m (n)). Informally, this shows that the confi<strong>de</strong>nce intervalsn7obtained un<strong>de</strong>r adaptive sampling are even slightly narrower in average than theircounterparts obtained un<strong>de</strong>r i.i.d. optimal sampling.Illustration. So far we have been concerned with distributional results, and we nowinvestigate a particular arbitrarily selected simulated trajectory of the TMLE, theconfi<strong>de</strong>nce intervals, and the adaptive <strong>de</strong>sign g ∗ n as a function of sample size. Someinteresting features of the selected simulated trajectory are apparent in Fig. 29.2 andTable 29.4. For instance, we can follow the convergence of the TMLEψ ∗ n towardthe true risk differenceψ 0 in the top plot of Fig. 29.2 and in the fifth column ofTable 29.4. Similarly, the middle plot of Fig. 29.2 and the second to fourth columnsof Table 29.4 illustrate the convergence of g ∗ n toward G ∗ (θ 0 ), as stated in Proposition29.3. What these plots and columns also teach us is that, in spite of the misspecifiedworking mo<strong>de</strong>l, the learned <strong>de</strong>sign G ∗ (θ 0 ) seems very close to the optimaltreatment mechanism g ∗ (Q 0 ) for the chosen simulation scheme and working mo<strong>de</strong>lused in our simulation study. Moreover, the last column of Table 29.4 illustrates howthe confi<strong>de</strong>nce intervals [ψ ∗ n±s ∗ nξ 1−0.05/2 / p n] shrink around the true risk differenceψ 0 as the sample size increases.Yet, the bottom plot of Fig. 29.2 may be the most interesting of the three. It obviouslyillustrates the convergence of s ∗2n toward var Q0,G ∗ (θ0)D ∗ (O; P Qθ0 ,ɛ 0 ,G ∗ (θ0)), i.e.,toward the variance un<strong>de</strong>r the fixed-<strong>de</strong>sign P Q0,G ∗ (θ0) of the efficient influence curveat P Qθ0 ,ɛ 0 .G ∗ (θ0). Hence, it also teaches us that the latter limit seems very close to theoptimal asymptotic variance v ∗ (Q 0 ) for the chosen simulation scheme and workingmo<strong>de</strong>l used in our simulation study. More importantly, s ∗2n strikingly converges tov ∗ (Q 0 ) from below. This finite sample characteristic may reflect the fact that the truefinite sample variance of p n(ψ ∗ n−ψ 0 ) might be lower than v ∗ (Q 0 ). Studying thisissue in <strong>de</strong>pth is certainly very <strong>de</strong>licate and goes beyond the scope of this chapter.


29 TMLE in Adaptive Group Sequential Covariate-Adjusted RCTs 517TMLETreatment mechanismEstimated asymptotic variance1 1.50.5 10 10 2025 100 250 500 1000 2500 5000Sample size25 100 250 500 1000 2500 5000Sample size25 100 250 500 1000 2500 5000Sample sizeFig. 29.2 Illustrating the TMLE procedure un<strong>de</strong>r (Q 0, g ∗ n)-adaptive sampling scheme. Top plot:The sequenceψ ∗ n(O 1 (n)), horizontal gray line indicates true value of risk differenceψ0. Middleplot: Three sequences g ∗ n(1 | 1) = G ∗ (θ n(O 1 (n)))(1 | 1) (bottom curve), n7 g∗ n(1 | 2) =n7G ∗ (θ n(O 1 (n)))(1 | 2) (middle curve), and n7 g∗ n(1 | 3) = G ∗ (θ n(O 1 (n)))(1 | 3) (top curve). Then7three horizontal gray lines indicate the optimal allocation probabilities g ∗ (Q 0)(1|1) (bottom line),g ∗ (Q 0)(1|2) (middle line), and g ∗ (Q 0)(1|3) (top line). Bottom plot: The sequence s ∗2n of estimatedasymptotic variance of p n(ψ ∗ n−ψ 0), horizontal gray line indicates the value of the optimalvariance v ∗ (Q 0). (The x-axis is on a logarithmic scale for all plots)518 <strong>Antoine</strong> Chambaz, Mark J. van <strong>de</strong>r LaanConcluding RemarksWe <strong>de</strong>veloped the TMLE in group sequential adaptive <strong>de</strong>signs for the data structureO=(W, A, Y). This generalizes the TMLE for the i.i.d. <strong>de</strong>sign for this data structureO as covered in <strong>de</strong>pth by the first part of this book. In addition, we showedthat targeted learning goes beyond targeted estimation and starts with the choice of<strong>de</strong>sign. Our targeted adaptive <strong>de</strong>signs combined with the TMLE provi<strong>de</strong>s a fullytargeted methodology, including <strong>de</strong>sign, for statistical inference with respect to acausal effect of interest. In the previous two chapters, we <strong>de</strong>monstrated (1) how tofully integrate the state of the art in machine learning while fully preserving theCLT for statistical inference and (2) the application of TMLE for the purpose of obtaininga targeted posterior distribution of the target parameter of interest, therebyimproving on the current standard in Bayesian learning.This book <strong>de</strong>monstrated that targeted learning with TMLE represents a unifiedoptimal approach for learning from data that can be represented as realizationsof n i.i.d. random variables. However, these last three chapters provi<strong>de</strong> insightinto the enormous reach of targeted learning, covering Bayesian learning, integratingthe most adaptive machine learning algorithms, targeted <strong>de</strong>signs,and statistical inference for targeted adaptive <strong>de</strong>signs that generate <strong>de</strong>pen<strong>de</strong>ntdata. Further exciting research in these areas is nee<strong>de</strong>d, but it appears that targetedlearning based on super learning and TMLE provi<strong>de</strong>s a road map for<strong>de</strong>veloping optimal tools to attack upcoming statistical challenges.Table 29.4 Simulation results of the TMLE procedure un<strong>de</strong>r a (Q 0, g ∗ n)-adaptive sampling schemeSample size Allocation probabilities TMLE confi<strong>de</strong>nce intervaln g ∗ n(1|1) g ∗ n(1|2) g ∗ n(1|3) ψ ∗ n (ψ ∗ n±s ∗ nξ 1−0.05/2/ p n)n 1 0.589 0.764 0.766 1.252 (0.722,1.783)n 2 0.624 0.775 0.707 1.388 (0.974,1.802)n 3 0.679 0.767 0.795 1.361 (1.037,1.685)n 4 0.677 0.757 0.813 1.341 (1.068,1.615)n 5 0.670 0.760 0.806 1.250 (1.012,1.488)n 6 0.677 0.788 0.835 1.288 (1.126,1.451)n 7 0.694 0.793 0.834 1.273 (1.157,1.389)


hal-00576070, version 1 - 11 Mar 2011Classification in postural style<strong>Antoine</strong> Chambaz and Christophe Denis ∗MAP5, Université Paris Descartes and CNRSMarch 11, 2011AbstractThis article contributes to the search for a notion of postural style, focusing on the issueof classifying subjects in terms of how they maintain posture. Longer term, the hope is tomake it possible to <strong>de</strong>termine on a case by case basis which sensorial information is prevalentin postural control, and to improve/adapt protocols for functional rehabilitation among thosewho show <strong>de</strong>ficits in maintaining posture, typically seniors. Here, we specifically tackle thestatistical problem of classifying subjects sampled from a two-class population. Each subject(enrolled in a cohort of 54 participants) un<strong>de</strong>rgoes four experimental protocols which are<strong>de</strong>signed to evaluate potential <strong>de</strong>ficits in maintaining posture. This results in four complextrajectories, from which we <strong>de</strong>ci<strong>de</strong> to extract four small-dimensional summary measures. Becauseun<strong>de</strong>rgoing several protocols is at least unpleasant, and sometimes painful, we try tolimit the number of protocols nee<strong>de</strong>d for the classification. Therefore, we first rank the protocolsby <strong>de</strong>creasing or<strong>de</strong>r of relevance, then we <strong>de</strong>rive four plug-in classifiers which involve thebest (i.e., more informative), the two best, the three best and all four protocols. This two-stepprocedure relies on the cutting-edge methodologies of targeted maximum likelihood learning(a novel methodology for robust and efficient estimation and testing) and super-learning (amachine learning procedure for aggregating various estimation procedures into a single betterestimation procedure). A simulation study is carried out. The performances of the procedureapplied to the real dataset (and evaluated by the leave-one-out rule) go as high as a 87%rate of correct classification (47 out of 54 subjects correctly classified), using only the bestprotocol.1 IntroductionThis article contributes to the search for a notion of postural style, focusing on the issue of classifyingsubjects in terms of how they maintain posture.Posture is fundamental to all activities, including locomotion and prehension. Posture is thefruit of a dynamic analysis by the brain of visual, proprioceptive and vestibular informations.Proprioceptive information stems from the ability to sense the position, location, orientation andmovement of the body and its parts. Vestibular information roughly relates to the sense of equilibrium.Naturally, every individual has <strong>de</strong>veloped his/her own preferences according to his/hersensorimotor experience. Sometimes, a sole kind of information (usually visual) is processed in allsituations. Although this kind of processing may be efficient for maintaining posture in one’s usualenvironment, it is likely not adapted to reacting to new or unexpected situations. Such situationsmay result in falling, the consequences of a fall being particularly bad in seniors. Longer term,the hope is to make it possible to <strong>de</strong>termine on a case by case basis which sensorial information isprevalent in postural control, and to improve/adapt protocols for functional rehabilitation amongthose who show <strong>de</strong>ficits in maintaining posture, typically seniors.hal-00576070, version 1 - 11 Mar 2011As in earlier studies (see [2, 5] and references therein), our approach to characterizing posturalcontrol involves the use of a force-platform. Subjects standing on a force-platform are exposedto different perturbations, following different experimental protocols (or simply protocols in thesequel). The force-platform records over time the center-of-pressure of each foot, that is “theposition of the global ground reactions forces that accommodates the sway of the body” [10]. Aprotocol is divi<strong>de</strong>d into three phases: a first phase without perturbation, followed by a secondphase with perturbation, followed itself by a last phase without perturbation. Different kind ofperturbations are consi<strong>de</strong>red. They can be characterized either as visual, or proprioceptive orvestibular, <strong>de</strong>pending on which sensorial system is perturbed.We specifically tackle the statistical problem of classifying subjects sampled from a two-classpopulation. The first class regroups subjects who do not show any <strong>de</strong>ficit in postural control. Thesecond class regroups hemiplegic subjects, who suffer from a proprioceptive <strong>de</strong>ficit. Even thoughdifferentiating two subjects from the two groups is relatively easy by visual inspection, it is a muchmore <strong>de</strong>licate task when relying on some general baseline covariates and the trajectories provi<strong>de</strong>dby a force-platform. Furthermore, since un<strong>de</strong>rgoing several protocols is at least unpleasant, andsometimes painful (some sensitive subjects have to lie down for 15 minutes in or<strong>de</strong>r to recoverfrom dizziness after a series of protocols), we also try to limit the number of protocols used forclassifying.Our classification procedure relies on cutting-edge statistical methodologies. In particular, wecome up with a nice preliminary ranking of the four protocols (in view of how much we can learnfrom them on postural control) which involves the targeted maximum likelihood methodology [16,12], a novel statistical procedure for robust and efficient estimation and testing. The targetedmaximum likelihood methodology relies itself on the super-learning procedure, a machine learningmethodology for aggregating various estimation procedures (or simply estimators) into a singlebetter estimation procedure/estimator [15, 12]. In addition to being a key element of the targetedmaximum likelihood ranking of the protocols, the super-learning procedure plays also a crucialrole in the construction of our classification procedure itself.We show that it is possible to achieve a 87% rate of correct classification (47 out of 54 subjectscorrectly classified; the performance is evaluated by the leave-one-out rule), using only the moreinformative protocol. Our classification procedure is easy to generalize (we actually provi<strong>de</strong> anexample of generalization), so we reasonably hope that even better results are within reach (especiallyconsi<strong>de</strong>ring that more data should soon augment our small dataset). The interest of thearticle goes beyond the specific application. It nicely illustrates the versatility and power of thetargeted maximum likelihood and super-learning methodologies. It also shows that retrieving andcomparing small-dimensional summary measures from complex trajectories may be convenient toclassify them.The article is organized as follows. In Section 2, we <strong>de</strong>scribe the dataset which is at the core ofthe study. The classification procedure is formally presented in Section 3, and its performances,evaluated by simulations, are discussed in Section 4. We report in Section 5 the results obtainedby applying the latter classification procedure to the real dataset. We relegate to the Appendix aself-contained presentation of the super-learning procedure as it is used here, and the <strong>de</strong>scriptionof an estimation procedure/estimator that will play a great role in the super-learning procedureapplied to the construction of our classification procedure.2 Data <strong>de</strong>scriptionThe dataset, collected at the Center for the study of sensorimotor functioning (CESEM) of theUniversity Paris Descartes, is <strong>de</strong>scribed in Section 2.1. We motivate the introduction of a summarizedversion of each observation, and present its construction in Section 2.2.∗ The authors thank I. Bonan (Service <strong>de</strong> Mé<strong>de</strong>cine Physique et <strong>de</strong> réadaptation, CHU Rennes) and P-P. Vidal(CESEM, Université Paris Descartes) for introducing them to this interesting problem and providing the dataset.They also thank warmly A. Samson (MAP5, Université Paris Descartes) for several fruitful discussions.12


2.1 Original datasetEach subject un<strong>de</strong>rgoes four protocols that are <strong>de</strong>signed to evaluate potential <strong>de</strong>ficits in maintainingposture. The specifics of the latter protocols are presented in Table 1. Protocols 1 and2 respectively perturb the processing of visual data and proprioceptive information by the brain.Protocol 3 cumulates both perturbations. Protocol 4 relies on perturbing the processing of vestibularinformation by the brain through a visual stimulation.A total ofn=54 subjects are enrolled. For each of them, the age, gen<strong>de</strong>r, laterality (thepreference that most humans show for one si<strong>de</strong> of their body over the other), height and weightare collected. Among the 54 subjects, 22 are hemiplegic (due to a cerebrovascular acci<strong>de</strong>nt), andtherefore suffer from a proprioceptive <strong>de</strong>ficit in postural control. Initial medical examinationsconclu<strong>de</strong>d that the 32 other subjects show no pronounced <strong>de</strong>ficits in postural control. We willrefer to those subjects as normal subjects.hal-00576070, version 1 - 11 Mar 2011protocol 1st phase (0→15s) 2nd phase (15→50s) 3rd phase (50→70s)1 eyes closed2 no perturbation muscular stimulation no perturbation3 eyes closedmuscular stimulation4 optokinetic stimulationTable 1: Specifics of the four protocols <strong>de</strong>signed to evaluate potential <strong>de</strong>ficits in postural control. Aprotocol is divi<strong>de</strong>d into three phases: a first phase without perturbation of the posture is followed by asecond phase with perturbations, which is itself followed by a last phase without perturbation. Differentkind of perturbations are consi<strong>de</strong>red. They can be characterized either as visual (closing the eyes), or proprioceptive(muscular stimulation) or vestibular (optokinetic stimulation), <strong>de</strong>pending on which sensorialinformation is perturbed.For each protocol, the center-of-pressure of each foot is recor<strong>de</strong>d over time. Thus each protocolresults in a trajectory (X t ) t∈T = (L t ,R t ) t∈TwhereL t = ( ( )L 1 t,Lt) 2 ∈ R 2 (respectively,R t =R1t ,Rt2 ) gives the position of the center-of-pressure of the left (respectively, right) foot on theforce-platform at timet, for eachtinT ={kδ : 1≤k≤2800} where the time-stepδ=0.025seconds (the protocol lasts 70 seconds). We represent in Figure 1 two such trajectories (X t ) t∈Tassociated with a normal subject and a hemiplegic subject, both un<strong>de</strong>rgoing the third protocol(see Table 1). Note that we do not take into account the first few seconds of the recording that ageneric subject needs to reach a stationary behavior.Figure 1 confirms the intuition that the structure of a generic trajectory (X t ) t∈T is complicated,and that a mere visual inspection is, at least on this example, of little help for differentiating thenormal and hemiplegic subjects. Although several articles investigate how to mo<strong>de</strong>l and use suchtrajectories directly [2, 5], we rather choose to rely on a summary measure of (X t ) t∈T instead ofrelying on (X t ) t∈T itself.2.2 The making of the summary measureThe summary measure that we construct is actually a summary measure of a one-dimensionaltrajectory (C t ) t∈T that we initially <strong>de</strong>rive from (X t ) t∈T . First, we introduce the trajectory ofbarycenters, (B t ) t∈T = ( 1 2 (L t +R t )) t∈T . Second, we evaluate a reference positionbwhich is<strong>de</strong>fined as the componentwise median value of (B t ) t∈T∩[0,15] (that is, the median value over thefirst phase of the protocol). Third, we setC t =‖B t −b‖ 2 for allt∈T, the Eucli<strong>de</strong>an distancebetweenB t and the reference positionb, which provi<strong>de</strong>s a relevant <strong>de</strong>scription of the sway of thebody during the course of the protocol. We plot in Figure 2 two examples of (C t ) t∈T correspondingto two different protocols un<strong>de</strong>rgone by a hemiplegic subject.Now, it is arguably in the neighborhood (in time) of the beginning and ending of the secondphase that the most characteristic features of a trajectory should be sought. As an illustration,it is striking that one recovers easily the beginning and ending times of the second phase of thehal-00576070, version 1 - 11 Mar 2011Figure 1: Sequencest↦→L t (left) andt↦→R t (right) of positions of the center-of-pressure overT of bothfeet on the force-platform, associated with a normal subject (top) and a hemiplegic subject (bottom), whoun<strong>de</strong>rgo the third protocol (see Table 1).third protocol from the right-hand plot in Figure 2, but not for the protocol 1 from the left-handplot of the same figure. Therefore, we <strong>de</strong>ci<strong>de</strong> to focus on the following finite-dimensional summarymeasure of (X t ) t∈T (through (C t ) t∈T ):where¯C − 1 =δ 5∑t∈T∩[10,15[C t , ¯C+ 1 = δ 5Y = ( ¯C + 1 − ¯C − 1 , ¯C − 2 − ¯C + 1 , ¯C + 2 − ¯C − 2 ) (1)∑t∈T∩]15,20]C t ,¯C − 2 =δ 5∑t∈T∩[45,50[C t , ¯C+ 2 = δ 5∑t∈T∩]50,55]are the averages ofC t computed over the intervals [10, 15[, ]15, 20], [45, 50[ and ]50, 55] (that is overthe last/first 5 seconds before/after the beginning/ending of the second phase of the protocol ofinterest). We arbitrarily choose this 5-second threshold. Note that it is not necessary to consi<strong>de</strong>rthe remaining differences ¯C − 2 − ¯C − 1 =Y 2 +Y 1 , ¯C+ 2 − ¯C − 1 =Y 3 +Y 2 , ¯C+ 2 − ¯C + 1 =Y 1 +Y 2 +Y 3since they all are linear combinations of the components ofY . We refer to Figure 3 for a visualrepresentation of the <strong>de</strong>finition of the summary measureY .3 Classification procedureWe <strong>de</strong>scribe hereafter our two-step classification procedure. We formally introduce the statisticalframework that we consi<strong>de</strong>r in Section 3.1. The first step of the classification procedure consistsof ranking the protocols from the most to the less informative with respect to some criterion, seeSection 3.2. The second step consists of the classification itself, see Section 3.3.C t34


the more knowledge onAwe expect to gain from the observation ofW and the summary measureY ji (i.e., by comparing the averages of (C t ) t∈T computed over the time intervals corresponding toin<strong>de</strong>xi, see Section 2.2). For instance, say that Ψ 2 1(P 0 )>0: this means that (inP 0 -average), thevariation in mean of the mean postures ¯C 1 − and ¯C 1 + of a hemiplegic subject computed before andafter the beginning of the muscular perturbation is larger than that of a normal subject. In words,the postural control of a hemiplegic subject is more affected by the beginning of the muscularperturbation than the postural control of a normal subject.3.2 Targeted maximum likelihood ranking of the protocolsFigure 2: Representing the trajectoriest↦→C t overT which correspond to two different protocolsun<strong>de</strong>rgone by a hemiplegic subject (protocol 1 on the left, protocol 3 on the right).Our ranking of the four protocols relies on testing the null hypotheses“Ψ j i (P 0) = 0”, (i,j)∈{1, 2, 3}×{1, 2, 3, 4},hal-00576070, version 1 - 11 Mar 2011Figure 3: Visual representation of the <strong>de</strong>finition of the finite-dimensional summary measureY of (X t) t∈T .The four horizontal segments (solid lines) represent, from left to right, the averages ¯C − 1 , ¯C+ 1 , ¯C− 2 , ¯C+ 2 of(C t) t∈T over the intervals [10, 15[, ]15, 20], [45, 50[, ]50, 55]. The three vertical segments (solid lines endingby an arrow) represent, from top to bottom, the componentsY 1,Y 2 andY 3 ofY . Two additional verticallines indicate the beginning and ending of the second phase of the consi<strong>de</strong>red protocol.3.1 Statistical frameworkThe observed data structureOwrites asO=(W,A,Y 1 ,Y 2 ,Y 3 ,Y 4 ), where•W∈ R×{0, 1} 2 ×R 2 is the vector of baseline covariates (corresponding to initial age, gen<strong>de</strong>r,laterality, height and weight, see Section 2.1);•A∈{0, 1} indicates the subject’s class (with conventionA = 1 for hemiplegic subjects andA = 0 for normal subjects);• for eachj∈{1, 2, 3, 4},Y j ∈ R 3 is the summary measure (as <strong>de</strong>fined in (1)) associated withthejth protocol.We <strong>de</strong>note byP 0 the true distribution ofO. Since we do not know much aboutP 0 , we simplysee it as an element of the non-parametric setMof all possible distributions ofO.We need a criterion to rank the four protocols from the most to the less informative in viewof the subject’s class. To this end, we introduce the functional Ψ :M→R 12 such that, for anyP∈M, Ψ(P) = (Ψ j (P)) 1≤j≤4 where{ })Ψ j (P) =(E P E P [Y ji |A = 1,W]−E P [Y ji |A = 0,W] .1≤i≤3The component Ψ j i (P) is known in the literature as the variable importance measure ofAon thesummary measureY ji controlling forW [12]. Un<strong>de</strong>r causal assumptions, it can be interpreted as theeffect ofAonY ji . More generally, we are interested in Ψj i (P 0) because the further it is from zero,hal-00576070, version 1 - 11 Mar 2011against their two-si<strong>de</strong>d alternatives. Heuristically, rejecting “Ψ j i (P 0) = 0" tells us that the valueof theith coordinate of the summary measureY j provi<strong>de</strong>s helpful information for the sake of<strong>de</strong>termining whetherA = 0 orA = 1.We consi<strong>de</strong>r tests based on the targeted maximum likelihood methodology [16, 12]. It is a novelprocedure for robust and efficient estimation and testing. Because presenting a self-contained introductionto the methodology would significantly lengthen the article, we only provi<strong>de</strong> below avery succinct <strong>de</strong>scription of it. The targeted maximum likelihood methodology relies itself on thesuper-learning procedure, a machine learning methodology for aggregating various estimation procedures(or simply estimators) into a single better estimation procedure/estimator [15, 12], basedon the cross-validation principle. Since super-learning also plays a crucial role in our classificationprocedure (see Section 3.3), and because it is possible to present a relatively short self-containedintroduction to the construction of a super-learner, we propose such an introduction in Section A.1.LetO (1) ,...,O (n) benin<strong>de</strong>pen<strong>de</strong>nt copies of the observed data structureO. For every(i,j)∈{1, 2, 3}×{1, 2, 3, 4}, we compute the targeted maximum likelihood estimator (TMLE)Ψ j i,n of Ψj i (P 0) based onO (1) ,...,O (n) and an estimatorσ j i,n of its asymptotic standard <strong>de</strong>viationσ j i (P 0). The methodology applies because Ψ j i is a “smooth” parameter. It notably involves thesuper-learning of the conditional meansQ j i (P 0)(A,W) =E P0 (Y ji |A,W) and of the conditionaldistributiong(P 0 )(A|W) =P 0 (A|W) (the collection of estimators aggregated by super-learning isgiven in Sections A.1). Un<strong>de</strong>r some regularity conditions, the estimator Ψ j i,n of Ψj i (P 0) is consistentwhen eitherQ j i (P 0) org(P 0 ) is consistently estimated, and it satisfies a central limit theorem.In addition, ifg(P 0 ) is consistently estimated by a maximum-likelihood based estimator, thenσ j i,nis a conservative estimator ofσ j i (P 0). This finally yields, for every (i,j)∈{1, 2, 3}×{1, 2, 3, 4}, at-statisticsT j √i,n = nΨji,n.σ j i,nNow, we rank the four protocols by comparing the 3-dimensional vectors of test statistics(T j 1,n ,Tj 2,n ,Tj 3,n ) for 1≤j≤ 4. Several criteria for comparing the vectors were consi<strong>de</strong>red. Theyall relied on the fact that the larger is|T j i,n | the less likely the null “Ψj i (P 0) = 0” is true. Since theresults were only slightly affected by the criterion, we focus here on a single one. Thus, we <strong>de</strong>ci<strong>de</strong>that protocolj is more informative than protocolj ′ if3∑i=1(T j′i,n )2


hal-00576070, version 1 - 11 Mar 20113.3 Classifying a new subjectWe now tackle the main goal of this article: Building a classifierφfor <strong>de</strong>termining whetherA=0orA=1based on the baseline covariatesW and summary measures (Y 1 ,Y 2 ,Y 3 ,Y 4 ). In or<strong>de</strong>r tostudy the influence of the ranking on the classification, we actually build four different classifiersφ 1 ,φ 2 ,φ 3 ,φ 4 which respectively use only the best (more informative) protocol, the two best, thethree best, and all four protocols. Soφ j is a function ofW and ofj among the four vectorsY 1 ,Y 2 ,Y 3 ,Y 4 .Say thatJ ⊂{1, 2, 3, 4} hasJ elements. First, we build an estimatorh J n(W,Y j ,j∈J ) ofP 0 (A = 1|W,Y j ,j∈J ) based onO (1) ,...,O (n) , relying again on the super-learning methodology(the self-contained introduction to the construction of a super-learner of Section A.1 is augmentedby the <strong>de</strong>scription of the collection of estimators involved in the super-learning). Second, we <strong>de</strong>fineφ J (W,Y j ,j∈J ) = 1{h n (W,Y j ,j∈J )≥ 1 2 }and <strong>de</strong>ci<strong>de</strong> to classify a new subject with information (W,Y j ,j∈J ) into the group of hemiplegicsubjects ifφ J (W,Y j ,j∈J ) = 1 or into the group of normal subjects otherwise.Thus, the classifierφ J relies on a plug-in rule, in the sense that the Bayes <strong>de</strong>cision rule1{P 0 (A = 1|W,Y j ,j∈J )≥ 1 2} is mimicked by the empirical version where one substitutesan estimator ofP 0 (A = 1|W,Y j ,j∈J ) for the latter regression function. Such classifiers can convergewith fast rates un<strong>de</strong>r a complexity assumption on the regression function and the so-calledmargin condition [1].4 Simulation studyIn this section, we carry out and report the results of a simulation study of the performances of theclassification procedure <strong>de</strong>scribed in Section 3. The <strong>de</strong>tails of the simulation scheme are presentedin Section 4.1, and the results are reported and evaluated in Section 4.2.4.1 Simulation schemeInstead of simulating (W,A) and the four complex trajectories (X 1 t ) t∈T , (X 2 t ) t∈T , (X 3 t ) t∈T , (X 4 t ) t∈Tassociated with four fictitious protocols (see Section 2.1), we generate directly (W,A) and the summarymeasuresY 1 ,Y 2 ,Y 3 ,Y 4 that one would have <strong>de</strong>rived from (X 1 t ) t∈T , (X 2 t ) t∈T , (X 3 t ) t∈T ,(X 4 t ) t∈T (see Section 2.2). Three different scenarios/probability distributionsP 1 0,P 2 0,P 3 0 are consi<strong>de</strong>red.They only differ from each other with respect to the conditional distributionsg(P 1 0 ),g(P 2 0 ),g(P 3 0 ) (see Table 2 for their characterization).For eachk= 1, 2, 3, a generic observed data structureO=(W,A,Y 1 ,Y 2 ,Y 3 ,Y 4 ) drawn fromP k 0 meets the following constraints:1.W is drawn from a slightly perturbed version of the empirical distribution ofW as obtainedfrom the original dataset (the same for allk=1, 2, 3);2. conditionally onW,Ais drawn fromg(P k 0 )(·|W);hal-00576070, version 1 - 11 Mar 2011dotted lines, respectively). Thus, the covariates may provi<strong>de</strong> valuable information for predictingthe class. But this time, logitg(P0 2 ) and logitg(P0 3 ) are tricky functions ofW.Likewise, the family of conditional meansQ j i (A,W) ofYj i given (A,W) that we use in thesimulation scheme is meant to cover a variety of situations with regard to how difficult it is toestimate each of them and how much they tell about the class prediction. Instead of representingthe latter conditional means, we find it more relevant to provi<strong>de</strong> the rea<strong>de</strong>r with the values(computed by Monte-Carlo simulations) of( )3∑S j (P0 k Ψ j 2i) =(Pk 0 )σ j i (Pk 0 )i=1for (j,k)∈{1, 2, 3, 4}×{1, 2, 3} andσ ∈{0.5, 1}, see Table 4. In<strong>de</strong>ed,nS j (P0 k ) should beinterpreted as a theoretical counterpart to the criterion ∑ 3i=1 (Tj i,n )2 . In particular, we <strong>de</strong>rivefrom Table 4 the theoretical ranking of the protocols: for every scenarioP0 k andσ∈{0.5, 1}, theprotocols ranked by <strong>de</strong>creasing or<strong>de</strong>r of informativeness are protocols 3, 2, 1, 4.Figure 4: Visual representation of the three conditional distributions consi<strong>de</strong>red in the simulation scheme.We plot the empirical cumulative distribution functions of{g k (A = 1|W (l)) :l=1,...,n} fork=1(solidline),k=2(dashed line) andk=3(dotted line), whereW (1),...,W (L) are in<strong>de</strong>pen<strong>de</strong>nt copies ofWdrawn from the marginal distribution ofW un<strong>de</strong>rP k 0 (which does not <strong>de</strong>pend onk), andL = 10 5 .4.2 Leave-one-out evaluation of the performances of the classificationprocedureWe rely on the leave-one-out rule to evaluate the performances of the classification procedure.Specifically, we repeat in<strong>de</strong>pen<strong>de</strong>ntlyB= 100 times the following steps fork= 1, 2, 3:3. conditionally on (A,W) and for each (i,j)∈{1, 2, 3}×{1, 2, 3, 4},Y ji is drawn from theGaussian distribution with meanQ j i (A,W) (the same for allk=1, 2, 3; see Table 3 for the<strong>de</strong>finition of the conditional means) and common standard <strong>de</strong>viationσ∈{0.5, 1}.Although that may not be clear when looking at Table 2, the difficulty of the classificationproblem should vary from one scenario to the other. When using the first conditional distributiong(P0 1 ), the conditional probability ofA=1givenW is concentrated around 1 2, as seen in Figure 4(solid line), withP0 1 (g(P0 1 )(1|W)∈[0.48, 0.54])≃1. In words, the covariate provi<strong>de</strong>s little informationfor predicting the classA. On the contrary, estimatingg(P0 1 ) from the data is easy sincelogitg(P0 1 )(A = 1|W) is a simple linear function ofW. The conditional probabilities ofA=1givenW un<strong>de</strong>rg(P0 2 ) andg(P0 3 ) are less concentrated around 1 2, as seen in Figure 4 (dashed and1. Draw in<strong>de</strong>pen<strong>de</strong>ntlyO (1,b) ,...,O (n,b) fromP k 0 , withn = 54; we <strong>de</strong>note byA (l,b) the groupmembership indicator associated withO (l,b) , and byO ′ (l,b) the observed data structureO (l,b)<strong>de</strong>prived ofA (l,b) .2. For eachl∈{1,...,n},(a) setS (l,b) ={O (l ′ ,b) :l ′ ≠l,l ′ ≤n};(b) based onS (l,b) , rank the protocols (see Section 3.2) then build four different classifiersφ 1 (l,b) ,φ2 (l,b) ,φ3 (l,b) andφ4 (l,b)(see Section 3.3), which respectively use only the best(more informative), the two best, the three best and all four protocols (thusφ J (l,b) is afunction of the covariateW and ofJ among the four vectorsY 1 ,Y 2 ,Y 3 ,Y 4 );78


(c) classifyO (l,b) according to the four different classificationsφ 1 (l,b) (O′ (l,b) ),φ2 (l,b) (O′ (l,b) ),φ 3 (l,b) (O′ (l,b) ),φ4 (l,b) (O′ (l,b) ).scenario 1: logitg(P0 1 )(A = 1|W) = W 150 +W 250 −W 310 − W 42000 +W 5scenario 2: logitg(P0 2 )(A = 1|W) = cos(W 1 +W 5 ) + sin(W 1 +W 5 )scenario 3: logitg(P 3 0 )(A = 1|W) =⌊10 cos(W 1 +W 3 )⌋+ √ 5 cos(W 1 +W 3 )−⌊5 cos(W 1 +W 3 )⌋ π 50 sin(10 cos(W 1 +W 3 ))Table 2: Characterization of the three conditional distributionsg(P k 0 ),k=1, 2, 3, consi<strong>de</strong>red in thesimulation scheme.3. Compute Perf J b = n∑ 1 nl=1 1{A (l,b) =φ J (l,b) (O′ (l,b))} forJ = 1, 2, 3, 4.From this, we compute for eachJ∈{1, 2, 3, 4} the mean and standard <strong>de</strong>viation of the sample(Perf J 1,...,Perf J B). There is no real need to report the values in a table, because they present a veryclear pattern. First, all the standard <strong>de</strong>viations are approximately equal to 5%. Second, for everyvalue ofσ∈{0.5, 1}, performance Perf J actually <strong>de</strong>pends only slightly onJ (i.e., on the number ofprotocols taken into account in the classification procedure), without any significant difference forj = 1, 2, 3, 4. Third, the latter performances all equal approximately 80% whenσ = 1, and increaseto approximately 90% whenσ = 0.5. This increase is the expected illustration of the fact the largeris the variability of the summary measures, the more difficult is the classification procedure. Onthe contrary, it is a little bit surprising that the conditional distributionsg(P 1 0 ),g(P 2 0 ),g(P 3 0 ) donot affect significantly the performances. Anecdotally, the estimated ranking of the protocolsalways coinci<strong>de</strong> with the ranking that we <strong>de</strong>rived from Table 4.hal-00576070, version 1 - 11 Mar 2011fictitiousprotocolj = 1j = 2j = 3j = 4conditional meansQ 1 (A,W ) = 2[A sin(W1 +W4) + (1−A) cos(W1 +W5)]Q 1 2 (A,W ) = 3[(1−6A)X5 −AX 4 +X 3 − (1− A 2 )X2 +AX]whereX = (1−2A)W5 + A 160 4Q 1 3 (A,W ) = A tan(W4) + (1−A) tan(W5 +W1W2)Q 2 1 (A,W ) = 1[A +W1 +W2 +W3 +W5 +W1W2120+(1−A)W 5 +W 2W 3W 4]Q 2 (A,W ) = 5[A sin(W1 +W4) + (1−A) cos(W1 +W4)]Q 2 3 (A,W ) = 1 20 [A(2W1 + 3 W3) + (1−A)W5]2W3) + (1−A) log(W5)Q 3 1 (A,W ) = A log(2W1 + 3 2Q 3 2 (A,W ) = 145(X + 7)(X + 2)(X− 7)(X− 3)whereX = W 4+W5145+AW1Q 3 3√2X−⌊2X⌋)(A,W ) = π[A sin(X)(⌊2X⌋ +√2X−⌊2X⌋)]+(1−A) cos(X)(⌊2X⌋ +whereX = cos(W 3 +W 4 +W 5)Q 4 1 (A,W ) = 1100 (2X3 +X 2 −X− 1)whereX = AW2+W 4+W530Q 4 2 (A,W ) = 1 (A +W1 +W2 +W3 +W5)60Q 4 3 (A,W ) = 11000 [W1W3W 4+ (1−A)(W 3 1 +W 3W 4) +AW 2W 5]Table 3: Conditional meansQ j i(A,W ) ofYjigiven (A,W ) used in the three different scenarios of thesimulation scheme.hal-00576070, version 1 - 11 Mar 20115 Application to the real datasetWe present here the results of the classification procedure of Section 3 applied to the real dataset.Thus, we first rank the protocols from the more to the less informative regarding postural control,see Section 5.1; then we construct the four classifiers and rely on the leave-one-out rule to evaluatetheir performances, see Section 5.2. A natural extension of the classification procedure is finallyconsi<strong>de</strong>red and applied in Section 5.3, yielding significantly better results.5.1 Targeted maximum likelihood ranking of the protocols over the realdatasetIt is a known medical fact that hemiplegic subjects are sensitive to muscular stimulations, and alsothat they tend to compensate for their proprioceptive <strong>de</strong>ficit by <strong>de</strong>veloping a preference for visualinformation in or<strong>de</strong>r to maintain posture [3]. This suggests that protocols involving muscularand/or visual stimulations should rank high. What do the data tell us?We <strong>de</strong>rive and report in Table 5 the results of the ranking of the protocols using the entiredataset. Table 5 teaches us that the most informative protocol is protocol 3 (visual and muscularstimulations), and that the three next protocols ranked by <strong>de</strong>creasing or<strong>de</strong>r of informativenessare protocols 2 (muscular stimulation), 1 (visual stimulation), and 4 (optokinetic stimulation).Apparently, protocols 3 and 2 (which have in common that muscular stimulations are involved)are highly relevant for differentiating normal and hemiplegic subjects based on postural controldata. On the contrary (and perhaps surprisingly, given the introductory remark), protocols 1 and4 seem to provi<strong>de</strong> significantly less information for the same purpose.fictitious scenario 1 scenario 2 scenario 3protocol σ = 0.5 σ = 1 σ = 0.5 σ = 1 σ = 0.5 σ = 1j = 1 0.14 0.04 0.11 0.03 0.14 0.04j = 2 0.86 0.37 0.74 0.31 0.85 0.37j = 3 2.94 1.12 2.49 0.93 2.90 1.11j = 4 0.06 0.01 0.04 0.01 0.06 0.01protocol j = 3 j = 2 j = 1 j = 4criterion ∑ 3i=1 (Tj i,n )2 75.51 33.13 6.80 5.53Table 5: Ranking the four protocols using the entire real dataset. We report the realizations of thecriteria ∑ 3i=1 (Tj i,n )2 obtained for protocolsj= 1, 2, 3, 4. These values teach us that the most informativeprotocol is protocol 3, and that the three next protocols ranked by <strong>de</strong>creasing or<strong>de</strong>r of informativenessare protocols 2, 1, and 4.Table 4: Values ofS j (P0 k ) for (j,k)∈{1, 2, 3, 4}×{1, 2, 3} andσ∈{0.5, 1}. They notably teach us that,for every scenarioP0k andσ∈{0.5, 1}, the protocols ranked by <strong>de</strong>creasing or<strong>de</strong>r of informativeness areprotocols 3, 2, 1, 4.5.2 Classification procedures applied to the real datasetIn or<strong>de</strong>r to evaluate the performances of the classification procedure applied to the real dataset,we carry out steps 2a, 2b, 2c from the leave-one-out rule <strong>de</strong>scribed in Section 4.2 where we910


substitute the real datasetO (1) ,...,O (n) for the simulated one. We actually do it twice. The firsttime, the super-learning methodology involves a large collection of estimators (see Section A.1).Then we justify resorting to a smaller collection (see Section A.1 again), hence a second roundof performance evaluation. We report the results in Table 6, where the second and third rowsrespectively correspond to the first (larger collection) and second (smaller collection) round ofperformance evaluation.Let’s consi<strong>de</strong>r first the performances of the classification procedure relying on the larger collection.The proportion of subjects correctly classified (evaluated by the leave-one-out rule) equalsonly 70% (38 out of the 54 subjects are correctly classified) when the sole most informative protocol(i.e., protocol 3) is exploited. This rate jumps to 80% (43 out of 54 subjects are correctly classified)when the two most informative protocols (i.e., protocols 3 and 2) are exploited. Interestingly,including one or two of the remaining protocols <strong>de</strong>creases the performances.protocol j = 3 j = 2 j = 1 j = 4criterion ∑ 8i=1 (Tj i,n )2 83.64 43.61 14.92 12.60Table 7: Ranking the four protocols using the entire real dataset and the exten<strong>de</strong>d small-dimensionalsummary measure of the complex trajectories. We report the realizations of the criteria ∑ 8i=1 (Tj i,n )2obtained for protocolsj= 1, 2, 3, 4. The ranking is the same as that <strong>de</strong>rived from Table 5.We finally apply once again steps 2a, 2b, 2c from the leave-one-out rule <strong>de</strong>scribed in Section 4.2where we substitute the real datasetO (1) ,...,O (n) for the simulated one, and use either all estimatorsor only two of them in the super-learner (we exposed our motives in the previous section;see Section A.1 for <strong>de</strong>tails on the super-learning procedure). The results are reported in Table 8.J = 1 J = 2 J = 3 J = 4Perf J (larger collection) 0.70 (38/54) 0.80 (43/54) 0.74 (40/54) 0.78 (42/54)Perf J (smaller collection) 0.74 (40/54) 0.81 (44/54) 0.78 (42/54) 0.85 (46/54)J = 1 J = 2 J = 3 J = 4Perf J (larger collection) 0.82 (44/54) 0.80 (43/54) 0.80 (43/54) 0.78 (42/54)Perf J (smaller collection) 0.87 (47/54) 0.85 (46/54) 0.80 (43/54) 0.82 (44/54)hal-00576070, version 1 - 11 Mar 2011Table 6: Leave-one-out performances Perf J of the classification procedure using the real dataset. PerformancePerf J corresponds to the classifier based onJ among the four vectorsY 1 ,Y 2 ,Y 3 ,Y 4 (thoseassociated with theJ more informative protocols) and either using all estimators (second row) or onlytwo of them (third row) in the super-learner (see Section A.1 for <strong>de</strong>tails).Now, the theoretical properties of the super-learning procedure [15, 12] are asymptotic, i.e.valid when the sample sizenis large, which is arguably not the case in this study. Even thoughthis is contradictory to the philosophy of the super-learning methodology, it is tempting to reducethe number of estimators involved in the super-learning. We therefore keep only two of them (seeSection A.1 for the <strong>de</strong>tails), and run again steps 2a, 2b, 2c from the leave-one-out rule <strong>de</strong>scribedin Section 4.2 where we substitute the real datasetO (1) ,...,O (n) for the simulated one. Resultsare reported in Table 6 (third row). We obtain better performances: for each value ofJ (i.e.,each number of protocols taken into account in the classification procedure), the second classifieroutperforms the first one. The best performance is achieved when all four protocols are used,yielding a rate of correct classification equal to 85% (46 out of the 54 subjects are correctlyclassified). This is encouraging, notably because one can reasonably expect that performances willbe improved on when a larger cohort is available.Yet, this is not the end of the story. We have built a general methodology that can be easilyexten<strong>de</strong>d, for instance by enriching the small-dimensional summary measure <strong>de</strong>rived from eachcomplex trajectory. We explore the effects of such an extension in the next section.5.3 ExtensionThus, let us enrich the small-dimensional summary measure initially <strong>de</strong>fined in Section 2.2. Since itmainly involves distances from a reference point, the most natural extension is to add informationspertaining to orientation. Relying on polar coordinates of the trajectory (B t ) t∈T poses sometechnical issues. Instead, we propose to fit simple linear mo<strong>de</strong>lsy(B t ) =vx(B t ) +u (wherex(B t )andy(B t ) are the abscisse and ordinate ofB t ) based on the datasets{B t :t∈T∩ [10, 15[},{B t :t∈T∩ [15, 20[},{B t :t∈T∩ [20, 45[},{B t :t∈T∩ [45, 50[} and{B t :t∈T∩ [50, 55[}, andto use the slope estimates as summary measures of an average orientation over each time interval.The observed data structure and parameter of interest still write asO=(W,A,Y 1 ,Y 2 ,Y 3 ,Y 4 )and Ψ(P) = (Ψ j (P)) 1≤j≤4 , butY j and Ψ j (P) now belong to R 8 (and not R 3 anymore). Theranking of the protocols now relies on the criterion ∑ 8i=1 (Tj i,n )2 , whose <strong>de</strong>finition straightforwardlyextends that of the criterion introduced in Section 3.2. The values of the criteria are reported inTable 7. The ranking of protocols remains unchanged, but the discrepancies between the valuesfor protocol 2 on one hand and for protocols 1 and 4 on the other hand are smaller.hal-00576070, version 1 - 11 Mar 2011Table 8: Leave-one-out performances Perf J of the classification procedure using the real dataset and theexten<strong>de</strong>d small-dimensional summary measure of the complex trajectories. Performance Perf J correspondsto the classifier based onJ among the four vectorsY 1 ,Y 2 ,Y 3 ,Y 4 (those associated with theJ moreinformative protocols) and either using all estimators (second row) or only two of them (third row) in thesuper-learner (see Section A.1 for <strong>de</strong>tails).When we inclu<strong>de</strong> all estimators in the super-learner, the classification procedure that relieson the exten<strong>de</strong>d small-dimensional summary measure of the complex trajectories outperforms theclassification procedure that relies on the initial summary measure, for every value ofJ (i.e., eachnumber of protocols taken into account in the classification procedure). The performances are evenbetter when we only inclu<strong>de</strong> two estimators. Remarkably, the best performance is achieved usingonly the most informative protocol, with a proportion of subjects correctly classified (evaluatedby the leave-one-out rule) equal to 87% (47 out of the 54 subjects are correctly classified)AAppendixWe gather in Section A.1 a short and self-contained <strong>de</strong>scription of the construction of a superlearner,as well as the estimation procedures that we choose to involve. One of those estimationprocedures, a variant of the top-scoring pairs classification procedure, is presented in Section A.2.A.1 Specifics of the super-learning proceduresWe refer to [15, 12] for a general presentation of the super-learning methodology, which is amethod for combining/aggregating, byV -fold cross-validation, a family of candidates estimationprocedures (or simply estimators) of a regression function.Let us <strong>de</strong>note byX∈ R d the vector of predictors and byY ∈ R the outcome of interest (inthe classification framework,Y ∈{0, 1}), and let (X (1) ,Y (1) ),...,(X (n) ,Y (n) ) benin<strong>de</strong>pen<strong>de</strong>ntcopies of (X,Y ). The empirical distribution of the whole sample is <strong>de</strong>noted byP n . For eachk∈K a finite set of cardinalityK,f k (P n ) is an estimator of the regression functionE(Y|X) builtonP n . The objective is to aggregate theK estimators into a single one which will enjoy someoptimality/oracle properties with respect to a certain criterion of interest. The criterion of interestdrives the choice of a loss functionL, which maps a generic observation (X,Y ) and a candidateestimatorf of the regression function to a real numberL((X,Y ),f). We use two different lossfunctions, one for the estimation of the conditional meansQ j i (P 0) and the conditional distributiong(P 0 ) as nee<strong>de</strong>d to rank the protocols (see Section 3.2), one for the classification of subjects (seeSection 3.3). Regarding how to combine (f k (P n )) k∈K , we <strong>de</strong>ci<strong>de</strong> to consi<strong>de</strong>r convex combinations.1112


hal-00576070, version 1 - 11 Mar 2011We use the valueV = 10, and therefore draw an-tuple (V (1) ,...,V (n) )∈{1,...,V} n such thatmax v,v′ ≤V| ∑ ni=1 1{V (i) =v}− ∑ ni=1 1{V (i) =v ′ }|≤1(thev-strata∑have the same cardinality,nup to one unit), and (in the classification framework) min v≤V,y=0,1 i=1 1{Y (i) =y,V (i) =v}≥1(eachv-stratum contains at least one observation from each class). For everyv≤V , let us <strong>de</strong>notebyPn v the empirical distribution of those observations for whichV (i) ≠v; the training set ma<strong>de</strong>of the latter observations yieldsK estimators (f k (Pn)) v k∈K . Now, we <strong>de</strong>fine the minimizer of theV -fold cross-validated risk:(∑ n∑α ∗ (P n ) = arg max L (X (i) ,Y (i) ), ∑ )k f k (Pn)α∈SK v≤V i=1 k∈Kα v 1{V (i) =v}(whereS K ={u∈R K + : ∑ k≤K u k = 1}), which finally yields the super-learner obtained astheα ∗ (P n )-convex combination of theK estimators (f k (P n )) k∈K trained on the whole sample,f ∗ (P n ) = ∑ k∈K α∗ k (P n)f k (P n ).We now turn to the <strong>de</strong>scription of the loss functionLand of the estimators (f k ) k∈K specificallyused in this article.• Estimation of the conditional meansQ j i (P 0) and the conditional distributiong(P 0 ).We choose the squared error loss function, characterized byL 2 ((X,Y ),f) = (Y−f(X)) 2 ,whose expectation is minimized, over the set of measurable functions ofX, at the targetedregression functionE(Y|X).Regarding the estimators (f k ) k∈K ,– conditional meansQ j i (P 0): we rely on (alphabetical or<strong>de</strong>r) the elastic net [13, 14],general additive mo<strong>de</strong>ls [8], linear mo<strong>de</strong>ls, loess local regressions [9], random forests [4],and support vector machines [6]. Different values of the various tuning parameters areconsi<strong>de</strong>red.– conditional distributiong(P 0 ): we rely on (alphabetical or<strong>de</strong>r) thek-nearest-neighbors,logistic linear regressions, random forests, and support vector machines. Differentvalues of the various tuning parameters are consi<strong>de</strong>red.• Classification of subjects. We choose the loss function characterized byL 3 ((X,Y ),f) =(Y− expit{β(f(X)− 1 2 )})2 withβ = 200 a large positive number. This loss function ismeant to provi<strong>de</strong> a tra<strong>de</strong>-off between the loss functions characterized byL 1 ((X,Y ),f) =(Y− 1{f(X)≥ 1 2 })2 andL 2 ((X,Y ),f) = (Y−f(X)) 2 . SinceL 3 is very close toL 1 ,the super-learner optimizes the convex-combination parameterα ∗ (P n ) to be applied to thecollection (f k (P n )) k∈K of estimators in view of the plug-in rule that will ultimately be used(see Section 3.3). YetL 3 is smoother thanL 1 (which takes its values in{0, 1}), thus in thatsense not so far away fromL 2 , which makes the numerical computation ofα ∗ (P n ) easier for<strong>de</strong>riving the super-learner.Regarding the estimators (f k ) k∈K , we rely on (alphabetical or<strong>de</strong>r) thek-nearest-neighbors,logistic regressions, random forests, and the top-scoring pairs classification procedure (seeSection A.2). Different values of the tuning parameters are consi<strong>de</strong>red for thek-nearestneighborsand random forests. We also consi<strong>de</strong>r the smaller collection of estimators thatreduces to random forests (with a single choice of the tuning parameters) and the top-scoringpairs classification procedure.Interestingly, the top-scoring pairs classification procedure involves pairwise comparisons1{X j ≤X j′ } (1≤j ≠j ′ ≤d) of predictors: how relevant such comparisons may be for thesake of classification is not taken into account by our method to rank protocols by <strong>de</strong>creasingor<strong>de</strong>r of informativeness relative to postural control.hal-00576070, version 1 - 11 Mar 2011A.2 The top-scoring pairs classification procedureThe top-scoring pair (abbreviated to TSP) classification procedure was introduced in [7] for thepurpose of molecular classification based on some genetic information. It is parameter-free andsimply relies on pairwise comparisons of predictors, hence its remarkable robustness. Even thoughthe context of the present article greatly differs from that of [7] (mostly in that the number ofpredictors is small here whereas it is huge there), the very good performances enjoyed by the TSPclassifier for molecular classification motivate making a variant of the TSP classification procedureone of the candidates estimators in the super-learner.Let us <strong>de</strong>noteX = (X 1 ,...,X d )∈R d the vector of predictors andY ∈{0, 1} the classmembership indicator. The objective is to estimate the regression functionP(Y = 1|X) based onn in<strong>de</strong>pen<strong>de</strong>nt copies (X (1) ,Y (1) ),...,(X (n) ,Y (n) ) of (X,Y )∼P. One first <strong>de</strong>rives the so-calledTSP,(k 0 ,l 0 ) = arg max∣1≤kXl0 (i) ,Y (i) = 1}∑ ni=1 1{Xk0 (i) >Xl0 (i) } .Our TSP estimator of the regression functionP(Y = 1|X) finally writes asP TSPn (Y = 1|X) =p − 0,n 1{Xk0 ≤X l0 } +p + 0,n 1{Xk0 >X l0 }.An alternative <strong>de</strong>finition (closer to the original TSP procedure from [7]) could have been tointroduceπ y 0,n = ∑ ni=1 1{Xk 0(i) ≤Xl 0(i) ,Y (i)=y}∑ n 1{Y i=1(i)=y}fory = 0, 1 and to estimate the regression functionP(Y = 1|X) byπ 1 0,n1{π 1 0,n ≥π 0 0,n} + (1−π 0 0,n)1{π 1 0,n


[6] N. Christianini and J. Shawe-Taylor. An introduction to Support Vector Machines. CambridgeSeries in Statistical and Probabilistic Mathematics. Cambridge University Press, 2000.[7] D. Geman, C. d’Avignon, D. Q. Naiman, and R. L. Winslow. Classifying gene expressionprofiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol., 3, 2004. Article 19.[8] T. J. Hastie and R. J. Tibshirani. Generalized additive mo<strong>de</strong>ls, volume 43 of Monographs onStatistics and Applied Probability. Chapman and Hall Ltd., London, 1990.[9] C. Loa<strong>de</strong>r. Local Regression and Likelihood. Statistics and Computing. Springer, New-York,1999.[10] K. M. Newell, S. M. Slobounov, E. S. Slobounova, and P. C. M. Molenaar. Stochastic processesin postural center-of-pressure profiles. Experimental Brain Research, 113:158–164, 1997.[11] R Development Core Team. R: A Language and Environment for Statistical Computing. RFoundation for Statistical Computing, Vienna, Austria, 2010. ISBN 3-900051-07-0.hal-00576070, version 1 - 11 Mar 2011[12] S. Rose and M. J. van <strong>de</strong>r Laan, editors. Targeted Learning: Causal Inference for Observationaland Experimental Data. Springer Verlag, 2011.[13] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B,58(1):267–288, 1996.[14] R. Tibshirani, T. Hastie, and J. H. Friedman. Regularization paths for generalized linearmo<strong>de</strong>ls via coordinate <strong>de</strong>scent. J. Stat. Softw., 33, Issue 1, 2010.[15] M. J. van <strong>de</strong>r Laan, E. C. Polley, and A. E. Hubbard. Super learner. Stat. Appl. Genet. Mol.Biol., 6:Art. 25, 23 pp. (electronic), 2007.[16] M. J. van <strong>de</strong>r Laan and D. Rubin. Targeted maximum likelihood learning. Int. J. Biostat.,2:Art. 11, 40, 2006.15


hal-00577883, version 1 - 17 Mar 2011Threshold regression mo<strong>de</strong>ls adapted to case-control studies,and the risk of lung cancer due to occupational exposure toasbestos in FranceA. Chambaz 1,∗ , D. Choudat 2,† , C. Huber 1 , J-C. Pairon 3 , M. J. van <strong>de</strong>r Laan 41 MAP5, Université Paris Descartes and CNRS2 Assistance Publique – Hôpitaux <strong>de</strong> Paris and Université Paris Descartes3 INSERM U955 and Université Paris-Est Créteil4 University of California, BerkeleyMarch 17, 2011AbstractAsbestos has been known for many years as a powerful carcinogen. Our purpose isquantify the relationship between an occupational exposure to asbestos and an increase ofthe risk of lung cancer. Furthermore, we wish to tackle the very <strong>de</strong>licate question of theevaluation, in subjects suffering from a lung cancer, of how much the amount of exposureto asbestos explains the occurrence of the cancer. For this purpose, we rely on a recentFrench case-control study. We build a large collection of threshold regression mo<strong>de</strong>ls,data-adaptively select a better mo<strong>de</strong>l in it by multi-fold likelihood-based cross-validation,then fit the resulting better mo<strong>de</strong>l by maximum likelihood. A necessary preliminary stepto eliminate the bias due to the case-control sampling <strong>de</strong>sign is ma<strong>de</strong> possible becausethe probability distribution of being a case can be computed beforehand based on anin<strong>de</strong>pen<strong>de</strong>nt study. The implications of the fitted mo<strong>de</strong>l in terms of a notion of maximumnumber of years of life guaranteed free of lung cancer are discussed.Keywords: case-control study; cross-validation; threshold regression mo<strong>de</strong>l.1 IntroductionAsbestos has been known for many years as a powerful carcinogen [1]. Our purpose is toquantify the relationship between an occupational exposure to asbestos and an increase ofthe risk of lung cancer. Furthermore, we wish to tackle the very <strong>de</strong>licate question of theevaluation, in subjects suffering from a lung cancer, of how much the amount of exposure toasbestos explains the occurrence of the cancer.For this purpose, we rely on a recent French case-control study on lung cancer [9]. For asample of approximately 2,000 participants, a number of information is available, including∗ This collaboration took place while <strong>Antoine</strong> Chambaz was a visiting scholar at UC Berkeley, supportedin part by a Fulbright Research Grant and the CNRS. A. Chambaz would like to thank M-L. Ting Lee forinteresting discussions on threshold regression mo<strong>de</strong>ls.† This work was supported by a grant from the Agence nationale <strong>de</strong> sécurité sanitaire <strong>de</strong> l’alimentation, <strong>de</strong>l’environnement et du travail (ES 2005-006).hal-00577883, version 1 - 17 Mar 2011information pertaining to lifetime tobacco consumption and a longitudinal <strong>de</strong>scription ofoccupational exposure to asbestos. Each employment is associated with its duration and anoriginal qualitative <strong>de</strong>scription of the exposure to asbestos into 28 categories.We <strong>de</strong>ci<strong>de</strong> to mo<strong>de</strong>l the age at inci<strong>de</strong>nt lung cancer as the first time that a time-in<strong>de</strong>xedcontinuous stochastic process (which should be interpreted as an amount of health relativeto lung cancer, initially positive and featuring a negative trend) reaches 0. This justifiesthe expression first hitting time mo<strong>de</strong>l, but the expression threshold regression mo<strong>de</strong>l is oftenpreferred. Such mo<strong>de</strong>ls have been playing an important role in survival analysis for someyears now, and we refer the rea<strong>de</strong>r to [6, 7] for a bibliographical overview. The mo<strong>de</strong>l is<strong>de</strong>signed in such a way that occupational exposure to asbestos may accelerate the referencetime, so that inci<strong>de</strong>nt lung cancer may occur sooner in the presence of exposure to asbestosthan it would in the absence of any such exposure. This actually yields a very large collectionof threshold regression mo<strong>de</strong>ls, the largest one (i.e., less constrained) containing thousands ofsmaller threshold regression mo<strong>de</strong>ls (obtained for instance by reducing the original 28-category<strong>de</strong>scription of exposure to asbestos to a <strong>de</strong>scription with fewer categories).As mentioned earlier, the dataset has been obtained following a case-control study <strong>de</strong>sign,which is convenient for a rare disease like lung cancer (since it allows to sample knowncases of lung cancer). In that sense, case-control sampling is a biased sampling method.In our example, approximately one out of two participants is a case, i.e. is diagnosed aninci<strong>de</strong>nt lung cancer, a proportion which is of course much larger than in the population ofinterest, with a known prevalence proportion approximately equal to five cases out of 10,000persons [2]. Knowing (actually: Estimating based on the in<strong>de</strong>pen<strong>de</strong>nt study [2]) beforehandthe probability distribution of being a case is of crucial importance, as it makes it possibleto eliminate the bias induced by the case-control sampling <strong>de</strong>sign, as shown in [13, 12].In<strong>de</strong>ed, we manage to data-adaptively select a better mo<strong>de</strong>l in our large collection of thresholdregression mo<strong>de</strong>ls by relying on multi-fold likelihood-based cross-validation [15]. Then, we fitthe latter better mo<strong>de</strong>l to the data by maximum likelihood, therefore obtaining a quantitativeun<strong>de</strong>rstanding of how an exposure to asbestos is related to an increase of the risk of lungcancer.The evaluation of how much the amount of exposure to asbestos explains the occurrenceof an inci<strong>de</strong>nt lung cancer in a case is a recurring issue. It has important implications inpublic-health policy-making and might be used in the <strong>de</strong>sign of legal compensation schemes(as in the United States, unlike in France). In this view, a mathematical notion of probabilityof causation has been formalized and studied in [10]. The authors of [10] soon overcamethe shortcomings of the latter notion which they had un<strong>de</strong>rlined, by showing that expectedyears of life lost due to hazardous exposure can sometimes be estimated, and how to estimatethem when possible [11, 8]. In this article, we explain and take advantage of the fact thatresorting to threshold regression mo<strong>de</strong>ling makes it very easy to come up with a notion ofmaximal number of years of life guaranteed free of lung cancer (heuristically, a number ofyears of life which a subject living infinitely would enjoy before <strong>de</strong>veloping an inci<strong>de</strong>nt lungcancer). Once the selected mo<strong>de</strong>l has been fitted, elementary algebra maps <strong>de</strong>terministically(conditionally on observed age at inci<strong>de</strong>nt lung cancer, history of occupational exposure,and parameter estimates) an age at inci<strong>de</strong>nt lung cancer and a longitudinal <strong>de</strong>scription ofoccupational exposure to asbestos into a number which can be interpreted as a maximalnumber of years of life guaranteed free of lung cancer.We emphasize that, although the central issues studied in this article are causal by theirvery nature, we cautiously used for their statement two expressions (how an exposure is related12


hal-00577883, version 1 - 17 Mar 2011to an increase; how much the amount of exposure explains the cancer) which belong to thesemantic field of associations. This wariness is notably motivated by the fact that smokingis also a well known risk factor of lung cancer [3], so that reaching a causal conclusion wouldrequire unraveling the intertwined effects of asbestos exposure and smoking, an impossibletask with the dataset at hand. For this reason among others, the above mentioned notion ofmaximal number of years of life guaranteed free of lung cancer cannot be interpreted causally.The article is organized as follows. The dataset and the original qualitative <strong>de</strong>scriptionof the exposure to asbestos into 28 categories are carefully <strong>de</strong>scribed in Section 2. Thecase-control estimation problem is formalized in Section 3, and some asymptotic propertiesof the resulting case-control weighted maximum likelihood estimator are briefly exposed inSection 4 (elements of proofs are relegated to the appendix). We <strong>de</strong>velop the thresholdregression mo<strong>de</strong>ling in Section 5. This inclu<strong>de</strong>s the formal <strong>de</strong>finition of the maximal numberof years of life guaranteed free of lung cancer. Section 6 is <strong>de</strong>dicated to the application itself.This inclu<strong>de</strong>s the computation of the quantities required to eliminate the bias due to thecase-control sampling <strong>de</strong>sign, the <strong>de</strong>tails of the mo<strong>de</strong>l selection procedure, the <strong>de</strong>scription ofthe better mo<strong>de</strong>l fitted to the data, and its implications regarding the maximal number ofyears of life guaranteed free of lung cancer. A brief discussion is finally <strong>de</strong>veloped in Section 7.2 DatasetA case-control study.The dataset was built following a case-control sampling scheme. The study took place between1999 and 2002 in four Parisian hospitals. Case and control subjects were retrospectivelyrecruited at the end of each year 1999 to 2002 among the patients of these hospitals whowere free of lung cancer at the beginning of the corresponding year. The case subjects werediagnosed with inci<strong>de</strong>nt lung cancer during the period of the study. They were matchedby control subjects on the basis of gen<strong>de</strong>r, age at end of calendar year (up to ±2.5 years),hospital, and race. Control subjects were recruited among patients of the <strong>de</strong>partments ofophthalmology, general and orthopedic surgeries, and were by <strong>de</strong>finition free of lung cancerat the time of their enrollment.The one-to-one matching (i.e., the pattern of who is matched by whom) and race are notavailable. We come up with an artificial valid matching pattern (based on gen<strong>de</strong>r, age andhospital) and make sure that our results do not <strong>de</strong>pend on this particular choice. We exclu<strong>de</strong>every subject with missing information. The full data set then counts n = 860 cases and 901controls, resulting in n + 901 = 1,761 observations.The population sampled from during the study is arguably stationary. Therefore, theobserved data structures on experimental units ma<strong>de</strong> of pairs of case and matched controlcan be mo<strong>de</strong>led as in<strong>de</strong>pen<strong>de</strong>nt and i<strong>de</strong>ntically distributed (iid) random variables. This simplefact is the cornerstone of the study un<strong>de</strong>rtaken here. Following the seminal article [13], weinvoke this fact in or<strong>de</strong>r to <strong>de</strong>rive the valid likelihood function which is the backbone of thestudy. The fundamental reasoning is fully <strong>de</strong>veloped with care in Section 3.Finally, we emphasize that the results we obtain in this article, based on this dataset, canbe interpreted as results relative to France un<strong>de</strong>r the additional assumption that samplingfrom the four Parisian hospitals that participate to the study is stochastically equivalent tosampling from the population of France. This assumption is also carefully stated in Section 3.hal-00577883, version 1 - 17 Mar 2011Non-professional information.Each subject inclu<strong>de</strong>d in the study is associated with his/her date of birth, gen<strong>de</strong>r, date ofinci<strong>de</strong>nt lung cancer diagnosis (for cases) or interview (for controls), and binary indicator ofoccurrence of lung cancer in close family.Information pertaining to tobacco consumption is also collected. We know for each subjectif he/she ever smoked. For those subjects who were once smokers, the beginning and endingdates of the smoking period are given, as well as the lifetime tobacco use.We will however summarize this information by only consi<strong>de</strong>ring a discretized version ofthe lifetime tobacco use. Our motivation is twofold. First, a relevant tobacco history wouldbe dynamic whereas we only have cumulated information. Second, such a time <strong>de</strong>pen<strong>de</strong>nttobacco history would yield time <strong>de</strong>pen<strong>de</strong>nt confounding. Furthermore, the previous argumentalso implies that the results we obtain in this article cannot be interpreted in causal terms.It is well known that tobacco is a serious risk factor of lung cancer [3]. Reaching a causalinterpretation would require that we unravel the intricate synergies between tobacco use andoccupational exposures to asbestos, a difficult task that we cannot even try to address giventhe data at hand.Occupational information.Occupational information on subjects is longitudinal. Every employment (with duration atleast 6 months) is associated with its start and end dates as well as with an original <strong>de</strong>scriptionof the exposure to asbestos, a known carcinogen.This <strong>de</strong>scription is a triplet referred to as “probability/frequency/intensity”, each of themtaking values in {1,2,3}: for the consi<strong>de</strong>red employment, the probability of exposure, itsfrequency and intensity are evaluated as low/mild/high, respectively co<strong>de</strong>d by 1, 2, 3. Hence,the set E of categories of exposure has 27+1=28 elements (we add a category 0 = (0,0,0) forno exposure), each of them corresponding to a particular rate of exposure. Note that we willuse from now on either the notation ε = (ε 1 , ε 2 , ε 3 ) or more simply the notational shortcutε = ε 1 ε 2 ε 3 .We report in Table 1 the overall number of employments associated to each possible“probability/frequency/intensity” <strong>de</strong>scription. Although computed over a total of 8,432 employments,Table 1 strikingly exhibits many zeros, showing that the latter <strong>de</strong>scription isover-parametrized.We also report in Table 2 the overall number of employments that feature a particularvalue of each coordinate of the “probability/frequency/intensity” <strong>de</strong>scription. Notice that, ofcourse, the sums over rows coinci<strong>de</strong>.The generic longitudinal <strong>de</strong>scription of occupational exposure to asbestos is <strong>de</strong>noted by ā.It belongs to Ē, the set of functions from the nonnegative real line to E such that a(t) = 0 fort small or large enough (before the age at first employment or when no further informationis available; this constraint is just a convenience, as we will make clear in Section 5). It isun<strong>de</strong>rstood that the value of ā at t is <strong>de</strong>noted by a(t) while ā(t) stand for the restrictions of āto [0, t]. Thus a(t) = a correspond to an occupational position held at age t and characterizedby asbestos exposure a.One of the central issues we <strong>de</strong>al with in this article is how to associate each <strong>de</strong>scriptionin E with a rate of exposure. We propose an original solution which heavily exploits theun<strong>de</strong>rlying multiplicative nature of the “probability/frequency/intensity” encoding. In<strong>de</strong>ed,34


hal-00577883, version 1 - 17 Mar 2011ε nb. of emp. ε nb. of emp. ε nb. of emp.111 213 211 53 311 138112 167 212 64 312 105113 3 213 6 313 24121 150 221 59 321 136122 46 222 36 322 189123 3 223 3 323 22131 0 231 2 331 1132 0 232 0 332 3133 0 233 0 333 0Table 1: Overall number of employments associated to each possible “probability/frequency/intensity”<strong>de</strong>scription. The total number of employments is 8,432. Only 1,423of them feature a <strong>de</strong>scription in E \ {0}.1 2 3probability 582 223 618frequency 773 644 6intensity 752 610 61Table 2: Overall number of employments that feature a particular value of each coordinateof the “probability/frequency/intensity” <strong>de</strong>scription.it is the product of “probability”, “frequency” and “intensity” which is relevant in terms ofrate of exposure.3 Formulation of the case-control estimation problemThis section builds upon the seminal study [13]. Following its strategy:(i) We <strong>de</strong>rive from the <strong>de</strong>scription of our case-control study the characterization of theprospective sampling scheme one would have liked to follow, had the probability ofbeing an inci<strong>de</strong>nt case of lung cancer not been so small. This mainly amounts to<strong>de</strong>fining an observed data structure O ⋆ un<strong>de</strong>r prospective sampling, whose distributionP0 ⋆ presents features of interest.hal-00577883, version 1 - 17 Mar 2011• W 2 = 0 if no lung cancer occurred in close family and W 2 = 1 otherwise;• W 3 = 0 for never-smoker, W 3 = 1 for lifetime tobacco use comprised between 1 and 25pack years, W 3 = 2 for lifetime tobacco use comprised between 26 and 45 pack years,W 3 = 3 otherwise.Note that the boundaries that we chose for <strong>de</strong>fining W 3 yield strata of comparable sizes (371subjects with W 3 = 0, and respectively 468, 469, 453 subjects with W 3 = 1,2,3). Let T<strong>de</strong>note his/her age at inci<strong>de</strong>nt lung cancer (set to infinity if no lung cancer ever occurs), andlet X = X(τ) <strong>de</strong>note his/her age at time τ. They are associated with Z = min{T, X} andY = 1{T ≤ X}. Finally, the occupational information collected at time τ is enco<strong>de</strong>d in Ā(X).Now, as explained in Section 2, sampling occurred at times τ 0 , τ 1 = τ 0 +1, τ 2 = τ 0 +2, τ 3 =τ 0 + 3 (where τ 0 stands for the initial sampling date, January 1st, 2000). Obviously thereference population <strong>de</strong>pends on time. Denoting by P ⋆ (τ) the distribution of the referencepopulation at time τ, we make the following stationary assumption:∀1 ≤ k ≤ 3, P ⋆ (τ k ) = P ⋆ (τ 0 ) ≡ P ⋆ 0 . (1)This assumption is reasonable due to the influx and the outflow featured by the populationof the Parisian region over the period of investigation. We emphasized in Section 2 thatinterpreting the results obtained in the present article as results relative to France requiresthat one be willing to assume that sampling from the four Parisian hospitals that participateto the study is stochastically equivalent to sampling from the population of France. Formally,this amounts to assuming that the distribution of the observed data structure O ⋆ sampledfrom the whole French population equals P0 ⋆.Had the prospective sampling been un<strong>de</strong>rtaken, we would have observed n 0 (respectivelyn 1 , n 2 , n 3 ) in<strong>de</strong>pen<strong>de</strong>nt observed data structures Oi ⋆ sampled at time points τ 0 (respectivelyτ 1 , τ 2 , τ 3 ), therefore collecting an iid sample (O1 ⋆, ...,O⋆ N ) of size N = n 0 + n 1 + n 2 + n 3 ofthe distribution P0 ⋆ by virtue of our stationary assumption. This justifies the final <strong>de</strong>finitionof the observed data structure in a prospective sampling scheme:with• W explanatory covariate;O ⋆ = (W, X, Ā(X), Y, Z) (2)• X the age of the subject associated with the unit when it is sampled;(ii) In view of the latter, we characterize the observed data structure O un<strong>de</strong>r matchedcase-control sampling and its distribution P 0 . Then we show how to make inference onthe features of P ⋆ 0 from data sampled un<strong>de</strong>r P 0.• Ā(X) occupational information up to age X related to asbestos;• Y = 1 if and only if (iff) T = Z ≤ X (the subject is then called a case) and Y = 0 iffT > Z = X (the subject is then called a control).Prospective sampling.The likelihood of O ⋆ un<strong>de</strong>r P0 ⋆ finally writes asWe first set a calendar time τ (expressed in years), and consi<strong>de</strong>r a generic subject sampled attime τ. We <strong>de</strong>note by W = (W 1 , W 2 , W 3 ) ∈ W his/her explanatory covariate taking valuesin W = {1,2,3,4} × {0,1} 2 × {0,1,2,3},• W 0 indicating from which hospital the generic subject is sampled;• W 1 = 0 for men and W 1 = 1 for women;P0 ⋆ (O ⋆ ) = P0 ⋆ (W)P0 ⋆ (X|W)P0 ⋆ (Ā(X)|X, W)× dP0 ⋆ (Z = T |T ≥ X − 1,Ā(X), X,W)Y× P ⋆ 0 (T > X|T ≥ X − 1,Ā(X), X,W)1−Y , (3)where dP0 ⋆ (t|T ≥ X −1,Ā(X), X,W) is the conditional <strong>de</strong>nsity of T at time t given the event[T ≥ X − 1,Ā(X), X,W]. 65


Matched case-control sampling.For this purpose, let us <strong>de</strong>fine for all v ∈ V the quantitieshal-00577883, version 1 - 17 Mar 2011Such a prospective sampling scheme would have been impractical and ineffective because theprobability P0 ⋆ (Y = 1) of being an inci<strong>de</strong>nt case of lung cancer is very small. In or<strong>de</strong>r torecruit some cases in the sample, one would have to sample a huge number of observations.This is the main motivation for using a case-control sampling scheme.Let us now <strong>de</strong>scribe what is our observed data structure in this framework. We introducethe categorical matching variable V ∈ V obtained by concatenating W 0 (subject’s hospitalwhen sampled), W 1 (subject’s gen<strong>de</strong>r) and a discretized version of the age at sampling X overbins of length five years. In the sequel, we repeatedly use the convenient (though redundant)notation (V,W).The matched case-control sampling scheme can be <strong>de</strong>scribed as follows:• One first samples a case by sampling(V 1 , O 1⋆ ) = (V 1 , W 1 , X 1 ,Ā1 (X 1 ), Y 1 = 1, Z 1 )from the conditional distribution of (V,O ⋆ ) given Y = 1.• Subsequently, one samples J controls(V 0,j , O 0,j⋆ ) = (V 0,j , W 0,j , X 0,j , Ā0,j (X 0,j ), Y 0,j = 0, Z 0,j )from the conditional distribution of (V,O ⋆ ) given Y = 0, V 0,j = V 1 for all j ≤ J.This results in the observed data structureO = ((V 1 , O 1⋆ ),(V 0,j , O 0,j⋆ ), j = 1, ...,J) ∼ P 0whose true distribution P 0 can be <strong>de</strong>duced from P0 ⋆ and the two-step <strong>de</strong>scription above.Interestingly, the method naturally allows to consi<strong>de</strong>r the case that J is random and thusvaries per experimental unit. This allows to exploit all our observations, even though we haveless cases than controls. Note that each control is only taken into account once.Case-control weighting of the log-likelihood loss function <strong>de</strong>veloped for prospectivesampling.It is remarkable that the log-likelihood loss function <strong>de</strong>veloped for prospective sampling canbe adapted to the case-control sampling scheme by appropriate weighting. This weightingrelies on the prior knowledge of the following probabilities: for each (y, v) ∈ {0,1} × V,q 0 = P ⋆ 0 (Y = 1), (4)q 0 (y|v) = P ⋆ 0 (Y = y|V = v), (5)q 0 (v|y) = P ⋆ 0 (V = v|Y = y), (6)hal-00577883, version 1 - 17 Mar 2011P0 ⋆ (Y = 0|V = v)¯q 0 (v) = q 0P0 ⋆(Y = 1|V = v) = q q 0 (0|v)0q 0 (1|v)and introduce the following case-control weighted log-likelihood loss function for the <strong>de</strong>nsityp ⋆ 0 of P ⋆ 0 un<strong>de</strong>r sampling of O ∼ P 0:l(O|p ⋆ ) = q 0 log p ⋆ (V 1 , O 1⋆ ) + ¯q 0 (V 1 ) 1 J(7)J∑log p ⋆ (V 1 , O 0,j⋆ ). (8)It is worth noting that, even though q 0 appears in both terms in (8), we prefer to consi<strong>de</strong>rl(O|p ⋆ ) as <strong>de</strong>fined above rather than q0 −1 l(O|p⋆ ). This choice guarantees that the weightedlog-likelihood l(O|p ⋆ ) is on the same scale as the log-likelihood log P0 ⋆(O⋆ ) un<strong>de</strong>r prospectivesampling.Proposition 1. Let p ⋆ 0 be the <strong>de</strong>nsity of the observed data structure O⋆ un<strong>de</strong>r prospectivesampling. Consi<strong>de</strong>r a mo<strong>de</strong>l P ⋆ for p ⋆ 0 such that the integrals ∫ log p ⋆ (o ⋆ )dP ⋆ 0 (o⋆ , Y = y) areproperly <strong>de</strong>fined for all p ⋆ ∈ P ⋆ and y = 0,1. If P ⋆ is well-specified (i.e., if p ⋆ 0 ∈ P⋆ ), then the<strong>de</strong>nsity that maximizes the expectation un<strong>de</strong>r P 0 of the weighted loss function (8) over P ⋆ ,is unique and coinci<strong>de</strong>s with p ⋆ 0 .arg maxp ⋆ ∈P ⋆ E P0 l(O|p ⋆ ),The proof is relegated to the appendix.In Section 5 we propose a parametric mo<strong>de</strong>l for the conditional distribution of O ⋆ given(W, X, Ā, Y ), that is the conditional distribution of Z given (W, X, Ā, Y ). The parametricmo<strong>de</strong>l is sound in the sense that the conditional distribution of Z given (W, X, Ā, Y ) only<strong>de</strong>pends on (W, X, Ā(X), Y ). The latter parametric mo<strong>de</strong>l is combined with a nonparametricmo<strong>de</strong>l for the conditional distribution of (W, X, Ā) given Y , both yielding a semiparametricmo<strong>de</strong>l P ⋆ for p ⋆ 0 because we know beforehand q 0, the true probability of being a case.Let p ⋆ θ be the <strong>de</strong>nsity of O⋆ un<strong>de</strong>r parameter θ. Assuming that the true <strong>de</strong>nsity p ⋆ 0 is“projected” (in terms of Kullback-Leibler divergence) onto p ⋆ θ 0for some θ 0 , or in other termsthat the mapping θ ↦→ KL(p ⋆ 0 , p⋆ θ ) achieves a unique minimum at the unique θ 0, we focushereafter on the maximum likelihood estimation of θ 0 . Following the lines of the proof ofProposition 1, the case-control weighted maximum likelihood estimatorθ n = arg maxθj=1n∑l(O i |p ⋆ θ ) (9)does estimate θ 0 . We briefly consi<strong>de</strong>r its asymptotic properties in the next section.i=1or, namely, the marginal probability of being a case (4), the conditional probabilities of beinga case or a control given matching variable at level v (5), and the conditional probabilitiesof observing level v for the matching variable given being a case or a control (6). In<strong>de</strong>ed,it is possible to compute the latter key quantities based on the in<strong>de</strong>pen<strong>de</strong>nt study [2], seeSection 6.1.4 Asymptotic properties of the case-control weighted maximumlikelihood estimatorLet us <strong>de</strong>note for convenienceΩ = (W, X, Ā(X), Y )78


so that O ⋆ = (Ω, Z), and in the same spirit,Ω 1 = (W 1 , X 1 ,Ā1 (X 1 ), Y 1 = 1)Ω 0,j = (W 0,j , X 0,j ,Ā0,j (X 0,j ), Y 0,j = 0),Suppose also that, for every θ 1 , θ 2 in a neighborhood of θ 0 and a function ṁ such thatE P ⋆0 ṁ(O ⋆ ) 2 < ∞, P0 ⋆ -almost surely|log p ⋆ θ 1(Z|Ω) − log p ⋆ θ 2(Z|Ω)| ≤ ṁ(O ⋆ )‖θ 1 − θ 2 ‖.hence O 1⋆ = (Ω 1 , Z 1 ) and O 0,j⋆ = (Ω 0,j , Z 0,j⋆ ). We consi<strong>de</strong>r a semiparametric mo<strong>de</strong>l suchthat the likelihood of O ⋆ un<strong>de</strong>r θ ∈ Θ writes asp ⋆ θ (O⋆ ) = p ⋆ θ (Z|Ω)η(Ω),Furthermore, assume that θ ↦→ E P0 l(O|p ⋆ θ ) = E P ⋆ 0 log p⋆ θ (O⋆ ) (by virtue of Proposition 1)admits a second-or<strong>de</strong>r Taylor expansion at θ 0 with nonsingular symmetric second <strong>de</strong>rivativematrix S θ0 = E P0¨˜l(O|θ0 ). Then S θ0 = E P ⋆0¨l⋆ θ0(O ⋆ ) andhal-00577883, version 1 - 17 Mar 2011η(Ω) being here the likelihood of Ω, which we assume without serious loss of generality tobe boun<strong>de</strong>d away from 0. Therefore the weighted log-likelihood loss function for p ⋆ 0 un<strong>de</strong>rsampling of O ∼ P 0 can be <strong>de</strong>composed asl(O|p ⋆ θ ) = q 0 log p ⋆ θ (V 1 , O 1⋆ ) + ¯q 0 (V 1 ) 1 J∑log p ⋆ θJ(V 1 , O 0,j⋆ )= q 0 log p ⋆ θ (Z1 |Ω 1 , V 1 ) + ¯q 0 (V 1 ) 1 Jj=1J∑log p ⋆ θ (Z0,j |Ω 0,j , V 1 )j=1+rem(O), (10)where rem(O) is a random term in<strong>de</strong>pen<strong>de</strong>nt of θ. We set ˜l(O|θ) = l(O|p ⋆ θ) − rem(O) andnote that one can substitute ˜l(O i |θ) for l(O i |p ⋆ θ ) in the <strong>de</strong>finition (9) of θ n without modifyingthe resulting estimator:θ n = arg maxθ∈Θn∑˜l(O i |θ),i=1therefore avoiding to consi<strong>de</strong>r η at all while estimating the parameter of interest.We recall that the class F = {l(·|p ⋆ θ ) : θ ∈ Θ} is P 0-Glivenko-Cantelli if sup θ∈Θ | 1 n∑ ni=1 l(O i|p ⋆ θ )−E P0 l(O|p ⋆ θ )| = o P 0(1). The following classical consistency result holds (see Theorem 5.7 andExample 19.8 in [14]).Proposition 2. Assume that E P ⋆0log p ⋆ 0 (O) is well-<strong>de</strong>fined and that the mapping θ ↦→ KL(p⋆ 0 , p⋆ θ )from Θ to the nonnegative real numbers attains its minimum uniquely at θ 0 ∈ int(Θ) (p ⋆ θ 0isthe Kullback-Leibler projection of p ⋆ 0 upon {p⋆ θ : θ ∈ Θ}). If F is P 0-Glivenko-Cantelli then θ nconverges in probability to θ 0 . This is for instance the case if Θ is a compact metric space, ifθ ↦→ l(o ⋆ |p ⋆ θ ) is continuous for every o⋆ and if F admits an integrable envelope function withrespect to P 0 .It is accompanied with an asymptotic normality result (inspired by classical results ofasymptotic normality, see Theorem 5.23 in [14]; we omit the measurability conditions).Proposition 3 (first part). In the context of Propositions 1 and 2, assume in addition thatθ ↦→ log p ⋆ θ (Z|Ω) is twice differentiable at θ 0 P0 ⋆ -almost surely with first and second <strong>de</strong>rivatives˙l ⋆ θ 0(O ⋆ ) and ¨l ⋆ θ 0(O ⋆ ) such that ∫ l(o ⋆ )dP0 ⋆(o⋆ , Y = y) are properly <strong>de</strong>fined for l = ˙l ⋆ θ 0, ¨l ⋆ θ 0andy = 0,1, and introduce accordingly the weighted versions˙˜l(O|θ 0 ) = q 0 ˙l⋆ θ0(V 1 , O 1⋆ ) + ¯q 0 (V 1 ) 1 J∑˙l ⋆ θJ 0(V 1 , O 0,j⋆ ),¨˜l(O|θ 0 ) = q 0¨l⋆ θ0(V 1 , O 1⋆ ) + ¯q 0 (V 1 ) 1 Jj=1J∑¨l ⋆ θ 0(V 1 , O 0,j⋆ ).j=1hal-00577883, version 1 - 17 Mar 2011√ n(θn − θ 0 ) = −S −1θ 01 √nn∑i=1˙˜l(O i |θ 0 ) + o P0 (1). (11)In particular, the sequence √ n(θ n − θ 0 ) is asymptotically Gaussian with mean zero and covariancematrix Σ = S −1θ 0E P0 [ ˙˜l(O|θ0 ) ˙˜l(O|θ0 ) ⊤ ]S −1θ 0.We emphasize that we purposely do not use the same convention to <strong>de</strong>note the first andsecond or<strong>de</strong>r <strong>de</strong>rivatives at θ 0 of θ ↦→ log p ⋆ θ (Z|Ω) (respectively ˙l ⋆ θ 0(O ⋆ ) and ¨l ⋆ θ 0(O ⋆ )) and the<strong>de</strong>rivatives of θ ↦→ ˜l(O|θ) (respectively ˙˜l(O|θ0 ) and ¨˜l(O|θ0 )): we intend to stress that theformer are related to the prospective sampling observed data structure O ⋆ whereas the latterare related to the case-control sampling observed data structure O.A natural question arises: How does the asymptotic covariance matrix Σ compares withthe asymptotic covariance matrix one would have got un<strong>de</strong>r prospective sampling? We give inthe second part of Proposition 3 a very simple answer, but for a prospective sampling un<strong>de</strong>ra modified version of P0 ⋆ . Introduce for clarity of exposition the notation¯q 0 (o ⋆ ) = q 0q 0 (y|v)q 0 (1|v)such that ¯q 0 (o ⋆ ) = q 0 if y = 1 (o ⋆ corresponds to a case) and ¯q 0 (o ⋆ ) = ¯q 0 (v) if y = 0 (o ⋆corresponds to a control). By setting(12)dP1⋆dP0⋆ (o ⋆ 1) =2¯q 0 (o ⋆ ) , (13)we <strong>de</strong>fine a probability distribution P ⋆ 1 for the observed data structure O⋆ (in<strong>de</strong>ed, o ⋆ ↦→ ¯q 0 (o ⋆ )is positive and E P ⋆0(2¯q 0 (O ⋆ )) −1 = 1). Moreover:• un<strong>de</strong>r P ⋆ 1 , being a case is as likely as being a control (equivalently, P ⋆ 1 (Y = 1) = 1 2 );• the marginal distribution of the matching variable V un<strong>de</strong>r P1 ⋆ equals the conditionaldistribution of V un<strong>de</strong>r P0 ⋆, conditionally on being a case (equivalently, P 1 ⋆ (V = v) =q 0 (v|1) for all v ∈ V);• given (V,Y ), O ⋆ has the same distribution un<strong>de</strong>r P ⋆ 1 as un<strong>de</strong>r P ⋆ 0 (in<strong>de</strong>ed, ¯q 0(o ⋆ ) <strong>de</strong>pendson o ⋆ through (v, y) only).Furthermore, since obviously2E P ⋆1 ¯q 0 (O ⋆ ) log p ⋆ θ (O⋆ ) = E P ⋆0log p ⋆ θ (O⋆ ),θ ↦→ ¯q 0 (O ⋆ ) log p ⋆ θ (O⋆ ) is a proper loss function for the purpose of estimating θ 0 un<strong>de</strong>r prospectivesampling of O ⋆ ∼ P ⋆ 1 . 109


Proposition 3 (second part). Define S ′ θ 0= E P ⋆1 ¯q 0 (O ⋆ )¨l ⋆ θ 0(O ⋆ ). Suppose that the mo<strong>de</strong>l iswell-specified (or equivalently that KL(p ⋆ 0 , p⋆ θ 0) = 0). Assume in addition that for all o 1⋆ =(z 1 , ω 1 ), the class of <strong>de</strong>rivatives of p ⋆ θ (z1 |ω 1 ) with respect to θ is uniformly boun<strong>de</strong>d (in θ) byan integrable function (of z 1 ). Then the covariance matrix Σ satisfiesDescribing what happens in presence of exposures involves the introduction of an accelerationfunction R, that is a non<strong>de</strong>creasing continuous function on the nonnegative real linesuch that R(t) ≥ t for all t. Given such a function R, we <strong>de</strong>fineT(h, µ,R) = inf{t ≥ 0 : h + µR(t) + B R(t) ≤ 0}, (15)hal-00577883, version 1 - 17 Mar 2011Σ = 1 2 S′ θ 0−1 EP ⋆1[¯q 0 (O ⋆ ) 2 ˙l⋆ θ0(O ⋆ ) ˙l ⋆ θ 0(O ⋆ ) ⊤ ]S ′ θ 0−1 .In particular, 2Σ can be interpreted as the asymptotic covariance matrix of the M-estimator of θ 0 based on the loss function θ ↦→ ¯q 0 (O ⋆ ) log p ⋆ θ (O⋆ ) and n iid observationsdrawn from P1 ⋆. The n observations un<strong>de</strong>r P 0-case-control sampling correspond to 2 × n observationsun<strong>de</strong>r P1 ⋆ -prospective sampling, each of them in the former counting for two in thelatter. Elements of proof are relegated to the appendix.5 Threshold regression parametric mo<strong>de</strong>ling5.1 Health as a stochastic processWe adopt the threshold regression approach (see [6, 7] and references therein), that is (quotingthe title of [6]) we mo<strong>de</strong>l the time to event of interest (<strong>de</strong>velopment of an inci<strong>de</strong>nt lung cancer)as a stochastic process reaching a boundary. The latter stochastic process represents herethe amount of health relative to lung cancer. As long as it stays above zero (the so-calledboundary), no lung cancer occurs. Crossing the boundary for the first time corresponds to<strong>de</strong>veloping an inci<strong>de</strong>nt lung cancer.Let B be a Brownian motion. For any real numbers h > 0 and µ ≤ 0, <strong>de</strong>fineT(h, µ) = inf{t ≥ 0 : h + µt + B t ≤ 0}, (14)the first time the drifted Brownian motion (h + µt + B t , t ≥ 0) hits the set of nonnegativenumbers. The distribution of T(h, µ) is known as the inverse Gaussian distribution withparameter (h, µ). It is characterized by its cumulative distribution function (cdf)(F(t;h, µ) = 1 + e −2hµ Φ (µt − h)t −1/2) (− Φ (µt + h)t −1/2) ,where Φ is the standard normal cdf.It is well known (see for instance [4]) that the drifted Brownian motion (h+µt+B t , t ≥ 0)will almost surely eventually reach the boundary (i.e., T(h, µ) < ∞) because µ ≤ 0. ThereforeT(h, µ) is also characterized by its <strong>de</strong>nsity( )hf(t;h, µ) =(2πt 3 ) 1/2 exp (h − |µ|t)2− .2thal-00577883, version 1 - 17 Mar 2011the first time the drifted Brownian motion (h + µt + B t , t ≥ 0) hits the set of nonnegativenumbers along the modified time scale <strong>de</strong>rived from R. Obviously, T(h, µ,R) = T(h, µ) whenR is the i<strong>de</strong>ntity, but in general T(h, µ,R) ≤ T(h, µ). Furthermore,T(h, µ,R) ≥ t if and only if T(h, µ) ≥ R(t), (16)so that the cdf of T(h, µ,R) at t is F(R(t);h, µ), and its <strong>de</strong>nsity at t is R ′ (t)f(R(t);h, µ) assoon as R is differentiable.More importantly here, by virtue of the factorization of the likelihood exhibited in (3), theconditional survival function and <strong>de</strong>nsity of T(h, µ,R) at t ≥ x − 1 given [T(h, µ,R) ≥ x − 1]are respectively1 − F(R(t);h, µ)G(t;h, µ,R) = (17)1 − F(R(x − 1);h, µ)andg(t;h, µ,R) = R′ (t)f(R(t);h, µ)1 − F(R(x − 1);h, µ) . (18)The time T(h, µ,R) should be interpreted as the time to inci<strong>de</strong>nt lung cancer un<strong>de</strong>r ahistory of exposure to asbestos compatible with R, a notion we investigate in the next section.5.2 Calendar versus biological ages: mo<strong>de</strong>ling the ageing acceleration dueto occupational exposure to asbestosThere is a nice interpretation of the acceleration function <strong>de</strong>vice. Admitting that the referencetime scale (that is that of the Brownian motion B) corresponds to chronological/calendartime scale, the new time scale formed by an acceleration function R may be un<strong>de</strong>rstood as abiological time scale. This interpretation acknowledges the fact that the ageing phenomenonrelated to lung cancer is stronger in presence of noxious exposure than in its absence.We present now an original class of acceleration functions tailored to our particular <strong>de</strong>scriptionof occupational exposures. Let us <strong>de</strong>fine{M = (M 0 ,(M k,l ) k,l≤3 ) ∈ R + × M 3,3 (R + ) :}0 ≤ M k,1 ≤ M k,2 ≤ M k,3 = 1, k = 1,2,3 . (19)Then the rate yiel<strong>de</strong>d by <strong>de</strong>scription ε = (ε 1 , ε 2 , ε 3 ) ∈ E \ {0} for M ∈ M writes asFinally, T(h, µ) has mean h/|µ| whenever µ < 0.Here, (h + µt + B t , t ≥ 0) mo<strong>de</strong>ls the amount of health relative to lung cancer as affectedby the exposure to asbestos in absence of such an exposure, so that an inci<strong>de</strong>nt lung cancereventually occurs at time T(h, µ). Furthermore, this presentation of the mo<strong>de</strong>l we are buildingyields a nice interpretation of the parameter (h, µ): h plays the role of an initial amount ofhealth relative to lung cancer, and µ a rate of <strong>de</strong>cay of the latter amount of health.M(ε) = 1 + M 0 × M 1,ε1 × M 2,ε2 × M 3,ε3 , (20)and that of ε = 0 is set to M(0) = 1. Notably, M 0 is to be interpreted as the factor ofacceleration of time for the higher exposure, which we recall is enco<strong>de</strong>d by ε = (3,3,3). RatesM(ε) range from 1 to M(3,3,3) = 1 + M 0 and (with convention 0/0 = 1)M(ε) − 1M 0= M 1,ε1 × M 2,ε2 × M 3,ε3 :1112


hal-00577883, version 1 - 17 Mar 2011an exposure characterized by “probability/frequency/intensity” <strong>de</strong>scription ε = (ε 1 , ε 2 , ε 3 )achieves a fraction M 1,ε1 × M 2,ε2 × M 3,ε3 of the maximal acceleration.Note that we only need 7 parameters in or<strong>de</strong>r to fully <strong>de</strong>scribe the 28 possibly differentrates of acceleration. Furthermore, it is easily seen that this parametrization is i<strong>de</strong>ntifiable:if M,M ′ ∈ M satisfy M(ε) = M ′ (ε) for all ε ∈ E then M = M ′ .Consi<strong>de</strong>r M ∈ M and a generic longitudinal <strong>de</strong>scription ¯ε as presented in Section 2. Letus <strong>de</strong>fine now the function ˜r over the nonnegative real line such that for all t ≥ 0,˜r(t;M, ¯ε) = M(ε(t)).Function ˜r is piecewise constant, but we can construct a continuous version r such that, if˜r = ∑ Ll=1 ρ l1]t l ;t l+1 ] + 1]t L+1 ; ∞), then r(t l ) = ρ l for l = 1, ...,L and ‖˜r − r‖ ∞ ≤ α for asmall α > 0 chosen a priori. Because we are willing to add a constraint M(0) ≤ C 1 to the<strong>de</strong>finition of M and to impose that a generic longitudinal <strong>de</strong>scription cannot have more thanC 2 breakpoints (that is to upper bound a priori the number of employments that a subjectcan have in a lifetime), the mapping ˜r ↦→ r can even guarantee sup M,¯ε ‖˜r − r‖ ∞ ≤ α. We will<strong>de</strong>note hereafter by r(·;M, ¯ε) the continuous function associated to M and ¯ε and proceed asif α = 0.Finally, every pair (M, ¯ε) gives rise to the differentiable (because r(·;M, ¯ε) is continuous)acceleration function characterized byR(t;M, ¯ε) =∫ t0r(s;M, ¯ε)ds.In particular if ε(t) = 0 for all t ≥ 0 (that is in absence of exposure throughout lifetime),then R(t;M, ¯ε) = t for all t ≥ 0: in other words, the chronological and biological time scalescoinci<strong>de</strong>.Now, given parameters h > 0, µ, M ∈ M and covariate ā, we obtain R a = R(·;M, ā),which yields in turn the time to inci<strong>de</strong>nt lung cancer T(h, µ,R a ).5.3 A notion of maximal number of years of life guaranteed free of lungcancerEquivalence (16) has another important straightforward consequence:hal-00577883, version 1 - 17 Mar 20115.4 Case-control weighted log-likelihood loss functionWe <strong>de</strong>rive in this section a valid log-likelihood loss function based on the threshold regressionparametric mo<strong>de</strong>lling introduced in Sections 5.1 and 5.2. By Section 3, we know that it sufficesto mo<strong>de</strong>l the distribution of the observed data structure O ⋆ un<strong>de</strong>r prospective sampling.As explained in Section 3, we wish to mo<strong>de</strong>l parametrically the conditional distribution ofO ⋆ given Ω = (W, X, Ā(X), Y ), leaving the conditional distribution of Ω given Y unspecified.For this purpose, we state that un<strong>de</strong>r θ = (α,β, M) ∈ Θ = R 4 × R 16 × M, the conditionaldistribution of T (the possibly unobserved age at inci<strong>de</strong>nt lung cancer of the subject associatedwith O ⋆ ) given Ω is that of T(h, µ,R a ) withlog h = α 1+(0,1,2,0)W ⊤ (21)(each level of (W 1 , W 2 ) ∈ {0,1} 2 is associated with a unique positive initial health h),log(−µ) = β 1+(0,1,2,4)W ⊤ (22)(each level of (W 1 , W 2 , W 3 ) ∈ W = {0,1} 2 × {0,1,2,3} is associated with a unique negativedrift µ), andR a = R(·;M,Ā(X)).Therefore,with conventionlog p ⋆ θ (Z|Ω) = Y log g(Z;θ) + (1 − Y ) log G(Z;θ)G(Z;θ) = G(Z;h, µ,R(·;M,Ā(X))),g(Z; θ) = g(Z;h, µ,R(·;M, Ā(X))),the functions G and g being <strong>de</strong>fined in (17) and (18). Finally, the relevant part of the resultingcase-control weighted log-likelihood at θ ∈ Θ writes as⎧⎫n∑ ⎨⎩ q 0 log g(Zi 1 ;θ) + ¯q 0 (Vi 1 ) 1 ∑J i ⎬log G(Z 0,ji;θ)J i ⎭ = P n˜l(·|θ), (23)i=1j=1T(h, µ) = R(T(h, µ,R)).In particular, given parameters h > 0, µ, M ∈ M and covariate ā, T(h, µ) = R a (T(h, µ,R a )).In words, all things (gen<strong>de</strong>r, occurrence of lung cancer in close family, lifetime tobacco use)being equal, the age at inci<strong>de</strong>nt lung cancer in the absence of occupational exposure to asbestoscan be <strong>de</strong>duced <strong>de</strong>terministically from the (observed) age at inci<strong>de</strong>nt lung cancer and historyof occupational exposure to asbestos of a case. Furthermore, the nonnegative quantityR a (T(h, µ,R a )) − T(h, µ,R a )(with convention R a (∞) = ∞ and ∞ − ∞ = 0) can be interpreted as a maximal numberof years of life guaranteed free of lung cancer. The expression conveys the notion thatR a (T(h, µ,R a )) −T(h, µ,R a ) is different from the remaining number of years of life, as <strong>de</strong>athmay occur anytime after T(h, µ,R a ) even in the absence of occupational exposure to asbestos.Heuristically, it is the number of years of life which a subject living infinitely would enjoybefore <strong>de</strong>veloping an inci<strong>de</strong>nt lung cancerwhere ˜l(O|θ) = l(O|p ⋆ θ )−rem(O) (see equation (10)) and P n = ∑ ni=1 δ O i<strong>de</strong>notes the empiricalmeasure.5.5 Multi-fold likelihood-based cross-validationIt makes no doubt that the mo<strong>de</strong>l we have built so far is over-dimensioned. The “probability/frequency/intensity”<strong>de</strong>scription with its 28 different levels is itself certainly too rich (seeagain Table 1), or at least difficult to establish and prone to errors. We rather consi<strong>de</strong>r themo<strong>de</strong>l Θ <strong>de</strong>scribed so far as a “maximal” mo<strong>de</strong>l giving rise to a large collection of sub-mo<strong>de</strong>lsΘ k obtained by adding constraints on the “maximal” parameter θ = (α,β, M) ∈ Θ. Thenumber of such sub-mo<strong>de</strong>ls is large in<strong>de</strong>ed: there are (1 + 7 3 ) = 344 sub-mo<strong>de</strong>ls <strong>de</strong>fined byadding only constraints on M (of the type M 0 = 0, or M 0 > 0 and for any k = 1, 2,3, 0 = M k,1or M k,1 = M k,2 or M k,2 = 1 or 0 = M k,1 = M k,2 or M k,1 = M k,2 = 1 or (0 = M k,1 , M k,2 = 1)),hence the total number of sub-mo<strong>de</strong>ls equals 2 2 × 2 3 × 344 = 11,008. It is out of question toexplore the whole collection of sub-mo<strong>de</strong>ls. Instead, we propose to1314


(i) <strong>de</strong>fine a large collection {Θ k : k ∈ K} of sub-mo<strong>de</strong>ls of interest,q 0 = 470.0682e-06hal-00577883, version 1 - 17 Mar 2011(ii) let the data select a better sub-mo<strong>de</strong>l Θˆkin the latter collection based on a multi-foldlikelihood-based cross-validation criterion.It is shown in [15] that, un<strong>de</strong>r mild assumptions, the multi-fold likelihood-based crossvalidationcriterion will select a better mo<strong>de</strong>l comparing favorably with the oracle mo<strong>de</strong>lof the collection (whose <strong>de</strong>finition involves the true distribution of the data). By this wemean that the likelihood risk of the better mo<strong>de</strong>l will not be much bigger than that of the oraclemo<strong>de</strong>l. Although we cannot invoke rigorously this remarkable property here, it motivatesthe procedure that we <strong>de</strong>scribe below.The likelihood risk of θ ∈ Θ is by <strong>de</strong>finitionR(θ) = −E P0˜l(O|θ),which is closely related to minus the Kullback-Leibler divergence between the <strong>de</strong>nsity p ⋆ 0 ofP0 ⋆ and p⋆ θ , as explained in Section 3. Let us <strong>de</strong>note by θ n,k(P n ) the case-control weightedmaximum likelihood estimator <strong>de</strong>fined in (9) with θ ranging over Θ k . Given the collection{θ n,k (P n ) : k ∈ K} we wish to select the estimator θ n,¯k(P n ) that minimizes R, where ¯kitself <strong>de</strong>pends on P n . Because the <strong>de</strong>finition of R involves the true distribution P 0 , we mustestimate R(θ n,k (P n )) and choose to do so by multi-fold cross-validation. Details follow.We split the data randomly into a training and a validation samples. Given an integerV (later set to V = 10), each observed data structure O i is associated with a labellab i = 1 + (i mod V ). The collection of labels {lab i : i ≤ n} ⊂ {1, ...,V } is such thatmax l,l ′ ≤V | ∑ ni=1 1{lab i = l} − ∑ ni=1 1{lab i = l ′ }| ≤ 1. The splitting random variableS = (S 1 , ...,S n ) ∈ {0,1} n is drawn in<strong>de</strong>pen<strong>de</strong>ntly of O 1 , ...,O n in such a way that, foreach 1 ≤ l ≤ V , S = (1{lab 1 = l}, ...,1{lab n = l}) with probability V −1 . Conditionallyon S, the observed data structure O i belongs to the training sample if S i = 0 (there areapproximately n(V −1)/V such O i ’s), otherwise it belongs to the validation sample. The empiricaldistribution of those O i ’s for which S i = 0 (respectively, S i = 1) is Pn,S 0 (respectively,Pn,S 1 ). The empirical distribution of those O i’s for which lab i = l (respectively, lab i ≠ l) isPn l (respectively, Pn −l ).Each Θ k yields a maximum likelihood estimator θ n,k (Pn,S 0 ) based on the training sampleonly. Its risk, averaged over the splits, writes ascrit(k) = E S R(θ n,k (P 0 n,S)) = − 1 VV∑l=1( )E P0˜l O|θ n,k (Pn −l ) .The value ˜k that minimizes k ↦→ crit(k) over K is called the oracle because it <strong>de</strong>pends bothon P n and on P 0 . In our attempt to reach that ˜k which is a good proxy to ¯k, we estimatecrit(k) byĉrit(k) = −E S E P 1n,S˜l( O|θn,k (P 0 n,S) ) = − 1 VV∑l=1( )E Pn˜ll O|θ n,k (Pn −l ) ,and propose to use the value ˆk that minimizes k ↦→ ĉrit(k) over K, whose <strong>de</strong>finition ispostponed to Section 6.2. In conclusion, the final estimator is θ n,ˆk(P n ).hal-00577883, version 1 - 17 Mar 2011a q 0 (1|0, a) q 0 (1|1, a)1 2.058932e-06 1.663324e-062 1.859944e-05 1.460473e-053 6.803086e-05 4.461827e-054 2.586692e-04 1.184914e-045 6.484864e-04 1.947058e-046 1.192778e-03 2.542976e-047 1.854668e-03 3.294062e-048 2.331553e-03 3.588764e-049 2.928415e-03 4.466062e-0410 3.686216e-03 5.312313e-0411 3.608302e-03 5.332930e-0412 3.636995e-03 5.395069e-0413 2.171286e-03 3.234775e-04a ¯q 0 (0, a) ¯q 0 (1, a)1 228.3063171 282.60716692 25.2727716 32.18554443 6.9091613 10.53486074 1.8167860 3.96663765 0.7243998 2.41377876 0.3936251 1.84802617 0.2529813 1.42654708 0.2011416 1.30936329 0.1600496 1.052063810 0.1270504 0.884395411 0.1298040 0.880974512 0.1287763 0.870822313 0.2160229 1.4527010Table 3: Estimating the probability distribution of being a case, based on the in<strong>de</strong>pen<strong>de</strong>ntstudy [2]. Left: Estimates of q 0 (1|w 1 , v 2 ), as <strong>de</strong>fined in (5). Middle: Estimate of q 0 , as<strong>de</strong>fined in (4). Right: Estimates of ¯q 0 (w 1 , v 2 ), as <strong>de</strong>fined in (7). Here, w 1 = 0 for men andw 1 = 1 for women, and v 2 = a if the age at sampling x belongs to [t a ;t a+1 ), where t 0 = 0,t a = 30 + 5(a − 1) for 1 ≤ a ≤ 12 and t 13 = ∞.6 Results6.1 Conditional distribution of being a caseEstimating the marginal probability of being a case q 0 (4), the conditional probabilities ofbeing a case or a control conditional on the matching variable (q 0 (1|v)) v∈V (5), and the weights(¯q 0 (v)) v∈V (7) is ma<strong>de</strong> possible thanks to [2], an in<strong>de</strong>pen<strong>de</strong>nt study of cancer inci<strong>de</strong>nce andmortality in France, over the period 1980–2005. However, we must assume either (i) that thedata from [2], which are collected over the whole French population, are representative of theParisian population of interest, or (ii) as un<strong>de</strong>rlined in Section 2 that sampling from the fourParisian hospitals that participate to the study is stochastically equivalent to sampling fromthe population of France.We first estimate these quantities for each year from 1999 to 2002 separately. In agreementwith our stationary assumption (1), we remark that the various estimates are very consistentover the years. In or<strong>de</strong>r to gain precision, we average the estimates over the years. The finalestimates are presented in Table 3. We emphasize that the weights (¯q 0 (v)) v∈V are far frombeing homogeneous.6.2 Mo<strong>de</strong>l selection procedure in actionWe explain in Section 5.5 how the best mo<strong>de</strong>l in<strong>de</strong>x ˆk (with related best mo<strong>de</strong>l Θˆk)is obtainedin a pre-<strong>de</strong>termined collection K of sub-mo<strong>de</strong>l indices (with related sub-mo<strong>de</strong>ls Θ k , k ∈ K).The latter collections are constructed by recursion as presented below.We first initialize Θ 0 = Θ and K −1 = ∅ with convention max ∅ = 0.At a given step ν ≥ 0, a sub-mo<strong>de</strong>l Θ ν is <strong>de</strong>fined as a subset of Θ meeting ν in<strong>de</strong>pen<strong>de</strong>nt1516


one-dimensional constraints on M ∈ M (i.e., constraints of the type M k,l−1 = M k,l for somek = 1,2,3 and l = 1,2,3 with convention M k,0 = 0). Start with c(ν+1) = −∞ and Θ ν+1 = ∅.The following rule is applied to Θ ν+1 :Rule 1. For every possible sub-mo<strong>de</strong>l Θ ′ ⊂ Θ ν <strong>de</strong>rived from Θ ν by adding another onedimensionalconstraint on M as <strong>de</strong>scribed above (all such mo<strong>de</strong>ls share the same dimension),evaluate the corresponding maximum log-likelihood criterionFor every k ∈ K there exists a unique ν = 0, ...,7 such that Θ ν,k be <strong>de</strong>fined: setting Θ k = Θ ν,kconclu<strong>de</strong>s the <strong>de</strong>finition of the collection {Θ k : k ∈ K} of sub-mo<strong>de</strong>ls of interest.The best mo<strong>de</strong>l Θˆk(according to our multi-fold likelihood-based cross-validation criterion)is a subset of Θ 2 , featuring 16 <strong>de</strong>grees of freedom. Its complete <strong>de</strong>scription follows:• the initial health parameter <strong>de</strong>pends on W only through gen<strong>de</strong>r (hence not on theindicator of occurrence of lung cancer in close family);l(Θ ′ ) = maxθ∈Θ ′ P n˜l(·|θ).• the drift parameter <strong>de</strong>pends on W only through gen<strong>de</strong>r and lifetime tobacco use (hencenot on the indicator of occurrence of lung cancer in close family);If l(Θ ′ ) ≥ c(ν + 1), update c(ν + 1) = l(Θ ′ ) and Θ ν+1 = Θ ′ .Applying Rule 1 as long as possible yields 7 sets Θ ν , ν = 0, ...,6. Their <strong>de</strong>scription isgiven in Table 4.• exposure to asbestos is significantly noxious; there is no difference between low probabilityand no exposure to asbestos at all (in view of (19), M 1,1 = 0) and no differenceeither between low and mild frequencies (in terms of (19), M 2,1 = M 2,2 ).hal-00577883, version 1 - 17 Mar 2011Θ 0 = Θ,Θ 1 = {θ ∈ Θ 0 : M 1,1 = 0}, (24)Θ 2 = {θ ∈ Θ 1 : M 2,1 = M 2,2 }, (25)Θ 3 = {θ ∈ Θ 2 : M 2,2 = 1}, (26)Θ 4 = {θ ∈ Θ 3 : M 3,1 = M 3,2 }, (27)Θ 5 = {θ ∈ Θ 4 : M 1,2 = 1}, (28)Θ 6 = {θ ∈ Θ 5 : M 3,1 = 0}. (29)(24) low probability does not differ from no exposure at all;(25) moreover, low and mild frequencies do not differ;(26) moreover, mild and high frequencies do not differ;(27) moreover, low and mild intensities do not differ;(28) moreover, mild and high probabilities do not differ;(29) moreover, low intensity does not differ from no exposure.hal-00577883, version 1 - 17 Mar 20116.3 Fitting the best mo<strong>de</strong>lThe best mo<strong>de</strong>l Θˆk⊂ Θ 2 <strong>de</strong>scribed in Section 6.2 is first fitted in terms of maximum likelihoodon the whole dataset. Regarding the <strong>de</strong>rivation of confi<strong>de</strong>nce intervals, we <strong>de</strong>ci<strong>de</strong> to rely onthe bootstrap instead of a central limit theorem (such as Proposition 3). The particularsof the bootstrap procedure follow. We set α = 2.5%, B = 1,000 and p = 5%, then for branging from 1 to B, we repeatedly resample without replacement n(1 − p) = 817 observeddata structures, yielding the bootstrapped empirical measure Pn(1−p) b , in or<strong>de</strong>r to computeand store the corresponding maximum likelihood estimate θ n(1−p),ˆk(P n(1−p) b ) of θ ∈ Θˆk. Themean and median values of θ B n,ˆk = {θ n(1−p),ˆk (P n(1−p) b ) : b ≤ B} only very slightly differfrom each other (moreover, they are very close to the maximum likelihood estimate θ n,ˆk(P n )computed on the whole dataset). The componentwise α/16- and (1 − α/16)-quantiles of θ B n,ˆkare used as lower- and upper-bounds of confi<strong>de</strong>nce intervals, which simultaneously provi<strong>de</strong> a(1 − 2α) = 95%-coverage by the applied Bonferroni correction. Specifically:• initial health:Table 4: Descriptions of Θ 0 , ...,Θ 6 . The collection of parameter sets is nested. For instance,Θ 3 is the set of those θ ∈ Θ such that M 1,1 = 0 and M 2,1 = M 2,2 = 1. Regarding dimensions,it trivially holds that, for each 0 ≤ k ≤ 6, dim(Θ k ) = 27 − k.h0 23.82 [23.42; 24.13]1 25.09 [24.86; 25.40]W 1At a given step ν ≥ 0, a set K ν−1 of successive integers is <strong>de</strong>fined. Start with K ν ={max K ν−1 + 1} (a set initially containing a single element) and <strong>de</strong>fine Θ ν,max Kν−1 +1 = Θ ν .The following second rule is applied to K ν :Rule 2. For every possible constraint “ϕ(θ) = 0” on θ ∈ Θ ν of the form “α and β in<strong>de</strong>pen<strong>de</strong>ntof W l ” for some l = 1,2,3 (the lth coordinate of W does not affect the value of the initialhealth and drift parameters h and µ, see (21) and (22)), update K ν = K ν ∪{max K ν +1}and <strong>de</strong>fine Θ ν,max Kν = {θ ∈ Θ ν : ϕ(θ) = 0}.(Note that each Θ ν therefore gives rise to 2 3 = 8 sub-mo<strong>de</strong>ls Θ ν,l .)We apply Rule 2 for ν = 0, ...,6, and finally <strong>de</strong>fineK = ∪ 11ν=0K ν = {1,2,3, ...,56}.It is seen in particular that women are associated with a significantly larger initial healththan men.• drift:Two main features arise:−100µW 3 W 1 = 0 W 1 = 10 0.69 [0.08; 1.46] 0.02 [0.01; 0.03]1 7.70 [6.91; 8.28] 6.63 [5.73; 7.68]2 13.89 [13.25; 14.46] 10.55 [9.63; 11.80]3 17.67 [17.11; 18.38] 14.79 [13.65; 17.77]1718


hal-00577883, version 1 - 17 Mar 2011– For each level of lifetime tobacco use, the absolute value of the drift is significantlylarger for men than for women (actually, the confi<strong>de</strong>nce intervals for W 3 = 3 slightlyoverlap). Combined with the already mentioned fact that women are associatedwith a larger initial health, this implies that for any given history of exposureto asbestos and for every level of lifetime tobacco use, the distribution of age atinci<strong>de</strong>nt lung cancer in women is stochastically dominated by the distribution ofage at inci<strong>de</strong>nt lung cancer in men. In other words, given a man and a womansharing the same history of exposure to asbestos and lifetime tobacco use, givenan age t, the man is more likely to have <strong>de</strong>veloped an inci<strong>de</strong>nt lung cancer at aget than the woman.Note that there is no clear consensus in the literature on whether there existdifferences in lung cancer risk between men and women or not (for instance, itis argued in [16] that women are more susceptible to tobacco carcinogens, but itis seen in [5] that men or women are more susceptible to tobacco carcinogens,<strong>de</strong>pending on ethnic and racial group).– Both in men and women, the absolute value of the drift significantly increases withlifetime tobacco use. This implies that, both in men and women, for any givenhistory of exposure to asbestos and for every 0 ≤ w < w ′ ≤ 3, the distributionof age at inci<strong>de</strong>nt lung cancer for lifetime tobacco use equal to w is stochasticallydominated by the distribution of age at inci<strong>de</strong>nt lung cancer for lifetime tobaccouse equal to w ′ . In other words, given two persons sharing the same gen<strong>de</strong>r andhistory of exposure to asbestos, the person with the larger lifetime tobacco use ismore likely to have <strong>de</strong>veloped an inci<strong>de</strong>nt lung cancer at age t than the other.This is in agreement with the general scientific consensus [3].• exposure to asbestos:M 0 : 1.19 [0.34; 2.00]M 1,1 = 0 M 1,2 : 0.97 [0.96; 0.99] M 1,3 = 1M 2,1 = M 2,2 M 2,2 : 0.93 [0.90; 0.98] M 2,3 = 1M 3,1 : 0.02 [0.00; 0.09] M 3,2 : 0.09 [0.00; 0.27] M 3,3 = 1We notably <strong>de</strong>rive from the above table the values of (M(ε) − 1) (which can be interpretedas a factor of acceleration of time due to an exposure of level ε, see (20)) andrelated confi<strong>de</strong>nce intervals for each level of exposure ε ∈ E \ {0}, see Table 5.6.4 Application to the maximal number of years of life guaranteed free oflung cancerIn view of Section 5.3 and the notion of maximal number of years of life guaranteed free oflung cancer, the results of the previous section provi<strong>de</strong> us with a way of evaluating the latternumber on a case by case basis. Arguably, we mostly care for a pointwise estimation of,and confi<strong>de</strong>nce lower-bound on, the maximal number of years of life guaranteed free of lungcancer. In or<strong>de</strong>r to address this issue, let us compute a counterpart of Table 5 based on thecomponentwise 2α/5-quantiles of θ B (which simultaneously provi<strong>de</strong> (1−2α) = 95%-coveragen,ˆkfor parameter M on its own by the applied Bonferroni correction, since there M has 5 <strong>de</strong>greesof freedom), see Table 6.hal-00577883, version 1 - 17 Mar 2011ε M(ε) − 1 ε M(ε) − 1 ε M(ε) − 1111 0 211 0.026 [0.000; 0.171] 311 0.026 [0.000; 0.173]112 0 212 0.092 [0.001; 0.530] 312 0.094 [0.001; 0.537]113 0 213 1.078 [0.297; 1.939] 313 1.108 [0.309; 1.964]121 0 221 0.026 [0.000; 0.171] 321 0.026 [0.000; 0.173]122 0 222 0.092 [0.001; 0.530] 322 0.094 [0.001; 0.537]123 0 223 1.078 [0.297; 1.939] 323 1.108 [0.309; 1.964]131 0 231 0.027 [0.000; 0.174] 331 0.028 [0.000; 0.176]132 0 232 0.099 [0.001; 0.539] 332 0.101 [0.001; 0.546]133 0 233 1.159 [0.330; 1.971] 333 1.192 [0.344; 1.998]Table 5: Estimated values (with precision 10 −3 ) of the factor of acceleration of time (M(ε)−1)and related confi<strong>de</strong>nce intervals for each level of exposure ε ∈ E \ {0}. Recall that M(0) = 1.ε M(ε) − 1 ε M(ε) − 1 ε M(ε) − 1111 0 211 0.026 [0.001; ∞) 311 0.026 [0.001; ∞)112 0 212 0.092 [0.004; ∞) 312 0.094 [0.004; ∞)113 0 213 1.078 [0.374; ∞) 313 1.108 [0.389; ∞)121 0 221 0.026 [0.001; ∞) 321 0.026 [0.001; ∞)122 0 222 0.092 [0.004; ∞) 322 0.094 [0.004; ∞)123 0 223 1.078 [0.374; ∞) 323 1.108 [0.389; ∞)131 0 231 0.027 [0.001; ∞) 331 0.028 [0.002; ∞)132 0 232 0.099 [0.004; ∞) 332 0.101 [0.004; ∞)133 0 233 1.159 [0.414; ∞) 333 1.192 [0.431; ∞)Table 6: Estimated values (with precision 10 −3 ) of the factor of acceleration of time (M(ε)−1)and related right confi<strong>de</strong>nce intervals for each level of exposure ε ∈ E \{0}. Recall that M(0) =1. A Bonferroni correction ensures that the confi<strong>de</strong>nce regions simultaneously guarantee(1 − 2α) = 95%-coverage (for {M(ε) − 1 : ε ∈ E} on its own).1920


Elementary algebra permits to compute an evaluation c(t, ā(t)) of, and confi<strong>de</strong>nce lowerboundc − (t, ā(t)) on, the maximal number of years of life guaranteed free of lung cancer forany couple (t, ā(t)) of age t at inci<strong>de</strong>nt lung cancer and history ā(t) of occupational exposureto asbestos till t. Let us consi<strong>de</strong>r three examples:• Consi<strong>de</strong>r a case of inci<strong>de</strong>nt lung cancer at age t who spent, till that age, 30 years with anoccupational exposure to asbestos ε = 332: one evaluates c(t, ā(t)) = 30 × 0.101 = 3.03maximal number of years guaranteed free of lung cancer, with its 95%-confi<strong>de</strong>nce lowerbound c − (t, ā(t)) = 30 × 0.004 = 0.09 maximal number of years guaranteed free of lungcancer (approximately 44 days).This is quite an extreme case, since 3 out of the 8,432 employments <strong>de</strong>scribed in thedataset achieve the <strong>de</strong>scription ε = 332.min. 25% 50% mean 75% max.max. number of years free of lung cancer 0.026 0.289 0.769 2.467 2.408 36.57795%-lower bound 0.001 0.014 0.037 0.555 0.102 12.832Table 7: Quartiles, mean and extreme values of the maximal number of years of life guaranteedfree of lung cancer and corresponding 95%-confi<strong>de</strong>nce lower-bound (with precision 10 −3 ), ascomputed on those 259 cases (i.e., 30% of all cases) for whom the evaluated maximal numberof years of life guaranteed free of lung cancer is positive.to ε = 111 (respectively, ε = 211 and ε = 212) and was diagnosed a lung cancer at 71years old. Although this is not relevant as far as the evaluation of the potential yearsof life free of lung cancer is concerned, his lifetime tobacco equals 55 pack years.hal-00577883, version 1 - 17 Mar 2011• Consi<strong>de</strong>r a case of inci<strong>de</strong>nt lung cancer at age t who spent, till that age, 10 years (thenlater 5 years and 2 years) with an occupational exposure to asbestos ε = 322 (then laterε = 121 and ε = 222): one evaluates c(t, ā(t)) = 10 × 0.094 + 5 × 0 + 2 × 0.092 = 1.124maximal number of years guaranteed free of lung cancer, with its 95%-confi<strong>de</strong>nce lowerbound c − (t, ā(t)) = 10 × 0.004 + 5 × 0 + 2 × 0.004 = 0.048 maximal number of yearsguaranteed free of lung cancer (approximately 17.5 days).Note that 150, 36 and 189 out of the 8,432 employments <strong>de</strong>scribed in the dataset achievethe <strong>de</strong>scriptions ε = 121, ε = 222 and ε = 322.• Consi<strong>de</strong>r a case of inci<strong>de</strong>nt lung cancer at age t who spent, till that age, 10 years(then later 15 years) with an occupational exposure to asbestos ε = 213 (then laterε = 223): one evaluates c(t, ā(t)) = 10 × 1.078 + 15 × 1.078 = 26.95 maximal number ofyears guaranteed free of lung cancer, with its 95%-confi<strong>de</strong>nce lower bound c − (t, ā(t)) =10×0.374+15×0.374 = 9.350 maximal number of years guaranteed free of lung cancer.This is quite an extreme case, since only 6 and 3 out of the 8,432 employments <strong>de</strong>scribedin the dataset achieve the <strong>de</strong>scriptions ε = 213 and ε = 223.Among the n = 860 cases of our dataset, only 259 (i.e., 30%) cases are associated withpositive maximal number of years of life guaranteed free of lung cancer. We report in Table 7the quartiles, mean and extreme values of maximal number of years of life guaranteed free oflung cancer as computed on those 259 cases.• The maximum value is reached by a male who accumulated through his professionallife a total of 33 years with occupational exposure to asbestos equal to ε = 313 andwas diagnosed a lung cancer at 70 years old. Although this is not relevant as far asthe evaluation of the potential years of life free of lung cancer is concerned, his lifetimetobacco equals 45 pack years.• The minimum value is reached by 4 women who accumulated through their professionallives a total of 1 year with occupational exposure to asbestos ε ∈ {211,221} and werediagnosed a lung cancer at 51 (for two of them), 59 and 68 years old. Although this isnot relevant as far as the evaluation of the potential years of life free of lung cancer isconcerned, their lifetime tobacco uses equal 25, 30, 32 and 55 pack years).• The median value is reached by a man who accumulated through his professional lifea total of 4 years (respectively, 5 and 7) with occupational exposure to asbestos equalhal-00577883, version 1 - 17 Mar 2011We represent in Figure 1 the empirical cdf of the maximal number of years of life guaranteedfree of lung cancer (and corresponding 95%-confi<strong>de</strong>nce lower bounds) for the 259 casesfor whom it is positive.7 DiscussionWe have <strong>de</strong>veloped a collection of threshold regression mo<strong>de</strong>ls (see Section 5), and havedata-adaptively selected a better mo<strong>de</strong>l in it by relying on multi-fold likelihood-based crossvalidation(see Section 6.2 for the <strong>de</strong>scriptions of the mo<strong>de</strong>l selection procedure and <strong>de</strong>rivedbetter mo<strong>de</strong>l). The latter better threshold regression mo<strong>de</strong>l has been fitted by maximumlikelihood, and bootstrapped confi<strong>de</strong>nce intervals have been obtained (see Section 6.3). Thestatistical procedure has been adjusted in or<strong>de</strong>r to eliminate the bias induced by the casecontrolsampling <strong>de</strong>sign used to collect the dataset. This necessary preliminary step was ma<strong>de</strong>possible because the probability distribution of being a case in the population of interest canbe computed beforehand based on an in<strong>de</strong>pen<strong>de</strong>nt study (see Section 6.1). We have discussedthe implications of the fitted threshold regression mo<strong>de</strong>l in terms of the notion of maximalnumber of years of life guaranteed free of lung cancer which is naturally attached to it (seeSection 6.4).We believe that, even though they cannot be interpreted causally, the results presented inthis article contribute significantly to the quantitative un<strong>de</strong>rstanding of how an occupationalexposure to asbestos is related to an increase of lung cancer, and to the evaluation, in subjectssuffering from a lung cancer, of how much the amount of exposure to asbestos explains theoccurrence of the cancer.We finally acknowledge a limitation of the approach un<strong>de</strong>rtaken in this article: The linkbetween the occupational exposure to asbestos and age at inci<strong>de</strong>nt lung cancer is well-<strong>de</strong>finedin the context of the proposed threshold regression mo<strong>de</strong>ls, but we do not extend it beyond.The parameter we aim for is therefore difficult to comprehend (it is related to the Kullback-Leibler projection of the true distribution of the data onto a threshold regression mo<strong>de</strong>l), andthe inference procedure certainly fails to estimate optimally/efficiently what we really carefor, which would be a measure of the strength of the link between the occupational exposureto asbestos and age at inci<strong>de</strong>nt lung cancer <strong>de</strong>fined non- or semiparametrically. We intend togo further in that direction in future work.2122


A Appendix: elements of proofProof of Proposition 1. On one hand, note that∫E P0 q 0 log p ⋆ (V 1 , O 1⋆ ) = q 0 log p ⋆ (v 1 , o 1⋆ )dP0 ⋆ (v 1 , o 1⋆ |y = 1)∫= log p ⋆ (v 1 , o 1⋆ )dP0 ⋆ (v 1 , o 1⋆ , y = 1)∫= log p ⋆ (o ⋆ )dP0 ⋆ (o ⋆ , y = 1). (30)hal-00577883, version 1 - 17 Mar 2011hal-00577883, version 1 - 17 Mar 2011On the other hand, for each j ≤ J,E P0¯q 0 (V 1 ) log p ⋆ (V 1 , O 0,j⋆ ) = E P0¯q 0 (V 1 [)E P0 log p ⋆ (V 1 , O 0,j⋆ )|V 1]∫= E P0¯q 0 (V 1 ) log p ⋆ (V 1 , o ⋆ )dP0 ⋆ (o ⋆ |V 1 , y = 0)∫= ¯q 0 (v 1 ) log p ⋆ (v 1 , o ⋆ )dP0 ⋆ (o ⋆ |v 1 , y = 0)dP 0 (v 1 ).Furthermore, for each v ∈ V, dP 0 (v) = dP ⋆ 0 (v|y = 1) = q 0(v|1)δ v (v) (we use the sameshorthand notation as in (4), (5), (6), (7)) and <strong>de</strong>note by δ v the Dirac mass at v), hence¯q 0 (v)dP 0 (v) = q 0q 0 (0|v)q 0 (1|v) q 0(v|1)δ v = q 0 (0|v)P ⋆ 0 (v)δ v (v) = dP ⋆ 0 (v, y = 0).Consequently, we obtainE P0¯q 0 (V 1 ) log p ⋆ (V 1 , O 0,j⋆ ) ===∫∫∫log p ⋆ (v 1 , o ⋆ )dP ⋆ 0 (o ⋆ |v 1 , y = 0)dP ⋆ 0 (v 1 , y = 0)log p ⋆ (v 1 , o ⋆ )dP ⋆ 0 (v 1 , o ⋆ , y = 0)log p ⋆ (o ⋆ )dP ⋆ 0 (o ⋆ , y = 0) (31)(which does not <strong>de</strong>pend on j). Combining (30), (31) finally yields∫E P0 l(O|p ⋆ ) = log p ⋆ (o ⋆ )dP0 ⋆ (o ⋆ ) = E P ⋆0log p ⋆ (O ⋆ ).Figure 1: Empirical distributions of maximal number of years of life guaranteedfree of lung cancer and related confi<strong>de</strong>nce lower-bound. The rightmost curve withbullets (respectively leftmost curve with triangles) represents the empirical cdf of the maximalnumber of years guaranteed free free of lung cancer (respectively of the 95%-confi<strong>de</strong>nce lowerbound on that number) of those cases for whom it is positive, that is the empirical cdf of{c(Ti 1,Ā1(Ti 1 )) : c(T1i ,Ā1 (Ti 1))> 0, i ≤ n} (respectively {c− (Ti 1,Ā1(Ti 1 )) : c(T1i ,Ā1 (Ti 1))>0, i ≤ n}). Only 30% of the cases are concerned. The x-axis scale is logarithmic.The conclusion is straightforward, becauseE P ⋆0log p ⋆ (O ⋆ ) − E P ⋆0log p ⋆ 0(O ⋆ ) = −KL(p ⋆ 0, p ⋆ ),the opposite of the Kullback-Leibler divergence between p ⋆ 0 and p⋆ , which is positive forp ⋆ ≠ p ⋆ 0 and equals zero otherwise.Proof of Proposition 3. The expansion (11) and the related distributional limit result are aconsequence of Theorem 5.23 in [14]. The fact that S θ0 = E P ⋆0¨l⋆ θ0(O ⋆ ) is obtained by adapting2324


hal-00577883, version 1 - 17 Mar 2011slightly the proof of Proposition 1. Regarding E P0 [ ˙˜l(O|θ0 ) ˙˜l(O|θ0 ) ⊤ ], let us abbreviate xx ⊤ tox 2 and note that[˙˜l(O|θ 0 ) ˙˜l(O|θ0 ) ⊤ = q 0¯q 0 (O 1⋆ ) ˙l ⋆ θ 0(V 1 , O 1⋆ ) 2 + ¯q 0 (V )( 1 1∑J j ¯q 0(O 0,j⋆ ) ˙l) 2 ]⋆θ 0(V 1 , O 0,j⋆ )[+ q 0¯q 0 (V 1 ) ˙l ⋆ θ 0(V 1 , O )( 1⋆ 1∑J j ¯q 0(O 0,j⋆ ) ˙l) ⊤⋆θ 0(V 1 , O 0,j⋆ )+ q 0¯q 0 (V )( 1 1∑J j ¯q 0(O 0,j⋆ ) ˙l ⋆ θ 0(O )) 0,j⋆ ˙l⋆ θ0(O 1⋆ ) ⊤] .The P 0 -expected value of the first term between brackets is E P ⋆0¨l⋆ θ0(O ⋆ ), as another simpleadaptation of the proof of Proposition 1 straightforwardly yields. Moreover,E P0 q 0¯q 0 (V 1 ) ˙l ⋆ θ 0(V 1 , O )( 1⋆ 1∑J j ¯q 0(O 0,j⋆ ) ˙l ⋆ θ 0(V 1 , O 0,j⋆ )[ ( (= E P0 q 0¯q 0 (V 1 )E P0˙l ⋆ θ 0(V 1 , O 1⋆ ) 1∑J j ¯q 0(O 0,j⋆ ) ˙l) ∣ )] ⊤∣∣∣ ⋆θ 0(V 1 , O 0,j⋆ ) V 1( = E P0[q 0¯q 0 (V 1 )E P0 ˙l⋆ θ0(V 1 , O 1⋆ ) ∣ V 1)( (E 1∑P0 J j ¯q 0(O 0,j⋆ ) ˙l) ∣ )] ⊤∣∣∣ ⋆θ 0(V 1 , O 0,j⋆ ) V 1by conditional in<strong>de</strong>pen<strong>de</strong>nce. Denote by Π = E P0 ( ˙l ⋆ θ 0(V 1 , O 1⋆ )|O 1⋆ \ Z 1 ) the conditionalexpectation of ˙l ⋆ θ 0(V 1 , O 1⋆ ) given every component of O 1⋆ but Z 1 , that is given Ω 1 (compatiblewith V 1 ). The projection Π can be written as a measurable function of Ω 1 times∫∫ ∂p˙l ⋆ θ 0(z,Ω 1 )p ⋆ θ 0(z|Ω 1 ⋆)dz = θ(z|Ω 1 )∂θ ∣ dz = 0,θ=θ0provi<strong>de</strong>d that the or<strong>de</strong>r of differentiation and integration can be reversed. This is ensuredby the stated constraint on the <strong>de</strong>rivatives of p ⋆ θ (z|Ω1 ) with respect to θ. Consequently, theP 0 -expected value of the second term between brackets in the first display is zero, hence thevalidity of the alternative version of Σ. The conclusion simply follows from another applicationof Theorem 5.23 in [14] in the classical iid framework associated with P ⋆ 1 .) ⊤hal-00577883, version 1 - 17 Mar 2011[3] H. K. Biesalski, B. B. <strong>de</strong> Mesquita, A. Chesson, F. Chytil, R. Grimble, R. J. Hermus,J. Kohrle, R. Lotan, K. Norpoth, U. Pastorino, and D. Thurnham. European consensusstatement on lung cancer: risk factors and prevention. Lung cancer panel. CA CancerJ. Clin., 48(3):167–176, 1998.[4] R. S. Chhikara and Folks J. L. The inverse Gaussian distribution: theory, methods andapplications. Marcel Dekker: New-York, 1989.[5] C. A. Haiman, D. O. Stram, L. R. Wilkens, M. C. Pike, L. N. Kolonel, B. E. Hen<strong>de</strong>rson,and L. Le Marchand. Ethnic and racial differences in the smoking-related risk of lungcancer. N. Engl. J. Med., 354(4):333–342, 2006.[6] M.-L. T. Lee and G. A. Whitmore. Threshold regression for survival analysis: mo<strong>de</strong>lingevent times by a stochastic process reaching a boundary. Statist. Sci., 21(4):501–513,2006.[7] M-L. T. Lee and G. A. Whitmore. Proportional hazards and threshold regression: theirtheoretical and practical connections. Lifetime Data Anal., 16(2):196–214, 2010.[8] P. Morfeld. Years of life lost due to exposure: Causal concepts and empirical shortcomings.Epi<strong>de</strong>miol. Perspect. Innov., 1(1), 2004.[9] J-C. Pairon, B. Legal-Régis, J. Ameille, J-M. Brechot, B. Lebeau, D. Valeyre, I. Monnet,M. Matrat, and B. Chamming’s, S. Housset. Occupational lung cancer: a multicentriccase-control study in Paris area. European Respiratory Society, 19th Annual Congress,Vienna, 2009.[10] J. Robins and S. Greenland. The probability of causation un<strong>de</strong>r a stochastic mo<strong>de</strong>l forindividual risk. Biometrics, 45(4):1125–1138, 1989.[11] J. Robins and S. Greenland. Estimability and estimation of expected years of life lostdue to a hazardous exposure. Stat. Med., 10(1):79–93, 1991.[12] S. Rose and M. J. van <strong>de</strong>r Laan. Simple optimal weighting of cases and controls incase-control studies. Int. J. Biostat., 4:Art. 19, 24, 2008.References[1] IARC monographs on the evaluation of the carcinogenic risk of chemicals to man: asbestos,volume 14. IARC, 1977.[2] A. Belot, P. Grosclau<strong>de</strong>, N. Bossard, E. Jougla, E. Benhamou, P. Delafosse,A. V. Guizard, F. Molinié, A. Danzon, S. Bara, A. M. Bouvier, B. Trétarre,F. Bin<strong>de</strong>r-Foucard, M. Colonna, L. Daubisse, G. Hé<strong>de</strong>lin, G. Launoy, N. Le Stang,M. Maynadié, A. Monnereau, X. Troussard, J. Faivre, A. Collignon, I. Janoray,P. Arveux, A. Buemi, N. Raverdy, C. Schvartz, M. Bovet, L. Chérié-Challine,J. Estève, L. Remontet, and M. Velten. Cancer inci<strong>de</strong>nce and mortality in Franceover the period 1980-2005. Rev. Epi<strong>de</strong>miol. Santé Publique, 56(3), 2008. Detailedresults and comments [online] http://www.invs.sante.fr/surveillance/-cancers/estimations cancers/<strong>de</strong>fault.htm.[13] M. J. van <strong>de</strong>r Laan. Estimation based on case-control <strong>de</strong>signs with known inci<strong>de</strong>nceprobability. U.C. Berkeley Division of Biostatistics Working Paper Series, 2008. Paper234.[14] A. W. van <strong>de</strong>r Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.[15] A. W. van <strong>de</strong>r Vaart, S. Dudoit, and M. J. van <strong>de</strong>r Laan. Oracle inequalities for multi-foldcross validation. Statist. Decisions, 24(3):351–371, 2006.[16] E. A. Zang and Wyn<strong>de</strong>r. E. L. Differences in lung cancer risk between men and women:examination of the evi<strong>de</strong>nce. J. Natl. Cancer Inst., 88(3-4):183–192, 1996.2526


hal-00582753, version 1 - 4 Apr 2011Estimation and testing in targeted goup sequentialcovariate-adjusted randomized clinical trialsA. Chambaz 1∗ and M. J. van <strong>de</strong>r Laan 21 MAP5, Université Paris Descartes and CNRS2 University of California, BerkeleyApril 4, 2011AbstractThis article is <strong>de</strong>voted to the construction and asymptotic study of adaptive group sequentialcovariate-adjusted randomized clinical trials analyzed through the prism of the semiparametricmethodology of targeted maximum likelihood estimation (TMLE). We show how tobuild, as the data accrue group-sequentially, a sampling <strong>de</strong>sign which targets a user-suppliedoptimal <strong>de</strong>sign. We also show how to carry out a sound TMLE statistical inference based onsuch an adaptive sampling scheme (therefore extending some results known in the i.i.d settingonly so far), and how group-sequential testing applies on top of it. The procedure is robust(i.e., consistent even if the working mo<strong>de</strong>l is misspecified). A simulation study confirms thetheoretical results, and validates the conjecture that the procedure may also be efficient.Keywords: Adaptive <strong>de</strong>sign; asymptotic normality; canonical distribution; clinical trial;contiguity; group-sequential testing; robustness; targeted maximum likelihood methodology.1 IntroductionThis article is <strong>de</strong>voted to the construction and asymptotic study of adaptive group sequentialcovariate-adjusted randomized clinical trials (RCTs) analyzed through the prism of the targetedmaximum likelihood estimation (TMLE) methodology. The consi<strong>de</strong>red observed data structurewrites as O = (W, A, Y ) where W is a vector of baseline covariates, A is the treatment assignmentand Y the primary outcome of interest. We assume that A is binary and that Y is one-dimensional.Typical parameters of scientific interest are Ψ + = E{E(Y |A = 1, W) −E(Y |A = 0, W)} (additivescale, which we consi<strong>de</strong>r hereafter) or Ψ × = log E{E(Y |A = 1, W)} − log E{E(Y |A = 0, W)}(multiplicative scale). Such parameters can be interpreted causally whenever one is willing toassume the existence of a full data structure X = (W, Y (0), Y (1)) containing the vector of baselinecovariates and the two counterfactual outcomes un<strong>de</strong>r the two possible treatments, and such thatY = Y (A) and A conditionally in<strong>de</strong>pen<strong>de</strong>nt of X given W. If so in<strong>de</strong>ed, Ψ + = EY (1) − EY (0)and Ψ × = log EY (1)−log EY (0). Let us now explain what we mean by adaptive group sequentialcovariate-adjusted RCT.Adaptive group sequential covariate-adjusted RCT.By adaptive covariate-adjusted <strong>de</strong>sign, we mean in the setting of this article a clinical trial <strong>de</strong>signthat allows the investigator to dynamically modify its course through data-driven adjustment ofthe randomization probability based on data accrued so far, without negatively impacting thestatistical integrity of the trial. Moreover, the patient’s baseline covariates may be taken intoaccount for the random treatment assignment. This <strong>de</strong>finition is slightly adapted from (Golub,∗ This collaboration was initiated while <strong>Antoine</strong> Chambaz was a visiting scholar at UC Berkeley, supported inpart by a Fulbright Research Grant and the CNRS.hal-00582753, version 1 - 4 Apr 20112006). In particular, we assume with a view to the <strong>de</strong>finition of pre-specified sampling plans givenin (Emerson, 2006) that, prior to collection of the data, the trial protocol specifies the parameterof scientific interest, the inferential method and confi<strong>de</strong>nce level to be used when constructing aconfi<strong>de</strong>nce interval for the latter parameter.Furthermore, we assume that the investigator specifies beforehand in the trial protocol a criterionof special interest which yields a notion of optimal randomization scheme that we thereforewish to target. For instance, the criterion could translate the necessity to minimize the numberof patients assigned to their corresponding inferior treatment arm, subject to level and powerconstraints. Or the criterion could translate the necessity that a result be available as quickly aspossible, subject to level and power constraints. The sole restriction on the criterion is that itmust yield an optimal randomization scheme which can be approximated from the data accruedso far. The two examples above comply with this restriction.We focus in this article on targeted maximum likelihood estimation, as introduced by van <strong>de</strong>rLaan and Rubin (2006) in the in<strong>de</strong>pen<strong>de</strong>nt i<strong>de</strong>ntically distributed (i.i.d) setting. The extensionto the setting of adaptive RCTs was first consi<strong>de</strong>red in (van <strong>de</strong>r Laan, 2008), upon which thisarticle relies. In addition, we choose to consi<strong>de</strong>r specifically the second criterion cited above. Consequently,the optimal randomization scheme is the so-called Neyman allocation, which minimizesthe asymptotic variance of the TMLE of the parameter of interest. We emphasize that there isnothing special about targeting the Neyman allocation, the whole methodology applying equallywell to a large class of optimal randomization schemes <strong>de</strong>rived from a variety of valid criteria.By adaptive group sequential <strong>de</strong>sign, we refer to the possibility to adjust the randomizationscheme only by blocks of c patients, where c ≥ 1 is a pre-specified integer (the case that c = 1corresponds to a fully sequential adaptive <strong>de</strong>sign). The expression also refers to the fact that groupsequential testing methods can be equally well applied on top of adaptive <strong>de</strong>signs, an extensionthat we do not consi<strong>de</strong>r here. Although all our results (and their proofs) still hold for any c ≥ 1,we consi<strong>de</strong>r the case c = 1 in the theoretical part of the article for simplicity’s sake, but the casethat c > 1 is consi<strong>de</strong>red in the simulation study.Short bibliography.The literature on adaptive <strong>de</strong>signs is vast and our review is not comprehensive. Quite misleadingly,the expression “adaptive <strong>de</strong>sign” has also been used in the literature for sequential testing and,in general, for <strong>de</strong>signs that allow data-adaptive stopping times for the whole study (or for certaintreatment arms) which achieve the wished type I and type II errors requirements when testing anull hypothesis against its alternative.Of course, data-adaptive randomization schemes have a long history that goes back to the1930s, and we refer to Section 1.2 in (Hu and Rosenberger, 2006), Section 17.4 in (Jennison andTurnbull, 2000) and (Rosenberger, 1996) for a comprehensive historical perspective.Many articles are <strong>de</strong>voted to the study of “response adaptive <strong>de</strong>signs”, an expression implicitlysuggesting that those <strong>de</strong>signs only <strong>de</strong>pend on past responses of previous patients and not on thecorresponding covariates. We refer to (Hu and Rosenberger, 2006; Chambaz and van <strong>de</strong>r Laan,2011) for a bibliography on that topic. On the contrary, covariate-adjusted response adaptive(CARA) randomizations tackle the so-called issue of heterogeneity (i.e., the use of covariates inadaptive <strong>de</strong>signs), by dynamically calculating the allocation probabilities on the basis of previousresponses and current and past values of certain covariates. In this view, this article studies anew type of CARA procedure. The interest in CARA procedures is more recent, and there isa steadily growing number of articles <strong>de</strong>dicated to their study, starting with (Rosenberger et al.,2001; Bandyopadhyay and Biswas, 2001), then (Atkinson and Biswas, 2005; Zhang et al., 2007; Zha;Shao et al., 2010) among others. The latter articles are typically concerned with the convergence(almost sure and in law) of the allocation probabilities vector and of the estimator of the parameterin a correctly specified parametric mo<strong>de</strong>l ((Shao et al., 2010) is <strong>de</strong>voted to the testing issue).By contrast, the consistency and asymptotic normality results that we obtain in this articleare robust to mo<strong>de</strong>l misspecification. Thus, they notably contribute significantly to solving thequestion raised by the Food & Drug Administration (2006):12


When is it valid to modify randomization based on results, for example, in a combinedphase 2/3 cancer trial?Finally, this article mainly relies on (Chambaz and van <strong>de</strong>r Laan, 2009, 2011; van <strong>de</strong>r Laan,2008), the latter technical report paving the way to robust and more efficient estimation based onadaptive RCTs in a variety of other setting (including the case that the outcome Y is a possiblycensored time-to-event).Organization.We first set the statistical framework in Section 2. The rationale of our adaptive covariate-adjusted<strong>de</strong>signs is presented in Section 3, and we complete its formal <strong>de</strong>finition in Section 4 where we also<strong>de</strong>tail how the TMLE methodology operates. The asymptotic study is carried out in Section 5(estimation) and in Section 6 (group sequential testing). The simulation study is <strong>de</strong>veloped inSection 7 (estimation) and in Section 8 (group sequential testing). The proofs are relegated to theAppendix.Proposition 1 (efficient influence curve). The functional Ψ is pathwise differentiable at everyP ∈ M relative to the maximal tangent space. The efficient influence curve of Ψ at P Q,g ∈ M ischaracterized byD ⋆ (P Q,g )(O) = D ⋆ 1(Q)(W) + D ⋆ 2(P Q,g )(O), (2)whereD1(Q)(W) ⋆ = ¯Q(1, W) − ¯Q(0, W) − Ψ(Q), andD2(P ⋆ Q,g )(O) = 2A − 1g(A|W) (Y − ¯Q(A, W)).The variance Var P D ⋆ (P)(O) is the lower bound of the asymptotic variance of any regular estimatorof Ψ(P) in the i.i.d setting. Furthermore, even if Q ≠ Q 0 ,when g = g(P 0 ).E P0 D ⋆ (P Q,g )(O) = 0 implies Ψ(Q) = Ψ(Q 0 ) (3)hal-00582753, version 1 - 4 Apr 20112 Statistical frameworkWe tackle the asymptotic study of adaptive group sequential <strong>de</strong>signs in the case of RCTs withcovariate, binary treatment and one-dimensional primary outcome of interest. Thus, the observeddata structure writes as O = (W, A, Y ), where W ∈ W consists of some baseline covariates, A ∈A = {0,1} <strong>de</strong>notes the assigned binary treatment, and Y ∈ Y is the primary outcome of interest.For example, Y can indicate whether the treatment has been successful or not (Y = {0,1}); orY can count the number of times an event of interest has occurred un<strong>de</strong>r the assigned treatmentduring a period of follow-up (Y = N); or Y can measure a quantity of interest after a given timehas elapsed (Y = R). Although we will focus on the last case in this article, the methodologyapplies equally well to each example cited above.Let us <strong>de</strong>note by P 0 the true distribution of the observed data structure O in the populationof interest. We see P 0 as a specific element of the non-parametric set M of all possible observeddata distributions. Note that, in or<strong>de</strong>r to avoid some technicalities, we assume (or rather: impose)that all elements of M are dominated by a common measure. The parameter of scientific interestis the marginal effect of treatment a = 1 relative to treatment a = 0 on the additive scale, orrisk difference: ψ 0 = E P0 {E P0 [Y |A = 1, W] − E P0 [Y |A = 0, W]}. Of course, other choices suchas the log-relative risk (the counterpart of the risk difference on the multiplicative scale) could beconsi<strong>de</strong>red, and <strong>de</strong>alt with along the same lines. The risk difference can be interpreted causallyfor instance in the counterfactual framework.For all P ∈ M, let us introduce the shorthand notation Q W (P)(W) = P[W], g(P)(A|W) =P[A|W], Q Y |A,W (P)(O) = P[Y |A, W]. We use the alternative notation P = P Q,g with Q =Q(P) ≡ (Q W (P), Q Y |A,W (P)) and g = g(P). Equivalently, P Q,g is the data generating distributionsuch that Q(P Q,g ) = Q and g(P Q,g ) = g. In particular, we <strong>de</strong>note Q 0 = Q(P 0 ) =(Q W (P 0 ), Q Y |A,W (P 0 )). We also introduce the notation Q = {Q(P) : P ∈ M} for the nonparametricset of all possible values of Q, and G = {g(P) : P ∈ M} for the non-parametric set ofall possible values of g. Setting ¯Q(P)(A, W) = E P [Y |A, W] and ¯Q 0 = ¯Q(P 0 ) (with a slight abuse,we also write sometimes ¯Q(Q) instead of ¯Q(P Q,g )), we <strong>de</strong>fine in greater generality{ }Ψ(P) = E P ¯Q(P)(1, W) − ¯Q(P)(0, W) (1)over the whole set M, so that ψ 0 equivalently writes as ψ 0 = Ψ(P 0 ). This notation also emphasizesthe fact that Ψ(P) only <strong>de</strong>pends on P through ¯Q(P) and Q W (P), justifying the alternativenotation Ψ(P Q,g ) = Ψ(Q). The following proposition summarizes the most fundamental propertiesenjoyed by Ψ.hal-00582753, version 1 - 4 Apr 2011The implication (3) is the key to the robustness of the targeted maximum likelihood introducedand studied in this article. It is another justification of our interest in the pathwise differentiabilityof the functional Ψ and its efficient influence curve.3 Data generating mechanism for adaptive <strong>de</strong>signThe purpose of adaptive group sequential <strong>de</strong>sign as we consi<strong>de</strong>r it in this article is to adjustthe randomization scheme as the data accrue. We first formally <strong>de</strong>scribe in Section 3.1 the datagenerating mechanism through the expression of the likelihood function and also in terms of causalgraphs. Then we discuss the general issue of choosing an optimal <strong>de</strong>sign to target in Section 3.2,before specifically <strong>de</strong>scribing an optimal <strong>de</strong>sign of interest in Section 3.3 and showing how to targetit in Section 3.4.3.1 Data generating mechanism and related likelihoodIn or<strong>de</strong>r to formally <strong>de</strong>scribe the data generating mechanism, we need to state a starting assumption:During the course of the clinical trial, it is possible to recruit in<strong>de</strong>pen<strong>de</strong>ntly the patientsfrom a stationary population. In the counterfactual framework, this is equivalent to supposingthat it is possible to sample as many in<strong>de</strong>pen<strong>de</strong>nt copies of the full data structure as required.Let us <strong>de</strong>note by O i = (W i , A i , Y i ) the ith observed data structure. We also find convenient tointroduce O n = (O 1 , ...,O n ) and for every i = 0, ...,n, O n (i) = (O 1 , ...,O i ) (with conventionO(0) = ∅). We <strong>de</strong>note by O the set where the observed data structure O takes its values.By adjusting the randomization scheme as the data accrue, we mean that the nth treatmentassignment A n is drawn from g n (·|W n ), where g n (·|W) is a conditional distribution (or treatmentmechanism) given the covariate W which additionally <strong>de</strong>pends on past observations O n−1 . Sincethe sequence of treatment mechanisms cannot reasonably grow in complexity as the sample sizeincreases, we will only consi<strong>de</strong>r data-adaptive treatment mechanisms such that g n (·|W) <strong>de</strong>pendson O n−1 only through a finite-dimensional summary measure Z n = φ n (O n−1 ), where the measurablefunction φ n maps O n−1 onto R d for some fixed d ≥ 0 (d = 0 corresponds to the casethat g n (·|W) actually does not adapt). For instance, Z n+1 = φ n+1 (O n ) ≡ (n −1 ∑ ni=1 Y i1{A i =0}, n −1 ∑ ni=1 Y i1{A i = 1}) characterizes a proper summary measure of the past, which keeps trackof the mean outcome in each treatment arm. Another sequence of mappings φ n will be at the coreof the adaptive methodology that we study in <strong>de</strong>pth in this article, see (9).Formally, the data generating mechanism is specified by the following factorization of thelikelihood of O n :n∏n∏{Q W (P 0 )(W i ) × Q Y |A,W (P 0 )(O i )} × g i (A i |W i ),i=1i=134


W n· · · A nY nO n−1W n+1 Y n+1 Z n+1 A n+1 Z n+2 · · ·O nability to reach a conclusion as quickly as possible subject to level and power constraints). Or onecould care the most for the well-being of the subjects participating to the clinical trial (thereforetrying to minimize the number of patients assigned to their corresponding inferior treatment arms,subject to level and power constraints). Obviously, these are only two important examples froma large class of potentially interesting criteria. The sole purpose of the criterion is to generate arandom element in G of the form g n = g Zn , where Z n = φ n (O n−1 ) is a finite-dimensional summarymeasure of O n−1 .We <strong>de</strong>ci<strong>de</strong> to focus in this article on the first example, but it must be clear that the methodologyapplies to a variety of other criteria (see (van <strong>de</strong>r Laan, 2008) for other examples).hal-00582753, version 1 - 4 Apr 2011Figure 1: A causal graph <strong>de</strong>scribing the data generating mechanism for adaptivegroup sequential <strong>de</strong>sign. An arrow pointing from no<strong>de</strong>s Λ 1 , ...,Λ r to no<strong>de</strong> Υ means thatthere exists a <strong>de</strong>terministic function f and an in<strong>de</strong>pen<strong>de</strong>nt source of randomness U such thatΥ = f(Λ 1 , ...,Λ r , U).which suggests the introduction of g n = (g 1 , ...,g n ), referred to as the <strong>de</strong>sign of the study, andthe expression “O n is drawn from (Q 0 ,g n )”. Likewise, the likelihood of O n un<strong>de</strong>r (Q,g n ) (whereQ = (Q W , Q Y |A,W ) ∈ Q is a candidate value for Q 0 ) isn∏n∏{Q W (W i ) × Q Y |A,W (O i )} × g i (A i |W i ), (4)i=1where we emphasize that the second factor is known. Thus we will refer with a slight abuseof terminology to ∑ ni=1 log Q W(W i ) + log Q Y |A,W (O i ) as the log-likelihood of O n un<strong>de</strong>r (Q,g n ).Furthermore, given g n , we introduce the notation P f ≡ E Q0,gi PQ 0 ,g i[f(O i )|O n (i − 1)] for anypossibly vector-valued measurable f <strong>de</strong>fined on O.Another equivalent characterization of the data generating mechanism involves the causal graphof Fig. 1. It is seen again firstly that W n is drawn in<strong>de</strong>pen<strong>de</strong>ntly from the past O n−1 ; secondly thatA n is a <strong>de</strong>terministic function of W n , the summary measure Z n (which <strong>de</strong>pends on O n−1 ), anda new in<strong>de</strong>pen<strong>de</strong>nt source of randomness (in other words, it is drawn conditionally on (W n , Z n )and conditionally in<strong>de</strong>pen<strong>de</strong>ntly of the past O n−1 ); thirdly that Y n is a <strong>de</strong>terministic function of(A n , W n ) and a new in<strong>de</strong>pen<strong>de</strong>nt source of randomness (in other words, it is drawn conditionallyon (A n , W n ) and conditionally in<strong>de</strong>pen<strong>de</strong>ntly of the past O n−1 ); then that the next summarymeasure Z n+1 is obtained as a function of O n−1 and O n = (W n , A n , Y n ) (i.e., as a function ofO n ; here, the causal graph grants the access to a new in<strong>de</strong>pen<strong>de</strong>nt source of randomness, but itis useless in our setting), and so on.Finally, it is interesting in practice to adapt the <strong>de</strong>sign group sequentially. This can be simplyformalized. For a given pre-specified integer c ≥ 1 (c = 1 corresponds to a fully sequential adaptive<strong>de</strong>sign), proceeding c-group sequentially simply amounts to imposing φ (r−1)c+1 (O (r−1)c ) = . .. =φ rc (O rc−1 ) for all r ≥ 1. Then the c treatment assignments A (r−1)c+1 , ...,A rc in the rth c-groupare all drawn from the same conditional distribution g (r−1)c (·|W). Yet, although all our results(and their proofs) still hold for any c ≥ 1, we prefer to consi<strong>de</strong>r in the rest of this section and inSections 4 and 5 the case that c = 1 for simplicity’s sake. On the contrary, the simulation studycarried out in Section 7 involves some c > 1.3.2 On the user-supplied <strong>de</strong>sign to targetOne of the most important features of the adaptive group sequential <strong>de</strong>sign methodology is thatit targets a user-supplied specific <strong>de</strong>sign of special interest. This specific <strong>de</strong>sign is generally anoptimal <strong>de</strong>sign with respect to a criterion which translates what the investigator cares for the most.Specifically, one could care the most for the well-being of the target population, wishing that aresult be available as quickly as possible and aspiring therefore to the highest efficiency (i.e., thei=1hal-00582753, version 1 - 4 Apr 20113.3 From the Neyman allocation to the optimal <strong>de</strong>signSo, the objective is now clearly i<strong>de</strong>ntified: we wish to adapt the <strong>de</strong>sign as the data accrue in or<strong>de</strong>rto learn from the data then mimic a specific <strong>de</strong>sign which guarantees the highest efficiency, our(user-supplied) optimal <strong>de</strong>sign.By Proposition 1, the asymptotic variance of any regular estimator of the risk difference Ψ(Q 0 )has lower bound Var PQ0 ,g D⋆ (P Q0,g)(O) if the estimator relies on data sampled in<strong>de</strong>pen<strong>de</strong>ntly fromP Q0,g. Now,Var PQ0 ,g D⋆ (P Q0,g)(O) = E Q0 ( ¯Q 0 (1, W) − ¯Q 0 (0, W) − Ψ(Q 0 )) 2( σ 2 )(Q 0 )(1, W)+ E Q0 + σ2 (Q 0 )(0, W),g(1|W) g(0|W)where σ 2 (Q 0 )(A, W) <strong>de</strong>notes the conditional variance of Y given (A, W) un<strong>de</strong>r Q 0 . We use thenotation E Q0 above (for the expectation with respect to the marginal distribution of W un<strong>de</strong>r P 0 )in or<strong>de</strong>r to emphasize the fact that the treatment mechanism g only appears in the second termof the right-hand si<strong>de</strong> sum. Furthermore, it holds P 0 -almost surely thatwith equality if and only ifσ 2 (Q 0 )(1, W)+ σ2 (Q 0 )(0, W)≥ (σ(Q 0 )(1, W) + σ(Q 0 )(0, W)) 2 ,g(1|W) g(0|W)g(1|W) =σ(Q 0 )(1, W)σ(Q 0 )(1, W) + σ(Q 0 )(0, W)P 0 -almost surely. Therefore, the following lower bound holds for all g ∈ G:Var PQ0 ,g D⋆ (P Q0,g)(O) ≥ E Q0 ( ¯Q 0 (1, W) − ¯Q 0 (0, W) − Ψ(Q 0 )) 2+ E Q0 (σ(Q 0 )(1, W) + σ(Q 0 )(0, W)) 2 ,with equality if and only if g ∈ G is characterized by (5). This optimal <strong>de</strong>sign is known in theliterature as the Neyman allocation (see (Hu and Rosenberger, 2006) page 13). This result notablymakes clear that the most efficient treatment mechanism assigns with higher probability a patientwith covariate vector W to the treatment arm such that the variance of the outcome Y in this armis the largest, regardless of the mean of the outcome (i.e., whether the arm is inferior or superior).Due to logistical reasons, it might be preferable to consi<strong>de</strong>r only treatment mechanisms thatassign treatment in response to a subvector V of the baseline covariate vector W. In addition,if W is complex, targeting the optimal Neyman allocation might be too ambitious. Therefore,we will consi<strong>de</strong>r the important case where V is a discrete covariate with finitely many values inthe set V = {1, ...,ν}. The covariate V indicates subgroup membership for a collection of νsubgroups of interest. We <strong>de</strong>ci<strong>de</strong> to restrict the search of an optimal <strong>de</strong>sign to the set G 1 ⊂ G ofthose treatment mechanisms which only <strong>de</strong>pend on W through V . The same calculations as above(5)56


hal-00582753, version 1 - 4 Apr 2011yield straightforwardly that, for all g ∈ G 1 ,Var PQ0 ,g D⋆ (P Q0,g)(O) ≥ E Q0 ( ¯Q 0 (1, W) − ¯Q 0 (0, W) − Ψ(Q 0 )) 2+ E Q0 (¯σ(Q 0 )(1, V ) + ¯σ(Q 0 )(0, V )) 2 ,where ¯σ 2 (Q 0 )(a, V ) = E Q0 [σ 2 (Q 0 )(a, W)|V ] for a ∈ A, with equality if and only if g coinci<strong>de</strong>swith g ⋆ (Q 0 ), characterized byg ⋆ (Q 0 )(1|V ) =¯σ(Q 0 )(1, V )¯σ(Q 0 )(1, V ) + ¯σ(Q 0 )(0, V )P 0 -almost surely. Hereafter, we refer to g ⋆ (Q 0 ) as the optimal <strong>de</strong>sign.3.4 Targeting the optimal <strong>de</strong>signBecause g ⋆ (Q 0 ) is characterized as the minimizer over g ∈ G 1 of the variance un<strong>de</strong>r P Q0,g of theefficient influence curve at P Q0,g, we propose to construct g n+1 ∈ G 1 as the minimizer over g ∈ G 1of an estimator of the latter variance based on past observations O n .We proceed by recursion. We first set g 1 = g b , the so-called balanced treatment mechanismsuch that g b (1|W) = 1 2 for all W ∈ W, and assume that O n has already been sampled from (Q 0 ,g n )as <strong>de</strong>scribed in Section 3.1, the sample size being large enough to guarantee ∑ ni=1 1{V i = v} > 0for all v ∈ V (if n 0 is the smallest sample size such that the previous condition is met, then we setg 1 = . .. = g n0 = g b ).The issue is now to construct g n+1 . Let us assume for the time being that we already knowhow to construct an estimator Q n of Q 0 based on O n (hence the estimators ¯Q n = ¯Q(Q n ) of ¯Q 0and Ψ(Q n ) of Ψ(Q 0 ) = ψ 0 ). 1 Then, for all g ∈ G 1 ,{(S n (g) = 1 n∑Dn1(Q ⋆ n )(W i ) 2 + 2D1(Q ⋆ n )(W i )D2(P ⋆ Qn,g)(O i ) g(A i|V i )gi=1i (A i |V i )) () 2 }+ D2(P ⋆ Qn,g)(O i ) 2 g(A i|V i ) 1n∑− D ⋆g i (A i |V i ) n1(Q n )(W i ) + D2(P ⋆ Qn,g)(O i ) g(A i|V i )gi=1i (A i |V i )= 1 n∑ (Y i − ¯Q n (A i , W i )) 2n g(Ai=1 i |V i )g i (A i |V i ){1n∑+ Dn1(Q ⋆ n )(W i ) 2 + 2D1(Q ⋆ n )(W i )D2(P ⋆ Qn,gi)(O i )i=1() 2 }1n∑− D ⋆n1(Q n )(W i ) + D2(P ⋆ Qn,gi)(O i )estimates Var PQ0 ,g D⋆ (O;P Q0,g) (the weighting provi<strong>de</strong>s the a<strong>de</strong>quate tilt of the empirical distribution;it is not necessary to weight the terms corresponding to D1 ⋆ because they do not <strong>de</strong>pend onthe treatment mechanism). Now, only the first term in the rightmost expression still <strong>de</strong>pends ong. The same calculations as above straightforwardly yields that S n (g) is minimized at gn+1 ⋆ ∈ G 1characterized bys v,n (1)g n+1 (1|v) =s v,n (1) + s v,n (0) , (7)for all v ∈ V, where for each (v, a) ∈ V × A,s 2 v,n(a) =1n∑ ni=1(Yi− ¯Qn(Ai,Wi)) 2gi(Ai|Vi)i=11n∑ ni=1 1{V i = v}1{(V i , A i ) = (v, a)}.1 The reasoning is not circular by virtue of the chronological or<strong>de</strong>ring as it is summarized in Fig. 1 for instance.(6)hal-00582753, version 1 - 4 Apr 2011Yet, instead of consi<strong>de</strong>ring the above characterization, we find it more convenient to <strong>de</strong>fineg ⋆ n+1(1|v) =for all v ∈ V, where for each (v, a) ∈ V × A,σ 2 v,n(a) =1n∑ ni=1σ v,n (1)σ v,n (1) + σ v,n (0) , (8)(Yi− ¯Qn(Ai,Wi)) 2gi(Ai|Vi)∑1 n 1{(Vi,Ai)=(v,a)}n i=1 gi(Ai|Vi)1{(V i , A i ) = (v, a)}.Note that s 2 v,n(a) and σ 2 v,n(a) share the same numerator, and that the different <strong>de</strong>nominatorsconverge to the same limit. Substituting σ 2 v,n(a) for s 2 v,n(a) is convenient because one naturallyinterprets the former as an estimator of the conditional variance of Y given (A, V ) = (a, v) basedon O n , a fact that we use in Section 4.2. Finally, we emphasize that g ⋆ n+1 = g Zn+1 for the summarymeasure of the past O nZ n+1 = φ n+1 (O n ) ≡ ((σ 2 v,n(0), σ 2 v,n(1)) : v ∈ V). (9)The rigorous <strong>de</strong>finition of the <strong>de</strong>sign g ⋆ n = (g ⋆ 1, ...,g ⋆ n) follows by recursion, but it is stillsubject to knowledge about how to construct an estimator Q n of Q 0 based on O n . Because thislast missing piece of the formal <strong>de</strong>finition of the adaptive group sequential <strong>de</strong>sign data generatingmechanism is also the core of the TMLE procedure, we address it in Section 4.4 TMLE in adaptive covariate-adjusted RCTsWe assume hereafter that O n has already been sampled from the (Q 0 ,g ⋆ n)-adaptive samplingscheme. In this section, we construct an estimator Q n (actually <strong>de</strong>noted by Q ∗ n) of Q 0 , thereforeyielding the characterization of g ⋆ n+1, and completing the formal <strong>de</strong>finition of the adaptive <strong>de</strong>signg ⋆ n that we initiated in Section 3. In particular, the next observed data structure O n+1 can bedrawn from (Q 0 , g ⋆ n+1), and it makes sense to un<strong>de</strong>rtake the asymptotic study of the properties ofthe TMLE methodology based on adaptive group sequential sampling.As in the i.i.d framework, the TMLE procedure maps an initial substitution estimator Ψ(Q 0 n)of ψ 0 into an update ψ ∗ n = Ψ(Q ∗ n) by fluctuating the initial estimate Q 0 n of Q 0 . The constructionof Q 0 n is presented and studied in Section 4.1. In Section 4.2, a straightforward application of themain result of Section 4.1 shows that the adaptive <strong>de</strong>sign converges. How to fluctuate Q 0 n andstretch it optimally to Q ∗ n is presented and studied in Section 4.3.4.1 Initial maximum likelihood based substitution estimatorWorking mo<strong>de</strong>l.In or<strong>de</strong>r to construct the initial estimate Q 0 n of Q 0 , we consi<strong>de</strong>r a working mo<strong>de</strong>l Q w n. With a slightabuse of notation, the elements of Q w n are <strong>de</strong>noted by (Q W (P n ), Q Y |A,W (θ)) for some parameterθ ∈ Θ, where Q W (P n ) is the empirical marginal distribution of W. Specifically, the working mo<strong>de</strong>lQ w n is chosen in such a way that{1Q Y |A,W (θ)(O) = √2πσ2V(A) exp − (Y − m(A, W;β V )) 2 }2σV 2 (A) .This notably implies that for any P θ ∈ M such that Q Y |A,W (P θ ) = Q Y |A,W (θ), the conditionalmean ¯Q(P θ )(A, W), which we also <strong>de</strong>note by ¯Q(θ), satisfies ¯Q(θ)(A, W) = m(A, W;β V ), the righthandsi<strong>de</strong> expression being a linear combination of variables extracted from (A, W) and in<strong>de</strong>xedby the regression vector β V (of dimension b). Definingθ(v) = (β v , σ 2 v(0), σ 2 v(1)) ⊤ ∈ Θ v ⊂ R b × R ∗ + × R ∗ + (10)for each v ∈ V, the complete parameter is given by θ = (θ(1) ⊤ , ...,θ(ν) ⊤ ) ⊤ ∈ Θ, where Θ =∏ νv=1 Θ v. We impose the following condition on the parameterization:78


PARAM. The parameter set Θ is compact. Furthermore, the linear parameterization is i<strong>de</strong>ntifiable:for all v ∈ V, if m(a, w;β v ) = m(a, w;β ′ v) for all a ∈ A and w ∈ W (compatiblewith v), then necessarily β v = β ′ v.Characterizing Q 0 n.Let us set a reference fixed <strong>de</strong>sign g r ∈ G 1 . We now characterize Q 0 n by lettingFurthermore, maximizing a weighted version of the log-likelihood is a technical twist that makesthe theoretical study of the properties of θ n easier. In<strong>de</strong>ed, the unweighted maximum likelihoo<strong>de</strong>stimatortargets the parametert n = arg maxθ∈Θn∑log Q Y |A,W (θ)(O i )i=1hal-00582753, version 1 - 4 Apr 2011whereθ n = arg maxθ∈ΘQ 0 n = (Q W (P n ), Q Y |A,W (θ n )), (11)n∑i=1log Q Y |A,W (θ)(O i ) gr (A i |V i )g ⋆ i (A i|V i )is a weighted maximum likelihood estimator with respect to the working mo<strong>de</strong>l. Thus, the vthcomponent θ n (v) of θ n satisfiesθ n (v) = arg minθ(v)∈Θvn∑(log σv(A 2 i ) + (Y i − m(A i , W i ;β v )) 2i=1σ 2 v(A i )) g r (A i |V i )g ⋆ i (A i|V i ) 1{V i = v}for every v ∈ V. Note that this initial estimate Q 0 n of Q 0 yields the initial maximum likelihoodbased substitution estimator Ψ(Q 0 n) of ψ 0 :Studying Q 0 n through θ n .Ψ(Q 0 n) = 1 nn∑¯Q(θ n )(1, W i ) − ¯Q(θ n )(0, W i ).For simplicity, let us introduce, for all θ ∈ Θ, the additional notationsi=1l θ,0 = log Q Y |A,W (θ), ˙lθ,0 = ∂ ∂θ l θ,0, ¨lθ,0 = ∂2∂θ 2 l θ,0.The first asymptotic property of θ n that we <strong>de</strong>rive concerns its consistency (see Theorem 5 in(van <strong>de</strong>r Laan, 2008)).Proposition 2 (consistency of θ n ). Assume that:A1. There exists a unique interior point θ 0 ∈ Θ such that θ 0 = arg max θ∈Θ P Q0,g rl θ,0.A2. The matrix −P Q0,g r ¨lθ0,0 is positive <strong>de</strong>finite.(12)hal-00582753, version 1 - 4 Apr 2011Tḡn (Q 0 ) = arg maxθ∈Θn∑i=1P Q0,gi ⋆ log Q Y |A,W(θ) = arg maxP Q0,ḡnl θ,0 ,θ∈Θwhere ḡ n = n −1 ∑ ni=1 g⋆ i and P Q0,ḡn f ≡ E PQ 0 ,ḡn [f(O n)|O n (n − 1)] for any measurable f <strong>de</strong>finedon O. Therefore, t n asymptotically targets the limit, if it exists, of Tḡn (Q 0 ). Assuming that ḡ nconverges itself to a fixed <strong>de</strong>sign g ∞ ∈ G, then t n asymptotically targets parameter T g∞ (Q 0 ). Thelatter parameter is very difficult to interpret and to analyze, as it <strong>de</strong>pends directly and indirectly(through g ∞ ) on Q 0 .4.2 Convergence of the adaptive <strong>de</strong>signConsi<strong>de</strong>r the mapping G ⋆ from Θ to G 1 (respectively equipped with the Eucli<strong>de</strong>an distance and,for instance, the distance d(g, g ′ ) = ∑ v∈V |g(1|v) − g′ (1|v)|) such that, for any θ ∈ Θ, for any(a, v) ∈ A × V,G ⋆ σ v (a)(θ)(a|v) =σ v (1) + σ v (0) . (13)Equation (13) characterizes G ⋆ , which is obviously continuous. Since g ⋆ n is adapted in such a waythat g ⋆ n = G ⋆ (θ n ) (see (8)), Proposition 2 and the continuous mapping theorem (see Theorem 1.3.6in (van <strong>de</strong>r Vaart and Wellner, 1996)) straightforwardly imply the following result (the convergencein probability yields the convergence in L 1 because g ⋆ n is uniformly boun<strong>de</strong>d).Proposition 3 (convergence of g ⋆ n). Un<strong>de</strong>r the assumptions of Proposition 2, the adaptive <strong>de</strong>signg ⋆ n converges in probability and in L 1 to the limit <strong>de</strong>sign G ⋆ (θ 0 ).The convergence of the adaptive <strong>de</strong>sign g ∗ n is a crucial result. It is noteworthy that thelimit <strong>de</strong>sign G ⋆ (θ 0 ) equals the optimal <strong>de</strong>sign g ⋆ (Q 0 ) if the working mo<strong>de</strong>l is correctly specified(which never happens in real-life applications), but not necessarily otherwise. Furthermore, therelationship g ⋆ n = G ⋆ (θ n ) also entails the possibility to <strong>de</strong>rive the convergence in distribution of√ n(g⋆n − G ⋆ (θ 0 )) to a centered Gaussian distribution with known variance by application of the<strong>de</strong>lta-method (G ⋆ is differentiable) from a central limit theorem on θ n (see Proposition 5 andTheorem 3.1 in (van <strong>de</strong>r Vaart, 1998)).Provi<strong>de</strong>d that O is a boun<strong>de</strong>d set, θ n consistently estimates θ 0 .The proof of Proposition 2 is given in Section A.1.The limit in probability of θ n has a nice interpretation in terms of projection of Q Y |A,W (P 0 )onto {Q Y |A,W (θ) : θ ∈ Θ}. Preferring to discuss this issue in terms of data generating distributionrather than conditional distribution, let us set Q θ0 = (Q W (P 0 ), Q Y |A,W (θ 0 )) and assume thatP Q0,g r log Q Y |A,W(P 0 ) is well <strong>de</strong>fined (this weak assumption concerns Q 0 , not g r , and holds forinstance when |log Q Y |A,W (P 0 )| is boun<strong>de</strong>d). Then A1 is equivalent to P Qθ0 ,gr being the uniqueKullback-Leibler projection of P Q0,gr onto the set4.3 TMLEFluctuating Q 0 n.The second step of the TMLE procedure stretches the initial estimate Ψ(Q 0 n) in the direction ofthe parameter of interest, through a maximum likelihood step over a well-chosen fluctuation ofQ 0 n. The latter fluctuation of Q 0 n is just a one-dimensional parametric mo<strong>de</strong>l {Q 0 n(ε) : ε ∈ E} ⊂ Qin<strong>de</strong>xed by the parameter ε ∈ E, E ⊂ R being a boun<strong>de</strong>d interval which contains a neighborhoodof the origin. Specifically, we set for all ε ∈ E:{P ∈ M : ∃θ ∈ Θ s.t. Q Y |A,W (P) = Q Y |A,W (θ), and Q W (P) = Q W (P 0 ), g(P) = g r }.Q 0 n(ε) = (Q W (P n ), Q Y |A,W (θ n , ε)), (14)In addition to being consistent, θ n actually satisfies a central limit theorem if supplementary mildconditions are met. The latter central limit theorem is embed<strong>de</strong>d in a more general result that westate in Section 4.3, see Proposition 5.where for any θ ∈ Θ,Q Y |A,W (θ,ε)(O) ={1√2πσ2V(A) exp − (Y − ¯Q(θ)(A, W) − εH ∗ (θ)(A, W)) 2 }2σV 2 (A) , (15)910


withH ∗ (θ)(A, W) = 2A − 1G ⋆ (θ)(A|V ) σ2 V (A).In particular, the fluctuation goes through Q 0 n at ε = 0 (i.e., Q 0 n(0) = Q 0 n). Let P 0 n(ε) ∈ M bea data generating distribution such that Q Y |A,W (P 0 n(ε)) = Q Y |A,W (θ n , ε). The conditional mean¯Q(P 0 n(ε)), which we also <strong>de</strong>note by ¯Q(θ n , ε), is5 Asymptotics5.1 Studying Q ∗ n through (θ n , ε n ): consistencyWe now state and comment on a consistency result for the stacked estimator (θ n , ε n ) which complementsProposition 2 (see Theorem 8 in (van <strong>de</strong>r Laan, 2008)). For simplicity, let us generalizethe notation l θ,0 introduced in Section 4.1 by setting, for all (θ,ε) ∈ Θ × E,¯Q(θ n , ε)(A, W) = ¯Q(θ n )(A, W) + εH ∗ (θ n )(A, W).l θ,ε = log Q Y |A,W (θ,ε).Furthermore, the score at ε = 0 of P 0 n(ε) equalsMoreover, let us set, for all (θ,ε) ∈ Θ × E,hal-00582753, version 1 - 4 Apr 2011∂∂ε log P n(ε)[O] ∣ 0 2A − 1ε=0=G ⋆ (θ n )(A|V ) (Y − ¯Q(θ n )(A, W)) = D2(P ⋆ Q 0 n ,G ⋆ (θn))(O),the second component of the efficient influence curve of Ψ at P Q 0 n ,G ⋆ (θn) = P Q 0 n ,gn ⋆ , see (2) inProposition 1 (recall that gn ⋆ = G ⋆ (θ n )).Characterizing Q ∗ n yields the TMLE.We characterize the update Q ∗ n of Q 0 n in the fluctuation {Q 0 n(ε) : ε ∈ E} bywhereε n = arg maxε∈En∑i=1Q ∗ n = Q 0 n(ε n ),log Q Y |A,W (θ n , ε)(O i ) g⋆ n(A i |V i )g ⋆ i (A i|V i )is a weighted maximum likelihood estimator with respect to the fluctuation. It is worth notingthat ε n is known in closed form (we assume, without serious loss of generality, that E is largeenough for the maximum to be achieved in its interior). Denoting the vth component θ n (v) of θ nby (β v,n , σ 2 v,n(0), σ 2 v,n(1)), it holds thatε n =∑ ni=1 (Y i − ¯Q(θ n )(A i , W i )) 2Ai−1g ⋆ i (Ai|Vi)∑.n σV 2 i ,n (Ai)i=1 gn ⋆(Ai|Vi)gi(Ai|Vi) The notation Q ∗ n for this first update of Q 0 n is a reference to the fact that the TMLE procedure,which is in greater generality an iterative procedure, converges here in one single step. In<strong>de</strong>ed,say that one fluctuates Q ∗ n as we fluctuate Q 0 n i.e., by introducingQ 1 n(ε) = (Q W (P n ), Q Y |A,W (θ n , ε n , ε))(16)hal-00582753, version 1 - 4 Apr 2011Q θ,ε = (Q W (P 0 ), Q Y |A,W (θ,ε)). (18)Proposition 4 (consistency of (θ n , ε n )). Suppose that A1 and A2 from Proposition 2 hold. Inaddition, assume that:A3. There exists a unique interior point ε 0 ∈ E such that ε 0 = arg max ε∈E P Q0,G ⋆ (θ0)l θ0,ε.(i) It holds that Ψ(Q θ0,ε0 ) = Ψ(Q 0).(ii) Provi<strong>de</strong>d that O is a boun<strong>de</strong>d set, (θ n , ε n ) consistently estimates (θ 0 , ε 0 ).The proof of Proposition 4 is given in Section A.1.We already discussed the interpretation of the limit in probability of θ n in terms of Kullback-Leibler projection. Likewise, the limit in probability ε 0 of ε n enjoys such an interpretation. Letus assume that P Q0,G ⋆ (θ0) log Q Y |A,W (P 0 ) is well <strong>de</strong>fined (this weak assumption concerns Q 0 , notG ⋆ (θ 0 ), and holds for instance when |log Q Y |A,W (P 0 )| is boun<strong>de</strong>d). Then A3 is equivalent toP Qθ0 ,ε 0 ,G ⋆ (θ0) being the unique the Kullback-Leibler projection of P Q0,G ⋆ (θ0) onto the set{P ∈ M : ∃ε ∈ E s.t. Q(P) = Q θ0,ε and g(P) = G ⋆ (θ 0 )}.Of course, the most striking property that ε 0 enjoys is (i): even if ¯Q 0 ∉ { ¯Q(θ,ε) : (θ,ε) ∈ Θ×E},it holds that Ψ(Q θ0,ε0) = Ψ(Q 0 ). This remarkable equality and the convergence of (θ n , ε n ) to(θ 0 , ε 0 ) are evi<strong>de</strong>ntly the keys to the consistency of ψ ∗ n = Ψ(Q ∗ n). We investigate in Section 5.3how the consistency result stated in Proposition 4 translates into the consistency of the TMLE.5.2 Studying Q ∗ n through (θ n , ε n ): central limit theoremWe now state and comment on a central limit theorem for the stacked estimator (θ n , ε n ) (see alsoTheorem 9 in (van <strong>de</strong>r Laan, 2008)).Let us introduce, for all (θ, ε) ∈ Θ × E,with Q Y |A,W (θ,ε ′ , ε) equal to the right-hand si<strong>de</strong> of (15) where one substitutes ¯Q(θ,ε ′ ) for ¯Q(θ).Say that one then <strong>de</strong>fines the weighted maximum likelihood ε ′ n as the right-hand si<strong>de</strong> of (16)where one substitutes Q Y |A,W (θ n , ε n , ε) for Q Y |A,W (θ n , ε). Then it follows that ε ′ n = 0 so that the“updated” Q ∗ n(ε ′ n) = Q ∗ n. The updated estimator Q ∗ n of Q 0 maps into the TMLE ψ ∗ n = Ψ(Q ∗ n) ofthe risk difference ψ 0 = Ψ(Q 0 ):ψ ∗ n = 1 nn∑¯Q(θ n , ε n )(1, W i ) − ¯Q(θ n , ε n )(0, W i ). (17)i=1D(θ,ǫ)(O,Z) =and ˜D(θ,ε)(O) = g Z (A|V )D(θ,ε)(O,Z).1( ) ⊤˙l⊤g Z (A|V ) θ,0 (O)g r (A|V ), ∂ ∂ε l θ,ε(O)G ⋆ (θ)(A|V ) , (19)Proposition 5 (central limit theorem for (θ n , ε n )). Suppose that A1, A2 and A3 from Propositions2 and 4 hold. In addition, assume thatA4. If a <strong>de</strong>terministic function F is such that F(O) = 0 P Q0,G ⋆ (θ0)-almost surely, then F = 0.The asymptotic study of ψ ∗ n relies on a central limit theorem for (θ n , ε n ), which we discuss inSection 5.3.Then the following asymptotic linear expansion holds:√ n((θn , ε n ) − (θ 0 , ε 0 )) = S −101 ∑ n√ D(θ 0 , ε 0 )(O i , Z i ) + o P (1), (20)ni=11112


hal-00582753, version 1 - 4 Apr 2011where(g ¨l θ0,0(O)r (A|V )S 0 = E h Q0,G ⋆ (θ0) ∂2i˛˛˛⊤∂θ∂ε lθ,ε(O)G⋆ (θ)(A|V )(θ,ε)=(θ0,ε0)G ⋆ (θ0)(A|V ) 01G ⋆ (θ0)(A|V ))∂ 2∂ε 2 l θ0,ε(O)| ε=ε0is an invertible matrix. Furthermore, (20) entails that √ n((θ n , ε n ) − (θ 0 , ε 0 )) converges in distributionto the centered Gaussian distribution with covariance matrix S0 −1 Σ 0(S0 −1 )⊤ , where( ˜D(θ0 , ε 0 )Σ 0 = E ˜D(θ)0 , ε 0 ) ⊤ (O)Q0,G ⋆ (θ0)G ⋆ (θ 0 )(A|V ) 2 (22)is a positive <strong>de</strong>finite symmetric matrix. Moreover, S 0 is consistently estimated by(S n = 1 n∑¨l θn,0(O i) gr (A i|V i))0g Zi (Ah i|V i)n∂2i˛˛˛⊤i=1 ∂θ∂ε lθ,ε(Oi)G⋆ 1 ∂(θ)(A i|V i)2(θ,ε)=(θn,εn)g Zi (A i|V i) ∂ε 2 l θn,ε(O i)| ε=εn g Zi (A i|V i)and Σ 0 is consistently estimated byΣ n = 1 n(21)n∑D(θ n , ε n )D(θ n , ε n ) ⊤ (O i , Z i ). (23)i=1The proof of Proposition 5 is given in Section A.2. We investigate in Section 5.3 how the abovecentral limit theorem translates into a central limit theorem for the TMLE.5.3 Consistency and asymptotic normality of the TMLEIn this section, we finally state and comment on the asymptotic properties of the TMLE ψ ∗ n.TMLE is consistent and asymptotically Gaussian.In the first place, the TMLE ψ ∗ n is robust: it is a consistent estimator even when the workingmo<strong>de</strong>l is misspecified (which is always the case in real-life applications).Proposition 6 (consistency of ψ ∗ n). Suppose that A1, A2, A3 and A4 from Propositions 2, 4and 5 hold. Then the TMLE ψ ∗ n consistently estimates the risk difference ψ 0 .If the <strong>de</strong>sign of the clinical trial was fixed (and consequently, the n first observations were i.i.d),then the TMLE would be a robust estimator of ψ 0 : Even if the working mo<strong>de</strong>l is misspecified,then the TMLE still consistently estimates ψ 0 because the treatment mechanism is known (or canbe consistently estimated, if one wants to gain in efficiency). Thus, the robustness of the TMLEstated in Proposition 6 is the expected counterpart of the TMLE’s robustness in the latter i.i.dsetting: Expected because the TMLE solves a martingale estimating function that is unbiased forψ 0 at misspecified Q and correctly specified g i , i = 1, ...,n.In the second place, the TMLE ψn ∗ is asymptotically linear, and therefore satisfies a centrallimit theorem. To see this, let us introduce the real-valued function φ on Θ × E such that φ(θ,ǫ) =Ψ(Q θ,ε ) (see (18) for the <strong>de</strong>finition of Q θ,ε ). The function φ is differentiable on the interior ofΘ × E, and we <strong>de</strong>note φ ′ θ,εits gradient at (θ, ε). The latter gradient satisfies{() } ⊤φ ′ θ,ε = E Qθ,ε,G ⋆ (θ) D ⋆ ∂(P Qθ,ε,G ⋆ (θ))(O)∂θ l⊤ θ,ε (O), ∂ ∂ε l θ,ε(O) . (24)Note that the right-hand si<strong>de</strong> expression cannot be computed explicitly because the marginaldistribution Q W (P 0 ) is unknown. By the law of large numbers (in<strong>de</strong>pen<strong>de</strong>nt case), we can buildan estimator φ ′ n of φ ′ as follows. For B a large number (say B = θ0,ε0 104 ), simulate B in<strong>de</strong>pen<strong>de</strong>ntcopies Õb of O from the data generating distribution P Q ∗ n ,G ⋆ (θn), then computeφ ′ n = 1 B∑() ⊤D ⋆ ∂(P Q ∗Bn ,G ⋆ (θn))(O b )∂θ l⊤ (O θ,εn b)| θ=θn , ∂ ∂ε l θn,ε(O b )| ε=εn . (25)b=1hal-00582753, version 1 - 4 Apr 2011Proposition 7 (central limit theorem for ψ ∗ n). Suppose that A1, A2, A3 and A4 from Propositions2, 4 and 5 hold. Then the following asymptotic linear expansion holds:√ n(ψ∗n − ψ 0 ) = √ 1 ∑ nIC(O i , Z i ) + o P (1), (26)ni=1whereIC(O,Z) = D1(Q ⋆ θ0,ε0 )(W) + φ′⊤ θ0,ε0 S−1 0 D(θ 0, ε 0 )(O,Z). (27)Furthermore, (26) entails that √ n(ψn ∗ − ψ 0 ) converges in distribution to the centered Gaussiandistribution with a variance consistently estimated bys 2 n = 1 nn∑D1(Q ⋆ ∗ n)(W i ) 2i=1+ 2 nn∑i=1D ⋆ 1(Q ∗ n)(W i )φ ′⊤ n S −1n D(θ n , ε n )(O i , Z i ) + (φ ′⊤ n S −1n )Σ n (φ ′⊤ n S −1n ) ⊤ .Proposition 7 is the backbone of the statistical analysis of adaptive group sequential RCTsas constructed in Section 4. In particular, <strong>de</strong>noting the (1 − α)-quantile of the standard normaldistribution by ξ 1−α , the proposition guarantees that the asymptotic level of the confi<strong>de</strong>nce interval[ψn ∗ ± s ]n√ ξ 1−α/2 (28)nfor the risk difference ψ 0 is (1 − α). The proofs of Propositions 6 and 7 are given in Section A.3.Extensions.We conjecture that the influence function IC computed at (O,Z), see (27), is equal toD1(Q ⋆ )(W) + D⋆ θ0,ε0 2(P Qθ0 ,ε 0 ,G ⋆ (θ0))(O) G⋆ (θ 0 )(A|V ).g Z (A|V )This conjecture is backed by the simulations that we carry out and present in Section 7. Wewill tackle the proof of the conjecture in future work. Let us assume for the moment that theconjecture is true. Then the asymptotic linear expansion (26) now implies that the asymptoticvariance of √ n(ψn ∗ − ψ 0 ) can be consistently estimated bys ⋆2n = 1 n∑() 2D ⋆n1(Q ∗ n)(W i ) + D2(P ⋆ Q ∗ n ,G ⋆ (θn))(O i ) G⋆ (θ n )(A i |V i ),g Zi (A i |V i )i=1another in<strong>de</strong>pen<strong>de</strong>nt argument also showing that s ⋆2n converges towardVar Q0,G ⋆ (θ0)D ⋆ (P Qθ0 ,ε 0 ,G ⋆ (θ0))(O)i.e., the variance un<strong>de</strong>r the fixed <strong>de</strong>sign P Q0,G ⋆ (θ0) of the efficient influence curve at P Qθ0 ,ε 0 ,G ⋆ (θ0).Furthermore, the most essential characteristic of the joint methodologies of <strong>de</strong>sign adaptationand targeted maximum likelihood estimation is certainly the utmost importance of the role playedby the likelihood. In this view, the targeted maximized log-likelihood of the datan∑{log Q W (P n )(W i ) + log gi ⋆ (A i |W i ) + log Q Y |A,W (θ n , ε n )(O i )}i=1provi<strong>de</strong>s us with a quantitative measure of the quality of the fit (targeted toward the parameterof interest). It is therefore possible, for example, to use that quantity for the sake of selection ofdifferent working mo<strong>de</strong>ls for Q 0 . As with TMLE for i.i.d data, we can use likelihood based crossvalidationto select among more general initial estimators in<strong>de</strong>xed by fine-tuning parameters. Thevalidity of such TMLEs for group sequential adaptive <strong>de</strong>signs as studied here is outsi<strong>de</strong> the scopeof this article.1314


hal-00582753, version 1 - 4 Apr 20116 Application to group sequential testingWe <strong>de</strong>rive in this section a group sequential testing procedure, that is a testing procedure whichrepeatedly tries to make a <strong>de</strong>cision at intervals rather than once all data are collected, or thanafter every new observation is obtained (such a testing procedure would be said fully sequential).We refer to (Jennison and Turnbull, 2000; Proschan et al., 2006) for a general presentation ofgroup sequential testing procedures. The TMLE group sequential testing procedure is formally<strong>de</strong>scribed in Section 6.1, and some arguments justifying why it should work well (as validated bythe simulation study carried out in Section 8) are exposed in Section 6.2.6.1 Description of the TMLE group sequential testing procedureThe problem at stake is to test the null “Ψ(Q) = ψ 0 ” against “Ψ(Q) > ψ 0 ” with asymptotic type Ierror α and asymptotic type II error β at some ψ 1 > ψ 0 . We want to proceed group sequentiallywith K ≥ 2 steps, based on the multi-dimensional t-statistic(√Nk(T1 ∗ , ...,TK) ∗ (ψ ∗ Nk=− ψ )0). (29)s NkHere, N 1 , ...,N K are random sample sizes whose realizations on a specific trajectory <strong>de</strong>pend onhow fast the information accrues as the data are collected. In or<strong>de</strong>r to quantify this notion ofinformation, we <strong>de</strong>ci<strong>de</strong> to consi<strong>de</strong>r the inverse n sof the estimated variance of the TMLE ψ 2 n∗ nbased on the n first observations O n (as a proxy to its true, finite sample, inverse variance).Given a reference maximum committed information I max and K increasingly or<strong>de</strong>r proportions0 < p 1 < ... < p K = 1, we set for every k ≤ K{}nN k = inf n ≥ 1 :s 2 ≥ p k I max .nThe characterization of I max <strong>de</strong>pends on how we wish to “spend” the type I and type II errorsat each step of the group sequential procedure and on how <strong>de</strong>manding is the power requirement(i.e., how close ψ 1 is to ψ 0 ). Say that our spending strategies are summarized by the K-tuples ofpositive numbers (α 1 , ...,α K ) and (β 1 , ...,β K ) such that ∑ Kk=1 α k = α and ∑ Kk=1 β k = β.Now, let (Z 1 , ...,Z K ) be distributed from the centered Gaussian distribution with covariancematrix C = ( √ p k∧l /p k∨l ) k,l≤K . We assume that there exists a unique value I > 0, our I max , suchthat there exist a rejection boundary (a 1 , ...,a K ) and a futility boundary (b 1 , ...,b K ) satisfyinga K = b K , P(Z 1 ≥ a 1 ) = α 1 , P(Z 1 + (ψ 1 − ψ 0 ) √ p 1 I ≤ b 1 ) = β 1 , and for every 1 ≤ k < K,k≤KP(∀j ≤ k, b j < Z j < a j and Z k+1 ≥ a k+1 ) = α k+1 ,P(∀j ≤ k,b j < Z j + (ψ 1 − ψ 0 ) √ p j I < a j and Z k+1 + (ψ 1 − ψ 0 ) √ p k+1 I ≤ b k+1 ) = β k+1 .Note that the closer ψ 1 is to ψ 0 , the larger is I max (actually, ψ 1 ↦→ ψ 1√Imax is both upperboun<strong>de</strong>d and boun<strong>de</strong>d away from zero). Heuristically, the closer ψ 1 is to ψ 0 , the more difficult itis to <strong>de</strong>ci<strong>de</strong> between the null and its alternative while preserving the required type II error at ψ 1 ,the more information is nee<strong>de</strong>d to proceed.The targeted maximum likelihood group sequential testing procedure finally goes as follows:starting from k = 1,if T ∗ k ≥ a k then reject the null and stop accruing data,if T ∗ k ≤ b k then fail rejecting the null and stop accruing data,if b k < T ∗ k < a k then set k ← k + 1 and repeat.If (T ∗ 1 , ...,T ∗ K ) had the same distribution as (Z 1, ...,Z K ), then the latter rule would yield a testingprocedure with the required type I error and type II error at the specified alternative parameter.Clearly, our <strong>de</strong>cision to target the optimal <strong>de</strong>sign G ⋆ (θ 0 ) which reduces as much as possible(over G 1 and for our choice of working mo<strong>de</strong>l) the asymptotic variance of the TMLE ψ ∗ n guarantees,hal-00582753, version 1 - 4 Apr 2011at least informally, that each N k is stochastically smaller un<strong>de</strong>r (Q 0 ,g ⋆ n)-adaptive sampling than itwould have been had another fixed <strong>de</strong>sign been used (or targeted). Thus, resorting to the (Q 0 ,g ⋆ n)-adaptive sampling is likely to result in an earlier conclusion than what would have yiel<strong>de</strong>d anotherfixed (or targeted) sampling scheme.6.2 Rationale of the TMLE group sequential testing procedureThe next proposition partially justifies the characterization of the TMLE group sequential testingprocedure of Section 6.1. First, let us <strong>de</strong>fine n k = ⌈np k ⌉ (the smallest integer not smaller thannp k ) for every k ≤ K, then(√nk (ψ ∗ nk(T 1 , ...,T K ) =− ψ 0).s nk)k≤KThis other multi-dimensional t-statistic is a substitute to (T1 ∗ , ...,TK ∗ ) whose asymptotic studyis easier to carry out. In particular, it is possible to <strong>de</strong>rive the limit distribution of (T 1 , ...,T K )un<strong>de</strong>r the null and un<strong>de</strong>r a sequence of contiguous alternatives. The limit distribution is called thecanonical distribution (see Theorems 11 in 12 in (van <strong>de</strong>r Laan, 2008), Theorem 3 in (Chambazand van <strong>de</strong>r Laan, 2011) and Theorem 2.1 in (Zhu and Hu, 2010), where a similar result is obtainedthrough a different approach based on the study of the limit distribution of a stochastic process<strong>de</strong>fined over (0, 1]).Proposition 8. Suppose that A1, A2, A3 and A4 from Propositions 2, 4 and 5 hold. Consi<strong>de</strong>rf ∈ L 2 (P Q0,G ⋆ (θ0)), f ≠ 0, with P Q0,G ⋆ (θ0)f = 0 and a fluctuation {Q 0 (h) : h ∈ H} ⊂ Q of Q 0with score f. Set Q f/√ n = Q 0 (1/ √ n) for all n ≥ n 0 (n 0 is such that 1/ √ n0 ∈ H). Assume also thatA5. The score f is boun<strong>de</strong>d, it is not proportional to D ⋆ (P Q0,G ⋆ (θ0)), E PQ0 ,G ⋆ (θ 0[f(O)|A, W] = 0,)P Q0,G ⋆ (θ0)D ⋆ (P Q0,G ⋆ (θ0))f > 0, and P Q0,G ⋆ (θ0)ICf > 0.The sequence (Q √ f/ n ) n≥n0 <strong>de</strong>fines a sequence (ψ n ) n≥n0 of contiguous parameters (“from directionf”, which only fluctuates the conditional distribution of Y given (A, W)), with ψ n =Ψ(Q √ f/ n ) > Ψ(Q 0 ) for n large enough. Introduce µ(f) = ( √ p 1 , ..., √ p K ) P Q 0 ,G ⋆ (θ 0 )ICf√PQ0 ,G ⋆ (θ 0 )IC 2.(i) Un<strong>de</strong>r (Q 0 ,g n )-adaptive sampling, (T 1 , ...,T K ) converges in distribution to the centeredGaussian distribution with covariance matrix C.(ii) Un<strong>de</strong>r (Q f/√ n ,g n )-adaptive sampling, (T 1 , ...,T K ) converges in distribution to the Gaussiandistribution with mean µ(f) and covariance matrix C.The proof of Proposition 8 is given in Section A.4.The rationale follows. Say that one conclu<strong>de</strong>s from Proposition 8 that (i) (T1 ∗ , ...,TK ∗ ) isalso approximately distributed as (Z 1 , ...,Z K ) un<strong>de</strong>r (Q 0 ,gn)-adaptive ⋆ sampling. Likewise, saythat ( √ N k (ψ ∗ − ψ Nk1)/s Nk ) k≤K is also approximately distributed as (Z 1 , ...,Z K ) un<strong>de</strong>r (Q 1 ,gn)-⋆adaptive sampling, with Ψ(Q 1 ) = ψ 1 > ψ 0 . Say that one is willing to substitute p k I max toN k /s 2 for each k ≤ K. Then (ii) (T ∗ Nk1 , ...,TK ∗ ) is approximately distributed as √ I max (ψ 1 −ψ 0 )( √ p 1 , ..., √ p K ) + (Z 1 , ...,Z K ) un<strong>de</strong>r (Q 1 ,gn)-adaptive ⋆ sampling. It appears that (i) and (ii)suffice to guarantee the wished asymptotic control of type I and type II errors. The rationale isvalidated by the simulation study un<strong>de</strong>rtaken in Section 8.7 Simulation study of the performances of TMLE in adaptivecovariate-adjusted RCTsIn this section, we carry out a simulation study of the performances of TMLE in adaptive groupsequential RCTs as exposed in the previous sections. We present the simulation scheme in Section7.1. The working mo<strong>de</strong>l upon which the TMLE methodology relies is <strong>de</strong>scribed in Section 7.2.How the TMLE-based confi<strong>de</strong>nce intervals behave is consi<strong>de</strong>red in Section 7.3 (empirical coverage)and in Section 7.4 (empirical width). An illustrating example is finally presented in Section 7.5.1516


Sampling scheme (Q 0, g) Allocation probabilities Varianceg(1|v = 1) g(1|v = 2) g(1|v = 3) Var Q0,gD ⋆ (O; P Q0,g)(Q 0, g b 111)-balanced23.864222(Q 0, g ⋆ (Q 0))-optimal 0.707 0.799 0.849 18.181Table 1: Numerical values of the allocation probabilities and variance of the efficient influencecurve un<strong>de</strong>r either the balanced g b or the targeted optimal g ⋆ (Q 0 ) sampling schemes. The ratioof the variances of the efficient influence curve un<strong>de</strong>r targeted optimal and balanced samplingschemes satisfies R(Q 0 ) = v ⋆ (Q 0 )/v b (Q 0 ) ≃ 0.762.7.1 Simulation scheme7.2 Specifying the working mo<strong>de</strong>l, reference <strong>de</strong>sign and a few twistsOn the working mo<strong>de</strong>l.For each v ∈ V, let us <strong>de</strong>note θ(v) = (β v , σv(0), 2 σv(1)) 2 ∈ Θ v , Θ v ⊂ R 3 × R ∗ + × R ∗ + being compact,where the regression vector β v = (β v,1 , β v,2 , β v,3 ) (b = 3 in (10)), then θ = (θ 1 , θ 2 , θ 3 ) ∈ Θ =Θ 1 × Θ 2 × Θ 3 .Following the <strong>de</strong>scription in Section 4.1, the working mo<strong>de</strong>l Q w n that the TMLE methodologyrelies on is characterized by the conditional likelihood of Y given (A, W):{1Q Y |A,W (θ)(O) = √2πσ2V(A) exp − (Y − m(A, W;β V )) 2 }2σV 2 (A) ,We characterize the component Q 0 = Q(P 0 ) of the true distribution P 0 of the observed datastructure O = (W, A, Y ) as follows:• the baseline covariate W = (U, V ) where U is uniformly distributed over the unit interval[0; 1] and the subgroup membership covariate V ∈ V = {1,2,3} (hence ν = 3) satisfieswith the specific choice of conditional mean ¯Q(θ)(A, W) of Y given (A, W):¯Q(θ)(a, w) = m(a, w;β v ) = β v,1 + β v,2 u + β v,3 afor all a ∈ A and w = (u, v) ∈ W = R × V. As required, condition PARAM is met. Obviously,the working mo<strong>de</strong>l is heavily misspecified:P 0 (V = 1) = 1 2 , P 0(V = 2) = 1 3 and P 0(V = 3) = 1 6 ;• a Gaussian conditional likelihood is used instead of a Gamma conditional likelihood,hal-00582753, version 1 - 4 Apr 2011• the conditional distribution of Y given (A, W) is the Gamma distribution characterized bythe conditional mean(¯Q 0 (A, W) = 2U 2 + 2U + 1 + ρ AV + 1 − A )(30)1 + Vwith ρ = 1 (we will set ρ to another value in Section 8) and the conditional variance(Var P0 (Y |A, W) = U + A(1 + V ) + 1 − A ) 2.1 + VIn particular, the risk difference ψ 0 = Ψ(Q 0 ), our parameter of interest, is known in closed form:and so is the varianceψ 0 = 9172 ≃ 1.264,v b (Q 0 ) = Var Q0,g bD⋆ (O;P Q0,g b)of the efficient influence curve un<strong>de</strong>r balanced sampling. The numerical value of v b (Q 0 ) is reportedin Table 1.We target the <strong>de</strong>sign which (a) <strong>de</strong>pends on the baseline covariate W = (U, V ) only through V(i.e., belongs to G 1 ) and (b) minimizes the variance of the efficient influence curve of the parameterof interest Ψ. The latter treatment mechanism g ⋆ (Q 0 ) and optimal efficient asymptotic variancev ⋆ (Q 0 ) = Var Q0,g ⋆ (Q0)D ⋆ (O;P Q0,g ⋆ (Q0))are also known in closed form, and numerical values are reported in Table 1.Let n = (100,250,500,750,1000,2500,5000) be a sequence of sample sizes. We estimate M =1000 times the risk difference ψ 0 = Ψ(Q 0 ) based on O m n7 (n i), m = 1, ...,M, i = 1, ...,7, un<strong>de</strong>r• i.i.d (Q 0 , g b )-balanced sampling,• i.i.d (Q 0 , g ⋆ (Q 0 ))-optimal sampling,• (Q 0 ,g ⋆ )-adaptive sampling.n7hal-00582753, version 1 - 4 Apr 2011• the parametric forms of the conditional expectation and variance are wrong too.On the reference <strong>de</strong>sign.Regarding the choice of a reference fixed <strong>de</strong>sign g r ∈ G 1 (see Section 4.1), we select g r = g b (thebalanced <strong>de</strong>sign). The parameter θ 0 only <strong>de</strong>pends on Q 0 and the working mo<strong>de</strong>l, but its estimatorθ n <strong>de</strong>pends on g r , which may affect negatively its performances. Therefore, we propose to dilutethe impact of the choice of g r as an initial reference <strong>de</strong>sign as follows.For a given sample size n, we first compute a first estimate θ 1 n of θ 0 as in (12) but with ⌈n/4⌉(the smallest integer not smaller than n/4) substituted for n in the sum. Then θ n is computed asin (12) but this time with G ⋆ (θ 1 n)(A i |V i ) substituted for g r (A i |V i ).The proofs can be adapted in or<strong>de</strong>r to incorporate this modification of the procedure. We referthe interested rea<strong>de</strong>r to Section 8.5 in (van <strong>de</strong>r Laan, 2008).On additional <strong>de</strong>tails.We <strong>de</strong>ci<strong>de</strong> arbitrarily to update the <strong>de</strong>sign each time c = 25 new observations are sampled .In addition, the first update only occurs when there are at least five completed observations ineach treatment arm and for all V -strata. Thus, the minimal sample size at first update is 30. Itcan be shown that, un<strong>de</strong>r initialization using the balanced <strong>de</strong>sign, the expected sample size atthe first sample size at which there are at least 5 observations in each arm equals 75. Finally,as a precautionary measure, we systematically apply a thresholding to the updated treatmentmechanism: Using the notation of Section 4, max{δ, min{1 − δ, g ⋆ i (A i|V i )}} is substituted forg ⋆ i (A i|V i ) in all computations. We arbitrarily choose δ = 0.01.7.3 Empirical coverage of the TMLE confi<strong>de</strong>nce intervalsWe now invoke the central limit theorem stated in Proposition 7 to construct confi<strong>de</strong>nce intervalsfor the risk difference. Let us introduce, for all types of sampling and each sample size n i , theconfi<strong>de</strong>nce intervals⎡√⎤I ni,m = ⎣ψ ∗ ni (Om (n v ni (O m n7n7 i)) ±(n i))ξ 1−α/2⎦, m = 1, ...,Mn iFinally, we emphasize that the observed data structure O = (W, A, Y ) is not boun<strong>de</strong>d, whereasO is assumed boun<strong>de</strong>d in Propositions 2, 3, 4, 5, 6 and 7.where the <strong>de</strong>finition of the variance estimator v n (O m n7 (n)) based on the n first observations Om n7 (n)<strong>de</strong>pends on the sampling scheme:1718


hal-00582753, version 1 - 4 Apr 2011Sampling schemeSample sizen 1 n 2 n 3 n 4i.i.d (Q 0, g b )-balanced 0.913 0.925 0.939 0.934(p < 0.001) (p < 0.001) (0.067) (0.015)i.i.d (Q 0, g ⋆ (Q 0))-optimal 0.894 0.941 0.940 0.953(p < 0.001) (0.111) (0.087) (0.688)adaptive (Q 0,g ⋆ ) n70.934 0.939 0.956 0.945(0.015) (0.067) (0.827) (0.253)Sampling schemeSample sizen 5 n 6 n 7i.i.d (Q 0, g b )-balanced 0.945 0.940 0.946(0.253) (0.087) (0.300)i.i.d (Q 0, g ⋆ (Q 0))-optimal 0.954 0.947 0.947(0.739) (0.351) (0.351)adaptive (Q 0,g ⋆ ) n70.943 0.933 0.952(0.172) (0.011) (0.634)Table 2: Checking the a<strong>de</strong>quateness of the coverage guaranteed by our simulated confi<strong>de</strong>nce intervals.We test if the rescaled empirical coverage Binomial random variables Mc ni have parameter(M,1 − α), the alternative stating that they have parameter (M,1 − a) with a > α. We reportthe values c ni (top row) and corresponding p-values (bottom row, between parentheses) of theBinomial test for each sample size and each sampling scheme.• un<strong>de</strong>r i.i.d (Q 0 , g b )-balanced sampling, v n (O m (n)) is the estimator of the asymptotic varianceof the TMLE Ψ(Q ∗ n,iidn7 ):v n (O m n7 (n)) = 1 nn∑D ⋆ (P Q ∗n,iid ,g b)(Om i ) 2 , (31)i=1• un<strong>de</strong>r i.i.d (Q 0 , g ⋆ (Q 0 ))-optimal sampling, v n (O m n7 (n)) is <strong>de</strong>fined as in (31), replacing gb withg ⋆ (Q 0 ),• un<strong>de</strong>r (Q 0 ,gn)-adaptive ⋆ sampling, v n (O m (n)) = s⋆2n7 n (O m (n)), the estimator of the conjecturedasymptotic variance of √ n(ψn(O ⋆ m (n)) − ψ n7 0) computed on the n first observationsn7O m (n). n7We are interested in the empirical coverage (reported in Table 2, top rows)c ni = 1 MM∑1{ψ 0 ∈ I ni,m}m=1guaranteed for each sampling scheme and every i = 1, ...,7 by {I ni,m : m = 1, ...,M}. Therescaled empirical coverage proportions Mc ni should have a Binomial distribution with parameter(M,1 − a) with a = α for every i = 1, ...,7. This property can be tested in terms of a standardbinomial test, the alternative stating that a > α. This results in a collection of 7 p-values for eachsampling scheme, as reported in Table 2 (bottom rows).Consi<strong>de</strong>ring each sampling scheme (i.e., each row of Table 2) separately, we conclu<strong>de</strong> thatthe (1 − α)-coverage cannot be <strong>de</strong>clared <strong>de</strong>fective un<strong>de</strong>r• i.i.d (Q 0 , g b )-balanced sampling for any sample size n i ≥ n 3 = 500,• i.i.d (Q 0 , g ⋆ (Q 0 ))-optimal sampling for any sample size n i ≥ n 2 = 250,• (Q 0 ,g ⋆ n7 )-adaptive sampling for any sample size n i ≥ n 1 = 100,hal-00582753, version 1 - 4 Apr 2011adjusting for multiple testing in terms of the Benjamini and Yekutieli procedure for controllingthe False Discovery Rate at level 5%.This is a remarkable result that not only validates the theory but also provi<strong>de</strong>s us with insightinto the finite sample properties of the TMLE procedure based on adaptive sampling. The factthat the TMLE procedure behaves better un<strong>de</strong>r adaptive sampling scheme than un<strong>de</strong>r balancedi.i.d sampling scheme at sample size n 1 = 100 may not be due to mere chance only. Although theTMLE procedure based on an adaptive sampling scheme is initiated un<strong>de</strong>r the balanced samplingscheme (so that each stratum consists at the beginning of comparable numbers of patients assignedto each treatment arm, allowing to estimate, at least roughly, the required parameters), it starts<strong>de</strong>viating from it (as soon as every (A, V )-stratum counts 5 patients) each time 25 new observationsare accrued. The poor performance of the TMLE procedure based on optimal i.i.d sampling schemeat sample size n 1 is certainly due to the fact that, by starting directly from the optimal samplingscheme (a choice we would not recommend in practice), too few patients from stratum V = 3 areassigned to treatment arm A = 0 among the n 1 first subjects. At larger sample sizes, the TMLEprocedure performs equally well un<strong>de</strong>r adaptive sampling scheme and un<strong>de</strong>r both i.i.d schemes interms of coverage.7.4 Empirical width of the TMLE confi<strong>de</strong>nce intervalsNow that we know that the TMLE-based confi<strong>de</strong>nce intervals based on (Q 0 ,gn)-adaptive ⋆ samplingare valid confi<strong>de</strong>nce regions, it is of interest to compare the widths of the latter (Q 0 ,gn)-adaptive⋆sampling confi<strong>de</strong>nce intervals with their counterparts obtained un<strong>de</strong>r i.i.d (Q 0 , g b )-balanced or(Q 0 , g ⋆ (Q 0 ))-optimal sampling schemes.For this purpose, we compare on one hand, for each sample size n i , the empirical distribution of{ √ v n (O m (n n7 i)) : m = 1, ...,M} as in (31) (i.e., the empirical distribution of width of the TMLEbasedconfi<strong>de</strong>nce intervals at sample size n i obtained un<strong>de</strong>r i.i.d (Q 0 , g b )-balanced sampling, upto the factor 2ξ 1−α/2 / √ n i ) to the empirical distribution of {s ⋆ n(O m (n n7 i)) : m = 1, ...,M} (i.e.,the empirical distribution of the width of the TMLE-based confi<strong>de</strong>nce intervals at sample sizen i obtained un<strong>de</strong>r (Q 0 ,gn)-adaptive ⋆ sampling, up to the factor 2ξ 1−α/2 / √ n i ) in terms of thetwo-sample Kolmogorov-Smirnov test, where the alternative states that the confi<strong>de</strong>nce intervalsobtained un<strong>de</strong>r adaptive sampling are stochastically smaller than their counterparts un<strong>de</strong>r i.i.dbalanced sampling. This results in 7 p-values, all equal to zero, which we nonetheless report inTable 3 (bottom row). In or<strong>de</strong>r to get a sense of how much narrower the confi<strong>de</strong>nce intervalsobtained un<strong>de</strong>r adaptive sampling are, we also compute and report in Table 3 (top row) the ratiosof empirical average widths∑1 MM m=1 s⋆ n(O m (n n7 i))∑1 M√M m=1 vn (O m (n (32)n7 i))for each sample size n i . Informally, this shows a 12% gain in width.On the other hand, we also compare, for each sample size n i , the empirical distribution of{ √ v n (O m (n n7 i)) : m = 1, ...,M} as in (31) but replacing g b by g ⋆ (Q 0 ) (i.e., the empirical distributionof width of the TMLE-based confi<strong>de</strong>nce intervals at sample size n i obtained un<strong>de</strong>r i.i.d(Q 0 , g ⋆ (Q 0 ))-optimal sampling, up to the factor 2ξ 1−α/2 / √ n i ) to the empirical distribution of{s ⋆ n(O m (n n7 i)) : m = 1, ...,M} (i.e., the empirical distribution of the width of the TMLE-basedconfi<strong>de</strong>nce intervals at sample size n i obtained un<strong>de</strong>r (Q 0 ,gn)-adaptive ⋆ sampling, up to the factor2ξ 1−α/2 / √ n i ) in terms of the two-sample Kolmogorov-Smirnov test, where the alternative statesthat the confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r adaptive sampling are stochastically larger than theircounterparts un<strong>de</strong>r i.i.d optimal sampling. This results in 7 p-values that we report in Table 3(bottom row). In or<strong>de</strong>r to get a sense of how similar the confi<strong>de</strong>nce intervals obtained un<strong>de</strong>radaptive and i.i.d optimal sampling schemes are, we also compute and report for each sample sizen i in Table 3 (top row) the ratios of empirical average widths as in (32) replacing again g b byg ⋆ (Q 0 ) in the <strong>de</strong>finition (31) of v n (O m (n)). Informally, this shows that the confi<strong>de</strong>nce intervalsn7obtained un<strong>de</strong>r adaptive sampling are even slightly narrower in average than their counterpartsobtained un<strong>de</strong>r i.i.d optimal sampling.1920


ComparisonSample sizen 1 n 2 n 3 n 4(Q 0,g ⋆ ) vs (Q0, n7 gb ) 0.856 0.871 0.879 0.880(0) (0) (0) (0)(Q 0,g ⋆ ) vs (Q0, n7 g⋆ (Q 0)) 0.962 0.977 0.992 0.995(0.144) (0.236) (0.100) (0.060)ComparisonSample sizen 5 n 6 n 7(Q 0,g ⋆ ) vs (Q0, n7 gb ) 0.878 0.877 0.876(0) (0) (0)(Q 0,g ⋆ ) vs (Q0, n7 g⋆ (Q 0)) 0.997 1.000 1.000(0.407) (0.236) (0.144)hal-00582753, version 1 - 4 Apr 2011Table 3: Comparing the width of our confi<strong>de</strong>nce intervals. On one hand, we test for each samplesize n i if the TMLE-based confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r (Q 0 ,g ⋆ n)-adaptive sampling arenarrower stochastically than the TMLE-based confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r i.i.d (Q 0 , g b )-balanced sampling in terms of the two-sample Kolmogorov-Smirnov test. On the other hand,we test for each sample size n i if the TMLE-based confi<strong>de</strong>nce intervals obtained un<strong>de</strong>r (Q 0 ,g ⋆ n)-adaptive sampling are wi<strong>de</strong>r stochastically than the TMLE-based confi<strong>de</strong>nce intervals obtainedun<strong>de</strong>r i.i.d (Q 0 , g ⋆ (Q 0 ))-optimal sampling in terms of the two-sample Kolmogorov-Smirnov test.We report the p-values (bottom rows, between parentheses). In addition, we report for each samplesize n i the ratios of average widths as <strong>de</strong>fined in (32).7.5 Illustrating exampleSo far we have been concerned with distributional results, answering to the questions Does theconfi<strong>de</strong>nce interval [ψn ∗ ± s ⋆ nξ 1−α/2 / √ n] provi<strong>de</strong> the wished coverage? (yes, even for mo<strong>de</strong>ratesample size: see Section 7.3), How does its width compare with the width of the confi<strong>de</strong>nce intervalobtained un<strong>de</strong>r either i.i.d sampling scheme? (well: see Section 7.4). In this section, we focus on aparticular simulated trajectory (we select arbitrarily the first one, associated to O 1 ) for the saken7of illustration.Some interesting features of the selected simulated trajectory are apparent in Fig. 7.5 andTable 4.For instance, we can follow the convergence of the TMLE ψn ∗ toward the true risk differenceψ 0 in the top plot of Fig. 7.5 and in the fifth column of Table 4. Similarly, the middle plot ofFig. 7.5 and the second to fourth columns of Table 4 illustrate the convergence of gn ⋆ towardG ⋆ (θ 0 ), as stated in Proposition 3. What these plots and columns also teach us is that, in spiteof the misspecified working mo<strong>de</strong>l, the learned <strong>de</strong>sign G ⋆ (θ 0 ) seems very close to the optimaltreatment mechanism g ⋆ (Q 0 ) for the chosen simulation scheme and working mo<strong>de</strong>l used in oursimulation study. Moreover, the last column of Table 4 illustrates how the confi<strong>de</strong>nce intervals[ψn ∗ ± s ⋆ nξ 1−0.05/2 / √ n] shrink around the true risk difference ψ 0 as the sample size increases.Yet, the bottom plot of Fig. 7.5 may be the most interesting of the three. It obviously illustratesthe convergence of s ⋆2n toward Var Q0,G ⋆ (θ0)D ⋆ (P Qθ0 ,ε 0 ,G ⋆ (θ0))(O) i.e., toward the varianceun<strong>de</strong>r the fixed <strong>de</strong>sign P Q0,G ⋆ (θ0) of the efficient influence curve at P Qθ0 ,ε 0 .G ⋆ (θ0). Hence, it alsoteaches us that the latter limit seems very close to the optimal asymptotic variance v ⋆ (Q 0 ) for thechosen simulation scheme and working mo<strong>de</strong>l used in our simulation study. More importantly, s ⋆2nstrikingly converges to v ⋆ (Q 0 ) from below. This finite sample characteristic may reflect the factthat the true finite sample variance of √ n(ψn ∗ −ψ 0 ) might be lower than v ⋆ (Q 0 ). Studying furtherthis issue in <strong>de</strong>pth is certainly very <strong>de</strong>licate, and goes beyond the scope of this article.hal-00582753, version 1 - 4 Apr 2011Figure 2: Illustrating the TMLE procedure un<strong>de</strong>r (Q 0 ,gn)-adaptive ⋆ sampling scheme.These three plots illustrate how the TMLE procedure behaves (on a simulated trajectory) asthe sample size (on x-axis, logarithmic scale; the vertical grey lines indicate the sample sizesn i , i = 1, ...,7) increases. Top plot: we represent the sequence ψn(O ∗ 1 (n)) at intermediatesample sizes n. The horizontal grey line indicates the true value of the risk difference ψ 0 .n7Middle plot: we represent the three sequences gn(1|1) ⋆ = G ⋆ (θ n (O 1 (n)))(1|1) (bottom curve),n7gn(1|2) ⋆ = G ⋆ (θ n (O 1 (n)))(1|2) (middle curve) and n7 g⋆ n(1|3) = G ⋆ (θ n (O 1 (n)))(1|3) (top curve) atn7intermediate sample sizes n. The three horizontal grey lines indicate the optimal allocation probabilitiesg ⋆ (Q 0 )(1|1) (bottom line), g ⋆ (Q 0 )(1|2) (middle line) and g ⋆ (Q 0 )(1|3) (top line). Bottomplot: we represent the sequence s ⋆2n of estimated asymptotic variance of √ n(ψn ∗ − ψ 0 ) at intermediatesample sizes n. The horizontal grey line indicates the value of the optimal variance v ⋆ (Q 0 ).See also Table 4.2122


hal-00582753, version 1 - 4 Apr 2011Sample size Allocation probabilities TMLE Confi<strong>de</strong>nce intervaln gn(1|1) ⋆ gn(1|2) ⋆ gn(1|3) ⋆ ψn ∗ [ψn ∗ ± s ⋆ nξ √ 1−0.05/2/ n]n 1 0.589 0.764 0.766 1.252 [0.722;1.783]n 2 0.624 0.775 0.707 1.388 [0.974;1.802]n 3 0.679 0.767 0.795 1.361 [1.037;1.685]n 4 0.677 0.757 0.813 1.341 [1.068;1.615]n 5 0.670 0.760 0.806 1.250 [1.012;1.488]n 6 0.677 0.788 0.835 1.288 [1.126;1.451]n 7 0.694 0.793 0.834 1.273 [1.157;1.389]Table 4: Illustrating the TMLE procedure un<strong>de</strong>r (Q 0 ,g ⋆ n)-adaptive sampling scheme. This tableillustrates how the TMLE procedure behaves (on a simulated trajectory) as the sample size increases.We report, at each sample size n i , the updated adapted <strong>de</strong>sign g ⋆ n = G ⋆ (θ n (O 1 n7 (n)))(columns two to four), current TMLE ψ ∗ n(O 1 n7 (n)) (column five) and 95% confi<strong>de</strong>nce interval I n,1as in (28) with s ⋆ n(O 1 n7 (n)) substituted to s n. The true risk difference ψ 0 ≃ 1.264 belongs to allconfi<strong>de</strong>nce intervals. See also Fig. 7.5.8 Simulation study of the performances of the TMLE groupsequential testing procedureIn this section, we resume the simulation study un<strong>de</strong>rtaken in Section 7 in or<strong>de</strong>r to investigatethe performances of the TMLE group sequential testing procedure presented in Section 6. We<strong>de</strong>scribe the simulation scheme in Section 8.1, then evaluate how the testing procedure behaves interms of empirical type I and type II errors in Section 8.2 and how it behaves in terms of empiricaldistribution of sample size at <strong>de</strong>cision in Section 8.3.8.1 The simulation scheme, continuedWe wish to test the null “Ψ(P) = ψ 0 ” against its alternative “Ψ(P) > ψ 0 ” (ψ 0 = 9172) with prescribedtype I error α = 5% and type II error β = 10% at the alternative ψ 1 = ψ 0 +0.4. Dependingon whether we want to investigate the empirical behaviors of the TMLE group sequential testingprocedure with respect to (i) type I or (ii) type II errors, we test M = 1000 times the abovehypotheses based on 3 × M in<strong>de</strong>pen<strong>de</strong>nt datasets obtained un<strong>de</strong>r(i) empirical type I error study:• i.i.d (Q 0 , g b )-balanced sampling,• i.i.d (Q 0 , g ⋆ (Q 0 ))-optimal sampling,• (Q 0 ,g ⋆ n)-adaptive sampling;(ii) empirical type II error study:• i.i.d (Q 1 , g b )-balanced sampling,• i.i.d (Q 1 , g ⋆ (Q 0 ))-optimal sampling,• (Q 1 ,g ⋆ n)-adaptive sampling,hal-00582753, version 1 - 4 Apr 2011ψ 0 ψ 1 I max1.264 1.664 57.927Step k 1 2 3 4Rejection boundary 2.734 2.305 2.005 1.715α-spending 0.3125% 0.9375% 1.5625% 2.1875%Step k 1 2 3 4Futility boundary -0.976 0.132 0.961 1.715β-spending 0.625% 1.875% 3.125% 4.375%Table 5: Specifics of the TMLE group sequential testing procedure.8.2 Empirical type I and type II errors of the TMLE group sequentialtesting procedureWe report in Table 6 the empirical type I and type II errors obtained during the course of thesimulation study. The values are strikingly close to each other in both cases. Even better, theempirical type I and type II errors obtained un<strong>de</strong>r (Q,g ⋆ n)-adaptive sampling are the lowest inboth cases.Since the number of times the null was falsely rejected for its alternative is Binomial withparameter (M,a), it is possible to test rigorously if a = α (as it should) against a > α. This yieldsthree p-values that we report in Table 6. Because there are less than 50 wrong <strong>de</strong>cisions for allsampling schemes, we naturally get large p-values, confirming (if necessary) that the type I erroris un<strong>de</strong>r control.Similarly, the number of times the null was falsely not rejected for its alternative is alsoBinomial with parameter (M, b). Thus, it is possible to test rigorously if b = β (as it should)against b > β. This yields three small p-values (p < 0.001 for the i.i.d balanced sampling scheme,0.002 for the i.i.d optimal sampling scheme and 0.003 for the adaptive sampling scheme). Thisconfirms the impression that the number of wrong <strong>de</strong>cisions are all significantly larger than whatrandom <strong>de</strong>viations from the reference distribution are likely to allow. Put bluntly, the type IIerror is not un<strong>de</strong>r control. However, there is no real need to worry. In<strong>de</strong>ed, if one rather testsb = 12% against b > 12%, then the three p-values (reported in Table 6) are large.In summary, for the consi<strong>de</strong>red scenario, the TMLE group sequential testing procedure <strong>de</strong>scribedin Section 6.1 performs at least as well un<strong>de</strong>r the (Q,g ⋆ n)-adaptive sampling scheme as un<strong>de</strong>r bothi.i.d (Q, g b )-balanced and (Q, g ⋆ (Q))-optimal sampling schemes. The type I error control meetsthe requirements. The type II error control is <strong>de</strong>fective, but only slightly <strong>de</strong>fective.Type I errorType II error(Q = Q 0 ) (Q = Q 1 )Sampling scheme Empirical error (p-value) Empirical error (p-value)i.i.d (Q, g b )-balanced 0.040 (0.919) 0.132 (0.113)i.i.d (Q, g ⋆ (Q))-optimal 0.043 (0.827) 0.128 (0.203)adaptive (Q,gn) ⋆ 0.040 (0.919) 0.126 (0.261)where Q 1 is <strong>de</strong>fined as Q 0 in Section 7.1 with ρ = ψ 1 /ψ 0 (so that Ψ(Q 1 ) = ψ 1 ).We <strong>de</strong>ci<strong>de</strong> to proceed in K = 4 steps, at proportions (p 1 , p 2 , p 3 , p 4 ) = (0.25,0.50,0.75,1).We choose the α- and β-spending strategies characterized by the equalities ∑ kl=1 α l = p 2 k∑ α andkl=1 β l = p 2 kβ, k = 1, ...,K. This set of conditions characterizes the whole group sequentialtesting procedure, see Table 5 for the resulting numerical values.Table 6: Checking the a<strong>de</strong>quateness of the type I and type II error controls on our simulatedTMLE group sequential testing procedures. We test if the Binomial random numbers of times thenull is falsely rejected have parameter (M,α), the alternative stating that they have parameter(M,a) with a > α. We also test if the Binomial random numbers of times the null is falsely notrejected have parameter (M, 12%), the alternative stating that they have parameter (M, b) withb > 12%. We report the empirical type I and type II errors and corresponding p-values.2324


hal-00582753, version 1 - 4 Apr 20118.3 Empirical distribution of sample size at <strong>de</strong>cision of the TMLE groupsequential testing procedureNow that we know that the TMLE group sequential testing procedure performs at least as wellun<strong>de</strong>r adaptive sampling scheme as un<strong>de</strong>r both i.i.d sampling schemes, let us compare the performancesin terms of sample size at <strong>de</strong>cision.For this purpose, we simply report the average sample sizes at <strong>de</strong>cision for each samplingscheme when evaluating the type I and type II errors control, see Table 7. The gain in averagesample size at <strong>de</strong>cision obtained by resorting to the (Q,g ⋆ n)-adaptive sampling scheme insteadof the i.d.d (Q, g b )-balanced sampling scheme is dramatic: In average, one needs approximately16% less observations in or<strong>de</strong>r to reach a conclusion un<strong>de</strong>r the (Q,g ⋆ n)-adaptive sampling schemerelative to the i.d.d (Q, g b )-balanced sampling scheme. Furthermore, it appears that it is evenmore efficient to resort to the i.d.d (Q, g ⋆ (Q))-optimal sampling scheme: In average, one needsapproximately 6% more observations in or<strong>de</strong>r to reach a conclusion un<strong>de</strong>r the i.i.d (Q,g ⋆ n)-adaptivesampling scheme relative to the i.d.d (Q, g b )-balanced sampling scheme.In summary, for the consi<strong>de</strong>red scenario, the TMLE group sequential testing procedure <strong>de</strong>scribedin Section 6.1 reaches a <strong>de</strong>cision more quickly un<strong>de</strong>r the (Q,g ⋆ n)-adaptive sampling scheme thanun<strong>de</strong>r the i.i.d (Q, g b )-balanced sampling scheme, with a gain in average sample size at <strong>de</strong>cision ofapproximately 16%. Furthermore, the TMLE group sequential testing procedure reaches a <strong>de</strong>cisionmore slowly un<strong>de</strong>r the (Q,g ⋆ n)-adaptive sampling scheme than un<strong>de</strong>r the i.i.d (Q, g ⋆ (Q))-optimalsampling scheme, but the loss in average sample size at <strong>de</strong>cision reduces to approximately 6%.Type I error Type II error(Q = Q 0 ) (Q = Q 1 )Sampling scheme Average sample size Average sample sizei.i.d (Q, g b )-balanced 786.63 895.64i.i.d (Q, g ⋆ (Q))-optimal 620.71 713.47adaptive (Q,gn) ⋆ 661.14 746.72Table 7: Comparing the average sample sizes at <strong>de</strong>cision for each sampling scheme when evaluatingthe type I and type II errors control.A ProofsLet us introduce for convenience the notations θ ε = (θ,ε), θ ε 0 = (θ 0 , ε 0 ) and θ ε n = (θ n , ε n ).Moreover, let us recall that if µ is a probability measure and f a measurable function, then µf<strong>de</strong>notes the integral ∫ fdµ.A.1 Proof of Propositions 2 and 4, and a useful remark on Proposition 3Proof of Proposition 2.The cornerstone of the proof is to interpret θ n as the solution in θ of the martingale estimatingequation ∑ ni=1 D 1(θ)(O i , Z i ) = 0, where Z i (<strong>de</strong>fined in (9)) is the finite dimensional summarymeasure of past observation O n (i − 1) such that gi ⋆ <strong>de</strong>pends on O n(i − 1) only through Z i (hencethe notation gi ⋆ = g ) and D Zi 1(θ)(O,Z) = ˙l θ,0 (O)g r (A|V )/g Z (A|V ).Note first that, for all i ≤ n,P Q0,g ⋆ i D 1(θ 0 ) = E PQ0 ,g r [ ˙l θ0,0(O i )|O n (i − 1)] = 0by A1 (changing the or<strong>de</strong>r of differentiation and integration is permitted by the dominated convergencetheorem, because O is boun<strong>de</strong>d and Θ is compact—see PARAM). Observe then that,hal-00582753, version 1 - 4 Apr 2011by <strong>de</strong>finition of θ n , we have:∥ 1n∑∥∥∥∥ P∥ Q0,gi n⋆D 1(θ n )∥ = 1n∑D 1 (θ n )(O i , Z i ) − P Q0,gi n⋆D 1(θ n )∥i=1i=11n∑≤ supD∥ 1 (θ)(O i , Z i ) − P Q0,gi n⋆D 1(θ)∥ ≡ M n.θ∈Θi=1Now, since (a) sup θ∈Θ ‖D 1 (θ)‖ ∞ < ∞ (because O is boun<strong>de</strong>d and Θ is compact) and (b) thestandard entropy of F = {D 1 (θ) : θ ∈ Θ} for the supremum norm satisfies H(F, ‖ · ‖ ∞ , ε) =O(log 1/ε) (see (van <strong>de</strong>r Vaart, 1998), example 19.7), we can apply (componentwise) the Kolmogorovstrong law of large numbers for martingales as in Theorem 8 of (Chambaz and van <strong>de</strong>rLaan, 2009). This yields the almost sure convergence of M n , hence of n −1 ∑ ni=1 P Q0,gi ⋆D 1(θ n ),to 0.By a Taylor expansion of θ ↦→ P Q0,g r ˙lθ,0 at θ 0 (changing the or<strong>de</strong>r of differentiation andintegration is permitted here too for the same reasons as above), it holds that, for all i ≤ n (recallthat P Q0,gi ⋆D 1(θ 0 ) = 0),P Q0,g ⋆ i D 1(θ n ) = (P Q0,g r ¨lθ0,0)(θ n − θ 0 ) + o P (‖θ n − θ 0 ‖).From this we <strong>de</strong>duce that (P Q0,g r ¨lθ0,0)(θ n −θ 0 ) converges in probability to 0. Because P Q0,g r ¨lθ0,0is positive <strong>de</strong>finite by A2, this implies the result.Proof of Proposition 4.The proof of (i) is very simple and typical of robust statistical studies. In<strong>de</strong>ed,0 = P Q0,G ⋆ (θ0) ∂ ∂ε l θ0,ε| ε=ε0 ,the latter expression also writing as{ }2A − 1E Q0,G ⋆ (θ0)G ⋆ (θ 0 )(A|V ) (Y − ¯Q(θ 0)(A, ε W)){= E Q0,G ⋆ (θ0) ( ¯Q0 (1, W) − ¯Q(θ0)(1, ε W)) − ( ¯Q 0 (0, W) − ¯Q(θ0)(0, ε W)) }= Ψ(Q 0 ) − Ψ(Q θ ε0),hence the result.The proof of (ii) fundamentally relies on the fact that θ ε n solves the martingale estimatingequation ∑ ni=1 D(θε )(O i , Z i ) = 0, where D(θ ε ) is the extension of D 1 (θ) <strong>de</strong>fined in (19). Here too,P Q0,g ⋆ i D(θε 0) = 0 for all i ≤ n, and applying the Kolmogorov strong law of large numbers for martingales(see Theorem 8 in (Chambaz and van <strong>de</strong>r Laan, 2009)) yields that n −1 ∑ ni=1 P Q0,g ⋆ i D(θε n)converges to zero almost surely. This entails the convergence in probability of θ ε n to θ ε 0 by a Taylorexpansion of θ ε ↦→ (P Q0,g r ˙l⊤ θ,0, P Q0,G ⋆ (θ) ∂ ∂ε l θ,ε) at θ ε 0. Note that A3 is a clear counterpart to A1from Proposition 2 but that there is no counterpart to A2 from Proposition 2 in Proposition 4. In<strong>de</strong>ed,it automatically holds in the framework of the proposition that −P Q0,G ⋆ (θ0) ∂2∂ε 2 l θ0,ε| ε=ε0 > 0,the proof requiring that the latter quantity be different from zero.A useful remark on Proposition 3.We <strong>de</strong>duce in Proposition 3 the convergence in probability and in L 1 of the adaptive <strong>de</strong>sign gn ⋆ tothe limit <strong>de</strong>sign G ⋆ (θ 0 ) from the convergence in probability of θ n to θ 0 as obtained in Proposition 2.It is crucial for us that n −1 ∑ ni=1 g⋆ i and n −1 ∑ ni=1 (g⋆ i )−1 also converge to G ⋆ (θ 0 ) and 1/G ⋆ (θ 0 )(with an obvious notational shortcut) in probability and in L 1 . Fortunately, this is true because(a) the positive random variables g ⋆ i are uniformly boun<strong>de</strong>d away from 0 and (b) if a sequence X nconverges in L 1 to 0 then n −1 ∑ ni=1 X i also converges in L 1 to 0 (by convexity of the L 1 -norm).Let us put this in a lemma:2526


hal-00582753, version 1 - 4 Apr 2011Lemma 1. For all (a, v) ∈ A × V, n −1 ∑ ni=1 g⋆ i (a|v) and ∑ n−1 ni=1 (g⋆ i (a|v))−1 converge in probabilityand in L 1 to G ⋆ (θ 0 )(a|v) and G ⋆ (θ 0 )(a|v) −1 .A.2 Proof of Proposition 5The proof of Proposition 5 still relies on the facts that (a) θn ε solves the martingale estimatingequation ∑ ni=1 D(θε )(O i , Z i ) = 0 and (b) P Q0,gi ⋆D(θε 0) = 0 for all i ≤ n.Consi<strong>de</strong>r the following equality:LHT ≡ 1 n∑[D(θ εnn)(O i , Z i ) − D(θ0)(O ε i , Z i )] = − 1 n∑[D(θ εn0)(O i , Z i ) − P Q0,gi ⋆D(θε 0)]. (33)i=1A Taylor expansion of LHT at θ ε 0 first yields thatLHT = 1 nn∑i=1∂i=1∂θ ε D(θε )(O i , Z i )| θε =θ ε 0 (θε n − θ ε 0) + o P (‖θ ε n − θ ε 0‖)= A n (θ ε n − θ ε 0) + B n (θ ε n − θ ε 0) + o P (‖θ ε n − θ ε 0‖)where A n = E[n −1 ∑ ni=1 ∂∂θ ε D(θε )(O i , Z i )| θ ε =θ0 ε] and B n = n −1 ∑ ni=1 ∂∂θ ε D(θε )(O i , Z i )| θ ε =θ0 ε−A n.The matrix A n satisfies A n = n −1 E[ ∑ ni=1 P ∂Q0,gi⋆ ∂θ ε D(θε )| θε =θ0 ε ]. Now, for each i ≤ n,P Q0,g ⋆ i∂∂θ ε D(θε )| θε =θ ε 0 = S 0 (34)where the latter matrix, <strong>de</strong>fined in (21) and in<strong>de</strong>pen<strong>de</strong>nt of i , is <strong>de</strong>terministic (this is due to theweighting). Thus, A n = S 0 . Moreover, A n = S 0 is invertible because its <strong>de</strong>terminant equals theproduct of P Q0,G ⋆ (θ0) ∂2∂ε l θ0,ε| 2 ε=ε0 < 0 and[g<strong>de</strong>t E Q0,G ⋆ (θ0)¨l r ](A|V )θ0,0(O)G ⋆ = <strong>de</strong>tP Q0,g¨lθ0,0,(θ 0 )(A|V )rwhich is negative by A2. Furthermore, (34) and A n = S 0 (<strong>de</strong>terministic) also imply that B n canbe rewritten asB n = 1 nn∑i=1[ ∂∂θ ε D(θε )(O i , Z i )| θ ε =θ0 ε − P Q0,gi⋆ ∂θ ε D(θε )| θ ε =θ0 ε],∂hal-00582753, version 1 - 4 Apr 2011Now, by Lemma 1, (36) implies that n −1 EC n = Σ 0 (1+o(1)), where Σ 0 (<strong>de</strong>fined in (22)) is positive<strong>de</strong>finite when A4 is met. In<strong>de</strong>ed, assume on the contrary that there exists a vector u ≠ 0 suchthat u ⊤ Σ 0 u = 0. Then necessarily ˜D(θ 0) ε ⊤ u = 0 P Q0,G ⋆ (θ0)-almost surely, which contradicts A4.Using (36) again, we also see that1n (C n − EC n ) =∑(a,v)∈A×VE Q0,G ⋆ (θ0)[ ˜D(θε0 ) ˜D(θ]0) ε ⊤ (O)G ⋆ 1{(A, V ) = (a, v)}(θ 0 )(A|V )([ ])1n∑ 1×n gi ⋆(a|v) − E 1n∑ 1n gi ⋆(a|v) from which we <strong>de</strong>duce that n −1 (C n − EC n ) converges to 0 in probability by Lemma 1. Consequently,M n converges in distribution to the centered Gaussian law with covariance matrixΣ 0 . In addition, n −1 ∑ ni=1 D(θε 0)D(θ0) ε ⊤ (O i , Z i ) converges in probability to Σ 0 . Since (a) θ ε ↦→D(θ ε )D(θ ε ) ⊤ is differentiable at θ0, ε (b) its <strong>de</strong>rivative is boun<strong>de</strong>d, and (c) we already know that‖θn ε − θ0‖ ε = o P (1), yet another Taylor expansion allows to <strong>de</strong>rive that Σ n (<strong>de</strong>fined in (23)) equalsΣ 0 (1 + o P (1)).Let us go back to (35), knowing now that M n satisfies a central limit theorem. We <strong>de</strong>rive fromit that √ n‖θn ε −θ0‖ ε = O P (1)+o P ( √ n‖θn ε −θ0‖) ε hence ‖θn ε −θ0‖ ε = O P (1). Thus, (35) does entail(20) since P Q0,gi ⋆D(θε 0) = 0 for all i ≤ n. The stated central limit theorem on √ n(θn ε − θ0) ε readilyfollows.A.3 Proof of Propositions 6 and 7Proof of Proposition 6.The proof of Proposition 6 is twofold and relies on the <strong>de</strong>compositionwherei=1ψ ∗ n − ψ 0 = (ψ ∗ n − Ψ(Q ∼ n )) + (Ψ(Q ∼ n ) − ψ 0 ), (37)Q ∼ n = (Q W (P 0 ), Q Y |A,W (θ ε n)). (38)First, a continuity argument and the convergence in probability of the stacked estimator θ ε nto θ ε 0 entail the convergence in probability of Ψ(Q ∼ n ) to Ψ(Q θ ε0) = ψ 0 (the equality holds by (i)in Proposition 4). The conclusion follows because ψ ∗ n − Ψ(Q ∼ n ) converges almost surely to zero.In<strong>de</strong>ed,ψ ∗ n − Ψ(Q ∼ n ) = (P W,n − P W,0 )q θ ε n(39)i=1and applying (componentwise) a standard law of large numbers for martingales (see for instanceTheorem 2.4.2 in (Sen and Singer, 1993)) yields that B n converges to 0 almost surely (note thatsup θ ε ∈Θ×E ‖ ∂∂θ ε D(θε )‖ ∞ < ∞). We emphasize that this proves the convergence in probability ofS n to S 0 as stated in the proposition. Consequently, LHT = S 0 (θn ε − θ0) ε +o P (‖θn ε − θ0‖) ε and (33)entails√ n(θεn − θ0) ε = −S0 −1 M n + o P ( √ n‖θn ε − θ0‖) ε (35)with M n = n −1/2 ∑ ni=1 [D(θε 0)(O i , Z i )−P Q0,g ⋆ i D(θε 0)]. It mainly remains to show that M n satisfiesa central limit theorem.For this purpose, we apply Theorem 10 in (Chambaz and van <strong>de</strong>r Laan, 2009). Define C n =∑ ni=1 P Q0,g ⋆ i D(θε 0)D(θ ε 0) ⊤ . It holds that1n C n = 1 n∑ni=1=E Q0,G ⋆ (θ0)∑(a,v)∈A×V[˜D(θ 0) ε ˜D(θ ]0) ε ⊤ (O)gi ⋆(A|V )G⋆ (θ 0 )(A|V ) ∣ O n(i − 1)[ ˜D(θεE 0 ) ˜D(θ]0) ε ⊤ (O)1n∑Q0,G ⋆ (θ0)G ⋆ 1{(A, V ) = (a, v)}(θ 0 )(A|V )ng ⋆ i=1 i1(36)(a|v).where, for all θ ε ∈ Θ × E, q θ ε = ¯Q(θ ε )(1, ·) − ¯Q(θ ε )(0, ·) and, with a slight abuse of notation, P W,n(respectively, P W,0 ) <strong>de</strong>notes the empirical (respectively, true) marginal distribution of W. Thus,|ψn ∗ − Ψ(Q ∼ n )| ≤ sup |(P W,n − P W,0 )q θ ε|, (40)θ ε ∈Θ×Ewhere (a) sup θ ε ∈Θ×E ‖q θ ε‖ ∞ < ∞ and (b) the standard entropy of F = {q θ ε : θ ε ∈ Θ × E} forthe supremum norm satisfies H(F, ‖ · ‖ ∞ , ε) = O(log 1/ε) (see (van <strong>de</strong>r Vaart, 1998), example19.7). Therefore, the class F is P W,0 -Glivenko-Cantelli and the right-hand si<strong>de</strong> term of (40)converges to 0 P W,0 -almost surely by the Glivenko-Cantelli theorem (i.i.d framework; see forinstance Theorem 19.4 in (van <strong>de</strong>r Vaart, 1998)).Proof of Proposition 7.The proof of (26) relies again on (37). It is easy to <strong>de</strong>rive the asymptotic linear expansion of thefirst term. In<strong>de</strong>ed, we can rewrite (39) as√ n(ψ∗n − Ψ(Q ∼ n )) = √ n(P W,n − P W,0 )q θ ε0+ √ n(P W,n − P W,0 )(q θ ε n− q θ ε0)= √ nP W,n (q θ ε0− P W,0 q θ ε0) + o P (1)= √ nP W,n D 1 (Q θ ε0) + o P (1), (41)2728


hal-00582753, version 1 - 4 Apr 2011which holds subject to checking that √ n(P W,n − P W,0 )(q θ εn− q θ ε0) = o P (1). Since the class Fintroduced previously for the proof of Proposition 6 is also P W,0 -Donsker, this is a consequenceof Lemma 19.24 in (van <strong>de</strong>r Vaart, 1998), provi<strong>de</strong>d that X n = P W,0 (q θ ε n− q θ ε0) 2 converges inprobability to 0. Now, θn ε converges to θ0 ε in probability, and the function (w, θ ε ) ↦→ q θ ε(w)continuously maps W × Θ × E onto R (thus it is uniformly boun<strong>de</strong>d). Consequently, X n (which isobviously non-negative) is upper boun<strong>de</strong>d, so that it is equivalent to show that X n converges to0 in L 1 . By the Fubini theorem,∫EX n = E[(q θ ε n(w) − q θ ε0(w)) 2 ]dP W,0 (w).By the continuous mapping theorem (see Theorem 1.3.6 in (van <strong>de</strong>r Vaart and Wellner, 1996)),(q θ ε n(w) −q θ ε0(w)) 2 converges to 0 in probability (hence in L 1 , the variables being boun<strong>de</strong>d) for allw ∈ W. Therefore, the integrand of the right-hand si<strong>de</strong> integral in the previous display convergespointwise to 0. Since it is also boun<strong>de</strong>d, we conclu<strong>de</strong> by the dominated convergence theorem thatEX n converges to 0. This completes the study of the first term of the <strong>de</strong>composition (37).Regarding the second term of the latter <strong>de</strong>composition, we <strong>de</strong>rive its asymptotic linear expansionby the <strong>de</strong>lta-method (see Theorem 3.1 in (van <strong>de</strong>r Vaart, 1998)) from that of √ n(θ ε n − θ ε 0),see (20). Specifically,√ n(Ψ(Q∼n ) − ψ 0 ) = √ n(φ(θ ε n) − φ(θ ε 0)) = φ ′⊤θ0 εS−101 ∑n√ D(θ εn0)(O i , Z i ) + o P (1). (42)Combining (41) and (42) yields the wished asymptotic linear expansion (26) and the closed-formexpression of the related influence function IC as given in (27).A central limit theorem for real-valued martingales (see Theorem 9 in (Chambaz and van <strong>de</strong>rLaan, 2009)) applied to (26) yields the stated convergence and validates the use of s 2 n as anestimator of the asymptotic variance. To see this, note that P Q0,gi ⋆ IC = 0 for all i ≤ n and <strong>de</strong>finec n = ∑ ni=1 P Q0,gi ⋆IC2 . Now we emphasize that, for every i ≤ n, firstlysecondlyand thirdlyi=1P Q0,g ⋆ i D⋆ 1(Q θ ε0) 2 = P Q0,G ⋆ (θ0)D ⋆ 1(Q θ ε0) 2 ,2P Q0,gi ⋆D⋆ 1(Q θ ε0)φ ′⊤θ0 εS−10 D(θε 0) = 2φ ′⊤θ0 εS−10 P Q0,gi ⋆D⋆ 1(Q θ ε0)D(θ0)ε= 2φ ′⊤θ ε 0 S−1 0 P Q0,G ⋆ (θ0)[D⋆1 (Q θ ε0) ˜D(θ]0)εG ⋆ ,(θ 0 )P Q0,g ⋆ i (φ′⊤ θ ε 0 S−1 0 D(θε 0)) 2 = φ ′⊤θ ε 0 S−1 0 (P Q0,g ⋆ i D(θε 0)D(θ ε 0) ⊤ )(φ ′⊤θ ε 0 S−1 0 )⊤ .By combining the last three equalities, we thus obtain that[1D⋆n c n = P Q0,G ⋆ (θ0)D1(Q ⋆ θ ε0) 2 + 2φ ′⊤θ0 εS−10 P 1 (Q θ ε0) ˜D(θ]0)εQ0,G ⋆ (θ0)G ⋆ (θ 0 )+ φ ′⊤ 1θ0 εS−10n C n(φ ′⊤θ0 εS−10 )⊤ (43)where C n was introduced in the proof of Proposition 5. Since we already showed in the latterproof that n −1 EC n = Σ 0 (1 + o P (1)), (43) notably yields the convergence of n −1 Ec n toward[D⋆P Q0,G ⋆ (θ0)D1(Q ⋆ θ ε0) 2 + 2φ ′⊤θ0 εS−10 P 1 (Q θ ε0) ˜D(θ]0)εQ0,G ⋆ (θ0)G ⋆ + φ ′⊤θ(θ 0 )0 εS−10 Σ 0(φ ′⊤θ0 εS−10 )⊤() 2˜D(θ= P Q0,G ⋆ (θ0) D1(Q ⋆ θ ε0) + φ ′⊤ 0)εθ0 εS−10G ⋆ ≡ s 2 ,(θ 0 )hal-00582753, version 1 - 4 Apr 2011which is positive when A4 is met. In addition, (43) also entails that n −1 (c n −Ec n ) converges to 0 inprobability, therefore implying the convergence in distribution of √ n(ψn−ψ ∗ 0 ) to the centered Gaussiandistribution with variance s 2 as well as the convergence in probability of n −1 ∑ ni=1 IC(O i, Z i ) 2to s 2 . Since (a) θ ε ↦→ D1(Q ⋆ θ ε) and θ ε ↦→ D(θ ε ) are differentiable at θ0, ε (b) both with uniformlyboun<strong>de</strong>d <strong>de</strong>rivatives, (c) we already know that ‖θn ε − θ0‖ ε = o P (1), and (d) φ ′ θand S −10ε 0 are consistentlyestimated by φ ′ n and Sn −1 , yet another Taylor expansion (and the continuous mappingtheorem, see for instance Theorem 1.3.6 in (van <strong>de</strong>r Vaart and Wellner, 1996)) finally yields thestated convergence in probability of s 2 n to s 2 , therefore completing the proof.A.4 Proof of Proposition 8The fact that Ψ(Q f/√ n ) > Ψ(Q 0 ) for n large enough is a consequence of the expansion Ψ(Q f/√ n ) =Ψ(Q 0 )+P Q0,G ⋆ (θ0)D ⋆ (P Q0,G ⋆ (θ0))f/ √ n+o(1/ √ n), which holds because Ψ is pathwise differentiableat P Q0,g (for any g) relative to the maximal tangent space, with efficient influence curve D ⋆ (Q 0 , g).The rest of Proposition 8 is a corollary of the following lemma.Lemma 2. Let Λ n be the log-likelihood ratio of the (Q √ f/ n ,gn) ⋆ experiment relative to the (Q 0 ,gn)⋆experiment. Un<strong>de</strong>r (Q 0 ,gn), ⋆ the vector ( √ n 1 (ψ ∗ − Ψ(Q n1 0)), ...,( √ n K (ψ ∗ − Ψ(Q nK 0)),Λ n ) convergesin law to the Gaussian distribution with mean (0, ...,0, − 1 2 P Q0,G ⋆ (θ0)f 2 ) and covariancematrix(P Q0,G ⋆ (θ0)IC 2 × C P Q0,G ⋆ (θ0)ICf × ( √ p 1 , ..., √ )p K ) ⊤P Q0,G ⋆ (θ0)ICf × ( √ p 1 , ..., √ p K ) P Q0,G ⋆ (θ0)f 2 .In particular, the (Q f/√ n ,g ⋆ n) and (Q 0 ,g ⋆ n) experiments are mutually contiguous.The limiting distribution of (T 1 , ...,T K ) un<strong>de</strong>r (Q 0 ,g ⋆ n) is easily <strong>de</strong>rived from Lemma 2. LeCam’s third lemma (see Example 6.7 in (van <strong>de</strong>r Vaart, 1998)) solves the problem of obtainingthe limiting distribution of (T 1 , ...,T K ) un<strong>de</strong>r (Q f/√ n ,g ⋆ n) from that un<strong>de</strong>r (Q 0 ,g ⋆ n). Only theasymptotic means are different. We do not repeat the proof here, as it is exactly the same (upto some minor variations in notation) as the proof of Lemma 5 in (Chambaz and van <strong>de</strong>r Laan,2011).Proof of Lemma 2.Since f is boun<strong>de</strong>d and E PQ0 ,G ⋆ (θ0)[f(O)|A, W] = 0 (here, one could equivalently use the notationE Q0 ), we can assume without loss of generality that the fluctuation {Q 0 (h) : h ∈ H} ischaracterized bydQ 0 (h)= (1 + hf(O))dQ 0for all h ∈ H = (−1/‖f‖ ∞ ,1/‖f‖ ∞ ). Let us <strong>de</strong>note by q n and q 0 the conditional <strong>de</strong>nsities of Ygiven (A, W) un<strong>de</strong>r Q √ f/ n and Q 0 .The log-likelihood ratio Λ n = ∑ ni=1 log q n/q 0 (O i ), by conditional in<strong>de</strong>pen<strong>de</strong>nce of O 1 , ...,O ngiven ((A 1 , W 1 ), ...,(A n , W n )) and because the marginal distribution of W is the same un<strong>de</strong>rQ √ f/ n as un<strong>de</strong>r Q 0 , (see (4) and note that the gi ⋆(A i|W i ) factors cancel out). The fluctuation ischosen in such a way that q n /q 0 = (1+f/ √ n), hence Λ n = √ n −1 ∑ ni=1 f(O i)− 1 ∑ 2 n−1 ni=1 f2 (O i )+n −1 ∑ ni=1 f2 (O i )R(f(O i )/ √ n), where the function R is characterized by log(1 + x) = x − 1 2 x2 +x 2 R(x) (all x > −1). In particular, R is increasing and R(x) → R(0) = 0 when x → 0. Since f isboun<strong>de</strong>d, the expansion is valid for n large enough. Moreover, the last term satisfies1n∑f 2 (O∣i )R(f(O i )/ √ n)n∣ ≤ 1 n∑f 2 (O i ) × R(‖f‖ ∞ / √ n),ni=1so that if n −1 ∑ ni=1 f2 (O i ) = O P (1) then it is o P (1). Furthermore, the Kolmogorov law of largenumbers for martingales implies that n −1 ∑ ni=1 f2 (O i ) = n −1 ∑ ni=1 P Q0,gi ⋆f2 +o P (1) = P Q0,ḡn ⋆f2 +i=12930


hal-00582753, version 1 - 4 Apr 2011o P (1), where ḡ ⋆ n = n −1 ∑ ni=1 g⋆ i . Denote Var(f)(A, W) = E Q0 [f2 (O)|A, W]. We have thatP Q0,ḡ ⋆ n f2 ≡ E Q0,ḡ ⋆ n [f2 (O n )|O n (n − 1)] = E Q0,ḡ ⋆ n [Var(f)(A, W)(O n)|O n (n − 1)]= E Q0 [ḡ ⋆ n(1|V )Var(f)(1, W) + ḡ ⋆ n(0|W)Var(f)(1, V )|O n (n − 1)]for V the relevant subvector of W upon which each g ⋆ i <strong>de</strong>pends. By Lemma 1, ḡ ⋆ n convergesin probability to G ⋆ (θ 0 ). Consequently, the continuous mapping theorem (see Theorem 1.3.6 in(van <strong>de</strong>r Vaart and Wellner, 1996)) yields that P Q0,ḡ ⋆ n f2 = P Q0,G ⋆ (θ0)f 2 + o P (1) = O P (1). Insummary, we obtain the following asymptotic linear expansion:Λ n = √ 1 ∑n f(O i ) − 1 n 2 P Q0,G ⋆ (θ0)f 2 + o P (1). (44)i=1Introduce now Z i ′ = (1{i ≤ n 1}, ...,1{i ≤ n K }) and the boun<strong>de</strong>d (and measurable) functionF such that F(O i , Z i , Z i ′) = (Z′ i IC(O i, Z i ), f(O i )). We show that M n = n −1 ∑ ni=1 [F(O i, Z i , Z i ′) −P Q0,gi ⋆F] = ∑ n−1 ni=1 F(O i, Z i , Z i ′) (since P Q0,gi ⋆F = P Q0,gi ⋆ f = 0 for all i ≤ n) satisfies a centrallimit theorem. The proof is very similar to the corresponding part of proof of Lemma 5 in(Chambaz and van <strong>de</strong>r Laan, 2011), hence we only give an outline, focusing on the differences.First, it holds for every 1 ≤ k,l ≤ K that n −1 ∑ nki=1 P Q0,gi ⋆ICf = p kP Q0,G ⋆ (θ0)ICf + o P (1),n −1 ∑ nk∧nli=1P Q0,gi ⋆IC2 = p k∧l P Q0,G ⋆ (θ0)IC 2 + o P (1), and n −1 ∑ ni=1 P Q0,gi ⋆f2 = P Q0,G ⋆ (θ0)f 2 +o P (1). This entails that the matrix En −1 ∑ ni=1 P Q0,gi ⋆F ⊤ F converges to(PQ0,G ⋆ (θ0)IC 2 )× (p k∧l ) k,l≤K P Q0,G ⋆ (θ0)ICf × (p 1 , ...,p K ) ⊤P Q0,G ⋆ (θ0)ICf × (p 1 , ...,p K ) P Q0,G ⋆ (θ0)f 2 ,which is positive <strong>de</strong>finite if and only if its <strong>de</strong>terminant is positive. The latter <strong>de</strong>terminant equals apositive constant times [P Q0,G ⋆ (θ0)f 2 − (P Q0,G ⋆ (θ0)ICf) 2 /P Q0,G ⋆ (θ0)IC 2 ], which is positive too bythe Cauchy-Schwarz inequality (f is not proporional to D ⋆ (P Q0,G ⋆ (θ0))). Therefore, Theorem 8 in(Chambaz and van <strong>de</strong>r Laan, 2011) applies and √ nM n converges in law to the centered Gaussiandistribution with the covariance matrix given in the above display. The conclusion readily follows.ReferencesA. C. Atkinson and A. Biswas. Adaptive biased-coin <strong>de</strong>signs for skewing the allocation proportionin clinical trials with normal responses. Stat. Med., 24(16):2477–2492, 2005.hal-00582753, version 1 - 4 Apr 2011C. Jennison and B. W. Turnbull. Group Sequential Methods with Applications to Clinical Trials.Chapman & Hall/CRC, Boca Raton, FL, 2000.M. A. Proschan, G. K. K. Lan, and J. T. Wittes. Statistical Monitoring of Clinical Trials: AUnified Approach. Statistics for biology and health. Springer, New-York, 2006.W. F Rosenberger, A. N. Vidyashankar, and D. K. Agarwal. Covariate-adjusted response-adaptive<strong>de</strong>signs for binary response. J. Biopharm. Statist., 11(227-236), 2001.W.F. Rosenberger. New directions in adaptive <strong>de</strong>signs. Statist. Sci., 11:137–149, 1996.P. K. Sen and J. M. Singer. Large sample methods in statistics. Chapman & Hall, New York,1993. An introduction with applications.J. Shao, X. Yu, and B. Bob Zhong. A theory for testing hypotheses un<strong>de</strong>r covariate-adaptiverandomization. Biometrika, 2010.M.J. van <strong>de</strong>r Laan. The construction and analysis of adaptive group sequential <strong>de</strong>signs. Technicalreport 232, Division of Biostatistics, University of California, Berkeley, March 2008.M.J. van <strong>de</strong>r Laan and D. Rubin. Targeted maximum likelihood learning. Int. J. Biostat., 2(1),2006.A. W. van <strong>de</strong>r Vaart. Asymptotic Statistics. Cambridge University Press, 1998.A. W. van <strong>de</strong>r Vaart and J. A. Wellner. Weak Convergence and Emprical Processes. Springer-Verlag New York, 1996.Li-Xin Zhang, Feifang Hu, Siu Hung Cheung, and Wai Sum Chan. Asymptotic properties ofcovariate-adjusted response-adaptive <strong>de</strong>signs. Ann. Statist., 35(3):1166–1182, 2007.H. Zhu and F. Hu. Sequential monitoring of response-adaptive randomized clinical trials. Ann.Statist., 38(4):2218–2241, 2010.U. Bandyopadhyay and A. Biswas. Adaptive <strong>de</strong>signs for normal responses with prognostic factors.Biometrika, 88(2):409–419, 2001.A. Chambaz and M. J. van <strong>de</strong>r Laan. Targeting the optimal <strong>de</strong>sign in randomized clinical trialswith binary outcomes and no covariate. Technical report, Division of Biostatistics, Universityof California, Berkeley, 2009.A. Chambaz and M. J. van <strong>de</strong>r Laan. Targeting the optimal <strong>de</strong>sign in randomized clinical trialswith binary outcomes and no covariate: Theoretical study. Int. J. Biostat., 7(1), 2011.S. S. Emerson. Issues in the use of adaptive clinical trial <strong>de</strong>signs. Stat. Med., 25:3270–3296, 2006.The Food & Drug Administration. Critical path opportunities list. Technical report, U.S. Departmentof Health and Human Services, Food and Drug Administration, 2006.H. L. Golub. The need for mor efficient trial <strong>de</strong>signs. Stat. Med., 25:3231–3235, 2006.F. Hu and W.F. Rosenberger. The theory of response adaptive randomization in clinical trials.New York Wiley, 2006.3132


hal-00629899, version 1 - 6 Oct 2011Estimation of a non-parametric variable importance measure ofa continuous exposure<strong>Antoine</strong> Chambaz 1 , Pierre Neuvial 2 , Mark J. van <strong>de</strong>r Laan 31 MAP5, Université Paris Descartes and CNRS2 Laboratoire Statistique et Génome, Université d’Évry Val d’Essonne,UMR CNRS 8071 – USC INRA3 Division of Biostatistics, UC BerkeleySeptember 30, 2011AbstractWe <strong>de</strong>fine a new measure of variable importance of an exposure on a continuous outcome,accounting for potential confoun<strong>de</strong>rs. The exposure features a reference level x 0with positive mass and a continuum of other levels. For the purpose of estimating it, wefully <strong>de</strong>velop the semi-parametric estimation methodology called targeted minimum lossestimation methodology (TMLE) [23, 22]. We cover the whole spectrum of its theoreticalstudy (convergence of the iterative procedure which is at the core of the TMLE methodology;consistency and asymptotic normality of the estimator), practical implementation,simulation study and application to a genomic example that originally motivated thisarticle. In the latter, the exposure X and response Y are, respectively, the DNA copynumber and expression level of a given gene in a cancer cell. Here, the reference levelis x 0 = 2, that is the expected DNA copy number in a normal cell. The confoun<strong>de</strong>ris a measure of the methylation of the gene. The fact that there is no clear biologicalindication that X and Y can be interpreted as an exposure and a response, respectively,is not problematic.1 IntroductionConsi<strong>de</strong>r the following statistical problem: One observes the data structure O = (W, X,Y )on an experimental unit of interest, where W ∈ W stands for a vector of baseline covariates,and X ∈ R and Y ∈ R respectively quantify an exposure and a response; the exposurefeatures a reference level x 0 with positive mass (there is a positive probability that X = x 0 )and a continuum of other levels (a first source of difficulty); one wishes to investigate therelationship between X on Y , accounting for W (a second source of difficulty) and makingfew assumptions on the true data-generating distribution (a third source of difficulty). Takinghal-00629899, version 1 - 6 Oct 2011W into account is <strong>de</strong>sirable when one knows (or cannot rule out the possibility) that it containsconfounding factors, i.e., common factors upon which the exposure X and the response Ymay simultaneously <strong>de</strong>pend.We illustrate our presentation with an example where the experimental unit is a set ofcancer cells, the relevant baseline covariate W is a measure of DNA methylation, the exposureX and response Y are, respectively, the DNA copy number and expression level of a givengene. Here, the reference level is x 0 = 2, that is the expected copy number in a normal cell.The fact that there is no clear biological indication that X and Y can be interpreted as anexposure and a response, respectively, is not problematic. Associations between DNA copynumbers and expression levels in genes have already been consi<strong>de</strong>red in the literature (seee.g., [11, 26, 1, 17, 10]). In contrast to these earlier contributions, we do exploit the fact thatX features both a reference level and a continuum of other levels, instead of discretizing it orconsi<strong>de</strong>ring it as a purely continuous exposure.We focus on the case that there is very little prior knowledge on the true data-generatingdistribution P 0 of O, although we know/assume that (i) O takes its values in the boun<strong>de</strong>dset O (we will <strong>de</strong>note ‖O‖ = max{|W |, |X|, |Y |} ), (ii) P 0 (X ≠ x 0 ) > 0, and finally (iii)P 0 (X ≠ x 0 |W) > 0 P 0 -almost surely. Accordingly, we see P 0 as a specific element of thenon-parametric set M of all possible data-generating distributions of O satisfying the latterconstraints. We <strong>de</strong>fine the parameter of interest as Ψ(P 0 ), for the non-parametric variableimportance measure Ψ : M → R characterized byΨ(P) = arg minE P{(E P (Y |X, W) − E P (Y |X = x 0 , W) − β(X − x 0 )) 2} (1)β∈Rfor all P ∈ M. The methodology presented in this article straightforwardly extends tosituations where one would prefer to replace the expression β(X − x 0 ) in (1) by βf(X) forany f such that f(x 0 ) = 0 and E P {f(X) 2 } > 0 for all P ∈ M. We emphasize that we donot assume a semi-parametric mo<strong>de</strong>l (which would write here as Y = β(X − x 0 ) + η(W) + Uwith unspecified η and U such that E P (U|X, W) = 0), in contrast to [15, 14, 28, 21, 20].This fact bears important implications. The parameter of interest, Ψ(P 0 ), is universally<strong>de</strong>fined (therefore justifying the expression “non-parametric variable importance measure ofa continuous exposure” in the title), no matter what properties the unknown true datageneratingdistribution P 0 enjoys, or does not enjoy.Parameter Ψ quantifies the influence of X and Y on a linear scale, using the referencelevel x 0 as a pivot (note that this expression conveys the notion that the role of X and Yare not symmetric). As its name suggests, Ψ belongs to the family of variable importancemeasures (a family that inclu<strong>de</strong>s the excess risk), which was introduced in [21]. However, itscase is not covered by the latter article because X is continuous (we will see how Ψ naturallyrelates to an excess risk when X takes only two distinct values). Our purpose here is to fully<strong>de</strong>velop the semi-parametric estimation methodology called targeted minimum loss estimation(TMLE) methodology [23, 22]. We cover the whole spectrum of its theoretical study, practical12


hal-00629899, version 1 - 6 Oct 2011implementation, simulation study, and application to the aforementioned genomic example.In Section 2, we study the fundamental properties of parameter Ψ. In Section 3 weprovi<strong>de</strong> an overview of the TMLE methodology tailored for the purpose of estimating Ψ(P 0 ).In Section 4, we state and comment on important theoretical properties enjoyed by the TMLE(convergence of the iterative updating procedure at the core of its <strong>de</strong>finition; its consistencyand asymptotic normality). The specifics of the TMLE procedure are presented in Section 5.The properties consi<strong>de</strong>red in Section 4 are illustrated by a simulation study inspired by theproblem of assessing the importance of DNA copy number variations on expression level ingenes, accounting for their methylation (the real data application we are ultimately interestedin), as <strong>de</strong>scribed in Section 6. All proofs are postponed to the appendix.We assume from now on, without loss of generality, that x 0 = 0. For any measure λ andmeasurable function f, λf = ∫ fdλ. We set L 2 0 (P) = {s ∈ L2 (P) : Ps = 0}. Moreover, the followingnotation are used throughout the article: for all P ∈ M, θ(P)(X, W) = E P (Y |X, W),µ(P)(W) = E P (X|W), g(P)(0|W) = P(X = 0|W), and σ 2 (P) = E P {X 2 }. In particular,Ψ(P) can also be written asΨ(P) = arg minE P{(θ(P)(X, W) − θ(P)(0, W) − βX) 2} .β∈R2 The non-parametric variable importance parameterIt is of paramount importance to study the parameter of interest in or<strong>de</strong>r to better estimateit. Parameter Ψ actually enjoys the following properties [see Chapter 25 in 25, for <strong>de</strong>finitions].Proposition 1. For all P ∈ M,Ψ(P) = E P {X(θ(P)(X, W) − θ(P)(0, W))}E P {X 2 . (2)}Parameter Ψ is pathwise differentiable at every P ∈ M with respect to the maximal tangentset L 2 0 (P). Its efficient influence curve at P is D⋆ (P) = D ⋆ 1 (P) + D⋆ 2 (P), where D⋆ 1 (P) =D ⋆ 1 (σ2 (P), θ(P),Ψ(P)) and D ⋆ 2 (P) = D⋆ 2 (σ2 (P), θ(P), µ(P), g(P)) are two L 2 0 (P)-orthogonalcomponents characterized byD ⋆ 1(σ 2 , θ, ψ)(O) =1 σ2(X(θ(X, W) − θ(0, W) − Xψ)),D ⋆ 2(σ 2 , θ, µ, g)(O) = 1 σ 2(Y − θ(X, W)) (X −µ(W)1{X = 0}g(0|W)Furthermore, the efficient influence curve is double-robust: for any (P,P ′ ) ∈ M 2 , if eitherθ(P ′ )(0, ·) = θ(P)(0, ·) or (µ(P ′ ) = µ(P) and g(P ′ ) = g(P)) holds, then PD ⋆ (P ′ ) = 0 impliesΨ(P ′ ) = Ψ(P).The proof of Proposition 1 is relegated to Section A.2.).hal-00629899, version 1 - 6 Oct 2011Let us emphasize again that we do not assume a semi-parametric mo<strong>de</strong>l Y = βX +η(W) + U (with unspecified η and U such that E P (U|X, W) = 0). Setting R(P,β)(X, W) =θ(P)(X, W)−θ(P)(0, W)−βX for all (P,β) ∈ M×R, the latter semi-parametric mo<strong>de</strong>l holdsfor P ∈ M if there exists a unique β(P) ∈ R such that R(P,β(P)) = 0. Note that β is alwayssolution to the equation βE P {X 2 } = E P {X (θ(P)(X, W) − θ(P)(0, W) − R(P,β)(X, W))}.In particular, if the semi-parametric mo<strong>de</strong>l holds for a certain P ∈ M, then β(P) = Ψ(P) by(2). On the contrary, if the semi-parametric mo<strong>de</strong>l does not hold for P, then it is not clearwhat β(P) could even mean whereas Ψ(P) is still a well-<strong>de</strong>fined parameter worth estimating.We discuss in Section 4.2 what happens if one estimates β(P) when assuming wrongly thatthe semi-parametric holds (the discussion allows to i<strong>de</strong>ntify the awkward non-parametricextension of parameter β(P) that one therefore estimates).Equality (2) also teaches us thatΨ(P) = F(P) − E P {µ(P)(W)θ(P)(0, W)}σ 2 (P)for the functional F : M → R characterized by{F(P) = arg minE P (Y − βX)2 } ≡ E P {XY }β∈Rσ 2 (P)(all P ∈ M). In that view, the second term in the right-hand si<strong>de</strong> of (3) is a correction termad<strong>de</strong>d to F(P) in or<strong>de</strong>r to take W into account for the purpose of quantifying the influenceof X on Y on a linear scale. Whereas the roles of X and Y are symmetric in the numeratorof F(P), they are obviously not in that of the correction term. Less importantly, (2) alsomakes clear that there is a connexion between Ψ and an excess risk. In<strong>de</strong>ed, consi<strong>de</strong>r P ∈ Msuch that P(X ∈ {0, x 1 }) = 1 for x 1 ≠ 0. Then Ψ(P) satisfiesΨ(P) = E P {(θ(P)(x 1 , W) − θ(P)(0, W))h(P)(W)}σ 2 (P)for h(P)(W) = P(X = x 1 |W), i.e., Ψ(P) appears as a weighted excess risk (the classicalexcess risk would be here E P {θ(P)(x 1 , W) − θ(P)(0, W)}).Since Ψ is pathwise differentiable, the theory of semi-parametric estimation applies, providinga notion of asymptotically efficient estimation. Remarkably, the asymptotic varianceof a regular estimator of Ψ(P 0 ) is lower-boun<strong>de</strong>d by the variance Var P0 D ⋆ (P 0 )(O) un<strong>de</strong>r P 0of the efficient influence curve at P 0 (a consequence of the convolution theorem). The TMLEprocedure takes advantage of the properties of Ψ <strong>de</strong>scribed in Proposition 1 in or<strong>de</strong>r to builda consistent and possibly asymptotically efficient substitution estimator of Ψ(P 0 ). In view of(3), this is a challenging statistical problem because, whereas estimating F(P 0 ) is straightforward(the ratio of the empirical means of XY and X 2 is an efficient estimator of F(P 0 )),estimating the correction term in (3) is more <strong>de</strong>licate, notably because this necessarily involvesestimating the infinite-dimensional features θ(P 0 )(0, ·) and µ(P 0 ).(3)(4)34


hal-00629899, version 1 - 6 Oct 20113 Overview of the TMLE procedure tailored to the estimationof the non-parametric variable importance measureWe assume now that we observe n in<strong>de</strong>pen<strong>de</strong>nt copies O (1) = (W (1) , X (1) , Y (1) ), ...,O (n) =(W (n) , X (n) , Y (n) ) of the observed data structure O ∼ P 0 ∈ M. The empirical measure is<strong>de</strong>noted by P n . The TMLE procedure iteratively updates an initial substitution estimatorψ 0 n = Ψ(P 0 n) of Ψ(P 0 ) (based on an initial estimator P 0 n of the data-generating distributionP 0 ), building a sequence {ψ k n = Ψ(P k n)} k≥0 (with P k n the kth update of P 0 n) which convergesto the targeted minimum loss estimator (TMLE) ψ ∗ n as k increases. This iterative scheme isvisually illustrated in Figure 1, and we invite the rea<strong>de</strong>r to consult its caption now.We <strong>de</strong>termine what initializing the TMLE procedure boils down to in Section 3.1. Ageneral one-step targeted updating procedure is <strong>de</strong>scribed in Section 3.2. How to conductspecifically these initialization and update (as well as two alternative tailored two-step updatingprocedures) is addressed in Section 5.3.1 Initial estimatorIn this subsection, we <strong>de</strong>scribe what it takes to construct an initial substitution estimator ofΨ(P 0 ). Of course, how one <strong>de</strong>rives the substitution estimator Ψ(P) from the <strong>de</strong>scription of(certain features of) P is relevant even if P is not literally an initial estimator of P 0 .By (2), building an initial substitution estimator Ψ(P 0 n) of Ψ(P 0 ) requires the estimationof θ(P 0 ), of σ 2 (P 0 ), and of the marginal distribution of (W, X) un<strong>de</strong>r P 0 . Given P 0 n, initialestimator of P 0 with known θ(P 0 n), σ 2 (P 0 n) > 0 and marginal distribution of (W, X) un<strong>de</strong>rP 0 n, Ψ(P 0 n) can in<strong>de</strong>ed be obtained (or, more precisely, evaluated accurately) by the lawof large numbers, as discussed below. We emphasize that such an initial estimator mayvery well be biased. In other words, one would need strong assumptions on the true datageneratingdistribution P 0 (which we are not willing to make; typically, assuming that P 0belongs to a given regular parametric mo<strong>de</strong>l) and adapting the construction of P 0 n based onthose assumptions (typically, relying on maximum likelihood estimation) in or<strong>de</strong>r to obtainthe consistency of Ψ(P 0 n).For B a large integer (say B = 10 5 ), evaluating accurately (rather than computing exactly)the initial substitution estimator Ψ(P 0 n) of Ψ(P 0 ) boils down to simulating B in<strong>de</strong>pen<strong>de</strong>ntcopies ( ˜W (b) , ˜X (b) ) of (W, X) un<strong>de</strong>r P 0 n, then using the approximation∑ψn 0 = Ψ(Pn) 0 = B−1 B ˜X b=1 (b) (θ(Pn)( 0 ˜X (b) , ˜W (b) ) − θ(Pn)(0, 0 ˜W (b) ))σ 2 (Pn)0 + O(B −1/2 ). (5)Knowing the marginal distribution of (W, X) un<strong>de</strong>r P 0 n amounts to knowing (i) themarginal distribution of W un<strong>de</strong>r P 0 n, (ii) the conditional distribution of Z ≡ 1{X = 0}given W un<strong>de</strong>r P 0 n, and (iii) the conditional distribution of X given (W, X ≠ 0) un<strong>de</strong>rP 0 n. Firstly, we advocate for estimating initially the marginal distribution of W un<strong>de</strong>r P 0hal-00629899, version 1 - 6 Oct 2011by its empirical version, or put in terms of likelihood, to build P 0 n in such a way thatP 0 n(W) = n −1 ∑ ni=1 1{W (i) = W }. Secondly, the conditional distribution of Z given Wun<strong>de</strong>r P 0 n is the Bernoulli law with parameter 1 − g(P 0 n)(0|W), so it is necessary that g(P 0 n)be known too (and such that, P 0 n-almost surely, g(P 0 n)(0|W) ∈ (0,1)). Thirdly, the conditionaldistribution of X given (W, X ≠ 0) un<strong>de</strong>r P 0 n can be any (finite variance) distribution,whose conditional mean can be <strong>de</strong>duced from µ(P 0 n):E P 0 n(X|X ≠ 0, W) =µ(P 0 n)(W)1 − g(P 0 n)(0|W) , (6)and whose conditional second or<strong>de</strong>r moment E P 0 n(X 2 |X ≠ 0, W) satisfiesE P 0 n{(1 − g(P0n )(0|W))E P 0 n(X 2 |X ≠ 0, W) } = σ 2 (P 0 n). (7)In particular, it is also necessary that µ(P 0 n) be known too.In summary, the only features of P 0 n we really care for in or<strong>de</strong>r to evaluate accurately(rather than compute exactly) ψ 0 n = Ψ(P 0 n) are θ(P 0 n), µ(P 0 n), g(P 0 n), σ 2 (P 0 n), and the marginaldistribution of W un<strong>de</strong>r P 0 n, which respectively estimate θ(P 0 ), µ(P 0 ), g(P 0 ), σ 2 (P 0 ), and themarginal distribution of W un<strong>de</strong>r P 0 . We could for instance rely on a working mo<strong>de</strong>l wherethe conditional distribution of X given (W, X ≠ 0) is chosen as the Gaussian distributionwith conditional mean as in (6) and any conditional second or<strong>de</strong>r moment (which is nothingbut a measurable function of W) such that (7) holds. Let us emphasize that we do usehere expressions from the semantical field of choice, and not from that of assumption; aworking mo<strong>de</strong>l is just a tool we use in the construction of the initial estimator, and we donot necessarily assume that it is well-specified. Although such a Gaussian working mo<strong>de</strong>lwould be a perfectly correct choice, we advocate for using another one for computationalconvenience, as presented in Section 5.1.3.2 A general one-step updating procedure of the initial estimatorThe next step consists in iteratively updating ψn 0 = Ψ(Pn). 0 Assuming that one has alreadybuilt (k −1) updates Pn, 1 ...,Pnk−1 of Pn, 0 resulting in (k −1) updated substitution estimatorsψn 1 = Ψ(Pn), 1 ...,ψnk−1 = Ψ(Pnk−1 ), it is formally sufficient to <strong>de</strong>scribe how the kth update Pnkis <strong>de</strong>rived from its pre<strong>de</strong>cessor Pnk−1 in or<strong>de</strong>r to fully <strong>de</strong>termine the iterative procedure. Notethat the value of ψn 1 = Ψ(Pn), 1 ...,ψnk−1 = Ψ(Pnk−1 ) are <strong>de</strong>rived as ψn 0 = Ψ(Pn), 0 by following(5) in Section 3.1 with Pn, 1 ...,Pn k−1 substituted for Pn.0We present here a general one-step updating procedure (two alternative tailored two-stepupdating procedures are also presented in Section 5.2). We invite again the rea<strong>de</strong>r to refer toFigure 1 for its visual illustration.Set ρ ∈ (0,1) a constant close to 1 and consi<strong>de</strong>r the path {P k−1ncharacterized by(ε)dPnk−1dPnk−1((O) = 1 + εD ⋆ (Pnk−1(ε) : |ε| ≤ ρ‖D ⋆ (P k−1n)‖ −1∞ }))(O) , (8)56


where D ⋆ (Pnk−1 ) is the current estimator of the efficient influence curve at P 0 obtained asthe efficient influence curve at Pnk−1 . The path is a one-dimensional parametric mo<strong>de</strong>l thatfluctuates P k−1n(i.e., P k−1n(0) = P k−1 ) in the direction of D ⋆ (P k−1 ) (i.e., the score of the pathnat ε = 0 equals D ⋆ (Pnk−1 )). Here, we choose minus the log-likelihood function as loss function(i.e., we choose L : M × O → R characterized by L(P)(O) = −log P(O)). Consequently, theoptimal update of Pnk−1 is in<strong>de</strong>xed by the maximum likelihood estimator (MLE)nMP 0 nxP0xD ⋆ (P k n)RΨ(P 0 n)ε k−1n = arg max|ε|≤ρ‖D ⋆ (P k−1n= arg max|ε|≤ρ‖D ⋆ (P k−1nn∑)‖ −1∞ i=1n∑)‖ −1∞ i=1log P k−1n (ε)(O (i) )()log 1 + εD ⋆ (Pn k−1 )(O (i) ) .hal-00629899, version 1 - 6 Oct 2011x{P k n(ε) : |ε| < η k n}P k nxPn(ε k k n) = Pnk+1Figure 1: Illustration of the TMLE procedure (with its general one-step updating procedure).We purposedly represent the initial estimator Pn 0 closer to P 0 than its kth and (k+1)th updatesPn k and Pnk+1 , heuristically because Pn 0 is as close to P 0 as one can possibly get (given P n andthe specifics of the super-learning procedure) when targeting P 0 itself. However, this obviouslydoes not necessarily imply that Ψ(Pn) 0 performs well when targeting Ψ(P 0 ) (instead of P 0 ),which is why we also purposedly represent Ψ(Pnk+1 ) closer to Ψ(P 0 ) than Ψ(Pn). 0 In<strong>de</strong>ed,is obtained by fluctuating its pre<strong>de</strong>cessor Pn k “in the direction of Ψ”, i.e., taking intoP k+1naccount the fact that we are ultimately interested in estimating Ψ(P 0 ). More specifically, thefluctuation {P k n(ε) : |ε| < η k n} of P k n is a one-dimensional parametric mo<strong>de</strong>l (hence its curvyshape in the large mo<strong>de</strong>l M) such that (i) P k n(0) = P k n, and (b) its score at ε = 0 equalsthe efficient influence curve D ⋆ (Pn) k at Pn k (hence the dotted arrow). An optimal stretch ε k nis <strong>de</strong>termined (e.g. by maximizing the likelihood on the fluctuation), yielding the updatePnk+1 = Pn(ε k k n).ΨΨ(P0)Ψ(Pn k+1 )hal-00629899, version 1 - 6 Oct 2011The MLE ε k−1n is uniquely <strong>de</strong>fined (and possibly equal to ±ρ‖D ⋆ (Pnk−1 )‖ −1∞ , hence the introductionof the constant ρ in the <strong>de</strong>finition of the path) provi<strong>de</strong>d for instance thatmaxi≤n |D⋆ (Pn k−1 )(O (i) )| > 0(this statement is to be un<strong>de</strong>rstood conditionally on P n , i.e. it is a statement about thesample). Un<strong>de</strong>r mild assumptions on P 0 , ε k−1n targets ε k−10 such that Pnk−1 (ε k−10 ) is theKullback-Leibler projection of P 0 onto the path {Pnk−1 (ε) : |ε| ≤ ρ‖D ⋆ (Pnk−1 )‖ −1∞ }. We nowset Pn k = Pnk−1 (ε k−1n ), thus concluding the <strong>de</strong>scription of the iterative updating step of theTMLE procedure. Finally, the TMLE ψn ∗ is <strong>de</strong>fined as ψn ∗ = lim k→∞ ψn, k assuming thatthe limit exists, or more generally as ψnknSections 4.1 and 4.2 regarding this issue).for a conveniently chosen sequence {k n } n≥0 (seeThis is a very general way of <strong>de</strong>aling with the updating step of the TMLE methodology.The key is that it is possible to <strong>de</strong>termine how the fundamental features of P k n(ε) (i.e., thecomponents of P k n(ε) involved in the <strong>de</strong>finition of D ⋆ (P k n(ε)) and in the <strong>de</strong>finition of Ψ) behave(exactly) as functions of ε relative to their counterparts at ε = 0 (i.e., with respect to (wrt)P k n), as shown in the next Lemma (its proof is relegated to Section A.2).Lemma 1. Set s ∈ L 2 0 (P) with ‖s‖ ∞ < ∞ and consi<strong>de</strong>r the path {P ε : |ε| < ‖s‖ −1∞ } ⊂ Mcharacterized bydP ε(O) = (1 + εs(O)). (9)dPThe path has score function s. For all |ε| < ‖s‖ −1∞ and all measurable function f of W,θ(P ε )(X, W) = θ(P)(X, W) + εE P(Y s(O)|X, W),1 + εE P (s(O)|X,W)(10)µ(P ε )(W) =µ(P)(W) + εE P(Xs(O)|W),1 + εE P (s(O)|W)(11)g(P ε )(0|W) =g(P)(0|W) + εE P(1{X = 0}s(O)|W),1 + εE P (s(O)|W)(12)78


σ 2 (P ε ) = σ 2 (P) + εE P {X 2 s(O)}, (13)4.1 On the convergence of the updating procedureE Pε {f(W)} = E P {f(W)(1 + εE P (s(O)|W))}. (14)Regarding the computation of Ψ(Pn), k it is also required to know how to sample in<strong>de</strong>pen<strong>de</strong>ntcopies of (W, X) un<strong>de</strong>r Pn(ε), k see Section 3.1. Finally, we emphasize that by (14), themarginal distribution of W un<strong>de</strong>r Pn k typically <strong>de</strong>viates from its counterpart un<strong>de</strong>r Pn 0 (i.e.,from its empirical counterpart).Studying the convergence of the updating procedure has several aspects to it. We focus onthe general one-step procedure of Section 3.2. All proofs are relegated to Section A.4.On one hand, the following result (very similar to Result 1 in [23]) trivially holds:Lemma 2. Assume (i) that all the paths we consi<strong>de</strong>r are inclu<strong>de</strong>d in M ′ ⊂ M such thatsup P ∈M ′ ‖D ⋆ (P)‖ ∞ = M < ∞, and (ii) that their fluctuation parameters ε are restricted to[−ρ, ρ] for ρ = (2M) −1 . If lim k→∞ ε k n = 0 then lim n→∞ P n D ⋆ (P k n) = 0.hal-00629899, version 1 - 6 Oct 2011TMLE and one-step estimation methodologies.By being based on an iterative scheme, the TMLE methodology naturally evokes the onestepestimation methodology introduced by Le Cam [8] (see [25, Sections 5.7 and 25.8] for arecent account). The latter estimation methodology draws its inspiration from the method ofNewton-Raphson in numerical analysis, and basically consists in updating an initial estimatorby relying on a linear approximation to the original estimating equation.Yet, some differences between the TMLE and one-step estimation methodologies are particularlystriking. Most importantly, because the TMLE methodology only involves substitutionestimators, how one updates (in the parameter space R) the initial estimator ψn 0 = Ψ(Pn)0of Ψ(P 0 ) into ψn 1 = Ψ(Pn) 1 is the consequence of how one updates (in mo<strong>de</strong>l M) the initialestimator Pn 0 of P 0 into Pn. 1 In contrast, the one-step estimator is naturally presented as anupdate (in the parameter space R) of the initial estimator, for the sake of solving a linearapproximation (in Ψ(P)) to the estimating equation P n D ⋆ (P) = 0. The TMLE methodologydoes not involve such a linear approximation; it nevertheless guarantees by constructionP n D ⋆ (Pn) k ≈ 0 for large k (see Section 4.1 on that issue). Furthermore, on a more technicalnote, the asymptotic study of the TMLE ψn ∗ does not require that the initial estimatorψn 0 = Ψ(Pn) 0 be √ n-consistent (i.e., that √ n(ψn 0 − Ψ(P 0 )) be uniformly tight), whereas thatof the one-step estimator typically does.However, there certainly exist interesting relationships between the TMLE and one-stepestimation methodologies too. Such relationships are not obvious, and we will investigatethem in future work.hal-00629899, version 1 - 6 Oct 2011Condition (i) is weak, and we refer to Lemma 4 for a set of conditions which guaranteethat it holds. Lemma 2 is of primary importance. It teaches us that if the TMLE procedure“converges” (in the sense that lim k→∞ ε k n = 0) then its “limit” is a solution of the efficientinfluence curve equation (in the sense that for any arbitrary small <strong>de</strong>viation from 0, it ispossible to guarantee P n D ⋆ (P k n) ≈ 0 by choosing k large enough). This is the key to the proofsof consistency and asymptotic linearity, see Section 4.2. Actually, the condition lim k→∞ ε k n = 0can be replaced by a more explicit condition on the class of the consi<strong>de</strong>red data-generatingdistributions, as shown in the next lemma.Lemma 3. Un<strong>de</strong>r the assumptions of Lemma 2, let us suppose additionally that the samplesatisfies (iii) inf k≥0 P n D ⋆ (P k n) 2 > 0, and (iv) that the log-likelihood of the data is uniformlyboun<strong>de</strong>d on M ′ : sup P ∈M ′∑ ni=1 log P(O(i) ) < ∞. Then it holds that lim k→∞ ε k n = 0 andlim n→∞ P n D ⋆ (P k n) = 0.On the other hand, it is possible to obtain another result pertaining to the “convergence”of the updating procedure directly put in terms of the convergence of the sequences {P k n } k≥0and {ψ k n} k≥0 , provi<strong>de</strong>d that {ε k n} k≥0 goes to 0 quickly enough. Specifically,Lemma 4. Suppose that P 0 n(‖O‖ ≤ C) = 1 for some finite C > 0. Then obviously P k n(‖O‖ ≤C) = P k n(|θ(P k n)(X, W)| ≤ C) = P k n(|µ(P k n)(W)| ≤ C) = 1 for all k ≥ 0. Suppose moreoverthat for all k ≥ 0, g(P k n)(0|W) ≥ c > 0 and σ 2 (P k n) ≥ c are boun<strong>de</strong>d away from 0. Thencondition (i) of Lemma 2 holds. Assume now that ∑ k≥0 |εk n| < ∞. Then the sequence{P k n } k≥0 converges in total variation (hence in law) to a data-generating distribution P ∗ n.Simultaneously, the sequence {ψ k n} k≥0 converges to Ψ(P ∗ n).4 Convergence and asymptoticsIn this section, we state and comment on important theoretical properties enjoyed by theTMLE. In Section 4.1, we study the convergence of the iterative updating procedure which isat the core of the TMLE procedure. In Section 4.2, we <strong>de</strong>rive the consistency and asymptoticnormality of the TMLE. By building on the statement of consistency, we also argue why it ismore interesting to estimate our non-parametric variable importance measure Ψ(P 0 ) than itssemi-parametric counterpart.It is necessary to bound g(P k n) and σ 2 (P k n) away from 0 because conditions (i) and (ii)of Lemma 2 only imply that g(P k n)(0|W) ≥ g(P 0 n)(0|W)((1 − ρ)/(1 + ρ)) k and σ 2 (P k n) ≥σ 2 (P 0 n)(1 − ρ) k . Now, it makes perfect sense from a computational point of view to resort tolower-thresholding in or<strong>de</strong>r to ensure that g(P k n)(0|W) and σ 2 (P k n) cannot be smaller than afixed constant. Assuming that the series ∑ k≥0 |εk n| converges ensures that {P k n } k≥0 convergesin total variation rather than weakly only. Interestingly, we do draw advantage from thisstronger type of convergence in or<strong>de</strong>r to <strong>de</strong>rive the second part of the lemma. In conclusion,note that Newton-Raphson-type algorithms converge at a k −2 -rate, which suggests that thecondition ∑ k≥0 |εk n| < ∞ is not too <strong>de</strong>manding.910


hal-00629899, version 1 - 6 Oct 20114.2 Consistency and asymptotic normalityLet us now investigate the statistical properties of the TMLE ψ ∗ n. We actually consi<strong>de</strong>r aslightly modified version of the TMLE in or<strong>de</strong>r to circumvent the issue of the convergenceof the sequence {ψ k n} k≥0 as k goes to infinity. The modified version is perfectly fine from apractical point of view. All proofs are relegated to Section A.5.Consistency.Un<strong>de</strong>r mild assumptions, the TMLE is consistent. Specifically:Proposition 2 (consistency). We assume (i) that there exist finite C > c > 0 such that‖θ(Pn kn )‖ ∞ ≤ C, g(Pn kn )(0|W) ≥ c and σ 2 (Pn kn ) ≥ c for all n ≥ 1, (ii) that θ(Pn kn ), µ(Pn kn ),g(Pn kn ) and σ 2 (Pn kn ) respectively converge to θ 0 such that ‖θ 0 ‖ ∞ ≤ C, µ 0 , g 0 and σ0 2 ≥ c insuch a way that P 0 (θ(Pn kn ) − θ 0 ) 2 = o P (1), P 0 (θ(Pn kn )(0, ·) − θ 0 (0, ·)) 2 = o P (1), P 0 (µ(Pn kn ) −µ 0 ) 2 = o P (1), P 0 (g(Pn kn )(0|·) − g 0 (0|·)) 2 = o P (1) and σ 2 (Pn kn ) = σ0 2 + o P(1), and (iii) thatD1 ⋆ kn(Pn ) and D2 ⋆ kn(Pn ) belong to a P 0 -Donsker class with P 0 -probability tending to 1. Inaddition, we suppose that all assumptions of Lemma 3 are met, and that the (possibly random)integer k n ≥ 0 is chosen so that P n D ⋆ (Pn kn ) = o P (1/ √ n).Define ˜ψ n ∗ = ψnkn = Ψ(Pn kn ). If the limits satisfy either θ 0 (0, ·) = θ(P 0 )(0, ·) or (µ 0 = µ(P 0 )and g 0 = g(P 0 )) then ˜ψ n ∗ consistently estimates Ψ(P 0 ).It is remarkable that the consistency of the TMLE ˜ψ n ∗ = Ψ(Pn kn ) is granted essentiallywhenever the estimators θ(Pn kn ), µ(Pn kn ), g(Pn kn ), σ 2 (Pn kn ) converge and that one only of thelimits θ 0 (0, ·) of θ(Pn kn )(0, ·) and (µ 0 , g 0 ) of (µ(Pn kn ), g(Pn kn )) coinci<strong>de</strong>s with the correspondingtruth θ(P 0 )(0, ·) or (µ(P 0 ), g(P 0 )). This property is mostly inherited from the doublerobustnessof the efficient influence curve D ⋆ of parameter Ψ (i.e., PD ⋆ (P ′ ) = 0 impliesΨ(P ′ ) = Ψ(P)) and from the fact that the TMLE solves the efficient influence curve equation(i.e., P n D ⋆ (P knn ) ≈ 0).Merit of the non-parametric variable importance measure over its semi-parametriccounterpart.Let us repeat that we do not assume a semi-parametric mo<strong>de</strong>l Y = βX + η(W) + U(with unspecified η and U such that E P (U|X, W) = 0). However, if P ∈ M is such thatθ(P)(X, W) = β(P)X + θ(P)(0, W) (i.e., if the semi-parametric mo<strong>de</strong>l holds un<strong>de</strong>r P) thenΨ(P) = β(P). Let us <strong>de</strong>note by M SP ⊂ M the set of all such data-generating distributions.It is known (see for instance [28]) that β : M SP → R is a pathwise differentiable parameter(wrt the corresponding maximal tangent space), and that its efficient influence curve atP ∈ M SP is given byD ⋆ SP(P)(O) =⎛Y − β(P)X − θ(P)(0, W)⎝Xv 2 −(P)(X, W)E P(E P(Xv 2 (P)(X,W)1v 2 (P)(X,W))∣⎞∣W) ⎠∣,∣Whal-00629899, version 1 - 6 Oct 2011with v 2 (P)(X, W) = E P ((Y − θ(P)(X, W)) 2 |X, W) is the conditional variance of Y given(X, W) un<strong>de</strong>r P. Note that the second factor in the right-hand si<strong>de</strong> expression reduces to(X − µ(P)(W)) whenever v 2 (P)(X,W) only <strong>de</strong>pends on W.For the purpose of emphasizing the merit of the non-parametric variable importance measureover its semi-parametric counterpart, say that one estimates β(P 0 ) assuming (temporarily)that P 0 ∈ M SP (hence Ψ(P 0 ) = β(P 0 )). Say that one builds P ∗ n,SP ∈ M SP such that (i)v 2 (Pn,SP ∗ )(X, W) does not <strong>de</strong>pend on (X, W), and (ii) P nDSP ⋆ (P n,SP ∗ ) = 0. Let us assume thatβ(P ∗ n,SP ), v2 (P ∗ n,SP ), µ(P ∗ n,SP ) and θ(P ∗ n,SP ) respectively converge to β 1, v 2 1 > 0, µ 1 and θ 1(such that θ 1 (X,W) = β 1 X + θ 1 (0, W)), and finally that one solves in the limit the efficientinfluence curve equation:E P0 {(Y − β 1 X − θ 1 (0, W))(X − µ 1 (W))} = 0 (15)(this is typically <strong>de</strong>rived from (ii) above; see the proof of Proposition 2 for a typical <strong>de</strong>rivation).Then (by double-robustness of D ⋆ SP ), the estimator β(P ∗ n,SP ) of β(P 0) is consistent (i.e., β 1 =β(P 0 )) if either θ 1 = θ(P 0 ) (that is obvious) or µ 1 = µ(P 0 ). For example, let us supposethat µ 1 = µ(P 0 ). In particular, one can <strong>de</strong>duce from equalities E P0 {X(X − µ(P 0 )(W))} =E P0 {(X − µ(P 0 )(W)) 2 } and (15) thatβ 1 = E P 0{(θ(P 0 )(X, W) − θ 1 (0, W))(X − µ(P 0 )(W))}E P0 {(X − µ(P 0 )(W)) 2 }(provi<strong>de</strong>d that X does not coinci<strong>de</strong> with µ(P 0 )(W) un<strong>de</strong>r P 0 ). Equivalently, β 1 = b(P 0 ) forthe functional b : M ′ = M \ {P ∈ M : X = µ(P)(W)} → R such that, for every P ∈ M ′ ,b(P) = arg minE P{[θ(P)(X, W) − θ 1 (0, W) − β(X − µ(P)(W))] 2} .β∈RNote that one can interpret parameter b as a non-parametric extension of the semiparametricparameter β (non-parametric, because its <strong>de</strong>finition does not involve a semiparametricmo<strong>de</strong>l anymore). Now, we want to emphasize that b arguably <strong>de</strong>fines a sensibletarget if θ 1 (0, ·) = θ(P)(0, ·) (in addition to µ 1 = µ(P 0 )), but not otherwise! This illustratesthe danger of relying on a semi-parametric mo<strong>de</strong>l when it is not absolutely certain that itholds, thus un<strong>de</strong>rlying the merit of targeting the non-parametric variable importance measurerather than its semi-parametric counterpart.Asymptotic normality.In addition to being consistent un<strong>de</strong>r mild assumptions, the TMLE is also asymptoticallylinear, and thus satisfies a central limit theorem. Let us start with a partial result:Proposition 3. Suppose that the assumptions of Proposition 2 are met. If σ 2 (P knn ) = σ 2 0 +O P (1/ √ n) then it holds that1112


˜ψ ∗ n − Ψ(P 0 ) = (P n − P 0 )D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 ))+ P 0 D ⋆ (σ 2 (P 0 ), θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 ))(1 + o P (1)) + o P (1/ √ n). (16)Expansion (16) sheds some light on the first or<strong>de</strong>r properties of the TMLE ˜ψ ∗ n. It notablymakes clear that the convergence of ˜ψ n ∗ is affected by how fast the estimators θ(Pn kn ),µ(Pn kn ) and g(Pn kn ) converge to their limits (see second term). If the rates of convergence arecollectively so slow that they only guarantee P 0 D ⋆ (σ 2 (P 0 ), θ(Pn kn ), µ(Pn kn ), g(Pn kn ),Ψ(P 0 )) =O P (1/n r ) for some r ∈ [0,1/2[, then expansion (16) becomesprocedures which can be substituted to the general one-step updating procedure presented inSection 3.2. Finally, we <strong>de</strong>scribe carefully what are all the features of interest of P 0 that mustbe consi<strong>de</strong>red for the purpose of targeting the parameter of ultimate interest, Ψ(P 0 ), via theconstruction of the TMLE.5.1 Working mo<strong>de</strong>l for the conditional distribution of X given (W,X ≠ 0)The working mo<strong>de</strong>l for the conditional distribution of X given (W, X ≠ 0) un<strong>de</strong>r P 0 n that webuild relies on two i<strong>de</strong>as:hal-00629899, version 1 - 6 Oct 2011˜ψ ∗ n − Ψ(P 0 ) = P 0 D ⋆ (σ 2 (P 0 ), θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 )) + o P (1/n r )and asymptotic linearity fails to hold. On the contrary, we easily <strong>de</strong>duce from Proposition 3what happens when θ 0 (0, ·) = θ(P 0 )(0, ·), µ 0 = µ(P 0 ), g 0 = g(P 0 ), with fast rates of convergence:Corollary 1 (asymptotic normality). Suppose that the assumptions of Proposition 3 are met.If in addition it holds that θ 0 (0, ·) = θ(P 0 )(0, ·), µ 0 = µ(P 0 ), g 0 = g(P 0 ) and(P 0 (θ(Pn kn )(0, ·) − θ 0 (0, ·)) 2 × P 0 (µ(Pn kn ) − µ 0 ) 2 + P 0 (g(Pn kn )(0|·) − g 0 (0|·)) 2) = o P (1/n)then˜ψ ∗ n − Ψ(P 0 ) = (P n − P 0 )D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )) + o P (1/ √ n)i.e., the TMLE ˜ψ ∗ n is asymptotically linear with influence function D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )).Thus, √ n( ˜ψ ∗ n − Ψ(P 0 )) is asymptotically distributed from a centered Gaussian law with varianceP 0 D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )) 2 . In particular, if θ 0 = θ(P 0 ) then the TMLE ˜ψ ∗ n isefficient.Corollary 1 covers a simple case in the sense that, by being o P (1/ √ n), the second righthandsi<strong>de</strong> term in (16) does not significantly contribute to the linear asymptotic expansion i.e.,the influence curve actually is D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )). Depending on how θ(P 0 n), µ(P 0 n)and g(P 0 n) are obtained (again, we recommend relying on super-learning), the contribution tothe linear asymptotic expansion may be significant (but <strong>de</strong>termining this contribution wouldbe a very difficult task to address on a case by case basis when relying on super-learning).5 Specifics of the TMLE procedure tailored to the estimationof the non-parametric variable importance measureIn this section, we present practical <strong>de</strong>tails on how we conduct the initialization and updatingsteps of the TMLE procedure as <strong>de</strong>scribed in Section 3. We introduce in Section 5.1 a workingmo<strong>de</strong>l for the conditional distribution of X given (W, X ≠ 0) which proves very efficientin computational terms. In Section 5.2, we introduce two alternative two-step updatinghal-00629899, version 1 - 6 Oct 2011- we link the conditional second or<strong>de</strong>r moment E P 0 n(X 2 |X ≠ 0, W) to the conditionalmean E P 0 n(X|X ≠ 0, W) (both un<strong>de</strong>r P 0 n) through the equalityE P 0 n(X 2 |X ≠ 0, W) = ϕ n,λ(EP 0 n(X|X ≠ 0, W) ) (17)where ϕ n,λ (t) = λt 2 + (1 − λ)(t(m n + M n ) − m n M n ) (with m n = min i≤n X (i) , M n =max i≤n X (i) ), and λ ∈ [0,1] is a fine-tune parameter;- un<strong>de</strong>r P 0 n and conditionally on (W, X ≠ 0), X takes its values in the set {X (i) : i ≤n} \ {0} of the observed X’s different from 0.Since the conditional distribution of X given (W, X ≠ 0) un<strong>de</strong>r P 0 n is subject to two constraints,X cannot take fewer than three different values in general. Elegantly, it is possible(un<strong>de</strong>r a natural assumption on P 0 n) to fine-tune λ and to select three values in {X (i) : i ≤n} \ {0} in such a way that X only takes the latter values:Lemma 5. Assume that P 0 n guarantees that σ 2 (P 0 n) > 0, P 0 n(X ≠ 0) > 0, g(P 0 n)(0|W) ∈ (0,1)Pn-almost 0 surely, and X ∈ [m n + c, M n − c] for some c > 0 when X ≠ 0. It is possible toconstruct Pn 00 ∈ M in such a way that (i) W has the same marginal distribution un<strong>de</strong>r Pn00and Pn, 0 µ(Pn 00 ) = µ(Pn), 0 g(Pn 00 ) = g(Pn), 0 σ 2 (Pn 00 ) = σ 2 (Pn), 0 and (ii) for all W ∈ W, thereexist three different values x (1) , x (2) , x (3) ∈ {X (i) : i ≤ n} \ {0} and three non-negative weightsp 1 , p 2 , p 3 summing up to 1 such that, conditionally on (W, X ≠ 0) un<strong>de</strong>r Pn 00 , X = x (k) withconditional probability p k .Hence, we directly construct a P 0 n of the same form as P 00n . Note that, by (8), becausethe conditional distribution of X given (W, X ≠ 0) un<strong>de</strong>r P 0 n has its support inclu<strong>de</strong>d in{X (i) : i ≤ n} \ {0}, then so do the conditional distributions of X given (W, X ≠ 0) un<strong>de</strong>rP k n (all k ≥ 1) obtained by following the general one-step updating procedure of Section 3.2.Similarly, because we initially estimate the marginal distribution of W un<strong>de</strong>r P 0 by its empiricalcounterpart, then the marginal distributions of W un<strong>de</strong>r P 0 n and P k n (all k ≥ 1) havetheir supports inclu<strong>de</strong>d in {W i : i ≤ n}.We discuss in Section 5.4 why it is computationally more interesting to consi<strong>de</strong>r sucha working mo<strong>de</strong>l (instead of a Gaussian working mo<strong>de</strong>l for instance). We emphasize that1314


hal-00629899, version 1 - 6 Oct 2011assuming X ∈ [m n + c, M n − c] when X ≠ 0 (for a possibly tiny c > 0) is hardly a constraint,and that the latter must be accounted for while estimating µ(P 0 ), g(P 0 ), and σ 2 (P 0 ). Theproof of the lemma is relegated to Section A.2.5.2 Two tailored alternative two-step updating proceduresWe present in Section 3.2 a general one-step updating procedure. Alternatively, it is alsopossible to <strong>de</strong>compose each update into a first update of the conditional distribution of Ygiven (W, X), followed by a second update of the marginal distribution of (W, X).First update: fluctuating the conditional distribution of Y given (W, X).We actually propose two different fluctuations for that purpose: a Gaussian fluctuation onone hand and a logistic fluctuation on the other hand, <strong>de</strong>pending on what one knows or wantsto impose.Gaussian fluctuation. In this case too, minus the log-likelihood function is used as a lossfunction. Specifically, we first fluctuate only the conditional distribution of Y given(W, X), by introducing the path {Pn,1 k−1 (ε) : ε ∈ R} such that (i) (W, X) has thesame distribution un<strong>de</strong>r Pn,1 k−1 (ε) as un<strong>de</strong>r Pk−1n , and (ii) un<strong>de</strong>r Pn,1 k−1 (ε) and given(W, X), Y is distributed from the Gaussian law with conditional mean θ(Pnk−1 )(X, W)+εH(P k−1n)(X, W) and conditional variance 1, where the so-called clever covariate H(P)is characterized for any P ∈ M byH(P)(X, W) = 1σ 2 (P)(X −)µ(P)(W)1{X = 0}.g(P)(0|W)This <strong>de</strong>finition guarantees that the path fluctuates Pnk−1 (0) = Pk−1nthat Y is conditionally Gaussian given (W, X) un<strong>de</strong>r Pn) 0 in the direction of D2 ⋆(i.e., the score of the path at ε = 0 equals D ⋆ 2(i.e., P k−1n,1(Pk−1n )). Introducing the MLE, provi<strong>de</strong>d(Pk−1n )hal-00629899, version 1 - 6 Oct 2011for clarity the function on the real line characterized by F a,b (t) = (t − a)/(b − a).Here, we choose the loss function characterized by −L a,b (P)(O) = F a,b (Y ) log F a,b ◦θ(P)(X, W) + (1 − F a,b (Y )) log(1 − F a,b ◦ θ(P)(X, W)), with convention L a,b (P)(O) =+∞ if θ(P)(X,W) ∈ {a, b}. Note that the loss L a,b (P) <strong>de</strong>pends on the conditionaldistribution of Y given (W, X) un<strong>de</strong>r P only through its conditional mean θ(P). Thisstraightforwardly implies that in or<strong>de</strong>r to <strong>de</strong>scribe a fluctuation {Pn,1 k−1 (ε) : ε ∈ R} ofPnk−1 , it is only necessary to <strong>de</strong>tail the form of the marginal distribution of (W, X) un<strong>de</strong>rPn,1 k−1k−1(ε) and how θ(Pn,1 (ε)) <strong>de</strong>pends on θ(Pk−1n ) and ε. Specifically, we first fluctuateonly the conditional distribution of Y given (W, X), by making Pn,1 k−1 (ε) be such that(i) (W, X) has the same distribution un<strong>de</strong>r Pn,1 k−1 (ε) as un<strong>de</strong>r Pk−1n , and (ii)θ(P k−1n,1( ())−1(ε))(X, W) = F expit logitF a,b ◦ θ(Pnk−1 )(X, W) + εH(Pn k−1 )(X, W) .Now, introduce the L a,b -minimum loss estimatora,bε k−1n,1= arg minε∈Rn∑i=1L a,b (P k−1n,1 (ε))(O(i) ),which finally yields the first intermediate update Pn,2 k−1 = Pn,1 k−1 (εk−1 n,1 ). The followinglemma (whose proof is relegated to Section A.2) justifies our interest in the loss functionL a,b and fluctuation {Pn,1 k−1 (ε) : ε ∈ R}:Lemma 6. Assume that the conditions stated above are met. Then L a,b is a valid lossfunction for the purpose of estimating θ(P 0 ) in the sense thatMoreover, it holds that∂∂ε L a,b(P k−1θ(P 0 ) = arg minP 0 L a,b (P).P ∈M∣n,1 (ε)) ∣∣ε=0(O) = −D2(P ⋆ nk−1 )(O).ε k−1n,1 = arg max=ε∈Rn∑i=1log P k−1n,1 (ε)(O(i) )∑ ni=1 (Y (i) − θ(Pnk−1 )(X (i) , W (i) ))H(Pn k−1 )(X (i) , W (i) ),k−1H(Pn )(X (i) , W (i) ) 2∑ ni=1The second inequality is the counterpart of the fact that, when using the Gaussianfluctuation, the score of the path at ε = 0 equals D2 ⋆ (Pk−1n ).Second update: fluctuating the marginal distribution of (W, X).the first intermediate update bends P k−1ninto P k−1n,2 = P k−1n,1 (εk−1 n,1 ).Logistic fluctuation. There is yet another interesting option in the case that Y ∈ [a, b]is boun<strong>de</strong>d (or in the case that one wishes to impose Y ∈ [a, b], typically then witha = min i≤n Y (i) and b = max i≤n Y (i) ), which allows to incorporate this known fact(or wish) into the procedure. Let us assume that θ(P 0 ) takes its values in ]a, b[ andalso that θ(Pnk−1 ) is constrained in such a way that θ(Pnk−1 )(X, W) ∈]a, b[. IntroduceNext, we preserve the conditional distribution of Y given (W, X) and only fluctuate themarginal distribution of (W, X), by introducing the path {Pn,2 k−1 (ε) : |ε| ≤ ρ‖D⋆ k−11 (Pn,2 )‖−1 ∞ }such that (i) Y has the same conditional distribution given (W, X) un<strong>de</strong>r Pn,2 k−1 (ε) as un<strong>de</strong>rP k−1n,2k−1, and (ii) the marginal distribution of (W, X) un<strong>de</strong>r P (ε) is characterized bydP k−1n,2 (ε)dP k−1n,2n,2()(X, W) = 1 + εD1(P ⋆ n,2 k−1 )(X, W) . (18)1516


hal-00629899, version 1 - 6 Oct 2011This second path fluctuates Pn,2 k−1 (i.e., Pn,2 k−1 k−1(0) = Pn,2 ) in the direction of D⋆ k−11 (Pn,2 ) (i.e.,the score of the path at ε = 0 equals D1 ⋆ k−1(Pn,2 )). Consi<strong>de</strong>r again minus the log-likelihood asloss function, and introduce the MLEε k−1n,2 = arg maxn∑|ε|≤ρ‖D1 ⋆ k−1(Pn,2 )‖−1 ∞ i=1log P k−1n,2 (ε)(O(i) ) :the second update bends Pn,2 k−1 into Pn k = Pn,2 k−1 (εk−1 n,2 ), concluding the <strong>de</strong>scription of how wecan alternatively build Pn k based on Pn k−1 .Note that, by (18), because the conditional distribution of X given (W, X ≠ 0) un<strong>de</strong>rP 0 n has its support inclu<strong>de</strong>d in {X (i) : i ≤ n} \ {0} (a consequence of our choice of workingmo<strong>de</strong>l, see Section 5.1), then so do the conditional distributions of X given (W, X ≠ 0) un<strong>de</strong>rP k n (all k ≥ 1) obtained by following either one of the tailored two-step updating procedure.Furthermore, it still holds that the marginal distributions of W un<strong>de</strong>r P 0 n and P k n (all k ≥ 1)have their supports inclu<strong>de</strong>d in {W i : i ≤ n} (because we initially estimate the marginaldistribution of W un<strong>de</strong>r P 0 by its empirical counterpart).5.3 Super-learning of the features of interestIt still remains to specify how we wish to carry out the initial estimation and updatingof the features of interest θ(P 0 ), µ(P 0 ), g(P 0 ), and σ 2 (P 0 ). As for σ 2 (P 0 ) = E P0 {X 2 },we simply estimate it by its empirical counterpart i.e., construct P 0 n in such a way thatσ 2 (P 0 n) = n −1 ∑ ni=1 (X(i) ) 2 . The three other features θ(P 0 ), µ(P 0 ) and g(P 0 ) are estimated bysuper-learning, and P 0 n is constructed in such a way that θ(P 0 n), µ(P 0 n) and g(P 0 n) equal theircorresponding estimators. Super-learning is a cross-validation based aggregation method thatbuilds a predictor as a convex combination of base predictors [24, 22] (we briefly <strong>de</strong>scribe inSection 6.5 the specifics of the super-learning procedure that we implement for our applicationto simulated and real data). The weights of the convex combination are chosen so as tominimize the prediction error, which is expressed in terms of the non-negative least squares(NNLS) loss function [7] and estimated by V -fold cross-validation. Heuristically the obtainedpredictor is by construction at least as good as the best of the base predictors (this statementhas a rigorous form implying oracle inequalities, see [24, 22]).Lemma 1 teaches us what additional features of Pnk−1 must be known in or<strong>de</strong>r to <strong>de</strong>rivethe kth update Pn k from its pre<strong>de</strong>cessor Pnk−1 , starting from k = 1. Specifically, if we rely onthe general one-step updating procedure of Section 3.2 then we need to know:- E Pk−1n(see (10));(Y D ⋆ (P k−1n)(O)|X, W) and E Pk−1n(D ⋆ (Pnk−1 )(O)|X, W) for the update of θ(Pn k−1 )- E Pk−1(D ⋆ (P k−1n n )(O)|W) for the updates of µ(Pnk−1 ), g(Pnk−1 ), and the marginal distributionof W un<strong>de</strong>r Pn k−1 (see the right-hand si<strong>de</strong> <strong>de</strong>nominators in (11), (12), (14));hal-00629899, version 1 - 6 Oct 2011- E Pk−1nin (11));- E Pk−1n(XD ⋆ (P k−1n(1{X = 0}D ⋆ (P k−1nnumerator in (12));- E Pk−1n)(O)|W) for the update of µ(P k−1 ) (see the right-hand si<strong>de</strong> numeratorn)(O)|W) for the update of g(P k−1 ) (see the right-hand si<strong>de</strong>{X 2 D ⋆ (Pnk−1 )(O)} for the update of σ 2 (Pn k−1 ) (see (13)).It is noteworthy that if either one of the two-step updating procedures of Section 5.2 isused then the first two conditional expectations do not need to be known, because updatingθ(Pnk−1 ) relies on the clever covariate H(Pnk−1 ), which is entirely characterized by the currentestimators µ(Pnk−1 ), g(Pnk−1 ), and σ 2 (Pnk−1 ) of the features µ(P 0 ), g(P 0 ), and σ 2 (P 0 ), respectively.In the sequel of this sub-section, we focus on the general one-step updating procedureof Section 3.2. How to proceed when relying on either of the two-step updating proceduresof Section 5.2 can be easily <strong>de</strong>duced from that case.Once θ(P 0 n), µ(P 0 n), g(P 0 n), and σ 2 (P 0 n) are <strong>de</strong>termined (see the first paragraph of thissub-section) hence D ⋆ (P 0 n) is known, we therefore also estimate by super-learning the conditionalexpectations E P0 (Y D ⋆ (P 0 n)(O)|X, W), E P0 (D ⋆ (P 0 n)(O)|X,W), E P0 (D ⋆ (P 0 n)(O)|W),E P0 (XD ⋆ (P 0 n)(O)|W), E P0 (1{X = 0}D ⋆ (P 0 n)(O)|W); as for E P0 {X 2 D ⋆ (P 0 n)(O)}, we simplyestimate it by its empirical counterpart. Then we constrain P 0 n in such a way that the conditionalexpectations E P 0 n(Y D ⋆ (P 0 n)(O)|X, W), E P 0 n(D ⋆ (P 0 n)(O)|X, W), E P 0 n(D ⋆ (P 0 n)(O)|W),E P 0 n(XD ⋆ (P 0 n)(O)|W), E P 0 n(1{X = 0}D ⋆ (P 0 n)(O)|W), and expectation E P 0 n{X 2 D ⋆ (P 0 n)(O)}equal their corresponding estimators. This completes the construction of P 0 n, and suffices forcharacterizing the features θ(P 1 n), µ(P 1 n), g(P 1 n) and σ 2 (P 1 n) of the first update P 1 n.Now, if one wished to follow exactly the conceptual road consisting in relying on Lemma 1in or<strong>de</strong>r to <strong>de</strong>rive the second update P 2 n from its pre<strong>de</strong>cessor P 1 n, one would have to <strong>de</strong>scribehow each conditional (and unconditional) expectation of the above list behaves, as a functionof ε, on the path {P 1 n(ε) : |ε| ≤ ρ‖D ⋆ (P 1 n)‖ −1∞ }. This would in turn enlarge the abovelist of the features of interest of P 0 that one would have to consi<strong>de</strong>r in the initial constructionof Pn. 0 Note that the length of the list would increase quadratically in the number ofupdates. Instead, once D ⋆ (Pnk−1 ) is known, we estimate by super-learning the conditionalexpectations E P0 (Y D ⋆ (Pnk−1 )(O)|X,W), E P0 (D ⋆ (Pnk−1 )(O)|X, W), E P0 (D ⋆ (Pnk−1 )(O)|W),E P0 (XD ⋆ (Pnk−1 )(O)|W), E P0 (1{X = 0}D ⋆ (Pnk−1 )(O)|W); as for E P0 {X 2 D ⋆ (Pnk−1 )(O)}, wesimply estimate it by its empirical counterpart. Then we proceed as if the conditional expectationsE Pk−1n(Y D ⋆ (P k−1n)(O)|X, W), E Pk−1(D ⋆ (P k−1n n(1{X = 0}D ⋆ (Pnk−1n)(O)|X,W), E Pk−1n(D ⋆ (Pnk−1 )(O)|W),{X 2 D ⋆ (Pnk−1 )(O)}E Pk−1(XD ⋆ (P k−1n n )(O)|W), E Pk−1)(O)|W), and EnPk−1nwere equal to their corresponding estimators. By doing so, the length of the list of the featuresof interest of P 0 is fixed no matter how many steps of the updating procedure are carried out.Arguably, following this alternative road has little if no effect relative to following exactlythe conceptual road consisting in relying on Lemma 1, because only second (or more) or<strong>de</strong>rexpressions in ε are involved.1718


hal-00629899, version 1 - 6 Oct 20115.4 Merit of the working mo<strong>de</strong>l for the conditional distribution of X given(W,X ≠ 0)Let us explain here why (a) initially estimating the marginal distribution of W un<strong>de</strong>r P 0 by itsempirical counterpart and (b) relying on the working mo<strong>de</strong>l for the conditional distribution ofX given (W, X ≠ 0) that we <strong>de</strong>scribed in Section 5.1 is computationally very interesting. Thekey is that, un<strong>de</strong>r P 0 n and its successive updates P k n (all k ≥ 1), the distributions of (W, X)have their supports inclu<strong>de</strong>d in {(W (i) , X (j) ) : i ≤ j ≤ n} (we say they are “parsimonious”).In<strong>de</strong>ed, Lemma 1 and a simple induction yield that, for each k ≥ 1, a single call toθ(P k n), µ(P k n) or g(P k n) involves a number of (nested) calls to the “past” features of interestθ(P k′n ), µ(P k′n ) and g(P k′n ) (0 ≤ k ′ < k) which is O(k). Furthermore, the evaluation of Ψ(P k n)(following (5) with P k n substituted to P 0 n) requires in turn B calls (assuming for simplicitythat the functions are not vectorized) to θ(P k n) (in or<strong>de</strong>r to evaluate the numerator of theright-hand si<strong>de</strong> term of (5)), µ(P k n) and g(P k n) (in or<strong>de</strong>r to simulate {( ˜W (b) , ˜X (b) ) : b ≤ B}).Overall, at least O(Bk) calls to the set of all features of interest are performed at the kthupdating step of the TMLE procedure. In practice (even if functions are vectorized) thisleads to a large memory footprint and prohibitive running time of the algorithm, as each ofthese calls consists in the prediction of the corresponding feature, as <strong>de</strong>scribed in Section 5.3.By taking advantage of the “parsimony” of the distributions of (W, X) un<strong>de</strong>r the successiveP k n (k ≥ 0), we manage to alleviate dramatically the time and memory requirements ofour implementation. In<strong>de</strong>ed, the “parsimony” implies that, at the kth step of the TMLEprocedure (k ≥ 0), it is only required to compute and store O(n 2 ) quantities (including,but not limited to, θ(P k n)(X (i) , W (j) ), µ(P k n)(W (i) ) and g(P k n)(W (j) ) for all 1 ≤ i, j ≤ n) —see Section 5.3). In particular, the evaluation of Ψ(P k n) now requires retrieving O(B) valuesfrom a handful of vectors instead of performing O(Bk) memory and time-consuming (nested)function calls.6 ApplicationWe first present the genomic problem that motivated this study, in Section 6.1, and earliercontributions on the same topic, in Section 6.2. Two real datasets are <strong>de</strong>scribed in Section 6.3.They play a central role in this article. We both (a) draw inspiration from one of them and(b) use it in or<strong>de</strong>r to set up our simulation study, as presented in Section 6.4. We also applythe TMLE methodology directly to the other. The specifics of the TMLE procedures thatwe un<strong>de</strong>rtake both on simulated and real data are given in Section 6.5, and their results aresummarized in Section 6.6, for the simulation study, and in Section 6.7, for the real dataapplication.hal-00629899, version 1 - 6 Oct 20116.1 Association between DNA copy number and gene expression in cancersThe activity of a gene in a cell is directly related to its expression level, that is, the number ofmessenger RNA (mRNA) fragments corresponding to this gene. Cancer cells are characterizedby changes in their gene expression patterns. Such alterations have been shown to be causeddirectly or indirectly by genetic events, such as changes in the number of DNA copies, an<strong>de</strong>pigenetic events, such as DNA methylation. Some changes in DNA copy number have beenreported to be positively associated with gene expression levels [11]. Conversely, DNA methylationis a chemical transformation of cytosines (one of the four types of DNA nucleoti<strong>de</strong>s)which is thought to lead to gene expression silencing [5]. Therefore, DNA methylation levelsare generally negatively associated with gene expression levels.We propose to apply the methodology <strong>de</strong>veloped in the previous sections to the searchfor genes for which there exists an association between DNA copy number variation and geneexpression level, accounting for DNA methylation.6.2 Related worksIn the context of cancer studies, various methods have been proposed in or<strong>de</strong>r to find associationsbetween DNA copy number and gene expression at the level of genes. Because wecannot cite all of them, we try here to cite one relevant publication for each broad type ofmethod. Most of them can be classified into two groups, <strong>de</strong>pending on whether DNA copynumber is viewed as a continuous or a discrete variable. When DNA copy number is viewedas a continuous variable, associations between X and Y are generally quantified using a correlationcoefficient [11]. When it is viewed as a discrete variable, associations are typicallyquantified using a test of differential expression between DNA copy number states [26]. Acommon limitation to this two types of methods is that they are generally good at i<strong>de</strong>ntifyinggenes that were already known, but less so at finding novel candidates. This is not surprising:for correlation-based methods, high correlation between X and Y requires both X and Yto vary substantially, in which case it is likely that these (marginal) variations have alreadybeen reported. For methods based on differential expression between copy number states, thelatter often correspond to biological or clinical groups which are already known and for whichdifferential expression analyses have already been carried out.In the present paper, we acknowledge the fact that while DNA copy number is observedas a quantitative variable, the copy neutral state (two copies of DNA) generally has positivemass, in the sense that for a given gene, a positive proportion of samples have two copies ofDNA.Another major difference between our method and the ones cited above is that we explicitlyincorporate DNA methylation into the analysis. Several papers where DNA copy number,gene expression and DNA methylation are combined have been published recently, but theytypically analyze one dimension of (W, X,Y ) at a time, and then use an ad hoc rule to1920


●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●● ●● ●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●● ●●●● ●● ●● ●● ●● ●● ●●●● ●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●● ●● ● ●●●●● ●●●●●●●●● ●●●●●●●● ● ●●●●● ● ●●●●●●● ● ●●●●●●● ●● ●●●●●●●●●●●●●● ●●●●●● ●● ● ●●●●● ●●●●●●●● ●●● ●●●●●●●● ●●●●●●● ●●● ●●●●● ●●●●●●● ●●●●●●● ● ●● ● ●●●● ●●●● ●●●●●●●●●●●● ● ●●● ● ●● ●●●●●●●●● ●●● ●●● ● ● ●●●● ●●● ● ● ●● ●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●● ●● ●●●●●● ●●●●● ●●●● ● ● ●●●●●●●●● ● ● ●●●●●●●●●●●●●● ●●●●●● ●●●●merge or intersect the results [1, 17]. The CNAmet method [10] relies on two scores: ascore of differential expression between copy number levels on the one hand, and betweenDNA methylation levels on the other hand. Then both scores are summed. In the methodproposed here, the three dimensions are studied jointly.6.3 Datasetshal-00629899, version 1 - 6 Oct 2011We exploit glioblastoma multiforme (GBM, the most common type of primary adult braincancers) and ovarian cancers (OvCa, a cancerous growth arising from the ovary) data fromThe Cancer Genome Atlas (TCGA) project [2], a collaborative initiative to better un<strong>de</strong>rstandseveral types of cancers using existing large-scale whole-genome technologies. TCGAhas recently completed a comprehensive genomic characterization of these types of tumor, includingDNA copy number (X) , gene expression (Y ), and DNA methylation (W) microarrayexperiments [18, 19].Probe-level normalized GBM and OvCa data can be downloa<strong>de</strong>d from the TCGA repositoryat http://tcga-data.nci.nih.gov/tcga/. In or<strong>de</strong>r to study associations between X,Y and W at the level of genes, these probe-level measurements first need to be aggregatedinto gene-level summaries. We choose to <strong>de</strong>fine X, Y and W as follows for a given gene:- DNA methylation W is the proportion of “methylated” signal at a CpG locus in thegene’s promoter region;- DNA copy number X is a locally smoothed total copy number relative to a set ofreference samples;- expression Y is the “unified” gene expression level across three microarray platforms,as <strong>de</strong>fined by [27].After this pre-processing step, each gene is represented by a 3 × n matrix, where 3 isthe number of data types and n is the number of samples. Figure 2(a) represents DNAmethylation, DNA copy number, and gene expression data for one particular gene, EGFR,which is known to be altered in GBM. The association between copy number and expressionis non-linear, and high methylation levels are associated with low expression levels.6.4 Simulation schemehal-00629899, version 1 - 6 Oct 20111412108642DNA methylation0.02●●●● ● ●●0.040.060.08(a) real dataset−0.57101214DNA copy number● ● 2468−1−0.540.87gene expression0120.080.060.040.02210−120151050.00DNA methylation● ● ● ● ●● ●0.020.04● ●0.060.08(b) simulated dataset0.1051015−0.6620DNA copy number●●● ●−2−0.700.80gene expressionFigure 2: Illustrating DNA methylation, DNA copy number, and gene expression data. Inboth graphics, we represent kernel <strong>de</strong>nsity estimates (diagonal panels), pairwise plots (lowerpanels), and report the pairwise Pearson correlation coefficients (upper panels). (a). Realdataset corresponding to the EGFR gene in 187 GBM tumor samples. For 130 among the187 samples, only DNA copy number and gene expression data were available (circles inlower middle plot). (b). Simulated dataset consisting of n = 200 in<strong>de</strong>pen<strong>de</strong>nt copies of thesynthetic observed data structure <strong>de</strong>scribed in Section 6.6. Note that the constant O2X isad<strong>de</strong>d to each value of X so that graphics corresponding to real and simulated data can bemore easily compared.−10120.100.080.060.040.020.00210−1−2Because association patterns between copy number, expression and methylation are generallynon-linear, setting up a realistic simulation mo<strong>de</strong>l is a difficult task. We <strong>de</strong>sign here asimulation strategy based on perturbations of real observed data structures, which mimicssituations such as the one observed in the Figure 2(a) for theEGFR gene in GBM. This strategyimplements the following constraints:2122


hal-00629899, version 1 - 6 Oct 2011- there are generally up to three copy number classes: normal regions, and regions of copynumber gains and losses;- in normal regions, expression is negatively correlated with methylation;- in regions of copy number alteration, copy number and expression are positively correlated.Our simulation scheme relies on three real observed data structures O 1 = (O W 1 , OX 1 , OY 1 ),O 2 = (O2 W, OX 2 , OY 2 ), O 3 = (O3 W, OX 3 , OY 3 ) corresponding to three samples from different copynumber classes: loss (class 1), normal (class 2), and gain (class 3). We simulate a syntheticobserved data structure O = (W, X,Y ) ∼ P s as follows. Given a vector p = (p 1 , p 2 , p 3 )of proportions such that p 1 + p 2 + p 3 = 1, we first draw a class assignment U from themultinomial distribution with parameter (1, p) (in other words, U = u with probability p u ).Conditionally on U, a measure W of DNA methylation is drawn randomly as a perturbationof the DNA methylation in the corresponding real observed data structure O U : given a vectorω = (ω 1 , ω 2 , ω 3 ) of positive numbers,W = expit ( logit ( OUW )+ ωU Z ) ,where Z is a standard normal random variable in<strong>de</strong>pen<strong>de</strong>nt of U. Finally, a couple (X, Y ) ofDNA copy number and DNA expression is drawn conditionally on (U, W) as a perturbation ofthe couple (O X U , OY U ) in the corresponding real observed data structure O U (with an additionalcentering applied to X so that the pivot value be equal to 0): Given σ 2 > 0, two variancecovariance2 × 2-matrices Σ 1 and Σ 3 and a non-increasing mapping λ 0 : [0,1] → [0,1],- if U = 2, then (X, Y ) = (0, O Y 2 +λ 0(W)+σ 2 Z ′ ), where Z ′ is a standard normal randomvariable in<strong>de</strong>pen<strong>de</strong>nt of (U, W);hal-00629899, version 1 - 6 Oct 2011EGFR gene which are <strong>de</strong>scribed in Table 1. The parameters for this simulation ( were ) chosen as9.96 1follows: p = (0,1/2,1/2), ω = (0, 3,3), λ 0 : w ↦→ −w, σ 2 = 1, Σ 3 = .1 0.436.5 Library of algorithms for super-learningWe explain in Section 5.3 that we rely on super-learning [24, 22] in or<strong>de</strong>r to estimate somerelevant infinite-dimensional features of P 0 , including (but not limited to) θ(P 0 ), µ(P 0 ) andg(P 0 ). This algorithmic challenge is easily overcome, thanks to the remarkable R-packageSuperLearner [12] and the possibility to rely on the library of R-packages [13] built by thestatistical community. As for the base predictors, they involve (by alphabetical or<strong>de</strong>r):- Generalized additive mo<strong>de</strong>ls: we use the gam R-package [4], with its <strong>de</strong>fault values.- Generalized linear mo<strong>de</strong>ls: we use the glm R-function with i<strong>de</strong>ntity link (for learningθ(P 0 ) and µ(P 0 )) and logit link (for learning g(P 0 )), and with linear combinations of(1, X,W) or (1, X,W, XW) (for learning θ(P 0 )) and linear combinations of (1, W) or(1, W, W 2 ) (for learning µ(P 0 ) and g(P 0 )).- Piecewise linear splines: we use polymars R-function from the polspline R-package [6],with its <strong>de</strong>fault values.- Random forests: we use the randomForest R-package [9], with its <strong>de</strong>fault values.- Support vector machines: we use the svm R-function from the e1071 R-package [3], withits <strong>de</strong>fault values.Note that none of the statistical mo<strong>de</strong>ls associated to the above estimation procedures containsP s (see Lemma 7).- if U ≠ 2, then (X, Y ) is drawn conditionally on (U,W) from the bivariate Gaussiandistribution with mean (O X U − OX 2 , OY U ) and variance-covariance matrix Σ U.In particular, the reference/pivot value x 0 = 0. Note that λ 0 is chosen non-increasing in or<strong>de</strong>rto account for the negative association between DNA expression and methylation. Furthermore,the synthetic observed data structure O drawn from P s is not boun<strong>de</strong>d.We easily <strong>de</strong>rive closed-form expressions for the features of interest θ(P s ), µ(P s ), g(P s ),and σ 2 (P s ), which we report in the Appendix (see Lemma 7). Relying on Lemma 7 makes itpossible to evaluate the value of Ψ(P s ), by following the procedure <strong>de</strong>scribed in Section 3.1(see <strong>de</strong>tails in Section 6.6).Finally we provi<strong>de</strong> in Figure 2(b), for the sake of illustration, a visual summary of asimulation run with n = 200 in<strong>de</strong>pen<strong>de</strong>nt copies of the synthetic observed data structure Odrawn from P s and based on real observed data structure from two GBM samples for the6.6 Simulation studyWe conduct twice a simulation study where B ′ = 10 3 datasets of n = 200 in<strong>de</strong>pen<strong>de</strong>ntobserved data structures are (in<strong>de</strong>pen<strong>de</strong>ntly) generated un<strong>de</strong>r P s (i.e., un<strong>de</strong>r the simulationscheme <strong>de</strong>scribed in Section 6.4). In each simulation study and for every simulated dataset, weperform the TMLE methodology for the purpose of estimating the target parameter Ψ(P s ).From one simulation study to the other, we only change the set up of the super-learningprocedure, by modifying the library of algorithms involved in the super-learning of the featuresof interest:- the first time, we proceed exactly as <strong>de</strong>scribed in Section 6.5 (we say that the full-SL isun<strong>de</strong>rtaken);- the second time, we <strong>de</strong>ci<strong>de</strong> to inclu<strong>de</strong> only algorithms based on generalized linear mo<strong>de</strong>ls(we say that the light-SL is un<strong>de</strong>rtaken).2324


hal-00629899, version 1 - 6 Oct 2011We do not use any in<strong>de</strong>x to refer to the super-learning set up (full-SL or light-SL) for thesake of alleviating notations.In each simulation study (i.e., for each set up of the super-learning procedure full-SL andlight-SL) and for each b ≤ B ′ , we record the values ψn,b k = Ψ(P n,b k ) of the initial substitutionestimator (k = 0) and subsequent updated substitution estimators (k = 1,2,3) targetingΨ(P s ), as <strong>de</strong>rived on the bth simulated dataset (whose empirical measure is <strong>de</strong>noted by P n,b ).The targeted update steps rely on the Gaussian fluctuations presented in Section 5.2 (theresults are very similar when one applies either the general one-step updating procedure ofSection 3.2 or the second tailored alternative two-step updating procedure of Section 5.2).We do not record the next updates because the ad hoc stopping criterion that we <strong>de</strong>visesystematically indicates that this is not necessary (heuristically, the criterion elaborates onthe gains in likelihood and the variations in the resulting estimates).The value of Ψ(P s ) is evaluated by simulations, following (5) in Section 3.1 with P ssubstituted for P 0 n (we rely on B = 10 5 simulated observed data structures, whose empiricalmeasure is <strong>de</strong>noted by P B ; the features θ(P s ) and σ 2 (P s ) are explicitly known, seeLemma 7). In or<strong>de</strong>r to get a sense of how accurate our evaluation of Ψ(P s ) is, we alsouse the same large simulated dataset to evaluate Var P sD ⋆ (P s )(O) (as the empirical varianceVar PB D ⋆ (P s )(O); again, D ⋆ (P s ) is known explicitly by Lemma 7). Denoting by ψ B (P s ) andv B (P s ) the latter evaluations, we interpret the intervals [ψ B (P s ) ± ξ 1−α/2√vB (P s )/n] and[ψ B (P s ) ±ξ 1−α/2√vB (P s )/B] as (1 −α)-accuracy intervals for the evaluation of Ψ(P s ) basedon n = 200 and B = 10 5 in<strong>de</strong>pen<strong>de</strong>nt observed data structures. The gray intervals in Figure 3represent these accuracy intervals for α = 5%, n = 200 (light gray) and B = 10 5 (dark gray).Note that (by the convolution theorem) the length of [ψ B (P s ) ± ξ 0.975√vB (P s )/n] is the optimallength of a 95%-confi<strong>de</strong>nce interval based on an efficient (regular) estimator of Ψ(P s )relying on n observations (assuming that the asymptotic regime is reached). The numericalvalues are reported in Table 2.The results of this joint simulation study are summarized by Figure 3 (which shows kernel<strong>de</strong>nsity estimates of the empirical distributions of {ψ k n,b : b ≤ B′ } for 0 ≤ k ≤ 3) and Table 3.They illustrate some of the fundamental characteristics of the TMLE estimator and relatedconfi<strong>de</strong>nce intervals: convergence of the iterative updating procedure, robustness, asymptoticnormality, and coverage.Convergence of the iterative updating procedure, and robustness.A substantialbias in the initial estimation is revealed by the location of the mo<strong>de</strong> of {ψ 0 n,b : b ≤ B′ }in Figure 3, both for the full-SL and light-SL procedures. We see that the full-SL initialestimator is less biased than its light-SL counterpart. As one can judge visually or bythe first rows of Tables 3(a) and 3(b), this initial bias is diminished (if not perfectlycorrected) at the first updating step of the TMLE procedure, illustrating the robustnessof the targeted estimator. The empirical distributions of {ψ k n,b : b ≤ B′ } for k = 1,2,3hal-00629899, version 1 - 6 Oct 2011sample name methylation Oi W copy number Oi X expression OiYTCGA-02-0001 (i = 2) 0.05 2.72 -0.46TCGA-02-0003 (i = 3) 0.01 9.36 1.25Table 1: Real methylation, copy number and expression data used as a baseline for simulatingthe dataset according to the simulation scheme presented in Section 6.6. A visual of thesimulated dataset is provi<strong>de</strong>d in Figure 2(b).ψ B (P s ) v B (P s ) [ψ B (P s ) ± ξ 0.975√vB (P s )/N]N = 200 N = 10 50.2345 0.05980232 [0.2006; 0.2684] [0.2329; 0.2360]Table 2: Values of ψ B (P s ) and v B (P s ), estimators of Ψ(P s ) and Var P sD ⋆ (P s )(O), and95%-accuracy intervals [ψ B (P s ) ± ξ 0.975√vB (P s )/n], [ψ B (P s ) ± ξ 0.975√vB (P s )/B] (n = 200,B = 10 5 ).(a) full-SL(b) light-SLFigure 3: Empirical distribution of {ψ k n,b : b ≤ B′ } based on n = 200 in<strong>de</strong>pen<strong>de</strong>nt observeddata structures for k = 0 (initial estimator) and k iterations of the updating procedure(k = 1,2,3), as obtained from B ′ = 10 3 in<strong>de</strong>pen<strong>de</strong>nt replications of the simulation study(using a Gaussian kernel <strong>de</strong>nsity estimator). (a). The super-learning procedure involvesall algorithms <strong>de</strong>scribed in Section 6.5. (b). The super-learning procedure only involvesalgorithms based on generalized linear mo<strong>de</strong>ls. In both graphics, gray rectangles represent95%-accuracy intervals [ψ B (P s ) ± ξ 0.975√vB (P s )/n] and [ψ B (P s ) ± ξ 0.975√vB (P s )/B] forthe true parameter Ψ(P s ) based on 200 observed data structures (light gray) and B = 10 5observed data structures (dark gray). The length of [ψ B (P s )±ξ 0.975√vB (P s )/n] is the optimallength of a 95%-confi<strong>de</strong>nce interval based on an efficient (regular) estimator of Ψ(P s ) relyingon n observations (assuming that the asymptotic regime is reached).2526


hal-00629899, version 1 - 6 Oct 2011are not (visually) markedly different, an empirical indication that the TMLE procedureconverges quickly.Asymptotic normality. In or<strong>de</strong>r to check the asymptotic normality of the TMLE estimator(e.g. un<strong>de</strong>r the conditions of Corollary 1), we first perform Lilliefors tests ofnormality based on the empirical distributions of {ψ k n,b : b ≤ B′ } for k = 0,1,2,3 (i.e.,we perform Kolmogorov-Smirnov tests of normality without specification of the meansand variances un<strong>de</strong>r the null). We report the values of the test statistics and correspondingp-values in the third and fourth rows of Tables 3(a) and 3(b). If we take intoaccount the multiplicity of tests, there is no clear indication that the limit distributionsare not Gaussian.Second, we test the fit of the empirical distributions of {ψ k n,b : b ≤ B′ } to a Gaussiandistribution with mean and variance given by the estimates ψ B (P s ) and v B (P s ) (whichare in<strong>de</strong>pen<strong>de</strong>nt of {ψ k n,b : b ≤ B′ }). We report in the fifth rows of Tables 3(a) and3(b) the obtained values of the KS test statistics. If all p-values are smaller than 10 −4 ,one notices that the test statistics are strikingly smaller for k ≥ 1 than for k = 0.Performing An<strong>de</strong>rson-Darling tests of normality with only the null mean or the nullvariance specified (i.e., KS tests of normality with specified null mean, equal to ψ B (P s ),and unspecified null variance or specified null variance, equal to v B (P s ), and unspecifiednull mean) teaches us that it is mainly the little remaining bias and not the choice ofthe variance un<strong>de</strong>r the null that makes the KS tests have so small p-values [values notshown].Coverage. The theoretical convergence in distribution of the TMLE estimator to a Gaussianlimit (e.g. un<strong>de</strong>r the conditions of Corollary 1) promotes the use of intervals[ψ k n,b ± ξ 1−α/2s k n,b /√ n] as (1 − α)-confi<strong>de</strong>nce intervals for Ψ(P s ) (k = 1, 2,3), with(s k n,b )2 = Var Pn,b D ⋆ (Pn,b k )(O). Interestingly, the theoretical result of Corollary 1 donot guarantee that it is safe to estimate the limit variance by (s k n,b )2 (additional assumptionson the construction and convergence of θ(P knn ), µ(P knn ) and g(P knn ) wouldbe required to get such a result). We nonetheless check whether the latter intervalsprovi<strong>de</strong> the wished coverage or not. For this purpose, we compute and reportin the sixth and seventh rows of Tables 3(a) and 3(b) the empirical coverages c k n =1∑ B ′B ′ b=1 1{ψ B(P s ) ∈ [ψn,b k ± ξ 1−α/2s k n,b /√ n]} and their optimistic counterpart c k+n =1∑ B ′B ′ b=1 1{[ψ √B(P s ) ± ξ 0.975 vB (P s )/B] ∩ [ψn,b k ± ξ 1−α/2s k n,b /√ n] ≠ ∅} (the latter incorporatesthe remaining uncertainty of the true value of Ψ(P s )). We conclu<strong>de</strong> thatthe provi<strong>de</strong>d coverage is good for the light-SL procedure (with excellent optimistic coverage),but disappointing for the full-SL procedure (even for the optimistic coverage).The results may have been better if one had relied on the bootstrap in or<strong>de</strong>r to estimatethe asymptotic variance of the TMLE. We will investigate this issue in future work.hal-00629899, version 1 - 6 Oct 20116.7 Real data applicationFor the real data application, we focus on all 130 genes g ∈ G of chromosome 18 in theOvCa dataset. This choice is notably motivated by the associated sample size, approximatelyequal to 500 (thus much larger than the sample size associated to the GBM dataset). Weestimate the non-parametric variable importance measure of X on Y accounting for W foreach gene separately (i.e., Ψ(P g 0 ) where P g 0 ∈ M is the true distribution of O = (W, X,Y )for gene g), following exactly one of the statistical methodologies <strong>de</strong>veloped in the simulationstudy. Specifically, the targeted update step relies on the Gaussian fluctuations presentedin Section 5.2, and the super-learning involves the library of algorithms that we report inSection 6.5. In particular, we estimate for each gene g the asymptotic variance of the TMLEψng,∗ of Ψ(P g 0 ) with the empirical variance (sg,∗ n ) 2 of the efficient influence curve at Pn g,∗ . In afuture work solely <strong>de</strong>voted to this real data application, we will use the bootstrap in or<strong>de</strong>r to<strong>de</strong>rive a more robust estimator of the asymptotic variance (again, Corollary 1 requires someconditions on P g g,∗0 and Pnin or<strong>de</strong>r to guarantee that (s g,∗n ) 2 is a consistent estimator). Wewill also “extend” W, by adding to the DNA methylation of the gene of interest the DNAmethylations, DNA copy numbers and gene expressions of its neighboring genes.Figure 4: Real data application to the 130 genes of chromosome 18 in the OvCa dataset(ovarian cancers). We represent the tests statistics √ n(ψng,3 − ψ g ref )/sg,3 n for ψ g ref= 0 (leftgraphic) and ψ g ref = F(P n) g (right graphic) along the position of gene g on the genome. Wereport the names of the genes such that √ n|ψn g,3 |/s g,3n > 45 (left graphic) and √ n|ψng,3 −F(Pn)|/s g g,3n > 6 (right graphic), the cut-offs being arbitrarily chosen.We only briefly summarize the results of the real data application. For this purpose,we report in Figure 4 the values of the test statistics √ n(ψng,3 − ψ g ref )/sg,3 n (g ∈ G) <strong>de</strong>rivedfrom the TMLE after three updates, using two different reference values ψ g ref ∈ {0, F(P n)}.g2728


hal-00629899, version 1 - 6 Oct 2011Here, F(Pn) g = ∑ ni=1 X(i) Y (i) / ∑ ni=1 (X(i) ) 2 is the least square (substitution, asymptoticallyefficient) estimator of parameter F(P g 0 ), see (4), a parameter which overlooks the role potentiallyplayed by W while quantifying the influence of X on Y . We are aware that F(Pn) g isnot in<strong>de</strong>pen<strong>de</strong>nt of ψn g,3 and s g,3n , and will make sure in a future work solely <strong>de</strong>voted to thisreal data application that our estimator of F(P g 0 ) is <strong>de</strong>rived from an in<strong>de</strong>pen<strong>de</strong>nt dataset(or we will un<strong>de</strong>rtake a cross-validated procedure). The reference value ψ g ref= 0 is a naturalnull value to rely on from a testing perspective. Using ψ g ref = F(P n) g as another null value isrelevant because that allows us to i<strong>de</strong>ntify those genes for which the (possibly intricate) roleplayed by W in quantifying the influence of X on Y is especially important and results in astark <strong>de</strong>viation of Ψ(P g 0 ) from F(P g 0 ).Looking at the left graphic in Figure 4 teaches us that a majority of the Ψ(P g 0 ) (g ∈ G)are likely positive. Eight genes stand up (by having a test statistic √ nψn g,3 /s g,3n > 45): twogenes at 18p11.32 (USP14 and THOC1), a cluster of five genes at 18q11.2 (SNRPD1, RBBP8,RIOK3, NPC1, SS18), and gene MBP at 18q23. This suggests that the region 18q11.2 (especially19-24 Mb) is of particular relevance in this set of ovarian cancers. Seven out of the latter eightgenes (specifically: all of them but gene NPC1) also stand up in the right graphic of Figure 4:six out of the latter seven genes standing up in both graphics (specifically: all of them but geneMBP) exhibit a significantly small test statistic (by having √ n(ψn g,3 − F(Pn))/s g g,3n < −6), asdoes the additional gene SERPINB2, while gene MBP exhibits a significantly large test statistic(by having √ n(ψn g,3 − F(Pn))/s g g,3n > 6), as do eight additional genes (MBD1, TXNL1, LMAN1,WDR7, NARS, ZNF236, ATP9B, TXNL4A). All genes standing up in the right graphic of Figure 4are located at 18q2 (41-76 Mb).AcknowledgmentsThe topic of this article originates from a presentation [16] by Terry Speed (Department ofStatistics, UC Berkeley) in the UC Berkeley Statistics and Genomics Seminar. We would liketo thank him for a series of instructive discussions that followed. We also would like to thankThe Cancer Genome Atlas project [2] for kindly providing the datasets.hal-00629899, version 1 - 6 Oct 2011equalities hold:θ(P s )(X, W) = (O2 Y + λ 0 (W))P s (U = 2|X, W)+ ∑ (Ou Y + Σ )u(1,2)Σ u (1,1) (X − (OX u − O2 X )) P s (U = u|X, W),µ(P s )(W) =u=1,33∑(Ou X − O2 X )P s (U = u|W),u=1g(P s )(0|W) = P s (U = 2|W),σ 2 (P s ) =∑ (p u Σu (1,1) + (Ou X − O2 X ) 2) ,u=1,3where, for each u = 1,2,3,P s (U = 2|X, W) ∝ p ( )2 logit(W) − logit(OWϕ2 )1{X = 0},ω 2 ω 2P s (U = u|X, W) ∝p ( ) (u logit(W) − logit(OWϕu )× ϕω u ω uP s (U = u|W) ∝p ( )u logit(W) − logit(OWϕu ).ω u ω uA.2 Proofs of Lemmas 1, 6 and Proposition 1X − (O X u − O X 2 ) √Σu (1,1)Proof of Lemma 1. Let us consi<strong>de</strong>r (10). For any non-negative measurable function f of(X, W), it holds thatE Pε {Y f(X, W)}= E P {Y f(X, W)(1 + εs(O))}= E P {θ(P)(X,W)f(X, W)} + εE P {Y f(X, W)s(O)}= E P {(θ(P)(X, W) + εE P (Y s(O)|X, W))f(X, W)}= E Pε {h(X,W)f(X, W)}for h(X,W) equal to the right-hand si<strong>de</strong> expression of (10), since (9) implies),A AppendixA.1 MiscellaneaRecall that P s <strong>de</strong>notes the data-generating distribution of the synthetic observed data structureO = (W, X,Y ) <strong>de</strong>scribed in Section 6.6. We easily <strong>de</strong>rive the following closed-formexpressions for the features of interest θ(P s ), µ(P s ), g(P s ), and σ 2 (P s ).dP εdP (X, W) = (1 + εE P(s(O)|X, W)).The function f being arbitrarily chosen, the latter equalities yield (10). The remainingrelationships are easily proven in the same spirit.Proof of Lemma 6. Note thatP 0 L a,b (P) = E P0 {KL(F a,b ◦ θ(P 0 )(X, W), F a,b ◦ θ(P)(X, W))} + c(P 0 ),Lemma 7. Let ϕ <strong>de</strong>note the <strong>de</strong>nsity of the standard normal distribution. The followingwhere KL(p, q) is the Kullback-Leibler divergence between the Bernoulli distributions of parametersp, q ∈]0,1[ and c(P 0 ) is a constant <strong>de</strong>pending on P 0 only. Since KL(p, q) ≥ 0 with2930


equality iff p = q, we obtain that θ(P 0 ) minimizes P ↦→ P 0 L a,b (P) and also that anotherminimizer must satisfy θ(P)(X, W) = θ(P 0 )(X, W) P 0 -almost surely. The second equality iseasily obtained by differentiating.Proof of Proposition 1. By expanding the squared sum in (1), we obtain thatwhere we emphasize that− XE P ((Y − θ(P)(X, W))s(O)|X = 0, W)} + o(ε), (21)E P ((Y − θ(P)(X, W))s(O)|X = 0, W) = E P( 1{X = 0}g(P)(0|W) (Y − θ(P)(X, W))s(O) ∣ ∣∣W).hal-00629899, version 1 - 6 Oct 2011{Ψ(P) = arg min −2βEP {X(θ(P)(X, W) − θ(P)(0, W))} + β 2 E P {X 2 } } ,β∈Rwhich straightforwardly yields (2). It is easily seen that PD1 ⋆(P)D⋆ 2 (P) = 0, or in other wordsthat the two components are orthogonal in L 2 0 (P).Regarding the pathwise differentiability, it is sufficient to consi<strong>de</strong>r paths of the form (9)for arbitrarily chosen s ∈ L 2 0 (P) with ‖s‖ ∞ < ∞. Set such a s and |ε| < ‖s‖ −1∞ , ε ≠ 0. Usingthe telescopic equality a 1 /b 1 − a 0 /b 0 = (a 1 − a 0 )/b 1 − (a 0 /b 0 )(b 1 − b 0 )/b 1 yieldswithT 1 εε −1 (Ψ(P ε ) − Ψ(P)) =T 1 εσ 2 (P ε ) − Ψ(P) T 2 εσ 2 (P ε ) , (19)= ε −1( )E Pε {X(θ(P ε )(X,W) − θ(P ε )(0, W))} − E P {X(θ(P)(X, W) − θ(P)(0, W))} ,T 2 ε = ε −1 (σ 2 (P ε ) − σ 2 (P)) = E P {s(O)X 2 } (20)by (13). Now, the same telescopic equality also yields thatT 1 ε = E P {X(θ(P ε )(X, W) − θ(P ε )(0, W))s(O)}+ E P{X(ε −1 (θ(P ε )(X, W) − θ(P)(X, W)) − ε −1 (θ(P ε )(0, W) − θ(P)(0, W)) )} .hal-00629899, version 1 - 6 Oct 2011Combining (19), (20), (21) and (13) teaches us that, for all s ∈ L 2 0 (P) with ‖s‖ ∞ < ∞,ε −1 (Ψ(P ε ) − Ψ(P)) = E P {D ⋆ (P)(O)s(O)} + o(ε),where D ⋆ (P) is <strong>de</strong>fined in the statement of the proposition. In particular, Ψ is pathwisedifferentiable at P wrt the <strong>de</strong>scribed collection of paths, and D ⋆ (P) is a gradient of Ψ at P.Since the related tangent space is L 2 0 (P) itself, it is necessarily the efficient influence curve.It remains to prove that D ⋆ (P) is double-robust. For this purpose, note thatσ 2 (P ′ )PD ⋆ (P ′ ) − σ 2 (P)(Ψ(P) − Ψ(P ′ ))= E P {X(θ(P ′ )(X,W) − θ(P ′ )(0, W)) − X(θ(P)(X,W) − θ(P)(0, W))}{(+ E P X − µ(P ′ )})(W)1{X = 0}g(P ′ E P (Y − θ(P ′ )(X, W)|X,W))(0|W)= E P {X(θ(P ′ )(X, W) − θ(P ′ )(0, W)) − X(θ(P)(X, W) − θ(P)(0, W))}{(+ E P X − µ(P ′ )(W)1{X = 0}g(P ′ )(0|W)= E P{X(θ(P)(0, W) − θ(P ′ )(0, W)) − µ(P ′ )(W) g(P)(0|W)= E P{(θ(P)(0, W) − θ(P ′ )(0, W))}(θ(P)(X, W) − θ(P ′ )(X, W))}g(P ′ )(0|W) (θ(P)(0, W) − θ(P ′ )(0, W))(µ(P)(W) − µ(P ′ )(W) g(P)(0|W) )}g(P ′ .)(0|W)}By (10) and the dominated convergence theorem (in<strong>de</strong>ed, {‖θ(P ε )‖ ∞boun<strong>de</strong>d),: |ε| < ‖s‖ −1∞ } isNow, the right-hand si<strong>de</strong> expression vanishes as soon as either θ(P ′ )(0, ·) = θ(P)(0, ) or(µ(P ′ ) = µ(P) and g(P ′ ) = g(P)). The conclusion readily follows.T 1 ε = E P {X(θ(P)(X, W) − θ(P)(0, W))s(O)} + o(ε)A.3 Proof of Lemma 5+ E P{X(ε −1 (θ(P ε )(X, W) − θ(P)(X, W)) − ε −1 (θ(P ε )(0, W) − θ(P)(0, W)) )} .Furthermore, (10) also yields thatε −1 (θ(P ε )(X, W) − θ(P)(X, W)) = E P ((Y − θ(P)(X,W))s(O)|X, W) + o(ε).Consequently, applying the dominated convergence theorem finally yields (by using the abovetelescopic equality and (10), one easily checks that {sup O∈O ε −1 |θ(P ε )(X,W)−θ(P)(X, W)| :|ε| < ‖s‖ −1∞ } is boun<strong>de</strong>d)T 1 ε = E P {X(θ(P)(X, W) − θ(P)(0, W))s(O)}+ E P {E P (X(Y − θ(P)(X, W))s(O)|X, W)Proof of Lemma 5. Assume for the time being that, for all W ∈ W, there exists λ n suchthat (17) holds with λ n substituted for λ. Then, for all W ∈ W, the point with coordinates(E P 0 n(X|X ≠ 0, W), ϕ n,λn (E P 0 n(X|X ≠ 0, W))) lies in the convex envelope of the set{(X (i) , X (i)2 ) : i ≤ n} \ {(0,0)}. Equivalently, there exist for all W ∈ W three non-negativeweights p 1 , p 2 , p 3 summing up to 1 and three different values x (1) , x (2) , x (3) ∈ {X (i) : i ≤n} \ {0} such that3∑3∑E P 0 n(X|X ≠ 0, W) = p k x (k) , E P 0 n(X 2 |X ≠ 0, W) = p k x (k)2 ,k=1the right-hand si<strong>de</strong> expressions being, respectively, the mean and second or<strong>de</strong>r moment of thedistribution ∑ 3k=1 p kDirac(x (k) ). Thus, there exists Pn00 ∈ M such that (i) and (ii) hold.k=13132


hal-00629899, version 1 - 6 Oct 2011Set W ∈ W. Combining (6), (7) and (17) yields that if there exists λ n such that (17) holdswith λ n substituted for λ, then it must be equal to l n = (T 1 n −σ 2 (P 0 n))/(T 1 n −T 2 n), where T 1 n =(m n + M n )E P 0 n{µ(P 0 n)(W)} − m n M n E P 0 n{1 − g(P 0 n)(0|W)} and T 2 n = E P 0 n{µ(P 0 n)(W) 2 /(1 −g(P 0 n)(0|W))}. In or<strong>de</strong>r to conclu<strong>de</strong>, it is therefore sufficient to check that l n ∈ [0,1].By the Jensen inequality, it holds that E P 0 n(X 2 |X ≠ 0, W) ≥ E P 0 n(X|X ≠ 0, W) 2 , whichyields in turn with (6) and (7) that σ 2 (P 0 n) ≥ T 2 n. Finally, using again (6) and (7), σ 2 (P 0 n)−T 1 nequalsE P 0 n{(1 − g(P0n )(0|W)) ( E P 0 n(X 2 |X ≠ 0, W) − (m n + M n )E P 0 n(X|X ≠ 0, W) + m n M n)}= E P 0 n{(1 − g(P0n )(0|W))E P 0 n((X − m n )(X − M n )|X ≠ 0, W) } ≤ −c 2 P 0 n(X ≠ 0),hence T 2 n ≤ σ 2 (P 0 n) < T 1 n. Thus, l n ∈ [0,1], which completes the proof.A.4 Proofs of Lemmas 2, 3 and 4Proof of Lemma 2. It is sufficient to verify that, un<strong>de</strong>r the stated assumptions,lim sup∣ P D ⋆ (P k n)n1 + εD ⋆ (Pn) k − P nD ⋆ (Pn)k ∣ = 0.(ε,k)→(0,∞)Now, the absolute value above is straightforwardly upper-boun<strong>de</strong>d by∣ εM 2 ∣∣1P n + εD ⋆ (Pn)k ∣ −1 −1= εM 2 P n(1 + εD ⋆ (Pn)) k ≤ εM 2 /(1 − ρM) = 2εM 2 .This trivially entails the wished convergence, hence the result.Let us introduce, for all k ≥ 0 and |ε| ≤ ρ, l k n(ε) = n −1 ∑ ni=1 log P k n(ε)(O (i) ) andA k n(ε) = −P nD ⋆ (P k n) 2(1 + εD ⋆ (P k n)) 2.Obviously, the normalized log-likelihood l k n(ε) un<strong>de</strong>r P k n(ε) is twice differentiable wrt ε, withfirst <strong>de</strong>rivative at ε = 0 equal to P n D ⋆ (P k n) and second <strong>de</strong>rivative at ε equal to A k n(ε).Proof of Lemma 3, first part. Let us first show, by contradiction, that lim k→∞ P n D ⋆ (P k n) = 0un<strong>de</strong>r the stated assumptions. Suppose that P n D ⋆ (P k n) does not converge to 0 as k → ∞:there exist η > 0 and an increasing function ϕ : N → N such that, for all k ≥ 0,hal-00629899, version 1 - 6 Oct 2011By assumption (iii) and since= ε ′ϕ(k)n−4M 2 ≤ inf infk ′ ≥0 |ε|≤ρ Ak′ n (ε) ≤ supk ′ ≥0 |ε|≤ρn ) 2P n D ⋆ (Pnϕ(k) ) + (ε′ϕ(k)2supA ϕ(k)n (ε ′ ). (23)A k′n (ε) ≤ − 4 9 infk ′ ≥0 P nD ⋆ (P k′n ) 2 , (24)there exists a constant κ > 0 (<strong>de</strong>pending on P n ) such that the right-hand si<strong>de</strong> term of (24) isupper-boun<strong>de</strong>d by −κ, hence A k′n (ε) ≤ −κ simultaneously for all k ′ ≥ 0 and |ε| ≤ ρ.The function ε ↦→ ∂ ∂ε lϕ(k) n (ε) being <strong>de</strong>creasing and equal to P n D ⋆ (Pn ϕ(k) ) ≠ 0 at ε = 0,it necessarily holds that ε ϕ(k)n P n D ⋆ (Pnϕ(k) ) > 0 (i.e., ε ϕ(k)n and P n D ⋆ (Pnϕ(k) ) share the samesign), hence ε ′ϕ(k)n P n D ⋆ (Pnϕ(k) ) > 0 too. Furthermore, combining (23) and the left-hand si<strong>de</strong>of (24) yieldsl ϕ(k)n(ε ϕ(k)n) − l ϕ(k)n (0) ≥ ε ′ϕ(k)n P n D ⋆ (Pnϕ(k) ) − 2M 2 (ε ′ϕ(k)≥n ) 2|ε ′ϕ(k)n |η − 2M 2 (ε ′ϕ(k)n ) 2 . (25)The conclusion is now at hand. Assume that the sequence {ε ϕ(k)n } k≥0 does not convergeto 0: there exist c > 0 and another increasing function ψ : N → N such that, for all k ≥ 0,|ε ψ◦ϕ(k)n | ≥ c > 0. Note that c can be chosen small enough to guarantee in addition thatcη − 2M 2 c 2 > 0. Let us impose now |ε ′ψ◦ϕ(k)n | = c for all k ≥ 0 (this uniquely <strong>de</strong>finesε n′ψ◦ϕ(k) ∈ [0, εn ψ◦ϕ(k) ]). According to (25), for all k ≥ 0,l ψ◦ϕ(k)n(ε ψ◦ϕ(k)n) − ln ψ◦ϕ(k) (0) ≥ cη − 2M 2 c 2 > 0.Using (a) l k′n (ε k′n ) −l k′n (0) ≥ 0 for all k ′ ≥ 0 and (b) l k′n (0) = l k′ −1nobtains that for all k ≥ 0,l ψ◦ϕ(k)n (ε ψ◦ϕ(k)n ) − l 0 n(0) ≥ k(cη − 2M 2 c 2 ).(ε k′ −1n) for every k ′ ≥ 1, oneThis contradicts assumption (iv). So the sequence {ε ϕ(k)n } k≥0 must converge to 0, Lemma 2applies, and (22) is contradicted.Proof of Lemma 3, second part. For all k ≥ 0, another Taylor expansion of l k n(ε) yields theexistence of ε ′k n ∈ [0, ε k n] such that|P n D ⋆ (P ϕ(k)n )| ≥ η > 0. (22)We show that necessarily lim k→∞ ε ϕ(k) = 0, hence by Lemma 2 that lim k→∞ P n D ⋆ (Pn ϕ(k) ) = 0,contradicting (22).Set k ≥ 0. For any ε ′ϕ(k)n ∈ [0, ε ϕ(k)n ], a Taylor expansion of l ϕ(k)n (ε) yields the existence ofε ′ ∈ [0, ε ′ϕ(k)n ] such thatl ϕ(k)n(ε ϕ(k)n) − l ϕ(k)n (0) ≥ l ϕ(k)n(ε ′ϕ(k)n) − l ϕ(k)n (0)We <strong>de</strong>rive from these inequalities that0 ≤ l k n(ε k n) − l k n(0) = ε k nP n D ⋆ (P k n) + (εk n) 22 Ak n(ε ′k n ).0 ≤ (ε k n) 2 κ ≤ (ε k n) 2 |A k n(ε ′k n )| ≤ 2ε k nP n D ⋆ (P k n) ≤ ρ|P n D ⋆ (P k n)|,where the right-hand si<strong>de</strong> converges to 0 as k → ∞ by virtue of the first part of the lemma.This completes the proof.3334


hal-00629899, version 1 - 6 Oct 2011Proof of Lemma 4. We first show that the sequence {P k n } k≥0 converges in total variation.For this purpose, note that Pn k is dominated by Pn, 0 with a <strong>de</strong>nsity fn k characterized byfn(O) k = ∏ k−1k ′ =0(1 + εk′ n D ⋆ (Pn k′ )(O)). Since (a) the functions D ⋆ (Pn k′ ) are uniformly boun<strong>de</strong>dby a common constant M, and (b) the series ∑ k≥0 |εk n| converges, the sequence of <strong>de</strong>nsities(wrt P 0 n) {f k n} k≥0 converges wrt the ‖·‖ ∞ -norm to a limit <strong>de</strong>nsity (wrt P 0 n) that we <strong>de</strong>note f ∗ n.Density f ∗ n gives rise to a data-generating distribution P ∗ n, the limit of P k n in total variation(hence its weak limit too).Now, it holds that ψ k n = Ψ(P k n) = (E P k n{XY } − E P k n{Xθ(P k n)(0, W)})/E P k n{X 2 }. Theobserved data structure O being boun<strong>de</strong>d, the functions O = (W, X,Y ) ↦→ XY and O =(W, X,Y ) ↦→ X 2 are continuous and boun<strong>de</strong>d, hence E P k n{XY } and E P k n{X 2 } respectivelyconverge to E P ∗ n{XY } and E P ∗ n{X 2 } ≥ c as k → ∞ by weak convergence. Furthermore, theconvergence of f k n to f ∗ n wrt the ‖ · ‖ ∞ -norm trivially entails the pointwise convergence ofθ(P k n) to θ(P ∗ n), then the wished convergence of E P k n{Xθ(P k n)(0, W)} to E P ∗ n{Xθ(P ∗ n)(0, W)}by the dominated convergence theorem. This completes the proof.A.5 Proof of Propositions 2 and 3 and of Corollary 1Denote D ⋆ (σ 2 , θ, µ, g,ψ) = D ⋆ 1 (σ2 , θ,ψ) + D ⋆ 2 (σ2 , θ,µ, g), let D 1 (σ 2 , θ) be characterized byD 1 (σ 2 , θ)(O) = (X(θ(X, W) − θ(0, W)))/σ 2 , and <strong>de</strong>fine D 1 (P) = D 1 (σ 2 (P), θ(P)). We usethe notation a b for “a smaller than b up to a multiplicative constant”. Let us start witha useful lemma.Lemma 8. Suppose that the assumptions of Proposition 2 are met. There exists ψ 0 ∈ R suchthat ˜ψ ∗ n = ψ 0 + o P (1) (i.e., the TMLE converges in probability). Moreover, it holds thatP 0 (D ⋆ 1(σ 2 (P knn ), θ(P knn ), ˜ψ ∗ n) − D ⋆ 1(σ 2 0, θ 0 , ψ 0 )) 2 = o P (1), (26)P 0 (D ⋆ 2(P knn ) − D ⋆ 2(σ 2 0, θ 0 , µ 0 , g 0 )) 2 = o P (1), hence (27)P 0 (D ⋆ (P knn ) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ψ 0 )) 2 = o P (1) (28)hal-00629899, version 1 - 6 Oct 2011˜ψ n∗ E Pn {X 2 }σ 2 (Pn kn ) = (P n − P 0 )(D 1 (Pn kn ) + D2(P ⋆ n kn ))+ P 0 (D 1 (P knn ) + D ⋆ 2(P knn ) − D 1 (σ 2 0, θ 0 ) − D ⋆ 2(σ 2 0, θ 0 , µ 0 , g 0 ))+ P 0 (D 1 (σ 2 0, θ 0 ) + D ⋆ 2(σ 2 0, θ 0 , µ 0 , g 0 )) + o P (1) (30)and consi<strong>de</strong>r the two first right-hand si<strong>de</strong> terms. Because D 1 (σ 2 , θ)(O) = D ⋆ 1 (σ2 , θ, ψ)(O) +X 2 ψ/σ 2 and the class {O ↦→ X 2 ψ/σ 2 : (ψ,σ 2 ) ∈ R ×[c, ∞]} is P 0 -Donsker, it holds that bothD 1 (Pn kn ) and D2 ⋆ kn(Pn ) belong to a P 0 -Donsker class with P 0 -probability tending to 1, henceso does D 1 (Pn kn )+D2 ⋆ kn(Pn ). Therefore, by (29), (27) and Lemma 19.24 in [25], the first termis O P (1/ √ n) = o P (1). Combining (29) and (27) with the Cauchy-Schwarz inequality yieldsin turn that the second term is o P (1). Finally, the law of large numbers and the fact thatσ 2 0 ≥ c entail that E P n{X 2 }/σ 2 (P knn ) = σ 2 (P 0 )/σ 2 0 × (1 + o P(1)). Consequently, we <strong>de</strong>ducefrom (30) that there exists ψ 0 ∈ R such that ˜ψ n = ψ 0 + o P (1).Because D ⋆ 1 (σ2 , θ, ψ)(O) = D 1 (σ 2 , θ)(O) − X 2 ψ/σ 2 , (26) easily follows from (29) and theconvergence in probability of ˜ψ n and σ 2 (Pn kn ) to ψ 0 and σ0 2 . Finally (26) and (27) imply (28),thus concluding the proof.Proof of Proposition 2. Let us first rewrite P n D ⋆ (P knn ) = o P (1/ √ n) asP 0 D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ψ 0 ) = −(P n − P 0 )D ⋆ (P knn )− P 0 (D ⋆ (P knn ) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ψ 0 )) + o P (1/ √ n). (31)Since D1 ⋆ kn(Pn ) and D2 ⋆ kn(Pn ) belong to a P 0 -Donsker class with P 0 -probability tending to 1,so does D ⋆ (Pn kn ). Therefore, (28) of Lemma 8 and Lemma 19.24 in [25] yield that the firstright-hand term in (31) is O P (1/ √ n) = o P (1). Moreover, (28) of Lemma 8 and the Cauchy-Schwarz inequality imply that the second right-hand si<strong>de</strong> term is o P (1). Consequently, the<strong>de</strong>terministic quantity P 0 D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ψ 0 ) is equal to 0, and the conditions on (θ 0 , µ 0 , g 0 )ensure that necessarily ψ 0 = Ψ(P 0 ) i.e., that the TMLE ˜ψ ∗ n is consistent.Proof. Recall that ‖O‖ is boun<strong>de</strong>d un<strong>de</strong>r P 0 and that σ 2 (Pn kn ), σ0 2 ≥ c. Using repeatedlythe telescopic equality a 1 /b 1 − a 0 /b 0 = (a 1 − a 0 )/b 1 − (a 0 /b 0 )(b 1 − b 0 )/b 1 and inequality(a+b) 2 ≤ 2(a 2 +b 2 ) yields that, un<strong>de</strong>r P 0 , (D 1 (P knn )−D 1 (σ 2 0 , θ 0))(O) 2 (θ(P knn )−θ 0 )(O) 2 +(θ(P knn )(0, ·) − θ 0 (0, ·))(O) 2 , and therefore thatP 0 (D 1 (P knn ) − D 1 (σ 2 0, θ 0 )) 2 = o P (1). (29)Similarly, the same tricks as above and the facts that (a) both |(Y −θ(Pn kn )(X, W))/σ 2 (Pn kn )|and |(Y −θ(P 0 )(X, W))/σ0 2| are upper-boun<strong>de</strong>d un<strong>de</strong>r P 0, and (b) g(Pn kn )(0|W), g 0 (0|W) ≥ cimply that, un<strong>de</strong>r P 0 , (D2 ⋆ kn(Pn )−D2 ⋆(σ2 0 , θ 0, µ 0 , g 0 ))(O) 2 (µ(Pn kn )−µ 0 )(O) 2 +(g(Pn kn )(0|·)−g 0 (0|·))(O) 2 , hence (27).Now, let us rewrite P n D ⋆ (P knn ) = o P (1) asProof of Proposition 3. Let us resume the previous proof where we left it. The fundamentalrelationship of this proof, <strong>de</strong>rived from equalities P 0 D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ψ 0 ) = 0 and P n D ⋆ (P knn ) =o P (1/ √ n), is− P 0 (D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ˜ψ ∗ n) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ψ 0 )) = (P n − P 0 )D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ˜ψ ∗ n)+ (P n − P 0 )(D ⋆ (P knn ) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ˜ψ ∗ n))+ P 0 (D ⋆ (P knn ) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ˜ψ ∗ n)) + o P (1/ √ n), (32)where the left-hand si<strong>de</strong> term obviously equals ( ˜ψ n ∗ − ψ 0 )σ 2 (P 0 )/σ0 2 . Let us consi<strong>de</strong>r now thefirst right-hand term in (32). Since (a) {D ⋆ (σ0 2, θ 0, µ 0 , g 0 , ψ) : ψ ∈ R} is a P 0 -Donsker classand (b) P 0 (D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ˜ψ ∗ n) − D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ψ 0 )) 2 = ( ˜ψ ∗ n − ψ 0 ) 2 E P0 {X 4 }/σ 4 0 = o P(1),it holds that (P n − P 0 )D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ˜ψ ∗ n) = (P n − P 0 )D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ψ 0 ) + o P (1/ √ n)3536


hal-00629899, version 1 - 6 Oct 2011by Lemma 19.24 in [25]. Regarding the second right-hand si<strong>de</strong> term in (32), note (a) that(D ⋆ (Pn kn )−D ⋆ (σ0 2, θ 0, µ 0 , g 0 , ˜ψ n))(O) ∗ = (D 1 (Pn kn )+D2 ⋆ kn(Pn ))(O)+((1/σ0 2−1/σ2 (Pn kn ))X 2 ˜ψ n)−∗(D 1 (σ0 2, θ 0) + D2 ⋆(σ2 0 , θ 0, µ 0 , g 0 ))(O), (b) that we have already shown that the first randomfunction between parentheses belongs to a P 0 -Donsker class with P 0 -probability tendingto 1, (c) that second random function between parentheses belongs to the P 0 -Donsker class{O ↦→ (1/σ 2 0 − 1/σ2 )X 2 ψ : (ψ, σ 2 ) ∈ R × [c, ∞]}, and (d) that the last function of the<strong>de</strong>composition is <strong>de</strong>terministic. Therefore, D ⋆ (P knn ) − D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 , ˜ψ ∗ n) belongs to a P 0 -Donsker class with P 0 -probability tending to 1. Now, by applying repeatedly inequality(a + b) 2 ≤ 2(a 2 + b 2 ) we <strong>de</strong>duce that P 0 (D ⋆ (Pn kn ) − D ⋆ (σ0 2, θ 0, µ 0 , g 0 , ˜ψ n)) ∗ 2 P 0 (D 1 (Pn kn ) −D 1 (σ0 2, θ 0)) 2 + P 0 (D2 ⋆ kn(Pn ) − D2 ⋆(σ2 0 , θ 0, µ 0 , g 0 )) 2 + ( ˜ψ n) ∗ 2 E P0 {((1/σ0 2 − 1/σ2 (Pn kn )) 2 X 4 }. But‖O‖ is boun<strong>de</strong>d un<strong>de</strong>r P 0 and σ 2 (Pn kn ), σ0 2 ≥ c, so that E P 0{((1/σ0 2 − 1/σ2 (Pn kn )) 2 X 4 } (σ 2 (Pn kn ) − σ0 2)2 = o P (1). This fact combined with (29), (27) and ˜ψ n ∗ = O P (1) yield thatP 0 (D ⋆ (Pn kn ) − D ⋆ (σ0 2, θ 0, µ 0 , g 0 , ˜ψ n)) ∗ 2 = o P (1). Consequently, Lemma 19.24 in [25] impliesthat the second right-hand si<strong>de</strong> term in (32) is o P (1/ √ n). Let us turn now to the lastright-hand si<strong>de</strong> term in (32). It is easily seen thatD ⋆ (P knn )(O) − D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 , ˜ψ ∗ n)(O)= D ⋆ (σ 2 (Pn kn ), θ(Pn kn ), µ(Pn kn ), g(Pn kn ), Ψ(P 0 ))(O) − D ⋆ (σ0, 2 θ 0 , µ 0 , g 0 , Ψ(P 0 ))(O)()− 1/σ 2 (Pn kn ) − 1/σ 2 (P 0 ) X 2 ( ˜ψ n ∗ − Ψ(P 0 )),where (1/σ 2 (P knn ) − 1/σ 2 (P 0 ))( ˜ψ ∗ n − Ψ(P 0 )) = O P (1/ √ n) × o P (1) = o P (1/ √ n). Using thatP 0 D ⋆ (σ 2 0 , θ 0, µ 0 , g 0 ,Ψ(P 0 )) = 0, the previous display yields that the third right-hand si<strong>de</strong> termin (32) equalshal-00629899, version 1 - 6 Oct 2011where we use that P 0 D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )) = 0. Following the lines of the proof ofLemma 8 and using the Cauchy-Schwarz inequality yield that the first term of the left-handsi<strong>de</strong> <strong>de</strong>composition is upper-boun<strong>de</strong>d (up to a multiplicative constant) by square-root ofP 0 (θ(P knn )(0, ·)−θ(P 0 )(0, ·)) 2 ×(P 0 (µ(P knn )−µ 0 ) 2 +P 0 (g(P knn )(0|·)−g 0 (0|·)) 2 ), while the secondterm equals zero. Thus the latter left-hand si<strong>de</strong> expression is o P (1/ √ n) by assumption, (32)yields the asymptotic linear expansion, and the central limit theorem completes the proof.References[1] Joseph Andrews, Wendy Kennette, Jenna Pilon, Alexandra Hodgson, Alan B. Tuck,Ann F. Chambers, and David I. Ro<strong>de</strong>nhiser. Multi-platform whole-genome microarrayanalyses refine the epigenetic signature of breast cancer metastasis with gene expressionand copy number. PLoS ONE, 5(1):e8665, 01 2010.[2] Francis S. Collins and Anna D. Barker. Mapping the cancer genome. Scientific American,296(3):50–57, Mar 2007.[3] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel. MiscFunctions of the Department of Statistics (e1071), TU Wien, 2011. URLhttp://cran.r-project.org/web/packages/e1071/in<strong>de</strong>x.html. R package version1.6.[4] T. Hastie. Generalized additive mo<strong>de</strong>ls, 2011. URLhttp://cran.r-project.org/web/packages/gam/in<strong>de</strong>x.html.1.04.1.R package versionP 0 D ⋆ (σ 2 (P knn ), θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 )) + o P (1/ √ n)[5] P. A. Jones and S. B. Baylin. The epigenomics of cancer. Cell, 128(4):683–692, Feb 2007.In summary, we just showed that= P 0 D ⋆ (σ 2 0, θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 ))(1 + o P (1)) + o P (1/ √ n).[6] C. Kooperberg. Polynomial spline routines, 2010. URLhttp://cran.r-project.org/web/packages/polspline/in<strong>de</strong>x.html.version 1.1.5.R package( ˜ψ ∗ n − Ψ(P 0 ))σ 2 (P 0 )/σ 2 0 = (P n − P 0 )D ⋆ (σ 2 0, θ 0 , µ 0 , g 0 ,Ψ(P 0 ))+ P 0 D ⋆ (σ 2 0, θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 ))(1 + o P (1)) + o P (1/ √ n),[7] C. L. Lawson and R. J. Hanson. Solving least squares problems, volume 15. Society forIndustrial Mathematics, 1995.hence the stated relationship.Proof of Corollary 1. This result relies on the <strong>de</strong>composition:[8] L. M. Le Cam. Théorie asymptotique <strong>de</strong> la décision statistique. Séminaire <strong>de</strong>Mathématiques Supérieures, No. 33 (Été, 1968). Les Presses <strong>de</strong> l’Université <strong>de</strong> Montréal,Montreal, Que., 1969.P 0 D ⋆ (σ 2 (P 0 ), θ(P knn ), µ(P knn ), g(P knn ),Ψ(P 0 ))= P 0(D ⋆ (σ 2 (P 0 ), θ(Pn kn ), µ(Pn kn ), g(Pn kn ),Ψ(P 0 )) − D ⋆ (σ 2 (P 0 ), θ(Pn kn ), µ 0 , g 0 ,Ψ(P 0 ))+ P 0(D ⋆ (σ 2 (P 0 ), θ(P knn ), µ 0 , g 0 ,Ψ(P 0 )) − D ⋆ (σ 2 (P 0 ), θ 0 , µ 0 , g 0 ,Ψ(P 0 )))),[9] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL http://CRAN.R-project.org/doc/Rnews/.[10] R. Louhimo and S. Hautaniemi. Cnamet: an r package for integrating copy number,methylation and expression data. Bioinformatics, 27(6):887, 2011.3738


[11] J. R. Pollack, T. Sørlie, C. M. Perou, C. A. Rees, S. S. Jeffrey, P. E. Lonning, R. Tibshirani,D. Botstein, A.-L. Børresen-Dale, and P. O Brown. Microarray analysis revealsa major direct role of DNA copy number alteration in the transcriptional program ofhuman breast tumors. Proc Natl Acad Sci U S A, 99(20):12963–12968, Oct 2002.[12] E. Polley and M. J. van <strong>de</strong>r Laan. SuperLearner, 2011. URLhttp://CRAN.R-project.org/package=SuperLearner. R package version 2.0-4.[13] R Development Core Team. R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2010. URLhttp://www.R-project.org. ISBN 3-900051-07-0.[23] M. J. van <strong>de</strong>r Laan and D. Rubin. Targeted maximum likelihood learning. Int. J.Biostat., 2:Article 11, 2006.[24] M. J. van <strong>de</strong>r Laan, E. C. Polley, and A. E. Hubbard. Super learner. Stat. Appl. Genet.Mol. Biol., 6:Article 25, 2007.[25] A. W. van <strong>de</strong>r Vaart. Asymptotic statistics, volume 3 of Cambridge Series in Statisticaland Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.[26] W. N. van Wieringen and M. A. van <strong>de</strong> Wiel. Nonparametric testing for DNA copynumber induced differential mRNA gene expression. Biometrics, 5(1):19–29, March 2008.hal-00629899, version 1 - 6 Oct 2011[14] J. M. Robins and A. Rotnitzky. Comment on Inference for semiparametric mo<strong>de</strong>ls: somequestions and an answer, by Bickel, P. J. and Kwon, J. Statistica Sinica, 11:920–935,2001.[15] J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by mo<strong>de</strong>llingthe expectation of exposure conditional on confoun<strong>de</strong>rs. Biometrics, 48(2):479–495, 1992.[16] T. Speed. From expression profiling to putative master regulators. UC Berkeley Statisticsand Genomics Seminar, February 5th, 2009.[17] Z. Sun, Y. W. Asmann, K. R. Kalari, B. Bot, J. E. Eckel-Passow, T. R. Baker, J. M.Carr, I. Khrebtukova, S. Luo, L. Zhang, et al. Integrated analysis of gene expression,cpg island methylation, and gene copy number in breast cancer cells by <strong>de</strong>ep sequencing.PLoS One, 6(2):e17490, 2011.hal-00629899, version 1 - 6 Oct 2011[27] X. V. Wang, R. G. W. Verhaak, E. Purdom, P. T. Spellman, and T. P. Speed. Unifyinggene expression measures from multiple platforms using factor analysis. PloS one, 6(3):e17691, 2011.[28] Z. Yu and M. J. van <strong>de</strong>r Laan. Measuring treatment effects using semiparametric mo<strong>de</strong>ls.Technical report, Division of Biostatistics, University of California, Berkeley, 2003.[18] The Cancer Genome Atlas (TGCA) research Network. Comprehensive genomic characterization<strong>de</strong>fines human glioblastoma genes and core pathways. Nature, 455:1061–1068,2008.[19] The Cancer Genome Atlas (TGCA) research Network. Integrated genomic analyses ofovarian carcinoma. Nature, 474(7353):609–615, 2011.[20] C. Tuglus and M. J. van <strong>de</strong>r Laan. Targeted Learning: Causal Inference for Observationaland Experimental Data, chapter Targeted methods for biomarker discovery. SpringerVerlag, 2011.[21] M. J. van <strong>de</strong>r Laan. Statistical inference for variable importance. Int. J. Biostat., 2:Article 2, 2006.[22] M. J. van <strong>de</strong>r Laan and S. Rose. Targeted Learning: Causal Inference for Observationaland Experimental Data. Springer Verlag, 2011.3940


hal-00629899, version 1 - 6 Oct 2011(a) full-SLiteration of the TMLE procedure k = 0 k = 1 k = 2 k = 3gain in relative error 0 0.0469 0.0625 0.0335gain in relative mean square error 0 0.0365 0.0369 0.0035Lilliefors test statistic 0.0183 0.0269 0.0298 0.0282Lilliefors test p-value 0.5718 0.0861 0.0365 0.0582KS test statistic 0.1566 0.0782 0.0743 0.0786empirical coverage – 0.896 0.905 0.898empirical coverage (optimistic) – 0.914 0.920 0.916(b) light-SLiteration of the TMLE procedure k = 0 k = 1 k = 2 k = 3gain in relative error 0 0.2871 0.2837 0.2866gain in mean square error 0 0.2352 0.2293 0.2305Lilliefors test statistic 0.0253 0.0224 0.0218 0.0295Lilliefors test p-value 0.1251 0.2620 0.2999 0.0400KS test statistic 0.4227 0.1327 0.1451 0.1377empirical coverage – 0.936 0.938 0.929empirical coverage (optimistic) – 0.945 0.948 0.941Table 3: Testing the asymptotic normality of ψ k n and the validity of the coverage provi<strong>de</strong>dby [ψ k n ± ξ 1−α/2 s k n/ √ n], with (s k n) 2 = Var Pn D ⋆ (P k n)(O) for k = 0,1, 2,3, (a) for the full-SLprocedure and (b) for the light-SL procedure. We report the gains in relative error and meansquare error (first and second rows), the test statistics and corresponding p-values of Lillieforstests of normality (third and fourth rows), the test statistics of the KS test of normality withnull mean and variance equal to ψ B (P s ) and v B (P s ) (fifth rows; the corresponding p-valuesare all smaller than 10 −4 ), and finally the empirical coverages c k n = 1 ∑ B ′B ′ b=1 1{ψ B(P s ) ∈[ψn,b k ± ξ 1−α/2s k n,b /√ n]} as well as their optimistic counterparts c k+n = 1 ∑ B ′B ′ b=1 1{[ψ B(P s ) ±√ξ 0.975 vB (P s )/B] ∩ [ψn,b k ± ξ 1−α/2s k n,b /√ n] ≠ ∅} (sixth and seventh rows).41


<strong>Antoine</strong> Chambaz, décembre 2011

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!