12.07.2015 Views

Radically Elementary Probability and Statistics By Charles J. Geyer ...

Radically Elementary Probability and Statistics By Charles J. Geyer ...

Radically Elementary Probability and Statistics By Charles J. Geyer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong><strong>By</strong><strong>Charles</strong> J. <strong>Geyer</strong>Technical Report No. 657School of <strong>Statistics</strong>University of MinnesotaApril 16, 2007


<strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> <strong>and</strong> <strong>Statistics</strong><strong>Charles</strong> J. <strong>Geyer</strong>Copyright 2004–2007 by <strong>Charles</strong> J. <strong>Geyer</strong>April 16, 2007


ContentsPrefaceviiI Nonst<strong>and</strong>ard Analysis 11 Introduction 32 The Natural Numbers 52.1 Axioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Illegal Set Formation . . . . . . . . . . . . . . . . . . . . . . . . . 73 Real Numbers 113.1 Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.2 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Summary of Arithmetic . . . . . . . . . . . . . . . . . . . . . . . 143.4 Overspill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Calculus 194.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3 Summation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.5 Derivability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27II <strong>Probability</strong> 295 <strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> Theory 315.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.2 Unconditional <strong>Probability</strong> . . . . . . . . . . . . . . . . . . . . . . 325.2.1 <strong>Probability</strong> . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.2 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 325.3 Conditional <strong>Probability</strong> . . . . . . . . . . . . . . . . . . . . . . . 33iii


ivCONTENTS5.3.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . 345.3.2 Conditional <strong>Probability</strong> . . . . . . . . . . . . . . . . . . . 355.4 Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . 355.5 <strong>Probability</strong> Measures . . . . . . . . . . . . . . . . . . . . . . . . . 376 More <strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> Theory 396.1 Almost Surely . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.1 Infinitesimal Almost Surely . . . . . . . . . . . . . . . . . 396.1.2 Limited Almost Surely . . . . . . . . . . . . . . . . . . . . 406.2 L 1 R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . . . . . 406.2.1 The Radon-Nikodym Theorem . . . . . . . . . . . . . . . 416.2.2 The Lebesgue Theorem . . . . . . . . . . . . . . . . . . . 426.2.3 L p R<strong>and</strong>om Variables . . . . . . . . . . . . . . . . . . . . 446.2.4 Conditional Expectation . . . . . . . . . . . . . . . . . . . 446.2.5 The Fubini Theorem . . . . . . . . . . . . . . . . . . . . . 457 Stochastic Convergence 477.1 Almost Sure Convergence . . . . . . . . . . . . . . . . . . . . . . 477.2 Convergence in <strong>Probability</strong> . . . . . . . . . . . . . . . . . . . . . 477.3 Almost Sure Near Equality . . . . . . . . . . . . . . . . . . . . . 477.4 Near Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.5 Convergence in Distribution . . . . . . . . . . . . . . . . . . . . . 488 The Central Limit Theorem 518.1 Independent <strong>and</strong> Identically Distributed . . . . . . . . . . . . . . 518.2 The De Moivre-Laplace Theorem . . . . . . . . . . . . . . . . . . 519 Near Equivalence in Metric Spaces 579.1 Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2 <strong>Probability</strong> Measures . . . . . . . . . . . . . . . . . . . . . . . . . 589.3 The Prohorov Metric . . . . . . . . . . . . . . . . . . . . . . . . . 589.4 Near Equivalence <strong>and</strong> the Prohorov Metric . . . . . . . . . . . . 609.5 The Portmanteau Theorem . . . . . . . . . . . . . . . . . . . . . 639.6 Continuous Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 649.7 Product Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6510 Distribution Functions 6710.1 The Lévy Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . 6710.2 Near Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 6810.3 General Normal Distributions . . . . . . . . . . . . . . . . . . . . 7011 Characteristic Functions 7311.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7311.2 Convergence I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7411.3 The Discrete Fourier Transform . . . . . . . . . . . . . . . . . . . 7511.4 The Double Exponential Distribution . . . . . . . . . . . . . . . . 77


CONTENTSv11.5 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8111.6 Convergence II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8211.6.1 One-Dimensional . . . . . . . . . . . . . . . . . . . . . . . 8211.6.2 Limited-Dimensional . . . . . . . . . . . . . . . . . . . . . 85III <strong>Statistics</strong> 8912 De Finetti’s Theorem 9112.1 Exchangeability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9112.2 A Simple De Finetti Theorem . . . . . . . . . . . . . . . . . . . . 9212.3 Philosophical Interpretation . . . . . . . . . . . . . . . . . . . . . 9412.4 A Fancier De Finetti Theorem . . . . . . . . . . . . . . . . . . . 9513 Almost Sure Convergence 9713.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . 9713.2 The Glivenko-Cantelli Theorem . . . . . . . . . . . . . . . . . . . 9713.3 Prohorov Consistency . . . . . . . . . . . . . . . . . . . . . . . . 10013.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10313.4.1 Kolmogorov-Style Pathology . . . . . . . . . . . . . . . . 10313.4.2 Near Tightness <strong>and</strong> Topology . . . . . . . . . . . . . . . . 10413.4.3 Prohorov Consistency <strong>and</strong> Glivenko-Cantelli . . . . . . . . 10413.4.4 Statistical Inference . . . . . . . . . . . . . . . . . . . . . 104


viCONTENTS


PrefaceThe book <strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> Theory (Nelson, 1987, Preface)is an attempt to lay new foundations for probability theory using atiny bit of nonst<strong>and</strong>ard analysis. The mathematical background islittle more than that which is taught in high school, <strong>and</strong> it is my hopethat it will make deep results from the modern theory of stochasticprocesses readily available to anyone who can add, multiply, <strong>and</strong>reason.The “tiny bit of nonst<strong>and</strong>ard analysis” is really “tiny” entailing little morethan a rigorous notion of infinitesimals. Nelson (1987) introduces it in threebrief chapters. We give a more belabored treatment (our Part I), but almosteverything one needs to know about nonst<strong>and</strong>ard analysis are the arithmeticrules for infinitesimal, appreciable, <strong>and</strong> unlimited numbers (a number is unlimitedif its reciprocal is infinitesimal, <strong>and</strong> a number is appreciable if it is neitherinfinitesimal or unlimited) given in the tables in our Section 3.3, the principleof external induction — an axiom of the nonst<strong>and</strong>ard analysis used in Nelson(1987) <strong>and</strong> this book (Axiom IV in Section 2.1) — <strong>and</strong> the principle of overspill(our Section 3.4).With this tiny bit of nonst<strong>and</strong>ard analysis in h<strong>and</strong> Nelson (1987) radicallysimplifies probability theory, adopting two principles (explained in our Section5.1). All probability spaces have(i) finite sample space <strong>and</strong>(ii) no nonempty events of probability zero.These have the consequence that expectations always exist <strong>and</strong> are given byfinite sums, conditional expectations are unique <strong>and</strong> given by a simple formula,also involving a finite sum, our equation (5.5), <strong>and</strong> no measure theory is necessary.One might think that this “radical” simplification is too radical — throwingthe baby out with the bathwater — but Nelson (1987) <strong>and</strong> this book providesome evidence that this is not so. Even though our theory has no continuousr<strong>and</strong>om variables or even discrete r<strong>and</strong>om variables with infinite sample space,hence no normal, exponential, Poisson, <strong>and</strong> so forth r<strong>and</strong>om variables. We shallsee that finite approximations satisfactorily take their place.vii


viiiPREFACEConsider a Binomial(n, p) r<strong>and</strong>om variable X such that neither p nor 1−p isinfinitesimal <strong>and</strong> n is unlimited. Then (the Nelson-style analog of) the centrallimit theorem says that (X −np)/ √ p(1 − p)/n has a distribution that is “nearlynormal” in the sense that the distribution function of this r<strong>and</strong>om variablediffers from the distribution function of the normal distribution in conventionalprobability only by an infinitesimal amount at any point (our Theorem 8.2).Consider a Binomial(n, p) r<strong>and</strong>om variable X such that p is infinitesimalbut np is appreciable. Then X has a distribution that is “nearly Poisson” inthe sense that the distribution function of this r<strong>and</strong>om variable differs from thedistribution function of the Poisson(np) distribution in conventional probabilityonly by an infinitesimal amount at any point (unfortunately this result is in ayet to be written chapter of this book, but is easily proved).Consider a Geometric(p) r<strong>and</strong>om variable X such that p is infinitesimal <strong>and</strong>choose a (necessarily infinitesimal) number ɛ such that Y = ɛX has appreciableexpectation µ. Then Y has a distribution that is “nearly exponential” inthe sense that the distribution function of this r<strong>and</strong>om variable differs fromthe distribution function of the Exponential(1/µ) distribution in conventionalprobability only by an infinitesimal amount at any point.Thus although Nelson-style probability theory does not have continuous r<strong>and</strong>omvariables, it does have discrete analogs that take their place. One is entitledto ask what is the point? Nelson makes one point in the text quotedabove: the level of mathematical difficulty is much lower. In teaching probabilityat levels below Ph. D. we are used to starting with Kolmogorov’s axioms<strong>and</strong> Venn diagrams <strong>and</strong> mutually exclusive events <strong>and</strong> — just when the studentsare thoroughly confused — dropping the whole subject. When expectation isintroduced, it is entirely the nineteenth-century, pre-Kolmogorov notion. Manytopics are discussed with allusions to Kolmogorov-style theory but without rigorbecause measure theoretic abstraction is deemed too difficult for the level of thecourse. Nelson-style theory avoids the necessity for all this h<strong>and</strong>waving. It providesa different kind of rigor that is underst<strong>and</strong>able, perhaps (as Nelson claims)even by high school students.We would not like to make the wrong impression. As Nelson (1987, p. 13)says, all mathematics involves abstractions. The notion of an unlimited number(which is larger than any number you could actually name) is just as muchan abstraction as an infinite sequence. The question is which abstraction dowe wish to use. If we choose to use actually infinite sequences, then measuretheory is unavoidable. If we choose to use nonst<strong>and</strong>ard analysis, then the lawof large numbers (Nelson, 1987, Chapter 16) can be stated without referenceto measure theory. The mathematical content is not exactly the same, but thesame purposes are served. Both laws of large numbers (Kolmogorov <strong>and</strong> Nelson)serve equally well in applications, such as justifying statistical inference.Neither flavor of abstraction makes the proof of the theorem trivial. Charlie’slaw of conservation of mathematical difficulty says that, no matter whatabstraction you use, at the end of the day you still have prove something isless than ɛ <strong>and</strong> that calculation is always essentially the same. Neither Nelson(1987) nor this book is easy reading. I admit that most American undergradu-


ixates (never mind high school students) are not comfortable with mathematicalarguments that run over several pages with many steps. Regardless of the levelof abstraction, such arguments are unavoidable. Nevertheless Nelson (1987) is aremarkable tour de force. In 79 small format pages he goes from nothing to whathe calls the de Moivre-Laplace-Lindeberg-Feller-Wiener-Lévy-Doob-Erdös-Kac-Donsker-Prohorov theorem, whichis a version of the de Moivre-Laplace central limit theorem that containsLindeberg’s theorem on the sufficiency of his condition, Feller’stheorem on its necessity, Wiener’s theorem on the continuity of thetrajectories for his process, the Lévy-Doob characterization of it asthe only normalized martingale with continuous trajectories, <strong>and</strong> theinvariance principle of Erdös <strong>and</strong> Kac as extended by Donsker <strong>and</strong>Prohorov.Nelson ends by sayingThis is an arbitrary stopping point. More can be done. I hopethat someone will write a truly elementary book on stochastic processesalong these lines, complete with exercises <strong>and</strong> applications.I cannot claim that this is that book. At least this book is “along these lines.”We follow Nelson in using only “radically elementary” nonst<strong>and</strong>ard analysisbased on four axioms from Nelson (1987) (our Section 2.1) <strong>and</strong> only “radicallyelementary” probability theory based on principles (i) <strong>and</strong> (ii) above. Wehave filled in many details left untouched by Nelson, in particular the relationshipof the central limit theorem mentioned above, which is the climax ofNelson (1987), to x ↦→ exp(−x 2 /2)/ √ 2π. We also do weak convergence on arbitrarymetric spaces, Prohorov metric, Lévy metric, the portmanteau theorem,Slutsky’s theorem, the continuous mapping theorem, <strong>and</strong> the Glivenko-Cantellitheorem.However, my interests are not in stochastic processes, except for spatialpoint processes <strong>and</strong> Markov chains, but in statistics. The original idea of thisbook was to rewrite statistics in “radically elementary” fashion, but I have onlybarely begun that task. The reason this incomplete draft is being turned into atechnical report is so that the thesis of Bernardo Borba de Andrade, which dealswith the “radically elementary” approach to Markov chains, can cite some of theresults in the part of my planned book that now exists (this technical report),in particular, the relationship between characteristic functions <strong>and</strong> convergencein distribution (Chapter 11) that he uses to prove the central limit theorem foralpha-mixing stationary processes. I hope that some of the planned additionsto this book will eventually be written.Whether the “radically elementary” or “Nelson-style” approach is better orworse than the “conventional” or “Kolmogorov-style” approach is something wecannot yet say. I can only hope that a reader of Nelson (1987) <strong>and</strong> this bookwill concede that our approach has promise. It is a huge task to rewrite all ofprobability theory in the new style. Many people will not think it worth the


xPREFACEeffort even if the new style is simpler, more elegant, <strong>and</strong> easier to underst<strong>and</strong>.Moreover, it must be conceded that the new style is not always more elegant.In particular, I had a great deal of difficulty proving the Lévy continuity theorem(our Theorem 11.8) because characteristic functions are messier objectswhen only discrete r<strong>and</strong>om variables are allowed: compare the form (11.11)of the characteristic function of the discrete double exponential distribution tot ↦→ 1/(1 + t 2 /α 2 ) of its continuous analog <strong>and</strong> the inelegant condition (11.16)that is required for the former to be well approximated by the latter. Othercharacteristic functions are worse. Generally one has no closed form expression,messy or otherwise.The new style also takes a lot of getting used to. It is hard to ab<strong>and</strong>on theunique normal distribution of conventional theory. In our new theory we call“normal” any distribution whose distribution function differs from the distributionfunction of the normal distribution in conventional probability only byan infinitesimal amount at any point. This huge class of distributions is verylike the conventional normal distribution in some respects <strong>and</strong> very unlike it inothers. In particular, the moments need be nothing like those of a conventionalnormal distribution. Even if we add the condition that our “Nelson-style” normaldistributions be (Nelson-style) L 2 so that first <strong>and</strong> second moments agree(to within an infinitesimal amount) with those of the conventional normal distribution(our Theorem 10.7), higher moments need not agree.Now is this a good thing or a bad thing? Originally, the normal distributionwas not thought of as a real distribution but only as a limit arising in the deMoivre-Laplace theorem. Gradually, through the work of Gauss, Quetelet, <strong>and</strong>others the normal distribution was reified so we now think of it as a real distribution.But much of late twentieth century statistics, especially nonparametrics<strong>and</strong> robustness, had the objective of knocking the normal distribution off itspedestal. Is is part of the problem or part of the solution that E(X 4 ) = 3σ 4 forthe normal distribution? Confidence intervals for the population variance basedon the F distribution give incorrect results, even asymptotically incorrect, unlessthe population fourth central moment is exactly three times the populationsecond central moment squared. We have no reason to believe that will be truein real applications. A well-known introductory textbook (Moore <strong>and</strong> McCabe,1989, pp. 568–569) saidThe F test <strong>and</strong> other procedures for inference about variancesare so lacking in robustness as to be of little use in practice.reproducing figures from Pearson <strong>and</strong> Please (1975) as evidence. So is theuniqueness of the normal distribution in conventional theory a benefit or a trapfor the unwary? Here it seems to a trap. The theory of “exact” confidenceintervals based on the F distribution (assuming “exact” normality) is elegantbut worthless in application precisely because “exact” normality is too muchto ask. In Nelson-style theory the unique continuous normal distribution goesback to being only a limit that is never reached <strong>and</strong> doesn’t really exist, so weare not tempted to take it too seriously.


xiMany other cases could be considered where conventional theory “takes infinitytoo seriously.” Marginalization paradoxes arising from the use of improperpriors in Bayesian inference (Dawid, et al., 1973) are an example. Nelson-styleimproper priors do not exist because any positive measure on a finite samplespace can be renormalized so that it is a probability measure. There are Nelsonstyleanalogs to improper priors. For example, the uniform distribution on agrid of points that has infinitesimal spacing <strong>and</strong> extends to unlimited numbersin both positive <strong>and</strong> negative directions behaves in some respects just likeLebesgue measure as an improper prior. But since the Nelson-style “improper”prior is in fact proper (a probability distribution), no paradoxes can arise. Herethe supposed simplicity of conventional theory is not simple at all. It leadsonly to confusion <strong>and</strong> incoherence. The extreme technical difficulty of decidingwhen an improper prior leads to sensible inference (Eaton, 1992) is notorious.In Nelson-style theory, the issue simply does not arise.The same point has been made about using finitely additive probability theoryby Sudderth (1980). Nelson-style “radically elementary” probability theoryis in this respect <strong>and</strong> some other respects — especially in Nelson’s definitionof “almost surely” (see our Section 6.1) — much like finitely additive probabilitytheory because, of course, if the sample space is finite, then countableadditivity is vacuous. In other respects, Nelson-style theory is not much likefinitely additive theory. As we have stressed, in Nelson-style theory the levelof abstraction is much lower than conventional Kolmogorov-style theory. Thelevel of abstraction of finitely additive theory tends to be much higher thanconventional theory.For a long time I have wondered why probability theory is the way it is. Whyis based on Kolmogorov (1933)? As mentioned above, finitely additive theoryhas been studied, although nowhere near as intensively as the Kolmogorov-stylecountably additive theory. Kolmogorov himself (along with others) initiatedother approaches, such as the so-called algorithmic complexity approach (Li <strong>and</strong>Vitanyi, 1997). There is a remarkable paper by Solovay (1970) which shows thatis it possible to make set-theoretic assumptions (incompatible with the axiomof choice) so that every subset of the real numbers is Lebesgue measurable. SocalledLoeb measure (Loeb, 1975) uses Robinson-style nonst<strong>and</strong>ard analysis (seeour Chapter 1 for brief explanation of that) to construct probability measures.The emphasis of probability over expectation in conventional theory can bereversed, as in Whittle (2005). Finally, as everyone is well aware, probabilitytheory had nearly 300 years of history before 1933 in which many importanttheorems were proved without measure theory. So why do we do probabilitytheory the way we do?This is not just a question of the history <strong>and</strong> sociology of mathematics. Instatistics, especially, theorems often only have heuristic value. The central limittheorem does not assure that normal approximation is good in any particularapplication, only that it will be good for some sufficiently large sample size,perhaps billions <strong>and</strong> billions of times larger than in any application. If it werereally the case that sample sizes of billions <strong>and</strong> billions were required, then noone would care about the theorem, despite its beauty. Statisticians who worry


xiiPREFACEabout the validity of normal approximation sometimes do simulations to check,<strong>and</strong> they find that it sometimes works well <strong>and</strong> sometimes not, depending onthe details of the application. So the theorem does, at least sometimes, providea useful guide to practice, even if it provides no guarantee. When we change theabstractions we use, we often change the atmospherics, hence we often changethe heuristics, even though the mathematical assertions of the theorems arevery similar. Nelson’s central limit theorem is very similar to the conventionalcentral limit theorem in what it says, but because the limiting distribution isnot unique (it is unique up to near equivalence, of course), it feels different.The normal distribution, as we said above, doesn’t really exist in Nelson-styletheory. Hence we take some of its properties, such as its fourth moment, muchless seriously.I started working on “radically elementary” probability <strong>and</strong> statistics merelyout of curiosity. I had to struggle to unlearn some ingrained habits from conventionaltheory. But I have been richly rewarded. Conventional theory looksdifferent to me now. I have a broader perspective, a wider context. To me, thatis even more important than simplicity <strong>and</strong> elegance.


Part INonst<strong>and</strong>ard Analysis1


Chapter 1IntroductionThis part of the book introduces so-called “nonst<strong>and</strong>ard analysis,” which isthe branch of mathematics that makes rigorous use of infinitesimals. The earlyhistory of infinitesimals starts with the invention of calculus by Newton <strong>and</strong>Leibniz in the late seventeenth century. From then until the late nineteenthcentury all calculus <strong>and</strong> analysis (the mathematician’s term for advanced calculus,as in the title of Cauchy’s Cours d’Analyse de l’École Royale Polytechnique)used infinitesimals, although questions about rigor persisted throughoutthis period. In the late nineteenth century, the delta-epsilon definitions of limits,continuity, derivatives, <strong>and</strong> integrals still in use today provided the first rigorousfoundations for calculus. From then until the middle of the twentieth centuryinfinitesimals were banished for lack of rigor.The reasons for banishment of infinitesimals disappeared with the publicationof Robinson (1996, first edition 1966). Interestingly, Robinson’s work grewout of the process of formalization of mathematics that made infinitesimals seemnonrigorous in the first place. What goes around comes around.Model theory is the branch of mathematical logic that deals with modelsfor axiomatic systems. Given a set of axioms, a model for those axioms is amathematical object that satisfies the axioms. If the axioms in question arethose for a vector space, then a model is just a vector space. This may seemtrivial, but it is profound. For one thing, the existence of a model means theaxioms are consistent, meaning they imply no contradiction (statements thatfollow from the axioms <strong>and</strong> contradict each other), because no model can satisfya contradiction. For another thing, it leads to the question of what sort ofmodels can satisfy the axioms. Again, if the axioms in question are those for avector space, then the question is what vector spaces exist <strong>and</strong> what are theirproperties? So this question encompasses the whole subject of linear algebra.If the axioms in question are those for the real numbers, then the question iswhat real number systems can exist? We are taught in conventional mathematicsabout the real number system. But model theoretic study of the axioms forthe real numbers shows that uniqueness of the real numbers cannot be proved,<strong>and</strong> that “nonst<strong>and</strong>ard” models exist. Robinson exploited this fact to bring3


4 CHAPTER 1. INTRODUCTIONinfinitesimals back into rigorous mathematics. He constructed, in conventionalmathematics, nonst<strong>and</strong>ard models of the real numbers having elements thatcould be interpreted as infinitesimals. For a recent, fairly elementary, treatmentof such model-theoretic constructions, see Goldblatt (1998).Nelson (1977) introduced axiomatic nonst<strong>and</strong>ard analysis in which the behaviorof nonst<strong>and</strong>ard concepts is derived from axioms rather than by explicitconstruction of models. In some ways this is simpler than the approach ofRobinson (1996), since it requires no knowledge of model theory. Many otheraxiomatic theories of nonst<strong>and</strong>ard analysis have since been developed. Gordon,Kusraev <strong>and</strong> Kutateladze (2002) <strong>and</strong> Kanovei <strong>and</strong> Reeken (2004) give encyclopediccoverage of them. These various versions of nonst<strong>and</strong>ard analysis canbe considered different formalizations of the same subject. Although axiomaticnonst<strong>and</strong>ard analysis following Nelson (1977) requires no model theory, it doesrequire a lot of set theory. In this book we are going to use a very simple versionof nonst<strong>and</strong>ard analysis found in Nelson (1987), which requires very little settheory.In this version of nonst<strong>and</strong>ard analysis, we consider the real number systemwe use to be the same as the real number system of conventional mathematicsin the sense that every conventional mathematical theorem about it remainsthe same. Those theorems are silent about infinitesimals. They neither proveinfinitesimals exist nor prove they do not exist. Hence we may assume thatinfinitesimals do exist <strong>and</strong> are real numbers just like any other real numbers.We might worry that this will cause trouble, but we shall be very careful. Ourmathematics with infinitesimals will be based on axioms (taken from Nelson,1987), which are theorems of the version of nonst<strong>and</strong>ard analysis in Nelson(1977); see p. 80 in Nelson (1987). A theorem in Nelson (1977), attributedto Powell, asserts that the theory of that paper <strong>and</strong> conventional mathematics(as formalized in Zermelo-Frankel set theory with the axiom of choice, usuallyabbreviated ZFC) are equiconsistent (meaning both theories are consistent orboth are inconsistent). Since the celebrated inconsistency theorem of Gödel saysthat the consistency of ZFC cannot be proved within ZFC, this is all that canbe said. Our nonst<strong>and</strong>ard analysis is consistent if conventional mathematics isitself consistent.


Chapter 2The Natural NumbersThe objects of this version of nonst<strong>and</strong>ard analysis are the objects of conventionalmathematics: the integers, the real numbers, <strong>and</strong> so forth. No newobjects are added. Nothing in conventional mathematics is changed. Everytheorem of conventional mathematics remains true.In particular, the natural numbers, the set N = {0, 1, 2, . . .}, remains thesame as it is in conventional mathematics.2.1 AxiomsOur only addition to conventional mathematics is an undefined propertyst<strong>and</strong>ard <strong>and</strong> four axioms that govern the use of this property.(I) 0 is st<strong>and</strong>ard.(II) If n ∈ N is st<strong>and</strong>ard, then so is n + 1.(III) There exists an n ∈ N that is not st<strong>and</strong>ard.(IV) If A is any property such that A(0) holds <strong>and</strong> A(n) → A(n + 1) holds forall st<strong>and</strong>ard n ∈ N, then A(n) holds for all st<strong>and</strong>ard n ∈ N.We define nonst<strong>and</strong>ard to mean not st<strong>and</strong>ard. Hence we usually say axiom(III) asserts the existence of a nonst<strong>and</strong>ard n ∈ N. We also define limited as asynonym of st<strong>and</strong>ard <strong>and</strong> unlimited as a synonym of nonst<strong>and</strong>ard. (In the nextchapter these terms will become nonsynonymous when limited <strong>and</strong> unlimitedcan be applied to real numbers <strong>and</strong> st<strong>and</strong>ard <strong>and</strong> nonst<strong>and</strong>ard cannot.) Wecould replace “st<strong>and</strong>ard” everywhere it appears in the axioms by “limited,” butwe do not do so, partly to follow Nelson (1987) <strong>and</strong> partly to distinguish betweenthe undefined term “st<strong>and</strong>ard” that has no meaning other than that given byits appearance in these axioms <strong>and</strong> the term “limited” which is defined in termsof “st<strong>and</strong>ard.”The complicated axiom (IV) is called external induction. We also, of course,inherit conventional mathematical induction from conventional mathematics.5


6 CHAPTER 2. THE NATURAL NUMBERSTo explain the difference, we need the following terminology. A property fromconventional mathematics (defined without any reference, direct or indirect,to the property “st<strong>and</strong>ard”) is called internal. A property that is not internalis external. Our only examples of external properties so far are st<strong>and</strong>ard,nonst<strong>and</strong>ard, limited, <strong>and</strong> unlimited, but later on we shall meet infinitesimal,nearly continuous, <strong>and</strong> many more. Following Nelson, we extend this terminologyto mathematics itself <strong>and</strong> sometimes say internal mathematics insteadof “conventional mathematics” <strong>and</strong> internal induction instead of “conventionalmathematical induction.” Internal induction is the following, which is a theoremof set theory.(V) If A is any internal property such that A(0) holds <strong>and</strong> A(n) → A(n + 1)holds for all n ∈ N, then A(n) holds for all n ∈ N.Note that internal induction can only be applied to internal properties (conventionalmathematical induction can only be applied to conventional mathematicalproperties). If (V) could be applied with A(n) meaning “n is st<strong>and</strong>ard,” thentogether with axioms (I) <strong>and</strong> (II) it would imply that every natural number isst<strong>and</strong>ard, but that would contradict axiom (III), <strong>and</strong> our axioms would be inconsistentwhen combined with those of conventional mathematics (ZFC). Notethat external induction (IV) applied with A(n) meaning “n is st<strong>and</strong>ard” onlyproduces the tautology that every st<strong>and</strong>ard n is st<strong>and</strong>ard.So these axioms produce no immediately obvious contradiction. As wasmentioned at the end of the last chapter, our theory is consistent if conventionalmathematics is consistent. Thus we need not worry about contradictions,obvious or not.Theorem 2.1. The numbers 1, 2, . . ., 10 are limited.The proof is obvious. Axioms (I) <strong>and</strong> (II) together imply that 1 is limited.Having established that, another application of axiom (II) shows that 2 is limited.Then another shows that 3 is limited. And so forth. Clearly the processneed not stop at 10. The theorem would be just as true if we replaced 10 by athous<strong>and</strong> or a million. (There would just be more steps to the proof.) We cannot,however, strengthen Theorem 2.1 by removing the upper bound entirely.That would conflict with axiom (III).2.2 ArithmeticIf we want to increase the upper bound in Theorem 2.1 we can make ourproof more efficient by using external induction.Theorem 2.2. If m <strong>and</strong> n are limited natural numbers, then so is m + n.Proof. Fix a limited m ∈ N <strong>and</strong> let A(n) be the property “m + n is limited.”<strong>By</strong> axiom (II) <strong>and</strong> external induction A(n) holds for all limited n.Now we can count limited numbers a little faster: 10 + 10 = 20 is limited,20 + 20 = 40 is limited, <strong>and</strong> so forth.


2.3. ORDER 7Theorem 2.3. If m <strong>and</strong> n are limited natural numbers, then so is m · n.Proof. Fix a limited m ∈ N <strong>and</strong> let A(n) be the property “m · n is limited.” <strong>By</strong>Theorem 2.2 <strong>and</strong> external induction A(n) holds for all st<strong>and</strong>ard n.Now we can count limited numbers a faster still: 10 · 10 = 100 is limited,100 · 100 = 10, 000 is limited, <strong>and</strong> so forth.Theorem 2.4. If m <strong>and</strong> n are limited natural numbers, then so is m n .Proof. Fix a limited m ∈ N <strong>and</strong> let A(n) be the property “m n is limited.” <strong>By</strong>Theorem 2.3 <strong>and</strong> external induction A(n) holds for all st<strong>and</strong>ard n.Now we can count limited numbers much faster: 10 10 is limited, 10 1010is limited, <strong>and</strong> so forth. We could accelerate this process further with furtherapplications of external induction, but we have run out of familiar mathematicaloperations (what comes after addition, multiplication, exponentiation?) <strong>and</strong> sowill be content to stop here.2.3 OrderWe now fill in the gaps in our counting.Theorem 2.5. If n <strong>and</strong> ν are natural numbers, n limited <strong>and</strong> ν unlimited, thenn < ν.Proof. Fix an unlimited ν, <strong>and</strong> let A(n) be the property n < ν. Since ν isunlimited, it is not zero, hence A(0) holds. If n is limited <strong>and</strong> A(n) holds, thenn + 1 is limited <strong>and</strong> n + 1 ≤ ν, <strong>and</strong> since equality is impossible when n + 1 islimited <strong>and</strong> ν is unlimited, we actually have A(n + 1). Thus external inductionimplies n < ν for all limited n.Hence we could now improve Theorem 2.1 to say that 1, 2, . . ., 10 1010 arelimited, but we won’t bother with a theorem number <strong>and</strong> a formal statement.2.4 Illegal Set FormationIt is a principle of internal mathematics that properties can be used to definesets (the subset axiom or the axiom of specification of ZFC).(VI) For any internal property A <strong>and</strong> any set S, there exists a set B such thatx ∈ B if <strong>and</strong> only if x ∈ S <strong>and</strong> A(x).The set B is unique (by the axiom of extension of ZFC) <strong>and</strong> is usually denoted{ x ∈ S : A(x) } (2.1)But nothing in internal mathematics says that external properties can be usedto define sets in this way (internal mathematics having no external properties).


8 CHAPTER 2. THE NATURAL NUMBERSMoreover, none of our four axioms of nonst<strong>and</strong>ard analysis allows the use of(2.1) when A is an external property. Nelson calls attempts to use (2.1) with Aexternal illegal set formation.This explains why we only distinguish between internal <strong>and</strong> external properties<strong>and</strong> not between internal <strong>and</strong> external objects. From a foundational pointof view, everything is a set. Natural numbers are sets: zero is just anothername for the empty set, one is the set {0}, two is the set {0, 1}, <strong>and</strong> so forth.Signs are identified with {0, 1} <strong>and</strong> integers with ordered pairs of a sign <strong>and</strong> anatural number. Rational numbers are identified with ordered pairs of integers.Real numbers are identified with Dedekind cuts of rationals (pairs of sets ofrationals). Functions are identified with their graphs, which are subsets of theCartesian product of the domain <strong>and</strong> range. And so forth. The rule againstillegal set formation disallows the use of (2.1) when A is external. Thus there isno way to get any new sets that are not already present in internal mathematics,hence no new mathematical objects of any sort. All objects are internal. Onlyproperties can be internal or external.This is a bit confusing; one must make an effort to keep this distinctionclear. When we say a natural number ν is unlimited, we are asserting A(ν)holds where A is the external property defined by A(n) means “n is unlimited,”but ν itself is an internal object. From a foundational point of view, ν is theset {0, . . . , ν − 1} <strong>and</strong> the principles of set theory allow us to form this set forany natural number ν.The rule about illegal set formation is very important because ignoring it islike money laundering: it destroys the distinction between internal <strong>and</strong> externalproperties <strong>and</strong> internal <strong>and</strong> external induction. If A is a property <strong>and</strong> B is aset such that A(x) holds if <strong>and</strong> only if x ∈ B, then the property A ′ defined byA ′ (x) meaning x ∈ B is equivalent to A <strong>and</strong> is an internal property, because Bis an internal set (all sets being internal) <strong>and</strong> ∈ is an internal relation.Theorem 2.6. There does not exist a subset B of N such that n ∈ B if <strong>and</strong>only if n is limited.Proof. Suppose the set B described by the theorem does exist. Then the propertyA defined by A(n) means “n ∈ B” is an internal property, <strong>and</strong> we can applyinternal induction to it. Axiom (I) implies A(0), <strong>and</strong> Axiom (II) implies A(n)implies A(n) → A(n + 1). We conclude A(n) holds for all n ∈ N, which saysthat every natural number is limited, but that contradicts Axiom (III). Henceno such set exists.The rule about illegal set formation does not deny existence; it merely deniesa particular justification of existence. Let A(n) be the property “n is limited<strong>and</strong> n < 6.” At first sight A appears to be an external property because itinvolves the external property “limited.” However, we know from Theorem 2.1that the elements of the set B = {0, 1, 2, 3, 4, 5} are all limited <strong>and</strong> thus we seethat A is equivalent to the internal property A ′ defined by A ′ (n) meaning eithern ∈ B or n < 6.


2.4. ILLEGAL SET FORMATION 9Thus even though the rule about illegal set formation forbids us to use thesubset axiom as a justification of the existence of this set, it nevertheless exists.If we want to show that no set corresponds to an external property, then weneed a theorem (like Theorem 2.6). Without a theorem proving either existenceor non-existence, we do not know whether any set corresponds to an externalproperty. All we know is that there is no rule that asserts it automatically exists<strong>and</strong> that we can’t just blithely use the notation (2.1) as if there were such arule.


10 CHAPTER 2. THE NATURAL NUMBERS


Chapter 3Real NumbersAny real number x lies between two integers ⌊x⌋ <strong>and</strong> ⌈x⌉, which are calledthe floor <strong>and</strong> ceiling of x <strong>and</strong> which are the greatest integer less than or equalto x <strong>and</strong> the least integer greater than or equal to x, respectively. If x is aninteger, then ⌊x⌋ = ⌈x⌉. Otherwise, ⌊x⌋ + 1 = ⌈x⌉.Note that any two consecutive natural numbers n <strong>and</strong> n + 1 are either bothlimited or both unlimited. If n is limited, then so is n + 1 by Axiom II. If n isunlimited, then so is n + 1, by Theorem 2.5.Thus we extend the notions of limited <strong>and</strong> unlimited to real numbers asfollows.• A nonnegative real number x is limited if <strong>and</strong> only if ⌊x⌋ <strong>and</strong> ⌈x⌉ arest<strong>and</strong>ard.• A negative real number x is limited if <strong>and</strong> only if |x| is limited.Unlimited means not limited.We use the concept “limited” to define a new concept “infinitesimal.”• 0 is infinitesimal.• A nonzero real number x is infinitesimal if <strong>and</strong> only if 1/x is unlimited.Another useful auxiliary concept is “appreciable.”• A real number is appreciable if it is limited <strong>and</strong> not infinitesimal.It is important to remember that limited, unlimited, infinitesimal, noninfinitesimal,<strong>and</strong> appreciable are external properties.Theorem 3.1. A nonzero real number x is infinitesimal if <strong>and</strong> only if 1/x isunlimited <strong>and</strong> is appreciable if <strong>and</strong> only if 1/x is appreciable.Proof. The first assertion just restates the definition of infinitesimal. The secondassertion then follows because when x is appreciable 1/x can be neitherinfinitesimal nor unlimited.11


12 CHAPTER 3. REAL NUMBERSWith these definitions come new notation. In the following x <strong>and</strong> y are anyreal numbers.• x ≃ y means x − y is infinitesimal.• x ≪ y means x ≤ y <strong>and</strong> x ≄ y.• x y means x ≤ y or x ≃ y.• x ≃ ∞ means x is positive <strong>and</strong> unlimited.• x ≪ ∞ means x ≄ ∞.• x ≫ y means y ≪ x.• x y means y x.• x ≃ −∞ means −x ≃ ∞.• x ≫ −∞ means −x ≪ ∞.And it is important to remember that all of the symbolic notations above expressexternal properties.These notations are read as follows.• x ≃ y is read x <strong>and</strong> y are nearly equal.• x ≪ y is read x is appreciably less than y or x is strongly less than y.• x y is read x is weakly less than y.Please note, to avoid any misunderst<strong>and</strong>ing, that the notation x ≃ ∞ doesnot (despite appearances) assert that there are two objects x <strong>and</strong> ∞ that arenearly equal. It is just shorth<strong>and</strong> for x is positive <strong>and</strong> unlimited <strong>and</strong> says nothingabout an object ∞, not even that such an object exists.3.1 OrderTheorem 3.2. If ξ, x, ɛ, y, <strong>and</strong> η are real numbers, ξ is negative <strong>and</strong> unlimited,x is negative <strong>and</strong> appreciable, ɛ is infinitesimal, y is positive <strong>and</strong> appreciable,<strong>and</strong> η is positive <strong>and</strong> unlimited, then ξ < x < ɛ < y < η.Proof. y < η is immediate from Theorem 2.5 <strong>and</strong> the definition of limited realnumbers. Then ɛ < y follows from the fact that for positive x <strong>and</strong> y we havex < y if <strong>and</strong> only if 1/y < 1/x (<strong>and</strong> from nonpositive numbers being less thanpositive numbers). The other inequalities follow from 0 < x < y if <strong>and</strong> only if−y < −x < 0.


3.2. ARITHMETIC 133.2 ArithmeticTheorem 3.3. If x <strong>and</strong> y are real numbers, then x + y is limited if both x <strong>and</strong>y are limited, <strong>and</strong> |x| + |y| is limited only if both x <strong>and</strong> y are limited.Proof. The forward part follows from Theorem 2.2 by the order properties ofaddition <strong>and</strong> the definitions of limited <strong>and</strong> unlimited real numbers. The conversepart follows from max(|x|, |y|) ≤ |x| + |y|.Theorem 3.4. If x <strong>and</strong> y are real numbers, then x · y is limited if both x <strong>and</strong>y are limited.Proof. This follows from Theorem 2.3 by the order properties of multiplication<strong>and</strong> the definitions of limited <strong>and</strong> unlimited real numbers.Corollary 3.5. If x 1 , . . ., x n are limited real numbers <strong>and</strong> n is a limited naturalnumber, then x 1 + · · · + x n is limited <strong>and</strong> x 1 × · · · × x n is limited.Proof. Apply external induction to Theorem 3.3 or Theorem 3.4.Theorem 3.6. If x <strong>and</strong> y are real numbers, then x · y is unlimited if x isnon-infinitesimal <strong>and</strong> y is unlimited.Proof. Suppose to get a contradiction that z = x · y is limited.Theorem 3.4, y = z · (1/x) is limited.Then, byTheorem 3.7. If x <strong>and</strong> y are real numbers, then x + y is infinitesimal if bothx <strong>and</strong> y are infinitesimal, <strong>and</strong> |x| + |y| is infinitesimal only if both x <strong>and</strong> y areinfinitesimal.Proof. The direct part is trivial when either x or y is zero, <strong>and</strong> because of|x + y| ≤ |x| + |y| we may assume without loss of generality that x <strong>and</strong> y arepositive, in which case we have1x + y ≥ 1 2 · 1max(x, y) ≃ ∞by the definition of infinitesimal <strong>and</strong> Theorem 3.6.The converse part follows from max(|x|, |y|) ≤ |x|+|y| <strong>and</strong> Theorem 3.2.Corollary 3.8. If x 1 , . . ., x n are infinitesimal real numbers <strong>and</strong> n is a limitednatural number, then x 1 + · · · + x n is infinitesimal.Proof. Apply external induction to Theorem 3.7.Corollary 3.9. The external relation ≃ is an equivalence.An equivalence relation is symmetric, reflexive, <strong>and</strong> transitive. The corollaryasserts that ≃ has these properties. We emphasize that it is an external relation.Hence { (x, y) ∈ R × R : x ≃ y } <strong>and</strong> { x ∈ R : x ≃ y } are illegal set formation.We can prove facts about ≃ but we cannot consider it a mathematical object(that is, a set), nor can we define objects, such as equivalence classes, in termsof it.


14 CHAPTER 3. REAL NUMBERSProof. Symmetry, x ≃ x for all x, is obvious. Reflexivity, x ≃ y −→ y ≃ x, isalso obvious. Transitivity, x ≃ y <strong>and</strong> y ≃ z −→ x ≃ z, follows from the sum ofinfinitesimals is infinitesimal (Theorem 3.7).Theorem 3.10. If x <strong>and</strong> y are real numbers, then x · y is infinitesimal if x isinfinitesimal <strong>and</strong> y is limited.Proof. This is trivial when either x or y is zero, <strong>and</strong> because of |xy| = |x| · |y|we may assume without loss of generality that 0 < x ≤ y, in which case we have1x · y = 1 x · 1y ≃ ∞by Theorem 3.6 because 1/x is unlimited <strong>and</strong> 1/y is non-infinitesimal.Theorem 3.11. exp(x) ≪ ∞ if <strong>and</strong> only if x ≪ ∞.Proof. Since exp(x) ≤ 1 when x ≤ 0 <strong>and</strong> exp(x) ≤ 3 ⌈x⌉ when x ≥ 0, the “if”direction follows from Theorems 2.4 <strong>and</strong> 3.2.The “only if” direction follows from exp(x) ≥ 1 + x (which is obvious fromthe Maclaurin series for exp).Theorem 3.12. exp(x) ≃ 1 if <strong>and</strong> only if x ≃ 0.Proof. Suppose x ≃ 0. <strong>By</strong> the law of the mean there exists an x ∗ between 0<strong>and</strong> x such thatexp(x) − exp(0) = exp(x ∗ ) · x.<strong>By</strong> Theorems 3.2, 3.10 <strong>and</strong> 3.11 the right h<strong>and</strong> side is infinitesimal. That provesone direction.Suppose x ≃ 1. <strong>By</strong> the law of the mean there exists an x ∗ between 1 <strong>and</strong> xsuch thatlog(x) − log(1) = 1 · (x − 1)x∗ <strong>By</strong> theorems 3.1, 3.2, 3.10 <strong>and</strong> 3.11 the right h<strong>and</strong> side is infinitesimal. Thatproves the other direction.3.3 Summary of ArithmeticRobert (1988) summarizes the arithmetic of nonst<strong>and</strong>ard analysis in severaltables. This seems like a good idea, <strong>and</strong> we have copied it, making somemodifications.Let δ <strong>and</strong> ɛ be infinitesimal real numbers, u <strong>and</strong> v appreciable real numbers,<strong>and</strong> X <strong>and</strong> Y unlimited real numbers. Then we have the following results aboutaddition (<strong>and</strong> subtraction).


3.3. SUMMARY OF ARITHMETIC 15infinitesimal appreciable unlimitedδ, ɛ u, v X, Yδ + ɛ δ + u, |u| + |v| δ + X, u + Xu + v |X| + |Y |X + YThe wide boxes indicate lack of more specific information. We only know thatu + v is limited. It may be infinitesimal (for example, if v = −u). And we can’tsay anything about X + Y . It may be infinitesimal, appreciable, or unlimited.And we have the following results about multiplication <strong>and</strong> division.infinitesimal appreciable unlimitedδ, ɛ u, v X, Yδ · ɛ, δ · u u · v u · X, X · Yδ/u, δ/X, u/X u/v u/δ, X/u, X/δδ · Xδ/ɛ, X/Y(Again the wide box contains results we can’t say anything about in general.)When we consider exponentiation, the identities( ) y 1x −y = = 1 x x yallow us to calculate x y for all positive real x <strong>and</strong> all real y if we are only givenx y for x ≥ 1 <strong>and</strong> y ≥ 0. Thus we make a table only covering that case. If δ, ɛ,u, v, X, <strong>and</strong> Y are all nonnegative, we have the following results about powers.infinitesimal appreciable unlimitedδ, ɛ u, v X, Yresult ≃ 1 1 ≪ result ≪ ∞ result ≃ ∞(1 + δ) ɛ , (1 + δ) u , (1 + u) δ (1 + u) v (1 + u) X , X u , X Y(1 + δ) X , X δExercise 3-1. Verify all entries in the summary tables about addition, subtraction,multiplication, <strong>and</strong> division. For “wide boxes” not only show that theresult is in the wide box but also give examples showing that the result can bein each part of the wide box.Exercise 3-2. Prove the assertions of each row of the following table: x hasthe external property in the left column if <strong>and</strong> only if exp(x) has the externalproperty in the right column <strong>and</strong> same row.xy = exp(x)x ≥ 0 <strong>and</strong> x ≃ 0 y ≥ 1 <strong>and</strong> y ≃ 10 ≪ x ≪ ∞ 1 ≪ y ≪ ∞x ≃ ∞y ≃ ∞


16 CHAPTER 3. REAL NUMBERSAlso state <strong>and</strong> prove the analogous theorem about x <strong>and</strong> log(x).Exercise 3-3. Use the results of Exercises 3-1 <strong>and</strong> 3-2 to verify the summarytable about exponentiation. As in Exercise 3-1, provide examples showing thatresults in the wide box can be either infinitesimal, appreciable, or unlimited.3.4 OverspillIllegal set formation is not just a complication that makes nonst<strong>and</strong>ard analysishard to underst<strong>and</strong>. It is involved in an important proof technique calledoverspill.We have divided the real numbers into “parts” with external properties, butthese parts are not sets.Theorem 3.13. There does not exist a set that contains all <strong>and</strong> only thereal numbers, where the blank is filled in with any of the following: limited,unlimited, infinitesimal, non-infinitesimal, appreciable, or any of these modifiedby positive or negative.Proof. If the limited reals constituted a set L, then L ∩ N would be the set Bthat Theorem 2.6 asserts does not exist. If the unlimited reals constituted a setU, then its complement would be L, which doesn’t exist. If the infinitesimalsconstituted a set I, then { x ∈ R \ {0} : 1/x ∈ I } would be U, which doesn’texist. If the non-infinitesimal reals constituted a set, then its complement wouldbe I, which doesn’t exist. If the appreciable reals constituted a set A, then[−1, 1] \ A would be I, which doesn’t exist.If any of these modified by positive or negative constituted a set S, thenS ′ = { −x : x ∈ S } would be the same modified by negative or positive,respectively, <strong>and</strong> S ∪ S ′ or (for infinitesimals) S ∪ S ′ ∪ {0} would be a set thatwe have already proved does not exist.Now we explain overspill. Suppose A is an internal property such thatA(x) holds for all x in any one of the “non-sets” mentioned in the theorem, forconcreteness, say the infinitesimals, that is, we are assuming A(x) holds for allinfinitesimal x. Then because A is internal,B = { x ∈ R : A(x) } (3.1)is not illegal set formation. Now we know by assumption that A(x) holds forall infinitesimal x. If it held only for infinitesimal x, then B would be the setI which the proof shows does not exist. We conclude that A(x) holds for somenon-infinitesimal x. In picturesque language, A spills over from infinitesimalsto non-infinitesimals. In short, if A(x) holds for all infinitesimal x, then byoverspill, it holds for some non-infinitesimal x (the words “by overspill” alludingto Theorem 3.13).There are three important things to note about this technique. The keyissue is that A must be internal. This method of proof is completely bogus


3.4. OVERSPILL 17when applied to an external property, because then its starting point (3.1) isillegal set formation.The second issue is that the scope of the technique is not limited to theinfinitesimals. As we hinted above, any of the “non-sets” mentioned in thetheorem will work just as well, for example, if A(x) holds for all unlimited x,then, by overspill, it holds for some limited x.The third issue is that the technique works when the free variable rangesover the natural numbers, in which case “by overspill” alludes to Theorem 2.6instead of Theorem 3.13. For example, if A(n) is an internal property that holdsfor all unlimited n, then it holds for some limited n.As simple examples of overspill, we have the following.x ≃ 0 ←→ (∀ɛ ≫ 0)(|x| ≤ ɛ)x ≃ ∞ ←→ (∀y ≪ ∞)(y ≤ x)To see the first one, let A(ɛ) be the property |x| ≤ ɛ. If x is infinitesimal, thenA(ɛ) holds for every ɛ ≫ 0 by Theorem 3.2. Conversely, if A(ɛ) holds for everyɛ ≫ 0, then, by overspill, it also holds for some infinitesimal ɛ, which implies xis infinitesimal (by Theorem 3.2 every number smaller than an infinitesimal isinfinitesimal).


18 CHAPTER 3. REAL NUMBERS


Chapter 4CalculusThis chapter isn’t really about calculus, despite its title. What it is about, isconcepts from nonst<strong>and</strong>ard analysis that compete with concepts from calculus<strong>and</strong> real analysis. 14.1 ConvergenceA sequence x 1 , x 2 , . . . of nonr<strong>and</strong>om real numbers nearly converges to a realnumber x ifx n ≃ x, n ≃ ∞ (4.1a)(Nelson, 1987, Chapter 6).At first sight, this appears very different from the conventional notion ofconvergence. But the following definition, which looks more like the conventionalnotion, is equivalent. A sequence . . . nearly converges . . . if(∀ɛ ≫ 0)(∃N ≪ ∞)(∀n ≥ N)(|x n − x| ≤ ɛ).(4.1b)The reason these two definitions are equivalent is because they are bothequivalent to a sequence . . . nearly converges . . . if(∀n ≥ N)(|x n − x| ≤ ɛ), ɛ ≫ 0, N ≃ ∞. (4.1c)That (4.1a) implies (4.1c) is clear. For the opposite direction, fix an n ≃ ∞.Then (4.1c) implies |x n − x| ≤ ɛ holds for all ɛ ≫ 0 <strong>and</strong> hence for some ɛ ≃ 0by overspill. Hence (4.1a) holds for this n <strong>and</strong> (since n ≃ ∞ was arbitrary) forall unlimited n.1 It is possible to redo calculus <strong>and</strong> real <strong>and</strong> functional analysis using nonst<strong>and</strong>ard analysis,but “radically elementary nonst<strong>and</strong>ard analysis” is not the right vehicle. In order to definederivatives <strong>and</strong> integrals as mathematical objects, one needs what we have called the “fulltheory” Nelson’s IST. Nelson (1977) <strong>and</strong> Gordon, et al. (2002, Chapter 2) give brief sketchesof that theory. Robert (1988) gives a more complete discussion covering all of elementarycalculus.19


20 CHAPTER 4. CALCULUSThat (4.1b) implies (4.1c) is also clear. For the opposite direction, fix anɛ ≫ 0. Then (4.1c) implies (∀n ≥ N)(|x n − x| ≤ ɛ) holds for all N ≃ ∞ <strong>and</strong>hence for some N ≪ ∞ by overspill. Hence (4.1b) holds.Two more issues of note should be mentioned. First, the limit is not unique.Indeed (4.1a) <strong>and</strong> x ≃ y implies by Corollary 3.9 that x n nearly converges to y.Second, <strong>and</strong> this is what is important for Nelson-style probability theory, theconcept applies quite nicely to finite sequences.Nelson (1987), after pointing out that near convergence is not exactly thesame as ordinary convergence, drops the “near” because ordinary convergenceis of no interest in Nelson-style probability theory. We will follow his lead.When we say “convergent,” we always mean “nearly convergent” unless it isclear from the context that we are talking about the notion from conventionalmathematics, although sometimes we say “nearly convergent” for emphasis.An important aspect of this notion of convergence is that in many contexts itmakes sequences irrelevant. A sequence x n is (nearly) convergent if x m ≃ x n forall unlimited m <strong>and</strong> n. So it is enough to underst<strong>and</strong> (the external equivalencerelation) near equality. We never need to deal with the whole sequence; wealways deal with two elements at a time (is x m ≃ x n or not?)In some respects, near convergence is very different from conventional convergence.Consider a double sequence x ij <strong>and</strong> supposex ij → y j ,x ij → z i ,i → ∞j → ∞If we are talking about the conventional notion of convergence, this does notimply anything about joint convergence, the behavior of x inj nas n → ∞ forarbitrary subsequences i n <strong>and</strong> j n . But if we are talking about near convergence,then the situation is very different;impliesbecausex ij ≃ y j ,x ij ≃ z i ,x ij ≃ x mn ,i ≃ ∞j ≃ ∞i, j, m, n ≃ ∞x ij ≃ y j ≃ x mj ≃ z m ≃ x mnwhenever all of the indices are unlimited: that’s just how equivalence relations(Corollary 3.9) behave.Thus we see that near convergence <strong>and</strong> conventional convergence are in somerespects similar, but in other respects near convergence is a much stronger <strong>and</strong>more useful property.


4.2. CONTINUITY 214.2 ContinuityIf S is a subset of R, then a function g : S → R is nearly continuous at thepoint x ∈ S if(∀y ∈ S) ( x ≃ y −→ g(x) ≃ g(y) ) (4.2)<strong>and</strong> nearly continuous on a set T ⊂ S if (4.2) holds for all x ∈ T .As with near convergence, these concepts are analogous to some conventionalconcepts. Near continuity at a point x is analogous to conventional continuity atx, <strong>and</strong> near continuity on a set T is analogous to conventional uniform continuityon T .For example, the function g : x ↦→ 1/x is continuous (but not uniformlycontinuous) on (0, ∞), <strong>and</strong> for infinitesimal positive ɛ we have g(ɛ) − g(2ɛ) =1/2ɛ, which is unlimited, so g is not nearly continuous on (0, ∞).Conversely, if g is nearly continuous on T , then for any ɛ ≫ 0(∀x ∈ T )(∀y ∈ T ) ( |x − y| ≤ δ −→ |g(x) − g(y)| ≤ ɛ ) (4.3)holds for all infinitesimal δ, hence, by overspill, for some non-infinitesimal δ.Thus for every ɛ ≫ 0 there exists a δ ≫ 0 such that (4.3) holds, which lookslike the conventional notion of uniform continuity (but isn’t exactly because ithas ≫ where the conventional notion has >).We need to know that familiar functions from calculus are nearly continuous.The following lemma says they are.Lemma 4.1. Suppose g is a differentiable function T → R, where T is an openinterval, <strong>and</strong> g ′ is limited on T . Then g is nearly continuous on T .Proof. For x <strong>and</strong> y in T , by the mean value theorem,g(y) − g(x) = g ′ (ξ)(y − x)for some ξ between x <strong>and</strong> y. <strong>By</strong> assumption g ′ (ξ) is limited, so if y − x ≃ 0 wehave g(y) − g(x) ≃ 0 by Theorem 3.10.This lemma is not sharp, because functions that are nearly continuous butnot differentiable are easily defined. Moreover, even if g is differentiable, wedo not need g ′ to be everywhere limited in order for g to be continuous. Thelemma does, however, h<strong>and</strong>le familiar “nice” functions. For example, it impliesthat the sine <strong>and</strong> cosine functions <strong>and</strong> the identity function x ↦→ x are nearlycontinuous on all of R.Some care is required. For example, the exponential function is its ownderivative, <strong>and</strong> Theorem 3.11 says e x ≪ ∞ if <strong>and</strong> only if x ≪ ∞. Hence thelemma only implies that x ↦→ e x is nearly continuous at points x such thatx ≪ ∞. This particular application of the lemma is sharp, as direct calculationshows. For unlimited x bye y − e x = e x (e y−x − 1) ≥ e x (y − x)


22 CHAPTER 4. CALCULUSwe see that this cannot be infinitesimal when y − x is a sufficiently large infinitesimal,say, y − x = e −x/2 (the inequality here is the same as the one in theproof of Theorem 3.11).Thus we see that the exponential function is not (nearly) continuous on allof R, <strong>and</strong> again we see the analogy between near continuity <strong>and</strong> conventionaluniform continuity (the exponential function is not uniformly continuous on Rin conventional mathematics).4.3 SummationThe “radically elementary” analog of infinite series <strong>and</strong> integrals studied incalculus is a sum with an unlimited number of terms. Consider two sumsν∑i=1x i<strong>and</strong>When will they be nearly equal? If ν is unlimited, then x i ≃ y i for all i is nota sufficient condition. Consider x i = 1/ν <strong>and</strong> y i = 2/ν.A sufficient condition is given by Nelson (1987, Chapter 5). It uses thefollowing notion. If x <strong>and</strong> y are real numbers with y nonzero, we say x isasymptotic to y, written x ∼ y, when x/y ≃ 1, in which case x is also nonzero.Lemma 4.2. The external relation ∼ on R \ {0} is an equivalence.Proof. Symmetry, x ∼ x, is obvious. Reflexivity is also obvious, because x/y ≃ 1implies y/x ≃ 1. Transitivity is also obvious, because x/y ≃ 1 <strong>and</strong> y/z ≃ 1 implyx/z ≃ 1.Then we have the following, which is Theorem 5.3 in Nelson (1987).Theorem 4.3. If x i > 0, y i > 0, <strong>and</strong> x i ∼ y i for each i, thenν∑x i ∼i=1ν∑i=1y iν∑y i . (4.4)i=1Nelson’s proof is instructive, so we copy it here.Proof. Fix ɛ ≫ 0. Then we have 1 − ɛ ≤ x i /y i ≤ 1 + ɛ for each i <strong>and</strong>(1 − ɛ)ν∑y i ≤i=1ν∑x i ≤ (1 + ɛ)i=1ν∑y i .i=1Hence1 − ɛ ≤∑ νi=1 x i∑ νi=1 y i≤ 1 + ɛ (4.5)Since ɛ ≫ 0 was arbitrary, this implies the fraction in (4.5) is nearly equal toone, which implies (4.4).


4.4. INTEGRATION 23The conclusion of the theorem is not exactly what is wanted. We wanted≃ in place of ∼ in (4.4). The following simple lemma, which is Theorem 5.2 inNelson (1987), usually gives us what we want.Lemma 4.4. Suppose x or y is appreciable. Then x ∼ y if <strong>and</strong> only if x ≃ y.Proof. Suppose y is appreciable. First, suppose x ∼ y. Then x/y − 1 is infinitesimal,<strong>and</strong> multiplying by y leaves it infinitesimal. Hence x ≃ y. Conversely,suppose x ≃ y. Then x − y is infinitesimal, <strong>and</strong> dividing by y leaves itinfinitesimal. Hence x ∼ y.Thus, so long as one side of (4.4) is appreciable, the other side is appreciabletoo, <strong>and</strong> there is no difference between ∼ <strong>and</strong> ≃.4.4 IntegrationThe title of this section should be in quotation marks. In “radically elementary”nonst<strong>and</strong>ard analysis <strong>and</strong> probability theory, we replace integrals withfinite sums. But Nelson (1987) often writes sums in a form that makes themlook like integrals for easy comparison with conventional mathematics.Following Nelson (1987, Chapter 9), we use the following notation. Let Tbe a finite subset of R. For t ∈ T such that t ≠ max(T ), we write dt to meanthe difference between t <strong>and</strong> its successor in T , that is,dt = min{ u ∈ T : u > t } − t. (4.6)For t = max(T ), we adopt the convention dt = 0. We call dt the spacing of Tat t <strong>and</strong> collectively refer to these dt as the spacings of T .If all the spacings dt are infinitesimal <strong>and</strong> each t ∈ T is limited, then we sayT is a near interval. If all the spacings dt are infinitesimal <strong>and</strong> min(T ) ≃ −∞<strong>and</strong> max(T ) ≃ ∞, then we say T is a near line.This “dt” notation allows us to write things like∑e −t2 /2 dt ≃ √ 2π (4.7)t∈Twhen T is a near line. (This will be proved in this section.)Theorem 4.5. Suppose T is a near interval <strong>and</strong> suppose f <strong>and</strong> g are limitedvaluedfunctions on T such that f(t) ≃ g(t) for all t ∈ T . Then∑f(t) dt ≃ ∑ g(t) dt. (4.8)t∈Tt∈TProof. Let L be the maximum of all |f(t)| <strong>and</strong> |g(t)| for t ∈ T . The maximumis achieved because T is finite <strong>and</strong> hence is limited by assumption. Fix ɛ ≫ 0


24 CHAPTER 4. CALCULUS<strong>and</strong> defineT + = { t ∈ T : f(t) ≥ ɛ }T 0 = { t ∈ T : |f(t)| < ɛ }T − = { t ∈ T : f(t) ≤ −ɛ }Then by Lemma 4.4 we havef(t) ∼ g(t),t ∈ T + ∪ T −<strong>and</strong> hence by Theorem 4.3 <strong>and</strong> Lemma 4.4 again we have∑f(t) dt ≃ ∑g(t) dtt∈T + t∈T +<strong>and</strong> the same with T + replaced by T − . Also∑f(t) dt∣ ∣ ≤ ∑ |f(t)| dt ≤ ɛ(b − a)t∈T 0 t∈T 0where a <strong>and</strong> b are the endpoints of T , <strong>and</strong> the same with f replaced by g <strong>and</strong>ɛ replaced by 2ɛ, because |f(x)| ≤ ɛ implies |g(x)| ≤ 2ɛ. Thus from the triangleinequality <strong>and</strong> the sum of infinitesimals being infinitesimal we obtain∑f(t) dt − ∑ g(t) dt 3ɛ(b − a)∣∣t∈Tt∈TSince ɛ ≫ 0 was arbitrary <strong>and</strong> a <strong>and</strong> b are limited (by the near interval assumption),this establishes (4.8).Theorem 4.6. Suppose T is a near interval having endpoints a <strong>and</strong> b, <strong>and</strong>suppose f is a function [a, b] → R that is Riemann integrable, has a limitedbound, <strong>and</strong> is nearly continuous on [a, b]. Then∑f(t) dt ≃t∈T∫ baf(t) dt.Proof. Fix ɛ ≫ 0. <strong>By</strong> definition of Riemann integrability, there exists a subsetS of R with endpoints a <strong>and</strong> b such that∑∫ bf(s) ds − f(t) dt∣∣ ≤ ɛs∈S(where ds is the spacing of S at s), <strong>and</strong> the same holds when S is replaced bya finer partition, in particular,∑∫ bf(u) du − f(t) dt∣∣ ≤ ɛu∈Uaa


4.4. INTEGRATION 25where U = S ∪ T (<strong>and</strong> where du is the spacing of U at u).Define h : U → R byh(u) = f(t),t ∈ T <strong>and</strong> t ≤ u < t + dt.<strong>By</strong> near continuity of f we have f(u) ≃ h(u) for u ∈ U, <strong>and</strong> hence by Theorem4.5∑f(t) dt.u∈Uf(u) du ≃ ∑ u∈Uh(u) du = ∑ t∈THence by the triangle inequality, we have∑∫ bf(t) dt − f(t) dt∣∣ ɛ.t∈TSince ɛ ≫ 0 was arbitrary, this finishes the proof.Corollary 4.7. Suppose T is a near line <strong>and</strong> suppose f <strong>and</strong> g are limitedvaluedfunctions on T such that f(t) ≃ g(t) for all t ∈ T . Also suppose thereexist M ≪ ∞ <strong>and</strong> α ≫ 1 such thata|f(t)| ≤ M|t| −α , |t| ≃ ∞ (4.9)<strong>and</strong> similarly with f replaced by g. Then (4.8) holds.Proof. For any appreciable δ we havehence∑|f(t)| dt ≤ Mt∈Tta∑|f(t)| dt ≤t∈T|t|>a∫ −a−∞∫ ∞a|t| −α dt|t − δ| −α dt2M(a − δ)−(α−1)(α − 1)(4.10)<strong>and</strong> the right h<strong>and</strong> side is infinitesimal when a ≃ ∞ because α − 1 is noninfinitesimalso (a − δ) α−1 is the case X u or X Y in the table for powers inSection 3.3, hence (a − δ) −(α−1) is infinitesimal, M is limited, <strong>and</strong> 1/(α − 1) islimited, <strong>and</strong> the product is infinitesimal. Similarly (4.10) holds with f replacedby g. Fix ɛ ≫ 0. Then∑|f(t)| dt ≤ ɛt∈T|t|>a<strong>and</strong>∑|g(t)| dt ≤ ɛt∈T|t|>a


26 CHAPTER 4. CALCULUSholds for every a ≃ ∞ hence, by overspill, for some limited a. Hence by Theorem4.5 <strong>and</strong> the triangle inequality∑f(t) dt − ∑ g(t) dt∣∣ 2ɛ.t∈Tt∈TSince ɛ ≫ 0 was arbitrary, this finishes the proof.Corollary 4.8. Suppose T is a near line <strong>and</strong> suppose f is a function R → R thatis absolutely Riemann integrable, has a limited bound, <strong>and</strong> is nearly continuousat each point of T . Also suppose there exist M ≪ ∞ <strong>and</strong> α ≫ 1 such that (4.9)holds. Then∑∫ ∞f(t) dt ≃ f(t) dt.t∈T−∞Proof. As in the preceding proof, condition (4.9) implies∫ −a−∞|f(t)| dt +∫ ∞a|f(t)| dt ≤ 2Ma−(α−1)(α − 1)<strong>and</strong> the right h<strong>and</strong> side is infinitesimal when a ≃ ∞. Fix ɛ ≫ 0. Then∫ −a−∞|f(t)| dt +∫ ∞a|f(t)| dt ≤ ɛ (4.11)holds for every a ≃ ∞ hence, by overspill, for some limited a.Also by the argument in the preceding proof∑|f(t)| dt ≤ ɛ (4.12)t∈T|t|>aholds for some limited a. Moreover we can choose one limited a so that (4.11)<strong>and</strong> (4.12) both hold.<strong>By</strong> Theorem 4.6∑∫ af(t) dt ≃ f(t) dt.t∈T|t|≤aNow apply the triangle inequality <strong>and</strong> the arbitrariness of ɛ.The condition (4.9) is not sharp, but a sharp condition, such as assertingthat the left h<strong>and</strong> sides of (4.11) <strong>and</strong> (4.12) are infinitesimal for all a ≃ ∞,would leave a lot of work to be done in applying the corollary.The inequality exp(x) ≥ 1 + x implies exp(−t 2 /2) ≤ 2|t| −2 . <strong>By</strong> Lemma 4.1t ↦→ exp(−t 2 /2) is nearly continuous, <strong>and</strong> it is bounded by 1. We know fromcalculus that ∫ ∞−∞ exp(−t2 /2) dt = √ 2π. Applying the corollary gives (4.7),−a


4.5. DERIVABILITY 274.5 DerivabilityIf T is a subset of R, then a function g : T → R is derivable at the pointx ∈ T if there exists a limited real number L such thatg(x 2 ) − g(x 1 )x 2 − x 1≃ L, whenever x 1 ≃ x ≃ x 2 <strong>and</strong> x 1 ≠ x 2 (4.13)(it being understood that x 1 <strong>and</strong> x 2 are elements of T ). 2It is clear that an analogous characterization is the following: g is derivableat x if there exists a limited real number L such that whenever x 1 ≃ x <strong>and</strong>h ≃ 0 there exists an α ≃ 0 such thatg(x 1 + h) − g(x 1 ) = Lh + αh, (4.14)(it being understood that x 1 <strong>and</strong> x 1 +h are elements of T ). To see the connectionbetween the two characterizations take x 2 = x 1 +h <strong>and</strong> α the difference betweenthe two sides of (4.13).Like other external notions, derivability is different from its internal analog(differentiability) <strong>and</strong> is actually a stronger <strong>and</strong> more useful property. Supposeg is derivable at every point of T <strong>and</strong> we define a limited-real-valued functionL on T such thatg(x 2 ) − g(x 1 )x 2 − x 1≃ L(x), whenever x 1 ≃ x ≃ x 2 <strong>and</strong> x 1 ≠ x 2 (4.15)(it being understood that x, x 1 , <strong>and</strong> x 2 are elements of T ). Clearly the functionL is not unique, since it can be changed by an infinitesimal amount at any xwithout affecting the validity of (4.15). However, it is clear from (4.15) that Lis nearly continuous on T <strong>and</strong> from (4.14) that g is nearly continuous on T .Lemma 4.9. Suppose g is a differentiable function I → R, where I is an openinterval, <strong>and</strong> g ′ is limited <strong>and</strong> nearly continuous on I. Then g is derivable on I<strong>and</strong> we may take L(x) = g ′ (x) in (4.15).Proof. For x, x 1 , <strong>and</strong> x 2 in I such that x 1 ≃ x ≃ x 2 <strong>and</strong> x 1 ≠ x 2 , by the meanvalue theoremg(x 2 ) − g(x 1 )= g ′ (ξ)x 2 − x 1where ξ is between x 1 <strong>and</strong> x 2 <strong>and</strong> hence ξ ≃ x <strong>and</strong> g ′ (ξ) ≃ g ′ (x) by the nearcontinuity assumption.2 This notion is taken from Nelson (1977, Section 5). It does not appear in Nelson (1987).In Nelson (1977), the formula (4.13) actually defines the derivative g ′ (x) because for st<strong>and</strong>ardg <strong>and</strong> x there is a unique st<strong>and</strong>ard L (by the st<strong>and</strong>ardization axiom of IST), which we denoteg ′ (x), that satisfies (4.13) <strong>and</strong> this implicitly defines derivability for nonst<strong>and</strong>ard g or x (bythe transfer axiom of IST). Our “radically elementary” nonst<strong>and</strong>ard analysis is too weak todo that (having neither st<strong>and</strong>ardization nor transfer).


28 CHAPTER 4. CALCULUSApplying both Lemma 4.1 <strong>and</strong> Lemma 4.9 we see that a function that hastwo derivatives, both limited on I, is derivable on I. And so most familiarfunctions from calculus are derivable, at least on limited intervals.Another way to look at derivability, is to note that when L(x) in (4.15)is appreciable then ≃ can be replaced by ∼ by Lemma 4.4 <strong>and</strong> the result isequivalent tog(x 2 ) − g(x 1 ) ∼ L(x)(x 2 − x 1 ), whenever x 1 ≃ x ≃ x 2 (4.16)(which, we reiterate, only holds when L(x) is appreciable).As an application of this theory, we prove the following useful identity.Theorem 4.10. For limited real numbers x <strong>and</strong> unlimited natural numbers n(1 + x n) n≃ e x . (4.17)Proof. <strong>By</strong> (near) continuity of the exponential function, it is enough to show(n log 1 + x )≃ x. (4.18)n<strong>By</strong> derivability of the logarithm function <strong>and</strong> because its derivative at one isappreciable, we have by (4.16) <strong>and</strong> Lemma 4.9log(1 + h) ∼ h, h ≃ 0.Hence for x <strong>and</strong> n as in the statement of the theorem,(n log 1 + x )∼ n · xn n = x<strong>and</strong> this implies (4.18) by another application of Lemma 4.4.


Part II<strong>Probability</strong>29


Chapter 5<strong>Radically</strong> <strong>Elementary</strong><strong>Probability</strong> Theory5.1 IntroductionNelson (1987) invented a new formalism for probability theory in which allr<strong>and</strong>om variables are defined on probability spaces that have(i) finite sample space <strong>and</strong>(ii) no nonempty events of probability zero.The main point of (ii) is to assure that conditional probabilities are always welldefined. When not using conditional probability, it need not be imposed.Point (i) implies two other restrictions.(iii) We only use finite collections of r<strong>and</strong>om variables. Every stochastic processhas a finite index set.(iv) We only use finite families of probability models. Every statistical modelhas a finite parameter space.Nelson (1987) uses (iii). Our justification of (iv) is that a likelihood is a stochasticprocess indexed by the parameter. Hence to obey (iii) the parameter musttake values in a finite (though perhaps unlimited) set. The need for (iv) is evenmore obvious if one is a Bayesian. If the parameter is a r<strong>and</strong>om variable, then(i) implies (iv).One might think (i) leaves no room for interesting advanced probabilitytheory. Under (i) every probability distribution is discrete. There are no trulycontinuous r<strong>and</strong>om variables. Also under (i) it is not possible to have an infinitesequence of independent r<strong>and</strong>om variables (except for the trivial special casewhere the r<strong>and</strong>om variables are constant).But Nelson combined this “radical” simplification with an innovation, theuse of nonst<strong>and</strong>ard analysis. In “Nelson-style” probability theory, a discrete31


32 CHAPTER 5. PROBABILITY THEORYdistribution in which every point has infinitesimal probability can behave muchlike a continuous distribution in conventional “Kolmogorov-style” probabilitytheory.In “Nelson-style” probability theory, a finite sequence X 1 , . . ., X n of independentr<strong>and</strong>om variables (which can be defined on a finite sample space),where n is unlimited can behave much like an infinite sequence in “Kolmogorovstyle”probability theory. For example, the law of large numbers <strong>and</strong> the centrallimit theorem can hold for such sequences, where, of course, we are referring to“Nelson-style” analogs of the conventional theorems (Nelson, 1987, Chapters 16<strong>and</strong> 18).5.2 Unconditional <strong>Probability</strong>5.2.1 <strong>Probability</strong><strong>Probability</strong> theory on finite sample spaces having no nonempty events ofprobability zero is very simple. <strong>Probability</strong> models consist of a finite set Ω (thesample space) <strong>and</strong> a strictly positive function pr (the probability mass function)on Ω such that ∑ ω∈Ωpr(ω) = 1.Every subset of Ω is an event, <strong>and</strong> the probability of an event A is given byPr(A) = ∑ ω∈Apr(ω). (5.1a)There are never any questions of measurability.Note that we distinguish between pr <strong>and</strong> Pr, the relationship being (5.1a)going one way <strong>and</strong>pr(ω) = Pr({ω})(5.1b)going the other. We can think of Pr as a probability measure. It certainly is theNelson-style analog of a Kolmogorov-style probability measure. However, it ismuch simpler. There is no sigma-algebra (every subset of Ω is an event). Andcountable additivity is vacuous (since Ω is finite, there are only finitely manyevents).5.2.2 ExpectationR<strong>and</strong>om ScalarsA real-valued function on the sample space is called a r<strong>and</strong>om variable, <strong>and</strong>the expectation of a r<strong>and</strong>om variable X is given byE(X) = ∑ ω∈ΩX(ω) pr(ω). (5.2)There are never any questions of existence of expectations. The set of all r<strong>and</strong>omvariables is the finite-dimensional vector space R Ω .


5.3. CONDITIONAL PROBABILITY 33The indicator function of the set A is the function I A : Ω → R defined by{0, ω ∈ Ω \ AI A (ω) =(5.3)1, ω ∈ AThe set Ω \ A is called the complement of A <strong>and</strong> is also denoted A c . Using thisnotationPr(A) = E(I A )(probability is expectation of indicator functions).R<strong>and</strong>om VectorsA function from the sample space Ω to a vector space V is called a r<strong>and</strong>omvector <strong>and</strong> the expectation of a r<strong>and</strong>om vector X is given by (5.2) whereX(ω) pr(ω) is interpreted as multiplication of the vector X(ω) by the scalarpr(ω) <strong>and</strong> the sum is interpreted as vector addition, which is well defined becauseΩ is finite.The set of all (V -valued) r<strong>and</strong>om vectors is V Ω . Note that even if V isinfinite-dimensional, it has a finite-dimensional subspace that contains all theX(ω) for ω ∈ Ω because Ω is finite. Even if we are interested in a finite sequenceX 1 , . . ., X n of r<strong>and</strong>om variables, there exists a finite dimensional subspace thatcontains all the X i (ω) for 1 ≤ i ≤ n <strong>and</strong> ω ∈ Ω. We shall never be interested ininfinite sequences or any other infinite collection of r<strong>and</strong>om vectors — restriction(iii) discussed in Section 5.1. Thus without loss of generality we may assume Vis finite-dimensional.R<strong>and</strong>om ElementsA function from the sample space Ω to an arbitrary set S is called a r<strong>and</strong>omelement of S. The set of all r<strong>and</strong>om elements (of S) is S Ω . The same sort ofargument as in the last paragraph of the preceding section says that withoutloss of generality we may take S to be finite.R<strong>and</strong>om elements need not have expectations. The addition <strong>and</strong> multiplicationin (5.2) are not defined for an element X of an abstract set S. But if X isa r<strong>and</strong>om element of S <strong>and</strong> f is a function S → R, then f ◦ X, which is usuallywritten f(X), is a r<strong>and</strong>om variable <strong>and</strong> does have expectation.5.3 Conditional <strong>Probability</strong>For any family X of r<strong>and</strong>om variables define a relation X ∼ on Ω byω 1X∼ ω2 ←→ (∀X ∈ X ) ( X(ω 1 ) = X(ω 2 ) ) (5.4)Clearly, this is an internal equivalence relation, <strong>and</strong> hence defines a partitionS of Ω. We write S = at(X ) <strong>and</strong> call the elements of S the atoms of X . <strong>By</strong>


34 CHAPTER 5. PROBABILITY THEORYdefinition, every element of X is constant on each element of S, <strong>and</strong> S is thecoarsest partition of Ω that has this property.The algebra generated by X is the largest family of r<strong>and</strong>om variables Asuch that at(A) = at(X ). Clearly, it is the set of all r<strong>and</strong>om variables that areconstant on each element of at(X ). 1An algebra A is closed under arbitrary operations. If f is a real-valuedfunction with d real arguments <strong>and</strong> X 1 , . . ., X d are elements of A, then Y =f(X 1 , . . . , X d ) is also an element of A, where this notation is defined (as is usualin probability theory) by(This is obvious from the definition.)Y (ω) = f ( X 1 (ω), . . . , X d (ω) ) , ω ∈ Ω.5.3.1 Conditional ExpectationLet A be a family of r<strong>and</strong>om variables (not necessarily an algebra), <strong>and</strong> letA ω denote the element of at(A) containing ω. The conditional expectation of ar<strong>and</strong>om variable X given the family A is the r<strong>and</strong>om variable Y defined byY (ω) =1Pr(A ω )∑ω ′ ∈A ωX(ω ′ ) pr(ω ′ ), ω ∈ Ω. (5.5)There are never any questions of existence or uniqueness; Pr(A ω ) cannot bezero because of the assumption that pr is strictly positive.Nelson mostly uses the notation E A X to denote the r<strong>and</strong>om variable definedby (5.5) but also uses the notation E(X|A) more common in Kolmogorov-styletheory <strong>and</strong> also uses the notation E(X|Z 1 , . . . , Z d ) when A = {Z 1 , . . ., Z d }.Theorem 5.1. Suppose A <strong>and</strong> B are algebras of r<strong>and</strong>om variables <strong>and</strong> A ⊂ B.ThenE A (X + Y ) = E A X + E A Y, X, Y ∈ R Ω (5.6a)E A (XY ) = XE A Y, X ∈ A, Y ∈ R Ω (5.6b)E A E B = E AEE A = E(5.6c)(5.6d)The meaning of (5.6c) or (5.6d) is that the two sides of the equation areequal when applied to any r<strong>and</strong>om variable. The proofs are straightforwardverifications directly from the definition (5.5).1 Nelson (1987, p. 6) defines algebra differently: a family of r<strong>and</strong>om variables containingthe constant r<strong>and</strong>om variables <strong>and</strong> closed under addition <strong>and</strong> multiplication. His definitionis more “mathematical” because it justifies the name “algebra.” But he then immediatelyproves that his definition characterizes the same notion as ours. As closure under addition <strong>and</strong>multiplication are not particularly interesting in light of the comments immediately followingthe footnoted text, we just take the characterization more relevant to probability theory asour definition.


5.4. DISTRIBUTION FUNCTIONS 355.3.2 Conditional <strong>Probability</strong>As with unconditional probability, conditional probability is expectation ofindicator functions: we define conditional probability byPr(B|A) = Pr A (B) = E A (I B ). (5.7)Consider the special case where A = {I A }, so at(A) = {A, A c }. Then forω ∈ A so A ω = A, the definition (5.5) gives1 ∑Pr(A ∩ B)I B (ω) pr(ω) =Pr(A)Pr(A)ω∈Afor the value of Pr(B|A) at such an ω.As in undergraduate probability theory we write thisPr(B|A) =Pr(A ∩ B)Pr(A)(5.8)<strong>and</strong> regard it as an independent definition of what Pr(B| · ) means when thethingy behind the bar is an event rather than a family of r<strong>and</strong>om variables.Clearly, (5.8) is well defined whenever A is nonempty (because of our rule thatnonempty events have nonzero probability).It is clear that Pr(B|A) evaluated at an ω ∈ A c is Pr(B|A c ). Moreover, forany family of r<strong>and</strong>om variables A (not just ones generated by a single indicatorfunction), we havePr(B|A)(ω) = Pr(B|A ω )where the notation on the left h<strong>and</strong> side means the r<strong>and</strong>om variable Pr(B|A)evaluated at the point ω <strong>and</strong>, as in the definition (5.5), A ω is the element of at(A)containing ω. And this connects the two notions of conditional probability: theundergraduate level (5.8) <strong>and</strong> the PhD level (5.7), at least “PhD level” whendone Kolmogorov-style, our Nelson-style definition using (5.5) being actuallyonly undergraduate level in difficulty.Although one does not usually see the “PhD level” definition (5.7) of conditionalprobability in undergraduate courses, one does see the so-called regressionfunction E(Y |X 1 , . . . , X p ) without which one cannot underst<strong>and</strong> multipleregression. Of course, it is usually introduced without having the PhD levelKolmogorov-style definition that makes it rigorous. Our Nelson-style definitionserves just as well to make the concept rigorous <strong>and</strong> is much lower in level ofdifficulty.5.4 Distribution FunctionsA r<strong>and</strong>om variable X induces a distribution function F defined byF (x) = Pr(X ≤ x), x ∈ R (5.9a)


36 CHAPTER 5. PROBABILITY THEORY<strong>and</strong> probability mass function f defined byThese are related by<strong>and</strong>f(x) = Pr(X = x), x ∈ R. (5.9b)f(x) = F (x) − maxy


5.5. PROBABILITY MEASURES 375.5 <strong>Probability</strong> MeasuresA r<strong>and</strong>om element X of a set S induces a probability measure P defined byP (A) = Pr(X ∈ A), A ⊂ S (5.11a)<strong>and</strong> probability mass function p defined byp(x) = Pr(X = x), x ∈ S. (5.11b)These are related by<strong>and</strong>p(x) = P ({x})(5.11c)P (A) = ∑ x∈Sp(x), (5.11d)a relation denoted by p = dP . This relationship between X <strong>and</strong> P is denotedP = L(X), <strong>and</strong> we say P is the law of X.The sum in (5.11d) is always well defined, even if S is an infinite set, becauseof the assumption (i) of Section 5.1 that all r<strong>and</strong>om elements are defined on finitesample spaces so p(x) is zero for all but finitely many x. We say that the set ofx such that p(x) is positive is the support of p <strong>and</strong> also the support of P . Weshall never be interested in measures that do not have finite support.If h is a real-valued function on S, X is a r<strong>and</strong>om element of S, <strong>and</strong> P =L(X), then we can writeE{h(X)} = ∑ x∈Sh(x) dP (x). (5.11e)As with (5.11d), the sum in (5.11e) is always well defined because of the assumptionthat P has finite support.


38 CHAPTER 5. PROBABILITY THEORY


Chapter 6More <strong>Radically</strong> <strong>Elementary</strong><strong>Probability</strong> Theory6.1 Almost SurelyThe definition of “almost surely” appropriate in Nelson-style probabilitytheory (Nelson, 1987, Chapter 7) goes as follows: a property A holds almostsurely if for every ɛ ≫ 0 there exists an event N (which may depend on ɛ) suchthat Pr(N) ≤ ɛ <strong>and</strong> A(ω) is true except for ω ∈ N.If the property A in question is internal, then the event{ ω ∈ Ω : A(ω) } (6.1)has probability nearly equal to one, which more resembles the conventional(Kolmogorov-style) definition.But if the property A is external, then (6.1) is an instance of illegal setformation (Section 2.4). It need not define a set, <strong>and</strong> hence need not have aprobability. Thus we need the more complicated definition involving a differentexception set N for every ɛ ≫ 0 when we want to say an external property holdsalmost surely.Lemma 6.1. Suppose A 1 , . . ., A n are properties that hold almost surely <strong>and</strong> nis limited. Then A 1 , . . ., A n hold simultaneously almost surely.Proof. For every ɛ ≫ 0, we have ɛ/n ≫ 0, hence there exist events N i such thatPr(N i ) ≥ ɛ/n <strong>and</strong> A i (ω) holds except for ω ∈ N i . But then Pr( ⋃ ni=1 N i) ≤ ɛ<strong>and</strong> <strong>and</strong> A i (ω) holds for i = 1, . . . , n, except for ω ∈ ⋃ ni=1 N i.6.1.1 Infinitesimal Almost SurelyWhen the property A in question is X ≃ 0, the last bit of the definition ofalmost surely becomes: <strong>and</strong> X(ω) ≃ 0 except for ω ∈ N.39


40 CHAPTER 6. MORE PROBABILITY THEORYLemma 6.2. The following three conditions are equivalent.(i) X is infinitesimal almost surely.(ii) If λ ≫ 0, then Pr(|X| ≥ λ) ≃ 0.(iii) There exists a λ ≃ 0 such that Pr(|X| ≥ λ) ≃ 0.This is Theorem 7.1 in Nelson (1987). We do not give a proof here. Theproof is very similar to that of the lemma in the next section.6.1.2 Limited Almost SurelyWhen the property A in question is |X| ≪ ∞, the last bit of the definitionof almost surely becomes: <strong>and</strong> |X(ω)| ≪ ∞ except for ω ∈ N.Lemma 6.3. The following three conditions are equivalent.(i) X is limited almost surely.(ii) If x ≃ ∞, then Pr(|X| ≥ x) ≃ 0.(iii) For every ɛ ≫ 0 there exists a limited x such that Pr(|X| ≥ x) ≤ ɛ.Proof. Assume (i). Then for every ɛ ≫ 0 there exists an event N such thatX(ω) is limited except when ω ∈ N. Hence if x ≃ ∞ the event |X| ≥ x iscontained in N, <strong>and</strong> Pr(|X| ≥ x) ≤ ɛ. Since ɛ was arbitrary, (ii) holds. Thus(i) ⇒ (ii).Assume (ii). Fix ɛ ≫ 0. Then Pr(|X| ≥ x) ≤ ɛ holds for all unlimitedpositive x, hence by overspill for some limited x. Thus (ii) ⇒ (iii).(iii) ⇒ (i) is obvious.The concept in Kolmogorov-style probability theory analogous to “limitedalmost surely” is “tight.” In Kolmogorov-style theory, tightness is an uninterestingconcept when applied to single r<strong>and</strong>om variables, because by countableadditivity every r<strong>and</strong>om variable is tight.In conventional finitely-additive probability theory, non-tight r<strong>and</strong>om variablesexist, although proof of their existence involves fancy mathematics likethe Hahn-Banach theorem. In Nelson-style probability theory, r<strong>and</strong>om variablesthat are not limited almost surely are easily constructed. For example,take the (discrete) uniform distribution on the integers 1, . . ., n, where n isunlimited.6.2 L 1 R<strong>and</strong>om VariablesIn Nelson-style theory, every r<strong>and</strong>om variable has a well defined expectationgiven by (5.2). But, unlike the situation in Kolmogorov-style probability theory,the mere existence of expectation proves nothing (since every r<strong>and</strong>om variablehas expectation, existence is vacuous).


6.2. L 1 RANDOM VARIABLES 41One might guess that the Nelson-style property analogous to Kolmogorovstyleexistence of expectation is limited absolute expectation, but it turns outthat an even stronger property is needed for a r<strong>and</strong>om variable to be “wellbehaved.”A r<strong>and</strong>om variable X is L 1 if E ( |X|I (a,∞) (|X|) ) nearly converges to zero asa goes to infinity, meaningE ( |X|I (a,∞) (|X|) ) ≃ 0, a ≃ ∞. (6.2)The analogous Kolmogorov-style definition is easily seen to define L 1 . IfX is a Kolmogorov-style r<strong>and</strong>om variable that has expectation, then the lefth<strong>and</strong> side of (6.2) converges to zero as a → ∞ by dominated convergence. Onthe other h<strong>and</strong>, if X is a Kolmogorov-style r<strong>and</strong>om variable that does not haveexpectation, then the left h<strong>and</strong> side of (6.2) is equal to +∞ for all a, because ifthere existed an a for which the left h<strong>and</strong> side was finite, then we would havefinite, contrary to assumption.E ( |X| ) = E ( |X|I [0,a] (|X|) ) + E ( |X|I (a,∞) (|X|) )≤ a + E ( |X|I (a,∞) (|X|) )6.2.1 The Radon-Nikodym TheoremWe don’t usually use the definition of L 1 directly. More often we use thefollowing characterization, which is Theorem 8.1 in Nelson (1987).Theorem 6.4 (Radon-Nikodym <strong>and</strong> converse). A r<strong>and</strong>om variable X is L 1if <strong>and</strong> only if E ( |X| ) is limited <strong>and</strong>, for all events M, if Pr(M) ≃ 0, thenE ( |X|I M)≃ 0.Thus we see that L 1 is a stronger property than limited absolute expectation(however, see Theorem 6.8 below for a criterion based on limitedness of highermoments). Here is an example of a r<strong>and</strong>om variable X that has limited absoluteexpectation, but is not L 1 . Let X be Bernoulli(p) <strong>and</strong> Y = bX with b > 0. ThenY is nonnegative, so we may omit absolute values.E ( Y I (a,∞) (Y ) ) {bp, a < b=0, a ≥ bso, if b ≃ ∞ <strong>and</strong> bp ≫ 0, then Y is not L 1 . But E(Y ) = bp, so if bp ≪ ∞, thenY does have limited absolute expectation. For example, take p = 1/b.The name of the theorem comes from Nelson (1987). He often labels histheorems with names of theorems from Kolmogorov-style theory. The namesdon’t mean that his theorem is the same as the Kolmogorov-style theorem (oreven a corollary of it). Rather they mean that his theorem is the analog in histheory of the named theorem in Kolmogorov-style theory.


42 CHAPTER 6. MORE PROBABILITY THEORYIn this case the analogy is the following. Suppose (Ω, A, P ) is a Kolmogorovstyleprobability space <strong>and</strong> X a Kolmogorov-style L 1 (P ) r<strong>and</strong>om variable on thisprobability space. Define a positive measure ν on (Ω, A) by∫ν(A) = |X| dP.Then P (M) = 0 implies ν(M) = 0, a property called absolute continuity ofν with respect to P in Kolmogorov-style theory, which is the condition in theRadon-Nikodym theorem. The condition in the Nelson-style Radon-Nikodymtheorem is P (M) ≃ 0 implies ν(M) ≃ 0.Unlike the situation in conventional mathematics (Lebesgue-style measuretheory as well as Kolmogorov-style probability theory) L 1 is not a vector space.It is not even a set. The followingA{ X ∈ R Ω : X is L 1 } (6.3)is illegal set formation because L 1 is an external property. In fact, we can provethat there does not exist a subset S of R Ω such that X ∈ S if <strong>and</strong> only if X isL 1 , because if this set did exist, then by Theorem 6.4, letting Y be the constantr<strong>and</strong>om variable everywhere equal to one,{ a ∈ R : aY ∈ S }would be the set of limited real numbers, which does not exist (Theorem 3.13).However, it immediately follows from Theorem 6.4 thatX <strong>and</strong> Y are L 1 −→ X + Y is L 1X is L 1 <strong>and</strong> |a| ≪ ∞ −→ aX is L 1Y is L 1 <strong>and</strong> |X| ≤ |Y | −→ X is L 1(6.4a)(6.4b)(6.4c)so Nelson-style L 1 behaves much like Kolmogorov-style L 1 . The differences arethat Kolmogorov-style theory would allow unlimited a in (6.4b) <strong>and</strong> would allow|X| ≤ |Y | to hold only almost surely in (6.4c).In Nelson-style theory we cannot insert “almost surely” in (6.4c) if thereexists any point ω having infinitesimal probability (that is, whenever “almostsurely” is not vacuous), because if X is L 1 <strong>and</strong> we define Y a to be equal to Xeverywhere except at ω where we have Y a (ω) = a, then EY a → ∞ as a → ∞<strong>and</strong> hence Y a is not L 1 for sufficiently large a, but Y a ≃ X almost surely.6.2.2 The Lebesgue TheoremAnother important theorem about L 1 is the following, which it gets its namebecause it is the Nelson-style analog of the Lebesgue dominated convergencetheorem, monotone convergence theorem, <strong>and</strong> Fatou’s lemma. It is Theorem 8.2in Nelson (1987).


6.2. L 1 RANDOM VARIABLES 43Theorem 6.5 (Lebesgue). If X <strong>and</strong> Y are L 1 <strong>and</strong> X ≃ Y almost surely, thenE(X) ≃ E(Y ).Note that what is a convergence in Kolmogorov-style theory has become anexternal equivalence relation in Nelson-style theory. This is typical. As we shallsee, the same thing happens with convergence in probability <strong>and</strong> convergencein distribution.How can this one simple Nelson-style theorem replace all of that Lebesgue-Kolmogorov-style theory? In Nelson-style theory, we only look at finite sequences,<strong>and</strong> existence <strong>and</strong> uniqueness of limits are not at issue: we can alwaystake the limit to be the last element of the sequence but the limit is neverunique. Nor can it be an issue to prove that the limit is L 1 . <strong>By</strong> the commentfollowing <strong>and</strong> concerning (6.4c), we can never prove a limit to be L 1 because allr<strong>and</strong>om variables almost surely equal to the limit are also limits but many ofthem are not L 1 . Thus the only part of the dominated, monotone, <strong>and</strong> Fatouconvergence theorems that is of interest is that almost sure convergence impliesconvergence of expectations, <strong>and</strong> this is assured by Theorem 6.5. Nelson-styletheory is in some respects much simpler than the competition.Lemma 6.6. Let Z be a nonnegative r<strong>and</strong>om variable. If E(Z) ≃ 0, then Z ≃ 0almost surely. Conversely, if Z is L 1 <strong>and</strong> Z ≃ 0 almost surely, then E(Z) ≃ 0.Proof. One direction is the Lebesgue theorem. The other direction is Markov’sinequality.Pr(Z > λ) ≤ E(Z)λ . (6.5)If E(Z) ≃ 0, then for every λ ≫ 0 the right h<strong>and</strong> side of (6.5) is infinitesimalover non-infinitesimal equals infinitesimal. Hence by criterion (ii) of Lemma 6.2Z ≃ 0 almost surely.This lemma is the Nelson-style analog of the Kolmogorov-style theorem thata nonnegative r<strong>and</strong>om variable Z is zero almost surely if <strong>and</strong> only if E(Z) = 0.For any r<strong>and</strong>om variable X <strong>and</strong> any positive real number a define anotherr<strong>and</strong>om variable⎧⎪⎨ −a, X(ω) < −aX (a) (ω) = X(ω), −a ≤ X(ω) ≤ a(6.6)⎪⎩a, X(ω) > aLemma 6.7 (Approximation). Suppose X <strong>and</strong> Y are L 1 r<strong>and</strong>om variables <strong>and</strong>EX (a) ≃ EY (a) for all limited a. Then EX ≃ EY .Proof. For every ɛ ≫ 0 we have|EX − EX (a) | ≤ ɛ <strong>and</strong> |EY − EY (a) | ≤ ɛ (6.7)for all unlimited a <strong>and</strong> hence by overspill for some limited a. But for this a wehave EX (a) ≃ EY (a) <strong>and</strong> hence by the triangle inequality|EX − EY | ≤ 3ɛ. (6.8)Since (6.8) holds for every ɛ ≫ 0, the left h<strong>and</strong> side must be infinitesimal.


44 CHAPTER 6. MORE PROBABILITY THEORY6.2.3 L p R<strong>and</strong>om VariablesFor 1 ≤ p < ∞, we say a r<strong>and</strong>om variable X is L p if |X| p is L 1 , <strong>and</strong> wesay a r<strong>and</strong>om variable X is L ∞ if it is limited, that is, if |X(ω)| ≪ ∞ for all ω(Nelson, 1987, p. 31). And for such variables we define the norms‖X‖ p = E{|X| p } 1/p(6.9a)when 1 ≤ p < ∞ <strong>and</strong>‖X‖ ∞ = maxω∈Ω |X(ω)|.(6.9b)In (6.9b) we can write “max” instead of “sup” (the supremum is achieved)because Ω is a finite set. Hence, if X is L ∞ , then ‖X‖ ∞ is limited. Theanalogous property, if X is L p , then ‖X‖ p is limited, holds for 1 ≤ p ≪ ∞because the expectation in (6.9a) is limited by the Radon-Nikodym theorem.Theorem 6.8. Suppose 1 ≤ p ≪ q ≪ ∞ <strong>and</strong> E(|X| q ) ≪ ∞, then X is L p .Proof. Firstfrom which we obtain|X| p =|X| q|X| p(q/p−1)<strong>and</strong>|X| p I (a,∞) (|X| p ) ≤|X|qa q/p−1E{|X| p I (a,∞) (|X| p )} ≤ E{|X|q }a q/p−1 .Now q/p − 1 = (q − p)/p is appreciable over appreciable equals appreciable,hence a q/p−1 ≃ ∞ whenever a ≃ ∞. Since E{|X| q } ≪ ∞, we haveso |X| p is L 1 <strong>and</strong> X is L p .E{|X| p I (a,∞) (|X| p )} ≃ 0, a ≃ ∞,6.2.4 Conditional ExpectationThere are two important theorems about L 1 <strong>and</strong> conditional expectation(Theorems 8.3 <strong>and</strong> 8.4 in Nelson, 1987) which are repeated here below.Theorem 6.9. If 1 ≤ p ≤ ∞ <strong>and</strong> X is L pvariables, then E A X is L p .<strong>and</strong> A is a family of r<strong>and</strong>omTheorem 6.10. If X is L 1 <strong>and</strong> A is a family of r<strong>and</strong>om variables, then X isL 1 on almost every atom of A.


6.2. L 1 RANDOM VARIABLES 45The meaning of Theorem 6.10 is not entirely obvious. It means that forevery ɛ ≫ 0 there exists a set N, which in this case may be taken to be a unionof atoms of A, such that Pr(N) ≤ ɛ <strong>and</strong>E A |X|I (a,∞) (|X|) ≃ 0, a ≃ ∞ (6.10)except on N (the conditional expectation is an element of A, hence constant onatoms of A <strong>and</strong> the assertion is that (6.10) holds except on those atoms of Athat are contained in N).Theorem 6.11 (Conditional Lebesgue). If X <strong>and</strong> Y are L 1 r<strong>and</strong>om variablessuch that X ≃ Y almost surely <strong>and</strong> A is a family of r<strong>and</strong>om variables, thenE(X | A) ≃ E(Y | A) almost surely.Proof. Let Z = X − Y . Then Z is L 1 <strong>and</strong> Z ≃ 0 almost surely. We are toshow that E(Z | A) ≃ 0 almost surely. Define W = E(Z | A). W is L 1 byTheorem 6.9. <strong>By</strong> the conditional Jensen inequality (Nelson, 1987, p. 8) <strong>and</strong>iterated conditional expectation (5.6d)E(|W |) = E [ |E(Z | B)| ] ≤ E [ E(|Z| | B) ] = E(|Z|). (6.11)Now Z is L 1 if <strong>and</strong> only if |Z| is by definition of L 1 <strong>and</strong> Z is infinitesimal almostsurely if <strong>and</strong> only if |Z| is because a number z is infinitesimal if <strong>and</strong> only if |z| is.Hence |Z| is L 1 <strong>and</strong> infinitesimal almost surely, so by Lemma 6.6 E(|Z|) ≃ 0.Hence (6.11) implies E(|W |) ≃ 0, <strong>and</strong> by the other direction of Lemma 6.6 |W |is infinitesimal almost surely, so W is infinitesimal almost surely.6.2.5 The Fubini TheoremWhy do we have a section with this title? If expectations are always finitesums, isn’t it obvious that we can interchange the order of summation?Yes, it is. But the Fubini theorem in Kolmogorov-style probability theory alsomakes measurability <strong>and</strong> integrability assertions (some authors put these in apreliminary lemma). The measurability assertions are vacuous in Nelson-styleprobability theory, but the integrability assertions are still important (Corollaryto Theorem 8.4 in Nelson, 1987, repeated below).Theorem 6.12 (Fubini). Suppose X is L 1 on a probability space with Ω =Ω 1 × Ω 2 <strong>and</strong> pr ( (ω 1 , ω 2 ) ) = pr 1 (ω 1 ) pr 2 (ω 2 ), then for pr 2 almost all ω 2 in Ω 2the r<strong>and</strong>om variable ω 1 ↦→ X(ω 1 , ω 2 ) on (Ω 1 , pr 1 ) is L 1 .


46 CHAPTER 6. MORE PROBABILITY THEORY


Chapter 7Stochastic Convergence7.1 Almost Sure ConvergenceA sequence X 1 , . . ., X ν of r<strong>and</strong>om variables converges almost surely (Nelson,1987, p. 26) if almost surelyX n ≃ X ν , n ≃ ∞. (7.1)More precisely, for every ɛ ≫ 0 there exists an event N (which may depend onɛ) such that Pr(N) ≤ ɛ <strong>and</strong>X n (ω) ≃ X ν (ω), n ≃ ∞, ω /∈ N.7.2 Convergence in <strong>Probability</strong>A sequence X 1 , . . ., X ν of r<strong>and</strong>om variables converges in probability (Nelson,1987, p. 26) ifX n ≃ X ν almost surely, n ≃ ∞. (7.2)More precisely, for every n ≃ ∞ <strong>and</strong> ɛ ≫ 0 there exists an event N (which maydepend on n <strong>and</strong> ɛ) such that Pr(N) ≤ ɛ <strong>and</strong>X n (ω) ≃ X ν (ω), ω /∈ N.7.3 Almost Sure Near EqualityLet us define a notation for X ≃ Y almost surely. R<strong>and</strong>om variables X <strong>and</strong>Y are nearly equal almost surely, written X ≃ asY , if X ≃ Y holds almost surely.Using this notation, we can redefine convergence in probability, replacing(7.2): a sequence . . . converges in probability ifX nas≃ X ν , n ≃ ∞.47


48 CHAPTER 7. STOCHASTIC CONVERGENCELemma 7.1. The external relation as≃ is an equivalence on R Ω .Proof. Symmetry <strong>and</strong> reflexivity come from the corresponding properties of ≃.If X ≃ asY <strong>and</strong> Y ≃ asZ, then X ≃ asZ by Corollary 3.9 <strong>and</strong> Lemma 6.1.Note that the situation here is much like what we saw with (near) convergenceof non-r<strong>and</strong>om sequences in Section 4.1. The sequence does very littlework. It is enough to underst<strong>and</strong> the external equivalence relation almost surenear equality. We never need to deal with the whole sequence; we always dealaswith two elements at a time (is X m ≃ X ν or not?)It is also much like what we saw with the Lebesgue theorem in Section 6.2.2replacing conventional theorems about sequences (monotone convergence, dominatedconvergence, Fatou).The only forms of stochastic convergence in which necessarily involve sequencesare almost sure convergence (Section 7.1 above) <strong>and</strong> the invarianceprinciple (Nelson, 1987, Theorem 18.1), both of which are sample path limittheorems. The point is that in (7.1) the exception sets must work for all unlimitedn, whereas in (7.2) one may use different exception sets for each n. So(7.1) is a statement about the whole sequence <strong>and</strong> (7.2) isn’t.7.4 Near EquivalenceA real-valued function g is limited if every value is limited. R<strong>and</strong>om variablesX <strong>and</strong> Y defined on possibly different probability spaces are nearly equivalent(Nelson, 1987, Chapter 17), written X w ≃ Y , ifE{g(X)} ≃ E{g(Y )}for every limited (nearly) continuous function g.Lemma 7.2. The external relation w ≃ is an equivalence.Proof. Obvious from ≃ being an equivalence.7.5 Convergence in DistributionA sequence X 1 , . . ., X ν of r<strong>and</strong>om variables defined on possibly differentprobability spaces converges in distribution (also called “weak convergence” or“convergence in law”), ifX nw≃ Xν , n ≃ ∞. (7.3)The situation here is much like what we have seen with every other formof convergence of sequences with the sole exception of almost sure convergence.The sequence does very little work. It is enough to underst<strong>and</strong> the externalequivalence relation near equivalence. We never need to deal with the whole


7.5. CONVERGENCE IN DISTRIBUTION 49wsequence; we always deal with two elements at a time (is X m ≃ Xν or not?)There is so little point to convergence in distribution (over <strong>and</strong> above nearequivalence) that Nelson (1987) does not even bother to define it.As we saw with (near) convergence of non-r<strong>and</strong>om sequences in Section 4.1,<strong>and</strong> for exactly the same reasons, Nelson-style convergence in distribution is amuch stronger property than Kolmogorov-style convergence in distribution. Wehave, for instance, the same behavior of double sequencesX ijw≃ Yj ,X ijw≃ Zi ,i ≃ ∞j ≃ ∞impliesX ijw≃ Xmn , i, j, m, n ≃ ∞.That’s just the way equivalence relations behave.The analogous property does not hold for Kolmogorov-style convergence indistribution (it doesn’t even hold when all the r<strong>and</strong>om variables are constantr<strong>and</strong>om variables <strong>and</strong> the convergence in distribution is convergence of nonr<strong>and</strong>omsequences in disguise).Now it might be that Nelson-style convergence in distribution is too strong?Maybe it is hard to get? But it turns out this is not the case. We get near equivalencewhen we expect to get it in situations analogous to when Kolmogorov-styleconvergence in distribution occurs. Thus it seems that Kolmogorov-style convergencein distribution is too weak. Nelson-style arguments are often simpler<strong>and</strong> easier.Theorem 7.3. The following are the only implications that hold between thevarious modes of stochastic convergence.(i) A sequence of r<strong>and</strong>om variables that converges almost surely also convergesin probability.(ii) A sequence of r<strong>and</strong>om variables that converges in probability also convergesin distribution.(iii) X as≃ Y implies X w ≃ Y .(The reverse implications are, in general, false.)Proof. (i) is obvious from the definitions. (iii) obviously implies (ii). Everylimited function is L 1 by the Radon-Nikodym theorem (our Theorem 6.4) <strong>and</strong>hence (iii) holds by the Lebesgue theorem (our Theorem 6.5).The converses to (ii) <strong>and</strong> (iii) need not hold, because X ≃ w Y does not evenrequire that X <strong>and</strong> Y be defined on the same probability space, <strong>and</strong> X ≃ asYdoes. Moreover, consider X having the uniform distribution on the two-pointset {−1, 1} <strong>and</strong> Y = −X, then X ≃ w Y is true, but X ≃ asY is false.Nelson (1987, p. 26) gives a counterexample to the converse to (i). Let X 1 ,. . ., X ν be independent <strong>and</strong> identically distributed Bernoulli r<strong>and</strong>om variables


50 CHAPTER 7. STOCHASTIC CONVERGENCEwith Pr(X n = 1) = c/ν, where ν is unlimited <strong>and</strong> c is appreciable. Note that,since X n is zero-or-one-valued, we have X n ≃ 0 if <strong>and</strong> only if X n = 0. Since c/νasis infinitesimal, we have X n ≃ 0 for all n. Hence X n converges in probability tozero.Let A µ denote the eventX m = 0,µ < m ≤ νThenPr(A µ ) =(1 − c ) ν−µν<strong>and</strong> if ν − µ is unlimited, we havePr(A µ ) ≃ exp(−c · ν − µ )ν(7.4)by Theorem 4.10.This makes it impossible for X n to converge almost surely to zero, becausethis requires that for every unlimited µ we have Pr(A µ ) ≥ 1 − ɛ for every ɛ ≫ 0,hence Pr(A µ ) ≃ 1, which happens only if the argument of the exponentialfunction in (7.4) is infinitesimal for all unlimited µ, <strong>and</strong> this is not so.


Chapter 8The Central Limit Theorem8.1 Independent <strong>and</strong> Identically DistributedNelson (1987, Chapter 18 <strong>and</strong> also the discussion on p. 57) gives a theoremthat has the following obvious corollary.Corollary 8.1 (The Central Limit Theorem). Suppose X 1 , X 2 , . . ., X ν areindependent <strong>and</strong> identically distributed L 2 r<strong>and</strong>om variables with mean µ <strong>and</strong>variance σ 2 , <strong>and</strong> suppose σ 2 ≫ 0 <strong>and</strong> ν ≃ ∞. DefineX ν = 1 νν∑X i .i=1Then the r<strong>and</strong>om variableZ = X ν − µσ/ √ νis L 2 <strong>and</strong> nearly equivalent to every other such r<strong>and</strong>om variable.(8.1)The assertion of the theorem is that, no matter what independent <strong>and</strong> identicallydistributed L 2 sequence with appreciable variance is chosen, the distributionof (8.1) is the same up to near equivalence.We say any r<strong>and</strong>om variable nearly equivalent to (8.1) is st<strong>and</strong>ard normal.Note that in Nelson-style theory the term “st<strong>and</strong>ard normal” does not name adistribution. It is an external property that distributions may or may not have.As with any external property, it is illegal set formation to try to form the setof all st<strong>and</strong>ard normal distributions (this set does not exist).8.2 The De Moivre-Laplace TheoremIn this section we find out more about the limiting distribution in the centrallimit theorem. The Bernoulli distribution with success probability p has51


52 CHAPTER 8. THE CENTRAL LIMIT THEOREMprobability mass functionf(k) ={q, k = 0p, k = 1where q = 1 − p. This distribution is abbreviated Bernoulli(p).The binomial distribution for n trials with success probability p has probabilitymass function( nf(k) = pk)k q n−k , k = 0, . . . , n. (8.2)This is the distribution of the sum of n IID Bernoulli(p) r<strong>and</strong>om variables. Thisdistribution is abbreviated Binomial(n, p).Theorem 8.2 (De Moivre-Laplace). Suppose X has the Binomial(n, p) distributionwithn ≃ ∞(8.3a)0 ≪ p ≪ 1 (8.3b)<strong>and</strong>Z = X − np √ npq.Then there exists a near line T such thatPr(Z ≤ z) ≃ √ 1 ∑e −t2 /2 dt. (8.4)2πt∈Tt≤zOur proof closely follows Feller (1950, Section 7.2).Proof. Stirling’s approximation for n! is(2π) 1/2 n n+1/2 e −n ≤ n! ≤ (2π) 1/2 n n+1/2 e −n+1/(12n) (8.5)(Feller, 1950, pp. 41–44) 1 Dividing through by the left h<strong>and</strong> term in (8.5) gives1 ≤n!(2π) 1/2 n n+1/2 e −n ≤ e1/(12n) (8.6)From the continuity of the exponential function for limited values of itsargument, we conclude 1 ≃ e 1/(12n) whenever 1/(12n) is infinitesimal, whichis whenever n is unlimited. Thus a nonst<strong>and</strong>ard analysis version of Stirling’sapproximation isn! ∼ (2π) 1/2 n n+1/2 e −n , n ≃ ∞. (8.7)1 Actually, Feller’s argument at this point in his book only establishes (8.5) with the factor(2π) 1/2 replaced by an unknown constant. It is only toward the end of the proof of the DeMoivre-Laplace theorem that we find out, by comparison with the normalizing constant forthe normal distribution, what this unknown constant is. (See footnote 3.)


8.2. THE DE MOIVRE-LAPLACE THEOREM 53Interestingly, Feller uses the same ∼ notation in his equation for Stirling’s approximation,but, of course, he means something conventional (that the ratio ofthe two sides converges to one as n goes to infinity).Plugging Stirling’s approximation into (8.2) gives( nf(k) = pk)k q n−k= n!pk q n−kk!(n − k)!(2π) 1/2 n n+1/2 e −n p k q n−k∼(2π) 1/2 k k+1/2 e −k (2π) 1/2 (n − k) n−k+1/2 e −(n−k)1=(2π) · n n+1/2 p k q n−k1/2 k k+1/2 (n − k) n−k+1/2() 1/2n( np) ( k nq=· ·2πk(n − k) k n − k) n−kwhenever k ≃ ∞ <strong>and</strong> n−k ≃ ∞ [this is our analog of equation (2.5) in Chapter 7of Feller (1950)].Now (still following Feller) we define δ = k − np so k = np + δ <strong>and</strong> n − k =nq − δ <strong>and</strong>() 1/2nf(k) ∼·2π(np + δ)(nq − δ)(1 + δnp1) np+δ (1 − δnq) nq−δ(8.8)[this is our analog of equation (2.7) in Chapter 7 of Feller (1950)].For δ = 0 the right h<strong>and</strong> side is exactly 1/ √ 2πnpq. For δ ≠ 0 we use thetwo-term Taylor series in η = δ/n with remainder to exp<strong>and</strong> the logarithm ofthe denominator of the second termlog[ (1 + δnp) np+δ (1 − δnq= δ22npq − ( 1p ∗2 − 1q ∗2 )·) nq−δ]δ 36n 2where p ∗ = p + δ ∗ /n, q ∗ = 1 − p ∗ , <strong>and</strong> |δ ∗ | ≤ |δ|.<strong>By</strong> (8.3b) the term 1/p ∗2 − 1/q ∗2 on the right h<strong>and</strong> side is limited when δ/nis infinitesimal <strong>and</strong> hence we havelog[ (1 + δnp) np+δ (1 − δnq) nq−δ]∼ δ22npq(8.9a)wheneverδn ≃ 0(8.9b)


54 CHAPTER 8. THE CENTRAL LIMIT THEOREM[these are our analogs of equations (2.9) <strong>and</strong> (2.10) in Chapter 7 of Feller(1950)]. 2 . Equation (8.9b) together with n ≃ ∞ implies both k ≃ ∞ <strong>and</strong>n − k ≃ ∞. Thus the only conditions required for (8.8) <strong>and</strong> (8.9a) to hold are(8.3a), (8.3b), <strong>and</strong> (8.9b).Still following Feller, (8.9b) together with (8.3b) implies np + δ ∼ np <strong>and</strong>nq − δ ∼ nq. Hence the first term on the right h<strong>and</strong> side in (8.8) is asymptoticto 1/ √ 2πnpq, <strong>and</strong>f(k) ∼( ) 1/2 )1exp(− δ22πnpq2npq(8.10)[this is our analog of equation (2.11) in Chapter 7 of Feller (1950)].We now leave Feller <strong>and</strong> do some “calculus” nonst<strong>and</strong>ard analysis style. Weknow from conventional probability theory that we should be interested in thest<strong>and</strong>ardized variablez = k − np √ npqin terms of which<strong>and</strong>k = np + z √ npqδ = z √ npqNote that z takes values in the near lineT = { (k − np)ɛ : k = 0, . . . , n }with regular spacing ɛ = 1/ √ npq, which by (8.3a) <strong>and</strong> (8.3b) is infinitesimal.Rewriting (8.10) in terms of z givesf(k) ∼ 1 √2πe −z2 /2 dz = φ(z) dz (8.11)where dz = ɛ is the spacing of T <strong>and</strong> φ is the st<strong>and</strong>ard normal density functionfrom conventional probability theory. 3Now for any limited numbers a <strong>and</strong> b with a < b we have by Theorem 4.3Pr(a < Z < b) ∼∑ z∈Ta


8.2. THE DE MOIVRE-LAPLACE THEOREM 55is infinitesimal, hence (8.9b) holds <strong>and</strong> hence also (8.11). Actually, we havePr(a < Z < b) ≃∑ z∈Ta


56 CHAPTER 8. THE CENTRAL LIMIT THEOREM


Chapter 9Near Equivalence in MetricSpaces9.1 Metric SpacesLet (S, d) be a metric space, that is, S is a set <strong>and</strong> d is a metric for S.We say points x <strong>and</strong> y of S are nearly equal <strong>and</strong> write x ≃ y when d(x, y) isinfinitesimal. Note that our definition of x ≃ y in R is a special case of thisgeneralization when we use the usual metric for R defined by d(x, y) = |x − y|.Let (S, d) <strong>and</strong> (S ′ , d ′ ) be metric spaces. A function h : S → S ′ is nearlycontinuous at a point x ∈ S ify ∈ S <strong>and</strong> x ≃ y −→ h(x) ≃ h(y), (9.1)<strong>and</strong> nearly continuous on a set T ⊂ S if (9.1) holds at all x ∈ T .R<strong>and</strong>om elements X <strong>and</strong> Y of a metric space (S, d) defined on possiblydifferent probability spaces are nearly equivalent written X w ≃ Y , ifE{g(X)} ≃ E{g(Y )}for every limited (nearly) continuous function g : S → R.Let d 1 <strong>and</strong> d 2 be two different metrics for the same set S. We say that d 1<strong>and</strong> d 2 are equivalent if they agree as to the meaning of x ≃ y, that is, ifd 1 (x, y) ≃ 0 ←→ d 2 (x, y) ≃ 0.In this case, a function h : S → S ′ where (S ′ , d ′ ) is another metric space is(nearly) continuous when d 1 is the metric for S if <strong>and</strong> only if it is continuouswhen d 2 is the metric. In this sense, near equivalence of r<strong>and</strong>om elements ofmetric spaces does not depend on the metric but only on the external equivalencerelation ≃ induced by the metric.57


58 CHAPTER 9. NEAR EQUIVALENCE IN METRIC SPACES9.2 <strong>Probability</strong> MeasuresA r<strong>and</strong>om element X of a metric space (S, d) induces a probability measureP defined by (5.11a) <strong>and</strong> probability mass function p defined by (5.11b), arelation denoted by p = dP . This relationship between X <strong>and</strong> P is denotedP = L(X), <strong>and</strong> we say P is the law of X.Since near equivalence is determined by expectations, <strong>and</strong> expectations aredetermined by measures, near equivalence really depends only on measures noton r<strong>and</strong>om elements (except through their measures). Thus we make the definition:measures P <strong>and</strong> Q on a metric space (having finite support) are nearlyequivalent, written P ≃ w Q if P h ≃ Qh for every limited nearly continuousfunction h : S → R, whereP h = ∑ x∈Sh(x)dP (x)is a shorth<strong>and</strong> for the expectation of the r<strong>and</strong>om variable h(X) when P = L(X).9.3 The Prohorov MetricIf d is a metric on S we define the notationd(x, A) = inf d(x, y)y∈Afor any nonempty subset A of S. Then we define the notationA ɛ = { x ∈ S : d(x, A) < ɛ }for any nonempty subset A of S <strong>and</strong> any ɛ > 0. For completeness, we define∅ ɛ = ∅ for ɛ > 0. Note that the “open” ball of radius ɛ centered at x can bedenoted {x} ɛ , <strong>and</strong> our definition for the empty set makesA ɛ = ⋃{x} ɛx∈Ahold for all A (rather than just nonempty A). The set A ɛ is called the ɛ-dilationof A.The triangle inequality implies (A ɛ ) η ⊂ A ɛ+η because z ∈ (A ɛ ) η when thereexist x ∈ A <strong>and</strong> y ∈ S such that d(x, y) < ɛ <strong>and</strong> d(y, z) < η.Let S be a finite set with metric d <strong>and</strong> let P(S, d) denote the set of allprobability measures on S. The Prohorov metric on P(S, d) is the functionπ : P(S, d) × P(S, d) → R defined so that π(P, Q) is the infimum of all ɛ > 0such thatP (A) ≤ Q(A ɛ ) + ɛ <strong>and</strong> Q(A) ≤ P (A ɛ ) + ɛ, A ⊂ S. (9.2)(When P <strong>and</strong> Q have finite support the infimum is achieved.)


9.3. THE PROHOROV METRIC 59Note that (9.2) holds for every ɛ > π(P, Q) because if (9.2) holds for someɛ, then for any η > 0P (A) ≤ P (A η ) ≤ Q ( (A η ) ɛ ) + ɛ ≤ Q(A η+ɛ ) + ɛ,<strong>and</strong> similarly with P <strong>and</strong> Q swapped, so (9.2) also holds with ɛ replaced byɛ + η. Hence the set of ɛ for which (9.2) holds is the union of intervals of theform [ɛ, ∞) <strong>and</strong> hence has one of the forms (δ, ∞) or [δ, ∞), only the latter beingpossible when P <strong>and</strong> Q have finite support. In either case δ is the Prohorovdistance between P <strong>and</strong> Q.In the following theorem <strong>and</strong> the rest of this chapter we restrict our attentionto Nelson-style probability theory: all probability measures have finite support.Theorem 9.1. π is a metric.Proof. The properties symmetry, nonnegativity, <strong>and</strong> π(P, P ) = 0 are obvious.If P ≠ Q then p = dP ≠ q = dQ <strong>and</strong> there exists an x such that p(x) ≠ q(x).Define B = supp P ∪ supp Q, where supp P denotes the support of P (definedin Section 5.5). If B ≠ {x}, then defineɛ 1 = min{ d(x, y) : y ∈ B \ {x} },otherwise (when B = {x}) define ɛ 1 = 1. Then ɛ 1 > 0 because B is a finite set,<strong>and</strong> for 0 < ɛ < ɛ 1 we haveP ({x} ɛ ) = p(x) <strong>and</strong> Q({x} ɛ ) = q(x),hence π(P, Q) ≥ |p(x) − q(x)|. The triangle inequality follows fromhenceP (A) ≤ Q(A ɛ ) + ɛ, ɛ > π(P, Q)Q(A ɛ ) ≤ R(A ɛ+η ) + η, η > π(Q, R)P (A) ≤ R(A ɛ+η ) + ɛ + η, ɛ + η > π(P, Q) + π(Q, R)<strong>and</strong> similarly with P <strong>and</strong> R swapped.Another useful fact about the Prohorov metric is the following (copied essentiallyverbatim from Billingsley (1999, p. 72), because the argument uses nomeasure theory so Kolmogorov-style <strong>and</strong> Nelson-style argument is the same).Lemma 9.2. The Prohorov distance between P <strong>and</strong> Q is the infimum over allɛ such thatP (A) ≤ Q(A ɛ ) + ɛ, A ⊂ S. (9.3)Proof. First note that A ⊂ S \ B ɛ if <strong>and</strong> only if B ⊂ S \ A ɛ because either isthe same as (∀x ∈ A)(∀y ∈ B) ( d(x, y) ≥ ɛ ) . If (9.3) holds, let B = S \ A ɛ , <strong>and</strong>thenP (A ɛ ) = 1 − P (B) ≥ 1 − Q(B ɛ ) − ɛ = Q(S \ B ɛ ) − ɛ ≥ Q(A) − ɛso (9.3) also holds with P <strong>and</strong> Q swapped, <strong>and</strong> hence (9.2) holds.


60 CHAPTER 9. NEAR EQUIVALENCE IN METRIC SPACES9.4 Near Equivalence <strong>and</strong> the Prohorov MetricLet (S, d) <strong>and</strong> (S ′ , d ′ ) be metric spaces. A function h : S → S ′ is nearlyLipschitz continuous if there exists a limited number L such thatd ′( h(x), h(y) ) ≤ L · d(x, y), x, y ∈ S.Note that, the product of limited <strong>and</strong> infinitesimal numbers being infinitesimal,a (nearly) Lipschitz continuous function is, as the name suggests, (nearly)continuous.Lemma 9.3. Let (S, d) be a metric space. For any nonempty A ⊂ S <strong>and</strong> anyɛ ≫ 0, define h : S → R byh(x) = max ( 0, 1 − d(x, A)/ɛ ) .Then h is limited <strong>and</strong> nearly Lipschitz continuous.Proof. We first establish|h(x) − h(y)| ≤|d(x, A) − d(y, A)|. (9.4)ɛ(Case I) Suppose d(x, A) = 0 so h(x) = 1. Then{1, d(y, A) ≥ ɛh(x) − h(y) = 1 − h(y) =d(y, A)/ɛ d(y, A) < ɛ<strong>and</strong> in either case (9.4) holds.(Case II) Suppose d(x, A) ≥ ɛ so h(x) = 0. Then{0, d(y, A) ≥ ɛh(y) − h(x) = h(y) =1 − d(y, A)/ɛ, d(y, A) < ɛ<strong>and</strong> in either case (9.4) holds.(Case III) Suppose 0 < d(x, A) < ɛ so 0 < h(x) < 1 <strong>and</strong> similarly with yreplacing x. Thend(y, A) − d(x, A)h(y) − h(x) =ɛ<strong>and</strong> again (9.4) holds. This finishes the proof of (9.4) (the other cases being likeI <strong>and</strong> II with x <strong>and</strong> y swapped).Now for any z ∈ A we havetaking the infimum over all z ∈ A givesd(x, y) + d(y, z) ≥ d(x, z) ≥ d(x, A)d(x, y) + d(y, A) ≥ d(x, A)


9.4. NEAR EQUIVALENCE AND THE PROHOROV METRIC 61<strong>and</strong> similarly with x <strong>and</strong> y swapped. Hencefrom which we see that (9.4) implies|h(x) − h(y)| ≤d(x, y) ≥ |d(y, A) − d(x, A)||d(x, A) − d(y, A)|ɛ≤ 1 · d(x, y).ɛAnd, 1/ɛ being limited, this establishes the lemma (<strong>and</strong> incidentally shows thatthe “Lipschitz constant” can be taken to be L = 1/ɛ).Theorem 9.4. Assume P <strong>and</strong> Q are measures on (S, d) having finite support<strong>and</strong> that P h ≃ Qh holds for every limited nearly Lipschitz continuous functionh : S → R. Then π(P, Q) ≃ 0.Proof. For any nonempty A ⊂ S <strong>and</strong> any ɛ ≫ 0, define h : S → R as in thelemma. Then h is limited <strong>and</strong> nearly Lipschitz continuous. So P h ≃ Qh. ButP (A) ≤ P h ≤ P (A ɛ )<strong>and</strong> similarly with P replaced by Q. Hence P (A) ≤ P h ≃ Qh ≤ Q(A ɛ ) holds forall A. Thus (9.3) holds for every ɛ ≫ 0 <strong>and</strong> hence π(P, Q) is infinitesimal.Theorem 9.5. Assume P <strong>and</strong> Q are measures on (S, d) having finite support<strong>and</strong> π(P, Q) ≃ 0. Then P w ≃ Q.Proof. It is enough to prove P h ≃ Qh for all nearly continuous h satisfying0 < h < 1 because for any limited nearly continuous function g there existlimited a <strong>and</strong> b such that g = a + bh <strong>and</strong> 0 < h < 1, in which case P h ≃ Qhimplies P g ≃ Qg.But for such h we have by Lemma 5.2 the representationP h = ∑ t∈TP {h > t} dtwhere T is a finite subset of [0, 1] containing 0, 1, <strong>and</strong> h(supp P ) <strong>and</strong> wherethe notation {h > t} denotes the event h(X) > t <strong>and</strong> P {h > t} denotes theprobability of this event, where X is a r<strong>and</strong>om element such that P = L(X).Note that t ↦→ 1 − P {h > t} is the distribution function of h(X).For arbitrary t in [0, 1] define A t = {h > t}. Fix an infinitesimal ɛ greaterthan π(P, Q). Then for x ∈ A ɛ t we have h(x) t <strong>and</strong> hence h(x) > t − δ for anyδ ≫ 0. Hence A t ⊂ A ɛ t ⊂ {h > t − δ}. Thus we haveP (A t ) = P {h > t} ≤ P (A ɛ t) ≤ P {h > t − δ}P (A t ) Q(A ɛ t)Q(A t ) P (A ɛ t)Q(A t ) = Q{h > t} ≤ Q(A ɛ t) ≤ Q{h > t − δ}


62 CHAPTER 9. NEAR EQUIVALENCE IN METRIC SPACESfrom which we inferP {h > t} Q{h > t − δ} <strong>and</strong> Q{h > t} P {h > t − δ} (9.5a)holds for all t in [0, 1] <strong>and</strong> all δ ≫ 0. To simplify notation, we define U(t) =P {h > t} <strong>and</strong> V (t) = Q{h > t} so (9.5a) becomesU(s) V (t) <strong>and</strong> V (s) U(t), whenever s ≪ t. (9.5b)We also define U + (t) = P {h ≥ t} <strong>and</strong> V + (t) = Q{h ≥ t}.Now for any limited n choose s i such that s 0 = 0, s n = 1, <strong>and</strong>U(s i ) ≤ n − in < U +(s i ), i = 1, . . . , n − 1.It may be that some of the s i are equal, so we remove duplicates. Letmaintaining s ∗ 0 < s ∗ 1 < · · · < s ∗ m n.Now for any δ ≫ 0 we havehence{s 0 , s 1 , . . . , s n } = {s ∗ 0, s ∗ 1, . . . , s ∗ m n}V (s ∗ i − δ) U(s ∗ i − δ/2) ≥ U + (s ∗ i ) > U(s ∗ i ) V (s ∗ i + δ) (9.6)V (s ∗ i − δ) + δ ≥ U + (s ∗ i ) > U(s ∗ i ) ≥ V (s ∗ i + δ) − δ (9.7)holds for all δ ≫ 0 <strong>and</strong> hence by overspill (9.7) must hold for some infinitesimalδ. For this infinitesimal δ, define r i = s ∗ i − δ <strong>and</strong> t i = s ∗ i + δ for i = 1, . . .,m n − 1, <strong>and</strong> also define r 0 = t 0 = 0 <strong>and</strong> r mn = t mn = 1.Now∑m nP h (s ∗ i − s ∗ i−1)U + (s ∗ i ) (9.8a)<strong>and</strong>i=1∑m nQh (r i − t i−1 )V (t i−1 ) (9.8b)i=1because the contributions from the infinitesimal intervals (t i , r i ) are negligible.For some integer kCombining these givesV (t i−1 ) U(s ∗ i−1) ≤ n − k + 1nn − k< U + (s ∗ i )nV (t i−1 ) U + (s ∗ i ) + 1 (9.8c)nAnd combining (9.8c) with (9.8a) <strong>and</strong> (9.8b) <strong>and</strong> using the fact that s ∗ i −s∗ i−1 ≃r i − t i−1 for all i gives Qh P h + 1/n. The same argument with P <strong>and</strong> Qswapped gives P h Qh + 1/n. Since these hold for any limited n, we musthave P h ≃ Qh.


9.5. THE PORTMANTEAU THEOREM 639.5 The Portmanteau TheoremLet (S, d) be a metric space. Define for A ⊂ SThe set A ɛ is called the ɛ-erosion of A.A ɛ = S \ (S \ A) ɛ .Theorem 9.6. Let P <strong>and</strong> Q be measures on (S, d) having finite support <strong>and</strong> πthe Prohorov metric. The following are equivalent(i) P <strong>and</strong> Q are nearly equivalent.(ii) P h ≃ Qh for every limited Lipschitz continuous function h : S → R.(iii) π(P, Q) is infinitesimal.(iv) For some ɛ ≃ 0 <strong>and</strong> for all A ⊂ S we have P (A) Q(A ɛ ).(v) For every ɛ ≫ 0 <strong>and</strong> for all A ⊂ S we have P (A) Q(A ɛ ).(vi) For some ɛ ≃ 0 <strong>and</strong> for all A ⊂ S we have P (A) Q(A ɛ ).(vii) For every ɛ ≫ 0 <strong>and</strong> for all A ⊂ S we have P (A) Q(A ɛ ).Proof. The implication (i) → (ii) is trivial. The implication (ii) → (iii) isTheorem 9.4. The implication (iii) → (i) is Theorem 9.5.<strong>By</strong> Lemma 9.2 (iii) is equivalent to the existence of an ɛ ≃ 0 such that (9.3)holds, which implies (iv). The implication (iv) → (v) is trivial. If (v) holds,then (9.3) holds for every ɛ ≫ 0, <strong>and</strong> hence by Lemma 9.2 (iii) holds. At thispoint we have established the equivalence of (i) through (v).The implications (iv) ←→ (vi) <strong>and</strong> (v) ←→ (vii) are just the complementrule: write B = S \ A so<strong>and</strong>P (A) = 1 − P (B) <strong>and</strong> Q(A ɛ ) = 1 − Q(B ɛ )P (A) Q(A ɛ ) ←→ P (B) Q(B ɛ )hence if the left h<strong>and</strong> side holds for all A, then the right h<strong>and</strong> side holds for allB <strong>and</strong> vice versa.Let S <strong>and</strong> T be two different sets with S ⊂ T <strong>and</strong> d a metric for T <strong>and</strong>hence also for S. Pedantically, the restriction d r of d to S × S is a metric forS <strong>and</strong> defines (S, d r ) as a metric subspace of (T, d). Then elements P <strong>and</strong> Q ofP(S, d r ) are nearly equivalent if <strong>and</strong> only if they are nearly equivalent consideredas elements of P(T, d). The enclosing superspace T is irrelevant. This is clearfrom (iv) of the portmanteau theorem. Since P (A) = P (A ∩ supp P ), we needonly check A ⊂ supp P in establishing (iv). So long as we are only interested intwo measures P <strong>and</strong> Q, we can take S = supp P ∪ supp Q, if we like, but anyenclosing metric space does as well.


64 CHAPTER 9. NEAR EQUIVALENCE IN METRIC SPACES9.6 Continuous MappingLet A be any property that may or may not hold at points of S <strong>and</strong> let P bea measure on (S, d) having finite support. We say A holds P -almost everywhereif for every ɛ ≫ 0 there exists a set N such that P (N) ≤ ɛ <strong>and</strong> A(x) holdsexcept for x in N.The following lemma is Theorem 17.3 in Nelson (1987). We reprove it hereonly to see how much shorter the proof gets with our extra apparatus.Lemma 9.7. Let P <strong>and</strong> Q be elements of P(S, d) such that P ≃ w Q, <strong>and</strong> letA be any property (internal or external) such that x, y ∈ S <strong>and</strong> x ≃ y impliesA(x) ←→ A(y). Then A holds P -almost everywhere if <strong>and</strong> only if it holdsQ-almost everywhere.Proof. Suppose A holds P -almost everywhere. Fix ɛ ≫ 0 <strong>and</strong> choose N ⊂ Ssuch that P (N) ≤ ɛ/2 <strong>and</strong> A(x) holds for all x ∈ S \ N. <strong>By</strong> (vi) of theportmanteau theorem there is a δ ≃ 0 such that P (N) Q(N δ ). So Q(N δ ) ≤ ɛ.<strong>By</strong> definition S \N δ = (S \N) δ . Hence y ∈ S \N δ if <strong>and</strong> only if there exists an xin S \N such that d(x, y) < δ. This implies x ≃ y, <strong>and</strong> hence A(y) holds. HenceA holds on S \N δ . Since ɛ ≫ 0 was arbitrary, A holds Q-almost everywhere.Let P be a measure on (S, d) having finite support. A function h : S → S ′ isnearly continuous P -almost everywhere if the property A(x) in the definition of“almost everywhere” is “h is nearly continuous at x.” Note that by the lemma,when P w ≃ Q we have h nearly continuous P -almost everywhere if <strong>and</strong> only if his nearly continuous Q-almost everywhere.For an arbitrary map h : S → S ′ <strong>and</strong> any measure P on (S, d), the imagemeasure P ′ ∈ (S ′ , d ′ ) denoted by P ◦ h −1 is defined byP ′ (B) = P ( h −1 (B) ) , B ⊂ S ′ .If a r<strong>and</strong>om element X has the distribution P , then the r<strong>and</strong>om element h(X)has the distribution P ◦ h −1 .Theorem 9.8 (Continuous Mapping). Let (S, d) <strong>and</strong> (S ′ , d ′ ) metric spaces, <strong>and</strong>let P, Q ∈ P(S, d). Suppose P w ≃ Q, <strong>and</strong> suppose h : S → S ′ is nearly continuousP -almost everywhere. Then P ◦ h −1 w ≃ Q ◦ h −1 .Proof. Fix ɛ ≫ 0 <strong>and</strong> choose N ⊂ S such that Q(N) ≤ ɛ/2 <strong>and</strong> h is nearlycontinuous on S\N. Let B ′ be an arbitrary subset of S ′ , <strong>and</strong> write B = h −1 (B ′ ).Also define P ′ = P ◦ h −1 <strong>and</strong> Q ′ = Q ◦ h −1 .Then P (B) = P ′ (B ′ ). <strong>By</strong> (iii) of the portmanteau theorem there exists aδ ≃ 0 such that P (B) Q(B δ ). <strong>By</strong> near continuity, h maps B δ \ N into B ′ɛ ,which implies Q(B δ \ N) ≤ Q ′ (B ′ɛ ). HenceP ′ (B ′ ) = P (B) Q(B δ \ N) + Q(N) ≤ Q ′ (B ′ ɛ ) +ɛ2


9.7. PRODUCT SPACES 65Reading from end to end we haveP ′ (B ′ ) ≤ Q ′ (B ′ ɛ ) + ɛ<strong>and</strong> since ɛ ≫ 0 <strong>and</strong> B ′ ⊂ S ′ were arbitrary, this implies by Lemma 9.2 thatthe Prohorov distance between P ′ <strong>and</strong> Q ′ is infinitesimal.Corollary 9.9. Let (S, d) <strong>and</strong> (S ′ , d ′ ) metric spaces, <strong>and</strong> let X <strong>and</strong> Y be r<strong>and</strong>omelements of (S, d) such that X w ≃ Y , <strong>and</strong> suppose h : S → S ′ is nearly continuousP -almost everywhere, where P = L(X). Then h(X) w ≃ h(Y ).9.7 Product SpacesLet (S 1 , d 1 ) <strong>and</strong> (S 2 , d 2 ) be metric spaces, then we make S 1 × S 2 into ametric space by giving it the metric d ∗ defined byd ∗( (x 1 , x 2 ), (y 1 , y 2 ) ) = d ′( d 1 (x 1 , y 1 ), d 2 (x 2 , y 2 ) )where d ′ is any metric on R 2 that is equivalent to one inducing the st<strong>and</strong>ardtopology, for example, we may use any ofd ′ (u, v) = |u| + |v|d ′ (u, v) = √ u 2 + v 2d ′ (u, v) = max(|u|, |v|)(9.9a)(9.9b)(9.9c)which are referred to as the L 1 , L 2 <strong>and</strong> L ∞ norms, respectively. We know fromthe comment at the end of Section 9.5 that it does not matter which d ′ we use,since they all give the same notion of near equivalence in P(S 1 × S 2 , d ∗ ).For our purposes here (9.9c) is the most useful, because of its special propertythat(B 1 × B 2 ) ɛ = B ɛ 1 × B ɛ 2, B 1 ⊂ S 1 , B 2 ⊂ S 2 (9.10)making ɛ-dilations particularly easy to work with.Let p i : S 1 × S 2 → S i denote the coordinate projection (x 1 , x 2 ) ↦→ x i . Thenfor any P ∈ P(S 1 × S 2 , d ∗ ) the marginals of P are the distributions P ◦ p −1i .Theorem 9.10 (Slutsky). Suppose P, Q ∈ P(S 1 × S 2 , d ∗ ) <strong>and</strong> supposeP ◦ p −1iw≃ Q ◦ p −1i , i = 1, 2 (9.11)<strong>and</strong> suppose supp(Q ◦ p −12 ) is a singleton. Then P w ≃ Q.Proof. Suppose without loss of generality that (9.9c) is used to define d ∗ so(9.10) holds. Write P ◦ p −1i= P i <strong>and</strong> Q ◦ p −1i = Q i , <strong>and</strong> write {c} = supp Q 2 .Choose an infinitesimal ɛ greater than either of the Prohorov distances betweenopposite sides of (9.11). For B ⊂ S 1 × S 2 defineB 1 = { x 1 ∈ S 1 : (x 1 , c) ∈ B }.


66 CHAPTER 9. NEAR EQUIVALENCE IN METRIC SPACESThen Q(B) = Q(B 1 × {c}) = Q 1 (B 1 ).Also, Q 2 ({c}) = 1 implies P 2 ({c} ɛ ) ≃ 1 by the portmanteau theorem <strong>and</strong>henceMoreover, B ⊃ B 1 × {c} impliesP 2 (S 2 \ {c} ɛ ) ≃ 0. (9.12)B ɛ ⊃ (B 1 × {c}) ɛ = B ɛ 1 × {c} ɛ ,P (B ɛ ) ≥ P (B ɛ 1 × {c} ɛ )= P 1 (B ɛ 1) − P ( B ɛ 1 × (S 2 \ {c} ɛ ) ) P 1 (B ɛ 1)the second term on the middle line being infinitesimal because of (9.12). Hencewe haveQ(B) = Q 1 (B 1 ) P 1 (B ɛ 1) P (B ɛ ),the middle relation being an application of Lemma 9.2 <strong>and</strong> the other relationshaving already been established. Since B was arbitrary, we have (iv) of theportmanteau theorem.Corollary 9.11. Suppose X = (X 1 , X 2 ) <strong>and</strong> Y = (Y 1 , Y 2 ) are r<strong>and</strong>om elementsof (S 1 × S 2 , d ∗ ) <strong>and</strong> Y 2 is a constant r<strong>and</strong>om element, <strong>and</strong> supposeX 1w≃ Y1X 2w≃ Y2then(X 1 , X 2 ) w ≃ (Y 1 , Y 2 ).


Chapter 10Distribution Functions10.1 The Lévy MetricThe Lévy metric on the set of all distribution functions (with finite support)on R is the function λ defined so that λ(F, G) is the infimum of all ɛ > 0 suchthatF (x) ≤ G(x + ɛ) + ɛ <strong>and</strong> G(x) ≤ F (x + ɛ) + ɛ, x ∈ R. (10.1)It is easy to see that λ actually is a metric.Actually, the Lévy metric can be applied to any nondecreasing functionsR → R. We shall do this in Corollary 10.6 below, where we consider Lévydistance between the distribution function of a Nelson-style r<strong>and</strong>om variable<strong>and</strong> the distribution function Φ of the normal distribution in Kolmogorov-styleprobability theory.We say nondecreasing functions F <strong>and</strong> G are nearly equal, written F ≃ G,if λ(F, G) ≃ 0. Caution: this does not necessarily imply that r<strong>and</strong>om variableshaving distribution functions F <strong>and</strong> G are nearly equivalent! See Theorem 10.4below.Lemma 10.1. Suppose F <strong>and</strong> G are nondecreasing functions <strong>and</strong> G is nearlycontinuous, Then F ≃ G if <strong>and</strong> only if F (x) ≃ G(x), for all x ∈ R.Proof. One direction is trivial. Conversely, suppose λ(F, G) = ɛ ≃ 0. Then bynear continuity of Gholds for all x.G(x) ≃ G(x − ɛ) − ɛ ≤ F (x) ≤ G(x + ɛ) + ɛ ≃ G(x)Lemma 10.2. If λ is the Lévy metric <strong>and</strong> π the Prohorov metric, F <strong>and</strong> G aredistribution functions <strong>and</strong> P <strong>and</strong> Q are the corresponding measures, thenπ(P, Q) ≥ λ(F, G).67


68 CHAPTER 10. DISTRIBUTION FUNCTIONSProof. Fix ɛ > π(P, Q). ThenF (x) = P {(−∞, x]}holds for all x. In particular we have≤ Q{(−∞, x] ɛ } + ɛ= Q{(−∞, x + ɛ)} + ɛ= max G(y + ɛ) + ɛy 0 sufficiently small so that G has no jump in (x + ɛ, x + ɛ + δ], <strong>and</strong>,applying (10.2) with ɛ replaced by ɛ + δ, we haveF (x) ≤ G(x + ɛ + δ) + ɛ + δ = G(x + ɛ) + ɛ + δ,which, since δ > 0 was arbitrary, implies (10.2) even when x + ɛ is a jump of G.The same argument with F <strong>and</strong> G swapped finishes the proof.10.2 Near EquivalenceCorollary 10.3. If X <strong>and</strong> Y are r<strong>and</strong>om variables having distribution functionsF <strong>and</strong> G, then X w ≃ Y implies F ≃ G.Theorem 10.4. If X <strong>and</strong> Y are r<strong>and</strong>om variables having distribution functionsF <strong>and</strong> G, either X or Y is limited almost surely, <strong>and</strong> F ≃ G, then X w ≃ Y .The limited almost surely condition cannot be suppressed. Let X have theuniform distribution on the even integers between 1 <strong>and</strong> 2ν <strong>and</strong> Y have theuniform distribution on the odd integers between 1 <strong>and</strong> 2ν. Then if F <strong>and</strong> Gare the corresponding distribution functions, then we have0 ≤ G(x) − F (x) ≤ 1 νfor all x. So λ(F, G) is infinitesimal whenever ν is unlimited. But, if h is definedby h(x) = sin 2 (πx), then h is a limited continuous function, <strong>and</strong> h(X) = 0 <strong>and</strong>h(Y ) = 1 (for all ω).Proof. Fix arbitrary appreciable ɛ 1 <strong>and</strong> ɛ 2 . <strong>By</strong> Lemma 9.7, if one of X <strong>and</strong> Y islimited almost surely <strong>and</strong> X w ≃ Y , then the other is also limited almost surely.Hence, by Lemma 6.3, there exist limited a <strong>and</strong> b such thatF (a) ≤ ɛ 1F (b) ≥ 1 − ɛ 1(10.3a)(10.3b)


10.2. NEAR EQUIVALENCE 69<strong>and</strong> similarly with F replaced by G. Let h be a nearly continuous function withlimited bound M. Then<strong>and</strong>|E{h(X)I (−∞,a] (X)}| ≤ MF (a) ≤ Mɛ 1|E{h(X)I (b,∞) (X)}| ≤ M [ 1 − F (b) ] ≤ Mɛ 1(10.4a)(10.4b)<strong>and</strong> similarly with X replaced by Y <strong>and</strong> F replaced by G.<strong>By</strong> near continuity of h there exists δ ≫ 0 such that |h(x) − h(y)| ≤ ɛ 2whenever |x − y| ≤ δ. There exists a limited natural number n such that(b − a)/n ≤ δ/2. Define c k = a + (k/n)(b − a) for integer k, noting that c 0 = a<strong>and</strong> c n = b. Then|h(x) − h(c k )| ≤ ɛ 2 , c k−2 ≤ x ≤ c k+2 , (10.5)<strong>and</strong> this together with (10.4a) <strong>and</strong> (10.4b) impliesn+1∣ E{h(X)} − ∑h(c k ) [ F (c k ) − F (c k−1 ) ]∣ ∣ ∣∣∣≤ ɛ 2 + 2Mɛ 1 . (10.6)k=1The assumption λ(F, G) ≃ 0 implies that there exist b k <strong>and</strong> d k such thatb k ≤ c k ≤ d k <strong>and</strong> b k ≃ c k ≃ d k <strong>and</strong>G(b k ) F (c k ) G(d k ), for all k.The same reasoning that lead to (10.6) implies that (10.6) holds with X replacedby Y <strong>and</strong> F replaced by G. Butn+1∑h(c k ) [ F (c k ) − F (c k−1 ) ] n+1∑− h(c k ) [ G(b k ) − G(b k−1 ) ]k=1=k=1n∑ [h(ck ) − h(c k+1 ) ] · [F(c k ) − G(b k ) ]k=1+ h(c n+1 ) [ F (c n+1 ) − G(b n+1 ) ] − h(c 1 ) [ F (c 0 ) − G(b 0 ) ] (10.7)<strong>and</strong> 0 F (c k ) − G(b k ) G(d k ) − G(b k ), <strong>and</strong> the latter sum to less or equal toone. Hence, a limited sum of infinitesimals being infinitesimal (Corollary 3.5),the sum in (10.7) is weakly less than ɛ 2 <strong>and</strong> weakly greater than zero. The otherterms are less than or equal to 4Mɛ 1 in absolute value. That is,n+1∑h(c k ) [ F (c k ) − F (c k−1 ) ] n+1∑− h(c k ) [ G(b k ) − G(b k−1 ) ]∣ ∣ ∣∣∣≤ ɛ 2 + 4Mɛ 1 .∣k=1Hence by the triangle inequalityk=1|E{h(X)} − E{h(Y )}| ≤ 3ɛ 3 + 8Mɛ 1 .Since ɛ 1 <strong>and</strong> ɛ 2 were arbitrary appreciable numbers <strong>and</strong> M is limited, we actuallyhave E{h(X)} ≃ E{h(Y )}.


70 CHAPTER 10. DISTRIBUTION FUNCTIONS10.3 General Normal DistributionsSt<strong>and</strong>ard normal r<strong>and</strong>om variables were defined in Section 8.1. They are thedistributions that arise as limits in the central limit theorem (Corollary 8.1).We found out more about these distributions in Theorem 8.2 <strong>and</strong> Corollary 8.3.Here we finish the job.Lemma 10.5. A st<strong>and</strong>ard normal r<strong>and</strong>om variable is limited almost surely.Proof. It is established near the end of the proof of Theorem 8.2 that the r<strong>and</strong>omvariable Z defined in the theorem statement, which is st<strong>and</strong>ard normal, is limitedalmost surely. Hence by Lemma 9.7 every st<strong>and</strong>ard normal r<strong>and</strong>om variable islimited almost surely.Corollary 10.6. A r<strong>and</strong>om variable is st<strong>and</strong>ard normal if <strong>and</strong> only if its distributionfunction is nearly equal to Φ defined byΦ(x) = 1 √2π∫ x−∞e −t2 /2 dt. (10.8)Proof. Theorem 8.2 <strong>and</strong> Corollary 8.3 assert that one particular st<strong>and</strong>ard normalr<strong>and</strong>om variable has distribution function nearly equal to Φ. Hence byCorollary 10.3, Theorem 10.4, <strong>and</strong> Lemma 10.5 a r<strong>and</strong>om variable is st<strong>and</strong>ardnormal if <strong>and</strong> only if its distribution function is nearly equal to Φ.Note that Φ is nearly continuous by Lemma 4.1, so by Lemma 10.1 <strong>and</strong>the preceding lemma a r<strong>and</strong>om variable is st<strong>and</strong>ard normal if <strong>and</strong> only if itsdistribution function F satisfiesF (x) ≃ Φ(x), x ∈ R.If Z is a st<strong>and</strong>ard normal r<strong>and</strong>om variable, µ is a limited real number, <strong>and</strong>σ is a positive appreciable real number, then we say X = µ + σZ is generalnormal <strong>and</strong> we also apply this terminology to the distribution of X. Likest<strong>and</strong>ard normality, general normality is an external property.Analogies with Kolmogorov-style probability theory tempt us to call µ themean <strong>and</strong> σ the st<strong>and</strong>ard deviation, but in Nelson-style probability theory, thisis nonsense. A normal distribution, as we have defined the concept, need nothave moments anywhere near those of a conventional normal distribution.Theorem 10.7. If Z is an L 2 st<strong>and</strong>ard normal r<strong>and</strong>om variable <strong>and</strong> X =µ + σZ, where µ <strong>and</strong> σ are limited <strong>and</strong> σ ≥ 0, then X is L 2 , E(X) ≃ µ, <strong>and</strong>var(X) ≃ σ 2 .Proof. Every st<strong>and</strong>ard normal r<strong>and</strong>om variable that arises in the central limittheorem (Corollary 8.1) is L 2 . Moreover, such a r<strong>and</strong>om variable (8.1) hasmean zero <strong>and</strong> st<strong>and</strong>ard deviation one by st<strong>and</strong>ardization. Since the map x ↦→x (a) defined by (6.6) is limited <strong>and</strong> continuous for limited a, it follows by the


10.3. GENERAL NORMAL DISTRIBUTIONS 71approximation lemma (Lemma 6.7) that nearly equivalent L 2 r<strong>and</strong>om variableshave nearly equal mean <strong>and</strong> variance. Hence E(Z) ≃ 0 <strong>and</strong> var(Z) ≃ 1.<strong>By</strong> (6.4a) <strong>and</strong> (6.4b), X is L 2 . It is elementary that E(X) = µ + σE(Z) <strong>and</strong>var(X) = σ 2 var(Z). The assertion about E(X) <strong>and</strong> var(X) then follows fromTheorems 3.7 <strong>and</strong> 3.10.Hence if we were to take L 2 as part of the definition of “normal” then µ<strong>and</strong> σ would be nearly equal to the mean <strong>and</strong> st<strong>and</strong>ard deviation. We havedecided to not take L 2 as part of the definition, because it is easier to add itwhen wanted (say “L 2 <strong>and</strong> normal”) than to remove it when not wanted (say“nearly equivalent to a normal r<strong>and</strong>om variable”).Theorem 10.8. If Z is a st<strong>and</strong>ard normal r<strong>and</strong>om variable <strong>and</strong> X = µ + σZ,where µ <strong>and</strong> σ are limited <strong>and</strong> σ ≥ 0, then the median of the distribution of Xis nearly equal to µ <strong>and</strong> the Φ(1) quantile of X is nearly equal to µ + σ, whereΦ is defined by (10.8).Proof. Writing φ = Φ ′ , we have for a < bΦ(b) − Φ(a) ≥ (b − a)φ (max(|a|, |b|))by the law of the mean <strong>and</strong> the unimodality of φ. From φ(z) = exp(−z 2 /2)/ √ 2π<strong>and</strong> Theorems 3.10 <strong>and</strong> 3.11 we concludeΦ(a) ≪ Φ(b), whenever −∞ ≪ a ≪ b ≪ ∞. (10.9)Since Z is limited almost surely, its p-th quantiles for 0 ≪ p ≪ 1 are limited<strong>and</strong> (10.9) implies these quantiles are unique up to near equality. In particular,the 0.5 <strong>and</strong> Φ(1) ≈ 0.8413 quantiles are unique up to near equality, <strong>and</strong> nearlyequal to zero <strong>and</strong> one, respectively. Then Theorem 3.10 implies the 0.5 <strong>and</strong> Φ(1)quantiles of the distribution of X are nearly equal to µ <strong>and</strong> µ + σ, respectively.


72 CHAPTER 10. DISTRIBUTION FUNCTIONS


Chapter 11Characteristic Functions11.1 DefinitionsAs always, R denotes the real number system. Now we introduce C for thecomplex number system.The characteristic function of a r<strong>and</strong>om variable X is a function ϕ : R → Cdefined byϕ(t) = E{e itX }, t ∈ R.Complex variables play only a limited, algebraic role in the theory. <strong>By</strong> theEuler formulae iy = cos(y) + i sin(y)characteristic functions are determined by two real-valued functions, which aret ↦→ E{cos(tX)}, the real part of ϕ, <strong>and</strong> t ↦→ E{sin(tX)}, the imaginary part ofϕ. The only virtue in using complex numbers is that certain identities, such ase i(u+v) = e iu e ivare “obvious” phrased in terms of complex exponentials <strong>and</strong> not “obvious” whenphrased in terms of trigonometric identities.For x <strong>and</strong> y in R d we denote the st<strong>and</strong>ard inner product by 〈 · , · 〉, that is,〈x, y〉 =d∑x i y i ,where x = (x 1 , . . . , x d ) <strong>and</strong> similarly with x replaced by y.The characteristic function of a r<strong>and</strong>om vector X taking values in R d is acomplex-valued function ϕ defined on all of R d byi=1ϕ(t) = E{e i〈t,X〉 }, t ∈ R d .73


74 CHAPTER 11. CHARACTERISTIC FUNCTIONSIf d is limited, we say R d is limited-dimensional. When R d is limiteddimensional,an element of R d is• infinitesimal if <strong>and</strong> only if all its components are infinitesimal• <strong>and</strong> limited if <strong>and</strong> only if all its components are limited.As usual, unlimited means not limited, <strong>and</strong> appreciable means limited <strong>and</strong> noninfinitesimal.Clearly, a limited-dimensional r<strong>and</strong>om vector is infinitesimal, appreciable, orunlimited if <strong>and</strong> only if its L ∞ norm is, where this norm is defined by‖x‖ ∞ = max{ |x i | : i = 1, . . . , d }.As usual, we write x ≃ y to mean x − y ≃ 0. If we define the L p norms by‖x‖ p =( d∑i=1|x i | p ) 1/pfor 1 ≤ p < ∞, then the obvious inequality‖x‖ ∞ ≤ ‖x‖ p ≤ d · ‖x‖ ∞implies that a limited-dimensional r<strong>and</strong>om vector is infinitesimal, appreciable,or unlimited if <strong>and</strong> only if its L p norm is likewise.When d is unlimited, the L p norms no longer agree about which vectors areinfinitesimal, appreciable, <strong>and</strong> unlimited. Hence our original definition basedon the behavior of components no longer makes sense either. Thus we shallsee that the characteristic function theory we develop here is useful only forlimited-dimensional r<strong>and</strong>om vectors.11.2 Convergence IA famous <strong>and</strong> very important theorem of Kolmogorov-style probability theorysays that a sequence of r<strong>and</strong>om variables converges in distribution if <strong>and</strong>only if the characteristic functions converge pointwise. In this section we startto develop the Nelson-style analog, working on the easy direction of the “if <strong>and</strong>only if” (the other direction is dealt with in Section 11.6).Theorem 11.1. If two limited-dimensional r<strong>and</strong>om vectors are nearly equivalent,then their characteristic functions are nearly equal at all limited argumentvalues.Proof. For limited t, the function x ↦→ e i〈t,x〉 is limited <strong>and</strong> (nearly) continuousbecause ofe i〈t,x〉 − e i〈t,y〉 = e i〈t,x〉( 1 − e i〈t,y−x〉) ,


11.3. THE DISCRETE FOURIER TRANSFORM 75which implies∣ e i〈t,x〉 − e i〈t,y〉∣ ∣ ≤∣ ∣1 − ei〈t,y−x〉 ∣ ∣=√ ∣∣1− cos(〈t, y − x〉)∣ ∣2+∣ ∣sin(〈t, y − x〉)∣ ∣2,because 〈t, y − x〉 ≃ 0 whenever x ≃ y by the Cauchy-Schwarz inequality <strong>and</strong>because sine <strong>and</strong> cosine are nearly continuous at zero (<strong>and</strong> everywhere else) byLemma 4.1.We cannot ask for near equality at all t. Consider the r<strong>and</strong>om variableconcentrated at zero, which has characteristic function identically equal to one,<strong>and</strong> another r<strong>and</strong>om variable concentrated at a nonzero infinitesimal ɛ, whichhas characteristic function t ↦→ e itɛ . These two r<strong>and</strong>om variables are nearlyequivalent, but e itɛ is not nearly equal to one for all t.11.3 The Discrete Fourier TransformLetT = { kɛ : k ∈ Z, |k| ≤ n } (11.1)where ɛ > 0, Z is the set of all integers, <strong>and</strong> n is a positive integer. We call sucha set T a symmetric grid; we call ɛ the spacing of T <strong>and</strong> write ɛ = spac(T ); wecall N = 2n + 1 the cardinality of T <strong>and</strong> write N = card(T ).If T is defined as in (11.1) then we definewhereT ∗ = { kɛ ∗ : k ∈ Z, |k| ≤ n } (11.2)ɛ ∗ = 2πNɛ(11.3)<strong>and</strong> call T ∗ the symmetric grid conjugate to T . Of course spac(T ∗ ) = ɛ ∗ <strong>and</strong>card(T ∗ ) = N.If f : T → C is any function, thenf ∗ (t) = ɛ ∑ x∈Tf(x)e itx (11.4)defines a function T ∗ → C that we call the discrete Fourier transform of f. Theterminology “discrete Fourier transform” is also used for the mapping C T → C T ∗given by f ↦→ f ∗ .Theorem 11.2. The discrete Fourier transform f ↦→ f ∗ is invertible with inversef ∗ ↦→ f given byf(x) = ɛ∗ ∑f ∗ (t)e −itx . (11.5)2πt∈T ∗


76 CHAPTER 11. CHARACTERISTIC FUNCTIONSProof. Plugging (11.4) into the right h<strong>and</strong> side of (11.5) we obtainɛ · ɛ ∗ ∑f(y)e ity e −itx = 1 ∑f(y)e it(y−x)2πN∑t∈T ∗ y∈T= ∑ y∈T∑t∈T ∗ y∈Tf(y) 1 N∑t∈T ∗ e it(y−x)We now claim that1 ∑e it(y−x) (11.6)Nt∈T ∗is zero if y ≠ x <strong>and</strong> one if y = x, which implies (11.5). To establish this claim,suppose y − x = kɛ, so k is an integer <strong>and</strong> |k| ≤ 2n. Then1 ∑e it(y−x) = 1 n∑e iɛ∗ mɛkNNt∈T ∗ m=−n= 1 n∑e 2πimk/NNm=−n= e −2πink/N 1 N2n∑e 2πimk/NNow we have two cases. If k = 0, which happens if <strong>and</strong> only if x = y, thenthe exponentials are all equal to one <strong>and</strong> this reduces to (1/N) ∑ 2nm=0 1 = 1.Otherwise, k ≠ 0, <strong>and</strong>, recall, |k| ≤ 2n < N. Define ω = exp(2πik/N). Then0 < |k| < N implies ω ≠ 1, <strong>and</strong>2n∑m=0e 2πimk/N =m=02n∑ω mm=0= 1 − ωN1 − ωNow ω N = 1, so that finishes the proof of the claim that (11.6) is zero if y ≠ x<strong>and</strong> one otherwise.If f : T d → C is any function, thenf ∗ (t) = ɛ d ∑x∈T d f(x)e i〈t,x〉 (11.7)defines a function (T ∗ ) d → C that we call the multivariate discrete Fouriertransform of f. This terminology is also used for the mapping f ↦→ f ∗ .Theorem 11.3. The multivariate discrete Fourier transform f ↦→ f ∗ is invertiblewith inverse f ∗ ↦→ f given by( ) ɛ∗ d ∑f(x) =f ∗ (t)e −i〈t,x〉 . (11.8)2πt∈(T ∗ ) d


11.4. THE DOUBLE EXPONENTIAL DISTRIBUTION 77Proof. Plugging (11.7) into the right h<strong>and</strong> side of (11.8) we obtain( ) ɛ · ɛ∗ d ∑2πt∈(T ∗ ) d∑f(y)e i〈t,y〉 e −i〈t,x〉 = 1N dy∈T d= ∑∑∑f(y)e i〈t,y−x〉t∈(T ∗ ) d y∈T dy∈T d f(y) 1= ∑y∈T d f(y)∑N d e i〈t,y−x〉t∈(T ∗ ) dd∏k=11N∑t k ∈T ∗ e it k(y k −x k )where t = (t 1 , . . . , t k ) <strong>and</strong> similarly for x <strong>and</strong> y. Now we know from the proof ofTheorem 11.2 that the innermost average is zero unless y k = x k , in which caseit is one. Thus the product is zero unless y = x, in which case it is one. Thusthe only nonzero term in the outermost sum is when y = x, in which case it isf(x).Notice that if we have a r<strong>and</strong>om variable concentrated on T whose probabilityfunction is given byPr(X = x) = ɛf(x), x ∈ T,then the restriction of the characteristic function to T ∗ is the discrete Fouriertransform f ∗ . Similarly, if we have a r<strong>and</strong>om vector concentrated on T d whoseprobability function is given byPr(X = x) = ɛ d f(x), x ∈ T d ,then the restriction of the characteristic function to (T ∗ ) d is the discrete Fouriertransform f ∗ .11.4 The Double Exponential DistributionLet T be a symmetric grid that is also a near line, so spac(T ) is infinitesimal<strong>and</strong> max(T ) is unlimited. Let α > 0 be a real parameter. Then we call thedistribution concentrated on T having unnormalized densityh α (x) = e −α|x| , x ∈ T,


78 CHAPTER 11. CHARACTERISTIC FUNCTIONSdouble exponential with rate parameter α. The normalizing constant for thisdistribution is calculated as followsc(α) = ɛ ∑ x∈Te −α|x|= ɛ= ɛ= ɛn∑k=−n(2e −αɛ|k|)n∑e −αɛk − 1k=0(2 · 1 − )e−αɛ(n+1)1 − e −αɛ − 1= ɛ · 1 − 2e−αɛ(n+1) + e −αɛ1 − e −αɛ= ɛ · 1 − 2An+1 + A1 − A(11.9)where we have introduced A = e −αɛ , <strong>and</strong> the normalized density is given byf α (x) = h α(x)c(α) , x ∈ T.Lemma 11.4. A double exponential distribution with unlimited rate parameteris infinitesimal almost surely. A double exponential distribution with noninfinitesimalrate parameter is limited almost surely.Proof. Suppose X has the double exponential distribution with rate parameterα. Then for 0 < m < n.Pr(|X| ≥ mɛ) =2ɛc(α)n∑k=me −αɛk= 2ɛn−mc(α) · ∑e−αɛmk=0e −αɛk= 2ɛc(α) · e−αɛm · 1 − e−αɛ(n−m+1)1 − e −αɛ= 2ɛc(α) · Am (1 − A n−m+1 )1 − A= 2Am (1 − A n−m+1 )1 − 2A n+1 + AIf mɛ ≫ 0 <strong>and</strong> α is unlimited, then by the rules of exponentiation (Section3.3) A ≤ 1 <strong>and</strong> A m ≃ A n+1 ≃ 0. Since 0 ≤ A n−m+1 ≤ 1, we havePr(|X| ≥ mɛ) ≃ 0. (11.10)


11.4. THE DOUBLE EXPONENTIAL DISTRIBUTION 79So the first statement follows from condition (ii) of Lemma 6.2.If mɛ ≃ ∞ <strong>and</strong> α ≫ 0, then A ≤ 1 <strong>and</strong> A m ≃ A n+1 ≃ 0 again followfrom the rules of exponentiation, <strong>and</strong> we obtain (11.10) again. So the secondstatement follows from condition (ii) of Lemma 6.3.The characteristic function of the double exponential distribution with rateparameter α is given byϕ α (t) =ɛ ∑e −α|x|+itxc(α)= ɛc(α)= ɛc(α)= ɛc(α)x∈Tn∑k=−ne −αɛ|k|+itɛk(∑ ne −(α−it)ɛk +k=0)n∑e −(α+it)ɛk − 1k=0( 1 − e−(α−it)ɛ(n+1)1 − e −(α−it)ɛ + 1 − e−(α+it)ɛ(n+1)1 − e −(α+it)ɛ − 1Since a double exponential distribution is symmetric about zero, the characteristicfunction is real, which is obvious from the fact that the two fractionsin the large parentheses in the form above are complex conjugates of each other<strong>and</strong> hence their sum is real.Using e iw = cos(w) + i sin(w) <strong>and</strong> clearing complex quantities from thedenominator by multiplying numerator <strong>and</strong> denominator by their complex conjugatesgives (after much formula manipulation, which has been checked usinga computer algebra system)ϕ α (t) =ɛc(α) · 1 − 2e−αɛ(n+1) cos[(n + 1)ɛt] + 2e −αɛ(n+2) cos(nɛt) − e −2αɛ1 − 2e −αɛ cos(ɛt) + e −2αɛ= ɛc(α) · 1 − 2An+1 cos[(n + 1)ɛt] + 2A n+2 cos(nɛt) − A 21 − 2A cos(ɛt) + A 21 + 2An+1 [1 − cos((n + 1)ɛt) − A + A cos(nɛt)](1 − A)(1 − 2A=n+1 + A)(11.11)2A[1 − cos(ɛt)]1 +(1 − A) 2We need the following bounds for sine <strong>and</strong> cosinex − x33!≤ sin(x) ≤ x, x ≥ 0 (11.12a)1 − x2x2≤ cos(x) ≤ 1 −2 2 + x44! , x ≥ 0 (11.12b)These are easily proved in order of degree of the bound. The function f 1 (x) =x − sin(x) has derivative 1 − cos(x), which is nonnegative, hence f 1 is nondecreasing<strong>and</strong> greater than or equal to f 1 (0) = 0 for x ≥ 0. The function)


80 CHAPTER 11. CHARACTERISTIC FUNCTIONSf 2 (x) = cos(x) − 1 + x 2 /2 has derivative f 1 (x), which is nonnegative, hence f 2is nondecreasing <strong>and</strong> greater than or equal to f 2 (0) = 0 for x ≥ 0. The functionf 3 (x) = sin(x) − x + x 3 /3! has derivative f 2 (x), which is nonnegative, hence f 3is nondecreasing <strong>and</strong> greater than or equal to f 3 (0) = 0 for x ≥ 0. The functionf 4 (x) = 1 − x 2 /2 − x 4 /4! − cos(x) has derivative f 3 (x), which is nonnegative,hence f 4 is nondecreasing <strong>and</strong> greater than or equal to f 4 (0) = 0 for x ≥ 0.Since both sides of each inequality in (11.12b) are symmetric functions of x, weactually havehence1 − x22x2≤ cos(x) ≤ 1 −2 + x4, x ∈ R. (11.13)4!1 − cos(x) ∼ x2, x ≃ 0. (11.14)2For non-infinitesimal x we use the bounds, also from (11.13),1 − x22≤ cos(x) ≤ 1 −x22) (1 − π4, |x| ≤ π. (11.15)12Lemma 11.5. Let ϕ α denote the discrete Fourier transform of the double exponentialdensity with appreciable rate parameter α defined on a symmetric gridT that is also a near line <strong>and</strong> satisfiesThen<strong>and</strong>spac(T ) · max(T ) ≃ ∞. (11.16)1ϕ α (t) ≃ , |t| ≪ ∞ (11.17)1 + t2α 220 < ϕ α (t) ≤ , t ∈ T ∗ . (11.18)1 + t26α 2The near equality (11.17) is mildly interesting, in that we derive it withoutreference to the right h<strong>and</strong> side being the Kolmogorov-style characteristic functionof the Kolmogorov-style (continuous) double exponential distribution withrate parameter α. Of course, it could also be derived from Corollary 4.8 <strong>and</strong>the Kolmogorov-style result.The bounds (11.18) will be crucial in our proof of the characteristic functionconvergence theorem in Section 11.6.The condition (11.16) is inelegant but harmless, because we only use thislemma as a technical tool <strong>and</strong> can always assure that the condition is satisfied.For example, if we want max(T ) = M, then we can choose n = ⌈M 3/2 ⌉ <strong>and</strong>ɛ = spac(T ) = M/n ∼ M −1/2 .Proof. Recall that ϕ α (t) is given by (11.11), <strong>and</strong> that A = e −αɛ . It follows fromthe rules of exponentiation (Section 3.3) that A ≃ 1 but A n+1 ≃ 0.


11.5. CONVOLUTION 81We first show that the numerator of (11.11) is nearly equal to one for t ∈ T ∗ .From |cos(x)| ≤ 1 <strong>and</strong> A ≃ 1 we have |1 − cos((n + 1)ɛt) − A + A cos(nɛt)| 2,<strong>and</strong> from A n+1 ≃ 0 we have 1 + A + 2A n+1 ≃ 2.Since 1 − exp(−ɛα) ∼ αɛ by Lemma 4.9 <strong>and</strong> the following comments, <strong>and</strong>since exp(x) ≥ 1 + x, for x ≥ 0 by the Maclaurin series for the exponentialfunction, we have for any appreciable positive ηA n1 − A = exp(−nɛα)1 − exp(−ɛα) ≤ 1 + ηαɛ(1 + nαɛ) ≤ n−1 α −2 ɛ −2 (1 + η).Hence (α being appreciable) condition (11.16) implies A n /(1 − A) ≃ 0. Puttingall this together we have the numerator of (11.11) nearly equal to one.Using A ≃ 1 <strong>and</strong> 1 − A ∼ αɛ, which were derived above, <strong>and</strong> (11.14) weobtain2A[1 − cos(ɛt)](1 − A) 2 ∼ t2α 2<strong>and</strong> combining this with our result about the numerator of (11.11) we get(11.17), using Lemma 4.4 several times.Since we only consider t ∈ T ∗ , we have |t| ≤ nɛ ∗ , hence |ɛt| ≤ nɛɛ ∗ =2πn/N < π. Thus we can apply (11.15) to the term 1 − cos(ɛt) in the denominatorof (11.11), obtainingɛ 2 t 22) (1 − π2≤ 1 − cos(ɛt)12We know from our earlier analysis that A/(1 − A) 2 ∼ 1/(αɛ) 2 . Hence for anypositive δ ≪ 1 we havet 2 ( )α 2 · δ 1 − π2 2A[1 − cos(ɛt)]≤12 (1 − A) 2 ,<strong>and</strong>, since 1/6 ≪ 1 − π 2 /12, we can choose δ so this becomest 2 2A[1 − cos(ɛt)]≤6α2 (1 − A) 2 .Combining this with our result about the numerator of (11.11) gives one inequalityin (11.18). The positivity assertion is obvious.11.5 ConvolutionFor x <strong>and</strong> y in T we define x ⊕ y to be the sum “modulo T ”, that is, theunique element of T having the form x + y + kNɛ where N = card(T ) <strong>and</strong> kis an integer. If nɛ = max(T ) so N = 2n + 1 (notation we have been using allalong), then, for example, nɛ ⊕ ɛ = −nɛ <strong>and</strong> nɛ + nɛ = −ɛ.


82 CHAPTER 11. CHARACTERISTIC FUNCTIONSFor x <strong>and</strong> y in T <strong>and</strong> t in T ∗ we haveexp[i(x ⊕ y)t] = exp[i(x + y + kNɛ)t]= exp(ixt) exp(iyt) exp(ikNɛt)= exp(ixt) exp(iyt) exp(ikmNɛɛ ∗ )= exp(ixt) exp(iyt) exp(2πikm)= exp(ixt) exp(iyt)for some integers k <strong>and</strong> m.For any r<strong>and</strong>om variable Z concentrated on T let ϕ Z denote the restrictionof its characteristic function to T ∗ . This conflicts with our earlier notation,which is still in force. If X is a r<strong>and</strong>om variable, then ϕ X is its characteristicfunction. If α is a real number, then ϕ α is the characteristic function of thedouble exponential distribution with rate parameter α.If X <strong>and</strong> Y are independent r<strong>and</strong>om variables concentrated on T , thenϕ X⊕Y (t) = E { it(X⊕Ye)}= E { e itX e itY }= ϕ X (t)ϕ Y (t)Of course, except for ⊕ rather than +, this fact is familiar from classicalprobability theory. So long as we are interested only in r<strong>and</strong>om variables thatare limited almost surely <strong>and</strong> so long as T is a near line, there is no practicaldifference between X + Y <strong>and</strong> X ⊕ Y . The fact proved in this section will servejust as well as its classical counterpart.For vectors we define ⊕ to act coordinatewise(x 1 , . . . , x d ) ⊕ (y 1 , . . . , y d ) = (x 1 ⊕ y 1 , . . . , x d ⊕ y d )Then ϕ X⊕Y (t) = ϕ X (t)ϕ Y (t) also holds where X <strong>and</strong> Y are r<strong>and</strong>om elementsof T d <strong>and</strong> all three characteristic functions are restricted to (T ∗ ) d .11.6 Convergence II11.6.1 One-DimensionalWe continue to let T be a symmetric grid that is also a near line <strong>and</strong> satisfies(11.16). For any r<strong>and</strong>om variable X concentrated on T , we denote its densityby f X , soPr{X = x} = ɛf X (x).We also continue to let ϕ X denote the characteristic function of X.Lemma 11.6. Let X <strong>and</strong> Y be r<strong>and</strong>om variables <strong>and</strong> Z α a double exponentialr<strong>and</strong>om variable with appreciable rate parameter α independent of X <strong>and</strong> Y , allthree r<strong>and</strong>om variables concentrated on T . Supposeϕ X (t) ≃ ϕ Y (t), t ∈ T ∗ , |t| ≪ ∞.


11.6. CONVERGENCE II 83ThenProof. <strong>By</strong> (11.5)f X⊕Zα (x) ≃ f Y ⊕Zα (x), x ∈ T. (11.19)f X⊕Zα (x) = ɛ∗2π= ɛ∗2π∑ϕ X⊕Zα (t)e −itxt∈T ∗∑t∈T ∗ ϕ X (t)ϕ α (t)e −itx (11.20)<strong>and</strong> similarly with X replaced by Y . Since ϕ X (t)e −itx is bounded in modulusby 1 <strong>and</strong> similarly with X replaced by Y <strong>and</strong> by the bound (11.18) on ϕ α (t)the sum (11.20) satisfies the conditions for Corollary 4.7 <strong>and</strong> similarly with Xreplaced by Y . Thus that corollary implies (11.19).Lemma 11.7 (Scheffé). Let X <strong>and</strong> Y be almost surely limited r<strong>and</strong>om variablesconcentrated on the symmetric grid T that is also a near line, <strong>and</strong> supposef X (x) ≃ f Y (x), x ∈ T. (11.21)Thenɛ ∑ x∈T|f X (x) − f Y (x)| ≃ 0 (11.22)<strong>and</strong> for any limited function h on TE{h(X)} = ɛ ∑ x∈Th(x)f X (x) ≃ ɛ ∑ x∈Th(x)f Y (x) = E{h(Y )}. (11.23)As usual when we take an eponym from classical probability theory, thisis not what is usually called Scheffé’s lemma (Lehmann, 1959, p. 351) but thelemma that plays the same role in radically elementary probability theory.Note that the conclusion (11.21) is stronger than near equivalence of X <strong>and</strong>Y since it holds for all limited h, not just limited nearly continuous h. InKolmogorov-style probability theory this type of convergence (what Scheffé’slemma implies) is called convergence in total variation. We shall not use itenough to need a name for it.Proof. <strong>By</strong> (iii) of Lemma 6.3 for any η ≫ 0 we can choose a limited a such thatPr{|X| > a} = ɛ ∑ x∈T|x|>a<strong>and</strong> similarly with X replaced by Y . Hencef X (x) ≤ η 3ɛ ∑ |f X (x) − f Y (x)| ≤ 2η 3 ,x∈T|x|>a


84 CHAPTER 11. CHARACTERISTIC FUNCTIONS<strong>and</strong> by Theorem 4.5 <strong>and</strong> (11.21)Hence by the triangle inequalityɛ ∑ |f X (x) − f Y (x)| ≃ 0.x∈T|x|≤aɛ ∑ x∈T|f X (x) − f Y (x)| 2η 3 .Since η ≫ 0 was arbitrary, this gives (11.22). And (11.22) immediately implies(11.23).Theorem 11.8. Let X <strong>and</strong> Y be almost surely limited r<strong>and</strong>om variables, <strong>and</strong>supposeϕ X (t) ≃ ϕ Y (t), for all limited t. (11.24)Then X <strong>and</strong> Y are nearly equivalent.Proof. Choose a symmetric grid T that is also a near line <strong>and</strong> satisfies (11.16)such that max(T ) is larger than the maximum of the supports of |X| <strong>and</strong> |Y |.Define g : R → T by g(x) = max{ t ∈ T : t ≤ x }. Then g(X) ≃ Xalways, hence almost surely, hence g(X) is nearly equivalent to X by part (iii)of Theorem 7.3, <strong>and</strong> similarly with X replaced by Y . <strong>By</strong> Theorem 11.1 we haveϕ X (t) ≃ ϕ g(X) for limited t, <strong>and</strong> similarly with X replaced by Y , hence wehave ϕ g(X) ≃ ϕ g(Y ) for limited t, because ≃ is an external equivalence relation(Corollary 3.9).Let Z α have the double exponential distribution on T with rate parameterα <strong>and</strong> be independent of X <strong>and</strong> Y .First suppose α is appreciable. It is clear that g(X) is limited almost surely.<strong>By</strong> Lemma 11.4 so is Z α . Since x ⊕ y = x + y whenever x <strong>and</strong> y are limited,<strong>and</strong> sum of limited is limited, it is clear that g(X) ⊕ Z α is also limited almostsurely. Then by Lemma 11.6 <strong>and</strong> by Lemma 11.7 <strong>and</strong> the comment followingit, g(X) ⊕ Z α is nearly equivalent to g(Y ) ⊕ Z α .Second suppose α is unlimited. Then by Lemma 11.4 Z α ≃ 0 almost surely.Hence by Slutsky’s theorem (Theorem 9.10) the r<strong>and</strong>om vector ( )g(X), Z α isnearly equivalent to the r<strong>and</strong>om vector ( g(X), 0 ) , <strong>and</strong> similarly with X replacedby Y . The map (x, y) ↦→ x⊕y is continuous at points (x, y) such that x <strong>and</strong> y arelimited. Hence by the continuous mapping theorem (Theorem 9.8), g(X) ⊕ Z αis nearly equivalent to g(X) ⊕ 0 = g(X), <strong>and</strong> similarly with X replaced by Y .Fix a limited nearly continuous function h : T → R, <strong>and</strong> fix η ≫ 0. Then{ ( )} { ( )}∣∣ E h g(X) ⊕ Zα − E h g(X) ∣η ≤3for all unlimited α <strong>and</strong> hence by overspill for some limited α, <strong>and</strong> similarly withX replaced by Y , <strong>and</strong> we may use the same limited α for both. SinceE { h ( g(X) ⊕ Z α)}≃ E{h(g(Y ) ⊕ Zα)},


11.6. CONVERGENCE II 85we have∣ E{h(g(X))}− E{h(g(Y ))}∣ ∣ 2η3by the triangle inequality. Since η ≫ 0 was arbitrary,E { h ( g(X) )} ≃ E { h ( g(Y ) )} .Since h was an arbitrary limited nearly continuous function, g(X) is nearlyequivalent to g(Y ). Since near equivalence is an external equivalence relation(Lemma 7.2), X w ≃ g(X) w ≃ g(Y ) w ≃ Y implies X w ≃ Y .11.6.2 Limited-DimensionalTheorem 11.9. Let X <strong>and</strong> Y be almost surely limited r<strong>and</strong>om vectors, takingvalues in R d for limited d, <strong>and</strong> suppose (11.24) holds. Then X <strong>and</strong> Y are nearlyequivalent.Proof. The proof is very similar to that of Theorem 11.8. We just need the multivariateanalogs of each tool used in that proof. We already have the inversiontheorem for the discrete Fourier transform (Theorem 11.3).The analog of the symmetric grid T in the proof of Theorem 11.8 is T dwhere T is a symmetric grid that is also a near line <strong>and</strong> satisfies (11.16) suchthat max(T ) is larger than all values of ‖X‖ ∞ <strong>and</strong> ‖Y ‖ ∞ . As usual, ɛ = spac(T )<strong>and</strong> ɛ ∗ = spac(T ∗ ).The analog of the function g in the proof of Theorem 11.8 is a map from R dto T d that operates coordinatewise, the action on each coordinate being the gin the proof of Theorem 11.8. We again call this map g. Then again we haveg(X) ≃ X almost surely, hence g(X) <strong>and</strong> X are nearly equivalent, <strong>and</strong> similarlywith X replaced by Y .For the analog of Z α in the proof of Theorem 11.8 we use here the r<strong>and</strong>omvector, also denoted Z α having independent <strong>and</strong> identically distributed componentsall of which have the double exponential distribution with rate parameterα. Lemma 6.1 <strong>and</strong> Lemma 11.4 imply Z α is almost surely zero when α ≃ ∞<strong>and</strong> limited almost surely when α ≫ 0.The multivariate analog of Lemma 11.6 is proved much the same way, theonly additional wrinkle being external induction on the dimension. Now byTheorem 11.3 we have( ) ɛ∗ d ∑d∏f g(X)⊕Zα (x) =ϕ g(X) (t) ϕ α (t j )e −itjxj2πt∈(T ∗ ) d j=1= ɛ∗ ∑ɛ∗ ∑−it1x1ϕ α (t 1 )e ϕ α (t 2 )e −it2x22π2πt 1∈T ∗ t 2∈T ∗· · · ɛ∗ ∑ϕ α (t d )e −it dx dϕ g(X) (t)2πt d ∈T ∗


86 CHAPTER 11. CHARACTERISTIC FUNCTIONSwhere x <strong>and</strong> t denote vectors with components x i <strong>and</strong> t i <strong>and</strong> where the sum inthe first line over (T ∗ ) d is replaced by d iterated sums over T ∗ in the next twolines, <strong>and</strong> similarly with X replaced by Y . Hencef g(X)⊕Zα (x) − f g(Y )⊕Zα (x) = ɛ∗2π· · · ɛ∗2π∑ϕ α (t 1 )et 1∈T ∗∑−it1x1 ɛ∗∑ϕ α (t 2 )e −it2x22πt 2∈T ∗t d ∈T ∗ ϕ α (t d )e −it dx d|ϕ g(X) (t) − ϕ g(Y ) (t)|Now by assumption ϕ X (t) ≃ ϕ Y (t) for limited t, <strong>and</strong> by Theorem 11.1 we haveϕ X (t) ≃ ϕ g(X) for limited t <strong>and</strong> similarly with X replaced by Y . Thus theterm in the absolute value above is infinitesimal for limited t. Corollary 4.7 nowshows that the bottom line above is infinitesimal when α is appreciable. Thenexternal induction shows that f g(X)⊕Zα (x)−f g(Y )⊕Zα (x) is infinitesimal (for allx ∈ T ).The multivariate analog of Lemma 11.7 is proved much the same way, theonly additional wrinkle being external induction on the dimension. Since X islimited almost surely, so is g(X) by Lemma 9.7 <strong>and</strong> similarly with X replacedby Y . Fix an appreciable α. Then g(X) ⊕ Z α is also limited almost surely.For any η ≫ 0 there is a limited a such that the box B d , whereB = { x ∈ T : |x| ≤ a }satisfies Pr{g(X) ⊕ Z α ∈ B d } ≥ 1 − η, <strong>and</strong> similarly with X replaced by Y .Henceɛ ∑|f g(X)⊕Zα (x) − f g(Y )⊕Zα (x)| ≤ 2η + ɛ ∑|f g(X)⊕Zα (x) − f g(Y )⊕Zα (x)|x∈T d x∈B d= 2η + ɛ ∑ɛ ∑· · · ɛ ∑|f g(X)⊕Zα (x) − f g(Y )⊕Zα (x)|x 1∈Bx 2∈Bx d ∈Bwhere x = (x 1 , . . . , x d ). Theorem 4.5 implies the innermost sum is infinitesimal.<strong>By</strong> external induction <strong>and</strong> Theorem 4.5 all sums on the right h<strong>and</strong> side areinfinitesimal. Since η ≫ 0 was arbitrary, the left h<strong>and</strong> side is infinitesimal. Asin the one-dimensional case, it is immediate that g(X) ⊕ Z α <strong>and</strong> g(Y ) ⊕ Z α arenearly equivalent (when α is appreciable).The rest of the proof follows that of Theorem 11.8 almost without change.The Slutsky argument is the same, <strong>and</strong> the rest is unchanged except now hmaps T d to R.Corollary 11.10 (Cramér-Wold). Let X <strong>and</strong> Y be r<strong>and</strong>om vectors taking valuesin R d where d is limited. Then X <strong>and</strong> Y are nearly equivalent if <strong>and</strong> only if〈t, X〉 <strong>and</strong> 〈t, Y 〉 are nearly equivalent for every limited t ∈ R d .Proof. For scalar s <strong>and</strong> vector tϕ 〈t,X〉 (s) = E{e is〈t,X〉 } = ϕ X (st).


11.6. CONVERGENCE II 87Suppose X <strong>and</strong> Y are nearly equivalent <strong>and</strong> t is limited. Then st is limited, forall limited s, henceϕ 〈t,X〉 (s) ≃ ϕ 〈t,Y 〉 (s),s limited.Conversely, suppose 〈t, X〉 <strong>and</strong> 〈t, Y 〉 are limited for all limited t. Thenϕ X (t) ≃ ϕ Y (t),t limited.Corollary 11.11. Suppose X i <strong>and</strong> Y i are independent, limited-dimensional r<strong>and</strong>omvectors for i = 1, 2 <strong>and</strong> X 1w≃ X2 <strong>and</strong> <strong>and</strong> Y 1w≃ Y2 . Then(X 1 , Y 1 ) w ≃ (X 2 , Y 2 ).The simple proof using characteristic functions <strong>and</strong> Theorems 11.1 <strong>and</strong> 11.9is left as an exercise.


88 CHAPTER 11. CHARACTERISTIC FUNCTIONS


Part III<strong>Statistics</strong>89


Chapter 12De Finetti’s Theorem12.1 ExchangeabilityFix a positive integer ν, <strong>and</strong> let I = {1, . . . , ν}. Fix an arbitrary set S, <strong>and</strong>let X 1 , . . ., X ν be r<strong>and</strong>om elements of S. Let Π be the set of all permutations(bijective functions) π : I → I. We say the sequence X 1 , . . ., X ν is exchangeableifPr(X i = s i , i ∈ I) = Pr(X i = s π(i) , i ∈ I) (12.1)for all π ∈ Π <strong>and</strong> any choice of s 1 , . . ., s ν in S.It simplifies discussion if we consider the sequence as single object: a r<strong>and</strong>omtuple X = (X 1 , . . . , X ν ), which we take to be a r<strong>and</strong>om element of the spaceS I . We will also consider S I , as in set theory, to be the set of all functionsI → S. Then, if s ∈ S I <strong>and</strong> π ∈ Π, it makes sense to write s ◦ π, a compositionof functions I → I → S <strong>and</strong> hence also an element of S I .The exchangeable algebra on S I is the family F of functions f : S I → Rsatisfyingf(s) = f(s ◦ π), s ∈ S I , π ∈ Π.The atoms of F are the equivalence classes[s] = { s ◦ π : π ∈ Π }.A r<strong>and</strong>om element X of S I is exchangeable ifPr(X = s) = Pr(X = s ◦ π), s ∈ S, π ∈ Π, (12.2)which says the same thing as (12.1) in different notation, or, what is equivalent,ifPr(X = s 1 ) = Pr(X = s 2 ), s 1 ∈ S, s 2 ∈ S, [s 1 ] = [s 2 ]. (12.3)The exchangeable algebra on the sample space induced by X isE = { f ◦ X : f ∈ F }.The atoms of E are the nonempty events of the form X −1 ([s]), the nonemptypreimages of atoms of F.91


92 CHAPTER 12. DE FINETTI’S THEOREM12.2 A Simple De Finetti TheoremWe write X i for the i-th component of X defined by X i (ω) = X(ω)(i) <strong>and</strong>indicate this relationship by X = (X 1 , . . . , X ν ), <strong>and</strong> this gets us back to thesimple notation used in (12.1).Theorem 12.1 (De Finetti). Let X = (X 1 , . . . , X ν ) be exchangeable with νunlimited. Let E be the exchangeable algebra on the sample space induced by X.Then for any limited functions h 1 , . . ., h n from S to R with n limited<strong>and</strong>{∏ nE h i (X i )i=1} ∣ E ≃n∏E { h i (X i ) ∣ }Ei=1{∏ n } [∏ nE h i (X i ) ≃ E E { h i (X i ) ∣ } ] E .i=1i=1(12.4a)(12.4b)The condition that the h i are limited functions means that every value islimited. It also means they have a simultaneous limited bound because, thesample space being finite, the maximum of ∣ (∣h i Xj (ω) )∣ ∣ over all i, j, <strong>and</strong> ω isachieved, hence limited.In (12.4a) the two sides of the equation are r<strong>and</strong>om variables, call them Y L<strong>and</strong> Y R . What the equation means is Y L (ω) ≃ Y R (ω) for all ω ∈ Ω.Lemma 12.2. Suppose X is an exchangeable r<strong>and</strong>om element of S I <strong>and</strong> P isa uniformly distributed r<strong>and</strong>om element of Π independent of X. Then X <strong>and</strong>X ◦ P are equal in law.Elements of S I are maps I → S <strong>and</strong> elements of Π are maps I → I. HenceX ◦ P is a map I → I → S, which is an element of S I . More precisely,X(ω) ◦ P (ω) is, for each ω, a map I → I → S. The equal in law assertion isthatPr(X = s) = Pr(X ◦ P = s), s ∈ S I . (12.5)Proof. Fix s ∈ S I <strong>and</strong> let K be the number of elements in [s]. Then it is clearfrom (12.3) thatPr(X = s) = 1 Pr(X ∈ [s])KFrom the bijectivity of permutations <strong>and</strong> (12.2) <strong>and</strong> the observation abovePr(X ◦ P = s) = Pr(X = s ◦ P −1 )= Pr(X = s ◦ P −1 ◦ π)= 1 K Pr(X ∈ [s ◦ P −1 ◦ π])but the last term is Pr(X ∈ [s]) by the definition of [ · ].


12.2. A SIMPLE DE FINETTI THEOREM 93Lemma 12.3. With the setup of the theorem, let E n denote the algebra generatedby X 1 , . . ., X n <strong>and</strong> E. ThenE { h n (X n ) ∣ ∣ En−1}≃ E{hn (X n ) ∣ ∣ E}(12.6)holds for each limited integer n ≥ 2.Proof. The conditional distributions involved in (12.6) are very simple. Fix anω in Ω. Write X(ω) = s = (s 1 , . . . , s ν ). The atom of E containing ω isOn this atom we haveA ω = { ω ′ ∈ Ω : X(ω ′ ) ∈ [s] }.E { h n (X n ) ∣ ∣ E}=1νν∑h n (s i )i=1(12.7a)(12.7b)because by Lemma 12.2 h n (X n ) has the same distribution on this atom ash n (s P (i) ) where P is a uniformly distributed r<strong>and</strong>om permutation <strong>and</strong> P (i)puts equal probability on the points 1, . . ., ν.The atom of E n−1 containing ω isB ω = { ω ′ ∈ A ω : X i (ω ′ ) = s i , i = 1, . . . , n − 1 }.(12.8a)On this atom,E { h n (X n ) ∣ ∣ En−1}=1ν − n + 1ν∑h n (s i )i=n(12.8b)by an argument similar to that establishing (12.7b). We must show that theright h<strong>and</strong> sides of (12.7b) <strong>and</strong> (12.8b) are nearly equal.Each of the h i is limited, hence by the comment following the statement ofthe theorem we may assume there is a simultaneous limited bound a for all ofthem, <strong>and</strong>∣ ∣∣∣∣ n−1∑h n (s i )≤ (n − 1)a,∣which impliesi=n1νi=1ν∑h n (s i ) ≃ 1 νi=1i=nν∑h n (s i )because (n − 1)a/ν is infinitesimal.Also,1ν∑h n (s i ) − 1 ν∑h n (s i )∣ν − n + 1ν ∣ = n − 1ν∑h n (s i )ν(ν − n + 1) ∣ ∣i=n≤(n − 1)aνi=n(12.9a)


94 CHAPTER 12. DE FINETTI’S THEOREMis infinitesimal, which implies1νν∑1h n (s i ) ≃ν − n + 1i=nν∑h n (s i ).i=n(12.9b)Putting (12.9a) <strong>and</strong> (12.9b) together finishes the proof.Lemma 12.4. For any r<strong>and</strong>om variables X <strong>and</strong> Y <strong>and</strong> any algebra A, X ≃ Yimplies E A X ≃ E A Y <strong>and</strong> EX ≃ EY .Proof. For any ɛ ≫ 0 we have X − ɛ ≤ Y ≤ X + ɛ, which implies E A X − ɛ ≤E A Y ≤ E A X + ɛ <strong>and</strong> this can hold for every ɛ ≫ 0 only if E A X − E A Y isinfinitesimal. The proof for unconditional expectation is similar.Proof of the theorem.{∏ n } [ { E h i (X i )∣ E ∏ n = E E h i (X i )∣ n−1} ∣ ]∣∣∣ E Ei=1i=1[ n−1 ∏= E h i (X i ) · E { h n (X n ) ∣ } ∣ ]∣∣∣ En−1 Ei=1[ n−1 ∏≃ E h i (X i ) · E { h n (X n ) ∣ } ∣ ]∣∣∣ E Ei=1(12.10)= E { h n (X n ) ∣ {} n−1 ∏ E · E h i (X i )i=1}∣ Ethe first equality being E E E En−1 = E E , the second <strong>and</strong> third equalities beingE A XY = Y E A X when Y ∈ A for A = E n−1 <strong>and</strong> A = E, respectively, <strong>and</strong> thenear equality being Lemmas 12.3 <strong>and</strong> 12.4. Then (12.4a) follows from (12.10) byexternal induction, <strong>and</strong> (12.4b) follows by another application of Lemma 12.4.12.3 Philosophical InterpretationThe philosophical interpretation of the theorem is that the r<strong>and</strong>om variablesX 1 , X 2 , . . . X ν are “nearly” conditionally independent given the exchangeablealgebra by (12.4a) <strong>and</strong> (exactly) marginally identically distributed (conditionally<strong>and</strong> unconditionally) by exchangeability, <strong>and</strong> their joint conditional distributionis “nearly” determined by the marginal distributions by (12.4b).Hence, if we adopt a Bayesian statistical model for data X 1 , . . ., X n thatassumes they are exactly conditionally independent <strong>and</strong> identically distributedgiven the exchangeable algebra, then we make only infinitesimal errors.This setup is a bit more abstract than most discussions of Bayesian theory(but no more abstract than most discussions of de Finetti theorems). It is hardto see the exchangeable algebra playing the role of a Bayesian parameter. But at


12.4. A FANCIER DE FINETTI THEOREM 95least in Nelson-style theory there is no heavy probabilistic abstraction to wadethrough. We know that the exchangeable algebra E induces a partition at(E)on the sample space. If θ is a r<strong>and</strong>om element of any space that induces thesame partition, then we can write E{ · |θ} in place of E{ · |E} <strong>and</strong> things willlook more Bayesian just from the change of notation.Thus the philosophical import of de Finetti’s theorem is that independent<strong>and</strong> identically distributed (IID) assumptions arise in quite different ways forthe Bayesian <strong>and</strong> the frequentist. The frequentist must assume that the dataactually are IID, which is a very strong assumption. The Bayesian only “assumes”that the data are exchangeable, a much weaker assumption. Moreover,for a subjective Bayesian, exchangeability is not really an assumption but merelya lack of knowledge that would allow one to treat X differently from X◦P (usingthe notation of Lemma 12.2).12.4 A Fancier De Finetti TheoremTheorem 12.5 (De Finetti II). Let X = (X 1 , . . . , X ν ) be exchangeable with νunlimited. Let E be the exchangeable algebra on the sample space induced by X.Then for any functions h 1 , . . ., h n from S to R with n limited such that h 1 (X 1 ),. . . , h n (X n ) <strong>and</strong> ∏ ni=1 h i(X i ) <strong>and</strong> ∏ ni=1 E{ h i (X i ) ∣ } E are L1{∏ nE h i (X i )i=1} ∣ E ≃n∏E { h i (X i ) ∣ } E , almost surely, (12.11a)i=1<strong>and</strong>{∏ n } [∏ nE h i (X i ) ≃ E E { h i (X i ) ∣ } ] E .i=1i=1(12.11b)In (12.11a) the two sides of the equation are r<strong>and</strong>om variables, call them Y L<strong>and</strong> Y R . What the equation means is for every ɛ ≫ 0 there exists an event Nsuch that Pr(N) ≤ ɛ <strong>and</strong> Y L (ω) ≃ Y R (ω) except for ω in N.Proof. Equation (12.11b) is implied by (12.4b) <strong>and</strong> Lemma 6.7. <strong>By</strong> Theorem6.10 h 1 (X 1 ), . . . , h n (X n ) <strong>and</strong> ∏ ni=1 h i(X i ) are L 1 on almost every atom ofE. On such atoms equation (12.11a) is implied by (12.4a) <strong>and</strong> Lemma 6.7.Whether one prefers Theorem 12.1 or Theorem 12.5 is a matter of taste.The restriction to limited functions (the external analog of bounded functions)makes the statement of Theorem 12.1 much simpler, only involving nonst<strong>and</strong>ardanalysis through the basic idea of the existence of infinitesimals. Theorem 12.5is stronger, but only in a rather trivial way involving Lemma 6.7, <strong>and</strong> involvesthe more complicated concepts almost surely <strong>and</strong> L 1 that use much more nonst<strong>and</strong>ardanalysis.Theorem 12.1 only involves master’s level calculations. Both the statement<strong>and</strong> the proof are elementary. The key issue is that (12.7b) is very close to(12.8b) whenever ν is much larger than n. Everything else is just applying


96 CHAPTER 12. DE FINETTI’S THEOREMrules of probability theory that are no different in Nelson-style theory fromKolmogorov-style theory (master’s level or PhD level).


Chapter 13Almost Sure ConvergenceWe introduced almost sure convergence in Section 7.1. Now we apply it toa few things.13.1 The Law of Large NumbersTheorem 16.3 in Nelson (1987) is the law of large numbers.Theorem 13.1 (The Law of Large Numbers). Suppose X 1 , X 2 , . . ., X ν areIID L 1 r<strong>and</strong>om variables with mean µ. DefineX n = 1 n∑X i . (13.1)nThen X n converges to µ almost surely.We do not usually deal with almost sure convergence by directly applying thedefinition. What we usually use is the following theorem, which is Theorem 7.2in Nelson (1987).Theorem 13.2. Let X 1 , . . ., X ν be r<strong>and</strong>om variables, then X n converges almostsurely to zero if <strong>and</strong> only if{}Pr supµ≤n≤ν|X n | ≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.2)13.2 The Glivenko-Cantelli TheoremSuppose X 1 , . . ., X ν are IID r<strong>and</strong>om variables defined on a finite samplespace having common (marginal) distribution function F , <strong>and</strong> ̂F n is the empiricaldistribution function defined bŷF n (x) = 1 n∑I (−∞,x] (X i ), x ∈ R.ni=197i=1


98 CHAPTER 13. ALMOST SURE CONVERGENCEThen for each fixed x the r<strong>and</strong>om variable ̂F n (x) is a “sample mean” of the form(13.1) with “population mean” F (x) = E{ ̂F n (x)}. Also ̂F n (x) is a limited, henceL 1 , r<strong>and</strong>om variable. Thus the law of large numbers applies <strong>and</strong> gives almostsurely ∣ ∣∣ ̂Fn (x) − F (x)∣∣ ≃ 0, n ≃ ∞, (13.3a)which by Theorem 13.2 holds if <strong>and</strong> only if{Pr sup ∣ ̂F }n (x) − F (x) ∣ ≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.3b)µ≤n≤νThese statements (13.3a) <strong>and</strong> (13.3b) are different statements for each fixed x.In (13.3a) the exception sets may depend on x.The Glivenko-Cantelli theorem says these statements hold uniformly in x.Theorem 13.3 (Glivenko-Cantelli). Suppose X 1 , . . ., X ν are IID r<strong>and</strong>om variablesdefined on a finite sample space having common (marginal) distributionfunction F , <strong>and</strong> suppose ̂F n is the empirical distribution function for samplesize n. Then almost surelysup ∣ ̂F n (x) − F (x) ∣ ≃ 0, n ≃ ∞. (13.4a)x∈R<strong>By</strong> Theorem 13.2 the statement (13.4a) holds almost surely if <strong>and</strong> only if{Pr sup sup ∣ ̂F }n (x) − F (x) ∣ ≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.4b)µ≤n≤ν x∈RAs in the usual proof of the Kolmogorov-style Glivenko-Cantelli theorem, theproof starts with the statements (13.3a), one for each fixed x, which are impliedby the law of large numbers. These statements do not obviously imply (13.4a)because the exception sets in (13.3a) depend on x. Some device must be used toreduce an uncountable infinity of exception sets to a limited number. Actuallywe work with the statements that do not explicitly involve “almost surely” <strong>and</strong>want to derive the one statement (13.4b) from the many statements (13.3b), buthere too, the implication is not obvious. As in the usual Kolmogorov-style proof,the device we use is monotonicity of distribution functions <strong>and</strong> compactness ofthe unit interval.Proof. Let T be the support of F . To simplify notation choose ɛ > 0 smallerthan any spacing in T so we always haveF (x − ɛ) = sup F (y)y∈Ty


13.2. THE GLIVENKO-CANTELLI THEOREM 99(meaning for each i there is a unique element of T that can be x i <strong>and</strong> notmeaning that the x i are distinct—it may be that x i = x i ′ when i < i ′ , in whichcase x i = x j for i ≤ j ≤ i ′ ). Also, since T is finite, there exists a (nonunique)point x 0 satisfying F (x 0 ) = 0.Now for i = 1, . . ., k <strong>and</strong> x i−1 < x < x i we havêF n (x) ≥ F (x) + λ −→ ̂F n (x i − ɛ) ≥ F (x i−1 ) + λthe last implication holding because ofF (x i−1 ) ≥ i − 1k−→ ̂F n (x i − ɛ) ≥ F (x i − ɛ) + λ 2> F (x i ) − 2 k > F (x i − ɛ) − 2 k ≥ F (x i − ɛ) − λ 2 ,the interpretation of these statements being “omega by omega,” that is, ̂F n (x)is a r<strong>and</strong>om variable defined bŷF n (x)(ω) = 1 nn∑ (I (−∞,x] Xi (ω) )i=1<strong>and</strong> the implications mean that the set of ω such that ̂F n (x)(ω) ≥ F (x) + λholds is included in the set of ω such that ̂F n (x i − ɛ)(ω) ≥ F (x i − ɛ) + λ 2 holds.Similarly,̂F n (x) ≤ F (x) − λ −→ ̂F n (x i−1 ) ≤ F (x i − ɛ) − λthe last implication holding because of−→ ̂F n (x i−1 ) ≤ F (x i−1 ) − λ 2F (x i − ɛ) < i k ≤ F (x i−1) + 1 k ≤ F (x i−1) + λ 4 .Putting these together <strong>and</strong> using monotonicity <strong>and</strong> subadditivity of probabilitygives⎧⎫⎨∣ ⎬∣∣Pr⎩ supsup ̂Fn (x) − F (x) ∣ ≥ λµ≤n≤ν x∈R⎭x i−1


100 CHAPTER 13. ALMOST SURE CONVERGENCEwe get⎧⎨Pr⎩ supµ≤n≤ν⎫∣ ⎬∣∣ sup ̂Fn (x) − F (x) ∣ ≥ λx∈R⎭x 0≤x


13.3. PROHOROV CONSISTENCY 101which holds if <strong>and</strong> only if{Pr sup ∣ ̂P }n (A) − P (A) ∣ ≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.6b)µ≤n≤νThese statements (13.6a) <strong>and</strong> (13.6b) are different statements for each fixed A.In (13.6a) the exception sets may depend on A.We investigate the conditions under which ̂P n converges to P . In order todiscuss convergence, we need a metric for convergence of probability measures.The natural one to use is the Prohorov metric, which was defined in Section 9.3.We say ̂P n converges almost surely to P (in the Prohorov metric) if almostsurelyπ ( ̂Pn , P ) ≃ 0, n ≃ ∞, (13.7a)which holds if <strong>and</strong> only if{Pr sup π ( ̂Pn , P ) }≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.7b)µ≤n≤νWe say P is nearly tight if for every ɛ ≫ 0 there exists a set F of limitedcardinality such that P (F ɛ ) ≥ 1 − ɛ, where F ɛ , the ɛ-dilation of F , was definedin Section 9.3.Note that a r<strong>and</strong>om variable X that is limited almost surely is nearly tightbecause for any ɛ ≫ 0, there exists a limited a such that Pr(|X| ≥ a) ≤ ɛ. So,if we defineF = { kɛ : k ∈ Z, |kɛ| ≤ a },then F has limited cardinality <strong>and</strong> F ɛ covers [−a, a]. Hence Pr(F ɛ ) ≥ 1 − ɛ.Theorem 13.4. Suppose X 1 , . . ., X ν are IID r<strong>and</strong>om elements of a metricspace having common (marginal) probability measure P <strong>and</strong> ̂P n is the empiricalmeasure for sample size n, then a necessary <strong>and</strong> sufficient condition for ̂P n toconverge almost surely to P (in the Prohorov metric) is that P is nearly tight.Proof. Let π denote the Prohorov metric. Then, by Lemma 9.2 <strong>and</strong> the commentson page 58 about the behavior of this metric,π ( ̂Pn , P ) ≥ λ ←→ (∀ɛ > 0)(∃A ⊂ S) ( ̂Pn (A) > P (A λ ) + λ − ɛ )←→ supA⊂S( ̂Pn (A) − P (A λ ) ) ≥ λthe arrows meaning that the statements are equivalent “omega by omega,” thatis if one holds at some point in the sample space, then so do the others. Hence(13.7b) is equivalent to{(Pr sup sup ̂Pn (A) − P (A λ ) ) }≥ λ ≃ 0, µ ≃ ∞, λ ≫ 0. (13.8)µ≤n≤ν A⊂SFirst, we prove necessity. Assume (13.8), <strong>and</strong> suppose to get a contradictionthat P is not nearly tight. Hence there is an ɛ ≫ 0 such that we do not have


102 CHAPTER 13. ALMOST SURE CONVERGENCEP (F ɛ ) ≥ 1 − ɛ for any set F of limited cardinality. Let A n = { X 1 , . . . , X n },that is, A n is the support of ̂P n , so ̂P n (A n ) = 1. Then a necessary conditionthat (13.8) hold isPr {1 − P (A ɛ n) ≥ ɛ} ≃ 0, n ≃ ∞. (13.9)This impliesPr {1 − P (A ɛ n) ≥ ɛ} < 1 2(13.10)holds for all unlimited n, hence by overspill for some limited n. But this isa contradiction because A n has at most n points <strong>and</strong> by our choice of ɛ, thisimplies P (A ɛ n) < 1−ɛ always (for all omega) <strong>and</strong> hence Pr {1 − P (A ɛ n) > ɛ} = 1.That finishes the proof of the necessity of near tightness.Now, we prove sufficiency. Assume near tightness. Fix λ ≫ 0, <strong>and</strong> choose aset F = {x 1 , . . . , x m } of limited cardinality such that P (F λ/3 ) ≥ 1 − λ/3.Disjointify the balls {x i } λ/3 defining recursivelyB i = {x i } λ/3 \i−1⋃j=1B j .Then the nonempty B i partition F λ/3 .Let I be the set of all subsets of {1, . . . , m}, <strong>and</strong> defineA I = ⋃ i∈IB i , I ∈ I.We do not need A ∅ as a notation for the empty set, so let us reuse the notation,redefining this to beA ∅ = S \ F λ/3 .Note that the diameter of each B i is less than λ, so if a set A contains even onepoint of B i , then A λ contains all of B i . HenceA ⊂ A I −→ A I ⊂ A λ , I ∈ I \ ∅.<strong>By</strong> the law of large numbers (13.3b){Pr sup ∣ ̂P n (A I ) − P (A I ) ∣ ≥ λ }≃ 0, µ ≃ ∞, I ∈ I.3µ≤n≤νConsider a point in the sample space (an omega) such thatsupµ≤n≤ν I∈Iholds. Then for any A ⊂ S definesup ∣ ̂P n (A I ) − P (A I ) ∣ < λ 3 ,I = { i ∈ {1, . . . , m} : A ∩ B i ≠ ∅ }


13.4. DISCUSSION 103so when µ ≤ n ≤ ν we haveHence for µ ≃ ∞{Prsup̂P n (A) − P (A λ ) = ̂P n (A \ F λ/3 ) + ̂P n (A ∩ A I ) − P (A λ )supµ≤n≤ν A⊂S≤ ̂P n (A ∅ ) + ̂P n (A I ) − P (A I )< 2λ 3 + ̂P n (A I ) − P (A I )< λ( ̂Pn (A) − P (A λ ) ) }≥ λ≤ ∑ {PrI∈Isup ∣ ̂P n (A I ) − P (A I ) ∣ ≥ λ }3µ≤n≤νSince I has limited cardinality (less than 2 m , which is limited by Theorem 2.4),this sum is infinitesimal by Corollary 3.8, <strong>and</strong>, since λ ≫ 0 was arbitrary, wehave (13.8) <strong>and</strong> are done.13.4 Discussion13.4.1 Kolmogorov-Style PathologyThe concept we call “near tightness” that plays such an important rolein Theorem 13.4 is the Nelson-style analog of what van der Vaart <strong>and</strong> Wellner(1996, p. 17) call pre-tightness. Their Lemma 1.3.2 says that a Borel probabilitymeasure on a metric space is pre-tight if <strong>and</strong> only if it is separable. A Borelprobability measure P on a metric space is separable when it has a separablesupport, that is, there exists a Borel measurable set A such that P (A) = 1 <strong>and</strong>A has a countable dense subset.So far this is fairly simple, but, according to the discussion on p. 24 ofvan der Vaart <strong>and</strong> Wellner (1996), it is undecidable under the usual axiomsof set theory whether nonseparable Borel probability measures can exist. It isknown that one could add to mathematics the axiom that they do not existwithout causing inconsistency (that is, if mathematics is inconsistent with thisnew axiom, then it is already inconsistent before this axiom is added). Butit is “apparently unknown” (say van der Vaart <strong>and</strong> Wellner, 1996) whethernonseparable probability measures can consistently exist (whatever that means).It is incredible what a morass we sink into with this notion. Countableadditivity is supposed to make things simple. Here it seems to make thingsabout as complicated as they can possibly be. Kolmogorov-style probabilitytheory allows an incredible amount of pathology <strong>and</strong> requires extreme technicalvirtuosity to navigate around it.In Nelson-style theory these things are simple. Non-tight (not nearly tight)probability measures exist, <strong>and</strong> there are very simple examples of them, such as


104 CHAPTER 13. ALMOST SURE CONVERGENCEthe discrete uniform distribution on the integers 1, . . ., ν with ν unlimited. Itis clear in Theorem 13.4 what (near) tightness does <strong>and</strong> why it is needed.13.4.2 Near Tightness <strong>and</strong> TopologyOn a slightly different point, why did we choose to make our concept neartightness the analog of Kolmogorov-style pre-tightness rather than tightness.The reason is that Kolmogorov-style tightness involves compact sets, <strong>and</strong> itis not possible to define compact sets in “radically elementary” nonst<strong>and</strong>ardanalysis. It is possible to define compact sets in the full theory of nonst<strong>and</strong>ardanalysis (Nelson, unpublished, p. 13 for the definition of compact subsets of R<strong>and</strong> p. 15 for the definition of compact subsets of general topological spaces),but we don’t want to use the full theory. Thus we avoid topology <strong>and</strong> thenotions compact, closed, <strong>and</strong> open. The same issue is why closed <strong>and</strong> open setsdo not appear in our version of the portmanteau theorem (Theorem 9.6) <strong>and</strong>are replaced by ɛ-dilations <strong>and</strong> ɛ-erosions.13.4.3 Prohorov Consistency <strong>and</strong> Glivenko-CantelliAlthough we drew an analogy between Theorems 13.3 <strong>and</strong> 13.4, <strong>and</strong> theyare quite analogous in their conditions <strong>and</strong> their proofs, the Kolmogorov-styleanalog of Theorem 13.4 is not what is called a uniform Glivenko-Cantelli theorem(see van der Vaart <strong>and</strong> Wellner, 1996, Chapter 2.4).13.4.4 Statistical InferenceWhatever the Prohorov-consistency theorem is called, it does show in whatsense statistical estimation is possible with no assumptions about the true unknowndistribution except near tightness. We know from Theorem 13.4 that,in the language of statistics, ̂Pn is a consistent estimator of the true unknowndistribution P .This remains true for the “minimum Prohorov distance estimator.” If wehave a statistical model P, which we take to be a finite family of probabilitydistributions <strong>and</strong> we let ˜P n be any estimator — that is, a function of the dataX 1 , . . ., X n taking values in P — that satisfiesπ( ˜P n , ̂P n ) minP ∈P π(P, ̂P n ),then ˜P n is a consistent estimator of the true unknown distribution P by thetriangle inequality.Admittedly, “minimum Prohorov distance estimators” are very hard to calculate<strong>and</strong> not used in practical applications. We will have to do a lot more toget a useful Nelson-style theory of statistics. But it’s a start.


BibliographyBillingsley, P. (1999).Wiley.Convergence of <strong>Probability</strong> Measures, second edition.Dawid, A. P., Stone, M. <strong>and</strong> Zidek, J. V. (1973). Marginalization paradoxesin Bayesian <strong>and</strong> structural inference (with discussion). Journal of the RoyalStatistical Society, Series B, 35:189–233.Eaton, M. L. (1992). A statistical diptych: Admissible inferences–recurrence ofsymmetric Markov Chains. Annals of <strong>Statistics</strong>, 20:1147–1179.Feller, W. (1950). An Introduction to <strong>Probability</strong> Theory <strong>and</strong> its Applications,Vol. I. Wiley.Goldblatt, R. (1998). Lectures on the Hyperreals: An Introduction to Nonst<strong>and</strong>ardAnalysis. Springer-Verlag.Gordon, E. I., Kusraev, A. G., <strong>and</strong> Kutateladze, S. S. (2002).Analysis. Kluwer Academic Publishers.InfinitesimalHalmos, P. R. (1958). Finite-Dimensional Vector Spaces, second edition. VanNostr<strong>and</strong> (original publisher), Springer-Verlag (current publisher).Kanovei, V. <strong>and</strong> Reeken, M. (2004).Springer-Verlag.Nonst<strong>and</strong>ard Analysis, Axiomatically.Kolmogorov, A. N. (1933). Grundbegriffe der WahrscheinlichkeitsrechnungSpringer. English translation (1950): Foundations of the theory of probability.Chelsea.Lang, S. (1993). Real <strong>and</strong> Functional Analysis, third edition. Springer-Verlag.Lehmann, E. L. (1959). Testing Statistical Hypotheses. Wiley.Li, M. <strong>and</strong> Vitanyi, P. (1997). An Introduction to Kolmogorov Complexity <strong>and</strong>Its Applications, 2nd ed. Springer.Loeb, P. (1975). Conversion from nonst<strong>and</strong>ard to st<strong>and</strong>ard measure spaces <strong>and</strong>applications to probability theory. Transactions of the American MathematicalSociety 211: 113–122.105


106 BIBLIOGRAPHYMoore, D. S. <strong>and</strong> McCabe, G. P. (1989). Introduction to the Practice of <strong>Statistics</strong>.W. H. Freeman.Nelson, E. (1977). Internal set theory: a new approach to nonst<strong>and</strong>ard analysis.Bulletin of the American Mathematical Society, 83:1165–1198.Nelson, E. (1987). <strong>Radically</strong> <strong>Elementary</strong> <strong>Probability</strong> Theory. Princeton UniversityPress.Nelson, E. (unpublished). Internal Set Theory. Chapter 1 of an unfinishedbook on nonst<strong>and</strong>ard analysis. Available on the web at the URLhttp://www.math.princeton.edu/~nelson/books/1.pdf.Pearson, E. S. <strong>and</strong> Please, N. W. (1975). Relation between the shape ofthe population distribution <strong>and</strong> the robustness of four simple test statistics.Biometrika, 62:223–224.Robert, A. M. (1988). Nonst<strong>and</strong>ard Analysis. Wiley. Translated by the authorfrom the 1985 French original. Republished by Dover, 2003.Robinson, A. (1996). Non-st<strong>and</strong>ard Analysis, rev. ed. Princeton UniversityPress. Originally published by North-Holl<strong>and</strong>, 1974. First edition, 1966.Solovay, R. M. (1970). A model of set-theory in which every set of reals isLebesgue measurable. Annals of Mathematics, Second Series, 92:1–56.Sudderth, W. D. (1980). Finitely additive priors, coherence <strong>and</strong> the marginalizationparadox. Journal of the Royal Statistical Society, Series B, 42:339–341.van der Vaart, A. W. <strong>and</strong> Wellner, J. A. (1996). Weak Convergence <strong>and</strong> EmpiricalProcesses: with Applications to <strong>Statistics</strong> Springer-Verlag.Whittle, P. (2005). <strong>Probability</strong> via Expectation, 4th ed. Springer.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!