comparing hypotheses in different models. There the standard notion <strong>of</strong> statisticalsufficiency breaks down, and the relevant statistics may not be sufficient in the senserequired by the LTE.The Asymmetry <strong>of</strong> RegressionThe same challenge <strong>to</strong> LTE extends <strong>to</strong> the linear regression problem (‘regression’ isthe statistician’s name for curve fitting). In these examples, you are also <strong>to</strong>ld that one <strong>of</strong>the two hypotheses or models is true,Y = 2 Xand you are invited <strong>to</strong> say which one istrue on the basis <strong>of</strong> the data. They areexamples in which anyone can tell fromthe full data (with moral certainty) whichis true, but nobody can tell from aknowledge solely <strong>of</strong> the likelihoods. Itis not because Bayesians, or anyone else,are using likelihoods in the wrong way.It’s because the relevant information isnot there!Suppose that data are generatedby the ‘structural’ or ‘causal’ equation Y= X + U, where X and Y are observedvariables, and U is a normal, orGaussian, random variable with meanzero and unit variance, where U isY = Xprobabilistically independent <strong>of</strong> X. 12 Touse this <strong>to</strong> generate pairs <strong>of</strong> values( x, y ), we must also provide agenerating distribution for X, representedby the equation X = µ + W, where µ isthe mean value <strong>of</strong> X, and W is another standard Gaussian random variable,Figure 3: There are two ways <strong>of</strong> generating thesame data: The Forward method and theBackward method (see text). It is impossible <strong>to</strong>tell which method was used from the data alone.probabilistically independent <strong>of</strong> U, also with zero mean and unit variance. Two hundreddata points are shown in Fig. 3. The vertical bar centered on the line Y = X represents theprobability density <strong>of</strong> y given a particular value <strong>of</strong> x.Example 1: Consider two hypotheses about how the data are generated. TheForward method randomly chooses an x value, and then determines the y value by addinga Gaussian error above or below the Forward line (Y = X). This is the method describedin the previous paragraph. The Backward method randomly chooses a y value according<strong>to</strong> the equation Y = µ + 2 Z, where Z is a Gaussian variable with zero mean and unitvariance. Then the x value is determined by adding a Gaussian error (with half thevariance) <strong>to</strong> the left or the right <strong>of</strong> the Backward line Y = 2X (slope = 2). This probabilitydensity represented by the horizontal bar centered on the line Y = 2X (see Fig. 3). In this12 Causal modeling <strong>of</strong> this kind has received a great deal <strong>of</strong> attention in recent years. See Pearl (2000) for acomprehensive survey <strong>of</strong> recent results, as well as Woodward (2003) for an introduction that is moreaccessible <strong>to</strong> philosophers.12
case, the ‘structural’ equation is X = 1 µ + 1 Y + 1 2 2V, where V is standard Gaussian2(mean zero and unit variance) such that V is probabilistically independent <strong>of</strong> Y. It isimpossible <strong>to</strong> tell from the data alone method which was used.Example 1 is not a counterexample <strong>to</strong> the likelihood theory <strong>of</strong> evidence (LTE).As it applies <strong>to</strong> this example, LTE says that if two simple hypotheses cannot bedistinguished on the basis <strong>of</strong> their likelihoods (let’s say that they are likelihoodequivalent) then they cannot be distinguished on the basis <strong>of</strong> the full data. Why are thetwo hypotheses likelihood equivalent? The Forward hypothesis specifies aprobability distribution for an x value, which we write as pF( x ) , and then specifies theprobability density y given x; in symbols, pF( y| x ) . This determines the joint distributionpF( xy , ) = pF( xp )F( y| x). Similarly, the Backward hypothesis determines a jointdistribution for x and y <strong>of</strong> the form pB( xy , ) = pB( yp )B( x| y). Under the statedconditions, it is possible <strong>to</strong> prove that for all x and for all y, pF( xy , ) = pB( xy , ) . So, theycannot be distinguished in terms <strong>of</strong> likelihoods. In this case, they also cannot bedistinguished by the full data, which is why Example 1 is not a counterexample <strong>to</strong> LTE.Example 2: The hypotheses considered in Example 1 are the standard intextbooks, but they are not the only ones possible. Consider the comparison <strong>of</strong> twosimple hypotheses that are not likelihood equivalent with respect <strong>to</strong> the data in Fig. 3.This will not be a counterexample <strong>to</strong> LTE either, but it does raise some importantworries, which will be exploited in subsequent examples. Let us specify the twohypotheses more concretely by assuming that µ = 0, so that the data are centered aroundthe point (0,0). We are <strong>to</strong>ld that one <strong>of</strong> two hypotheses is true:F2: Y = X + U , and X is standard Gaussian.B2: X = Y + V , and Y is Gaussian with mean 0 and variance 2,where again U and V are standard Gaussian, U is independent <strong>of</strong> X, and V is independen<strong>to</strong>f Y. F2is the same hypothesis as in Example 1. The difference is in the Backwardshypothesis. The marginal y values are generated in the same way as before, but now B2says that the x values are generated from the line Y = X (rather than Y = 2X) using aGaussian error <strong>of</strong> mean zero and unit variance (as opposed <strong>to</strong> a variance <strong>of</strong> ½). It isintuitively clear that B2will fit the data worse (have lower likelihood) than the previoushypothesis, B1. What’s remarkable in the present case is that, when compared <strong>to</strong> F 2, thelower likelihood <strong>of</strong> B2arises not from its generation <strong>of</strong> x values, but from the fact thatthere is larger variation in y values than in the x values. This is strange because weintuitively regard the generation <strong>of</strong> the independent or exogenous variable <strong>to</strong> be aninessential part <strong>of</strong> the causal hypothesis.To demonstrate this phenomenon, consider an arbitrary data point ( x, y ). Fromthe fact that F2and B2generate the y and x values, respectively, from the line Y = X, andthe fact that an arbitrary point is equidistant from this line in both the vertical and thehorizontal directions, it follows that p ( y| x) = p ( x| y). For the benefit <strong>of</strong> thetechnocrats amongst us, both are equal <strong>to</strong>FBx y2 21 1( ) 2(2 π )− e− − .13