28.07.2014 Views

sinning in the basement: what are the rules? the ten commandments ...

sinning in the basement: what are the rules? the ten commandments ...

sinning in the basement: what are the rules? the ten commandments ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

578 KENNEDY<br />

peculiarities of that particular data set, and consequently will be mislead<strong>in</strong>g<br />

<strong>in</strong> terms of <strong>what</strong> it says about <strong>the</strong> underly<strong>in</strong>g process generat<strong>in</strong>g <strong>the</strong> data.<br />

Fur<strong>the</strong>rmore, traditional test<strong>in</strong>g procedures used to ‘sanctify’ <strong>the</strong> specification <strong>are</strong><br />

no longer legitimate, because <strong>the</strong>se data, s<strong>in</strong>ce <strong>the</strong>y have been used to generate <strong>the</strong><br />

specification, cannot be judged impartial if used to test that specification.<br />

This objection to data m<strong>in</strong><strong>in</strong>g is <strong>the</strong> basis for compla<strong>in</strong>ts about Hendry’s (1980,<br />

p. 403) edict that ‘The three golden <strong>rules</strong> of econometrics <strong>are</strong> test, test, and test’.<br />

Applied <strong>in</strong>discrim<strong>in</strong>ately, <strong>the</strong>se <strong>rules</strong>, and <strong>the</strong>ir associated ‘general to specific’<br />

specification search methodology, can lead to problems such as pretest bias and<br />

distortion of type I error rates, jokes about hav<strong>in</strong>g more test statistics than<br />

observations, and reference to Ronald Coase’s oft-cited comment that ‘If you<br />

torture <strong>the</strong> data long enough, Nature will confess’. A more sympa<strong>the</strong>tic view<br />

recognizes (as Hendry has always ma<strong>in</strong>ta<strong>in</strong>ed), that model specification should<br />

not bl<strong>in</strong>dly follow test<strong>in</strong>g procedures — that it needs to be a well-thought-out<br />

comb<strong>in</strong>ation of <strong>the</strong>ory and data, and that test<strong>in</strong>g procedures used <strong>in</strong> such<br />

specification searches should be designed to m<strong>in</strong>imize <strong>the</strong> costs of data m<strong>in</strong><strong>in</strong>g.<br />

Examples of such procedures <strong>are</strong> sett<strong>in</strong>g aside data for out-of-sample prediction<br />

tests, adjust<strong>in</strong>g significance levels, and avoid<strong>in</strong>g questionable criteria such as<br />

maximiz<strong>in</strong>g R 2 . Hoover and Perez (1999), and associated commentary, provide a<br />

good summary of recent <strong>in</strong>novations on this front, and of related criticisms.<br />

Hendry and Mizon (1990) is an excellent discussion of <strong>the</strong> issues surround<strong>in</strong>g <strong>the</strong><br />

test<strong>in</strong>g <strong>in</strong>gredient of model specification.<br />

An alternative view of ‘data m<strong>in</strong><strong>in</strong>g’ is that it refers to experiment<strong>in</strong>g with (or<br />

‘fish<strong>in</strong>g through’) <strong>the</strong> data to discover empirical regularities that can <strong>in</strong>form<br />

economic <strong>the</strong>ory. This approach to data m<strong>in</strong><strong>in</strong>g, likened by some to Exploratory<br />

Data Analysis (Tukey, 1977), has been welcomed <strong>in</strong>to <strong>the</strong> ma<strong>in</strong>stream of statistical<br />

analysis by <strong>the</strong> recent launch<strong>in</strong>g of <strong>the</strong> journal Data M<strong>in</strong><strong>in</strong>g and Knowledge<br />

Recovery. Hand et al. (2000) describe data m<strong>in</strong><strong>in</strong>g as <strong>the</strong> process of seek<strong>in</strong>g<br />

<strong>in</strong>terest<strong>in</strong>g or valuable <strong>in</strong>formation <strong>in</strong> large data sets. Its greatest virtue is that it can<br />

uncover empirical regularities that po<strong>in</strong>t to errors=omissions <strong>in</strong> <strong>the</strong>oretical<br />

specifications, an example of which is described by Kennedy (1998, p. 87).<br />

The spirit of this approach is captured by Thaler’s (2000, p. 139) remark that<br />

‘Some economists seem to feel that data-driven <strong>the</strong>ory is, somehow, unscientific.<br />

Of course, just <strong>the</strong> opposite is true’. Pena (2001, p. 9) quotes George Box as<br />

emphasiz<strong>in</strong>g ‘<strong>the</strong> necessity to cont<strong>in</strong>ually change <strong>the</strong> model as one’s understand<strong>in</strong>g<br />

develops’. The art of <strong>the</strong> applied econometrician is to allow for data-driven <strong>the</strong>ory<br />

while avoid<strong>in</strong>g <strong>the</strong> considerable dangers <strong>in</strong>herent <strong>in</strong> data m<strong>in</strong><strong>in</strong>g.<br />

In summary, this second type of ‘data m<strong>in</strong><strong>in</strong>g’ identifies regularities <strong>in</strong> or<br />

characteristics of <strong>the</strong> data that should be accounted for and understood <strong>in</strong> <strong>the</strong><br />

context of <strong>the</strong> underly<strong>in</strong>g <strong>the</strong>ory. This may suggest <strong>the</strong> need to reth<strong>in</strong>k <strong>the</strong> <strong>the</strong>ory<br />

beh<strong>in</strong>d one’s model, result<strong>in</strong>g <strong>in</strong> a new specification founded on a more broadbased<br />

understand<strong>in</strong>g. This is to be dist<strong>in</strong>guished from a new specification created<br />

by mechanically remold<strong>in</strong>g <strong>the</strong> old specification to fit <strong>the</strong> data; this would risk<br />

<strong>in</strong>curr<strong>in</strong>g <strong>the</strong> costs described earlier when discuss<strong>in</strong>g <strong>the</strong> first variant of ‘data<br />

m<strong>in</strong><strong>in</strong>g’.<br />

# Blackwell Publishers Ltd. 2002

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!