18.02.2015 Views

Berry

Berry

Berry

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Putting Data Mining to Work 607<br />

based on arithmetic operations. When data has many categorical variables,<br />

then decision trees are quite useful, although association rules and link analysis<br />

may be appropriate in some cases.<br />

Number of Input Fields<br />

In directed data mining applications, there should be a single target field or<br />

dependent variable. The rest of the fields (except for those that are either<br />

clearly irrelevant or clearly dependent on the target variable) are treated as<br />

potential inputs to the model. Data mining methods vary in their ability to successfully<br />

process large numbers of input fields. This can be a factor in deciding<br />

on the right technique for a particular application.<br />

In general, techniques that rely on adjusting a vector of weights that has an<br />

element for each input field run into trouble when the number of fields grows<br />

very large. Neural networks and memory-based reasoning share that trait.<br />

Association rules run into a different problem. The technique looks at all possible<br />

combinations of the inputs; as the number of inputs grows, processing<br />

the combinations becomes impossible to do in a reasonable amount of time.<br />

Decision-tree methods are much less hindered by large numbers of fields.<br />

As the tree is built, the decision-tree algorithm identifies the single field that<br />

contributes the most information at each node and bases the next segment of<br />

the rule on that field alone. Dozens or hundreds of other fields can come along<br />

for the ride, but won’t be represented in the final rules unless they contribute<br />

to the solution.<br />

TIP When faced with a large number of fields for a directed data mining<br />

problem, it is a good idea to start by building a decision tree, even if the final<br />

model is to be built using a different technique. The decision tree will identify a<br />

good subset of the fields to use as input to a another technique that might be<br />

swamped by the original set of input variables.<br />

Free-Form Text<br />

Most data mining techniques are incapable of directly handling free-form text.<br />

But clearly, text fields often contain extremely valuable information. When<br />

analyzing warranty claims submitted to an engine manufacturer by independent<br />

dealers, the mechanic’s free-form notes explaining what went wrong and<br />

what was done to fix the problem are at least as valuable as the fixed fields that<br />

show the part numbers and hours of labor used.<br />

One data mining technique that can deal with free text is memory-based<br />

reasoning, one of the nearest neighbor methods discussed in Chapter 8. Recall<br />

that memory-based reasoning is based on the ability to measure the distance

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!