13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2.2 WHAT’S IN AN EXAMPLE? 45data in which what is to be predicted is not play or don’t play but rather is thetime (in minutes) to play. With numeric prediction problems, as with othermachine learning situations, the predicted value for new instances is often ofless interest than the structure of the description that is learned, expressed interms of what the important attributes are <strong>and</strong> how they relate to the numericoutcome.2.2 What’s in an example?The input to a machine learning scheme is a set of instances. These instancesare the things that are to be classified, associated, or clustered. Althoughuntil now we have called them examples, henceforth we will use the more specificterm instances to refer to the input. Each instance is an individual, independentexample of the concept to be learned. In addition, each one ischaracterized by the values of a set of predetermined attributes. This was thecase in all the sample datasets described in the last chapter (the weather, contactlens, iris, <strong>and</strong> labor negotiations problems). Each dataset is represented as amatrix of instances versus attributes, which in database terms is a single relation,or a flat file.Expressing the input data as a set of independent instances is by far the mostcommon situation for practical data mining. However, it is a rather restrictiveway of formulating problems, <strong>and</strong> it is worth spending some time reviewingwhy. Problems often involve relationships between objects rather than separate,independent instances. Suppose, to take a specific situation, a family tree isgiven, <strong>and</strong> we want to learn the concept sister. Imagine your own family tree,with your relatives (<strong>and</strong> their genders) placed at the nodes. This tree is the inputto the learning process, along with a list of pairs of people <strong>and</strong> an indication ofwhether they are sisters or not.Figure 2.1 shows part of a family tree, below which are two tables that eachdefine sisterhood in a slightly different way. A yes in the third column of thetables means that the person in the second column is a sister of the person inthe first column (that’s just an arbitrary decision we’ve made in setting up thisexample).The first thing to notice is that there are a lot of nos in the third column ofthe table on the left—because there are 12 people <strong>and</strong> 12 ¥ 12 = 144 pairs ofpeople in all, <strong>and</strong> most pairs of people aren’t sisters. The table on the right, whichgives the same information, records only the positive instances <strong>and</strong> assumes thatall others are negative. The idea of specifying only positive examples <strong>and</strong> adoptinga st<strong>and</strong>ing assumption that the rest are negative is called the closed worldassumption. It is frequently assumed in theoretical studies; however, it is not of

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!