01.11.2020 Views

Machine Learning in Python Essential Techniques for Predictive Analysis by Michael Bowles (z-lib.org).epub

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A unique identifier may or may not be present in your problem. For

instance, websites typically tag site visitors with a user ID that is

associated with them for the duration of their visit. If a user does not

register with the site, the same user gets a different ID with each visit.

The ID is usually assigned to each observation, which will be the

subject of the prediction you’re going to build. Columns 2, 3, and 4

are called Attributes instead of being given more specific names like

Height or Gender. The point is to highlight their role in the prediction

process. Attributes are data available about the case that will be used

to make predictions.

Labels are the things you want to predict. In this example, UserID is a

simple number, Attribute 1 is height, Attribute 2 is gender, and

Attribute 3 is how many books the person read last year. The column

under Labels contains how much money the individual spent on

books online last year. What are the roles that these different

categories of data will play? What use does a machine learning

algorithm make of user ID, attributes, and labels? The short answer is

this: You ignore the user ID. You use the attributes to predict the

labels.

The unique ID is for bookkeeping purposes and allows you to refer

back to the other data available for the specific case. Generally, the

unique ID does not get used directly in a machine learning algorithm.

Attributes are the things that you’ve chosen to use for making

predictions. Labels are observed outcomes that the machine learning

algorithm will use to build a predictive model.

User ID doesn’t usually get used for making predictions because it is

too specific. It pertains to only a single example. The trick with

machine learning is to build a model that generalizes to new cases

(not merely memorizing past cases). To achieve that, the algorithm

must be derived so that it is forced to pay attention to more than one

row of data. One possible exception to excluding user ID is when the

user ID is numeric and assigned in the order that users are signed up.

Basically, it’s indicating signup date in that case and can be useful

because users with close IDs signed up at similar times and can be

considered as a group on that basis.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!