11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2 1 Data Science Process

in model building itself. Becoming a modern data scientist, you need to understand

how to handle these new data types and learn the modern technologies of machine

learning.

So, let us begin our journey by first understanding the data science process. I will

first introduce you to the traditional model building process, followed by old-hat data

scientists. No, do not take it wrong. Though these processes were developed many

years ago, they still find their usefulness in modern data science. I will provide you

with definite guidelines on when to use the traditional approach and when to use a

more-advanced modern approach.

Traditional Model Building

All these years, a data scientist, while building an AI application, would first start

with the exploratory data analysis (EDA). After all, understanding the data yourself

is very vital in telling the machine what it means. In technical terms, it is important

for us to understand the features (independent variables) in our dataset to do a

predictive analysis on the target, a dependent variable. Using these features and

targets, we would create a training dataset for training a statistical algorithm. Such

EDA, many-a-times, requires a deep domain knowledge. And that is where people

having domain knowledge in various vertical industries thrive to become a data

scientist. I will try to meet the aspirations of every such individual.

As said earlier, in those days, the data was mostly numeric. As a data scientist,

you have to make sure that the data is clean for feeding it to an algorithm. So, there

comes data-cleansing. For this, one has to first find out if there are any missing

values. If so, either remove those columns from your analysis or impute the proper

values in those missing fields. After we ensure the data is clean, you need to do some

preprocessing on it to make it ready for machine learning.

The various steps required in data preparation are studying the data variance in

columns, scaling data, searching for correlations, dimensionality reduction, and so

on. For this, he would use many available tools for data exploring, get visual

representations of data distributions and correlations between columns, and use

several dimensionality reduction techniques. The list is endless; process is timeconsuming

and laborious.

After the data scientist makes the dataset ready for machine learning, his next task

is to select an appropriate algorithm based on his knowledge and experience. After

the algorithm is trained, we say we have built the model. Data scientist now uses any

known performance evaluations methods to test the trained model on his test

datasets. If the performance metrics do not give acceptable accuracies, he will try

tweaking the hyper-parameters of the algorithm. If that does not work, he may have

to go back to the data preparation stage, selecting new features, do additional

features engineering and further dimensionality reductions, and then retrain his

algorithm for an improved accuracy. If this too does not work out, he will try another

statistical algorithm. The entire process continues over many iterations until he

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!