11.04.2024 Views

Thinking-data-science-a-data-science-practitioners-guide

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4 1 Data Science Process

operations in real-time data streams. Today, we have machine learning applications

like traffic analysis that work on live data streams. All these new models made

machine learning engineers (in the short term, we call them ML engineers) and data

scientists to develop and learn new methods for processing image data. Finally, our

computers understand only binary data; though image data is binary data, we still

need to transform this data into a machine-understandable format. Note that each

single image comprises several binary data items (pixels) and each pixel in an image

is a RGB representation in an image data file.

The older classical ML technology failed to meet these new requirements, and the

industry started looking at the alternative approaches. The ANNs (artificial neural

networks) technology, which was invented many decades ago, came to the rescue.

The modern computing resources made the use of this technology workable in

developing such models. The neural network training requires several gigabytes of

memory, GPUs, and many hours of training. Tech giants had those kinds of

resources, and they trained networks, which we can reuse and extend their functionalities

for our own purpose. We call this new technology Transfer Learning. I have

given an exhaustive coverage later in the book on this new technology—transfer

learning.

Model Development on Text Datasets

After seeing the success of ANN/DNN (deep neural networks) technology in

building image applications, researchers started exploring its application to text

data. There came the new term and the field of its own—natural language processing

(NLP). Preparing text data for machine learning requires a different kind of approach

as compared to numeric. Numeric data is contained in databases having a few

columns. Each column is a potential candidate for a feature. Thus, in numeric

machine learning models, the number of features is typically very low.

Now, consider the text data, where each word or a sentence is a potential

candidate for a feature. For email-spam applications, you use every word in your

text as a feature, while for a document summarization model, you will use every

sentence as a feature. Considering the vocabulary of words that we have, you can

easily imagine the number of features. The traditional dimensionality reduction

techniques that we have for numeric data would not apply for a text dataset in

reducing its dimensions. You need to cleanse the entire text corpus. For cleansing

text data, you would need several steps, removing punctuations, stop words, removing/converting

numbers, lowercasing, and so on. Further to this data cleaning

operation, to reduce the features count, we need to apply a few more techniques

like building a dictionary of unique words, stemming, lemmatization, and so

on. Finally, you need to understand tokenization so that these words/sentences

would be transformed into machine-understandable formats.

For text data, the context in which a word appears also plays an important role in

its analytics. So, there came a new branch called natural language understanding

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!