09.10.2023 Views

Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 7

Analytics at Scale

Spark Core is the fundamental component of Spark. It can run on

top of Hadoop or stand-alone. It abstracts the data set as a resilient

distributed data set (RDD). RDD is a collection of read-only objects.

Because it is read only, there will not be any synchronization problems

when it is shared with multiple parallel operations. Operations on

RDD are lazy. There are two types of operations happening on RDD:

transformation and action. In transformation, there is no execution

happening on a data set. Spark only stores the sequence of operations as

a directed acyclic graph called a lineage. When an action is called, then

the actual execution takes place. After the first execution, the result is

cached in memory. So, when a new execution is called, Spark makes a

traversal of the lineage graph and makes maximum reuse of the previous

computation, and the computation for the new operation becomes the

minimum. This makes data processing very fast and also makes the data

fault tolerant. If any node fails, Spark looks at the lineage graph for the

data in that node and easily reproduces it.

One limitation of the Hadoop framework is that it does not have any

message-passing interface in parallel computation. But there are several

use cases where parallel jobs need to talk with each other. Spark achieves

this using two kinds of shared variable. They are the broadcast variable

and the accumulator. When one job needs to send a message to all other

jobs, the job uses the broadcast variable, and when multiple jobs want to

aggregate their results to one place, they use an accumulator. RDD splits its

data set into a unit called a partition. Spark provides an interface to specify

the partition of the data, which is very effective for future operations

like join or find. The user can specify the storage type of partition in

Spark. Spark has a programming interface in Python, Java, and Scala. The

following code is an example of a word count program in Spark:

167

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!