Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Chapter 7
Analytics at Scale
Spark Core is the fundamental component of Spark. It can run on
top of Hadoop or stand-alone. It abstracts the data set as a resilient
distributed data set (RDD). RDD is a collection of read-only objects.
Because it is read only, there will not be any synchronization problems
when it is shared with multiple parallel operations. Operations on
RDD are lazy. There are two types of operations happening on RDD:
transformation and action. In transformation, there is no execution
happening on a data set. Spark only stores the sequence of operations as
a directed acyclic graph called a lineage. When an action is called, then
the actual execution takes place. After the first execution, the result is
cached in memory. So, when a new execution is called, Spark makes a
traversal of the lineage graph and makes maximum reuse of the previous
computation, and the computation for the new operation becomes the
minimum. This makes data processing very fast and also makes the data
fault tolerant. If any node fails, Spark looks at the lineage graph for the
data in that node and easily reproduces it.
One limitation of the Hadoop framework is that it does not have any
message-passing interface in parallel computation. But there are several
use cases where parallel jobs need to talk with each other. Spark achieves
this using two kinds of shared variable. They are the broadcast variable
and the accumulator. When one job needs to send a message to all other
jobs, the job uses the broadcast variable, and when multiple jobs want to
aggregate their results to one place, they use an accumulator. RDD splits its
data set into a unit called a partition. Spark provides an interface to specify
the partition of the data, which is very effective for future operations
like join or find. The user can specify the storage type of partition in
Spark. Spark has a programming interface in Python, Java, and Scala. The
following code is an example of a word count program in Spark:
167