Advanced Data Analytics Using Python_ With Machine Learning, Deep Learning and NLP Examples ( 2023)
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 7
Analytics at Scale
val conf = new SparkConf().setAppName("wiki_test") // create a
spark config object
val sc = new SparkContext(conf) // Create a spark context
val data = sc.textFile("/path/to/somedir") // Read files from
"somedir" into an RDD of (filename, content) pairs.
val tokens = data.flatMap(_.split(" ")) // Split each file into
a list of tokens (words).
val wordFreq = tokens.map((_, 1)).reduceByKey(_ + _) // Add a
count of one to each token, then sum the counts per word type.
wordFreq.sortBy(s => -s._2).map(x => (x._2, x._1)).top(10)
// Get the top 10 words. Swap word and count to sort by count.
On top of Spark Core, Spark provides the following:
• Spark SQL, which is a SQL interface through the
command line or a database connector interface. It
also provides a SQL interface for the Spark data frame
object.
• Spark Streaming, which enables you to process
streaming data in real time.
• MLib, a machine learning library to build analytical
models on Spark data.
• GraphX, a distributed graph processing framework.
Analytics in the Cloud
Like many other fields, analytics is being impacted by the cloud. It is
affected in two ways. Big cloud providers are continuously releasing
machine learning APIs. So, a developer can easily write a machine
learning application without worrying about the underlining algorithm.
For example, Google provides APIs for computer vision, natural language,
168