Analytics for Enterprise Class Hadoop and Streaming Data

More documents

Recommendations

Info

All About Hadoop: The Big Data Lingo Chapter 67 associated with the analysis of large amounts of data and lets the developer focus on the analysis of the data rather than on the individual map and reduce tasks. Hive Although Pig can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. Some folks at Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent with SQL (which is commonplace for relational database developers) to leverage the Hadoop platform right out of the gate. Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL) statements that are similar to standard SQL statements; now you should be aware that HQL is limited in the commands it understands, but it is still pretty useful. HQL statements are broken down by the Hive service into MapReduce jobs and executed across a Hadoop cluster. For anyone with a SQL or relational database background, this section will look very familiar to you. As with any database management system (DBMS), you can run your Hive queries in many ways. You can run them from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive Thrift Client. The Hive Thrift Client is much like any database client that gets installed on a user’s client machine (or in a middle tier of a three-tier architecture): it communicates with the Hive services running on the server. You can use the Hive Thrift Client within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as DB2 or Informix). The following shows an example of creating a table, populating it, and then querying that table using Hive: CREATE TABLE Tweets(from_user STRING, userid BIGINT, tweettext STRING, retweets INT) COMMENT 'This is the Twitter feed table' STORED AS SEQUENCEFILE; LOAD DATA INPATH 'hdfs://node/tweetdata' INTO TABLE TWEETS; SELECT from_user, SUM(retweets) FROM TWEETS GROUP BY from_user;
68 Understanding Big Data As you can see, Hive looks very much like traditional database code with SQL access. However, because Hive is based on Hadoop and MapReduce operations, there are several key differences. The first is that Hadoop is intended for long sequential scans, and because Hive is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as DB2. Finally, Hive is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations. Jaql Jaql is primarily a query language for JavaScript Object Notation (JSON), but it supports more than just JSON. It allows you to process both structured and nontraditional data and was donated by IBM to the open source community (just one of many contributions IBM has made to open source). Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS, much like a blend of Pig and Hive. Jaql’s query language was inspired by many programming and query languages, including Lisp, SQL, XQuery, and Pig. Jaql is a functional, declarative query language that is designed to process large data sets. For parallelism, Jaql rewrites high-level queries, when appropriate, into “low-level” queries consisting of MapReduce jobs. Before we get into the Jaql language, let’s first look at the popular data interchange format known as JSON, so that we can build our Jaql examples on top of it. Application developers are moving in large numbers towards JSON as their choice for a data interchange format, because it’s easy for humans to read, and because of its structure, it’s easy for applications to parse or generate. JSON is built on top of two types of structures. The first is a collection of name/value pairs (which, as you learned earlier in the “The Basics of MapReduce” section, makes it ideal for data manipulation in Hadoop, which works on key/value pairs). These name/value pairs can represent anything since they are simply text strings (and subsequently fit well into existing models) that could represent a record in a database, an object, an associative array, and more. The second JSON structure is the ability to create an ordered list of values much like an array, list, or sequence you might have in your existing applications.
Page 2 and 3:
Understanding Big Data
Page 4 and 5:
Information Architect. Dirk has a B
Page 6 and 7:
Understanding Big Data Analytics fo
Page 8 and 9:
My fifteenth book in my eighteenth
Page 10 and 11:
CONTENTS AT A GLANCE PART I Big Dat
Page 12 and 13:
xii Contents PART II Big Data: From
Page 14 and 15:
Executive Letter from Rob Thomas FO
Page 16 and 17:
Foreword xvii these warehouses into
Page 18 and 19:
Foreword xix people who are passion
Page 20 and 21:
xxii Acknowledgments Finally, to Li
Page 22 and 23:
xxiv About this Book spectrum, an a
Page 24 and 25:
xxvi About this Book Data platform
Page 26 and 27:
xxviii About this Book world-class
Page 28 and 29:
xxx About this Book chapter detaili
Page 30 and 31:
Part I Big Data: From the Business
Page 32 and 33:
4 Understanding Big Data Quite simp
Page 34 and 35:
6 Understanding Big Data terabytes
Page 36 and 37:
8 Understanding Big Data Quite simp
Page 38 and 39:
10 Understanding Big Data platform
Page 40 and 41:
12 Understanding Big Data Indeed, s
Page 42 and 43: 2 Why Is Big Data Important? This c
Page 44 and 45: Why Is Big Data Important? 17 A goo
Page 46 and 47: Why Is Big Data Important? 19 We th
Page 48 and 49: Why Is Big Data Important? 21 lever
Page 50 and 51: Why Is Big Data Important? 23 Mashu
Page 52 and 53: Why Is Big Data Important? 25 shape
Page 54 and 55: Why Is Big Data Important? 27 bette
Page 56 and 57: Why Is Big Data Important? 29 the c
Page 58 and 59: Why Is Big Data Important? 31 Big D
Page 60 and 61: Why Is Big Data Important? 33 Hadoo
Page 62 and 63: 36 Understanding Big Data that we w
Page 64 and 65: 38 Understanding Big Data be. They
Page 66 and 67: 40 Understanding Big Data The IBM $
Page 68 and 69: 42 Understanding Big Data could acc
Page 70 and 71: 44 Understanding Big Data people se
Page 72 and 73: 46 Understanding Big Data cyclical
Page 74 and 75: 48 Understanding Big Data workload
Page 76 and 77: 50 Understanding Big Data applied e
Page 78 and 79: 4 All About Hadoop: The Big Data Li
Page 80 and 81: All About Hadoop: The Big Data Ling
Page 106 and 107: 5 InfoSphere BigInsights: Analytics
Page 108 and 109: InfoSphere BigInsights: Analytics f
Page 142 and 143:
InfoSphere BigInsights: Analytics f
Page 144 and 145:
Page 146 and 147:
Page 148 and 149:
124 Understanding Big Data InfoSphe
Page 150 and 151:
126 Understanding Big Data (as well
Page 152 and 153:
128 Understanding Big Data maintena
Page 154 and 155:
130 Understanding Big Data to the n
Page 156 and 157:
132 Understanding Big Data runs in
Page 158 and 159:
134 Understanding Big Data streamID
Page 160 and 161:
136 Understanding Big Data given at
Page 162 and 163:
138 Understanding Big Data add data
Page 164 and 165:
140 Understanding Big Data relocata
Page 166:
Additional Skills Resources Rely on
show all

Analytics for Enterprise Class Hadoop and Streaming Data

Create successful ePaper yourself

Delete template?

Save as template?