27.01.2014 Views

Analytics for Enterprise Class Hadoop and Streaming Data

Analytics for Enterprise Class Hadoop and Streaming Data

Analytics for Enterprise Class Hadoop and Streaming Data

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

All About <strong>Hadoop</strong>: The Big <strong>Data</strong> Lingo Chapter 67<br />

associated with the analysis of large amounts of data <strong>and</strong> lets the developer<br />

focus on the analysis of the data rather than on the individual map <strong>and</strong><br />

reduce tasks.<br />

Hive<br />

Although Pig can be quite a powerful <strong>and</strong> simple language to use, the<br />

downside is that it’s something new to learn <strong>and</strong> master. Some folks at Facebook<br />

developed a runtime <strong>Hadoop</strong> support structure that allows anyone<br />

who is already fluent with SQL (which is commonplace <strong>for</strong> relational database<br />

developers) to leverage the <strong>Hadoop</strong> plat<strong>for</strong>m right out of the gate.<br />

Their creation, called Hive, allows SQL developers to write Hive Query Language<br />

(HQL) statements that are similar to st<strong>and</strong>ard SQL statements; now<br />

you should be aware that HQL is limited in the comm<strong>and</strong>s it underst<strong>and</strong>s,<br />

but it is still pretty useful. HQL statements are broken down by the Hive<br />

service into MapReduce jobs <strong>and</strong> executed across a <strong>Hadoop</strong> cluster.<br />

For anyone with a SQL or relational database background, this section<br />

will look very familiar to you. As with any database management system<br />

(DBMS), you can run your Hive queries in many ways. You can run them<br />

from a comm<strong>and</strong> line interface (known as the Hive shell), from a Java <strong>Data</strong>base<br />

Connectivity (JDBC) or Open <strong>Data</strong>base Connectivity (ODBC) application<br />

leveraging the Hive JDBC/ODBC drivers, or from what is called a Hive<br />

Thrift Client. The Hive Thrift Client is much like any database client that gets<br />

installed on a user’s client machine (or in a middle tier of a three-tier architecture):<br />

it communicates with the Hive services running on the server. You<br />

can use the Hive Thrift Client within applications written in C++, Java, PHP,<br />

Python, or Ruby (much like you can use these client-side languages with<br />

embedded SQL to access a database such as DB2 or In<strong>for</strong>mix). The following<br />

shows an example of creating a table, populating it, <strong>and</strong> then querying that<br />

table using Hive:<br />

CREATE TABLE Tweets(from_user STRING, userid BIGINT, tweettext STRING,<br />

retweets INT)<br />

COMMENT 'This is the Twitter feed table'<br />

STORED AS SEQUENCEFILE;<br />

LOAD DATA INPATH 'hdfs://node/tweetdata' INTO TABLE TWEETS;<br />

SELECT from_user, SUM(retweets)<br />

FROM TWEETS<br />

GROUP BY from_user;

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!