Hadoop: The Definitive Guide - Cdn.oreilly.com
Hadoop: The Definitive Guide - Cdn.oreilly.com
Hadoop: The Definitive Guide - Cdn.oreilly.com
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Avro<br />
A serialization system for efficient, cross-language RPC and persistent data<br />
storage.<br />
MapReduce<br />
A distributed data processing model and execution environment that runs on large<br />
clusters of <strong>com</strong>modity machines.<br />
HDFS<br />
A distributed filesystem that runs on large clusters of <strong>com</strong>modity machines.<br />
Pig<br />
A data flow language and execution environment for exploring very large datasets.<br />
Pig runs on HDFS and MapReduce clusters.<br />
Hive<br />
A distributed data warehouse. Hive manages data stored in HDFS and provides a<br />
query language based on SQL (and which is translated by the runtime engine to<br />
MapReduce jobs) for querying the data.<br />
HBase<br />
A distributed, column-oriented database. HBase uses HDFS for its underlying<br />
storage, and supports both batch-style <strong>com</strong>putations using MapReduce and point<br />
queries (random reads).<br />
ZooKeeper<br />
A distributed, highly available coordination service. ZooKeeper provides primitives<br />
such as distributed locks that can be used for building distributed applications.<br />
Sqoop<br />
A tool for efficient bulk transfer of data between structured data stores (such as<br />
relational databases) and HDFS.<br />
Oozie<br />
A service for running and scheduling workflows of <strong>Hadoop</strong> jobs (including Map-<br />
Reduce, Pig, Hive, and Sqoop jobs).<br />
<strong>Hadoop</strong> Releases<br />
Which version of <strong>Hadoop</strong> should you use? <strong>The</strong> answer to this question changes over<br />
time, of course, and also depends on the features that you need. “<strong>Hadoop</strong> Releases”<br />
on page 13 summarizes the high-level features in recent <strong>Hadoop</strong> release series.<br />
<strong>The</strong>re are a few active release series. <strong>The</strong> 1.x release series is a continuation of the 0.20<br />
release series and contains the most stable versions of <strong>Hadoop</strong> currently available. This<br />
series includes secure Kerberos authentication, which prevents unauthorized access to<br />
<strong>Hadoop</strong> data (see “Security” on page 325). Almost all production clusters use these<br />
releases or derived versions (such as <strong>com</strong>mercial distributions).<br />
<strong>Hadoop</strong> Releases | 13