04.07.2013 Views

Hadoop: The Definitive Guide - Cdn.oreilly.com

Hadoop: The Definitive Guide - Cdn.oreilly.com

Hadoop: The Definitive Guide - Cdn.oreilly.com

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Avro<br />

A serialization system for efficient, cross-language RPC and persistent data<br />

storage.<br />

MapReduce<br />

A distributed data processing model and execution environment that runs on large<br />

clusters of <strong>com</strong>modity machines.<br />

HDFS<br />

A distributed filesystem that runs on large clusters of <strong>com</strong>modity machines.<br />

Pig<br />

A data flow language and execution environment for exploring very large datasets.<br />

Pig runs on HDFS and MapReduce clusters.<br />

Hive<br />

A distributed data warehouse. Hive manages data stored in HDFS and provides a<br />

query language based on SQL (and which is translated by the runtime engine to<br />

MapReduce jobs) for querying the data.<br />

HBase<br />

A distributed, column-oriented database. HBase uses HDFS for its underlying<br />

storage, and supports both batch-style <strong>com</strong>putations using MapReduce and point<br />

queries (random reads).<br />

ZooKeeper<br />

A distributed, highly available coordination service. ZooKeeper provides primitives<br />

such as distributed locks that can be used for building distributed applications.<br />

Sqoop<br />

A tool for efficient bulk transfer of data between structured data stores (such as<br />

relational databases) and HDFS.<br />

Oozie<br />

A service for running and scheduling workflows of <strong>Hadoop</strong> jobs (including Map-<br />

Reduce, Pig, Hive, and Sqoop jobs).<br />

<strong>Hadoop</strong> Releases<br />

Which version of <strong>Hadoop</strong> should you use? <strong>The</strong> answer to this question changes over<br />

time, of course, and also depends on the features that you need. “<strong>Hadoop</strong> Releases”<br />

on page 13 summarizes the high-level features in recent <strong>Hadoop</strong> release series.<br />

<strong>The</strong>re are a few active release series. <strong>The</strong> 1.x release series is a continuation of the 0.20<br />

release series and contains the most stable versions of <strong>Hadoop</strong> currently available. This<br />

series includes secure Kerberos authentication, which prevents unauthorized access to<br />

<strong>Hadoop</strong> data (see “Security” on page 325). Almost all production clusters use these<br />

releases or derived versions (such as <strong>com</strong>mercial distributions).<br />

<strong>Hadoop</strong> Releases | 13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!