Analytics for Enterprise Class Hadoop and Streaming Data

More documents

Recommendations

Info

InfoSphere BigInsights: Analytics for Big Data at Rest 91 replicated data and metadata on disks belonging to different failure groups. Should a set of disks become unavailable, GPFS-SNC will retrieve the data from the other replicated locations. If you select the GPFS-SNC component for installation, the BigInsights graphical installer will handle the creation and configuration of the GPFS- SNC cluster for you. The installer prompts you for input on the nodes where it should assign the Cluster Manager and Quorum Node services. This installation approach uses default configurations that are typical for BigInsights workloads. GPFS-SNC is highly customizable, so for specialized installations, you can install and configure it outside of the graphical installer by modifying the template scripts and configuration files (although some customization is available within the installer itself). GPFS-SNC Failover Scenarios Regardless of whether you’re using GPFS-SNC or HDFS for your cluster, the Hadoop MapReduce framework is running on top of the file system layer. A running Hadoop cluster depends on the TaskTracker and JobTracker services, which run on the GPFS-SNC or HDFS storage layers to support MapReduce workloads. Although these servers are not specific to the file system layer, they do represent an SPOF in a Hadoop cluster. This is because if the JobTracker node fails, all executing jobs fail as well; however, this kind of failure is rare and is easily recoverable. A NameNode failure in HDFS is far more serious and has the potential to result in data loss if its disks are corrupted and not backed up. In addition, for clusters with many terabytes of storage, restarting the NameNode can take hours, as the cluster’s metadata needs to be fetched from disk and read into memory, and all the changes from the previous checkpoint must be replayed. In the case of GPFS-SNC, there is no need for a NameNode (it is solely an HDFS component). Different kinds of failures can occur in a cluster, and we describe how GPFS-SNC handles each of these failure scenarios: • Cluster Manager Failure When the Cluster Manager fails, Quorum Nodes detect this condition and will elect a new Cluster Manager (from the pool of Quorum Nodes). Cluster operations will continue with a very small interruption in overall cluster performance.
92 Understanding Big Data • File System Manager Node Failure Quorum Nodes detect this condition and ask the Cluster Manager to pick a new File System Manager Node from among any of the nodes in the cluster. This kind of failure results in a very small interruption in overall cluster performance. • Secondary Cluster Configuration Server Failure Quorum Nodes will detect the failure, but the cluster administrator will be required to designate a new node manually as the Secondary Cluster Configuration Server. Cluster operations will continue even if this node is in a failure state, but some administrative commands that require both the primary and secondary servers might not work. • Rack Failure The remaining Quorum Nodes will decide which part of the cluster is still operational and which nodes went down with it. If the Cluster Manager was on the rack that went down, Quorum Nodes will elect a new Cluster Manager in the healthy part of the cluster. Similarly, the Cluster Manager will pick a File System Manager Node in case the old one was on the failed rack. The cluster will employ standard recovery strategies for each of the individual data nodes lost on the failed rack. GPFS-SNC POSIX-Compliance A significant architectural difference between GPFS-SNC and HDFS is that GPFS-SNC is a kernel-level file system, while HDFS runs on top of the operating system. As a result, HDFS inherently has a number of restrictions and inefficiencies. Most of these limitations stem from the fact that HDFS is not fully POSIXcompliant. On the other hand, GPFS-SNC is 100 percent POSIX-compliant. This makes your Hadoop cluster more stable, more secure, and more flexible. Ease of Use and Storage Flexibility Files stored in GPFS-SNC are visible to all applications, just like any other files stored on a computer. For instance, when copying files, any authorized user can use traditional operating system commands to list, copy, and move files in GPFS-SNC. This isn’t the case in HDFS, where users need to log into Hadoop to see the files in the cluster. In addition, if you want to perform any file manipulations in HDFS, you need to understand how the Hadoop command shell environment works and know specific Hadoop file system commands.
Page 2 and 3:
Understanding Big Data
Page 4 and 5:
Information Architect. Dirk has a B
Page 6 and 7:
Understanding Big Data Analytics fo
Page 8 and 9:
My fifteenth book in my eighteenth
Page 10 and 11:
CONTENTS AT A GLANCE PART I Big Dat
Page 12 and 13:
xii Contents PART II Big Data: From
Page 14 and 15:
Executive Letter from Rob Thomas FO
Page 16 and 17:
Foreword xvii these warehouses into
Page 18 and 19:
Foreword xix people who are passion
Page 20 and 21:
xxii Acknowledgments Finally, to Li
Page 22 and 23:
xxiv About this Book spectrum, an a
Page 24 and 25:
xxvi About this Book Data platform
Page 26 and 27:
xxviii About this Book world-class
Page 28 and 29:
xxx About this Book chapter detaili
Page 30 and 31:
Part I Big Data: From the Business
Page 32 and 33:
4 Understanding Big Data Quite simp
Page 34 and 35:
6 Understanding Big Data terabytes
Page 36 and 37:
8 Understanding Big Data Quite simp
Page 38 and 39:
10 Understanding Big Data platform
Page 40 and 41:
12 Understanding Big Data Indeed, s
Page 42 and 43:
2 Why Is Big Data Important? This c
Page 44 and 45:
Why Is Big Data Important? 17 A goo
Page 46 and 47:
Why Is Big Data Important? 19 We th
Page 48 and 49:
Why Is Big Data Important? 21 lever
Page 50 and 51:
Why Is Big Data Important? 23 Mashu
Page 52 and 53:
Why Is Big Data Important? 25 shape
Page 54 and 55:
Why Is Big Data Important? 27 bette
Page 56 and 57:
Why Is Big Data Important? 29 the c
Page 58 and 59:
Why Is Big Data Important? 31 Big D
Page 60 and 61:
Why Is Big Data Important? 33 Hadoo
Page 62 and 63:
36 Understanding Big Data that we w
Page 64 and 65:
38 Understanding Big Data be. They
Page 66 and 67: 40 Understanding Big Data The IBM $
Page 68 and 69: 42 Understanding Big Data could acc
Page 70 and 71: 44 Understanding Big Data people se
Page 72 and 73: 46 Understanding Big Data cyclical
Page 74 and 75: 48 Understanding Big Data workload
Page 76 and 77: 50 Understanding Big Data applied e
Page 78 and 79: 4 All About Hadoop: The Big Data Li
Page 80 and 81: All About Hadoop: The Big Data Ling
Page 106 and 107: 5 InfoSphere BigInsights: Analytics
Page 108 and 109: InfoSphere BigInsights: Analytics f
Page 148 and 149: 124 Understanding Big Data InfoSphe
Page 150 and 151: 126 Understanding Big Data (as well
Page 152 and 153: 128 Understanding Big Data maintena
Page 154 and 155: 130 Understanding Big Data to the n
Page 156 and 157: 132 Understanding Big Data runs in
Page 158 and 159: 134 Understanding Big Data streamID
Page 160 and 161: 136 Understanding Big Data given at
Page 162 and 163: 138 Understanding Big Data add data
Page 164 and 165: 140 Understanding Big Data relocata
Page 166:
Additional Skills Resources Rely on
show all

Analytics for Enterprise Class Hadoop and Streaming Data

Create successful ePaper yourself

Delete template?

Save as template?