12.07.2015 Views

Hadoop In Cloud Computing - IRNet Explore

Hadoop In Cloud Computing - IRNet Explore

Hadoop In Cloud Computing - IRNet Explore

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Hadoop</strong> <strong>In</strong> <strong>Cloud</strong> <strong>Computing</strong>Amritha KaileshDepartment Of <strong>In</strong>formation Technology, Ranganthan Engineering CollegeREC Kalvi Nagar, Thondamuttur Via, Virailyur Post,Coimbatore-641109,Tamil Nadu, <strong>In</strong>diabvb_amrita@yahoo.comamritha.ritu@gmail.comINTRODUCTIONApache <strong>Hadoop</strong> is an opensource software project that enables distributedprocessing of large data sets across clusters ofservers. It is designed to scale from a single server tothousands of machines with a very high degree offault tolerance. <strong>Hadoop</strong> was derived from Google’smap/reduce and the Google fast system. Yahoo wasthe originator and a major contributor and uses<strong>Hadoop</strong> across its business. Other users includeTwitter, Facebook, IBM, Linked in, The AmericanAirlines, The New York Times, Microsoft and manymore. A key aspect of the resiliency of <strong>Hadoop</strong>clusters transforms the software’s ability to detectand handle failures at the application layers. <strong>Hadoop</strong>has two main sub projects.1. Map Reduce:The framework that understandsand send word to the nodes in the cluster, basicallycomputation2. HDFS:The <strong>Hadoop</strong> Distributer FastSystem that spends other nodes in the <strong>Hadoop</strong>structure for data storage. The HDFS links togetherthe fast system on many local nodes to make theminto one big fast system. HDFS assumes nodes willfail, so it achieves reliability by replicating dataacross multiple nodes. <strong>Hadoop</strong> is supplemented by anecosystem of Apache Projects like Pig, Hive andZookeeper, that extends the value of <strong>Hadoop</strong> andimproves its usability. Basically storage


How does <strong>Hadoop</strong> Work:As mentioned before and also shown in the imageabove, <strong>Hadoop</strong> consists of two main layers.1. HDFS2. Map ReduceHDFS:All regular file system operationscan be done in it. These are apt for largefiles of several petabytes and terra bytes.Each file saved on the HDFS is split intothree blocks. The default block size is 128MB but people also use it at 512 MB or evenat 1 GB and the last block is the remainingsize after splitting the other blocks. Theblock is then copied three times; thesecopies are called the 3x replicas. The 3x isconfigurable and depends on how safe thefile is to be saved. The HDFS saves theseblocks on the data nodes. The HDFSdistributes the blocks to the data nodes asevenly as possible. The node that watchesall over this is called the name node. Thename node is the machine in the cluster thatwatches over all the other machines in thecluster. It does not store any data insteadkeeps a map on where each block is to besent. So, basically the name node tacks theblocks. The name node watches over all thedata nodes so when the inevitable happensand if one of the data nodes goes away, thename node notices it, finds which othersystem has the replicas of that block anddemand that one of those copies over theblock that has gone away to anothermachine. This is where the 3x replica comesinto action. We can lose 2 machines at onceand still be able to recover data by makingreplicas; the probability of losing all threemachines is very low. The name node is asingle point failure this is the only drawbackof the HDFS. There are two types in thename node.i. Single point failure:This doesn’t failoften .If there are 4000 data nodes and ifonly one machine is available that is to bekept up on running then it’s possible.ii. Secondary Name node:It is not exactly asecondary name node that is the mostimportant fact about it. The secondary namenode keeps check of every function that isgoing on and if needed that is when thename node goes down, the secondary namenode can be turned into a name node, butthis is not an automatic procedure.<strong>Hadoop</strong> is currently working on thename node that is called that distributedname node. The data node continuously runscheck sums and so does the block verifier,so when a node is marked as corrupted thenode is considered deleted and re-replicasare made in.<strong>Hadoop</strong> is suitable of both structuredand unstructured data as the size of thehadoop system can be increased easily ataffordable. The largest cluster today is about4000 data nodes and 22 Petabytes worthdata. The data nodes cannot be generally atdifferent geographical locations as the namenode has a very close watch over the datanodes and so if they are not watched overproperly then problems occur. Differentfirms come up with different packages ofsoftware for this. For example, Facebookhas software called as high tide where thenodes are in the form of a boat, where morenodes can be added from both the sides.


Map-Reduce:Map reduce is a way of takingthe distributed out of distributed computing.Distributed computing is mainly verydifficult as they have to deal with distributedstates. <strong>In</strong> hadoop each data is treated as akey and a value. It can be called as a byteoffset of a file if working with a file.<strong>Hadoop</strong> does a lot of log analysis and so itcontains a time stamp. Everything in hadoopis a key-value pair which is the basic unionof currency in computation. Repeated valuesare considered as separate values. Everycalculations or computation takes place inthe task tracker that is present above the datanode. The computation does not take placeon all three replica, it does only one at atime and if it is found that a particular nodedoes not run fast enough then the job isswitched on to one of the replicas. The tasktracker is where every computation takesplace. It works above the data node. The jobtracker decides on which node thecomputation takes place which is the analogto the name node. Using the simple the ideaof taking a mapper, taking in some keys andvalues, admitting some immediate ones,bringing in a reducer, a lot of things can bedone like filtering, sorting, etc.The map reduce is basically like anassembly language. Hence people startedcoming up with other languages to writeupon hadoop, of which pig and hive are thetwo main projects.a. Pig:Pig was developed by yahoo tosave its programmers from writing tediousprograms on map reduce. Pig does the sameprocesses as that of map reduce ie data areloaded from the HDFS called the text line,then they are filtered and grouped but in asimpler manner. Though the actual pig andtext takes a while to grip, once it does, itmakes every job given to the machinesimpler i.e. map reduce needn’t be used. Pigis very popular and is used by many otherslike Linked in.b. Hive:Hive was developed by Facebookderived from the existing language calledSQL . Hive is again an open source project.It completely looks like SQL, though itdoesn’t completely comply with it. Hive isanother alternative to pig. They are mostlyused for the securities on these sites.The benefit of pigand hive are that key and values aren’t to beworried about by the programmers. Both pigand hive compile down to map reduce jobs.Apache came up with more such kinds ofprojects, likec. Apache H Base:One thing that hadoop is good at isbatch processing and one thing it is poorat is real time processing. Google wrotea paper which was called the big tablethat ran over hadoop in order to get realtime access with hadoop. H- Base iscurrently very popular and is now usedby Yahoo, Stumble Upon is anothermajor user of H-base.This basically came out ofPower set, a search engine that wastaken over by Microsoft. The mainadvantage of H- Base is that it can


have a run time access on hadoopfiles‣ Flume:Once an access is established tohadoop, it is important to realize the amountof data to be stored in it, though it doesn’tneed restructuring, it is rather good to havethe data serialized.‣ Avro:One of the major projects used forthe serialization of data is called Avro,started by Doug Cutting, which is a newdata serialization format. The next popularproject is Zoo keeper, which Google wrote apaper for a distributed co-ordination systemcalled Chubby. Zoo keeper is a very lowlevel, it isn’t written directly instead it iswritten on system that rely on it for exampleApache Cophga, that is a real time streamingsystem that is developed by linked in.Flume is a log aggregation system.Flume runs collectors on each machine, pig,hive co-ordinated with oozie.cloud.‣ Whirr:Whirr is a file sharing system in a‣ Sqoop:‣ H Catalog:H catalog is basically combiningthe Meta store for hive keeping all theinformation about the data and merging itback with what pig does. Once, H catalog iskept in running, a single central system canbe kept where all the meta data tells whereall the tables are, all the schemas andeverything else and they can be accessed viaH catalog in which ever system the user isusing.Sqoop fills the need of taking allthe information stored on expensive databases and transferring them into <strong>Hadoop</strong>.Here the structured data can be obtainedfrom the database into hadoop.‣ Mahout:‣ Oozie:This is basically a job schedulingsystem. Each job on hadoop are dependenton one another, hence scheduling isrequired. This was developed by Yahoo.Mahout is designed to do machinelearning above map reduce where clusteringcan be done.‣ Big Top:


<strong>Hadoop</strong> works with an enormousamount of data with a number of projects.Big top sees to that this doesn’t crash whilerunning.Pros & Cons:<strong>Hadoop</strong> changes theeconomics and dynamics of large scalecomputing. Its impact can be bolted down tofour characteristics.a. ScalableNew nodes can be added as neededwithout needing to change dataformats or how data is loaded orhow jobs are in or the applicationson top.b. Cost Effective<strong>Hadoop</strong> brings massive parallelcomputing to large clusters ofregular servers. The result is asizable decrease in the cost perterabyte of storage which in turnmakes analyzing all of the dataaffordable.c. Flexible<strong>Hadoop</strong> is scheme less. It canabsorb any type of data, structuredor not from any number of sources.Data from multiple sources can bejoined and navigated in arbitraryways enabling deeper analysis thanother systems can provide.d. Fault TolerantWhen a node is lost, the systemredirects work to another locationin the cluster and continuesprocessing without missing a beat.All of this happens with theprogrammers not having to write aspecial code or be aware of themechanics of the parallelprocessing infrastructure. With<strong>Hadoop</strong> data management andanalytics can be taken to a wholenew level.The disadvantage of hadoop is thatit is a bit complexApplications:The Major users of <strong>Hadoop</strong> are,a. Yahoob. Twitterc. Facebookd. IBM,Linkedine. The American Airlinesf. The New York Timesg. Microsoftand many more…ACKNOWLEDGEMENT:The above paper is done under the guidance ofAsst.Prof.M..Kathiravan.REFERENCES:[1]http://ebiquity.umbc.edu/blogger/2007/12/26/cloud-computing-with-hadoop/[2]www.inferdata.com/training/hadoop/hadoop-training.htm[3]www.hadoop.apache.org/[4]www.ibm.com/developerworks/aix/library/au-cloud_apache/


[5]www.cloudera.com/hadoop[6]www.bigdatacloud.com/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!