13.11.2012 Views

Hadoop Development - CSC

Hadoop Development - CSC

Hadoop Development - CSC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

©2011 <strong>CSC</strong><br />

INNOVATE<br />

AND DELIVER<br />

AN INTRODUCTION TO HADOOP<br />

AND MAP-REDUCE<br />

Les Klein<br />

Solution Architect, GBS EMEA<br />

November 10, 2011


Agenda<br />

• The Data Challenge<br />

• What Is <strong>Hadoop</strong>?<br />

• <strong>Hadoop</strong> Architectural Overview<br />

• What is Map-Reduce?<br />

• Real-World Example - A Cyber Problem<br />

• Extras:<br />

• How to get started<br />

• Resources<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 2


The Data Challenge<br />

•3 dimensions:<br />

–Volume<br />

–Variety<br />

–Velocity<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 3


Volume<br />

• In 2006, 161 exabytes of digital information were created, representing roughly 3<br />

million times the information in all the books ever written!<br />

– an exabyte is 1,000 petabytes, a petabyte is 1,000 terabytes, a terabyte is 1,000 gigabytes<br />

• In March 2007, researcher IDC released a report, sponsored by EMC 1 , forecasting that<br />

as much as 988 exabytes of digital information will be created in 2010, a six-fold<br />

increase from 2006.<br />

• From 2007 until 2010, IDC said it expected information will sport a compound annual<br />

growth rate of 57 percent to hit the 988 exabyte mark. IDC now forecasts the total<br />

volume of data stored electronically in 2011 will be 1.8 zettabytes 2<br />

– a zettabyte is 1000 exabytes<br />

– As an example, the Large Hadron Collider, at CERN in Switzerland, is expected to generate ~15<br />

petabytes of data per year<br />

1 The Expanding Digital Universe, A Forecast of Worldwide Information Growth Through 2010, John F Gantz et al, IDC, March 2007<br />

2 The Diverse and Exploding Digital Universe, An Updated Forecast of Worldwide Information Growth Through 2011, John F Gantz et al, IDC, March<br />

2008<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 4


Volume (cont’d)<br />

• IDC now predicts that the digital universe will be 44 times bigger in 2020<br />

than it was in 2009, totalling a staggering 35 zettabytes. 3<br />

• EMC reports that the number of customers storing a petabyte or more of<br />

data will grow from 1,000 (reached in 2010) to 100,000 before the end of<br />

the decade. 4<br />

– By 2012 it expects that some customers will be storing exabytes (1,000 petabytes) of<br />

information. 5<br />

• In 2010 Gartner reported that enterprise data growth will be 650 percent<br />

over the next five years, and that 80 percent of that will be unstructured. 6<br />

3 “The Digital Universe Decade – Are You Ready?” IDC, sponsored by EMC Corporation, May 2010. See tab “The Digital Universe Decade,”<br />

http://www.emc.com/collateral/demos/microsites/idc-digital-universe/iview.htm<br />

4 “EMC’s Record Breaking Product Launch,” Chuck Hollis blog, 14 January 2011, http://chucksblog.emc.com/chucks_blog/2011/01/emcs-record-breaking-product-launch.html<br />

5 Ibid.<br />

6 “Technology Trends You Can’t Afford to Ignore,” Gartner Webinar, January 2010, slide 8,<br />

http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 5


Volume (cont’d)<br />

• If you have a few 100 terabytes of data then products like Teradata,<br />

Netezza, Oracle Database Machine, etc., can help you - $$$$$$$<br />

– Note these usually also require “structured” data<br />

• If you have many (even tens) of petabytes of data that need to be stored<br />

and analyzed<br />

– Products like those are cost prohibitive for most of us (assuming that the product can<br />

scale that far)<br />

• Complexity of analytics is also now becoming a problem<br />

– Difficult to express within the constraints of the tools (e.g. SQL)<br />

– Time taken to get results is unacceptable<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 6


Variety<br />

• Continuing increase in the types of data needing to be stored:<br />

– Video, voice/sound, images, RFID tags, SMS messages, “chat”, etc.<br />

• Many of these are not easy to store/process in relational databases for<br />

analysis<br />

• Many sources of such data are “unstructured” and/or not easy to structure<br />

– Often need to know at design time what kind of “questions” will need to be asked<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 7


Velocity<br />

• The rate at which volume and variety are increasing are themselves<br />

increasing!<br />

• “real time” analysis needs<br />

– Increasingly becoming necessary to process new data as it streams in rather than<br />

over-night<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 8


What Is <strong>Hadoop</strong>?<br />

• Google famously encountered these problems when it wanted to index the<br />

World Wide Web. Their solution:<br />

– Google File System (GFS) for storage<br />

– Google Map-Reduce to be able to rapidly process, in a highly parallel way, data<br />

stored in GFS<br />

• Google‟s solution is proprietary and their success created demand for<br />

similar capabilities that other companies could use, including their<br />

competitors, most notably, Yahoo!<br />

• The result of that demand is an Apache Open Source project called<br />

<strong>Hadoop</strong> that provides:<br />

– HDFS (<strong>Hadoop</strong> Distributed File System), equivalent in capability to GFS<br />

– MapReduce to process the data stored in HDFS, equivalent in capability to Google<br />

Map-Reduce<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 9


What Is <strong>Hadoop</strong>? (cont’d)<br />

• <strong>Hadoop</strong> itself is “free”<br />

• There is already a large “ecosystem” around <strong>Hadoop</strong><br />

– Many projects and products to make it easier to use <strong>Hadoop</strong> (open source and<br />

commercial)<br />

– Commercial support is available in a “RedHat” style model from Cloudera,<br />

Hortonworks and some others<br />

– Commercial support is also available from Greenplum (EMC) and IBM (BigInsights)<br />

who both have <strong>Hadoop</strong> based “products” in their portfolio, as do some others (e.g.<br />

Platform Computing)<br />

• <strong>Hadoop</strong> is widely used in industry today<br />

– Yahoo!, Facebook, LinkedIn, eBay, Quantcast, and many others<br />

• It is potentially “disruptive” technology<br />

– One <strong>CSC</strong> customer had an existing OLAP application that it re-implemented with<br />

<strong>Hadoop</strong> and got a 3x performance improvement for 10% of the infrastructure<br />

cost!<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 10


What Is <strong>Hadoop</strong>? (cont’d)<br />

• <strong>Hadoop</strong> can scale<br />

– Yahoo! has many clusters, the largest is 4000 nodes providing 16PB of HDFS (4 x 1TB HDDs/server)<br />

– Facebook has a 2000 node cluster providing 21PB of HDFS (12 x 1TB HDDs/server)<br />

• July 2011 Facebook announced a 30PB <strong>Hadoop</strong> cluster in a new “bleeding edge” data centre<br />

• What <strong>Hadoop</strong>/HDFS is not<br />

– A Database - it does not require structured data<br />

– A POSIX file system<br />

– Real-time – batch only<br />

• <strong>Hadoop</strong> is map-reduce only<br />

– Not all problems necessarily lend themselves to a this type of solution<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 11


<strong>Hadoop</strong> Architectural Overview<br />

Name Node<br />

Job Tracker<br />

TBSC 2009<br />

Secondary<br />

Name Node<br />

Data Nodes<br />

•Data Nodes are commodity servers<br />

- Provide both storage and processing<br />

•Name Node is a SPoF<br />

- HA/Resilient server is a good choice<br />

•Secondary Name Node is a “warm”<br />

standby<br />

- Fail-over requires some downtime<br />

- Ideally the same server choice as the<br />

Name<br />

Node<br />

•Job Tracker<br />

- Loss does not incur data loss, but in flight<br />

jobs<br />

are lost<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 12


<strong>Hadoop</strong> Architectural Overview (cont’d)<br />

• Hardware Failure is to be expected!<br />

– The Facebook 2000 node cluster has 24,000 SATA HDDs<br />

• A 3% failure rate per annum => 720 HDD fail per year = ~2 per day !!<br />

– Cheapest commodity servers rather than “enterprise” class devices<br />

• Limited/No redundancy<br />

• HDFS can accommodate disk/node loss without data loss<br />

– Each data block is replicated 3 times (by default) and HDFS can be made rack-aware<br />

• 1 st replica on different server in same rack<br />

• 2 nd replica on server in a different rack<br />

– If a disk or a Data Node is lost then HDFS automatically creates new replicas in background for all the<br />

lost data blocks (disk space permitting of course)<br />

• MapReduce will pre-emptively start the same processing tasks on different copies of<br />

data blocks when it detects that some nodes appear to be running slowly (or have<br />

died).<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 13


<strong>Hadoop</strong> Architectural Overview (cont’d)<br />

• HDFS is designed for Write Once/Read Many operations<br />

• HDFS block sizes are big<br />

– 64MB, 128MB and 256MB are common<br />

– To maximise disk read throughput<br />

• <strong>Hadoop</strong> runs one Map task for each HDFS block in the data to be processed and<br />

takes approx one minute to start a map task so execution needs to take at least<br />

one minute<br />

• Increase block size to increase task execution time<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 14


<strong>Hadoop</strong> Architectural Overview (cont’d)<br />

• <strong>Hadoop</strong> scales linearly<br />

– Suppose 100TB of data to process. With one server and a read rate of 100MB/s, it<br />

would take:<br />

• (100 x 10^12) / (100 x 10^6) = 10^6 = 1,000,000 seconds to read the files.<br />

– With 100 servers each with a 100MB/s read rate and the equal distribution of files<br />

across all 100 servers, i.e. each server has 1TB to read, it will take:<br />

• (1 x 10^12) / (100 x 10^6) = 10^4 = 10,000 seconds<br />

• Since all 100 read the data in parallel, total read time is 10,000 seconds, i.e. 100<br />

times faster.<br />

– If have 1000 servers and equal distribution of the data across the cluster, then can<br />

read all 100TB in just 1,000 seconds, i.e. less than 20 minutes.<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 15


<strong>Hadoop</strong> Architectural Overview (cont’d)<br />

• <strong>Hadoop</strong>/HDFS is a paradigm shift for Parallel Processing (HPC)<br />

– moves the processing to the data as opposed to moving the data to the processing<br />

• Developers only have 2 pieces of code to write<br />

– The map algorithm<br />

– The reduce algorithm<br />

• The only other code needed is the <strong>Hadoop</strong> client code needed to get<br />

<strong>Hadoop</strong> to run the map-reduce job<br />

– mostly boiler-plate and can be auto-generated<br />

• <strong>Hadoop</strong> does the rest<br />

– Decides which nodes will run the code<br />

– Coordinates running all the mappers before it starts any reducers<br />

– Developers no longer need to worry about rendezvous, semaphores, deadlocks, etc.!<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 16


What is Map-Reduce?<br />

• Map-Reduce is a different way of<br />

thinking about a problem.<br />

• Suppose that you want to count the<br />

number of times each word is used in<br />

the complete works of Shakespeare,<br />

and that you have loaded the<br />

complete works of Shakespeare into<br />

HDFS.<br />

• The first thing to do is to use the map<br />

phase to output a key-value pair for<br />

each word that you find in the blocks<br />

of data that you read. So if you<br />

process<br />

“To be, or not to be, that is the<br />

question”<br />

you would get the result shown to the<br />

right:<br />

TBSC 2009<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

• [, 1]<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 17


What is Map-Reduce? (cont’d)<br />

• Note that you could choose to optimise the code to aggregate locally before outputting<br />

the key values<br />

– e.g. to get [, 2] and [, 2] but that is not something that has to be done (although<br />

in a non-trivial case doing so would minimise I/O so may well help).<br />

• The “key” function can be anything you like that generates a unique key for each value<br />

that you will encounter<br />

– e.g. we could convert all the characters to upper case and use that string as the key ([“TO”, 1] for<br />

example).<br />

• The <strong>Hadoop</strong> MapReduce framework uses the key to decide which Data Node to send<br />

that data to for the reduce phase.<br />

– In this example, the Data Node chosen to get the key value pairs for the “key for to” will get 2 items to<br />

process as will the one for the “key for be”<br />

– all the others will only get one pair to process.<br />

• All the reduce code has to do is to aggregate all the input pairs into a single key value<br />

pair like [“TO”, 2] (if we are using upper case strings as the key).<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 18


What is Map-Reduce? (cont’d)<br />

Map code for word count example<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 19


What is Map-Reduce? (cont’d)<br />

Reduce code for word count example<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 20


What is Map-Reduce? (cont’d)<br />

• Unlikely you will have more nodes in your cluster than possible key values<br />

– <strong>Hadoop</strong> will send pairs for more than a single key value to a given Data Node, and<br />

runs a separate Reduce task for each key value on that Data Node<br />

• your reduce code does not need to worry about that<br />

• Similarly, if there are more data blocks than Data Nodes<br />

– <strong>Hadoop</strong> starts a separate Map task for each block to be read by a Data Node<br />

– <strong>Hadoop</strong> decides which blocks will be processed on which nodes<br />

• Usually it has 3 choices of Data Node for each block<br />

– your map code doesn‟t need to worry about any of this<br />

(Note that in general <strong>Hadoop</strong> will run multiple tasks concurrently on Data Nodes)<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 21


Real world example - a Cyber problem<br />

• We would like to be able to trace the use of any Linux system calls back<br />

to the user who was ultimately responsible for them.<br />

• Assume malicious users will try to hide their identity by spawning layers of<br />

child processes that make it difficult to track back to the original process<br />

that is their "terminal" (login) session.<br />

• What we would like to know is can <strong>Hadoop</strong> be used to solve this problem<br />

by doing the track back for any (or all) SYSCALL (system call) events?<br />

• As for the scale, consider a datacentre with:<br />

– servers generating audit events at the rate of ~20 per second, i.e. ~72,000 per hour<br />

– that has 1000 servers, i.e. ~72 million events per hour, or ~1.5 billion events a day (!)<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 22


Real world example - a Cyber problem (cont’d)<br />

• The problem can addressed by processing the audit logs produced by the<br />

auditd daemon that is supplied with Linux kernels 2.6.x (or newer).<br />

• In order to get auditd to log the data we are interested in it is necessary to<br />

set up rules in /etc/audit/audit.rules on each server as specified in the<br />

NSA document<br />

“Guide to the Secure Configuration of Red Hat Enterprise Linux 5, Revision 4,<br />

September 14, 2010”<br />

• The data to process looks something like this:<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 23


Real world example - a Cyber problem (cont’d)<br />

node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11<br />

success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139<br />

auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295<br />

comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />

node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8"<br />

a3="-t" a4="utf-8" a5="-o" a6="/dev/null"<br />

node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p"<br />

node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv"<br />

inode=18564672 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0<br />

node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257<br />

dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0<br />

node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500<br />

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" :<br />

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)'<br />

node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500<br />

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" :<br />

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)„<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 24


Real world example - a Cyber problem (cont’d)<br />

• The events we are interested in are the SYSCALL events such as:<br />

node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11<br />

success=yes exit=0 a0=9f20008 a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139<br />

auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295<br />

comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />

• The process id making this system call is 10139 and this process has a parent process<br />

3696<br />

• What we need to do is find which process created 3696 and if that was not an event<br />

with a "terminal" entry, such as:<br />

node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500<br />

auid=4294967295 subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" :<br />

exe="/bin/su" (hostname=?, addr=?, terminal=pts/1 res=success)'<br />

• Then we need to trace the ppid of that process, and so on, until we get to an event that<br />

has a "terminal" entry. An interesting problem since we have no idea how deep the<br />

process creation hierarchy will be!<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 25


Real world example - a Cyber problem (cont’d)<br />

• As always, there is more than one way to approach this, but one simple way would<br />

seem to be to have a map-reduce job that takes as input:<br />

– The auditd logs, and<br />

– A list of parent process ids that we want to trace back<br />

• and which outputs a file that contains:<br />

– A list of parent process-ids as the key to a list of auditd events that all have this process-id as a ppid entry.<br />

– A list of parent process-ids found that also have "terminal" entries (i.e. for which we have now found the user).<br />

• We can then repeatedly run this map-reduce job using the output of one run as input,<br />

with the auditd logs, to the next, until there is no difference in the output file between<br />

two consecutive runs, or until the output file is empty (whichever occurs first).<br />

(Note that since we will be dealing with audit logs from many different servers, we will need to use the IP<br />

address of the server with the process-id to form a unique key)<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 26


Real world example - a Cyber problem (cont’d)<br />

• A first question is where to get the initial list of process IDs from, and there<br />

are two obvious options:<br />

– Wait for the SOC staff to spot SYSCALL events that they are interested in, or<br />

– Make a first pass through the audit logs and for a given day, extract all the SYSCALL<br />

events on that day and then find the owner UIDs for all of them<br />

(Note that since the dataset used for development was quite small, option 2<br />

was practical)<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 27


Real world example - a Cyber problem (cont’d)<br />

Sample auditd log data<br />

• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.027:33673): arch=40000003 syscall=11 success=yes exit=0 a0=9f20008<br />

a1=9efd5d8 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10139 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0<br />

fsgid=0 tty=(none) ses=4294967295 comm="iconv" exe="/usr/bin/iconv" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />

• node=192.168.1.3 type=EXECVE msg=audit(1294649964.027:33673): argc=7 a0="iconv" a1="-f" a2="utf-8" a3="-t" a4="utf-8" a5="-o"<br />

a6="/dev/null"<br />

• node=192.168.1.3 type=CWD msg=audit(1294649964.027:33673): cwd="/usr/share/man/man0p"<br />

• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=0 name="/usr/bin/iconv" inode=18564672 dev=fd:00<br />

mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:bin_t:s0<br />

• node=192.168.1.3 type=PATH msg=audit(1294649964.027:33673): item=1 name=(null) inode=9994257 dev=fd:00 mode=0100755<br />

ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0<br />

• node=192.168.1.3 type=USER_AUTH msg=audit(1294649964.060:33674): user pid=9836 uid=500 auid=4294967295<br />

subj=system_u:system_r:unconfined_t:s0 msg='PAM: authentication acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1<br />

res=success)'<br />

• node=192.168.1.3 type=USER_ACCT msg=audit(1294649964.060:33675): user pid=9836 uid=500 auid=4294967295<br />

subj=system_u:system_r:unconfined_t:s0 msg='PAM: accounting acct="root" : exe="/bin/su" (hostname=?, addr=?, terminal=pts/1<br />

res=success)'<br />

• node=192.168.1.3 type=SYSCALL msg=audit(1294649964.063:33676): arch=40000003 syscall=11 success=yes exit=0 a0=9f12820<br />

a1=9f00c48 a2=9efcee0 a3=40 items=2 ppid=3696 pid=10141 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0<br />

fsgid=0 tty=(none) ses=4294967295 comm="gawk" exe="/bin/gawk" subj=system_u:system_r:crond_t:s0-s0:c0.c1023 key=(null)<br />

• node=192.168.1.3 type=EXECVE msg=audit(1294649964.063:33676): argc=6 a0="/usr/bin/gawk"<br />

• node=192.168.1.3 type=CWD msg=audit(1294649964.063:33676): cwd="/usr/share/man/man0p"<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 28


Real world example - a Cyber problem (cont’d)<br />

Sample output from SYSCALL map-reduce run<br />

• 10530:192.168.1.3 audit(1294650031.384:34011)<br />

• 10531:192.168.1.3 audit(1294650031.471:34015) audit(1294650031.481:34016) audit(1294650031.490:34017)<br />

• 10551:192.168.1.3 audit(1294650031.564:34022)<br />

• 10555:192.168.1.3 audit(1294650061.348:34028) audit(1294650061.341:34027)<br />

• 10556:192.168.1.3 audit(1294650061.355:34029) audit(1294650061.447:34034) audit(1294650061.408:34031) audit(1294650061.446:34035)<br />

• 10557:192.168.1.3 audit(1294650061.405:34030)<br />

• 10559:192.168.1.3 audit(1294650061.413:34033) audit(1294650061.408:34032)<br />

• 10561:192.168.1.3 audit(1294650061.470:34036)<br />

• 10563:192.168.1.3 audit(1294650061.473:34037) audit(1294650061.477:34038)<br />

• 10668:192.168.1.3 audit(1294653661.604:34047) audit(1294653661.610:34048)<br />

• 10669:192.168.1.3 audit(1294653661.626:34049) audit(1294653661.723:34054) audit(1294653661.726:34055) audit(1294653661.676:34050)<br />

• 10670:192.168.1.3 audit(1294653661.682:34051)<br />

• 10672:192.168.1.3 audit(1294653661.686:34053) audit(1294653661.685:34052)<br />

• 10674:192.168.1.3 audit(1294653661.743:34056)<br />

• 10676:192.168.1.3 audit(1294653661.745:34057) audit(1294653661.748:34058)<br />

• 10702:192.168.1.3 audit(1294654578.626:34062)<br />

• 10787:192.168.1.3 audit(1294657261.887:34072) audit(1294657261.881:34071)<br />

• 10788:192.168.1.3 audit(1294657261.951:34074) audit(1294657261.992:34078) audit(1294657261.892:34073) audit(1294657261.997:34079)<br />

• 10789:192.168.1.3 audit(1294657261.956:34075)<br />

• 10791:192.168.1.3 audit(1294657261.974:34077) audit(1294657261.971:34076)<br />

• 10793:192.168.1.3 audit(1294657262.022:34080)<br />

• 10795:192.168.1.3 audit(1294657262.028:34081) audit(1294657262.031:34082)<br />

• 10898:192.168.1.3 audit(1294660861.189:34090) audit(1294660861.197:34091)<br />

• 10899:192.168.1.3 audit(1294660861.257:34093) audit(1294660861.258:34092)<br />

• ...<br />

• ...<br />

• 9833:192.168.1.3 audit(1294649959.818:33419)<br />

• 9836:192.168.1.3 audit(1294649964.126:33683) audit(1294649964.108:33679) audit(1294649964.115:33680)<br />

• 9840:192.168.1.3 audit(1294649959.993:33425)<br />

• 9841:192.168.1.3 audit(1294649959.999:33426)<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 29


Here you can see that although the output is one file, it contains the two lists in that the keys that have an "*" in front of them are the auditd events that contain "terminal" entries. Any lines that start with "*" will be ignored on the next run.<br />

Real world example - a Cyber problem (cont’d)<br />

Sample output from trackback map-reduce run<br />

• ...<br />

• ...<br />

• *10530:192.168.1.3 audit(1294650031.326:34005)<br />

• *10555:192.168.1.3 audit(1294650061.485:34040) audit(1294650061.485:34039) audit(1294650061.337:34026) audit(1294650061.330:34025) audit(1294650061.324:34024) audit(1294650061.324:34023)<br />

• *10668:192.168.1.3 audit(1294653661.595:34045) audit(1294653661.601:34046) audit(1294653661.591:34043) audit(1294653661.592:34044) audit(1294653661.753:34060) audit(1294653661.752:34059)<br />

• *10787:192.168.1.3 audit(1294657262.072:34084) audit(1294657261.870:34069) audit(1294657261.878:34070) audit(1294657261.867:34068) audit(1294657261.865:34067) audit(1294657262.070:34083)<br />

• *10898:192.168.1.3 audit(1294660861.181:34086) audit(1294660861.186:34089) audit(1294660861.183:34088) audit(1294660861.181:34087) audit(1294660861.295:34103) audit(1294660861.295:34102)<br />

• *11010:192.168.1.3 audit(1294664461.538:34121) audit(1294664461.410:34108) audit(1294664461.403:34107) audit(1294664461.401:34106) audit(1294664461.400:34105) audit(1294664461.538:34122)<br />

• *11122:192.168.1.3 audit(1294668061.647:34124) audit(1294668061.792:34140) audit(1294668061.792:34141) audit(1294668061.656:34127) audit(1294668061.651:34126) audit(1294668061.649:34125)<br />

• *9836:192.168.1.3 audit(1294649964.119:33681) audit(1294649964.060:33675) audit(1294649964.123:33682) audit(1294649964.060:33674)<br />

• 10002:192.168.1.3:10003 audit(1294649962.248:33560)<br />

• 10008:192.168.1.3:10009 audit(1294649962.301:33565)<br />

• 10014:192.168.1.3:10015 audit(1294649962.368:33570)<br />

• 10020:192.168.1.3:10021 audit(1294649962.464:33575)<br />

• 10026:192.168.1.3:10027 audit(1294649962.540:33580)<br />

• 10032:192.168.1.3:10033 audit(1294649962.624:33585)<br />

• 10038:192.168.1.3:10039 audit(1294649962.693:33590)<br />

• 10044:192.168.1.3:10045 audit(1294649962.781:33595)<br />

• 10050:192.168.1.3:10051 audit(1294649962.857:33600)<br />

• 10056:192.168.1.3:10057 audit(1294649962.922:33605)<br />

• 10062:192.168.1.3:10063 audit(1294649963.007:33610)<br />

• 10068:192.168.1.3:10069 audit(1294649963.108:33615)<br />

• 10074:192.168.1.3:10075 audit(1294649963.188:33620)<br />

• 10080:192.168.1.3:10081 audit(1294649963.279:33625)<br />

• 10086:192.168.1.3:10087 audit(1294649963.343:33630)<br />

• 10092:192.168.1.3:10093 audit(1294649963.440:33635)<br />

• 10098:192.168.1.3:10099 audit(1294649963.500:33640)<br />

TBSC 2009<br />

although the output is one file, it contains the<br />

two lists in that the keys that have an "*" in front<br />

of them are the auditd events that contain<br />

"terminal" entries. Any lines that start with "*"<br />

will be ignored on the next run.<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 30


How to get started...<br />

• Download and install a Java SDK (version 5+)<br />

• Download and install Eclipse<br />

• Download and install Karmasphere Studio Community Edition for Eclipse<br />

• Grab a copy of my 2011 LEF Report – <strong>Hadoop</strong>, a Practioner‟s Guide<br />

• Get a copy of Tom White‟s book – <strong>Hadoop</strong>, the Definive Guide (O‟Reilly)<br />

• Windows users also need to:<br />

– Download and install Cygwin, with X windows<br />

• Open an xterm window in Cygwin and run Eclipse from there instead of Windows<br />

• No need to download and install <strong>Hadoop</strong> !!<br />

– Karmasphere Studio gives you all need to develop and test <strong>Hadoop</strong> MapReduce jobs<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 31


Resources<br />

• My LEF report:<br />

– http://assets1.csc.com/lef/downloads/<strong>CSC</strong>_Grant_2011_A_Practioner_s_Guide_for_<strong>Hadoop</strong>_<strong>Development</strong>.pdf<br />

• LEF Data rEvolution Report:<br />

– http://www.csc.com/lef/ds/22182-reports<br />

• <strong>Hadoop</strong> Wiki:<br />

– http://wiki.apache.org/hadoop/<br />

• <strong>Hadoop</strong> home page:<br />

– http://hadoop.apache.org/#News<br />

• Karmasphere Studio:<br />

– http://www.karmasphere.com/Products-Information/karmasphere-studio.html<br />

• If you want to contact me: lklein2@csc.com<br />

TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 32


Questions?<br />

TBSC 2009<br />

?<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 33


©2011 <strong>CSC</strong><br />

INNOVATE<br />

AND DELIVER<br />

Thank You


TBSC 2009<br />

11/10/2011 12:53 PM 0725-23_TBSC 2009 35

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!