28.11.2012 Views

Data Warehousing and Analytics Infrastructure at Facebook

Data Warehousing and Analytics Infrastructure at Facebook

Data Warehousing and Analytics Infrastructure at Facebook

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

case of Hive a technology th<strong>at</strong> we have developed <strong>at</strong> <strong>Facebook</strong>.<br />

While the former is a popular distributed file system <strong>and</strong> mapreduce<br />

pl<strong>at</strong>form inspired by Google's GFS[4] <strong>and</strong> map-reduce[5]<br />

infrastructure, the l<strong>at</strong>ter brings the traditional d<strong>at</strong>a warehousing<br />

tools <strong>and</strong> techniques such as SQL, meta d<strong>at</strong>a, partitioning etc. to<br />

the Hadoop ecosystem. The availability of such familiar tools has<br />

tremendously improved the developer/analyst productivity <strong>and</strong><br />

cre<strong>at</strong>ed new use cases <strong>and</strong> usage p<strong>at</strong>terns for Hadoop. With Hive,<br />

the same tasks th<strong>at</strong> would take hours if not days to program can<br />

now be expressed in minutes <strong>and</strong> a direct consequence of this has<br />

been the fact th<strong>at</strong> we see more <strong>and</strong> more users using Hive <strong>and</strong><br />

Hadoop for ad hoc analysis in <strong>Facebook</strong> – a usage p<strong>at</strong>tern th<strong>at</strong> is<br />

not easily supported by just Hadoop. This exp<strong>and</strong>ed usage has<br />

also made d<strong>at</strong>a discovery <strong>and</strong> collabor<strong>at</strong>ion on analysis th<strong>at</strong> much<br />

more important. In the following sections we will also touch upon<br />

some systems th<strong>at</strong> we have built to address those important<br />

requirements.<br />

Just as Hive <strong>and</strong> Hadoop are core to our storage <strong>and</strong> d<strong>at</strong>a<br />

processing str<strong>at</strong>egies, Scribe[3] is core to our log collection<br />

str<strong>at</strong>egy. Scribe is an open source technology cre<strong>at</strong>ed <strong>at</strong> <strong>Facebook</strong><br />

th<strong>at</strong> acts as a service th<strong>at</strong> can aggreg<strong>at</strong>e logs from thous<strong>and</strong>s of<br />

web servers. It acts as a distributed <strong>and</strong> scalable d<strong>at</strong>a bus <strong>and</strong> by<br />

combining it with Hadoop's distributed file system (HDFS)[6] we<br />

have come up with a scalable log aggreg<strong>at</strong>ion solution th<strong>at</strong> can<br />

scale with the increasing volume of logged d<strong>at</strong>a.<br />

In the following sections we present how these different systems<br />

come together to solve the problems of scale <strong>and</strong> job diversity <strong>at</strong><br />

<strong>Facebook</strong>. The rest of the paper is organized as follows. Section 2<br />

describes how the d<strong>at</strong>a flows from the source systems to the d<strong>at</strong>a<br />

warehouse. Section 3 talks about storage systems, form<strong>at</strong>s <strong>and</strong><br />

optimiz<strong>at</strong>ions done for storing large d<strong>at</strong>a sets. Section 4 describes<br />

some approaches taken towards making this d<strong>at</strong>a easy to discover,<br />

query <strong>and</strong> analyze. Section 5 talks about some challenges th<strong>at</strong> we<br />

encounter due to different SLA expect<strong>at</strong>ions from different users<br />

using the same shared infrastructure. Section 6 discusses the<br />

st<strong>at</strong>istics th<strong>at</strong> we collect to monitor cluster health, plan out<br />

provisioning <strong>and</strong> give usage d<strong>at</strong>a to the users. We conclude in<br />

Figure 1: <strong>D<strong>at</strong>a</strong> Flow Architecture<br />

Section 7.<br />

2. DATA FLOW ARCHITECTURE<br />

In Figure 1, we illustr<strong>at</strong>e how the d<strong>at</strong>a flows from the source<br />

systems to the d<strong>at</strong>a warehouse <strong>at</strong> <strong>Facebook</strong>. As depicted, there are<br />

two sources of d<strong>at</strong>a – the feder<strong>at</strong>ed mysql tier th<strong>at</strong> contains all the<br />

<strong>Facebook</strong> site rel<strong>at</strong>ed d<strong>at</strong>a <strong>and</strong> the web tier th<strong>at</strong> gener<strong>at</strong>es all the<br />

log d<strong>at</strong>a. An example of a d<strong>at</strong>a set th<strong>at</strong> origin<strong>at</strong>es in the former<br />

includes inform<strong>at</strong>ion describing advertisements – their c<strong>at</strong>egory,<br />

there name, the advertiser inform<strong>at</strong>ion etc. The d<strong>at</strong>a sets<br />

origin<strong>at</strong>ing in the l<strong>at</strong>ter mostly correspond to actions such as<br />

viewing an advertisement, clicking on it, fanning a <strong>Facebook</strong> page<br />

etc. In traditional d<strong>at</strong>a warehousing terminology, more often than<br />

not the d<strong>at</strong>a in the feder<strong>at</strong>ed mysql tier corresponds to dimension<br />

d<strong>at</strong>a <strong>and</strong> the d<strong>at</strong>a coming from the web servers corresponds to fact<br />

d<strong>at</strong>a.<br />

The d<strong>at</strong>a from the web servers is pushed to a set of Scribe-Hadoop<br />

(scribeh) clusters. These clusters comprise of Scribe servers<br />

running on Hadoop clusters. The Scribe servers aggreg<strong>at</strong>e the logs<br />

coming from different web servers <strong>and</strong> write them out as HDFS<br />

files in the associ<strong>at</strong>ed Hadoop cluster. Note th<strong>at</strong> since the d<strong>at</strong>a is<br />

passed uncompressed from the web servers to the scribeh clusters,<br />

these clusters are generally bottlenecked on network. Typically<br />

more than 30TB of d<strong>at</strong>a is transferred to the scribeh clusters every<br />

day – mostly within the peak usage hours. In order to reduce the<br />

cross d<strong>at</strong>a center traffic the scribeh clusters are loc<strong>at</strong>ed in the d<strong>at</strong>a<br />

centers hosting the web tiers. While we are exploring possibilities<br />

of compressing this d<strong>at</strong>a on the web tier before pushing it to the<br />

scribeh cluster, there is a trade off between compression <strong>and</strong> the<br />

l<strong>at</strong>encies th<strong>at</strong> can be introduced as a result of it, especially for low<br />

volume log c<strong>at</strong>egories. If the log c<strong>at</strong>egory does not have enough<br />

volume to fill up the compression buffers on the web tier, the d<strong>at</strong>a<br />

therein can experience a lot of delay before it becomes available<br />

to the users unless the compression buffer sizes are reduced or<br />

periodically flushed. Both of those possibilities would in turn lead<br />

to lower compression r<strong>at</strong>ios.<br />

Periodically the d<strong>at</strong>a in the scribeh clusters is compressed by<br />

copier jobs <strong>and</strong> transferred to the Hive-Hadoop clusters as shown<br />

in Figure 1. The copiers run <strong>at</strong> 5-15 minute time intervals <strong>and</strong><br />

copy out all the new files cre<strong>at</strong>ed in the scribeh clusters. In this<br />

manner the log d<strong>at</strong>a gets moved to the Hive-Hadoop clusters. At<br />

this point the d<strong>at</strong>a is mostly in the form of HDFS files. It gets<br />

published either hourly or daily in the form of partitions in the<br />

corresponding Hive tables through a set of loader processes <strong>and</strong><br />

then becomes available for consumption.<br />

The d<strong>at</strong>a from the feder<strong>at</strong>ed mysql tier gets loaded to the Hive-<br />

Hadoop clusters through daily scrape processes. The scrape<br />

processes dump the desired d<strong>at</strong>a sets from mysql d<strong>at</strong>abases,<br />

compressing them on the source systems <strong>and</strong> finally moving them<br />

into the Hive-Hadoop cluster. The scrapes need to be resilient to<br />

failures <strong>and</strong> also need to be designed such th<strong>at</strong> they do not put too<br />

much load on the mysql d<strong>at</strong>abases. The l<strong>at</strong>ter is accomplished by<br />

running the scrapes on a replic<strong>at</strong>ed tier of mysql d<strong>at</strong>abases thereby<br />

avoiding extra load on the already loaded masters. At the same<br />

time any notions of strong consistency in the scraped d<strong>at</strong>a is<br />

sacrificed in order to avoid locking overheads. The scrapes are<br />

retried on a per d<strong>at</strong>abase server basis in the case of failures <strong>and</strong> if<br />

the d<strong>at</strong>abase cannot be read even after repe<strong>at</strong>ed tries, the previous<br />

days scraped d<strong>at</strong>a from th<strong>at</strong> particular server is used. With<br />

thous<strong>and</strong>s of d<strong>at</strong>abase servers, there are always some servers th<strong>at</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!