Data Warehousing and Analytics Infrastructure at Facebook
Data Warehousing and Analytics Infrastructure at Facebook
Data Warehousing and Analytics Infrastructure at Facebook
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
case of Hive a technology th<strong>at</strong> we have developed <strong>at</strong> <strong>Facebook</strong>.<br />
While the former is a popular distributed file system <strong>and</strong> mapreduce<br />
pl<strong>at</strong>form inspired by Google's GFS[4] <strong>and</strong> map-reduce[5]<br />
infrastructure, the l<strong>at</strong>ter brings the traditional d<strong>at</strong>a warehousing<br />
tools <strong>and</strong> techniques such as SQL, meta d<strong>at</strong>a, partitioning etc. to<br />
the Hadoop ecosystem. The availability of such familiar tools has<br />
tremendously improved the developer/analyst productivity <strong>and</strong><br />
cre<strong>at</strong>ed new use cases <strong>and</strong> usage p<strong>at</strong>terns for Hadoop. With Hive,<br />
the same tasks th<strong>at</strong> would take hours if not days to program can<br />
now be expressed in minutes <strong>and</strong> a direct consequence of this has<br />
been the fact th<strong>at</strong> we see more <strong>and</strong> more users using Hive <strong>and</strong><br />
Hadoop for ad hoc analysis in <strong>Facebook</strong> – a usage p<strong>at</strong>tern th<strong>at</strong> is<br />
not easily supported by just Hadoop. This exp<strong>and</strong>ed usage has<br />
also made d<strong>at</strong>a discovery <strong>and</strong> collabor<strong>at</strong>ion on analysis th<strong>at</strong> much<br />
more important. In the following sections we will also touch upon<br />
some systems th<strong>at</strong> we have built to address those important<br />
requirements.<br />
Just as Hive <strong>and</strong> Hadoop are core to our storage <strong>and</strong> d<strong>at</strong>a<br />
processing str<strong>at</strong>egies, Scribe[3] is core to our log collection<br />
str<strong>at</strong>egy. Scribe is an open source technology cre<strong>at</strong>ed <strong>at</strong> <strong>Facebook</strong><br />
th<strong>at</strong> acts as a service th<strong>at</strong> can aggreg<strong>at</strong>e logs from thous<strong>and</strong>s of<br />
web servers. It acts as a distributed <strong>and</strong> scalable d<strong>at</strong>a bus <strong>and</strong> by<br />
combining it with Hadoop's distributed file system (HDFS)[6] we<br />
have come up with a scalable log aggreg<strong>at</strong>ion solution th<strong>at</strong> can<br />
scale with the increasing volume of logged d<strong>at</strong>a.<br />
In the following sections we present how these different systems<br />
come together to solve the problems of scale <strong>and</strong> job diversity <strong>at</strong><br />
<strong>Facebook</strong>. The rest of the paper is organized as follows. Section 2<br />
describes how the d<strong>at</strong>a flows from the source systems to the d<strong>at</strong>a<br />
warehouse. Section 3 talks about storage systems, form<strong>at</strong>s <strong>and</strong><br />
optimiz<strong>at</strong>ions done for storing large d<strong>at</strong>a sets. Section 4 describes<br />
some approaches taken towards making this d<strong>at</strong>a easy to discover,<br />
query <strong>and</strong> analyze. Section 5 talks about some challenges th<strong>at</strong> we<br />
encounter due to different SLA expect<strong>at</strong>ions from different users<br />
using the same shared infrastructure. Section 6 discusses the<br />
st<strong>at</strong>istics th<strong>at</strong> we collect to monitor cluster health, plan out<br />
provisioning <strong>and</strong> give usage d<strong>at</strong>a to the users. We conclude in<br />
Figure 1: <strong>D<strong>at</strong>a</strong> Flow Architecture<br />
Section 7.<br />
2. DATA FLOW ARCHITECTURE<br />
In Figure 1, we illustr<strong>at</strong>e how the d<strong>at</strong>a flows from the source<br />
systems to the d<strong>at</strong>a warehouse <strong>at</strong> <strong>Facebook</strong>. As depicted, there are<br />
two sources of d<strong>at</strong>a – the feder<strong>at</strong>ed mysql tier th<strong>at</strong> contains all the<br />
<strong>Facebook</strong> site rel<strong>at</strong>ed d<strong>at</strong>a <strong>and</strong> the web tier th<strong>at</strong> gener<strong>at</strong>es all the<br />
log d<strong>at</strong>a. An example of a d<strong>at</strong>a set th<strong>at</strong> origin<strong>at</strong>es in the former<br />
includes inform<strong>at</strong>ion describing advertisements – their c<strong>at</strong>egory,<br />
there name, the advertiser inform<strong>at</strong>ion etc. The d<strong>at</strong>a sets<br />
origin<strong>at</strong>ing in the l<strong>at</strong>ter mostly correspond to actions such as<br />
viewing an advertisement, clicking on it, fanning a <strong>Facebook</strong> page<br />
etc. In traditional d<strong>at</strong>a warehousing terminology, more often than<br />
not the d<strong>at</strong>a in the feder<strong>at</strong>ed mysql tier corresponds to dimension<br />
d<strong>at</strong>a <strong>and</strong> the d<strong>at</strong>a coming from the web servers corresponds to fact<br />
d<strong>at</strong>a.<br />
The d<strong>at</strong>a from the web servers is pushed to a set of Scribe-Hadoop<br />
(scribeh) clusters. These clusters comprise of Scribe servers<br />
running on Hadoop clusters. The Scribe servers aggreg<strong>at</strong>e the logs<br />
coming from different web servers <strong>and</strong> write them out as HDFS<br />
files in the associ<strong>at</strong>ed Hadoop cluster. Note th<strong>at</strong> since the d<strong>at</strong>a is<br />
passed uncompressed from the web servers to the scribeh clusters,<br />
these clusters are generally bottlenecked on network. Typically<br />
more than 30TB of d<strong>at</strong>a is transferred to the scribeh clusters every<br />
day – mostly within the peak usage hours. In order to reduce the<br />
cross d<strong>at</strong>a center traffic the scribeh clusters are loc<strong>at</strong>ed in the d<strong>at</strong>a<br />
centers hosting the web tiers. While we are exploring possibilities<br />
of compressing this d<strong>at</strong>a on the web tier before pushing it to the<br />
scribeh cluster, there is a trade off between compression <strong>and</strong> the<br />
l<strong>at</strong>encies th<strong>at</strong> can be introduced as a result of it, especially for low<br />
volume log c<strong>at</strong>egories. If the log c<strong>at</strong>egory does not have enough<br />
volume to fill up the compression buffers on the web tier, the d<strong>at</strong>a<br />
therein can experience a lot of delay before it becomes available<br />
to the users unless the compression buffer sizes are reduced or<br />
periodically flushed. Both of those possibilities would in turn lead<br />
to lower compression r<strong>at</strong>ios.<br />
Periodically the d<strong>at</strong>a in the scribeh clusters is compressed by<br />
copier jobs <strong>and</strong> transferred to the Hive-Hadoop clusters as shown<br />
in Figure 1. The copiers run <strong>at</strong> 5-15 minute time intervals <strong>and</strong><br />
copy out all the new files cre<strong>at</strong>ed in the scribeh clusters. In this<br />
manner the log d<strong>at</strong>a gets moved to the Hive-Hadoop clusters. At<br />
this point the d<strong>at</strong>a is mostly in the form of HDFS files. It gets<br />
published either hourly or daily in the form of partitions in the<br />
corresponding Hive tables through a set of loader processes <strong>and</strong><br />
then becomes available for consumption.<br />
The d<strong>at</strong>a from the feder<strong>at</strong>ed mysql tier gets loaded to the Hive-<br />
Hadoop clusters through daily scrape processes. The scrape<br />
processes dump the desired d<strong>at</strong>a sets from mysql d<strong>at</strong>abases,<br />
compressing them on the source systems <strong>and</strong> finally moving them<br />
into the Hive-Hadoop cluster. The scrapes need to be resilient to<br />
failures <strong>and</strong> also need to be designed such th<strong>at</strong> they do not put too<br />
much load on the mysql d<strong>at</strong>abases. The l<strong>at</strong>ter is accomplished by<br />
running the scrapes on a replic<strong>at</strong>ed tier of mysql d<strong>at</strong>abases thereby<br />
avoiding extra load on the already loaded masters. At the same<br />
time any notions of strong consistency in the scraped d<strong>at</strong>a is<br />
sacrificed in order to avoid locking overheads. The scrapes are<br />
retried on a per d<strong>at</strong>abase server basis in the case of failures <strong>and</strong> if<br />
the d<strong>at</strong>abase cannot be read even after repe<strong>at</strong>ed tries, the previous<br />
days scraped d<strong>at</strong>a from th<strong>at</strong> particular server is used. With<br />
thous<strong>and</strong>s of d<strong>at</strong>abase servers, there are always some servers th<strong>at</strong>