Making Sense of Big Data - Cloudera Blog
Making Sense of Big Data - Cloudera Blog
Making Sense of Big Data - Cloudera Blog
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Technologyforecast<br />
<strong>Making</strong> sense<br />
<strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
A quarterly journal<br />
2010, Issue 3<br />
In this issue<br />
04<br />
Tapping into the<br />
power <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
22<br />
Building a bridge<br />
to the rest <strong>of</strong> your data<br />
36<br />
Revising the CIO’s<br />
data playbook
Contents<br />
Features<br />
04 Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
Treating it differently from your core enterprise data is essential.<br />
22 Building a bridge to the rest <strong>of</strong> your data<br />
How companies are using open-source cluster-computing<br />
techniques to analyze their data.<br />
36 Revising the CIO’s data playbook<br />
Start by adopting a fresh mind-set, grooming the right talent,<br />
and piloting new tools to ride a the next wave <strong>of</strong> innovation.
Interviews<br />
14 The data scalability challenge<br />
John Parkinson <strong>of</strong> TransUnion describes the data handling issues<br />
more companies will face in three to five years.<br />
18 Creating a cost-effective <strong>Big</strong> <strong>Data</strong> strategy<br />
Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an<br />
agile approach that leverages open-source and cloud technologies.<br />
34 Hadoop’s foray into the enterprise<br />
<strong>Cloudera</strong>’s Amr Awadallah discusses how and why diverse<br />
companies are trying this novel approach.<br />
48 New approaches to customer data analysis<br />
Razorfish’s Mark Taylor and Ray Velez discuss how new techniques<br />
enable them to better analyze petabytes <strong>of</strong> Web data.<br />
Departments<br />
02 Message from the editor<br />
50 Acknowledgments<br />
54 Subtext
Message from<br />
the editor<br />
Bill James has loved baseball statistics ever since he was a kid in Mayetta,<br />
Kansas, cutting baseball cards out <strong>of</strong> the backs <strong>of</strong> cereal boxes in the early<br />
1960s. James, who compiled The Bill James Baseball Abstract for years, is<br />
a renowned “sabermetrician” (a term he coined himself). He now is a senior<br />
advisor on baseball operations for the Boston Red Sox, and he previously<br />
worked in a similar capacity for other Major League Baseball teams.<br />
James has done more to change the world <strong>of</strong> baseball statistics than<br />
anyone in recent memory. As broadcaster Bob Costas says, James<br />
“doesn’t just understand information. He has shown people a different<br />
way <strong>of</strong> interpreting that information.” Before Bill James, Major League<br />
Baseball teams all relied on long-held assumptions about how games are<br />
won. They assumed batting average, for example, had more importance<br />
than it actually does.<br />
James challenged these assumptions. He asked critical questions that<br />
didn’t have good answers at the time, and he did the research and analysis<br />
necessary to find better answers. For instance, how many days’ rest does<br />
a reliever need? James’s answer is that some relievers can pitch well for<br />
two or more consecutive days, while others do better with a day or two <strong>of</strong><br />
rest in between. It depends on the individual. Why can’t a closer work more<br />
than just the ninth inning? A closer is frequently the best reliever on the<br />
team. James observes that managers <strong>of</strong>ten don’t use the best relievers<br />
to their maximum potential.<br />
The lesson learned from the Bill James example is that the best statistics<br />
come from striving to ask the best questions and trying to get answers to<br />
those questions. But what are the best questions? James takes an iterative<br />
approach, analyzing the data he has, or can gather, asking some questions<br />
based on that analysis, and then looking for the answers. He doesn’t stop<br />
with just one set <strong>of</strong> statistics. The first set suggests some questions, to<br />
which a second set suggests some answers, which then give rise to yet<br />
another set <strong>of</strong> questions. It’s a continual process <strong>of</strong> investigation, one that’s<br />
focused on surfacing the best questions rather than assuming those<br />
questions have already been asked.<br />
Enterprises can take advantage <strong>of</strong> a similarly iterative, investigative<br />
approach to data. Enterprises are being overwhelmed with data; many<br />
enterprises each generate petabytes <strong>of</strong> information they aren’t making best<br />
use <strong>of</strong>. And not all <strong>of</strong> the data is the same. Some <strong>of</strong> it has value, and some,<br />
not so much.<br />
The problem with this data has been tw<strong>of</strong>old: (1) it’s difficult to analyze,<br />
and (2) processing it using conventional systems takes too long and is<br />
too expensive.<br />
02 PricewaterhouseCoopers Technology Forecast
Addressing these problems effectively doesn’t require<br />
radically new technology. Better architectural design<br />
choices and s<strong>of</strong>tware that allows a different approach<br />
to the problems are enough. Search engine companies<br />
such as Google and Yahoo provide a pragmatic way<br />
forward in this respect. They’ve demonstrated that<br />
efficient, cost-effective, system-level design can lead<br />
to an architecture that allows any company to treat<br />
different data differently.<br />
Enterprises shouldn’t treat voluminous, mostly<br />
unstructured information (for example, Web server<br />
log files) the same way they treat the data in core<br />
transactional systems. Instead, they can use commodity<br />
computer clusters, open-source s<strong>of</strong>tware, and Tier 3<br />
storage, and they can process in an exploratory way<br />
the less-structured kinds <strong>of</strong> data they’re generating.<br />
With this approach, they can do what Bill James does<br />
and find better questions to ask.<br />
In this issue <strong>of</strong> the Technology Forecast, we review the<br />
techniques behind low-cost distributed computing that<br />
have led companies to explore more <strong>of</strong> their data in new<br />
ways. In the article, “Tapping into the power <strong>of</strong> <strong>Big</strong><br />
<strong>Data</strong>,” on page 04, we begin with a consideration <strong>of</strong><br />
exploratory analytics—methods that are separate from<br />
traditional business intelligence (BI). These techniques<br />
make it feasible to look for more haystacks, rather than<br />
just the needle in one haystack.<br />
The article, “Building a bridge to the rest <strong>of</strong> your data,”<br />
on page 22 highlights the growing interest in and<br />
adoption <strong>of</strong> Hadoop clusters. Hadoop provides highvolume,<br />
low-cost computing with the help <strong>of</strong> opensource<br />
s<strong>of</strong>tware and hundreds or thousands <strong>of</strong><br />
commodity servers. It also <strong>of</strong>fers a simplified approach<br />
to processing more complex data in parallel. The<br />
methods, cost advantages, and scalability <strong>of</strong> Hadoopstyle<br />
cluster computing clear a path for enterprises to<br />
analyze lots <strong>of</strong> data they didn’t have the means to<br />
analyze before, as well as enable innovative ways to<br />
analyze it.<br />
The buzz around <strong>Big</strong> <strong>Data</strong> and “cloud storage” (a term<br />
some vendors use to describe less-expensive cluster-<br />
computing techniques) is considerable, but the article,<br />
“Revising the CIO’s data playbook,” on page 36<br />
emphasizes that CIOs have time to pick and choose the<br />
most suitable approach. The most promising<br />
opportunity is in the area <strong>of</strong> “gray data,” or data that<br />
comes from a variety <strong>of</strong> sources. This data is <strong>of</strong>ten raw<br />
and unvalidated, arrives in huge quantities, and doesn’t<br />
yet have established value. Gray data analysis requires<br />
a different skill set—people who are more exploratory<br />
by nature.<br />
As always, in this issue we’ve included interviews with<br />
knowledgeable executives who have insights on the<br />
overall topic <strong>of</strong> interest:<br />
• John Parkinson <strong>of</strong> TransUnion describes the data<br />
challenges that more and more companies will face<br />
during the next three to five years.<br />
• Bud Albers, Scott Thompson, and Matt Estes <strong>of</strong> Disney<br />
outline an agile, open-source cloud data vision.<br />
• Amr Awadallah <strong>of</strong> <strong>Cloudera</strong> explores the reasons<br />
behind Apache Hadoop’s adoption at search engine,<br />
social media, and financial services companies.<br />
• Mark Taylor and Ray Velez <strong>of</strong> Razorfish contrast<br />
newer, more scalable techniques <strong>of</strong> studying<br />
customer data with the old methods.<br />
Please visit pwc.com/techforecast to find these articles<br />
and other issues <strong>of</strong> the Technology Forecast online.<br />
If you would like to receive future issues <strong>of</strong> the<br />
Technology Forecast as a PDF attachment, you<br />
can sign up at pwc.com/techforecast/subscribe.<br />
We welcome your feedback and your ideas for future<br />
research and analysis topics to cover.<br />
Tom DeGarmo<br />
Principal<br />
Technology Leader<br />
thomas.p.degarmo@us.pwc.com<br />
Message from the editor 03
Tapping into the<br />
power <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
Treating it differently from your core enterprise data is essential.<br />
By Galen Gruman<br />
04 PricewaterhouseCoopers Technology Forecast
Like most corporations, the Walt Disney Co. is<br />
swimming in a rising sea <strong>of</strong> <strong>Big</strong> <strong>Data</strong>: information<br />
collected from business operations, customers,<br />
transactions, and the like; unstructured information<br />
created by social media and other Web repositories,<br />
including the Disney home page itself and sites for<br />
its theme parks, movies, books, and music; plus the<br />
sites <strong>of</strong> its many big business units, including ESPN<br />
and ABC.<br />
“In any given year, we probably generate more data<br />
than the Walt Disney Co. did in its first 80 years <strong>of</strong><br />
existence,” observes Bud Albers, executive vice<br />
president and CTO <strong>of</strong> the Disney Technology Shared<br />
Services Group. “The challenge becomes what do<br />
you do with it all?”<br />
Albers and his team are in the early stages <strong>of</strong><br />
answering their own question with an economical<br />
cluster-computing architecture based on a set <strong>of</strong><br />
cost-effective and scalable technologies anchored<br />
by Apache Hadoop, an open-source, Java-based<br />
distributed file system based on Google File System<br />
and developed by Apache S<strong>of</strong>tware Foundation.<br />
These still-emerging technologies allow Disney analysts<br />
to explore multiple terabytes <strong>of</strong> information without the<br />
lengthy time requirements or high cost <strong>of</strong> traditional<br />
business intelligence (BI) systems.<br />
This issue <strong>of</strong> the Technology Forecast examines how<br />
Apache Hadoop and these related technologies can<br />
derive business value from <strong>Big</strong> <strong>Data</strong> by supporting a<br />
new kind <strong>of</strong> exploratory analytics unlike traditional BI.<br />
These s<strong>of</strong>tware technologies and their hardware cluster<br />
platform make it feasible not only to look for the needle<br />
in the haystack, but also to look for new haystacks.<br />
This kind <strong>of</strong> analysis demands an attitude <strong>of</strong><br />
exploration—and the ability to generate value from<br />
data that hasn’t been scrubbed or fully modeled into<br />
relational tables.<br />
Using Disney and other examples, this first article<br />
introduces the idea <strong>of</strong> exploratory BI for <strong>Big</strong> <strong>Data</strong>.<br />
The second article examines Hadoop clusters and<br />
technologies that support them (page 22), and the<br />
third article looks at steps CIOs can take now to exploit<br />
the future benefits (page 36). We begin with a closer<br />
look at Disney’s still-nascent but illustrative effort.<br />
“In any given year, we probably<br />
generate more data than the<br />
Walt Disney Co. did in its first<br />
80 years <strong>of</strong> existence.”<br />
— Bud Albers <strong>of</strong> Disney<br />
Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong> 05
Bringing <strong>Big</strong> <strong>Data</strong> under control<br />
<strong>Big</strong> <strong>Data</strong> is not a precise term; rather, it is a<br />
characterization <strong>of</strong> the never-ending accumulation <strong>of</strong><br />
all kinds <strong>of</strong> data, most <strong>of</strong> it unstructured. It describes<br />
data sets that are growing exponentially and that are<br />
too large, too raw, or too unstructured for analysis<br />
using relational database techniques. Whether<br />
terabytes or petabytes, the precise amount is less the<br />
issue than where the data ends up and how it is used.<br />
Like everyone else, Disney’s <strong>Big</strong> <strong>Data</strong> is huge, more<br />
unstructured than structured, and growing much<br />
faster than transactional data.<br />
The Disney Technology Shared Services Group, which<br />
is responsible for Disney’s core Web and analysis<br />
technologies, recently began its <strong>Big</strong> <strong>Data</strong> efforts but<br />
already sees high potential. The group is testing the<br />
technology and working with analysts in Disney<br />
business units. Disney’s data comes from varied<br />
sources, but much <strong>of</strong> it is collected for departmental<br />
business purposes and not yet widely shared. Disney’s<br />
<strong>Big</strong> <strong>Data</strong> approach will allow it to look at diverse data<br />
sets for unplanned purposes and to uncover patterns<br />
across customer activities. For example, insights from<br />
Disney Store activities could be useful in call centers<br />
for theme park booking or to better understand the<br />
audience segments <strong>of</strong> one <strong>of</strong> its cable networks.<br />
The Technology Shared Services Group is even using<br />
<strong>Big</strong> <strong>Data</strong> approaches to explore its own IT questions<br />
to understand what data is being stored, how it is<br />
used, and thus what type <strong>of</strong> storage hardware and<br />
management the group needs.<br />
Albers assumes that <strong>Big</strong> <strong>Data</strong> analysis is destined to<br />
become essential. “The speed <strong>of</strong> business these days<br />
and the amount <strong>of</strong> data that we are now swimming in<br />
mean that we need to have new ways and new<br />
techniques <strong>of</strong> getting at the data, finding out what’s in<br />
there, and figuring out how we deal with it,” he says.<br />
The team stumbled upon an inexpensive way to<br />
improve the business while pursuing more IT costeffectiveness<br />
through the use <strong>of</strong> private-cloud<br />
technologies. (See the Technology Forecast, Summer<br />
2009, for more on the topic <strong>of</strong> cloud computing.) When<br />
Albers launched the effort to change the division’s cost<br />
curve so IT expenses would rise more slowly than the<br />
business usage <strong>of</strong> IT—the opposite had been true—he<br />
turned to an approach that many companies use to<br />
make data centers more efficient: virtualization.<br />
Virtualization <strong>of</strong>fers several benefits, including higher<br />
utilization <strong>of</strong> existing servers and the ability to move<br />
workloads to prevent resource bottlenecks. An<br />
organization can also move workloads to external cloud<br />
providers, using them as a backup resource when<br />
needed, an approach called cloud bursting. By using<br />
such approaches, the Disney Technology Shared<br />
Services Group lowered its IT expense growth rate from<br />
27 percent to –3 percent, while increasing its annual<br />
processing growth from 17 percent to 45 percent.<br />
While achieving this efficiency, the team realized that<br />
the ability to move resources and tap external ones<br />
could apply to more than just data center efficiency. At<br />
first, they explored using external clouds to analyze big<br />
sets <strong>of</strong> data, such as Web traffic to Disney’s many sites,<br />
and to handle big processing jobs more cost-effectively<br />
and more quickly than with internal systems.<br />
During that exploration, the team discovered Hadoop,<br />
MapReduce, and other open-source technologies that<br />
distribute data-analysis workloads across many<br />
computers, breaking the analysis into many parallel<br />
workloads that produce results faster. Faster results<br />
mean that more questions can be asked, and the low<br />
cost <strong>of</strong> the technologies means the team can afford<br />
to ask those questions.<br />
Disney assembled a Hadoop cluster and set up a<br />
central logging service to mine data that the<br />
organization hadn’t been able to before. It will begin<br />
to provide internal group access to the cluster in<br />
October 2010. Figure 1 shows how the Hadoop<br />
cluster will benefit internal groups, business partners,<br />
and customers.<br />
“The speed <strong>of</strong> business these days and the amount <strong>of</strong> data that we<br />
are now swimming in mean that we need to have new ways and new<br />
techniques <strong>of</strong> getting at the data, finding out what’s in there, and figuring<br />
out how we deal with it.” — Bud Albers <strong>of</strong> Disney<br />
06 PricewaterhouseCoopers Technology Forecast
Improved<br />
experience<br />
4<br />
Site<br />
visitors<br />
Internal<br />
business<br />
partners<br />
Affiliated<br />
businesses<br />
1<br />
Interface to cluster<br />
(MapReduce/Hive/Pig)<br />
Usage<br />
data<br />
Core IT and<br />
business unit<br />
systems<br />
2 Central 3<br />
logging<br />
service<br />
D-Cloud data cluster<br />
Hadoop<br />
Metadata<br />
repository<br />
Figure 1: Disney’s Hadoop cluster and central logging service<br />
Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment <strong>of</strong> (2) central logging<br />
service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more responsive and<br />
personalized user experience.<br />
Source: Disney, 2010<br />
Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong> 07
Simply put, the low cost <strong>of</strong> a Hadoop cluster means<br />
freedom to experiment. Disney uses a couple <strong>of</strong> dozen<br />
servers that were scheduled to be retired, and the<br />
organization operates its cluster with a handful <strong>of</strong><br />
existing staff. Matt Estes, principal data architect<br />
for the Disney Technology Shared Services Group,<br />
estimates the cost <strong>of</strong> the project at $300,000<br />
to $500,000.<br />
“Before, I would have needed to figure on spending $3<br />
million to $5 million for such an initiative,” Albers says.<br />
“Now I can do this without charging to the bottom line.”<br />
Unlike the reusable canned queries in typical BI systems,<br />
<strong>Big</strong> <strong>Data</strong> analysis does require more effort to write the<br />
queries and the data-parsing code for what are <strong>of</strong>ten<br />
unique inquiries <strong>of</strong> data sources. But Albers notes that<br />
“the risk is lower due to all the other costs being lower.”<br />
Failure is inexpensive, so analysts are more willing to<br />
explore questions they would otherwise avoid.<br />
Even in this early stage, Albers is confident that the<br />
ability to ask more questions will lead to more insights<br />
that translate to both the bottom line and the top line.<br />
For example, Disney already is seeking to boost<br />
customer engagement and spending by making<br />
recommendations to customers based on pattern<br />
analysis <strong>of</strong> their online behavior.<br />
How <strong>Big</strong> <strong>Data</strong> analysis is different<br />
What should other enterprises anticipate from Hadoopstyle<br />
analytics? It is a type <strong>of</strong> exploratory BI they haven’t<br />
done much before. This is business intelligence that<br />
provides indications, not absolute conclusions. It requires<br />
a different mind-set, one that begins with exploration, the<br />
results <strong>of</strong> which create hypotheses, which are tested<br />
before moving on to validation and consolidation.<br />
These methods could be used to answer questions<br />
such as, “What indicators might there be that predate a<br />
surge in Web traffic?” or “What fabrics and colors are<br />
gaining popularity among influencers, and what<br />
sources might be able to provide the materials to us?”<br />
or “What’s the value <strong>of</strong> an influencer on Web traffic<br />
through his or her social network?” See the sidebar<br />
“Opportunities for <strong>Big</strong> <strong>Data</strong> insights” for more examples<br />
<strong>of</strong> the kinds <strong>of</strong> questions that can be asked <strong>of</strong> <strong>Big</strong> <strong>Data</strong>.<br />
Opportunities for <strong>Big</strong> <strong>Data</strong> insights<br />
Here are other examples <strong>of</strong> the kinds <strong>of</strong> insights<br />
that may be gleaned from analysis <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
information flows:<br />
• Customer churn, based on analysis <strong>of</strong> call center,<br />
help desk, and Web site traffic patterns<br />
• Changes in corporate reputation and the potential<br />
for regulatory action, based on the monitoring <strong>of</strong><br />
social networks as well as Web news sites<br />
• Real-time demand forecasting, based on<br />
disparate inputs such as weather forecasts,<br />
travel reservations, automotive traffic, and<br />
retail point-<strong>of</strong>-sale data<br />
• Supply chain optimization, based on analysis <strong>of</strong><br />
weather patterns, potential disaster scenarios,<br />
and political turmoil<br />
Disney and others explore their data without a lot <strong>of</strong><br />
preconceptions. They know the results won’t be as<br />
specific as a pr<strong>of</strong>it-margin calculation or a drug-efficacy<br />
determination. But they still expect demonstrable value,<br />
and they expect to get it without a lot <strong>of</strong> extra expense.<br />
Typical BI uses data from transactional and other<br />
relational database management systems (RDBMSs)<br />
that an enterprise collects—such as sales and<br />
purchasing records, product development costs, and<br />
new employee hire records—diligently scrubs the data<br />
for accuracy and consistency, and then puts it into a<br />
form the BI system is programmed to run queries<br />
against. Such systems are vital for accurate analyses<br />
<strong>of</strong> transactional information, especially information<br />
subject to compliance requirements, but they don’t<br />
work well for messy questions, they’ve been too<br />
expensive for questions you’re not sure there’s any<br />
value in asking, and they haven’t been able to scale to<br />
analyze large data sets efficiently. (See Figure 2.)<br />
08 PricewaterhouseCoopers Technology Forecast
Large<br />
data sets<br />
Small<br />
data sets<br />
<strong>Big</strong> <strong>Data</strong> (via<br />
Hadoop/MapReduce)<br />
Little analytical value<br />
Non-relational data<br />
Figure 2: Where <strong>Big</strong> <strong>Data</strong> fits in<br />
Source: PricewaterhouseCoopers, 2010<br />
Less scalability<br />
Traditional BI<br />
Relational data<br />
Other companies have also tapped into the excitement<br />
brewing over <strong>Big</strong> <strong>Data</strong> technologies. Several Weboriented<br />
companies that have always dealt with huge<br />
amounts <strong>of</strong> data—such as Yahoo, Twitter, and<br />
Google—were early adopters. Now, more traditional<br />
companies—such as Disney and TransUnion, a credit<br />
rating service—are exploring <strong>Big</strong> <strong>Data</strong> concepts,<br />
having seen the cost and scalability benefits the Web<br />
companies have realized.<br />
Specifically, enterprises are also motivated by the<br />
inability to scale their existing approach for working<br />
on traditional analytics tasks, such as querying<br />
across terabytes <strong>of</strong> relational data. They are learning<br />
that the tools associated with Hadoop are uniquely<br />
positioned to explore data that has been sitting on<br />
the side, unanalyzed. Figure 3 illustrates how the data<br />
architecture landscape appears in 2010. Enterprises<br />
with high processing power requirements and<br />
centralized architectures are facing scaling issues.<br />
In contrast, <strong>Big</strong> <strong>Data</strong> techniques allow you to sift through<br />
data to look for patterns at a much lower cost and in<br />
much less time than traditional BI systems. Should the<br />
data end up being so valuable that it requires the<br />
ongoing, compliance-oriented analysis <strong>of</strong> regular BI<br />
systems, only then do you make that investment.<br />
<strong>Big</strong> <strong>Data</strong> approaches let you ask more questions <strong>of</strong><br />
more information, opening a wide range <strong>of</strong> potential<br />
insights you couldn’t afford to consider in the past.<br />
“Part <strong>of</strong> the analytics role is to challenge assumptions,”<br />
Estes says. BI systems aren’t designed to do that;<br />
instead, they’re designed to dig deeper into known<br />
questions and look for variations that may indicate<br />
deviations from expected outcomes.<br />
Furthermore, <strong>Big</strong> <strong>Data</strong> analysis is usually iterative: you<br />
ask one question or examine one data set, then think <strong>of</strong><br />
more questions or decide to look at more data. That’s<br />
different from the “single source <strong>of</strong> truth” approach to<br />
standard BI and data warehousing. The Disney team<br />
started with making sure they could expose and<br />
access the data, then moved to iterative refinement in<br />
working with the data. “We aggressively got in to find<br />
the direction and the base. Then we began to iterate<br />
rather than try to do a <strong>Big</strong> Bang,” Albers says.<br />
High<br />
processing<br />
power<br />
Low<br />
processing<br />
power<br />
Enterprises facing<br />
scaling and<br />
capacity/cost<br />
problems<br />
Most enterprises<br />
Centralized<br />
compute<br />
architecture<br />
Google, Amazon,<br />
Facebook, Twitter,<br />
etc. (all use nonrelational<br />
data stores<br />
for reasons <strong>of</strong> scale)<br />
Cloud users with<br />
low compute<br />
requirements<br />
Distributed<br />
compute<br />
architecture<br />
Figure 3: The data architecture landscape in 2010<br />
Source: PricewaterhouseCoopers, 2010<br />
Wolfram Research and IBM have begun to extend<br />
their analytics applications to run on such large-scale<br />
data pools, and startups are presenting approaches<br />
they promise will allow data exploration in ways that<br />
technologies couldn’t have enabled in the past,<br />
including support for tools that let knowledge workers<br />
examine traditional databases using <strong>Big</strong> <strong>Data</strong>–style<br />
exploratory tools.<br />
Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong> 09
The ways different enterprises approach <strong>Big</strong> <strong>Data</strong><br />
It should come as no surprise that organizations<br />
dealing with lots <strong>of</strong> data are already investigating <strong>Big</strong><br />
<strong>Data</strong> technologies, or that they have mixed opinions<br />
about these tools.<br />
“At TransUnion, we spend a lot <strong>of</strong> our time trawling<br />
through tens or hundreds <strong>of</strong> billions <strong>of</strong> rows <strong>of</strong> data,<br />
looking for things that match a pattern approximately,”<br />
says John Parkinson, TransUnion’s acting CTO. “We<br />
want to do accurate but approximate matching and<br />
categorization in very large low-structure data sets.”<br />
Parkinson has explored <strong>Big</strong> <strong>Data</strong> technologies such<br />
as MapReduce that appear to have a more efficient<br />
filtering model than some <strong>of</strong> the pattern-matching<br />
algorithms TransUnion has tried in the past. “It also, at<br />
least in its theoretical formulation, is very amenable to<br />
highly parallelized execution,” which lets the users tap<br />
into farms <strong>of</strong> commodity hardware for fast, inexpensive<br />
processing, he notes.<br />
However, Parkinson thinks Hadoop and MapReduce<br />
are too immature. “MapReduce really hasn’t evolved<br />
yet to the point where your average enterprise<br />
technologist can easily make productive use <strong>of</strong> it. As<br />
for Hadoop, they have done a good job, but it’s like a<br />
lot <strong>of</strong> open-source s<strong>of</strong>tware—80 percent done. There<br />
were limits in the code that broke the stack well before<br />
what we thought was a good theoretical limit.”<br />
Parkinson echoes many IT executives who are<br />
skeptical <strong>of</strong> open-source s<strong>of</strong>tware in general. “If I have<br />
a bunch <strong>of</strong> engineers, I don’t want them spending their<br />
day being the technology support environment for what<br />
should be a product in our architecture,” he says.<br />
That’s a legitimate point <strong>of</strong> view, especially considering<br />
the data volumes TransUnion manages—8 petabytes<br />
from 83,000 sources in 4,000 formats and growing—<br />
and its focus on mission-critical capabilities for this<br />
data. Credit scoring must run successfully and deliver<br />
top-notch credit scores several times a day. It’s an<br />
operational system that many depend on for critical<br />
business decisions that happen millions <strong>of</strong> times a<br />
day. (For more on TransUnion, see the interview with<br />
Parkinson on page 14.)<br />
Disney’s system is purely intended for exploratory<br />
efforts or at most for reporting that eventually may feed<br />
up to product strategy or Web site design decisions. If<br />
it breaks or needs a little retooling, there’s no crisis.<br />
But Albers disagrees about the readiness <strong>of</strong> the tools,<br />
noting that the Disney Technology Shared Services<br />
Group also handles quite a bit <strong>of</strong> data. He figures<br />
Hadoop and MapReduce aren’t any worse than a lot <strong>of</strong><br />
proprietary s<strong>of</strong>tware. “I fully expect we will run on things<br />
that break,” he says, adding facetiously, “Not that any<br />
commercial product I’ve ever had has ever broken.”<br />
<strong>Data</strong> architect Estes also sees responsiveness in<br />
open-source development that’s laudable. “In our<br />
testing, we uncovered stuff, and you get somebody<br />
on the other end. This is their baby, right? I mean,<br />
they want it fixed.”<br />
Albers emphasizes the total cost-effectiveness <strong>of</strong><br />
Hadoop and MapReduce. “My s<strong>of</strong>tware cost is zero.<br />
You still have the implementation, but that’s a constant<br />
at some level, no matter what. Now you probably need<br />
to have a little higher skill level at this stage <strong>of</strong> the<br />
game, so you’re probably paying a little more, but<br />
certainly, you’re not going out and approving a Teradata<br />
cluster. You’re talking about Tier 3 storage. You’re<br />
talking about a very low level <strong>of</strong> cost for the storage.”<br />
Albers’ points are also valid. PricewaterhouseCoopers<br />
predicts these open-source tools will be solid sooner<br />
rather than later, and are already worthy <strong>of</strong> use in<br />
non-mission-critical environments and applications.<br />
Hence, in the CIO article on page 36, we argue in favor<br />
<strong>of</strong> taking cautious but exploratory steps.<br />
Asking new business questions<br />
Saving money is certainly a big reward, but<br />
PricewaterhouseCoopers contends the biggest<br />
pay<strong>of</strong>f from Hadoop-style analysis <strong>of</strong> <strong>Big</strong> <strong>Data</strong> is the<br />
potential to improve organizations’ top line. “There’s<br />
a lot <strong>of</strong> potential value in the unstructured data in<br />
organizations, and people are starting to look at it<br />
more seriously,” says Tom Urquhart, chief architect<br />
at PricewaterhouseCoopers. Think <strong>of</strong> it as a “Google<br />
in a box, which allows you to do intelligent search<br />
regardless <strong>of</strong> whether the underlying content is<br />
structured or unstructured,” he says.<br />
10 PricewaterhouseCoopers Technology Forecast
The Google-style techniques in Hadoop, MapReduce,<br />
and related technologies work in a fundamentally<br />
different way from traditional BI systems, which use<br />
strictly formatted data cubes pulling information from<br />
data warehouses. <strong>Big</strong> <strong>Data</strong> tools let you work with data<br />
that hasn’t been formally modeled by data architects,<br />
so you can analyze and compare data <strong>of</strong> different types<br />
and <strong>of</strong> different levels <strong>of</strong> rigor. Because these tools<br />
typically don’t discard or change the source data<br />
before the analysis begins, the original context remains<br />
available for drill-down by analysts.<br />
These tools provide technology assistance to a very<br />
human form <strong>of</strong> analysis: looking at the world as it is<br />
and finding patterns <strong>of</strong> similarity and difference, then<br />
going deeper into the areas <strong>of</strong> interest judged valuable.<br />
In contrast, BI systems know what questions should be<br />
asked and what answers to expect; their goal is to look<br />
for deviations from the norm or changes in standard<br />
patterns deemed important to track (such as changes<br />
in baseline quality or in sales rates in specific<br />
geographies). Such an approach, absent an exploratory<br />
phase, results in a lot <strong>of</strong> information loss during data<br />
consolidation. (See Figure 4.)<br />
Pattern analysis mashup services<br />
There’s another use <strong>of</strong> <strong>Big</strong> <strong>Data</strong> that combines<br />
efficiency and exploratory benefits: on-the-fly pattern<br />
analysis from disparate sources to return real-time<br />
results. Amazon.com pioneered <strong>Big</strong> <strong>Data</strong>–based<br />
product recommendations by analyzing customer data,<br />
including purchase histories, product ratings, and<br />
comments. Albers is looking for similar value that<br />
would come from making live recommendations to<br />
customers when they go to a Disney site, store, or<br />
reservations phone line—based on their previous online<br />
and <strong>of</strong>fline behavior with Disney.<br />
O’Reilly Media, a publisher best known for technical<br />
books and Web sites, is working with the White House<br />
to develop mashup applications that look at data from<br />
various sources to identify patterns that might help<br />
lobbyists and policymakers. For example, by mashing<br />
together US Census data and labor statistics, they can<br />
see which counties have the most international and<br />
domestic immigration, then correlate those attributes<br />
with government spending changes, says Roger<br />
Magoulas, O’Reilly’s research director.<br />
Pre-consolidated data (never collected)<br />
Exploration<br />
Information<br />
loss<br />
Consolidation<br />
All collected data<br />
Summary<br />
departmental<br />
data<br />
Summary<br />
enterprise<br />
data<br />
Information<br />
loss<br />
Information<br />
loss<br />
Consolidation<br />
All collected data<br />
Summary<br />
departmental<br />
data<br />
Summary<br />
enterprise<br />
data<br />
Less information loss<br />
Insight<br />
Greater insight<br />
Figure 4: Information loss in the data consolidation process<br />
Source: PricewaterhouseCoopers, 2010<br />
Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong> 11
Mashups like this can also result in customer-facing<br />
services. FlightCaster for iPhone and BlackBerry uses<br />
<strong>Big</strong> <strong>Data</strong> approaches to analyze flight-delay records<br />
and current conditions to issue flight-delay predictions<br />
to travelers.<br />
Exploiting the power <strong>of</strong> human analysis<br />
<strong>Big</strong> <strong>Data</strong> approaches can lower processing and storage<br />
costs, but we believe its main value is to perform the<br />
analysis that BI systems weren’t designed for, acting as<br />
an enabler and an amplifier <strong>of</strong> human analysis.<br />
Ad hoc exploration at a bargain<br />
<strong>Big</strong> <strong>Data</strong> lets you inexpensively explore questions<br />
and peruse data for patterns that may indicate<br />
opportunities or issues. In this arena, failure is cheap,<br />
so analysts are more willing to explore questions they<br />
would otherwise avoid. And that should lead to insights<br />
that help the business operate better.<br />
Medical data is an example <strong>of</strong> the potential for ad hoc<br />
analysis. “A number <strong>of</strong> such discoveries are made on<br />
the weekends when the people looking at the data are<br />
doing it from the point <strong>of</strong> view <strong>of</strong> just playing around,”<br />
says Doug Lenat, founder and CEO <strong>of</strong> Cycorp and a<br />
former pr<strong>of</strong>essor at Stanford and Carnegie Mellon<br />
universities.<br />
Right now the technical knowledge required to use<br />
these tools is nontrivial. Imagine the value <strong>of</strong> extending<br />
the exploratory capability more broadly. Cycorp is one<br />
<strong>of</strong> many startups trying to make <strong>Big</strong> <strong>Data</strong> analytic<br />
capabilities usable by more knowledge workers so<br />
they can perform such exploration.<br />
Analyzing data that wasn’t designed for BI<br />
<strong>Big</strong> <strong>Data</strong> also lets you work with “gray data,” or data<br />
from multiple sources that isn’t formatted or vetted for<br />
your specific needs, and that varies significantly in its<br />
level <strong>of</strong> detail and accuracy—and thus cannot be<br />
examined by BI systems.<br />
One analogy is Wikipedia. Everyone knows its<br />
information is not rigorously managed or necessarily<br />
accurate; nonetheless, Wikipedia is a good first place<br />
to look for indicators <strong>of</strong> what may be true and useful.<br />
From there, you do further research using a mix <strong>of</strong><br />
information resources whose accuracy and<br />
completeness may be more established.<br />
People use their knowledge and experience to<br />
appropriately weigh and correlate what they find across<br />
gray data to come up with improved strategies to aid<br />
the business. Figure 5 compares gray data and more<br />
normalized black data.<br />
Gray data<br />
Raw<br />
<strong>Data</strong> and context co-mingled<br />
Noisy<br />
Hypothetical<br />
e.g., Wikipedia<br />
Unchecked<br />
Indicative<br />
Less trustworthy<br />
Managed by business unit<br />
Figure 5: Gray versus black data<br />
Source: PricewaterhouseCoopers, 2010<br />
Black data<br />
Classified<br />
Provenanced<br />
Cleaned<br />
Actual<br />
e.g., Financial system data<br />
Reviewed<br />
Confirming<br />
More trustworthy<br />
Managed by IT<br />
Web analytics and financial risk analysis are two<br />
examples <strong>of</strong> how <strong>Big</strong> <strong>Data</strong> approaches augment human<br />
analysts. These techniques comb huge data sets <strong>of</strong><br />
information collected for specific purposes (such as<br />
monitoring individual financial records), looking for<br />
patterns that might identify good prospects for loans<br />
and flag problem borrowers. Increasingly, they comb<br />
external data not collected by a credit reporting<br />
agency—for example, trends in a neighborhood’s<br />
housing values or in local merchants’ sales patterns—<br />
to provide insights into where sales opportunities could<br />
be found or where higher concentrations <strong>of</strong> problem<br />
customers are located.<br />
The same approaches can help identify shifts in<br />
consumer tastes, such as for apparel and furniture.<br />
And, by analyzing gray data related to costs <strong>of</strong><br />
resources and changes in transportation schedules,<br />
these approaches can help anticipate stresses on<br />
suppliers and help identify where additional suppliers<br />
might be found.<br />
All <strong>of</strong> these activities require human intelligence,<br />
experience, and insight to make sense <strong>of</strong> the data,<br />
figure out the questions to ask, decide what<br />
information should be correlated, and generally<br />
conduct the analysis.<br />
12 PricewaterhouseCoopers Technology Forecast
Why the time is ripe for <strong>Big</strong> <strong>Data</strong><br />
The human analysis previously described is old hat<br />
for many business analysts, whether they work in<br />
manufacturing, fashion, finance, or real estate. What’s<br />
changing is scale. As noted, many types <strong>of</strong> information<br />
are now available that never existed or were not<br />
accessible. What could once only be suggested<br />
through surveys, focus groups, and the like can now<br />
be examined directly, because more <strong>of</strong> the granular<br />
thinking and behaviors are captured. Businesses have<br />
the potential to discover more through larger samples<br />
and more granular details, without relying on people<br />
to recall behaviors and motivations accurately.<br />
This potential can be realized only if you pull together<br />
and analyze all that data. Right now, there’s simply<br />
too much information for individual analysts to<br />
manage, increasing the chances <strong>of</strong> missing potential<br />
opportunities or risks. Businesses that augment their<br />
human experts with <strong>Big</strong> <strong>Data</strong> technologies could have<br />
significant competitive advantages by heading <strong>of</strong>f<br />
problems sooner, identifying opportunities earlier,<br />
and performing mass customization at a larger scale.<br />
Fortunately, the emerging <strong>Big</strong> <strong>Data</strong> tools should let<br />
businesspeople apply individual judgments to vaster<br />
pools <strong>of</strong> information, enabling low-cost, ad hoc<br />
analysis never before feasible. Plus, as patterns<br />
are discovered, the detection <strong>of</strong> some can be<br />
automated, letting the human analysts concentrate<br />
on the art <strong>of</strong> analysis and interpretation that algorithms<br />
can’t accomplish.<br />
Even better, emerging <strong>Big</strong> <strong>Data</strong> technologies promise<br />
to extend the reach <strong>of</strong> analysis beyond the cadre <strong>of</strong><br />
researchers and business analysts. Several startups<br />
<strong>of</strong>fer new tools that use familiar data-analysis tools—<br />
similar to those for SQL databases and Excel<br />
spreadsheets—to explore <strong>Big</strong> <strong>Data</strong> sources, thus<br />
broadening the ability to explore to a wider set <strong>of</strong><br />
knowledge workers.<br />
Finally, <strong>Big</strong> <strong>Data</strong> approaches can be used to power<br />
analytics-based services that improve the business<br />
itself, such as in-context recommendations to<br />
customers, more accurate predictions <strong>of</strong> service<br />
delivery, and more accurate failure predictions<br />
(such as for the manufacturing, energy, medical,<br />
and chemical industries).<br />
Conclusion<br />
PricewaterhouseCoopers believes that <strong>Big</strong> <strong>Data</strong><br />
approaches will become a key value creator for<br />
businesses, letting them tap into the wild, woolly<br />
world <strong>of</strong> information heret<strong>of</strong>ore out <strong>of</strong> reach. These<br />
new data management and storage technologies can<br />
also provide economies <strong>of</strong> scale in more traditional<br />
data analysis. Don’t limit yourself to the efficiencies<br />
<strong>of</strong> <strong>Big</strong> <strong>Data</strong> and miss out on the potential for gaining<br />
insights through its advantages in handling the gray<br />
data prevalent today.<br />
The <strong>Big</strong> <strong>Data</strong> analysis supplements, not replaces, the<br />
BI systems, data warehouses, and database systems<br />
essential to financial reporting, sales management,<br />
production management, and compliance systems.<br />
The difference is that these information systems deal<br />
with the knowns that must meet high standards for<br />
rigor, accuracy, and compliance—while the emerging<br />
<strong>Big</strong> <strong>Data</strong> analytics tools help you deal with the<br />
unknowns that could affect business strategy<br />
or its execution.<br />
As the amount and interconnectedness <strong>of</strong> data vastly<br />
increases, the value <strong>of</strong> the <strong>Big</strong> <strong>Data</strong> approach will only<br />
grow. If the amount and variety <strong>of</strong> today’s information is<br />
daunting, think what the world will be like in 5 or 10<br />
years. People will become mobile sensors—collecting,<br />
creating, and transmitting all sorts <strong>of</strong> information, from<br />
locations to body status to environmental information.<br />
We already see this happening as smartphones<br />
equipped with cameras, microphones, geolocation,<br />
and compasses proliferate. Wearable medical sensors,<br />
small temperature tags for use on packages, and other<br />
radio-equipped sensors are a reality. They’ll be the<br />
Twitter and Facebook feeds <strong>of</strong> tomorrow, adding vast<br />
quantities <strong>of</strong> new information that could provide<br />
context on behavior and environment never before<br />
possible—and a lot <strong>of</strong> “noise” certain to mask<br />
what’s important.<br />
Insight-oriented analytics in this sea <strong>of</strong> information—<br />
where interactions cause untold ripples and eddies in<br />
the flow and delivery <strong>of</strong> business value—will become<br />
a critical competitive requirement. <strong>Big</strong> <strong>Data</strong> technology<br />
is the likeliest path to gaining such insights.<br />
Tapping into the power <strong>of</strong> <strong>Big</strong> <strong>Data</strong> 13
The data scalability<br />
challenge<br />
John Parkinson <strong>of</strong> TransUnion describes the<br />
data handling issues more companies will face<br />
in three to five years.<br />
Interview conducted by Vinod Baya and Alan Morrison<br />
John Parkinson is the acting CTO <strong>of</strong> TransUnion, the chairman and owner <strong>of</strong> Parkwood<br />
Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines<br />
TransUnion’s considerable requirements for less-structured data analysis, shedding<br />
light on the many data-related technology challenges TransUnion faces today—challenges<br />
he says that more companies will face in the near future.<br />
PwC: In your role at TransUnion, you’ve<br />
evaluated many large-scale data processing<br />
technologies. What do you think <strong>of</strong> Hadoop<br />
and MapReduce?<br />
JP: MapReduce is a very computationally attractive<br />
answer for a certain class <strong>of</strong> problem. If you have that<br />
class <strong>of</strong> problem, then MapReduce is something you<br />
should look at. The challenge today, however, is that the<br />
number <strong>of</strong> people who really get the formalism behind<br />
MapReduce is a lot smaller than the group <strong>of</strong> people<br />
trying to understand what to do with it. It really hasn’t<br />
evolved yet to the point where your average enterprise<br />
technologist can easily make productive use <strong>of</strong> it.<br />
PwC: What class <strong>of</strong> problem would that be?<br />
JP: MapReduce works best in situations where you<br />
want to do high-volume, accurate but approximate<br />
matching and categorization in very large, lowstructured<br />
data sets. At TransUnion, we spend a lot <strong>of</strong><br />
our time trawling through tens or hundreds <strong>of</strong> billions<br />
<strong>of</strong> rows <strong>of</strong> data looking for things that match a pattern<br />
approximately. MapReduce is a more efficient filter for<br />
some <strong>of</strong> the pattern-matching algorithms that we have<br />
tried to use. At least in its theoretical formulation, it’s<br />
very amenable to highly parallelized execution, which<br />
many <strong>of</strong> the other filtering algorithms we’ve used aren’t.<br />
The open-source stack is attractive for experimenting,<br />
but the problem we find is that Hadoop isn’t what<br />
Google runs in production—it’s an attempt by a bunch<br />
<strong>of</strong> pretty smart guys to reproduce what Google runs in<br />
production. They’ve done a good job, but it’s like a lot<br />
<strong>of</strong> open-source s<strong>of</strong>tware—80 percent done. The<br />
20 percent that isn’t done—those are the hard parts.<br />
From an experimentation point <strong>of</strong> view, we have had a<br />
lot <strong>of</strong> success in proving that the computing formalism<br />
behind MapReduce works, but the s<strong>of</strong>tware that we<br />
can acquire today is very fragile. It’s difficult to manage.<br />
It has some bugs in it, and it doesn’t behave very well<br />
in an enterprise environment. It also has some<br />
interesting limitations when you try to push the<br />
scale and the performance.<br />
14 PricewaterhouseCoopers Technology Forecast
We found a number <strong>of</strong> representational problems<br />
when we used the HDFS/Hadoop/HBase stack to do<br />
something that, according to the documentation<br />
available, should have worked. However, in practice,<br />
limits in the code broke the stack well before what we<br />
thought was a good theoretical limit that we needed<br />
to achieve to make it worthwhile.<br />
Now, the good news <strong>of</strong> course is that you get source<br />
code. But that’s also the bad news. You need to get the<br />
source code, and that’s not something that we want to<br />
do as part <strong>of</strong> routine production. I have a bunch <strong>of</strong><br />
smart engineers, but I don’t want them spending their<br />
day being the technology support environment for what<br />
should be a product in our architecture. Yes, there’s a<br />
pony there, but it’s going to be awhile before it stabilizes<br />
to the point that I want to bet revenue on it.<br />
PwC: <strong>Data</strong> warehousing appliance prices have<br />
dropped pretty dramatically over the past couple<br />
<strong>of</strong> years. When it comes to data that’s not<br />
necessarily on the critical path, how does an<br />
enterprise make sure that it is not spending more<br />
than it has to?<br />
JP: We are probably not a good representational<br />
example <strong>of</strong> that because our business is analyzing the<br />
data. There is almost no price we won’t pay to get a<br />
better answer faster, because we can price that into<br />
the products we produce. The challenge we face is<br />
that the tools don’t always work properly at the edge<br />
<strong>of</strong> the envelope. This is a problem for hardware as<br />
well as s<strong>of</strong>tware. A lot <strong>of</strong> the vendors stop testing<br />
their applications at about 80 percent or 85 percent<br />
<strong>of</strong> their theoretical capability. We routinely run them at<br />
110 percent <strong>of</strong> their theoretical capability, and they<br />
break. I don’t mind making tactical justifications for<br />
technologies that I expect to replace quickly. I do that<br />
all the time. But having done that, I want the damn<br />
thing to work. Too <strong>of</strong>ten, we’ve discovered that it<br />
doesn’t work.<br />
PwC: Are you forced to use technologies that<br />
have matured because <strong>of</strong> a wariness <strong>of</strong> things<br />
on the absolute edge?<br />
JP: My dilemma is that things that are known to work<br />
usually don’t scale to what we need—for speed or full<br />
capacity. I must spend some time, energy, and dollars<br />
betting on things that aren’t mature yet, but that can be<br />
sufficiently generalized architecturally. If the one I pick<br />
doesn’t work, or goes away, I can fit something else into<br />
its place relatively easily. That’s why we like appliances.<br />
As long as they are well behaved at the network layer<br />
and have a relatively generalized or standards-based<br />
business semantic interface, it doesn’t matter if I have<br />
to unplug one in 18 months or two years because<br />
something better came along. I can’t do that for<br />
everything, but I can usually afford to do it in the areas<br />
where I have no established commercial alternative.<br />
“I have a bunch <strong>of</strong> smart engineers, but I don’t want them spending their<br />
day being the technology support environment for what should be a<br />
product in our architecture.” —John Parkinson <strong>of</strong> TransUnion<br />
The data scalability challenge 15
PwC: What are you using in place <strong>of</strong> something<br />
like Hadoop?<br />
JP: Essentially, we use brute force. We use Ab Initio,<br />
which is a very smart brute-force parallelization scheme.<br />
I depend on certain capabilities in Ab Initio to parallelize<br />
the ETL [extract, transform, and load] in such a way that<br />
I can throw more cores at the problem.<br />
PwC: Much <strong>of</strong> the data you see is transactional. Is<br />
it all structured data, or are you also mining text?<br />
JP: We get essentially three kinds <strong>of</strong> data. We get<br />
accounts receivable data from credit loan issuers. That’s<br />
the record <strong>of</strong> what people actually spend. We get public<br />
record data, such as bankruptcy records, court records,<br />
and liens, which are semi-structured text. And we get<br />
other data, which is whatever shows up, and it’s<br />
generally hooked together around a well-understood set<br />
<strong>of</strong> identifiers. But the cost <strong>of</strong> this data is essentially<br />
free—we don’t pay for it. It’s also very noisy. So we<br />
have to spend computational time figuring out whether<br />
the data we have is right, because we must find a place<br />
to put it in the working data sets that we build.<br />
At TransUnion, we suck in 100 million updates a day<br />
for the credit files. We update a big data warehouse<br />
that contains all the credit and related data. And then<br />
every day we generate somewhere between 1 and 20<br />
operational data stores, which is what we actually run<br />
the business on. Our products are joined between what<br />
we call indicative data, the information that identifies<br />
you as an individual; structured data, which is derived<br />
from transactional records; and unstructured data that<br />
is attached to the indicative. We build those products on<br />
the fly because the data may change every day,<br />
sometimes several times a day.<br />
One challenge is how to accurately find the right place<br />
to put the record. For example, we get a Joe Smith at<br />
13 Main Street and a Joe Smith at 31 Main Street.<br />
Are those two different Joe Smiths, or is that a typing<br />
error? We have to figure that out 100 million times a<br />
day using a bunch <strong>of</strong> custom pattern-matching and<br />
probabilistic algorithms.<br />
PwC: Of the three kinds <strong>of</strong> data, which is the<br />
most challenging?<br />
JP: We have two kinds <strong>of</strong> challenges. The first is driven<br />
purely by the scale at which we operate. We add<br />
roughly half a terabyte <strong>of</strong> data per month to the credit<br />
file. Everything we do has challenges related to scale,<br />
updates, speed, or database performance. The vendors<br />
both love us and hate us. But we are where the industry<br />
is going—where everybody is going to be in two to five<br />
years. We are a good leading indicator, but we break<br />
their stuff all the time. A second challenge is the<br />
unstructured part <strong>of</strong> the data, which is increasing.<br />
PwC: It’s more <strong>of</strong> a challenge to deal with the<br />
unstructured stuff because it comes in various<br />
formats and from various sources, correct?<br />
JP: Yes. We have 83,000 data sources. Not everyone<br />
provides us with data every day. It comes in about<br />
4,000 formats, despite our data interchange standards.<br />
And, to be able to process it fast enough, we must<br />
convert all data into a single interchange format that is<br />
the representation <strong>of</strong> what we use internally. Complex<br />
computer science problems are associated with all<br />
<strong>of</strong> that.<br />
PwC: Are these the kinds <strong>of</strong> data problems that<br />
businesses in other industries will face in three<br />
to five years?<br />
JP: Yes, I believe so.<br />
PwC: What are some <strong>of</strong> the other problems you<br />
think will become more widespread?<br />
JP: Here are some simple practical examples. We have<br />
8.5 petabytes <strong>of</strong> data in the total managed environment.<br />
Once you go seriously above 100 terabytes, you must<br />
replace the storage fabric every four or five years.<br />
Moving 100 terabytes <strong>of</strong> data becomes a huge material<br />
issue and takes a long time. You do get some help from<br />
improved interconnect speed, but the arrays go as fast<br />
16 PricewaterhouseCoopers Technology Forecast
as they go for reads and writes and you can’t go faster<br />
than that. And businesses down the food chain are not<br />
accustomed to thinking about refresh cycles that take<br />
months to complete. Now, a refresh cycle <strong>of</strong> PCs might<br />
take months to complete, but any one piece <strong>of</strong> it takes<br />
only a couple <strong>of</strong> hours. When I move data from one<br />
array to another, I’m not done until I’m done.<br />
Additionally, I have some bugs and new vulnerabilities<br />
to deal with.<br />
Today, we don’t have a backup problem at TransUnion<br />
because we do incremental forever backup. However,<br />
we do have a restore problem. To restore a material<br />
amount <strong>of</strong> data, which we very occasionally need to do,<br />
takes days in some instances because the physics <strong>of</strong><br />
the technology we use won’t go faster than that. The<br />
average IT department doesn’t worry about these<br />
problems. But take the amount <strong>of</strong> data an average IT<br />
department has under management, multiply it by a<br />
single decimal order <strong>of</strong> magnitude, and it starts to<br />
become a material issue.<br />
We would like to see computationally more-efficient<br />
compression algorithms, because my two big cost<br />
pools are Store It and Move It. For now, I don’t have<br />
a computational problem, but if I can’t shift the trend<br />
line on Store It and Move It, I will have a computational<br />
problem within a few years. To perform the<br />
computations in useful time, I must parallelize how I<br />
compute. Above a certain point, the parallelization<br />
breaks because I can’t move the data further.<br />
PwC: <strong>Cloudera</strong> [a vendor <strong>of</strong>fering a Hadoop<br />
distribution] would say bring the computation to<br />
the data.<br />
JP: That works only for certain kinds <strong>of</strong> data. We already<br />
do all <strong>of</strong> that large-scale computation on a file system<br />
basis, not on a database basis. And we spend compute<br />
cycles to compress the data so there are fewer bits to<br />
move, then decompress the data for computation, and<br />
recompress it so we have fewer bits to store.<br />
What we have discovered—because I run the fourth<br />
largest commercial GPFS [general parallel file system,<br />
a distributed computing file system developed by IBM]<br />
cluster in the world—is that once you go beyond a<br />
certain size, the parallelization management tools break.<br />
That’s why I keep telling people that Hadoop is not what<br />
Google runs in production. Maybe the Google guys<br />
have solved this, but if they have, they aren’t telling<br />
me how. •<br />
“We would like to see<br />
computationally more-efficient<br />
compression algorithms,<br />
because my two big cost<br />
pools are Store It and Move It.”<br />
—John Parkinson <strong>of</strong> TransUnion<br />
The data scalability challenge 17
Creating a cost-effective<br />
<strong>Big</strong> <strong>Data</strong> strategy<br />
Disney’s Bud Albers, Scott Thompson,<br />
and Matt Estes (respectively) outline<br />
an agile approach that leverages<br />
open-source and cloud technologies.<br />
Interview conducted by Galen Gruman<br />
and Alan Morrison<br />
Bud Albers joined what is now the Disney Technology Shared<br />
Services Group two years ago as executive vice president and CTO. His management<br />
team includes Scott Thompson, vice president <strong>of</strong> architecture, and Matt Estes, principal<br />
data architect. The Technology Shared Services Group, located in Seattle, has a heritage<br />
dating back to the late 1990s, when Disney acquired Starwave and Infoseek.<br />
The group supports all the Disney businesses ($38 billion in annual revenue), managing<br />
the company’s portfolio <strong>of</strong> Web properties. These include properties for the studio, store,<br />
and park; ESPN; ABC; and a number <strong>of</strong> local television stations in major cities.<br />
In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s<br />
Web data analysis footprint without incurring additional cost by implementing a Hadoop<br />
cluster. Albers and team freed up budget for this cluster by virtualizing servers and<br />
eliminating other redundancies.<br />
PwC: Disney is such a diverse company, and yet<br />
there clearly is lots <strong>of</strong> potential for synergies and<br />
cross-fertilization. How do you approach these<br />
opportunities from a data perspective?<br />
BA: We try and understand the best way to work with<br />
and to provide services to the consumer in the long<br />
term. We have some businesses that are very data<br />
intensive, and then we have some that are less so<br />
because <strong>of</strong> their consumer audience. One <strong>of</strong> the<br />
challenges always is how to serve both kinds <strong>of</strong><br />
businesses and do so in ways that make sense. The<br />
sell-to relationships extend from the studio out to the<br />
distribution groups and the theater chains. If you’re<br />
selling to millions, you’re trying to understand the<br />
different audiences and how they connect.<br />
One <strong>of</strong> the things I’ve been telling my folks from a data<br />
perspective is that you don’t send terabytes one way to<br />
be mated with a spreadsheet on the other side, right?<br />
We’re thinking through those kinds <strong>of</strong> pieces and trying<br />
to figure out how we move down a path. The net is that<br />
working with all these businesses gives us a diverse set<br />
<strong>of</strong> requirements, as you might imagine. We’re trying to<br />
stay ahead <strong>of</strong> where all the businesses are.<br />
In that respect, the questions I’m asking are, how do we<br />
get more agile, and how do we do it in a way that<br />
handles all the data we have? We must consider all <strong>of</strong><br />
the new form factors being developed, all <strong>of</strong> which will<br />
generate lots <strong>of</strong> data. A big question is, how do we<br />
handle this data in a way that makes cost sense for the<br />
business and provides us an increased level <strong>of</strong> agility?<br />
18 PricewaterhouseCoopers Technology Forecast
We hope to do in other areas what we’ve done with<br />
content distribution networks [CDNs]. We’ve had a<br />
tremendous amount <strong>of</strong> success with the CDN<br />
marketplace by standardizing, by staying in the middle<br />
<strong>of</strong> the road and not going to Akamai proprietary<br />
extensions, and by creating a dynamic marketplace.<br />
If we get a new episode <strong>of</strong> LOST, we can start<br />
streaming it, and I can be streaming 80 percent on<br />
Akamai and 20 percent on Level 3. Then we can<br />
decide we’re going to turn it back, and I’m going to<br />
give 80 percent to Limelight and 20 percent to Level 3.<br />
We can do that dynamically.<br />
PwC: What are the other main strengths <strong>of</strong> the<br />
Technology Shared Services Group at Disney?<br />
BA: When I came here a couple <strong>of</strong> years ago, we had<br />
some very good core central services. If you look at the<br />
true definition <strong>of</strong> a cloud, we had the very early makings<br />
<strong>of</strong> one—shared central services around registration,<br />
for example. On Disney, on ABC, or on ESPN, if you<br />
have an ID, it works on all the Disney properties. If you<br />
have an ESPN ID, you can sign in to KGO in San<br />
Francisco, and it will work. It’s all a shared registration<br />
system. The advertising system we built is shared. The<br />
marketing systems we built are shared—all the analytics<br />
collection, all those things are centralized. Those things<br />
that are common are shared among all the sites.<br />
Those things that are brand specific are built by the<br />
brands, and the user interface is controlled by the<br />
brands, so each <strong>of</strong> the various divisions has a head<br />
<strong>of</strong> engineering on the Web site who reports to me.<br />
Our CIO worries about it from the firewall back;<br />
I worry about it from the firewall to the living room<br />
and the mobile device. That’s the way we split up<br />
the world, if that makes sense.<br />
PwC: How do you link the data requirements <strong>of</strong><br />
the central core with those that are unique to the<br />
various parts <strong>of</strong> the business?<br />
BA: It’s more art than science. The business units<br />
must generate revenue, and we must provide the core<br />
services. How do you strike that balance? Ownership<br />
is a lot more negotiated on some things today. We<br />
typically pull down most <strong>of</strong> the analytics and add things<br />
in, and it’s a constant struggle to answer the question,<br />
“Do we have everything?” We’re headed toward this<br />
notion <strong>of</strong> one data element at a time, aggregate, and<br />
queue up the aggregate. It can get a little bit crazy<br />
because you wind up needing to pull the data in and<br />
run it through that whole food chain, and it may or<br />
may not have lasting value.<br />
It may have only a temporal level <strong>of</strong> importance, and so<br />
we’re trying to figure out how to better handle that. An<br />
awful lot <strong>of</strong> what we do in the data collection is pull it in,<br />
lay it out so it can be reported on, and/or push it back<br />
into the businesses, because the Web is evolving<br />
rapidly from a standalone thing to an integral part<br />
<strong>of</strong> how you do business.<br />
“It’s more art than science. The business units must generate revenue,<br />
and we must provide the core services. How do you strike that balance?<br />
Ownership is a lot more negotiated on some things today.”<br />
—Bud Albers<br />
Creating a cost-effective <strong>Big</strong> <strong>Data</strong> strategy 19
PwC: Hadoop seems to suggest a feasible way<br />
to analyze data that has only temporal<br />
importance. How did you get to the point where<br />
you could try something like a Hadoop cluster?<br />
BA: Guys like me never get called when it’s all pretty<br />
and shiny. The Disney unit I joined obviously has many<br />
strengths, but when I was brought on, there was a cost<br />
growth situation. The volume <strong>of</strong> the aggregate activity<br />
growth was 17 percent. Our server growth at the time<br />
was 30 percent. So we were filling up data centers, but<br />
we were filling them with CPUs that weren’t being used.<br />
My question was, how can you go to the CFO and ask<br />
for a lot <strong>of</strong> money to fill a data center with capital assets<br />
that you’re going to use only 5 percent <strong>of</strong>?<br />
CPU utilization isn’t the only measure, but it’s the most<br />
prominent one. To study and understand what was<br />
happening, we put monitors and measures on our<br />
servers and reported peak CPU utilization on fiveminute<br />
intervals across our server farm. We found that<br />
on roughly 80 percent <strong>of</strong> our servers, we never got<br />
above 10 percent utilization in a monthly period.<br />
Our first step to address that problem was virtualization.<br />
At this point, about 49 percent <strong>of</strong> our data center is<br />
virtual. Our virtualization effort had a sizable impact on<br />
cost. Dollars fell out because we quit building data<br />
centers and doing all kinds <strong>of</strong> redundant shuffling. We<br />
didn’t have to lay <strong>of</strong>f people. We changed some <strong>of</strong> our<br />
processes, and we were able to shift our growth curve<br />
from plus 27 to minus 3 on the shared service.<br />
We call this our D-Cloud effort. Another step in this<br />
effort was moving to a REST [REpresentational State<br />
Transfer] and JSON [Javascript Object Notation] data<br />
exchange standard, because we knew we had to hit all<br />
these different devices and create some common APIs<br />
[application programming interfaces] in the framework.<br />
One <strong>of</strong> the very first things we put in place was a central<br />
logging service for all the events. These event logs can<br />
be streamed into one very large data set. We can then<br />
use the Hadoop and MapReduce paradigm to go after<br />
that data.<br />
PwC: How does the central logging service fit<br />
into your overall strategy?<br />
ST: As we looked at it, we said, it’s not just about<br />
virtualization. To be able to burst and do these other<br />
things, you need to build a bunch <strong>of</strong> core services. The<br />
initiative we’re working on now is to build some <strong>of</strong> those<br />
core services around managing configuration. This<br />
project takes the foundation we laid with virtualization<br />
and a REST and JSON data exchange standard, and<br />
adds those core services that enable us to respond to<br />
the marketplace as it develops. Piping that data back to<br />
a central repository helps you to analyze it, understand<br />
what’s going on, and make better decisions on the<br />
basis <strong>of</strong> what you learned.<br />
PwC: How do you evolve so that the data<br />
strategy is really served well, so that it’s more<br />
<strong>of</strong> a data-driven approach in some ways?<br />
ME: On one side, you have a very transactional OLTP<br />
[online transactional processing] kind <strong>of</strong> world, RDBMSs<br />
[relational database management systems], and major<br />
vendors that we’re using there. On the other side <strong>of</strong> it,<br />
you have traditional analytical warehousing. And where<br />
we’ve slotted this [Hadoop-style data] is in the middle<br />
with the other operational data. Some <strong>of</strong> it is derived<br />
from transactional data, and some has been crafted<br />
out <strong>of</strong> analytical data. There’s a freedom that’s derived<br />
from blending these two kinds <strong>of</strong> data.<br />
Our centralized logging service is an example.<br />
As we look at continuing to drive down costs to<br />
drive up efficiency, we can begin to log a large<br />
amount <strong>of</strong> this data at a price point that we have<br />
not been able to achieve by scaling up RDBMSs<br />
or using warehousing appliances.<br />
Then the key will be putting an expert system in place.<br />
That will give us the ability to really understand what’s<br />
going on in the actual operational environment.<br />
We’re starting to move again toward lower utilization<br />
trajectories. We need to scale the infrastructure back<br />
and get that utilization level up to the threshold.<br />
20 PricewaterhouseCoopers Technology Forecast
PwC: This kind <strong>of</strong> information doesn’t go in a<br />
cube. Not that data cubes are going away,<br />
but cubes are fairly well known now. The value<br />
you can create is exactly what you said,<br />
understanding the thinking behind it and the<br />
exploratory steps.<br />
ST: We think storing the unstructured data in its raw<br />
format is what’s coming. In a Hadoop environment,<br />
instead <strong>of</strong> bringing the data back to your warehouse,<br />
you figure out what question you want to answer. Then<br />
you MapReduce the input, and you may send that <strong>of</strong>f<br />
to a data cube and a thing that someone can dig around<br />
in, but you keep the data in its raw format and pull out<br />
only what you need.<br />
BA: The wonderful thing about where we’re headed<br />
right now is that data analysis used to be this giant,<br />
massive bet that you had to place up front, right?<br />
No longer. Now, I pull Hadoop <strong>of</strong>f <strong>of</strong> the Internet,<br />
first making sure that we’re compliant from a legal<br />
perspective with licensing and so forth. After that’s<br />
taken care <strong>of</strong>, you begin to prototype. You begin to<br />
work with it against common hardware. You begin<br />
to work with it against stuff you otherwise might<br />
throw out. Rather than, I’m going to go spend how<br />
much for Teradata?<br />
We’re using the basic premise <strong>of</strong> the cloud, and we’re<br />
using those techniques <strong>of</strong> standardizing the interface<br />
to virtualize and drive cost out. I’m taking that cost<br />
savings and returning some <strong>of</strong> it to the business, but<br />
then reinvesting some in new capabilities while the<br />
cost curve is stabilizing.<br />
ME: Refining some <strong>of</strong> this reinvestment in new<br />
capabilities doesn’t have to be put in the category <strong>of</strong><br />
traditional “$5 million projects” companies used to think<br />
about. You can make significant improvements with<br />
reinvestments <strong>of</strong> $200,000 or even $50,000.<br />
BA: It’s then a matter <strong>of</strong> how you’re redeploying an<br />
investment in resources that you’ve already made as<br />
a corporation. It’s a matter <strong>of</strong> now prioritizing your<br />
work and not changing the bottom-line trajectory in a<br />
negative fashion with a bet that may not pay <strong>of</strong>f. I can<br />
try it, and I don’t have to get great big governancebased<br />
permission to do it, because it’s not a bet <strong>of</strong> half<br />
the staff and all <strong>of</strong> this stuff. It’s, OK, let’s get something<br />
on the ground, let’s work with the business unit, let’s<br />
pilot it, let’s go somewhere where we know we have a<br />
need, let’s validate it against this need, and let’s make<br />
sure that it’s working. It’s not something that must go<br />
through an RFP [request for proposal] and standard<br />
procurement. I can move very fast. •<br />
“We think storing the unstructured data in its raw format is what’s<br />
coming. In a Hadoop environment, instead <strong>of</strong> bringing the data back to<br />
your warehouse, you figure out what question you want to answer.”<br />
—Scott Thompson<br />
Creating a cost-effective <strong>Big</strong> <strong>Data</strong> strategy 21
Building a bridge to<br />
the rest <strong>of</strong> your data<br />
How companies are using open-source cluster-computing techniques<br />
to analyze their data.<br />
By Alan Morrison<br />
22 PricewaterhouseCoopers Technology Forecast
As recently as two years ago, the International<br />
Supercomputing Conference (ISC) agenda included<br />
nothing about distributed computing for <strong>Big</strong> <strong>Data</strong>—<br />
as if projects such as Google Cluster Architecture, a<br />
low-cost, distributed computing design that enables<br />
efficient processing <strong>of</strong> large volumes <strong>of</strong> less-structured<br />
data, didn’t exist. In a May 2008 blog, Brough Turner<br />
noted the omission, pointing out that Google had<br />
harnessed as much as 100 petaflops 1 <strong>of</strong> computing<br />
power, compared to a mere 1 petaflop in the new IBM<br />
Roadrunner, a supercomputer pr<strong>of</strong>iled in EE Times<br />
that month. “Have the supercomputer folks been<br />
bypassed and don’t even know it?” Turner wondered. 2<br />
Turner, co-founder and CTO <strong>of</strong> Ashtonbrooke.com, a<br />
startup in stealth mode, had been reading Google’s<br />
research papers and remarking on them in his blog for<br />
years. Although the broader business community had<br />
taken little notice, some companies were following in<br />
Google’s wake. Many <strong>of</strong> them were Web companies<br />
that had data processing scalability challenges similar<br />
to Google’s.<br />
Yahoo, for example, abandoned its own data<br />
architecture and began to adopt one along the lines<br />
pioneered by Google. It moved to Apache Hadoop,<br />
an open-source, Java-based distributed file system<br />
based on Google File System and developed by<br />
the Apache S<strong>of</strong>tware Foundation; it also adopted<br />
MapReduce, Google’s parallel programming framework.<br />
Yahoo used these and other open-source tools it helped<br />
develop to crawl and index the Web. After implementing<br />
the architecture, it found other uses for the technology<br />
and has now scaled its Hadoop cluster to 4,000 nodes.<br />
By early 2010, Hadoop, MapReduce, and related<br />
open-source techniques had become the driving forces<br />
behind what O’Reilly Media, The Economist, and others<br />
in the press call <strong>Big</strong> <strong>Data</strong> and what vendors call cloud<br />
storage. <strong>Big</strong> <strong>Data</strong> refers to data sets that are growing<br />
exponentially and that are too large, too raw, or too<br />
unstructured for analysis by traditional means. Many<br />
who are familiar with these new methods are convinced<br />
that Hadoop clusters will enable cost-effective analysis<br />
<strong>of</strong> <strong>Big</strong> <strong>Data</strong>, and these methods are now spreading<br />
beyond companies that mine the public Web as part<br />
<strong>of</strong> their business.<br />
By early 2010, Hadoop, MapReduce, and related open-source<br />
techniques had become the driving forces behind what O’Reilly Media,<br />
The Economist, and others in the press call <strong>Big</strong> <strong>Data</strong> and what<br />
vendors call cloud storage.<br />
Building a bridge to the rest <strong>of</strong> your data 23
“Hadoop will process the data set and output a new data set,<br />
as opposed to changing the data set in place.” —Amr Awadallah<br />
<strong>of</strong> <strong>Cloudera</strong><br />
What are these methods and how do they work? This<br />
article looks at the architecture and tools surrounding<br />
Hadoop clusters with an eye toward what about them<br />
will be useful to mainstream enterprises during the<br />
next three to five years. We focus on their utility for<br />
less-structured data.<br />
Hadoop clusters<br />
Although cluster computing has been around for<br />
decades, commodity clusters are more recent, starting<br />
with UNIX- and Linux-based Beowulf clusters in the<br />
mid-1990s. These banks <strong>of</strong> inexpensive servers<br />
networked together were pitted against expensive<br />
supercomputers from companies such as Cray and<br />
others—the kind <strong>of</strong> computers that government<br />
agencies, such as the National Aeronautics and Space<br />
Administration (NASA), bought. It was no accident<br />
that NASA pioneered the development <strong>of</strong> Beowulf. 3<br />
Hadoop extends the value <strong>of</strong> commodity clusters,<br />
making it possible to assemble a high-end computing<br />
cluster at a low-end price. A central assumption<br />
underlying this architecture is that some nodes are<br />
bound to fail when computing jobs are distributed<br />
across hundreds or thousands <strong>of</strong> nodes. Therefore,<br />
one key to success is to design the architecture to<br />
anticipate and recover from individual node failures. 4<br />
Other goals <strong>of</strong> the Google Cluster Architecture and<br />
its expression in open-source Hadoop include:<br />
• Price/performance over peak performance—The<br />
emphasis is on optimizing aggregate throughput; for<br />
example, sorting functions to rank the occurrence <strong>of</strong><br />
keywords in Web pages. Overall sorting throughput<br />
is high. In each <strong>of</strong> the past three years, Yahoo’s<br />
Hadoop clusters have won Gray’s terabyte sort<br />
benchmarking test. 5<br />
• S<strong>of</strong>tware tolerance for hardware failures—When a<br />
failure occurs, the system responds by transferring<br />
the processing to another node, a critical capability<br />
for large distributed systems. As Roger Magoulas,<br />
research director for O’Reilly Media, says, “If you are<br />
going to have 40 or 100 machines, you don’t expect<br />
your machines to break. If you are running something<br />
with 1,000 nodes, stuff is going<br />
to break all the time.”<br />
• High compute power per query—The ability to<br />
scale up to thousands <strong>of</strong> nodes implies the ability to<br />
throw more compute power at each query. That<br />
ability, in turn, makes it possible to bring more data<br />
to bear on each problem.<br />
• Modularity and extensibility—Hadoop clusters<br />
scale horizontally with the help <strong>of</strong> a uniform, highly<br />
modular architecture.<br />
Hadoop isn’t intended for all kinds <strong>of</strong> workloads,<br />
especially not those with many writes. It works best for<br />
read-intensive workloads. These clusters complement,<br />
rather than replace, high-performance computing (HPC)<br />
and other relational data systems. They don’t work well<br />
with transactional data or records that require frequent<br />
updating. “Hadoop will process the data set and output<br />
a new data set, as opposed to changing the data set in<br />
place,” says Amr Awadallah, vice president <strong>of</strong><br />
engineering and CTO <strong>of</strong> <strong>Cloudera</strong>, which develops a<br />
version <strong>of</strong> Hadoop.<br />
A data architecture and a s<strong>of</strong>tware design that are<br />
frugal with network and disk resources are responsible<br />
for the price/performance ratio <strong>of</strong> Hadoop clusters.<br />
In Awadallah’s words, “You move your processing to<br />
where your data lives.” Each node has its own<br />
processing and storage, and the data is divided and<br />
processed locally in blocks sized for the purpose.<br />
This concept <strong>of</strong> localization makes it possible to use<br />
inexpensive serial advanced technology attachment<br />
(SATA) hard disks—the kind used in most PCs and<br />
servers—and Gigabit Ethernet for most network<br />
interconnections. (See Figure 1.)<br />
24 PricewaterhouseCoopers Technology Forecast
Client<br />
Switch<br />
1000Mbps<br />
Switch<br />
100Mbps<br />
Switch<br />
100Mbps<br />
Typical node setup<br />
2 quad-core Intel Nehalem<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
JobTracker<br />
24GB <strong>of</strong> RAM<br />
12 1TB SATA disks (non-RAID)<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
NameNode<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
1Gigabit Ethernet card<br />
Cost per node: $5,000<br />
Effective file space per node: 20TB<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Task tracker/<br />
<strong>Data</strong>Node<br />
Claimed benefits<br />
Linear scaling at $250 per user TB<br />
(versus $5,000–$100,000 for alternatives)<br />
Compute placed near the data and<br />
fewer writes limit networking<br />
and storage costs<br />
Rack<br />
Rack<br />
Modularity and extensibility<br />
Figure 1: Hadoop cluster layout and characteristics<br />
Source: IBM, 2008, and <strong>Cloudera</strong>, 2010<br />
Building a bridge to the rest <strong>of</strong> your data 25
“Amazon supports Hadoop directly through its Elastic MapReduce<br />
application programming interfaces.” —Chris Wensel <strong>of</strong> Concurrent<br />
The result is less-expensive large-scale distributed<br />
computing and parallel processing, which make<br />
possible an analysis that is different from what most<br />
enterprises have previously attempted. As author Tom<br />
White points out, “The ability to run an ad hoc query<br />
against your whole data set and get the results in a<br />
reasonable time is transformative.” 6<br />
The cost <strong>of</strong> this capability is low enough that companies<br />
can fund a Hadoop cluster from existing IT budgets.<br />
When it decided to try Hadoop, the Walt Disney Co.’s<br />
Technology Shared Services Group took advantage <strong>of</strong><br />
the increased server utilization it had already achieved<br />
from virtualization. As <strong>of</strong> March 2010, with nearly<br />
50 percent <strong>of</strong> its servers virtualized, Disney had<br />
30 percent server image growth annually but 30 percent<br />
less growth in physical servers. It was then able to set<br />
up a multi-terabyte cluster with Hadoop and other free<br />
open-source tools, using servers it had planned to<br />
retire. The group estimates it spent less than $500,000<br />
on the entire project. (See the article, “Tapping into the<br />
power <strong>of</strong> <strong>Big</strong> <strong>Data</strong>,” on page 04.)<br />
These clusters are also transformative because cloud<br />
providers can <strong>of</strong>fer them on demand. Instead <strong>of</strong> using<br />
their own infrastructures, companies can subscribe to<br />
a service such as Amazon’s or <strong>Cloudera</strong>’s distribution<br />
on the Amazon Elastic Compute Cloud (EC2) platform.<br />
The EC2 platform was crucial in a well-known use <strong>of</strong><br />
cloud computing on a <strong>Big</strong> <strong>Data</strong> project that also<br />
depended on Hadoop and other open-source tools.<br />
In 2007, The New York Times needed to quickly<br />
assemble the PDFs <strong>of</strong> 11 million articles from 4<br />
terabytes <strong>of</strong> scanned images. Amazon’s EC2 service<br />
completed the job in 24 hours after setup, a feat<br />
that received widespread attention in blogs and the<br />
trade press.<br />
Mostly overlooked in all that attention was the use <strong>of</strong><br />
the Hadoop Distributed File System (HDFS) and the<br />
MapReduce framework. Using these open-source tools,<br />
after studying how-to blog posts from others, Times<br />
senior s<strong>of</strong>tware architect Derek Gottfrid developed and<br />
ran code in parallel across multiple Amazon machines. 7<br />
“Amazon supports Hadoop directly through its Elastic<br />
MapReduce application programming interfaces [APIs],”<br />
says Chris Wensel, founder <strong>of</strong> Concurrent, which<br />
developed Cascading. (See the discussion <strong>of</strong><br />
Cascading later in this article.) “I regularly work with<br />
customers to boot up 200-node clusters and process<br />
3 terabytes <strong>of</strong> data in five or six hours, and then shut<br />
the whole thing down. That’s extraordinarily powerful.”<br />
The Hadoop Distributed File System<br />
The Hadoop Distributed File System (HDFS) and the<br />
MapReduce parallel programming framework are at<br />
the core <strong>of</strong> Apache Hadoop. Comparing HDFS and<br />
MapReduce to Linux, Awadallah says that together<br />
they’re a “data operating system.” This description may<br />
be overstated, but there are similarities to any operating<br />
system. Operating systems schedule tasks, allocate<br />
resources, and manage files and data flows to fulfill the<br />
tasks. HDFS does a distributed computing version <strong>of</strong><br />
this. “It takes care <strong>of</strong> linking all the nodes together to<br />
look like one big file and job scheduling system for the<br />
applications running on top <strong>of</strong> it,” Awadallah says.<br />
HDFS, like all Hadoop tools, is Java based. An HDFS<br />
contains two kinds <strong>of</strong> nodes:<br />
• A single NameNode that logs and maintains the<br />
necessary metadata in memory for distributed jobs<br />
• Multiple <strong>Data</strong>Nodes that create, manage, and<br />
process the 64MB blocks that contain pieces <strong>of</strong><br />
Hadoop jobs, according to the instructions from<br />
the NameNode<br />
26 PricewaterhouseCoopers Technology Forecast
HDFS uses multi-gigabyte file sizes to reduce the<br />
management complexity <strong>of</strong> lots <strong>of</strong> files in large data<br />
volumes. It typically writes each copy <strong>of</strong> the data once,<br />
adding to files sequentially. This approach simplifies<br />
the task <strong>of</strong> synchronizing data and reduces disk and<br />
bandwidth usage.<br />
Equally important are fault tolerance within the same<br />
disk and bandwidth usage limits. To accomplish fault<br />
tolerance, HDFS creates three copies <strong>of</strong> each data<br />
block, typically storing two copies in the same rack.<br />
The system goes to another rack only if it needs the<br />
third copy. Figure 2 shows a simplified depiction <strong>of</strong><br />
HDFS and its data block copying method.<br />
<strong>Data</strong>Node<br />
<strong>Data</strong>Node<br />
Client<br />
<strong>Data</strong>Node<br />
NameNode<br />
(metadata)<br />
Files Blocks<br />
File A 1, 2, 4<br />
File A 3, 5<br />
<strong>Data</strong>Node<br />
HDFS does not perform tasks such as changing<br />
specific numbers in a list or other changes on parts<br />
<strong>of</strong> a database. This limitation leads some to assume<br />
that HDFS is not suitable for structured data. “HDFS<br />
was never designed for structured data and therefore<br />
it’s not optimal to perform queries on structured data,”<br />
says Daniel Abadi, assistant pr<strong>of</strong>essor <strong>of</strong> computer<br />
science at Yale University. Abadi and others at Yale<br />
have done performance testing on the subject, and they<br />
have created a relational database alternative to HDFS<br />
called HadoopDB to address the performance issues<br />
they identified. 8<br />
Some developers are structuring data in ways that are<br />
suitable for HDFS; they’re just doing it differently from<br />
the way relational data would be structured. Nathan<br />
Marz, a developer at BackType, a company that <strong>of</strong>fers a<br />
search engine for social media buzz, uses schemas to<br />
ensure consistency and avoid data corruption. “A lot<br />
<strong>of</strong> people think that Hadoop is meant for unstructured<br />
data, like log files,” Marz says. “While Hadoop is<br />
great for log files, it’s also fantastic for strongly typed,<br />
structured data.” For this purpose, Marz uses Thrift,<br />
which was developed by Facebook for data translation<br />
and serialization purposes. 9 (See the discussion <strong>of</strong><br />
Thrift later in this article.) Figure 3 illustrates a typical<br />
Hadoop data processing flow that includes Thrift<br />
and MapReduce.<br />
1 2 4<br />
5 2 3<br />
4 3 1<br />
5 2 5<br />
Figure 2: The Hadoop Distributed File System, or HDFS<br />
Source: Apache S<strong>of</strong>tware Foundation, IBM, and PricewaterhouseCoopers, 2008<br />
Input<br />
data<br />
Input<br />
applications<br />
Core Hadoop data processing<br />
Output<br />
applications<br />
Less-structured<br />
information<br />
such as:<br />
log files<br />
messages<br />
images<br />
Cascading<br />
Thrift<br />
Zookeeper<br />
Pig<br />
Figure 3: Hadoop ecosystem overview<br />
Source: PricewaterhouseCoopers, derived from Apache S<strong>of</strong>tware Foundation and Dion Hinchcliffe, 2010<br />
Jobs<br />
1<br />
2<br />
3<br />
1<br />
M<br />
M<br />
2<br />
M<br />
3<br />
1<br />
2<br />
3<br />
R<br />
Results<br />
Mashups<br />
RDBMS apps<br />
BI systems<br />
M<br />
R<br />
Map<br />
Reduce<br />
64MB blocks<br />
Building a bridge to the rest <strong>of</strong> your data 27
MapReduce<br />
MapReduce is the base programming framework for<br />
Hadoop. It <strong>of</strong>ten acts as a bridge between HDFS and<br />
tools that are more accessible to most programmers.<br />
According to those at Google who developed the tool,<br />
“it hides the details <strong>of</strong> parallelization” and the other<br />
nuts and bolts <strong>of</strong> HDFS. 10<br />
MapReduce is a layer <strong>of</strong> abstraction, a way <strong>of</strong> managing<br />
a sea <strong>of</strong> details by creating a layer that captures and<br />
summarizes their essence. That doesn’t mean it is easy<br />
to use. Many developers choose to work with another<br />
tool, yet another layer <strong>of</strong> abstraction on top <strong>of</strong> it. “I<br />
avoid using MapReduce directly at all cost,” Marz says.<br />
“I actually do almost all my MapReduce work with a<br />
library called Cascading.”<br />
The terms “map” and “reduce” refer to steps the<br />
tool takes to distribute, or map, the input for parallel<br />
processing, and then reduce, or aggregate, the<br />
processed data into output files. (See Figure 4.)<br />
MapReduce works with key-value pairs. Frequently<br />
with Web data, the keys consist <strong>of</strong> URLs and the<br />
values consist <strong>of</strong> Web page content, such as HTML.<br />
MapReduce’s main value is as a platform with a set<br />
<strong>of</strong> APIs. Before MapReduce, fewer programmers could<br />
take advantage <strong>of</strong> distributed computing. Now that<br />
user-accessible tools have been designed, simpler<br />
programming is possible on massively parallel systems<br />
without the need to adapt the programs as much.<br />
The following sections examine some <strong>of</strong> these tools.<br />
Map<br />
<strong>Data</strong><br />
store 1<br />
Input key-value pairs<br />
Map<br />
<strong>Data</strong><br />
store n<br />
Input key-value pairs<br />
key 1<br />
values<br />
key 2 values key 3 values key 1 values key 2 values key 3 values<br />
Barrier ...<br />
Aggregates intermediate values by output key ... Barrier<br />
key 1 intermediate values key 2 intermediate values key 3 intermediate values<br />
Reduce Reduce Reduce<br />
final key 1 values final key 2 values final key 3 values<br />
Figure 4: MapReduce phases<br />
Source: Google, 2004, and <strong>Cloudera</strong>, 2009<br />
28 PricewaterhouseCoopers Technology Forecast
“You can code in whatever JVM-based language you want, and then<br />
shove that into the cluster.” —Chris Wensel <strong>of</strong> Concurrent<br />
Cascading<br />
Wensel, who created Cascading, calls it an alternative<br />
API to MapReduce, a single library <strong>of</strong> operations<br />
developers can tap. It’s another layer <strong>of</strong> abstraction<br />
that helps bring what programmers ordinarily do in<br />
non-distributed environments to distributed computing.<br />
With it, he says, “you can code in whatever JVM-based<br />
[Java Virtual Machine] language you want, and then<br />
shove that into the cluster.”<br />
Wensel wanted to obviate the need for “thinking in<br />
MapReduce.” When using Cascading, developers don’t<br />
think in key-value pair terms—they think in terms <strong>of</strong><br />
fields and lists <strong>of</strong> values called “tuples.” A Cascading<br />
tuple is simpler than a database record but acts like<br />
one. Each tuple flows through “pipe” assemblies, which<br />
are comparable to Java classes. The data flow begins<br />
at the source, an input file, and ends with a sink, an<br />
output directory. (See Figure 5.)<br />
Rather than approach map and reduce phases large-file<br />
by large-file, developers assemble flows <strong>of</strong> operations<br />
using functions, filters, aggregators, and buffers.<br />
Those flows make up the pipe assemblies, which, in<br />
Marz’s terms, “compile to MapReduce.” In this way,<br />
Cascading smoothes the bumpy MapReduce terrain so<br />
more developers—including those who work mainly in<br />
scripting<br />
Client<br />
languages—can build flows. (See Figure 6.)<br />
Assembly<br />
A<br />
A<br />
A<br />
A<br />
Cluster<br />
Job<br />
A<br />
A<br />
A<br />
A<br />
Flow<br />
MR<br />
MR<br />
Job<br />
MR<br />
MR<br />
Map Reduce Map Reduce<br />
[f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...]<br />
P P P P P P<br />
[f1, f2, ...] [f1, f2, ...]<br />
So<br />
Si<br />
A<br />
MR<br />
Pipe assembly<br />
Hadoop MR (translation to MapReduce)<br />
MapReduce jobs<br />
[f1, f2, ...] Tuples with field names<br />
So Source<br />
Si Sink<br />
P Pipe<br />
Figure 6: Cascading assembly and flow<br />
Source: Concurrent, 2010<br />
Figure 5: A Cascading assembly<br />
Source: Concurrent, 2010<br />
Building a bridge to the rest <strong>of</strong> your data 29
Some useful tools for MapReduce-style<br />
analytics programming<br />
Open-source tools that work via MapReduce on<br />
Hadoop clusters are proliferating. Users and developers<br />
don’t seem concerned that Google received a patent for<br />
MapReduce in January 2010. In fact, Google, IBM, and<br />
others have encouraged the development and use <strong>of</strong><br />
open-source versions <strong>of</strong> these tools at various research<br />
universities. 11 A few <strong>of</strong> the more prominent tools<br />
relevant to analytics, and used by developers we’ve<br />
interviewed, are listed in the sections that follow.<br />
Clojure<br />
Clojure creator Rich Hickey wanted to combine aspects<br />
<strong>of</strong> C or C#, LISP (for list processing, a language<br />
associated with artificial intelligence that’s rich in<br />
mathematical functions), and Java. The letters C, L, and<br />
J led him to name the language, which is pronounced<br />
“closure.” Clojure combines a LISP library with Java<br />
libraries. Clojure’s mathematical and natural language<br />
processing (NLP) capabilities and the fact that it is JVM<br />
based make it useful for statistical analysis on Hadoop<br />
clusters. FlightCaster, a commercial-airline-delayprediction<br />
service, uses Clojure on top <strong>of</strong> Cascading,<br />
on top <strong>of</strong> MapReduce and Hadoop, for “getting the<br />
right view into unstructured data from heterogeneous<br />
sources,” says Bradford Cross, FlightCaster co-founder.<br />
LISP has attributes that lend themselves to NLP, making<br />
Clojure especially useful in NLP applications. Mark<br />
Watson, an artificial intelligence consultant and author,<br />
says most LISP programming he’s done is for NLP. He<br />
considers LISP to be four times as productive for<br />
programming as C++ and twice as productive as Java.<br />
His NLP code “uses a huge amount <strong>of</strong> memory-resident<br />
data,” such as lists <strong>of</strong> proper nouns, text categories,<br />
common last names, and nationalities.<br />
“Getting the right view into<br />
unstructured data from<br />
heterogeneous sources.”<br />
— Bradford Cross <strong>of</strong> FlightCaster<br />
With LISP, Watson says, he can load the data once and<br />
test multiple times. In C++, he would need to use a<br />
relational database and reload each time for a program<br />
test. Using LISP makes it possible to create and test<br />
small bits <strong>of</strong> code in an iterative fashion, a major reason<br />
for the productivity gains.<br />
This iterative, LISP-like program-programmer<br />
interaction with Clojure leads to what Hickey calls<br />
“dynamic development.” Any code entered in the<br />
console interface, he points out, is automatically<br />
compiled on the fly.<br />
Thrift<br />
Thrift, initially created at Facebook in 2007 and then<br />
released to open source, helps developers create<br />
services that communicate across languages, including<br />
C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby.<br />
With Thrift, according to Facebook, users can “define<br />
all the necessary data structures and interfaces for a<br />
complex service in a single short file.”<br />
A more important aspect <strong>of</strong> Thrift, according to<br />
BackType’s Marz, is its ability to create strongly typed<br />
data and flexible schemas. Countering the emphasis <strong>of</strong><br />
the so-called noSQL community on schema-less data,<br />
Marz asserts there are effective ways to lightly structure<br />
the data in Hadoop—style analysis.<br />
Marz uses Thrift’s serialization features, which turn<br />
objects into a sequence <strong>of</strong> bits that can be stored as<br />
files, to create schemas between types (for instance,<br />
differentiating between text strings and long, 64-bit<br />
integers) and schemas between relationships (for<br />
instance, linking Twitter accounts that share a<br />
common interest). Structuring the data in this way<br />
helps BackType avoid inconsistencies in the data<br />
or the need to manually filter for some attributes.<br />
BackType can use required and optional fields to<br />
structure the Twitter messages it crawls and analyzes.<br />
The required fields can help enforce data type. The<br />
optional fields, meanwhile, allow changes to the<br />
schema as well as the use <strong>of</strong> old objects that were<br />
created using the old schema.<br />
30 PricewaterhouseCoopers Technology Forecast
Marz’s use <strong>of</strong> Thrift to model social graphs like the one<br />
in Figure 7 demonstrates the flexibility <strong>of</strong> the schema<br />
for Hadoop-style computing. Thrift essentially enables<br />
modularity in the social graph described in the schema.<br />
For example, to select a single age for each person,<br />
BackType can take into account all the raw age data.<br />
It can do this by a computation on the entire data set<br />
or a selective computation on only the people in the<br />
data set who have new data.<br />
Alice<br />
Gender female<br />
Age 25<br />
Bob<br />
Apache<br />
Thrift<br />
Gender male<br />
Age 39<br />
Charlie<br />
Gender male<br />
Age 22<br />
Language: C++<br />
Figure 7: An example <strong>of</strong> a social graph modeled using<br />
Thrift schema<br />
Source: Nathan Marz, 2010<br />
BackType doesn’t just work with raw data. It runs a<br />
series <strong>of</strong> jobs that constantly normalize and analyze<br />
new data coming in, and then other jobs that write the<br />
analyzed data to a scalable random-access database<br />
such as HBase or Cassandra. 12<br />
Open-source, non-relational data stores<br />
Non-relational data stores have become much more<br />
numerous since the Apache Hadoop project began in<br />
2007. Many are open source. Developers <strong>of</strong> these data<br />
stores have optimized each for a different kind <strong>of</strong> data.<br />
When contrasted with relational databases, these data<br />
stores lack many design features that can be essential<br />
for enterprise transactional data. However, they are<br />
<strong>of</strong>ten well tailored to specific, intended purposes,<br />
and they <strong>of</strong>fer the added benefit <strong>of</strong> simplicity. Primary<br />
non-relational data store types include the following:<br />
• Multidimensional map store—Each record<br />
maps a row name, a column name, and a time<br />
stamp to a value. Map stores have their heritage<br />
in Google’s <strong>Big</strong>table.<br />
• Key-value store—Each record consists <strong>of</strong> a key,<br />
or unique identifier, mapped to one or more values.<br />
• Graph store—Each record consists <strong>of</strong> elements that<br />
together form a graph. Graphs depict relationships.<br />
For example, social graphs describe relationships<br />
between people. Other graphs describe relationships<br />
between objects, between links, or both.<br />
• Document store—Each record consists <strong>of</strong> a<br />
document. Extensible Markup Language (XML)<br />
databases, for example, store XML documents.<br />
Because <strong>of</strong> their simplicity, map and key-value stores<br />
can have scalability advantages over most types <strong>of</strong><br />
relational databases. (HadoopDB, a hybrid approach<br />
developed at Yale University, is designed to overcome<br />
the scalability problems associated with relational<br />
databases.) Table 1 provides a few examples <strong>of</strong> the<br />
open-source, non-relational data stores that<br />
are available.<br />
Map Key-value Document Graph<br />
HBase Tokyo Cabinet/Tyrant MongoDB Resource Description Framework (RDF)<br />
Hypertable Project Voldemort CouchDB Neo4j<br />
Cassandra Redis Xindice InfoGrid<br />
Table 1: Example open-source, non-relational data stores<br />
Source: PricewaterhouseCoopers, Daniel Abadi <strong>of</strong> Yale University, and organization Web sites, 2010<br />
Building a bridge to the rest <strong>of</strong> your data 31
“We established that Hadoop does horizontally scale. This is what’s really<br />
exciting, because I’m an RDBMS guy, right? I’ve done that for years, and<br />
you don’t get that kind <strong>of</strong> constant scalability no matter what you do.”<br />
— Scott Thompson <strong>of</strong> Disney<br />
Other related technologies and vendors<br />
A comprehsensive review <strong>of</strong> the various tools created<br />
for the Hadoop ecosystem is beyond the scope <strong>of</strong> this<br />
article, but a few <strong>of</strong> the tools merit brief description here<br />
because they’ve been mentioned elsewhere in this issue:<br />
• Pig—A scripting language called Pig Latin, which is<br />
a primary feature <strong>of</strong> Apache Pig, allows more concise<br />
querying <strong>of</strong> data sets “directly from the console” than<br />
is possible using MapReduce, according to author<br />
Tom White.<br />
• Hive—Hive is designed as “mainly an ETL [extract,<br />
transform, and load] system” for use at Facebook,<br />
according to Chris Wensel.<br />
• Zookeeper—Zookeeper provides an interface<br />
for creating distributed applications, according<br />
to Apache.<br />
<strong>Big</strong> <strong>Data</strong> covers many vendor niches, and some<br />
vendors’ products take advantage <strong>of</strong> the Hadoop stack<br />
or add to its capabilities. (See the sidebar “Selected <strong>Big</strong><br />
<strong>Data</strong> tool vendors.”)<br />
Conclusion<br />
Interest in and adoption <strong>of</strong> Hadoop clusters are<br />
growing rapidly. Reasons for Hadoop’s<br />
popularity include:<br />
• Open, dynamic development—The Hadoop/<br />
MapReduce environment <strong>of</strong>fers cost-effective<br />
distributed computing to a community <strong>of</strong> opensource<br />
programmers who’ve grown up on Linux<br />
and Java, and scripting languages such as<br />
Perl and Python. Some are taking advantage <strong>of</strong><br />
functional programming language dialects such<br />
as Clojure. The openness and interaction can<br />
lead to faster development cycles.<br />
• Cost-effective scalability—Horizontal scaling from<br />
a low-cost base implies a feasible long-term cost<br />
structure for more kinds <strong>of</strong> data. Scott Thompson,<br />
vice president for infrastructure at the Disney<br />
Technology Shared Services Group, says, “We<br />
established that Hadoop does horizontally scale.<br />
This is what’s really exciting, because I’m an<br />
RDBMS guy, right? I’ve done that for years, and<br />
you don’t get that kind <strong>of</strong> constant scalability no<br />
matter what you do.”<br />
• Fault tolerance—Associated with scalability is<br />
the assumption that some nodes will fail. Hadoop<br />
and MapReduce are fault tolerant, another reason<br />
commodity hardware can be used.<br />
• Suitability for less-structured data—Perhaps<br />
most importantly, the methods that Google<br />
pioneered, and that Yahoo and others expanded,<br />
focus on what <strong>Cloudera</strong>’s Awadallah calls<br />
“complex” data. Although developers such as Marz<br />
understand the value <strong>of</strong> structuring data, most<br />
Hadoop/MapReduce developers don’t have a<br />
DBMS mentality. They have an NLP mentality, and<br />
they’re focused on techniques optimized for large<br />
amounts <strong>of</strong> less-structured information, such as the<br />
vast amount <strong>of</strong> information on the Web.<br />
The methods, cost advantages, and scalability <strong>of</strong><br />
Hadoop-style cluster computing clear a path for<br />
enterprises to analyze the <strong>Big</strong> <strong>Data</strong> they didn’t have<br />
the means to analyze before. This set <strong>of</strong> methods is<br />
separate from, yet complements, data warehousing.<br />
Understanding what Hadoop clusters do and how<br />
they do it is fundamental to deciding when and where<br />
enterprises should consider making use <strong>of</strong> them.<br />
32 PricewaterhouseCoopers Technology Forecast
Selected <strong>Big</strong> <strong>Data</strong> tool vendors<br />
Amazon<br />
Amazon provides a Hadoop framework on its<br />
Elastic Compute Cloud (EC2) and S3 storage<br />
service it calls Elastic MapReduce.<br />
Appistry<br />
Appistry’s CloudIQ Storage platform <strong>of</strong>fers a<br />
substitute for HDFS, one designed to eliminate the<br />
single point <strong>of</strong> failure <strong>of</strong> the NameNode.<br />
<strong>Cloudera</strong><br />
<strong>Cloudera</strong> takes a Red Hat approach to Hadoop,<br />
<strong>of</strong>fering its own distribution on EC2/S3 with<br />
management tools, training, support, and<br />
pr<strong>of</strong>essional services.<br />
Cloudscale<br />
Cloudscale’s first product, Cloudcel, marries an<br />
Excel-style front end to a back end that’s a modified<br />
HDFS. The product is designed to process stored,<br />
historical, or streamed data.<br />
Concurrent<br />
Concurrent developed Cascading, for which it<br />
<strong>of</strong>fers licensing, training, and support.<br />
Drawn to Scale<br />
Drawn to Scale <strong>of</strong>fers an HBase/HDFS storage<br />
platform and Hadoop ecosystem consulting<br />
and training.<br />
IBM<br />
IBM’s jStart team <strong>of</strong>fers briefings and workshops<br />
on Hadoop pilots. IBM <strong>Big</strong>Sheets acts as an<br />
aggregation, analysis, and visualization point for<br />
large amounts <strong>of</strong> Web data.<br />
Micros<strong>of</strong>t<br />
Micros<strong>of</strong>t Pivot uses the company’s Deep Zoom<br />
technology to provide visual data browsing<br />
capabilities for XML files. Azure Table services is<br />
in some ways comparable to <strong>Big</strong>table or HBase.<br />
(See the interview with Mark Taylor and Ray Velez<br />
<strong>of</strong> Razorfish on page 46.)<br />
ParaScale<br />
ParaScale <strong>of</strong>fers s<strong>of</strong>tware for enterprises to<br />
set up their own public or private cloud storage<br />
environments with parallel processing and<br />
large-scale data handling capability.<br />
1 FLOPS stands for “floating point operations per second.” Floating point<br />
processors use more bits to store each value, allowing more precision<br />
and ease <strong>of</strong> programming than fixed point processors. One petaflop is<br />
upwards <strong>of</strong> one quadrillion floating point operations per second.<br />
2 Brough Turner, “Google Surpasses Supercomputer Community,<br />
Unnoticed?” May 20, 2008, http://blogs.broughturner.com/<br />
communications/2008/05/google-surpasses-supercomputercommunity-unnoticed.html<br />
(accessed April 8, 2010).<br />
3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,”<br />
Dr. Dobb’s Journal, November 1, 1998, Factiva Document<br />
dobb000020010916dub100045 (accessed April 9, 2010).<br />
4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a<br />
Planet: The Google Cluster Architecture,” Google Research<br />
Publications, http://research.google.com/archive/googlecluster.html<br />
(accessed April 10, 2010).<br />
5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/<br />
(accessed April 9, 2010).<br />
6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly<br />
Media, 2009), 4.<br />
7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!”<br />
The New York Times Open <strong>Blog</strong>, November 1, 2007, http://open.<br />
blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/<br />
(accessed March 23, 2010) and {Alan: Check this<br />
one. I ran a Google search and the author, article title, and date don’t<br />
match. Bill Snyder wrote another article on March 5, 2008. http://<br />
www.cio.com/article/192701/Cloud_Computing_Tales_from_the_<br />
Front } Bill Snyder, “Cloud Computing: Not Just Pie in the Sky,” CIO,<br />
March 5, 2008, Factiva Document CIO0000020080402e4350000<br />
(accessed March 28, 2010).<br />
8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html<br />
(accessed April 11, 2010).<br />
9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on<br />
Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/<br />
(accessed April 11, 2010).<br />
10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified <strong>Data</strong><br />
Processing on Large Clusters,” Google Research Publications,<br />
December 2004, http://labs.google.com/papers/mapreduce.html<br />
(accessed April 22, 2010).<br />
11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http://<br />
www.uspto.gov. For an example <strong>of</strong> the participation by Google and<br />
IBM in Hadoop’s development, see “Google and IBM Announce<br />
University Initiative to Address Internet-Scale Computing Challenges,”<br />
Google press release, October 8, 2007, http://www.google.com/intl/en/<br />
press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010).<br />
12 See the Apache site at http://apache.org/ for descriptions <strong>of</strong> many<br />
tools that take advantage <strong>of</strong> MapReduce and/or HDFS that are not<br />
pr<strong>of</strong>iled in this article.<br />
Building a bridge to the rest <strong>of</strong> your data 33
Hadoop’s foray<br />
into the enterprise<br />
<strong>Cloudera</strong>’s Amr Awadallah discusses how and why<br />
diverse companies are trying this novel approach.<br />
Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya<br />
Amr Awadallah is vice president, engineering and chief technology <strong>of</strong>ficer at <strong>Cloudera</strong>,<br />
a company that <strong>of</strong>fers products and services around Hadoop, an open-source<br />
technology that allows efficient mining <strong>of</strong> large, complex data sets. In this interview,<br />
Awadallah provides an overview <strong>of</strong> Hadoop’s capabilities and how <strong>Cloudera</strong> customers<br />
are using them.<br />
PwC: Were you at Yahoo before coming<br />
to <strong>Cloudera</strong>?<br />
AA: Yes. I was with Yahoo from mid-2000 until mid-<br />
2008, starting with the Yahoo Shopping team after<br />
selling my company VivaSmart to Yahoo. Beginning in<br />
2003, my career shifted toward business intelligence<br />
and analytics at consumer-facing properties such as<br />
Yahoo News, Mail, Finance, Messenger, and Search.<br />
I had the daunting task <strong>of</strong> building a very large data<br />
warehouse infrastructure that covered all these diverse<br />
products and figuring out how to bring them together.<br />
That is when I first experienced Hadoop. Its model <strong>of</strong><br />
“mine first, govern later” fits in with the well-governed<br />
infrastructure <strong>of</strong> a data mart, so it complements these<br />
systems very well. Governance standards are important<br />
for maintaining a common language across the<br />
organization. However, they do inhibit agility, so it’s best<br />
to complement a well-governed data mart with a more<br />
agile complex data processing system like Hadoop.<br />
PwC: How did Yahoo start using Hadoop?<br />
AA: In 2005, Yahoo was faced with a business<br />
challenge. The cost <strong>of</strong> creating the Web search index<br />
was approaching the revenues being made from the<br />
keyword advertising on the search pages. Yahoo Search<br />
adopted Hadoop as an economically scalable solution,<br />
and worked on it in conjunction with the open-source<br />
Apache Hadoop community. Yahoo played a very big<br />
role in the evolution <strong>of</strong> Hadoop to where it is today.<br />
Soon after the Yahoo Search team started using<br />
Hadoop, other parts <strong>of</strong> the company began to see<br />
the power and flexibility that this system <strong>of</strong>fers.<br />
Today Yahoo uses Hadoop for data warehousing,<br />
mail spam detection, news feed processing, and<br />
content/ad targeting.<br />
PwC: What are some <strong>of</strong> the advantages <strong>of</strong><br />
Hadoop when you compare it with RDBMSs<br />
[relational database management systems]?<br />
AA: With Oracle, Teradata, and other RDBMSs, you<br />
must create the table and schema first. You say, this is<br />
what I’m going to be loading in, these are the types <strong>of</strong><br />
columns I’m going to load in, and then you load your<br />
data. That process can inhibit how fast you can evolve<br />
your data model and schemas, and it can limit what you<br />
log and track.<br />
With Hadoop, it’s the other way around. You load all <strong>of</strong><br />
your data, such as XML [Extensible Markup Language],<br />
tab delimited flat files, Apache log files, JSON<br />
[Javascript Object Notation], etc. Then in Hive or Pig<br />
[both <strong>of</strong> which are Hadoop <strong>Data</strong> Query Tools], you point<br />
your metadata toward the file and parse the data on<br />
34 PricewaterhouseCoopers Technology Forecast
“We are not talking about a replacement technology for datawarehouses—<br />
let’s be clear on this. No customers are using Hadoop in that fashion.”<br />
the fly when reading it out. This approach lets you<br />
extract the columns that map to the data structure<br />
you’re interested in.<br />
Creating the structure on the read path like this can<br />
have its disadvantages; however, it gives you the agility<br />
and the flexibility to evolve your schema much quicker<br />
without normalizing your data first. In general, relational<br />
systems are not well suited for quickly evolving complex<br />
data types.<br />
Another benefit is retroactive schemas. For example,<br />
an engineer launching a new product feature can add<br />
the logging for it, and that new data will start flowing<br />
directly into Hadoop. Weeks or months later, a data<br />
analyst can update their read schema on how to parse<br />
this new data. Then they will immediately be able to<br />
query the history <strong>of</strong> this metric since it started flowing<br />
in [as opposed to waiting for the RDBMS schema to be<br />
updated and the ETL processes to reload the full history<br />
<strong>of</strong> that metric].<br />
PwC: What about the cost advantages?<br />
AA: The cost basis is 10 to 100 times cheaper than<br />
other solutions. But it’s not just about cost. Relational<br />
databases are really good at what they were designed<br />
for, which is running interactive SQL queries against<br />
well-structured data. We are not talking about a<br />
replacement technology for data warehouses—let’s be<br />
clear on this.<br />
No customers are using Hadoop in that fashion. They<br />
recognize that the nature <strong>of</strong> data is changing. Where is<br />
the data growing? It’s growing around complex data<br />
types. Is a relational container the best and most<br />
interesting place to ask questions <strong>of</strong> complex plus<br />
relational data? Probably not, although organizations<br />
still need to use, collect, and present relational data<br />
for questions that are routine and require, in some<br />
cases, a real-time response.<br />
PwC: How have companies benefited<br />
from querying across both structured and<br />
complex data?<br />
AA: When you query against complex data types, such<br />
as Web log files and customer support forums, as well<br />
as against the structured data you have already been<br />
collecting, such as customer records, sales history, and<br />
transactions, you get a much more accurate answer to<br />
the question you’re asking. For example, a large credit<br />
card company we’ve worked with can identify which<br />
transactions are most likely fraudulent and can prioritize<br />
which accounts need to be addressed.<br />
PwC: Are the companies you work with aware<br />
that this is a totally different paradigm?<br />
AA: Yes and no. The main use case we see is in<br />
companies that have a mix <strong>of</strong> complex data and<br />
structured data that they want to query across. Some<br />
large financial institutions that we talk to have 10, 20,<br />
or even hundreds <strong>of</strong> Oracle systems—it’s amazing.<br />
They have all <strong>of</strong> these file servers storing XML files or<br />
log files, and they want to consolidate all these tables<br />
and files onto one platform that can handle both data<br />
types so they can run comprehensive queries. This is<br />
where Hadoop really shines; it allows companies to<br />
run jobs across both data types. •<br />
Hadoop’s foray into the enterprise 35
Revising the CIO’s<br />
data playbook<br />
Start by adopting a fresh mind-set, grooming the right talent,<br />
and piloting new tools to ride the next wave <strong>of</strong> innovation.<br />
By Jimmy Guterman<br />
36 PricewaterhouseCoopers Technology Forecast
Like pioneers exploring a new territory, a few<br />
enterprises are making discoveries by exploring <strong>Big</strong><br />
<strong>Data</strong>. The terrain is complex and far less structured<br />
than the data CIOs are accustomed to. And it is<br />
growing by exabytes each year. But it is also getting<br />
easier and less expensive to explore and analyze, in<br />
part because s<strong>of</strong>tware tools built to take advantage<br />
<strong>of</strong> cloud computing infrastructures are now available.<br />
Our advice to CIOs: You don’t need to rush, but do<br />
begin to acquire the necessary mind-set, skill set,<br />
and tool kit.<br />
These are still the early days. The prime directive<br />
for any CIO is to deliver value to the business through<br />
technology. One way to do that is to integrate new<br />
technologies in moderation, with a focus on the<br />
long-term opportunities they may yield. Leading<br />
CIOs pride themselves on waiting until a technology<br />
has proven value before they adopt it. Fair enough.<br />
However, CIOs who ignore the <strong>Big</strong> <strong>Data</strong> trends<br />
described in the first two articles risk being<br />
marginalized in the C-suite. As they did with earlier<br />
technologies, including traditional business<br />
intelligence, business unit executives are ready to<br />
seize the <strong>Big</strong> <strong>Data</strong> opportunity and make it their own.<br />
This will be good for their units and their careers,<br />
but it would be better for the organization as a whole<br />
if someone—the CIO is the natural person—drove a<br />
single, central, cross-enterprise <strong>Big</strong> <strong>Data</strong> initiative.<br />
With this in mind, PricewaterhouseCoopers<br />
encourages CIOs to take these steps:<br />
• Start to add the discipline and skill set for <strong>Big</strong> <strong>Data</strong><br />
to your organizations; the people for this may or<br />
may not come from existing staff.<br />
• Set up sandboxes (which you can rent or buy) to<br />
experiment with <strong>Big</strong> <strong>Data</strong> technologies.<br />
• Understand the open-source nature <strong>of</strong> the tools<br />
and how to manage risk.<br />
Enterprises have the opportunity to analyze more<br />
kinds <strong>of</strong> data more cheaply than ever before. It is<br />
also important to remember that <strong>Big</strong> <strong>Data</strong> tools did<br />
snot originate with vendors that were simply trying to<br />
create new markets. The tools sprung from a real<br />
need among the enterprises that first confronted<br />
the scalability and cost challenges <strong>of</strong> <strong>Big</strong> <strong>Data</strong>—<br />
challenges that are now felt more broadly. These<br />
pioneers also discovered the need for a wider variety<br />
<strong>of</strong> talent than IT has typically recruited.<br />
Enterprises have the opportunity to analyze more kinds <strong>of</strong> data<br />
more cheaply than ever before. It is also important to remember<br />
that <strong>Big</strong> <strong>Data</strong> tools did not originate with vendors that were simply<br />
trying to create new markets.<br />
Revising the CIO’s data playbook 37
<strong>Big</strong> <strong>Data</strong> lessons from Web companies<br />
Today’s CIO literature is full <strong>of</strong> lessons you can learn<br />
from companies such as Google. Some <strong>of</strong> the<br />
comparisons are superficial because most companies<br />
do not have a Web company’s data complexities and<br />
will never attain the original singleness <strong>of</strong> purpose<br />
that drove Google, for example, to develop <strong>Big</strong> <strong>Data</strong><br />
innovations. But there is no niche where the<br />
development <strong>of</strong> <strong>Big</strong> <strong>Data</strong> tools, techniques, mind-set,<br />
and usage is greater than in companies such as<br />
Google, Yahoo, Facebook, Twitter, and LinkedIn.<br />
And there is plenty that CIOs can learn from these<br />
companies. Every major service these companies<br />
create is built on the idea <strong>of</strong> extracting more and more<br />
value from more and more data.<br />
For example, the 1-800-GOOG-411 service, which<br />
individuals can call to get telephone numbers and<br />
addresses <strong>of</strong> local businesses, does not merely take<br />
an ax to the high-margin directory assistance services<br />
run by incumbent carriers (although it does that).<br />
That is just a by-product. More important, the<br />
800-number service has let Google compile what has<br />
been described as the world’s largest database <strong>of</strong><br />
spoken language. Google is using that database to<br />
improve the quality <strong>of</strong> voice recognition in Google<br />
Voice, in its sundry mobile-phone applications, and in<br />
other services under development. Some <strong>of</strong> the ways<br />
companies such as Google capture data and convert<br />
it into services are listed in Table 1.<br />
Service<br />
Self-serve advertising<br />
Analytics<br />
Social networking<br />
Browser<br />
E-mail<br />
Search engine<br />
RSS feeds<br />
Extra browser functionality<br />
View videos<br />
Free directory assistance<br />
<strong>Data</strong> that Web companies capture<br />
Ad-clicking and -picking behavior<br />
Aggregated Web site usage tracking<br />
Sundry online<br />
Limited browser behaviors<br />
Words used in e-mails<br />
Searches and clicking information<br />
Detailed reading habits<br />
All browser behavior<br />
All site behavior<br />
<strong>Data</strong>base <strong>of</strong> spoken words<br />
Table 1: Web portal <strong>Big</strong> <strong>Data</strong> strategy<br />
Source: PricewaterhouseCoopers, 2010<br />
38 PricewaterhouseCoopers Technology Forecast
“I see inspiration from the Google model ...just having lots <strong>of</strong><br />
cheap stuff that you can use to crunch vast quantities <strong>of</strong> data.”<br />
— Phil Buckle, CIO, UK NPIA<br />
Many Web companies are finding opportunity in “gray<br />
data.” Gray data is the raw and unvalidated data that<br />
arrives from various sources, in huge quantities, and not<br />
in the most usable form. Yet gray data can deliver value<br />
to the business even if the generators <strong>of</strong> that content<br />
(for example, people calling directory assistance) are<br />
contributing that data for a reason far different from<br />
improving voice-recognition algorithms. They just want<br />
the right phone number; the data they leave is a gift to<br />
the company providing the service.<br />
The new technologies and services described in the<br />
article, “Building a bridge to the rest <strong>of</strong> your data,” on<br />
page 22 are making it possible to search for enterprise<br />
value in gray data in agile ways at low cost. Much <strong>of</strong><br />
this value is likely to be in the area <strong>of</strong> knowing your<br />
customers, a sure path for CIOs looking for ways to<br />
contribute to company growth and deepen their<br />
relationships with the rest <strong>of</strong> the C-suite.<br />
What Web enterprise use <strong>of</strong> <strong>Big</strong> <strong>Data</strong> shows<br />
CIOs, most <strong>of</strong> all, is that there is a way to think and<br />
manage differently when you conclude that standard<br />
transactional data analysis systems are not and<br />
should not be the only models. New models are<br />
emerging. CIOs who recognize these new models<br />
without throwing away the legacy systems that still<br />
serve them well will see that having more than one<br />
tool set, one skill set, and one set <strong>of</strong> controls makes<br />
their organizations more sophisticated, more agile,<br />
less expensive to maintain, and more valuable to<br />
the business.<br />
The business case<br />
Besides Google, Yahoo, and other Web-based<br />
enterprises that have complex data sets, there are<br />
stories <strong>of</strong> brick and mortar organizations that will be<br />
making more use <strong>of</strong> <strong>Big</strong> <strong>Data</strong>. For example, Rollin Ford,<br />
Wal-Mart’s CIO, told The Economist earlier this year,<br />
“Every day I wake up and ask, ‘How can I flow data<br />
better, manage data better, analyze data better?’”<br />
The answer to that question today implies a budget<br />
reallocation, with less-expensive hardware and s<strong>of</strong>tware<br />
carrying more <strong>of</strong> the load. “I see inspiration from the<br />
Google model and the notion <strong>of</strong> moving into<br />
commodity-based computing—just having lots <strong>of</strong><br />
cheap stuff that you can use to crunch vast quantities<br />
<strong>of</strong> data. I think that really contrasts quite heavily with<br />
the historic model <strong>of</strong> paying lots <strong>of</strong> money for really<br />
specialist stuff,” says Phil Buckle, CIO <strong>of</strong> the UK’s<br />
National Policing Improvement Agency, which oversees<br />
law enforcement infrastructure nationwide. That’s a new<br />
mind-set for the CIO, who ordinarily focuses on keeping<br />
the plumbing and the data it carries safe, secure,<br />
in-house, and functional.<br />
Seizing the <strong>Big</strong> <strong>Data</strong> initiative would give CIOs in<br />
particular and IT in general more clout in the executive<br />
suite. But are CIOs up to the task? “It would be a<br />
positive if IT could harness unstructured data<br />
effectively,” former Gartner analyst Howard Dresner,<br />
CEO <strong>of</strong> Dresner Advisory Services, observes. “However,<br />
they haven’t always done a great job with structured<br />
data, and unstructured is far more complex and exists<br />
predominately outside the firewall and beyond<br />
their control.”<br />
Tools are not the issue. Many evolving tools, as noted<br />
in the previous article, come from the open-source<br />
community; they can be downloaded and experimented<br />
with for low cost and are certainly up to supporting<br />
any pilot project. More important is the aforementioned<br />
mind-set and a new kind <strong>of</strong> talent IT will need.<br />
Revising the CIO’s data playbook 39
“The talent demand isn’t so much for Java developers or statisticians<br />
per se as it is for people who know how to work with denormalized<br />
data.” — Ray Velez <strong>of</strong> Razorfish<br />
To whom does the future <strong>of</strong> IT belong?<br />
The ascendance <strong>of</strong> <strong>Big</strong> <strong>Data</strong> means that CIOs need a<br />
more data-centric approach. But what kind <strong>of</strong> talent<br />
can help a CIO succeed in a more data-centric business<br />
environment, and what specific skills do the CIO’s<br />
teams focused on the area need to develop<br />
and balance?<br />
Hal Varian, a University <strong>of</strong> California, Berkeley, pr<strong>of</strong>essor<br />
and Google’s chief economist, says, “The sexy job in<br />
the next 10 years will be statisticians.” He and others,<br />
such as IT and management pr<strong>of</strong>essor Erik Brynjolfsson<br />
at the Massachusetts Institute <strong>of</strong> Technology (MIT),<br />
contend this demand will happen because the amount<br />
<strong>of</strong> data to be analyzed is out <strong>of</strong> control. Those who<br />
can make sense <strong>of</strong> the flood will reap the greatest<br />
rewards. They have a point, but the need is not just<br />
for statisticians—it’s for a wide range <strong>of</strong> analytically<br />
minded people.<br />
Today, larger companies still need staff with expertise<br />
in package implementations and customizations,<br />
systems integration, and business process<br />
reengineering, as well as traditional data management<br />
and business intelligence that’s focused on<br />
transactional data. But there is a growing role for<br />
people with flexible minds to analyze data and suggest<br />
solutions to problems or identify opportunities from<br />
that data.<br />
In Silicon Valley and elsewhere, where businesses such<br />
as Google, Facebook, and Twitter are built on the<br />
rigorous and speedy analysis <strong>of</strong> data, programming<br />
frameworks such as MapReduce (which works with<br />
Hadoop) and NoSQL (a database approach for nonrelational<br />
data stores) are becoming more popular.<br />
Chris Wensel, who created Cascading (an alternative<br />
application programming interface [API] to MapReduce)<br />
and straddles the worlds <strong>of</strong> startups and entrenched<br />
companies, says, “When I talk to CIOs, I tell them:<br />
‘You know those people you have who know about<br />
data. You probably don’t use those people as much<br />
as you should. But once you take advantage <strong>of</strong> that<br />
expertise and reallocate that talent, you can take<br />
advantage <strong>of</strong> these new techniques.’”<br />
The increased emphasis on data analysis does not<br />
mean that traditional programmers will be replaced by<br />
quantitative analysts or data warehouse specialists.<br />
“The talent demand isn’t so much for Java developers<br />
or statisticians per se as it is for people who know how<br />
to work with denormalized data,” says Ray Velez, CTO<br />
at Razorfish, an interactive marketing and technology<br />
consulting firm involved in many <strong>Big</strong> <strong>Data</strong> initiatives.<br />
“It’s about understanding how to map data into a format<br />
that most people are not familiar with. Most people<br />
understand SQL and the relational format, so the real<br />
skill set evolution doesn’t have quite as much to do with<br />
whether it’s Java or Python or other technologies.”<br />
Velez points to Bill James as a useful case. James, a<br />
baseball writer and statistician, challenged conventional<br />
wisdom by taking an exploratory mind-set to baseball<br />
statistics. He literally changed how baseball<br />
management makes talent decisions, and even how<br />
they manage on the field. In fact, James became senior<br />
adviser for baseball operations in the Boston Red Sox’s<br />
front <strong>of</strong>fice.<br />
40 PricewaterhouseCoopers Technology Forecast
For example, James showed that batting average is<br />
less an indicator <strong>of</strong> a player’s future success than<br />
how <strong>of</strong>ten he’s involved in scoring runs—getting on<br />
base, advancing runners, or driving them in. In this<br />
example and many others, James used his knowledge<br />
<strong>of</strong> the topic, explored the data, asked questions no<br />
one had asked, and then formulated, tested, and<br />
refined hypotheses.<br />
Says Velez: “Our analytics team within Razorfish has<br />
the James types <strong>of</strong> folks who can help drive different<br />
thinking and envision possibilities with the data. We<br />
need to find a lot more <strong>of</strong> those people. They’re not<br />
very easy to find. There is an aspect <strong>of</strong> James that<br />
just has to do with boldness and courage, a willingness<br />
to challenge those who are in the habit <strong>of</strong> using<br />
metrics they’ve been using for years.”<br />
The CIO will need people throughout the organization<br />
who have all sorts <strong>of</strong> relevant analysis and coding skills,<br />
who understand the value <strong>of</strong> data, and who are not<br />
afraid to explore. This does not mean the end <strong>of</strong> the<br />
technology- or application-centric organizational chart<br />
<strong>of</strong> the typical IT organization. Rather, it means the<br />
addition <strong>of</strong> a data-exploration dimension that is more<br />
than one or two people. These people will be using a<br />
blend <strong>of</strong> tools that differ depending on requirements,<br />
as Table 2 illustrates. More <strong>of</strong> the tools will be open<br />
source than in the past.<br />
Skills Tools (a sampler) Comments<br />
Natural language processing<br />
and text mining<br />
Clojure, Redis, Scala, Crane, other<br />
Java functional language libraries,<br />
Python Natural Language ToolKit<br />
To some extent, each <strong>of</strong> these serves as<br />
a layer <strong>of</strong> abstraction on top <strong>of</strong> Hadoop.<br />
Those familiar keep adding layers on top <strong>of</strong><br />
layers. FlightCaster, for example, uses a<br />
stack consisting <strong>of</strong> Amazon S3 -> Amazon<br />
EC2 -> <strong>Cloudera</strong> -> HDFS -> Hadoop -><br />
Cascading -> Clojure 1<br />
<strong>Data</strong> mining R, Mathlab R is more suited to finance and statistics,<br />
whereas Mathlab is more engineering<br />
oriented. 2<br />
Scripting and NoSQL<br />
database programming skills<br />
Python and related frameworks,<br />
HBase, Cassandra, CouchDB,<br />
Tokyo Cabinet<br />
These lend themselves to or are based on<br />
the functional languages mentioned above.<br />
CouchDB, for example, is written in Erlang 3<br />
“Erlang, another functional programming<br />
language comparable to LISP (see<br />
discussion <strong>of</strong> Clojure and LISP on page 30).<br />
Table 2: New skills and tools for the IT department<br />
Source: Cited online postings and PricewaterhouseCoopers, 2008–2010<br />
1<br />
Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight <strong>Data</strong>,” <strong>Data</strong> Wrangling blog, August 24, 2009,<br />
http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data (accessed May 14, 2010).<br />
2<br />
Brendan O’Connor, “Comparison <strong>of</strong> data analysis packages,” AI and Social Science blog, February 23, 2009,<br />
http://anyall.org/blog/2009/02/comparison-<strong>of</strong>-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/,<br />
accessed May 25, 2010.<br />
3<br />
Scripting languages such as Python run more slowly than Java, but developers sometimes make the trade<strong>of</strong>f<br />
to increase their own productivity. Some companies have created their own frameworks and released these<br />
to open source. See Klaas Bosteels, “ Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog,<br />
May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010).<br />
Revising the CIO’s data playbook 41
“Every technology department has a skunkworks, no matter how<br />
informal—a sandbox where they can test and prove technologies.<br />
That’s how open source entered our organization. A small Hadoop<br />
installation might be a gateway that leads you to more open source.<br />
But it might turn out to be a neat little open-source project that<br />
sits by itself and doesn’t bother anything else.” —CIO at a small<br />
Massachusetts company<br />
Where do CIOs find such talent? Start with your own<br />
enterprise. For example, business analysts managing<br />
the marketing department’s lead-generation systems<br />
could be promoted onto an IT data staff charged with<br />
exploring the data flow. Most large consumer-oriented<br />
companies already have people in their business units<br />
who can analyze data and suggest solutions to<br />
problems or identify opportunities. These people need<br />
to be groomed and promoted, and more <strong>of</strong> them hired<br />
for IT, to enable the entire organization, not just the<br />
marketing department, to reap the riches.<br />
Set up a sandbox<br />
Although the business case CIOs can make for <strong>Big</strong> <strong>Data</strong><br />
is inarguable, even inarguable business cases carry<br />
some risk. Many CIOs will look at the risks associated<br />
with <strong>Big</strong> <strong>Data</strong> and find a familiar canard. Many <strong>Big</strong> <strong>Data</strong><br />
technologies—Hadoop in particular—are open source,<br />
and open source is <strong>of</strong>ten criticized for carrying too<br />
much risk.<br />
The open-source versus proprietary technology<br />
argument is nothing new. CIOs who have tried to<br />
implement open-source programs, from the Apache<br />
Web server to the Drupal content-management system,<br />
have faced the usual arguments against code being<br />
available to all comers. Some <strong>of</strong> those arguments,<br />
especially concerns revolving around security and<br />
reliability, verge on the specious. Google built its internal<br />
Web servers atop Apache. And it would be difficult to<br />
find a <strong>Big</strong> <strong>Data</strong> site as reliable as Google’s.<br />
Clearly, one challenge CIOs face has nothing to do<br />
with data or skill sets. Open-source projects become<br />
available earlier in their evolution than do proprietary<br />
alternatives. In this respect, <strong>Big</strong> <strong>Data</strong> tools are less<br />
stable and complete than are Apache or Linux<br />
open-source tools kits.<br />
Introducing an open-source technology such as<br />
Hadoop into a mostly proprietary environment does<br />
not necessarily mean turning the organization upside<br />
down. A CIO at a small Massachusetts company says,<br />
“Every technology department has a skunkworks, no<br />
matter how informal—a sandbox where they can test<br />
and prove technologies. That’s how open source<br />
entered our organization. A small Hadoop installation<br />
might be a gateway that leads you to more open<br />
source. But it might turn out to be a neat little opensource<br />
project that sits by itself and doesn’t bother<br />
anything else. Either can be OK, depending on the<br />
needs <strong>of</strong> your company.”<br />
Bud Albers, executive vice president and CTO <strong>of</strong><br />
Disney Technology Shared Services Group, concurs.<br />
“It depends on your organizational mind-set,” he says.<br />
“It depends on your organizational capability. There<br />
is a certain ‘don’t try this at home’ kind <strong>of</strong> warning<br />
that goes with technologies like Hadoop. You have to<br />
be willing at this stage <strong>of</strong> its maturity to maybe have<br />
a little higher level <strong>of</strong> capability to go in.”<br />
42 PricewaterhouseCoopers Technology Forecast
PricewaterhouseCoopers agrees with those sentiments<br />
and strongly urges large enterprises to establish a<br />
sandbox dedicated to <strong>Big</strong> <strong>Data</strong> and Hadoop/<br />
MapReduce. This move should be standard operating<br />
procedure for large companies in 2010, as should a<br />
small, dedicated staff <strong>of</strong> data explorers and modest<br />
budget for the efforts. For more information on what<br />
should be in your sandbox, refer to the article, “Building<br />
a bridge to the rest <strong>of</strong> your data,” on page 22.<br />
And for some ideas on how the sandbox could fit in<br />
your org chart, see Figure 1.<br />
VP <strong>of</strong> IT<br />
Director <strong>of</strong> application development<br />
Director <strong>of</strong> data analysis<br />
<strong>Data</strong> analysis<br />
team<br />
<strong>Data</strong><br />
exploration<br />
team<br />
Marketing<br />
manager<br />
Web site<br />
manager<br />
Sales<br />
manager<br />
Operations<br />
manager<br />
Finance<br />
manager<br />
Figure 1: Where a data exploration team might fit in an<br />
organization chart<br />
Source: PricewaterhouseCoopers, 2010<br />
Different companies will want to experiment with<br />
Hadoop in different ways, or segregate it from the rest<br />
<strong>of</strong> the IT infrastructure with stronger or weaker walls.<br />
The CIO must determine how to encourage this kind<br />
<strong>of</strong> experimentation.<br />
Understand and manage the risks<br />
Some <strong>of</strong> the risks associated with <strong>Big</strong> <strong>Data</strong> are<br />
legitimate, and CIOs must address them. In the case <strong>of</strong><br />
Hadoop clusters, security is a pressing question: it was<br />
a feature added as the project developed, not cooked<br />
in from the beginning. It’s still far from perfect. Many<br />
open-source projects start as cool projects intended to<br />
prove a concept or solve a particular problem. Some,<br />
such as Linux or Mozilla, become massive successes,<br />
but they rarely start with the sort <strong>of</strong> requirements a CIO<br />
faces when introducing systems to corporate settings.<br />
Beyond open source, regardless <strong>of</strong> which tools are<br />
used to manipulate data, there are always risks<br />
associated with making decisions based on the analysis<br />
<strong>of</strong> <strong>Big</strong> <strong>Data</strong>. To give one dramatic example, the recent<br />
financial crisis was caused in part by banks and rating<br />
agencies whose models for understanding value at risk<br />
and the potential for securities based on subprime<br />
mortgages to fail were flat-out wrong. Just as there is<br />
risk in data that is not sufficiently clean, there is risk<br />
in data manipulation techniques that have not been<br />
sufficiently vetted. Many times, the only way to<br />
understand big, complicated data is through the use<br />
<strong>of</strong> big, complicated algorithms, which leaves a door<br />
open to big, catastrophic mistakes in analysis. Table 3<br />
includes a list <strong>of</strong> the risks associated with <strong>Big</strong> <strong>Data</strong><br />
analysis and ways to mitigate them.<br />
Revising the CIO’s data playbook 43
Risk<br />
Over-reliance on insights gleaned from<br />
data analysis leads to loss<br />
Inaccurate or obsolete data<br />
Analysis leads to paralysis<br />
Security<br />
Buggy code and other glitches<br />
Rejection by other parts <strong>of</strong><br />
the organization<br />
Mitigation tactic<br />
Testing<br />
Maintains strong metadata management; unverified information must be flagged<br />
Keep the sandbox related to the business problem or opportunity<br />
Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security<br />
<strong>of</strong>ficer for help<br />
Make sure the team keeps track <strong>of</strong> modifications and other implementation<br />
history, since documentation isn’t plentiful<br />
Do change management to help improve the odds <strong>of</strong> acceptance,<br />
along with quick successes<br />
Table 3: How to mitigate the risks <strong>of</strong> <strong>Big</strong> <strong>Data</strong> analysis<br />
Source: PricewaterhouseCoopers, 2010<br />
By nature and by work experiences, most CIOs are<br />
risk averse. Blue-chip CIOs hold <strong>of</strong>f installing new versions <strong>of</strong> s<strong>of</strong>tware until they have been proven beyond a<br />
doubt, and these CIOs don’t standardize<br />
on new platforms until the risk for change appears to<br />
be less than the risk <strong>of</strong> stasis.<br />
“The fundamental issue is whether IT is willing to<br />
depart from the status quo, such as an RDBMS<br />
[relational database management system], in favor<br />
<strong>of</strong> more powerful technologies,” Dresner says.<br />
“This means massive change, and IT doesn’t<br />
always embrace change.” More forward-thinking IT<br />
organizations constantly review their s<strong>of</strong>tware<br />
portfolio and adjust accordingly.<br />
In this case, the need to manipulate larger and larger<br />
amounts <strong>of</strong> data that companies are collecting is<br />
pressing. Even risk-averse CIOs are exploring the<br />
possibilities <strong>of</strong> <strong>Big</strong> <strong>Data</strong> for their businesses. Bernard<br />
(Bud) Mathaisel, CIO <strong>of</strong> the outsourcing vendor<br />
Achievo, divides the risks <strong>of</strong> <strong>Big</strong> <strong>Data</strong> and their<br />
solutions into three areas:<br />
• Accessibility—data repository used for data<br />
analysis should be access managed<br />
• Classification—Gray data should be identified<br />
as such<br />
• Governance—Who’s doing what with this?<br />
Yes, <strong>Big</strong> <strong>Data</strong> is new. But accessibility, classification,<br />
and governance are matters CIOs have had to deal<br />
with for many years in many guises.<br />
44 PricewaterhouseCoopers Technology Forecast
Conclusion<br />
At many companies, <strong>Big</strong> <strong>Data</strong> is both an opportunity<br />
(what useful needles can we find in a terabyte-sized<br />
haystack?) and a source <strong>of</strong> stress (<strong>Big</strong> <strong>Data</strong> is<br />
overwhelming our current tools and methods; they<br />
don’t scale up to meet the challenge). The prefix<br />
“tera” in “terabyte,” after all, comes from the Greek<br />
word for “monster.” CIOs aiming to use <strong>Big</strong> <strong>Data</strong> to<br />
add value to their businesses are monster slayers.<br />
CIOs don’t just manage hardware and s<strong>of</strong>tware now;<br />
they’re expected to manage the data stored in that<br />
hardware and used by that s<strong>of</strong>tware—and provide a<br />
framework for delivering insights from the data.<br />
From Amazon.com to the Boston Red Sox, diverse<br />
companies compete based on what data they collect<br />
and what they learn from it. CIOs must deliver easy,<br />
reliable, secure access to that data and develop<br />
consistent, trustworthy ways to explore and wrench<br />
wisdom from that data. CIOs do not need to rush, but<br />
they do need to be prepared for the changes that <strong>Big</strong><br />
<strong>Data</strong> is likely to require.<br />
Perhaps the most productive way for CIOs to frame the<br />
issue is to acknowledge that <strong>Big</strong> <strong>Data</strong> isn’t merely a<br />
new model; it’s a new way to think about all data<br />
models. <strong>Big</strong> <strong>Data</strong> isn’t merely more data; it is different<br />
data that requires different tools. As more and more<br />
internal and external sources cast <strong>of</strong>f more and more<br />
data, basic notions about the size and attributes <strong>of</strong> data<br />
sets are likely to change. With those changes, CIOs will<br />
be expected to capture more data and deliver it to the<br />
executive team in a manner that reveals the business—<br />
and how to grow it—in new ways.<br />
Web companies have set the bar high already. John<br />
Avery, a partner at Sungard Consulting, points to the<br />
YouTube example: “YouTube’s ability to index a data<br />
store <strong>of</strong> such immense size and then accrete additional<br />
analysis on top <strong>of</strong> that, as an ongoing process with no<br />
foresight into what those analyses would look like when<br />
the data was originally stored, is very, very impressive.<br />
That is something that has challenged folks in financial<br />
technology for years.”<br />
As companies with a history <strong>of</strong> cautious data policies<br />
begin to test and embrace Hadoop, MapReduce, and<br />
the like, forward-looking CIOs will turn to the issues that<br />
will become more important as <strong>Big</strong> <strong>Data</strong> becomes the<br />
norm. The communities arising around Hadoop (and the<br />
inevitable open-source and proprietary competitors that<br />
follow) will grow and become influential, inspiring more<br />
CIOs to become more data-centric. The pr<strong>of</strong>usion <strong>of</strong><br />
new data sources will lead to dramatic growth in the<br />
use and diversity <strong>of</strong> metadata. As the data grows, so<br />
will our vocabulary for understanding it.<br />
Whether learning from Google’s approach to <strong>Big</strong> <strong>Data</strong>,<br />
hiring a staff primed to maximize its value, or managing<br />
the new risks, forward-looking CIOs will, as always,<br />
be looking to enable new business opportunities<br />
through technology.<br />
As companies with a history <strong>of</strong><br />
cautious data policies begin to<br />
test and embrace Hadoop,<br />
MapReduce, and the like, forwardlooking<br />
CIOs will turn to the issues<br />
that will become more important<br />
as <strong>Big</strong> <strong>Data</strong> becomes the norm.<br />
The communities arising around<br />
Hadoop (and the inevitable opensource<br />
and proprietary competitors<br />
that follow) will grow and become<br />
influential, inspiring more CIOs to<br />
become more data-centric.<br />
Revising the CIO’s data playbook 45
New approaches to<br />
customer data analysis<br />
Razorfish’s Mark Taylor and Ray Velez discuss how new<br />
techniques enable them to better analyze petabytes <strong>of</strong><br />
Web data.<br />
Interview conducted by Alan Morrison and Bo Parker<br />
Mark Taylor is global solutions director and Ray Velez is CTO <strong>of</strong> Razorfish, an interactive<br />
marketing and technology consulting firm that is now a part <strong>of</strong> Publicis Groupe. In this<br />
interview, Roberts and Velez discuss how they use Amazon’s Elastic Compute Cloud<br />
(EC2) and Elastic MapReduce services, as well as Micros<strong>of</strong>t Azure Table services, for<br />
large-scale customer segmentation and other data mining functions.<br />
PwC: What business problem were you trying to<br />
solve with the Amazon services?<br />
MT: We needed to join together large volumes <strong>of</strong><br />
disparate data sets that both we and a particular client<br />
can access. Historically, those data sets have not been<br />
able to be joined at the capacity level that we were able<br />
to achieve using the cloud.<br />
In our traditional data environment, we were limited<br />
to the scope <strong>of</strong> real clickstream data that we could<br />
actually access for processing and leveraging<br />
bandwidth, because we procured a fixed size <strong>of</strong> data.<br />
We managed and worked with a third party to serve<br />
that data center.<br />
This approach worked very well until we wanted to tie<br />
together and use SQL servers with online analytical<br />
processing cubes, all in a fixed infrastructure. With the<br />
cloud, we were able to throw billions <strong>of</strong> rows <strong>of</strong> data<br />
together to really start categorizing that information<br />
so that we could segment non-personally identifiable<br />
data from browsing sessions and from specific ways<br />
in which we think about segmenting the behavior<br />
<strong>of</strong> customers.<br />
That capability gives us a much smarter way to apply<br />
rules to our clients’ merchandising approaches, so that<br />
we can achieve far more contextual focus for the use <strong>of</strong><br />
the data. Rather than using the data for reporting only,<br />
we can actually leverage it for targeting and think about<br />
how we can add value to the insight.<br />
RV: It was slightly different from a traditional database<br />
approach. The traditional approach just isn’t going to<br />
work when dealing with the amounts <strong>of</strong> data that a tool<br />
like the Atlas ad server [a Razorfish ad engine that is<br />
now owned by Micros<strong>of</strong>t and <strong>of</strong>fered through Micros<strong>of</strong>t<br />
Advertising] has to deal with.<br />
PwC: The scalability aspect <strong>of</strong> it seems clear.<br />
But is the nature <strong>of</strong> the data you’re collecting<br />
such that it may not be served well by a<br />
relational approach?<br />
RV: It’s not the nature <strong>of</strong> the data itself, but what we<br />
end up needing to deal with when it comes to relational<br />
data. Relational data has lots <strong>of</strong> flexibility because <strong>of</strong><br />
the normalized format, and then you can slice and dice<br />
and look at the data in lots <strong>of</strong> different ways. Until you<br />
46 PricewaterhouseCoopers Technology Forecast
“Rather than using the data for reporting only, we can actually leverage it<br />
for targeting and think about how we can add value to the insight.”<br />
— Ray Velez<br />
put it into a data warehouse format or a denormalized<br />
EMR [Elastic MapReduce] or <strong>Big</strong>table type <strong>of</strong> format,<br />
you really don’t get the performance that you need<br />
when dealing with larger data sets.<br />
So it’s really that classic trade<strong>of</strong>f; the data doesn’t<br />
necessarily lend itself perfectly to either approach.<br />
When you’re looking at performance and the amount<br />
<strong>of</strong> data, even a data warehouse can’t deal with the<br />
amount <strong>of</strong> data that we would get from a lot <strong>of</strong> our<br />
data sources.<br />
PwC: What motivated you to look at this new<br />
technology to solve that old problem?<br />
RV: Here’s a similar example where we used a slightly<br />
different technology. We were working with a large<br />
financial services institution, and we were dealing with<br />
massive amounts <strong>of</strong> spending-pattern and anonymous<br />
data. We knew we had to scale to Internet volumes,<br />
and we were talking about columnar databases. We<br />
wondered, can we use a relational structure with<br />
enough indexes to make it perform well? We<br />
experimented with a relational structure and it<br />
just didn’t work.<br />
So early on we jumped into what Micros<strong>of</strong>t Azure<br />
technology allowed us to do, and we put it into a<br />
<strong>Big</strong>table format, or a Hadoop-style format, using Azure<br />
Table services. The real custom element was designing<br />
the partitioning structure <strong>of</strong> this data to denormalize<br />
what would usually be five or six tables into one huge<br />
table with lots <strong>of</strong> columns, to the point where we started<br />
to bump up against the maximum number <strong>of</strong> columns<br />
they had.<br />
We were able to build something that we never would<br />
have thought <strong>of</strong> exposing to the world because it never<br />
would have performed well. It actually spurred a whole<br />
new business idea for us. We were able to take what<br />
would typically be a BusinessObjects or a Cognos<br />
application, which would not scale to Internet volumes.<br />
We did some sizing to determine how big the data<br />
footprint would be. Obviously, when you do that, you<br />
tend to have a ton more space than you require,<br />
because you’re duplicating lots and lots <strong>of</strong> data that,<br />
with a relational database table, would be lookup data<br />
or other things like that. But it turned out that when I<br />
laid the indexes on top <strong>of</strong> the traditionally relational<br />
data, the resulting data set actually had even greater<br />
storage requirements than performing the duplication<br />
and putting the data set into a large denormalized<br />
format. That was a bit <strong>of</strong> a surprise to us. The size <strong>of</strong><br />
the indexes got so large.<br />
When you think about it, maybe that’s just how an index<br />
works anyway—it puts things into this denormalized<br />
format. An index file is just some closed concept in<br />
your database or memory space. The point is, we would<br />
have never tried to expose that to consumers, but we<br />
were able to expose it to consumers because <strong>of</strong> this<br />
new format.<br />
MT: The first commercial benefits were the ability to<br />
aggregate large and disparate data into one place and<br />
extra processing power. But the next phase <strong>of</strong> benefits<br />
really derives from the ability to identify true<br />
relationships across that data.<br />
Managing petabyte-scale customer data analysis 47
“The stat section on [the MLB] site was always the most difficult part <strong>of</strong><br />
the site, but the business insisted it needed it.” — Ray Velez<br />
Tiny percentages <strong>of</strong> these data sets have the most<br />
significant impact on our customer interactions. We are<br />
already developing new data measurement and KPI<br />
strategies as we’re starting to ask ourselves, “Do our<br />
clients really need all <strong>of</strong> the data and measurement<br />
points to solve their business goals?”<br />
PwC: Given these new techniques, is the skill<br />
set that’s most beneficial to have at Razorfish<br />
changing?<br />
RV: It’s about understanding how to map data into a<br />
format that most people are not familiar with. Most<br />
people understand SQL and relational format, so I think<br />
the real skill set evolution doesn’t have quite as much<br />
to do with whether the tool <strong>of</strong> choice is Java or Python<br />
or other technologies; it’s more about do I understand<br />
normalized versus denormalized structures.<br />
MT: From a more commercial viewpoint, there’s a shift<br />
away from product type and skill set, which is based<br />
around constraints and managing known parameters,<br />
and very much more toward what else can we do.<br />
It changes the impact—not just in the technology<br />
organization, but in the other disciplines as well.<br />
I’ve already seen a pr<strong>of</strong>ound effect on the old ways <strong>of</strong><br />
doing things. Rather than thinking <strong>of</strong> doing the same<br />
things better, it really comes down to having the people<br />
and skills to meet your intended business goals.<br />
Using the Elastic MapReduce service can have a ripple<br />
effect on all <strong>of</strong> the non-technical business processes<br />
and engagements across teams. For example,<br />
conventional marketing segmentation used to involve<br />
teams <strong>of</strong> analysts who waded through various data sets<br />
and stages <strong>of</strong> processing and analysis to make sense <strong>of</strong><br />
how a business might view groups <strong>of</strong> customers. Using<br />
the Hadoop-style alternative and Cascading, we’re able<br />
to identify unconventional relationships across many<br />
data points with less effort, and in the process create<br />
new segmentations and insights.<br />
This way, we stay relevant and respond more quickly to<br />
customerdemand. We’re identifying new variations and<br />
shifts in the data on a real-time basis that would have<br />
otherwise taken weeks or months, or even missed<br />
completely using the old approach. The analyst’s role<br />
in creating these new algorithms and designing new<br />
methods <strong>of</strong> campaign planning is clearly key to this<br />
type <strong>of</strong> solution design. The outcome <strong>of</strong> all this is really<br />
interesting and I’m starting to see a subtle, organic<br />
response to different changes in customer behavior.<br />
PwC: Are you familiar with Bill James, a Major<br />
League Baseball statistician who has taken a<br />
rather different approach to metrics? James<br />
developed some metrics that turned out to be<br />
more useful than those used for many years<br />
in baseball. That kind <strong>of</strong> person seems to be<br />
the type that you’re enabling to hypothesize,<br />
perhaps even do some machine learning to<br />
generate hypotheses.<br />
RV: Absolutely. Our analytics team within Razorfish has<br />
the Bill James type <strong>of</strong> folks who can help drive different<br />
thinking and envision possibilities with the data. We<br />
need to find a lot more <strong>of</strong> those people. They’re not<br />
very easy to find. And we have some <strong>of</strong> the leading<br />
folks in the industry.<br />
You know, a long, long time ago we designed the<br />
Major League Baseball site and the platform. The stat<br />
section on that site was always the most difficult part<br />
<strong>of</strong> the site, but the business insisted it needed it. The<br />
amount <strong>of</strong> people who really wanted to churn through<br />
that data was small. We were using Oracle at the time.<br />
We used the concept <strong>of</strong> temporary tables, which<br />
would denormalize lots <strong>of</strong> different relational tables for<br />
performance reasons, and that was a challenge. If I<br />
had the cluster technology we do now back in 1999<br />
and 2000, we could have built to scale much more<br />
than going to two measly servers that we could cluster.<br />
48 PricewaterhouseCoopers Technology Forecast
PwC: The Bill James analogy goes beyond<br />
batting averages, which have been the age-old<br />
metric for assessing the contribution <strong>of</strong> a hitter<br />
to a team, to measuring other things that weren’t<br />
measured before.<br />
RV: Even crazy stuff. We used to do things like, show<br />
me all <strong>of</strong> Derek Jeter’s hits at night on a grass field.<br />
PwC: There you go. Exactly.<br />
RV: That’s the example I always use, because that was<br />
the hardest thing to get to scale, but if you go to the<br />
stat section, you can do a lot <strong>of</strong> those things. But if too<br />
many people went to the stat section on the site, the<br />
site would melt down, because Oracle couldn’t handle<br />
it. If I were to rebuild that today, I could use an EMR<br />
or a <strong>Big</strong>table and I’d be much happier.<br />
PwC: Considering the size <strong>of</strong> the <strong>Big</strong>table that<br />
you’re able to put together without using joins, it<br />
seems like you’re initially able to filter better and<br />
maybe do multistage filtering to get to something<br />
useful. You can take a cyclical approach to your<br />
analysis, correct?<br />
RV: Yes, you’re almost peeling away the layer <strong>of</strong> the<br />
onion. But putting data into a denormalized format does<br />
restrict flexibility, because you have so much more<br />
power with a where clause than you do with a standard<br />
EMR or <strong>Big</strong>table access mechanism.<br />
It’s like the difference between something built for<br />
exactly one task versus something built to handle tasks<br />
I haven’t even thought <strong>of</strong>. If you peel away the layer <strong>of</strong><br />
the onion, you might decide, wow, this data’s interesting<br />
and we’re going in a very interesting direction, so what<br />
about this? You may not be able to slice it that way. You<br />
might have to step back and come up with a different<br />
partition structure to support it.<br />
PwC: Social media is helping customers become<br />
more active and engaged. From a marketing<br />
analysis perspective, it’s a variation on a Super<br />
Bowl advertisement, just scaled down to that<br />
social media environment. And if that’s going to<br />
happen frequently, you need to know what is the<br />
impact, who’s watching it, and how are the<br />
people who were watching it affected by it. If<br />
you just think about the data ramifications <strong>of</strong><br />
that, it sort <strong>of</strong> blows your mind.<br />
RV: If you think about the popularity <strong>of</strong> Hadoop and<br />
<strong>Big</strong>table, which is really looking under the covers <strong>of</strong><br />
the way Google does its search, and when you think<br />
about search at the end <strong>of</strong> the day, search really is<br />
recommendations. It’s relevancy. What are the impacts<br />
on the ability <strong>of</strong> people to create new ways to do search<br />
and to compete in a more targeted fashion with the<br />
search engine? If you look three to five years out, that’s<br />
really exciting. We used to say we could never re-create<br />
that infrastructure that Google has; Google is the<br />
second largest server manufacturer in the world. But<br />
now we have a way to create small targeted ways <strong>of</strong><br />
doing what Google does. I think that’s pretty exciting. •<br />
“What are the impacts on the ability<br />
<strong>of</strong> people to create new ways to<br />
do search and to compete in a<br />
more targeted fashion with the<br />
search engine? If you look three to<br />
five years out, that’s really exciting.”<br />
— Ray Velez<br />
Managing petabyte-scale customer data analysis 49
Acknowledgments<br />
Advisory<br />
Sponsor & Technology Leader<br />
Tom DeGarmo<br />
US Thought Leadership<br />
Partner-in-Charge<br />
Tom Craren<br />
Center for Technology and Innovation<br />
Managing Editor<br />
Bo Parker<br />
Editors<br />
Vinod Baya, Alan Morrison<br />
Contributors<br />
Larry Best, Galen Gruman, Jimmy Guterman, Larry Marion, Bill Roberts<br />
Editorial Advisers<br />
Markus Anderle. Stephen Bay. Brian Butte, Tom Johnson, Krishna<br />
Kumaraswamy, Bud Mathaisel, Sean McClowry, Rajesh Munavalli,<br />
Luis Orama, Dave Patton, Jonathan Reichental, Terry Retter, Deepak Sahi,<br />
Carter Shock, David Steier, Joe Tagliaferro, Dimpsy Teckchandani,<br />
Cindi Thompson, Tom Urquhart, Christine Wendin, Dean Wotkiewich<br />
Copyedit<br />
Lea Anne Bantsari, Ellen Dunn<br />
Transcription<br />
Paula Burns<br />
50 PricewaterhouseCoopers Technology Forecast
Graphic Design<br />
Art Director<br />
Jacqueline Corliss<br />
Designer<br />
Jacqueline Corliss, Suzanne Lau<br />
Illustrator<br />
Donald Bernhardt, Suzanne Lau,<br />
Tatiana Pechenik<br />
Photographers<br />
Tim Szumowski<br />
Marina Waltz<br />
Online<br />
Director, Online Marketing<br />
Jack Teuber<br />
Designer and Producer<br />
Scott Schmidt<br />
Reviewers<br />
Dave Stuckey, Chris Wensel<br />
Marketing<br />
Bob Kramer<br />
Special thanks to<br />
Ray George, Page One<br />
Rachel Lovinger, Razorfish<br />
Mariam Sughayer, Disney<br />
Industry perspectives<br />
During the preparation <strong>of</strong> this publication, we benefited<br />
greatly from interviews and conversations with the<br />
following executives: and industry analysts:<br />
Bud Albers, executive vice president and chief<br />
technology <strong>of</strong>ficer, Technology Shared Services<br />
Group, Disney<br />
Matt Aslett, analyst, enterprise s<strong>of</strong>tware, the451<br />
John Avery, partner, Sungard Consulting Services<br />
Amr Awadallah, vice president, engineering,<br />
and chief technology <strong>of</strong>ficer, <strong>Cloudera</strong><br />
Phil Buckle, chief information <strong>of</strong>ficer, National Policing<br />
Improvement Agency<br />
Howard Dresner, president and founder,<br />
Dresner Advisory Services”<br />
Brian Donnelly, founder and chief executive <strong>of</strong>ficer,<br />
InSilico Discovery<br />
Matt Estes, principal architect, Technology Shared<br />
Services Group, Disney<br />
Jim Kobelius, senior analyst, Forrester Research<br />
Doug Lenat, founder and chief executive <strong>of</strong>ficer, Cycorp<br />
Roger Magoulas, research director, O’Reilly Media<br />
Nathan Marz, lead engineer, BackType<br />
Bill McColl, founder and chief executive <strong>of</strong>ficer,<br />
Cloudscale<br />
John Parkinson, acting chief technology <strong>of</strong>ficer,<br />
TransUnion<br />
David Smoley, chief information <strong>of</strong>ficer, Flextronics<br />
Mark Taylor, director, global solutions, Razorfish<br />
Scott Thompson, vice president, architecture,<br />
Technology Shared Services Group, Disney<br />
Ray Velez, chief technology <strong>of</strong>ficer, Razorfish<br />
Acknowledgments 51
pwc.com/us<br />
To have a deeper conversation<br />
about how this subject may affect<br />
your business, please contact:<br />
Tom DeGarmo<br />
Principal, Technology Leader<br />
PricewaterhouseCoopers<br />
+1 267-330-2658<br />
thomas.p.degarmo@us.pwc.com<br />
This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and<br />
Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC)<br />
certified stock using 25% post-consumer waste.<br />
Recycled paper
Subtext<br />
<strong>Big</strong> <strong>Data</strong><br />
<strong>Data</strong> sets that range from many terabytes to petabytes in size, and<br />
that usually consist <strong>of</strong> less-structured information such as Web log files.<br />
Hadoop cluster<br />
A type <strong>of</strong> scalable computer cluster inspired by the Google<br />
Cluster Architecture and intended for cost-effectively processing<br />
less-structured information.<br />
Apache Hadoop<br />
The core <strong>of</strong> an open-source ecosystem that makes <strong>Big</strong> <strong>Data</strong> analysis<br />
more feasible through the efficient use <strong>of</strong> commodity computer clusters.<br />
Cascading<br />
A bridge from Hadoop to common Java-based programming techniques<br />
not previously usable in cluster-computing environments.<br />
NoSQL<br />
Gray data<br />
A class <strong>of</strong> non-relational data stores and data analysis techniques that<br />
are intended for various kinds <strong>of</strong> less-structured data. Many <strong>of</strong> these<br />
techniques are part <strong>of</strong> the Hadoop ecosystem.<br />
<strong>Data</strong> from multiple sources that isn’t formatted or vetted for<br />
specific needs, but worth exploring with the help <strong>of</strong> Hadoop<br />
cluster analysis techniques.<br />
Comments or requests? Please visit www.pwc.com/techforecast OR send e-mail to: techforecasteditors@us.pwc.com<br />
PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for<br />
its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to<br />
develop fresh perspectives and practical advice.<br />
© 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited<br />
liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms <strong>of</strong> the network, each <strong>of</strong> which is a<br />
separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation<br />
with pr<strong>of</strong>essional advisors.