Making Sense of Big Data - Cloudera Blog

Technologyforecast 

Making sense 

of Big Data 

A quarterly journal 

2010, Issue 3 

In this issue 

04 

Tapping into the 

power of Big Data 

22 

Building a bridge 

to the rest of your data 

36 

Revising the CIO’s 

data playbook

Contents 

Features 

04 Tapping into the power of Big Data 

Treating it differently from your core enterprise data is essential. 

22 Building a bridge to the rest of your data 

How companies are using open-source cluster-computing 

techniques to analyze their data. 

36 Revising the CIO’s data playbook 

Start by adopting a fresh mind-set, grooming the right talent, 

and piloting new tools to ride a the next wave of innovation.

Interviews 

14 The data scalability challenge 

John Parkinson of TransUnion describes the data handling issues 

more companies will face in three to five years. 

18 Creating a cost-effective Big Data strategy 

Disney’s Bud Albers, Scott Thompson, and Matt Estes outline an 

agile approach that leverages open-source and cloud technologies. 

34 Hadoop’s foray into the enterprise 

Cloudera’s Amr Awadallah discusses how and why diverse 

companies are trying this novel approach. 

48 New approaches to customer data analysis 

Razorfish’s Mark Taylor and Ray Velez discuss how new techniques 

enable them to better analyze petabytes of Web data. 

Departments 

02 Message from the editor 

50 Acknowledgments 

54 Subtext

Message from 

the editor 

Bill James has loved baseball statistics ever since he was a kid in Mayetta, 

Kansas, cutting baseball cards out of the backs of cereal boxes in the early 

1960s. James, who compiled The Bill James Baseball Abstract for years, is 

a renowned “sabermetrician” (a term he coined himself). He now is a senior 

advisor on baseball operations for the Boston Red Sox, and he previously 

worked in a similar capacity for other Major League Baseball teams. 

James has done more to change the world of baseball statistics than 

anyone in recent memory. As broadcaster Bob Costas says, James 

“doesn’t just understand information. He has shown people a different 

way of interpreting that information.” Before Bill James, Major League 

Baseball teams all relied on long-held assumptions about how games are 

won. They assumed batting average, for example, had more importance 

than it actually does. 

James challenged these assumptions. He asked critical questions that 

didn’t have good answers at the time, and he did the research and analysis 

necessary to find better answers. For instance, how many days’ rest does 

a reliever need? James’s answer is that some relievers can pitch well for 

two or more consecutive days, while others do better with a day or two of 

rest in between. It depends on the individual. Why can’t a closer work more 

than just the ninth inning? A closer is frequently the best reliever on the 

team. James observes that managers often don’t use the best relievers 

to their maximum potential. 

The lesson learned from the Bill James example is that the best statistics 

come from striving to ask the best questions and trying to get answers to 

those questions. But what are the best questions? James takes an iterative 

approach, analyzing the data he has, or can gather, asking some questions 

based on that analysis, and then looking for the answers. He doesn’t stop 

with just one set of statistics. The first set suggests some questions, to 

which a second set suggests some answers, which then give rise to yet 

another set of questions. It’s a continual process of investigation, one that’s 

focused on surfacing the best questions rather than assuming those 

questions have already been asked. 

Enterprises can take advantage of a similarly iterative, investigative 

approach to data. Enterprises are being overwhelmed with data; many 

enterprises each generate petabytes of information they aren’t making best 

use of. And not all of the data is the same. Some of it has value, and some, 

not so much. 

The problem with this data has been twofold: (1) it’s difficult to analyze, 

and (2) processing it using conventional systems takes too long and is 

too expensive. 

02 PricewaterhouseCoopers Technology Forecast

Addressing these problems effectively doesn’t require 

radically new technology. Better architectural design 

choices and software that allows a different approach 

to the problems are enough. Search engine companies 

such as Google and Yahoo provide a pragmatic way 

forward in this respect. They’ve demonstrated that 

efficient, cost-effective, system-level design can lead 

to an architecture that allows any company to treat 

different data differently. 

Enterprises shouldn’t treat voluminous, mostly 

unstructured information (for example, Web server 

log files) the same way they treat the data in core 

transactional systems. Instead, they can use commodity 

computer clusters, open-source software, and Tier 3 

storage, and they can process in an exploratory way 

the less-structured kinds of data they’re generating. 

With this approach, they can do what Bill James does 

and find better questions to ask. 

In this issue of the Technology Forecast, we review the 

techniques behind low-cost distributed computing that 

have led companies to explore more of their data in new 

ways. In the article, “Tapping into the power of Big 

Data,” on page 04, we begin with a consideration of 

exploratory analytics—methods that are separate from 

traditional business intelligence (BI). These techniques 

make it feasible to look for more haystacks, rather than 

just the needle in one haystack. 

The article, “Building a bridge to the rest of your data,” 

on page 22 highlights the growing interest in and 

adoption of Hadoop clusters. Hadoop provides highvolume, 

low-cost computing with the help of opensource 

software and hundreds or thousands of 

commodity servers. It also offers a simplified approach 

to processing more complex data in parallel. The 

methods, cost advantages, and scalability of Hadoopstyle 

cluster computing clear a path for enterprises to 

analyze lots of data they didn’t have the means to 

analyze before, as well as enable innovative ways to 

analyze it. 

The buzz around Big Data and “cloud storage” (a term 

some vendors use to describe less-expensive cluster- 

computing techniques) is considerable, but the article, 

“Revising the CIO’s data playbook,” on page 36 

emphasizes that CIOs have time to pick and choose the 

most suitable approach. The most promising 

opportunity is in the area of “gray data,” or data that 

comes from a variety of sources. This data is often raw 

and unvalidated, arrives in huge quantities, and doesn’t 

yet have established value. Gray data analysis requires 

a different skill set—people who are more exploratory 

by nature. 

As always, in this issue we’ve included interviews with 

knowledgeable executives who have insights on the 

overall topic of interest: 

• John Parkinson of TransUnion describes the data 

challenges that more and more companies will face 

during the next three to five years. 

• Bud Albers, Scott Thompson, and Matt Estes of Disney 

outline an agile, open-source cloud data vision. 

• Amr Awadallah of Cloudera explores the reasons 

behind Apache Hadoop’s adoption at search engine, 

social media, and financial services companies. 

• Mark Taylor and Ray Velez of Razorfish contrast 

newer, more scalable techniques of studying 

customer data with the old methods. 

Please visit pwc.com/techforecast to find these articles 

and other issues of the Technology Forecast online. 

If you would like to receive future issues of the 

Technology Forecast as a PDF attachment, you 

can sign up at pwc.com/techforecast/subscribe. 

We welcome your feedback and your ideas for future 

research and analysis topics to cover. 

Tom DeGarmo 

Principal 

Technology Leader 

thomas.p.degarmo@us.pwc.com 

Message from the editor 03

Tapping into the 

power of Big Data 

Treating it differently from your core enterprise data is essential. 

By Galen Gruman 


Like most corporations, the Walt Disney Co. is 

swimming in a rising sea of Big Data: information 

collected from business operations, customers, 

transactions, and the like; unstructured information 

created by social media and other Web repositories, 

including the Disney home page itself and sites for 

its theme parks, movies, books, and music; plus the 

sites of its many big business units, including ESPN 

and ABC. 

“In any given year, we probably generate more data 

than the Walt Disney Co. did in its first 80 years of 

existence,” observes Bud Albers, executive vice 

president and CTO of the Disney Technology Shared 

Services Group. “The challenge becomes what do 

you do with it all?” 

Albers and his team are in the early stages of 

answering their own question with an economical 

cluster-computing architecture based on a set of 

cost-effective and scalable technologies anchored 

by Apache Hadoop, an open-source, Java-based 

distributed file system based on Google File System 

and developed by Apache Software Foundation. 

These still-emerging technologies allow Disney analysts 

to explore multiple terabytes of information without the 

lengthy time requirements or high cost of traditional 

business intelligence (BI) systems. 

This issue of the Technology Forecast examines how 

Apache Hadoop and these related technologies can 

derive business value from Big Data by supporting a 

new kind of exploratory analytics unlike traditional BI. 

These software technologies and their hardware cluster 

platform make it feasible not only to look for the needle 

in the haystack, but also to look for new haystacks. 

This kind of analysis demands an attitude of 

exploration—and the ability to generate value from 

data that hasn’t been scrubbed or fully modeled into 

relational tables. 

Using Disney and other examples, this first article 

introduces the idea of exploratory BI for Big Data. 

The second article examines Hadoop clusters and 

technologies that support them (page 22), and the 

third article looks at steps CIOs can take now to exploit 

the future benefits (page 36). We begin with a closer 

look at Disney’s still-nascent but illustrative effort. 

“In any given year, we probably 

generate more data than the 

Walt Disney Co. did in its first 

80 years of existence.” 

— Bud Albers of Disney 

Tapping into the power of Big Data 05

Bringing Big Data under control 

Big Data is not a precise term; rather, it is a 

characterization of the never-ending accumulation of 

all kinds of data, most of it unstructured. It describes 

data sets that are growing exponentially and that are 

too large, too raw, or too unstructured for analysis 

using relational database techniques. Whether 

terabytes or petabytes, the precise amount is less the 

issue than where the data ends up and how it is used. 

Like everyone else, Disney’s Big Data is huge, more 

unstructured than structured, and growing much 

faster than transactional data. 

The Disney Technology Shared Services Group, which 

is responsible for Disney’s core Web and analysis 

technologies, recently began its Big Data efforts but 

already sees high potential. The group is testing the 

technology and working with analysts in Disney 

business units. Disney’s data comes from varied 

sources, but much of it is collected for departmental 

business purposes and not yet widely shared. Disney’s 

Big Data approach will allow it to look at diverse data 

sets for unplanned purposes and to uncover patterns 

across customer activities. For example, insights from 

Disney Store activities could be useful in call centers 

for theme park booking or to better understand the 

audience segments of one of its cable networks. 

The Technology Shared Services Group is even using 

Big Data approaches to explore its own IT questions 

to understand what data is being stored, how it is 

used, and thus what type of storage hardware and 

management the group needs. 

Albers assumes that Big Data analysis is destined to 

become essential. “The speed of business these days 

and the amount of data that we are now swimming in 

mean that we need to have new ways and new 

techniques of getting at the data, finding out what’s in 

there, and figuring out how we deal with it,” he says. 

The team stumbled upon an inexpensive way to 

improve the business while pursuing more IT costeffectiveness 

through the use of private-cloud 

technologies. (See the Technology Forecast, Summer 

2009, for more on the topic of cloud computing.) When 

Albers launched the effort to change the division’s cost 

curve so IT expenses would rise more slowly than the 

business usage of IT—the opposite had been true—he 

turned to an approach that many companies use to 

make data centers more efficient: virtualization. 

Virtualization offers several benefits, including higher 

utilization of existing servers and the ability to move 

workloads to prevent resource bottlenecks. An 

organization can also move workloads to external cloud 

providers, using them as a backup resource when 

needed, an approach called cloud bursting. By using 

such approaches, the Disney Technology Shared 

Services Group lowered its IT expense growth rate from 

27 percent to –3 percent, while increasing its annual 

processing growth from 17 percent to 45 percent. 

While achieving this efficiency, the team realized that 

the ability to move resources and tap external ones 

could apply to more than just data center efficiency. At 

first, they explored using external clouds to analyze big 

sets of data, such as Web traffic to Disney’s many sites, 

and to handle big processing jobs more cost-effectively 

and more quickly than with internal systems. 

During that exploration, the team discovered Hadoop, 

MapReduce, and other open-source technologies that 

distribute data-analysis workloads across many 

computers, breaking the analysis into many parallel 

workloads that produce results faster. Faster results 

mean that more questions can be asked, and the low 

cost of the technologies means the team can afford 

to ask those questions. 

Disney assembled a Hadoop cluster and set up a 

central logging service to mine data that the 

organization hadn’t been able to before. It will begin 

to provide internal group access to the cluster in 

October 2010. Figure 1 shows how the Hadoop 

cluster will benefit internal groups, business partners, 

and customers. 

“The speed of business these days and the amount of data that we 

are now swimming in mean that we need to have new ways and new 

techniques of getting at the data, finding out what’s in there, and figuring 

out how we deal with it.” — Bud Albers of Disney 


Improved 

experience 

4 

Site 

visitors 

Internal 

business 

partners 

Affiliated 

businesses 

1 

Interface to cluster 

(MapReduce/Hive/Pig) 

Usage 

data 

Core IT and 

business unit 

systems 

2 Central 3 

logging 

service 

D-Cloud data cluster 

Hadoop 

Metadata 

repository 

Figure 1: Disney’s Hadoop cluster and central logging service 

Disney’s new D-Cloud data cluster can scale to handle (1) less-structured usage data through the establishment of (2) central logging 

service, (3) a cost-effective Hadoop data analysis engine, and a commodity computer cluster. The result is (4) a more responsive and 

personalized user experience. 

Source: Disney, 2010 


Simply put, the low cost of a Hadoop cluster means 

freedom to experiment. Disney uses a couple of dozen 

servers that were scheduled to be retired, and the 

organization operates its cluster with a handful of 

existing staff. Matt Estes, principal data architect 

for the Disney Technology Shared Services Group, 

estimates the cost of the project at $300,000 

to $500,000. 

“Before, I would have needed to figure on spending $3 

million to $5 million for such an initiative,” Albers says. 

“Now I can do this without charging to the bottom line.” 

Unlike the reusable canned queries in typical BI systems, 

Big Data analysis does require more effort to write the 

queries and the data-parsing code for what are often 

unique inquiries of data sources. But Albers notes that 

“the risk is lower due to all the other costs being lower.” 

Failure is inexpensive, so analysts are more willing to 

explore questions they would otherwise avoid. 

Even in this early stage, Albers is confident that the 

ability to ask more questions will lead to more insights 

that translate to both the bottom line and the top line. 

For example, Disney already is seeking to boost 

customer engagement and spending by making 

recommendations to customers based on pattern 

analysis of their online behavior. 

How Big Data analysis is different 

What should other enterprises anticipate from Hadoopstyle 

analytics? It is a type of exploratory BI they haven’t 

done much before. This is business intelligence that 

provides indications, not absolute conclusions. It requires 

a different mind-set, one that begins with exploration, the 

results of which create hypotheses, which are tested 

before moving on to validation and consolidation. 

These methods could be used to answer questions 

such as, “What indicators might there be that predate a 

surge in Web traffic?” or “What fabrics and colors are 

gaining popularity among influencers, and what 

sources might be able to provide the materials to us?” 

or “What’s the value of an influencer on Web traffic 

through his or her social network?” See the sidebar 

“Opportunities for Big Data insights” for more examples 

of the kinds of questions that can be asked of Big Data. 

Opportunities for Big Data insights 

Here are other examples of the kinds of insights 

that may be gleaned from analysis of Big Data 

information flows: 

• Customer churn, based on analysis of call center, 

help desk, and Web site traffic patterns 

• Changes in corporate reputation and the potential 

for regulatory action, based on the monitoring of 

social networks as well as Web news sites 

• Real-time demand forecasting, based on 

disparate inputs such as weather forecasts, 

travel reservations, automotive traffic, and 

retail point-of-sale data 

• Supply chain optimization, based on analysis of 

weather patterns, potential disaster scenarios, 

and political turmoil 

Disney and others explore their data without a lot of 

preconceptions. They know the results won’t be as 

specific as a profit-margin calculation or a drug-efficacy 

determination. But they still expect demonstrable value, 

and they expect to get it without a lot of extra expense. 

Typical BI uses data from transactional and other 

relational database management systems (RDBMSs) 

that an enterprise collects—such as sales and 

purchasing records, product development costs, and 

new employee hire records—diligently scrubs the data 

for accuracy and consistency, and then puts it into a 

form the BI system is programmed to run queries 

against. Such systems are vital for accurate analyses 

of transactional information, especially information 

subject to compliance requirements, but they don’t 

work well for messy questions, they’ve been too 

expensive for questions you’re not sure there’s any 

value in asking, and they haven’t been able to scale to 

analyze large data sets efficiently. (See Figure 2.) 


Large 

data sets 

Small 

data sets 

Big Data (via 

Hadoop/MapReduce) 

Little analytical value 

Non-relational data 

Figure 2: Where Big Data fits in 

Source: PricewaterhouseCoopers, 2010 

Less scalability 

Traditional BI 

Relational data 

Other companies have also tapped into the excitement 

brewing over Big Data technologies. Several Weboriented 

companies that have always dealt with huge 

amounts of data—such as Yahoo, Twitter, and 

Google—were early adopters. Now, more traditional 

companies—such as Disney and TransUnion, a credit 

rating service—are exploring Big Data concepts, 

having seen the cost and scalability benefits the Web 

companies have realized. 

Specifically, enterprises are also motivated by the 

inability to scale their existing approach for working 

on traditional analytics tasks, such as querying 

across terabytes of relational data. They are learning 

that the tools associated with Hadoop are uniquely 

positioned to explore data that has been sitting on 

the side, unanalyzed. Figure 3 illustrates how the data 

architecture landscape appears in 2010. Enterprises 

with high processing power requirements and 

centralized architectures are facing scaling issues. 

In contrast, Big Data techniques allow you to sift through 

data to look for patterns at a much lower cost and in 

much less time than traditional BI systems. Should the 

data end up being so valuable that it requires the 

ongoing, compliance-oriented analysis of regular BI 

systems, only then do you make that investment. 

Big Data approaches let you ask more questions of 

more information, opening a wide range of potential 

insights you couldn’t afford to consider in the past. 

“Part of the analytics role is to challenge assumptions,” 

Estes says. BI systems aren’t designed to do that; 

instead, they’re designed to dig deeper into known 

questions and look for variations that may indicate 

deviations from expected outcomes. 

Furthermore, Big Data analysis is usually iterative: you 

ask one question or examine one data set, then think of 

more questions or decide to look at more data. That’s 

different from the “single source of truth” approach to 

standard BI and data warehousing. The Disney team 

started with making sure they could expose and 

access the data, then moved to iterative refinement in 

working with the data. “We aggressively got in to find 

the direction and the base. Then we began to iterate 

rather than try to do a Big Bang,” Albers says. 

High 

processing 

power 

Low 

processing 

power 

Enterprises facing 

scaling and 

capacity/cost 

problems 

Most enterprises 

Centralized 

compute 

architecture 

Google, Amazon, 

Facebook, Twitter, 

etc. (all use nonrelational 

data stores 

for reasons of scale) 

Cloud users with 

low compute 

requirements 

Distributed 

compute 

architecture 

Figure 3: The data architecture landscape in 2010 


Wolfram Research and IBM have begun to extend 

their analytics applications to run on such large-scale 

data pools, and startups are presenting approaches 

they promise will allow data exploration in ways that 

technologies couldn’t have enabled in the past, 

including support for tools that let knowledge workers 

examine traditional databases using Big Data–style 

exploratory tools. 


The ways different enterprises approach Big Data 

It should come as no surprise that organizations 

dealing with lots of data are already investigating Big 

Data technologies, or that they have mixed opinions 

about these tools. 

“At TransUnion, we spend a lot of our time trawling 

through tens or hundreds of billions of rows of data, 

looking for things that match a pattern approximately,” 

says John Parkinson, TransUnion’s acting CTO. “We 

want to do accurate but approximate matching and 

categorization in very large low-structure data sets.” 

Parkinson has explored Big Data technologies such 

as MapReduce that appear to have a more efficient 

filtering model than some of the pattern-matching 

algorithms TransUnion has tried in the past. “It also, at 

least in its theoretical formulation, is very amenable to 

highly parallelized execution,” which lets the users tap 

into farms of commodity hardware for fast, inexpensive 

processing, he notes. 

However, Parkinson thinks Hadoop and MapReduce 

are too immature. “MapReduce really hasn’t evolved 

yet to the point where your average enterprise 

technologist can easily make productive use of it. As 

for Hadoop, they have done a good job, but it’s like a 

lot of open-source software—80 percent done. There 

were limits in the code that broke the stack well before 

what we thought was a good theoretical limit.” 

Parkinson echoes many IT executives who are 

skeptical of open-source software in general. “If I have 

a bunch of engineers, I don’t want them spending their 

day being the technology support environment for what 

should be a product in our architecture,” he says. 

That’s a legitimate point of view, especially considering 

the data volumes TransUnion manages—8 petabytes 

from 83,000 sources in 4,000 formats and growing— 

and its focus on mission-critical capabilities for this 

data. Credit scoring must run successfully and deliver 

top-notch credit scores several times a day. It’s an 

operational system that many depend on for critical 

business decisions that happen millions of times a 

day. (For more on TransUnion, see the interview with 

Parkinson on page 14.) 

Disney’s system is purely intended for exploratory 

efforts or at most for reporting that eventually may feed 

up to product strategy or Web site design decisions. If 

it breaks or needs a little retooling, there’s no crisis. 

But Albers disagrees about the readiness of the tools, 

noting that the Disney Technology Shared Services 

Group also handles quite a bit of data. He figures 

Hadoop and MapReduce aren’t any worse than a lot of 

proprietary software. “I fully expect we will run on things 

that break,” he says, adding facetiously, “Not that any 

commercial product I’ve ever had has ever broken.” 

Data architect Estes also sees responsiveness in 

open-source development that’s laudable. “In our 

testing, we uncovered stuff, and you get somebody 

on the other end. This is their baby, right? I mean, 

they want it fixed.” 

Albers emphasizes the total cost-effectiveness of 

Hadoop and MapReduce. “My software cost is zero. 

You still have the implementation, but that’s a constant 

at some level, no matter what. Now you probably need 

to have a little higher skill level at this stage of the 

game, so you’re probably paying a little more, but 

certainly, you’re not going out and approving a Teradata 

cluster. You’re talking about Tier 3 storage. You’re 

talking about a very low level of cost for the storage.” 

Albers’ points are also valid. PricewaterhouseCoopers 

predicts these open-source tools will be solid sooner 

rather than later, and are already worthy of use in 

non-mission-critical environments and applications. 

Hence, in the CIO article on page 36, we argue in favor 

of taking cautious but exploratory steps. 

Asking new business questions 

Saving money is certainly a big reward, but 

PricewaterhouseCoopers contends the biggest 

payoff from Hadoop-style analysis of Big Data is the 

potential to improve organizations’ top line. “There’s 

a lot of potential value in the unstructured data in 

organizations, and people are starting to look at it 

more seriously,” says Tom Urquhart, chief architect 

at PricewaterhouseCoopers. Think of it as a “Google 

in a box, which allows you to do intelligent search 

regardless of whether the underlying content is 

structured or unstructured,” he says. 


The Google-style techniques in Hadoop, MapReduce, 

and related technologies work in a fundamentally 

different way from traditional BI systems, which use 

strictly formatted data cubes pulling information from 

data warehouses. Big Data tools let you work with data 

that hasn’t been formally modeled by data architects, 

so you can analyze and compare data of different types 

and of different levels of rigor. Because these tools 

typically don’t discard or change the source data 

before the analysis begins, the original context remains 

available for drill-down by analysts. 

These tools provide technology assistance to a very 

human form of analysis: looking at the world as it is 

and finding patterns of similarity and difference, then 

going deeper into the areas of interest judged valuable. 

In contrast, BI systems know what questions should be 

asked and what answers to expect; their goal is to look 

for deviations from the norm or changes in standard 

patterns deemed important to track (such as changes 

in baseline quality or in sales rates in specific 

geographies). Such an approach, absent an exploratory 

phase, results in a lot of information loss during data 

consolidation. (See Figure 4.) 

Pattern analysis mashup services 

There’s another use of Big Data that combines 

efficiency and exploratory benefits: on-the-fly pattern 

analysis from disparate sources to return real-time 

results. Amazon.com pioneered Big Data–based 

product recommendations by analyzing customer data, 

including purchase histories, product ratings, and 

comments. Albers is looking for similar value that 

would come from making live recommendations to 

customers when they go to a Disney site, store, or 

reservations phone line—based on their previous online 

and offline behavior with Disney. 

O’Reilly Media, a publisher best known for technical 

books and Web sites, is working with the White House 

to develop mashup applications that look at data from 

various sources to identify patterns that might help 

lobbyists and policymakers. For example, by mashing 

together US Census data and labor statistics, they can 

see which counties have the most international and 

domestic immigration, then correlate those attributes 

with government spending changes, says Roger 

Magoulas, O’Reilly’s research director. 

Pre-consolidated data (never collected) 

Exploration 

Information 

loss 

Consolidation 

All collected data 

Summary 

departmental 

data 

Summary 

enterprise 

data 

Information 

loss 

Information 

loss 

Consolidation 

All collected data 

Summary 

departmental 

data 

Summary 

enterprise 

data 

Less information loss 

Insight 

Greater insight 

Figure 4: Information loss in the data consolidation process 



Mashups like this can also result in customer-facing 

services. FlightCaster for iPhone and BlackBerry uses 

Big Data approaches to analyze flight-delay records 

and current conditions to issue flight-delay predictions 

to travelers. 

Exploiting the power of human analysis 

Big Data approaches can lower processing and storage 

costs, but we believe its main value is to perform the 

analysis that BI systems weren’t designed for, acting as 

an enabler and an amplifier of human analysis. 

Ad hoc exploration at a bargain 

Big Data lets you inexpensively explore questions 

and peruse data for patterns that may indicate 

opportunities or issues. In this arena, failure is cheap, 

so analysts are more willing to explore questions they 

would otherwise avoid. And that should lead to insights 

that help the business operate better. 

Medical data is an example of the potential for ad hoc 

analysis. “A number of such discoveries are made on 

the weekends when the people looking at the data are 

doing it from the point of view of just playing around,” 

says Doug Lenat, founder and CEO of Cycorp and a 

former professor at Stanford and Carnegie Mellon 

universities. 

Right now the technical knowledge required to use 

these tools is nontrivial. Imagine the value of extending 

the exploratory capability more broadly. Cycorp is one 

of many startups trying to make Big Data analytic 

capabilities usable by more knowledge workers so 

they can perform such exploration. 

Analyzing data that wasn’t designed for BI 

Big Data also lets you work with “gray data,” or data 

from multiple sources that isn’t formatted or vetted for 

your specific needs, and that varies significantly in its 

level of detail and accuracy—and thus cannot be 

examined by BI systems. 

One analogy is Wikipedia. Everyone knows its 

information is not rigorously managed or necessarily 

accurate; nonetheless, Wikipedia is a good first place 

to look for indicators of what may be true and useful. 

From there, you do further research using a mix of 

information resources whose accuracy and 

completeness may be more established. 

People use their knowledge and experience to 

appropriately weigh and correlate what they find across 

gray data to come up with improved strategies to aid 

the business. Figure 5 compares gray data and more 

normalized black data. 

Gray data 

Raw 

Data and context co-mingled 

Noisy 

Hypothetical 

e.g., Wikipedia 

Unchecked 

Indicative 

Less trustworthy 

Managed by business unit 

Figure 5: Gray versus black data 


Black data 

Classified 

Provenanced 

Cleaned 

Actual 

e.g., Financial system data 

Reviewed 

Confirming 

More trustworthy 

Managed by IT 

Web analytics and financial risk analysis are two 

examples of how Big Data approaches augment human 

analysts. These techniques comb huge data sets of 

information collected for specific purposes (such as 

monitoring individual financial records), looking for 

patterns that might identify good prospects for loans 

and flag problem borrowers. Increasingly, they comb 

external data not collected by a credit reporting 

agency—for example, trends in a neighborhood’s 

housing values or in local merchants’ sales patterns— 

to provide insights into where sales opportunities could 

be found or where higher concentrations of problem 

customers are located. 

The same approaches can help identify shifts in 

consumer tastes, such as for apparel and furniture. 

And, by analyzing gray data related to costs of 

resources and changes in transportation schedules, 

these approaches can help anticipate stresses on 

suppliers and help identify where additional suppliers 

might be found. 

All of these activities require human intelligence, 

experience, and insight to make sense of the data, 

figure out the questions to ask, decide what 

information should be correlated, and generally 

conduct the analysis. 


Why the time is ripe for Big Data 

The human analysis previously described is old hat 

for many business analysts, whether they work in 

manufacturing, fashion, finance, or real estate. What’s 

changing is scale. As noted, many types of information 

are now available that never existed or were not 

accessible. What could once only be suggested 

through surveys, focus groups, and the like can now 

be examined directly, because more of the granular 

thinking and behaviors are captured. Businesses have 

the potential to discover more through larger samples 

and more granular details, without relying on people 

to recall behaviors and motivations accurately. 

This potential can be realized only if you pull together 

and analyze all that data. Right now, there’s simply 

too much information for individual analysts to 

manage, increasing the chances of missing potential 

opportunities or risks. Businesses that augment their 

human experts with Big Data technologies could have 

significant competitive advantages by heading off 

problems sooner, identifying opportunities earlier, 

and performing mass customization at a larger scale. 

Fortunately, the emerging Big Data tools should let 

businesspeople apply individual judgments to vaster 

pools of information, enabling low-cost, ad hoc 

analysis never before feasible. Plus, as patterns 

are discovered, the detection of some can be 

automated, letting the human analysts concentrate 

on the art of analysis and interpretation that algorithms 

can’t accomplish. 

Even better, emerging Big Data technologies promise 

to extend the reach of analysis beyond the cadre of 

researchers and business analysts. Several startups 

offer new tools that use familiar data-analysis tools— 

similar to those for SQL databases and Excel 

spreadsheets—to explore Big Data sources, thus 

broadening the ability to explore to a wider set of 

knowledge workers. 

Finally, Big Data approaches can be used to power 

analytics-based services that improve the business 

itself, such as in-context recommendations to 

customers, more accurate predictions of service 

delivery, and more accurate failure predictions 

(such as for the manufacturing, energy, medical, 

and chemical industries). 

Conclusion 

PricewaterhouseCoopers believes that Big Data 

approaches will become a key value creator for 

businesses, letting them tap into the wild, woolly 

world of information heretofore out of reach. These 

new data management and storage technologies can 

also provide economies of scale in more traditional 

data analysis. Don’t limit yourself to the efficiencies 

of Big Data and miss out on the potential for gaining 

insights through its advantages in handling the gray 

data prevalent today. 

The Big Data analysis supplements, not replaces, the 

BI systems, data warehouses, and database systems 

essential to financial reporting, sales management, 

production management, and compliance systems. 

The difference is that these information systems deal 

with the knowns that must meet high standards for 

rigor, accuracy, and compliance—while the emerging 

Big Data analytics tools help you deal with the 

unknowns that could affect business strategy 

or its execution. 

As the amount and interconnectedness of data vastly 

increases, the value of the Big Data approach will only 

grow. If the amount and variety of today’s information is 

daunting, think what the world will be like in 5 or 10 

years. People will become mobile sensors—collecting, 

creating, and transmitting all sorts of information, from 

locations to body status to environmental information. 

We already see this happening as smartphones 

equipped with cameras, microphones, geolocation, 

and compasses proliferate. Wearable medical sensors, 

small temperature tags for use on packages, and other 

radio-equipped sensors are a reality. They’ll be the 

Twitter and Facebook feeds of tomorrow, adding vast 

quantities of new information that could provide 

context on behavior and environment never before 

possible—and a lot of “noise” certain to mask 

what’s important. 

Insight-oriented analytics in this sea of information— 

where interactions cause untold ripples and eddies in 

the flow and delivery of business value—will become 

a critical competitive requirement. Big Data technology 

is the likeliest path to gaining such insights. 


The data scalability 

challenge 

John Parkinson of TransUnion describes the 

data handling issues more companies will face 

in three to five years. 

Interview conducted by Vinod Baya and Alan Morrison 

John Parkinson is the acting CTO of TransUnion, the chairman and owner of Parkwood 

Advisors, and a former CTO at Capgemini. In this interview, Parkinson outlines 

TransUnion’s considerable requirements for less-structured data analysis, shedding 

light on the many data-related technology challenges TransUnion faces today—challenges 

he says that more companies will face in the near future. 

PwC: In your role at TransUnion, you’ve 

evaluated many large-scale data processing 

technologies. What do you think of Hadoop 

and MapReduce? 

JP: MapReduce is a very computationally attractive 

answer for a certain class of problem. If you have that 

class of problem, then MapReduce is something you 

should look at. The challenge today, however, is that the 

number of people who really get the formalism behind 

MapReduce is a lot smaller than the group of people 

trying to understand what to do with it. It really hasn’t 

evolved yet to the point where your average enterprise 

technologist can easily make productive use of it. 

PwC: What class of problem would that be? 

JP: MapReduce works best in situations where you 

want to do high-volume, accurate but approximate 

matching and categorization in very large, lowstructured 

data sets. At TransUnion, we spend a lot of 

our time trawling through tens or hundreds of billions 

of rows of data looking for things that match a pattern 

approximately. MapReduce is a more efficient filter for 

some of the pattern-matching algorithms that we have 

tried to use. At least in its theoretical formulation, it’s 

very amenable to highly parallelized execution, which 

many of the other filtering algorithms we’ve used aren’t. 

The open-source stack is attractive for experimenting, 

but the problem we find is that Hadoop isn’t what 

Google runs in production—it’s an attempt by a bunch 

of pretty smart guys to reproduce what Google runs in 

production. They’ve done a good job, but it’s like a lot 

of open-source software—80 percent done. The 

20 percent that isn’t done—those are the hard parts. 

From an experimentation point of view, we have had a 

lot of success in proving that the computing formalism 

behind MapReduce works, but the software that we 

can acquire today is very fragile. It’s difficult to manage. 

It has some bugs in it, and it doesn’t behave very well 

in an enterprise environment. It also has some 

interesting limitations when you try to push the 

scale and the performance. 


We found a number of representational problems 

when we used the HDFS/Hadoop/HBase stack to do 

something that, according to the documentation 

available, should have worked. However, in practice, 

limits in the code broke the stack well before what we 

thought was a good theoretical limit that we needed 

to achieve to make it worthwhile. 

Now, the good news of course is that you get source 

code. But that’s also the bad news. You need to get the 

source code, and that’s not something that we want to 

do as part of routine production. I have a bunch of 

smart engineers, but I don’t want them spending their 

day being the technology support environment for what 

should be a product in our architecture. Yes, there’s a 

pony there, but it’s going to be awhile before it stabilizes 

to the point that I want to bet revenue on it. 

PwC: Data warehousing appliance prices have 

dropped pretty dramatically over the past couple 

of years. When it comes to data that’s not 

necessarily on the critical path, how does an 

enterprise make sure that it is not spending more 

than it has to? 

JP: We are probably not a good representational 

example of that because our business is analyzing the 

data. There is almost no price we won’t pay to get a 

better answer faster, because we can price that into 

the products we produce. The challenge we face is 

that the tools don’t always work properly at the edge 

of the envelope. This is a problem for hardware as 

well as software. A lot of the vendors stop testing 

their applications at about 80 percent or 85 percent 

of their theoretical capability. We routinely run them at 

110 percent of their theoretical capability, and they 

break. I don’t mind making tactical justifications for 

technologies that I expect to replace quickly. I do that 

all the time. But having done that, I want the damn 

thing to work. Too often, we’ve discovered that it 

doesn’t work. 

PwC: Are you forced to use technologies that 

have matured because of a wariness of things 

on the absolute edge? 

JP: My dilemma is that things that are known to work 

usually don’t scale to what we need—for speed or full 

capacity. I must spend some time, energy, and dollars 

betting on things that aren’t mature yet, but that can be 

sufficiently generalized architecturally. If the one I pick 

doesn’t work, or goes away, I can fit something else into 

its place relatively easily. That’s why we like appliances. 

As long as they are well behaved at the network layer 

and have a relatively generalized or standards-based 

business semantic interface, it doesn’t matter if I have 

to unplug one in 18 months or two years because 

something better came along. I can’t do that for 

everything, but I can usually afford to do it in the areas 

where I have no established commercial alternative. 

“I have a bunch of smart engineers, but I don’t want them spending their 

day being the technology support environment for what should be a 

product in our architecture.” —John Parkinson of TransUnion 

The data scalability challenge 15

PwC: What are you using in place of something 

like Hadoop? 

JP: Essentially, we use brute force. We use Ab Initio, 

which is a very smart brute-force parallelization scheme. 

I depend on certain capabilities in Ab Initio to parallelize 

the ETL [extract, transform, and load] in such a way that 

I can throw more cores at the problem. 

PwC: Much of the data you see is transactional. Is 

it all structured data, or are you also mining text? 

JP: We get essentially three kinds of data. We get 

accounts receivable data from credit loan issuers. That’s 

the record of what people actually spend. We get public 

record data, such as bankruptcy records, court records, 

and liens, which are semi-structured text. And we get 

other data, which is whatever shows up, and it’s 

generally hooked together around a well-understood set 

of identifiers. But the cost of this data is essentially 

free—we don’t pay for it. It’s also very noisy. So we 

have to spend computational time figuring out whether 

the data we have is right, because we must find a place 

to put it in the working data sets that we build. 

At TransUnion, we suck in 100 million updates a day 

for the credit files. We update a big data warehouse 

that contains all the credit and related data. And then 

every day we generate somewhere between 1 and 20 

operational data stores, which is what we actually run 

the business on. Our products are joined between what 

we call indicative data, the information that identifies 

you as an individual; structured data, which is derived 

from transactional records; and unstructured data that 

is attached to the indicative. We build those products on 

the fly because the data may change every day, 

sometimes several times a day. 

One challenge is how to accurately find the right place 

to put the record. For example, we get a Joe Smith at 

13 Main Street and a Joe Smith at 31 Main Street. 

Are those two different Joe Smiths, or is that a typing 

error? We have to figure that out 100 million times a 

day using a bunch of custom pattern-matching and 

probabilistic algorithms. 

PwC: Of the three kinds of data, which is the 

most challenging? 

JP: We have two kinds of challenges. The first is driven 

purely by the scale at which we operate. We add 

roughly half a terabyte of data per month to the credit 

file. Everything we do has challenges related to scale, 

updates, speed, or database performance. The vendors 

both love us and hate us. But we are where the industry 

is going—where everybody is going to be in two to five 

years. We are a good leading indicator, but we break 

their stuff all the time. A second challenge is the 

unstructured part of the data, which is increasing. 

PwC: It’s more of a challenge to deal with the 

unstructured stuff because it comes in various 

formats and from various sources, correct? 

JP: Yes. We have 83,000 data sources. Not everyone 

provides us with data every day. It comes in about 

4,000 formats, despite our data interchange standards. 

And, to be able to process it fast enough, we must 

convert all data into a single interchange format that is 

the representation of what we use internally. Complex 

computer science problems are associated with all 

of that. 

PwC: Are these the kinds of data problems that 

businesses in other industries will face in three 

to five years? 

JP: Yes, I believe so. 

PwC: What are some of the other problems you 

think will become more widespread? 

JP: Here are some simple practical examples. We have 

8.5 petabytes of data in the total managed environment. 

Once you go seriously above 100 terabytes, you must 

replace the storage fabric every four or five years. 

Moving 100 terabytes of data becomes a huge material 

issue and takes a long time. You do get some help from 

improved interconnect speed, but the arrays go as fast 


as they go for reads and writes and you can’t go faster 

than that. And businesses down the food chain are not 

accustomed to thinking about refresh cycles that take 

months to complete. Now, a refresh cycle of PCs might 

take months to complete, but any one piece of it takes 

only a couple of hours. When I move data from one 

array to another, I’m not done until I’m done. 

Additionally, I have some bugs and new vulnerabilities 

to deal with. 

Today, we don’t have a backup problem at TransUnion 

because we do incremental forever backup. However, 

we do have a restore problem. To restore a material 

amount of data, which we very occasionally need to do, 

takes days in some instances because the physics of 

the technology we use won’t go faster than that. The 

average IT department doesn’t worry about these 

problems. But take the amount of data an average IT 

department has under management, multiply it by a 

single decimal order of magnitude, and it starts to 

become a material issue. 

We would like to see computationally more-efficient 

compression algorithms, because my two big cost 

pools are Store It and Move It. For now, I don’t have 

a computational problem, but if I can’t shift the trend 

line on Store It and Move It, I will have a computational 

problem within a few years. To perform the 

computations in useful time, I must parallelize how I 

compute. Above a certain point, the parallelization 

breaks because I can’t move the data further. 

PwC: Cloudera [a vendor offering a Hadoop 

distribution] would say bring the computation to 

the data. 

JP: That works only for certain kinds of data. We already 

do all of that large-scale computation on a file system 

basis, not on a database basis. And we spend compute 

cycles to compress the data so there are fewer bits to 

move, then decompress the data for computation, and 

recompress it so we have fewer bits to store. 

What we have discovered—because I run the fourth 

largest commercial GPFS [general parallel file system, 

a distributed computing file system developed by IBM] 

cluster in the world—is that once you go beyond a 

certain size, the parallelization management tools break. 

That’s why I keep telling people that Hadoop is not what 

Google runs in production. Maybe the Google guys 

have solved this, but if they have, they aren’t telling 

me how. • 

“We would like to see 

computationally more-efficient 

compression algorithms, 

because my two big cost 

pools are Store It and Move It.” 

—John Parkinson of TransUnion 

The data scalability challenge 17

Creating a cost-effective 

Big Data strategy 

Disney’s Bud Albers, Scott Thompson, 

and Matt Estes (respectively) outline 

an agile approach that leverages 

open-source and cloud technologies. 

Interview conducted by Galen Gruman 

and Alan Morrison 

Bud Albers joined what is now the Disney Technology Shared 

Services Group two years ago as executive vice president and CTO. His management 

team includes Scott Thompson, vice president of architecture, and Matt Estes, principal 

data architect. The Technology Shared Services Group, located in Seattle, has a heritage 

dating back to the late 1990s, when Disney acquired Starwave and Infoseek. 

The group supports all the Disney businesses ($38 billion in annual revenue), managing 

the company’s portfolio of Web properties. These include properties for the studio, store, 

and park; ESPN; ABC; and a number of local television stations in major cities. 

In this interview, Albers, Thompson, and Estes discuss how they’re expanding Disney’s 

Web data analysis footprint without incurring additional cost by implementing a Hadoop 

cluster. Albers and team freed up budget for this cluster by virtualizing servers and 

eliminating other redundancies. 

PwC: Disney is such a diverse company, and yet 

there clearly is lots of potential for synergies and 

cross-fertilization. How do you approach these 

opportunities from a data perspective? 

BA: We try and understand the best way to work with 

and to provide services to the consumer in the long 

term. We have some businesses that are very data 

intensive, and then we have some that are less so 

because of their consumer audience. One of the 

challenges always is how to serve both kinds of 

businesses and do so in ways that make sense. The 

sell-to relationships extend from the studio out to the 

distribution groups and the theater chains. If you’re 

selling to millions, you’re trying to understand the 

different audiences and how they connect. 

One of the things I’ve been telling my folks from a data 

perspective is that you don’t send terabytes one way to 

be mated with a spreadsheet on the other side, right? 

We’re thinking through those kinds of pieces and trying 

to figure out how we move down a path. The net is that 

working with all these businesses gives us a diverse set 

of requirements, as you might imagine. We’re trying to 

stay ahead of where all the businesses are. 

In that respect, the questions I’m asking are, how do we 

get more agile, and how do we do it in a way that 

handles all the data we have? We must consider all of 

the new form factors being developed, all of which will 

generate lots of data. A big question is, how do we 

handle this data in a way that makes cost sense for the 

business and provides us an increased level of agility? 


We hope to do in other areas what we’ve done with 

content distribution networks [CDNs]. We’ve had a 

tremendous amount of success with the CDN 

marketplace by standardizing, by staying in the middle 

of the road and not going to Akamai proprietary 

extensions, and by creating a dynamic marketplace. 

If we get a new episode of LOST, we can start 

streaming it, and I can be streaming 80 percent on 

Akamai and 20 percent on Level 3. Then we can 

decide we’re going to turn it back, and I’m going to 

give 80 percent to Limelight and 20 percent to Level 3. 

We can do that dynamically. 

PwC: What are the other main strengths of the 

Technology Shared Services Group at Disney? 

BA: When I came here a couple of years ago, we had 

some very good core central services. If you look at the 

true definition of a cloud, we had the very early makings 

of one—shared central services around registration, 

for example. On Disney, on ABC, or on ESPN, if you 

have an ID, it works on all the Disney properties. If you 

have an ESPN ID, you can sign in to KGO in San 

Francisco, and it will work. It’s all a shared registration 

system. The advertising system we built is shared. The 

marketing systems we built are shared—all the analytics 

collection, all those things are centralized. Those things 

that are common are shared among all the sites. 

Those things that are brand specific are built by the 

brands, and the user interface is controlled by the 

brands, so each of the various divisions has a head 

of engineering on the Web site who reports to me. 

Our CIO worries about it from the firewall back; 

I worry about it from the firewall to the living room 

and the mobile device. That’s the way we split up 

the world, if that makes sense. 

PwC: How do you link the data requirements of 

the central core with those that are unique to the 

various parts of the business? 

BA: It’s more art than science. The business units 

must generate revenue, and we must provide the core 

services. How do you strike that balance? Ownership 

is a lot more negotiated on some things today. We 

typically pull down most of the analytics and add things 

in, and it’s a constant struggle to answer the question, 

“Do we have everything?” We’re headed toward this 

notion of one data element at a time, aggregate, and 

queue up the aggregate. It can get a little bit crazy 

because you wind up needing to pull the data in and 

run it through that whole food chain, and it may or 

may not have lasting value. 

It may have only a temporal level of importance, and so 

we’re trying to figure out how to better handle that. An 

awful lot of what we do in the data collection is pull it in, 

lay it out so it can be reported on, and/or push it back 

into the businesses, because the Web is evolving 

rapidly from a standalone thing to an integral part 

of how you do business. 

“It’s more art than science. The business units must generate revenue, 

and we must provide the core services. How do you strike that balance? 

Ownership is a lot more negotiated on some things today.” 

—Bud Albers 

Creating a cost-effective Big Data strategy 19

PwC: Hadoop seems to suggest a feasible way 

to analyze data that has only temporal 

importance. How did you get to the point where 

you could try something like a Hadoop cluster? 

BA: Guys like me never get called when it’s all pretty 

and shiny. The Disney unit I joined obviously has many 

strengths, but when I was brought on, there was a cost 

growth situation. The volume of the aggregate activity 

growth was 17 percent. Our server growth at the time 

was 30 percent. So we were filling up data centers, but 

we were filling them with CPUs that weren’t being used. 

My question was, how can you go to the CFO and ask 

for a lot of money to fill a data center with capital assets 

that you’re going to use only 5 percent of? 

CPU utilization isn’t the only measure, but it’s the most 

prominent one. To study and understand what was 

happening, we put monitors and measures on our 

servers and reported peak CPU utilization on fiveminute 

intervals across our server farm. We found that 

on roughly 80 percent of our servers, we never got 

above 10 percent utilization in a monthly period. 

Our first step to address that problem was virtualization. 

At this point, about 49 percent of our data center is 

virtual. Our virtualization effort had a sizable impact on 

cost. Dollars fell out because we quit building data 

centers and doing all kinds of redundant shuffling. We 

didn’t have to lay off people. We changed some of our 

processes, and we were able to shift our growth curve 

from plus 27 to minus 3 on the shared service. 

We call this our D-Cloud effort. Another step in this 

effort was moving to a REST [REpresentational State 

Transfer] and JSON [Javascript Object Notation] data 

exchange standard, because we knew we had to hit all 

these different devices and create some common APIs 

[application programming interfaces] in the framework. 

One of the very first things we put in place was a central 

logging service for all the events. These event logs can 

be streamed into one very large data set. We can then 

use the Hadoop and MapReduce paradigm to go after 

that data. 

PwC: How does the central logging service fit 

into your overall strategy? 

ST: As we looked at it, we said, it’s not just about 

virtualization. To be able to burst and do these other 

things, you need to build a bunch of core services. The 

initiative we’re working on now is to build some of those 

core services around managing configuration. This 

project takes the foundation we laid with virtualization 

and a REST and JSON data exchange standard, and 

adds those core services that enable us to respond to 

the marketplace as it develops. Piping that data back to 

a central repository helps you to analyze it, understand 

what’s going on, and make better decisions on the 

basis of what you learned. 

PwC: How do you evolve so that the data 

strategy is really served well, so that it’s more 

of a data-driven approach in some ways? 

ME: On one side, you have a very transactional OLTP 

[online transactional processing] kind of world, RDBMSs 

[relational database management systems], and major 

vendors that we’re using there. On the other side of it, 

you have traditional analytical warehousing. And where 

we’ve slotted this [Hadoop-style data] is in the middle 

with the other operational data. Some of it is derived 

from transactional data, and some has been crafted 

out of analytical data. There’s a freedom that’s derived 

from blending these two kinds of data. 

Our centralized logging service is an example. 

As we look at continuing to drive down costs to 

drive up efficiency, we can begin to log a large 

amount of this data at a price point that we have 

not been able to achieve by scaling up RDBMSs 

or using warehousing appliances. 

Then the key will be putting an expert system in place. 

That will give us the ability to really understand what’s 

going on in the actual operational environment. 

We’re starting to move again toward lower utilization 

trajectories. We need to scale the infrastructure back 

and get that utilization level up to the threshold. 


PwC: This kind of information doesn’t go in a 

cube. Not that data cubes are going away, 

but cubes are fairly well known now. The value 

you can create is exactly what you said, 

understanding the thinking behind it and the 

exploratory steps. 

ST: We think storing the unstructured data in its raw 

format is what’s coming. In a Hadoop environment, 

instead of bringing the data back to your warehouse, 

you figure out what question you want to answer. Then 

you MapReduce the input, and you may send that off 

to a data cube and a thing that someone can dig around 

in, but you keep the data in its raw format and pull out 

only what you need. 

BA: The wonderful thing about where we’re headed 

right now is that data analysis used to be this giant, 

massive bet that you had to place up front, right? 

No longer. Now, I pull Hadoop off of the Internet, 

first making sure that we’re compliant from a legal 

perspective with licensing and so forth. After that’s 

taken care of, you begin to prototype. You begin to 

work with it against common hardware. You begin 

to work with it against stuff you otherwise might 

throw out. Rather than, I’m going to go spend how 

much for Teradata? 

We’re using the basic premise of the cloud, and we’re 

using those techniques of standardizing the interface 

to virtualize and drive cost out. I’m taking that cost 

savings and returning some of it to the business, but 

then reinvesting some in new capabilities while the 

cost curve is stabilizing. 

ME: Refining some of this reinvestment in new 

capabilities doesn’t have to be put in the category of 

traditional “$5 million projects” companies used to think 

about. You can make significant improvements with 

reinvestments of $200,000 or even $50,000. 

BA: It’s then a matter of how you’re redeploying an 

investment in resources that you’ve already made as 

a corporation. It’s a matter of now prioritizing your 

work and not changing the bottom-line trajectory in a 

negative fashion with a bet that may not pay off. I can 

try it, and I don’t have to get great big governancebased 

permission to do it, because it’s not a bet of half 

the staff and all of this stuff. It’s, OK, let’s get something 

on the ground, let’s work with the business unit, let’s 

pilot it, let’s go somewhere where we know we have a 

need, let’s validate it against this need, and let’s make 

sure that it’s working. It’s not something that must go 

through an RFP [request for proposal] and standard 

procurement. I can move very fast. • 

“We think storing the unstructured data in its raw format is what’s 

coming. In a Hadoop environment, instead of bringing the data back to 

your warehouse, you figure out what question you want to answer.” 

—Scott Thompson 

Creating a cost-effective Big Data strategy 21

Building a bridge to 

the rest of your data 

How companies are using open-source cluster-computing techniques 

to analyze their data. 

By Alan Morrison 


As recently as two years ago, the International 

Supercomputing Conference (ISC) agenda included 

nothing about distributed computing for Big Data— 

as if projects such as Google Cluster Architecture, a 

low-cost, distributed computing design that enables 

efficient processing of large volumes of less-structured 

data, didn’t exist. In a May 2008 blog, Brough Turner 

noted the omission, pointing out that Google had 

harnessed as much as 100 petaflops 1 of computing 

power, compared to a mere 1 petaflop in the new IBM 

Roadrunner, a supercomputer profiled in EE Times 

that month. “Have the supercomputer folks been 

bypassed and don’t even know it?” Turner wondered. 2 

Turner, co-founder and CTO of Ashtonbrooke.com, a 

startup in stealth mode, had been reading Google’s 

research papers and remarking on them in his blog for 

years. Although the broader business community had 

taken little notice, some companies were following in 

Google’s wake. Many of them were Web companies 

that had data processing scalability challenges similar 

to Google’s. 

Yahoo, for example, abandoned its own data 

architecture and began to adopt one along the lines 

pioneered by Google. It moved to Apache Hadoop, 

an open-source, Java-based distributed file system 

based on Google File System and developed by 

the Apache Software Foundation; it also adopted 

MapReduce, Google’s parallel programming framework. 

Yahoo used these and other open-source tools it helped 

develop to crawl and index the Web. After implementing 

the architecture, it found other uses for the technology 

and has now scaled its Hadoop cluster to 4,000 nodes. 

By early 2010, Hadoop, MapReduce, and related 

open-source techniques had become the driving forces 

behind what O’Reilly Media, The Economist, and others 

in the press call Big Data and what vendors call cloud 

storage. Big Data refers to data sets that are growing 

exponentially and that are too large, too raw, or too 

unstructured for analysis by traditional means. Many 

who are familiar with these new methods are convinced 

that Hadoop clusters will enable cost-effective analysis 

of Big Data, and these methods are now spreading 

beyond companies that mine the public Web as part 

of their business. 

By early 2010, Hadoop, MapReduce, and related open-source 

techniques had become the driving forces behind what O’Reilly Media, 

The Economist, and others in the press call Big Data and what 

vendors call cloud storage. 

Building a bridge to the rest of your data 23

“Hadoop will process the data set and output a new data set, 

as opposed to changing the data set in place.” —Amr Awadallah 

of Cloudera 

What are these methods and how do they work? This 

article looks at the architecture and tools surrounding 

Hadoop clusters with an eye toward what about them 

will be useful to mainstream enterprises during the 

next three to five years. We focus on their utility for 

less-structured data. 

Hadoop clusters 

Although cluster computing has been around for 

decades, commodity clusters are more recent, starting 

with UNIX- and Linux-based Beowulf clusters in the 

mid-1990s. These banks of inexpensive servers 

networked together were pitted against expensive 

supercomputers from companies such as Cray and 

others—the kind of computers that government 

agencies, such as the National Aeronautics and Space 

Administration (NASA), bought. It was no accident 

that NASA pioneered the development of Beowulf. 3 

Hadoop extends the value of commodity clusters, 

making it possible to assemble a high-end computing 

cluster at a low-end price. A central assumption 

underlying this architecture is that some nodes are 

bound to fail when computing jobs are distributed 

across hundreds or thousands of nodes. Therefore, 

one key to success is to design the architecture to 

anticipate and recover from individual node failures. 4 

Other goals of the Google Cluster Architecture and 

its expression in open-source Hadoop include: 

• Price/performance over peak performance—The 

emphasis is on optimizing aggregate throughput; for 

example, sorting functions to rank the occurrence of 

keywords in Web pages. Overall sorting throughput 

is high. In each of the past three years, Yahoo’s 

Hadoop clusters have won Gray’s terabyte sort 

benchmarking test. 5 

• Software tolerance for hardware failures—When a 

failure occurs, the system responds by transferring 

the processing to another node, a critical capability 

for large distributed systems. As Roger Magoulas, 

research director for O’Reilly Media, says, “If you are 

going to have 40 or 100 machines, you don’t expect 

your machines to break. If you are running something 

with 1,000 nodes, stuff is going 

to break all the time.” 

• High compute power per query—The ability to 

scale up to thousands of nodes implies the ability to 

throw more compute power at each query. That 

ability, in turn, makes it possible to bring more data 

to bear on each problem. 

• Modularity and extensibility—Hadoop clusters 

scale horizontally with the help of a uniform, highly 

modular architecture. 

Hadoop isn’t intended for all kinds of workloads, 

especially not those with many writes. It works best for 

read-intensive workloads. These clusters complement, 

rather than replace, high-performance computing (HPC) 

and other relational data systems. They don’t work well 

with transactional data or records that require frequent 

updating. “Hadoop will process the data set and output 

a new data set, as opposed to changing the data set in 

place,” says Amr Awadallah, vice president of 

engineering and CTO of Cloudera, which develops a 

version of Hadoop. 

A data architecture and a software design that are 

frugal with network and disk resources are responsible 

for the price/performance ratio of Hadoop clusters. 

In Awadallah’s words, “You move your processing to 

where your data lives.” Each node has its own 

processing and storage, and the data is divided and 

processed locally in blocks sized for the purpose. 

This concept of localization makes it possible to use 

inexpensive serial advanced technology attachment 

(SATA) hard disks—the kind used in most PCs and 

servers—and Gigabit Ethernet for most network 

interconnections. (See Figure 1.) 


Client 

Switch 

1000Mbps 

Switch 

100Mbps 

Switch 

100Mbps 

Typical node setup 

2 quad-core Intel Nehalem 

Task tracker/ 

DataNode 

JobTracker 

24GB of RAM 

12 1TB SATA disks (non-RAID) 

Task tracker/ 


Task tracker/ 


NameNode 

Task tracker/ 


1Gigabit Ethernet card 

Cost per node: $5,000 

Effective file space per node: 20TB 

Task tracker/ 


Task tracker/ 


Task tracker/ 


Task tracker/ 


Task tracker/ 


Task tracker/ 


Claimed benefits 

Linear scaling at $250 per user TB 

(versus $5,000–$100,000 for alternatives) 

Compute placed near the data and 

fewer writes limit networking 

and storage costs 

Rack 

Rack 

Modularity and extensibility 

Figure 1: Hadoop cluster layout and characteristics 

Source: IBM, 2008, and Cloudera, 2010 


“Amazon supports Hadoop directly through its Elastic MapReduce 

application programming interfaces.” —Chris Wensel of Concurrent 

The result is less-expensive large-scale distributed 

computing and parallel processing, which make 

possible an analysis that is different from what most 

enterprises have previously attempted. As author Tom 

White points out, “The ability to run an ad hoc query 

against your whole data set and get the results in a 

reasonable time is transformative.” 6 

The cost of this capability is low enough that companies 

can fund a Hadoop cluster from existing IT budgets. 

When it decided to try Hadoop, the Walt Disney Co.’s 

Technology Shared Services Group took advantage of 

the increased server utilization it had already achieved 

from virtualization. As of March 2010, with nearly 

50 percent of its servers virtualized, Disney had 

30 percent server image growth annually but 30 percent 

less growth in physical servers. It was then able to set 

up a multi-terabyte cluster with Hadoop and other free 

open-source tools, using servers it had planned to 

retire. The group estimates it spent less than $500,000 

on the entire project. (See the article, “Tapping into the 

power of Big Data,” on page 04.) 

These clusters are also transformative because cloud 

providers can offer them on demand. Instead of using 

their own infrastructures, companies can subscribe to 

a service such as Amazon’s or Cloudera’s distribution 

on the Amazon Elastic Compute Cloud (EC2) platform. 

The EC2 platform was crucial in a well-known use of 

cloud computing on a Big Data project that also 

depended on Hadoop and other open-source tools. 

In 2007, The New York Times needed to quickly 

assemble the PDFs of 11 million articles from 4 

terabytes of scanned images. Amazon’s EC2 service 

completed the job in 24 hours after setup, a feat 

that received widespread attention in blogs and the 

trade press. 

Mostly overlooked in all that attention was the use of 

the Hadoop Distributed File System (HDFS) and the 

MapReduce framework. Using these open-source tools, 

after studying how-to blog posts from others, Times 

senior software architect Derek Gottfrid developed and 

ran code in parallel across multiple Amazon machines. 7 

“Amazon supports Hadoop directly through its Elastic 

MapReduce application programming interfaces [APIs],” 

says Chris Wensel, founder of Concurrent, which 

developed Cascading. (See the discussion of 

Cascading later in this article.) “I regularly work with 

customers to boot up 200-node clusters and process 

3 terabytes of data in five or six hours, and then shut 

the whole thing down. That’s extraordinarily powerful.” 

The Hadoop Distributed File System 

The Hadoop Distributed File System (HDFS) and the 

MapReduce parallel programming framework are at 

the core of Apache Hadoop. Comparing HDFS and 

MapReduce to Linux, Awadallah says that together 

they’re a “data operating system.” This description may 

be overstated, but there are similarities to any operating 

system. Operating systems schedule tasks, allocate 

resources, and manage files and data flows to fulfill the 

tasks. HDFS does a distributed computing version of 

this. “It takes care of linking all the nodes together to 

look like one big file and job scheduling system for the 

applications running on top of it,” Awadallah says. 

HDFS, like all Hadoop tools, is Java based. An HDFS 

contains two kinds of nodes: 

• A single NameNode that logs and maintains the 

necessary metadata in memory for distributed jobs 

• Multiple DataNodes that create, manage, and 

process the 64MB blocks that contain pieces of 

Hadoop jobs, according to the instructions from 

the NameNode 


HDFS uses multi-gigabyte file sizes to reduce the 

management complexity of lots of files in large data 

volumes. It typically writes each copy of the data once, 

adding to files sequentially. This approach simplifies 

the task of synchronizing data and reduces disk and 

bandwidth usage. 

Equally important are fault tolerance within the same 

disk and bandwidth usage limits. To accomplish fault 

tolerance, HDFS creates three copies of each data 

block, typically storing two copies in the same rack. 

The system goes to another rack only if it needs the 

third copy. Figure 2 shows a simplified depiction of 

HDFS and its data block copying method. 



Client 


NameNode 

(metadata) 

Files Blocks 

File A 1, 2, 4 

File A 3, 5 


HDFS does not perform tasks such as changing 

specific numbers in a list or other changes on parts 

of a database. This limitation leads some to assume 

that HDFS is not suitable for structured data. “HDFS 

was never designed for structured data and therefore 

it’s not optimal to perform queries on structured data,” 

says Daniel Abadi, assistant professor of computer 

science at Yale University. Abadi and others at Yale 

have done performance testing on the subject, and they 

have created a relational database alternative to HDFS 

called HadoopDB to address the performance issues 

they identified. 8 

Some developers are structuring data in ways that are 

suitable for HDFS; they’re just doing it differently from 

the way relational data would be structured. Nathan 

Marz, a developer at BackType, a company that offers a 

search engine for social media buzz, uses schemas to 

ensure consistency and avoid data corruption. “A lot 

of people think that Hadoop is meant for unstructured 

data, like log files,” Marz says. “While Hadoop is 

great for log files, it’s also fantastic for strongly typed, 

structured data.” For this purpose, Marz uses Thrift, 

which was developed by Facebook for data translation 

and serialization purposes. 9 (See the discussion of 

Thrift later in this article.) Figure 3 illustrates a typical 

Hadoop data processing flow that includes Thrift 

and MapReduce. 

1 2 4 

5 2 3 

4 3 1 

5 2 5 

Figure 2: The Hadoop Distributed File System, or HDFS 

Source: Apache Software Foundation, IBM, and PricewaterhouseCoopers, 2008 

Input 

data 

Input 

applications 

Core Hadoop data processing 

Output 

applications 

Less-structured 

information 

such as: 

log files 

messages 

images 

Cascading 

Thrift 

Zookeeper 

Pig 

Figure 3: Hadoop ecosystem overview 

Source: PricewaterhouseCoopers, derived from Apache Software Foundation and Dion Hinchcliffe, 2010 

Jobs 

1 

2 

3 

1 

M 

M 

2 

M 

3 

1 

2 

3 

R 

Results 

Mashups 

RDBMS apps 

BI systems 

M 

R 

Map 

Reduce 

64MB blocks 


MapReduce 

MapReduce is the base programming framework for 

Hadoop. It often acts as a bridge between HDFS and 

tools that are more accessible to most programmers. 

According to those at Google who developed the tool, 

“it hides the details of parallelization” and the other 

nuts and bolts of HDFS. 10 

MapReduce is a layer of abstraction, a way of managing 

a sea of details by creating a layer that captures and 

summarizes their essence. That doesn’t mean it is easy 

to use. Many developers choose to work with another 

tool, yet another layer of abstraction on top of it. “I 

avoid using MapReduce directly at all cost,” Marz says. 

“I actually do almost all my MapReduce work with a 

library called Cascading.” 

The terms “map” and “reduce” refer to steps the 

tool takes to distribute, or map, the input for parallel 

processing, and then reduce, or aggregate, the 

processed data into output files. (See Figure 4.) 

MapReduce works with key-value pairs. Frequently 

with Web data, the keys consist of URLs and the 

values consist of Web page content, such as HTML. 

MapReduce’s main value is as a platform with a set 

of APIs. Before MapReduce, fewer programmers could 

take advantage of distributed computing. Now that 

user-accessible tools have been designed, simpler 

programming is possible on massively parallel systems 

without the need to adapt the programs as much. 

The following sections examine some of these tools. 

Map 

Data 

store 1 

Input key-value pairs 

Map 


store n 

Input key-value pairs 

key 1 

values 

key 2 values key 3 values key 1 values key 2 values key 3 values 

Barrier ... 

Aggregates intermediate values by output key ... Barrier 

key 1 intermediate values key 2 intermediate values key 3 intermediate values 

Reduce Reduce Reduce 

final key 1 values final key 2 values final key 3 values 

Figure 4: MapReduce phases 

Source: Google, 2004, and Cloudera, 2009 


“You can code in whatever JVM-based language you want, and then 

shove that into the cluster.” —Chris Wensel of Concurrent 

Cascading 

Wensel, who created Cascading, calls it an alternative 

API to MapReduce, a single library of operations 

developers can tap. It’s another layer of abstraction 

that helps bring what programmers ordinarily do in 

non-distributed environments to distributed computing. 

With it, he says, “you can code in whatever JVM-based 

[Java Virtual Machine] language you want, and then 

shove that into the cluster.” 

Wensel wanted to obviate the need for “thinking in 

MapReduce.” When using Cascading, developers don’t 

think in key-value pair terms—they think in terms of 

fields and lists of values called “tuples.” A Cascading 

tuple is simpler than a database record but acts like 

one. Each tuple flows through “pipe” assemblies, which 

are comparable to Java classes. The data flow begins 

at the source, an input file, and ends with a sink, an 

output directory. (See Figure 5.) 

Rather than approach map and reduce phases large-file 

by large-file, developers assemble flows of operations 

using functions, filters, aggregators, and buffers. 

Those flows make up the pipe assemblies, which, in 

Marz’s terms, “compile to MapReduce.” In this way, 

Cascading smoothes the bumpy MapReduce terrain so 

more developers—including those who work mainly in 

scripting 

Client 

languages—can build flows. (See Figure 6.) 

Assembly 

A 

A 

A 

A 

Cluster 

Job 

A 

A 

A 

A 

Flow 

MR 

MR 

Job 

MR 

MR 

Map Reduce Map Reduce 

[f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] [f1, f2, ...] 

P P P P P P 

[f1, f2, ...] [f1, f2, ...] 

So 

Si 

A 

MR 

Pipe assembly 

Hadoop MR (translation to MapReduce) 

MapReduce jobs 

[f1, f2, ...] Tuples with field names 

So Source 

Si Sink 

P Pipe 

Figure 6: Cascading assembly and flow 

Source: Concurrent, 2010 

Figure 5: A Cascading assembly 

Source: Concurrent, 2010 


Some useful tools for MapReduce-style 

analytics programming 

Open-source tools that work via MapReduce on 

Hadoop clusters are proliferating. Users and developers 

don’t seem concerned that Google received a patent for 

MapReduce in January 2010. In fact, Google, IBM, and 

others have encouraged the development and use of 

open-source versions of these tools at various research 

universities. 11 A few of the more prominent tools 

relevant to analytics, and used by developers we’ve 

interviewed, are listed in the sections that follow. 

Clojure 

Clojure creator Rich Hickey wanted to combine aspects 

of C or C#, LISP (for list processing, a language 

associated with artificial intelligence that’s rich in 

mathematical functions), and Java. The letters C, L, and 

J led him to name the language, which is pronounced 

“closure.” Clojure combines a LISP library with Java 

libraries. Clojure’s mathematical and natural language 

processing (NLP) capabilities and the fact that it is JVM 

based make it useful for statistical analysis on Hadoop 

clusters. FlightCaster, a commercial-airline-delayprediction 

service, uses Clojure on top of Cascading, 

on top of MapReduce and Hadoop, for “getting the 

right view into unstructured data from heterogeneous 

sources,” says Bradford Cross, FlightCaster co-founder. 

LISP has attributes that lend themselves to NLP, making 

Clojure especially useful in NLP applications. Mark 

Watson, an artificial intelligence consultant and author, 

says most LISP programming he’s done is for NLP. He 

considers LISP to be four times as productive for 

programming as C++ and twice as productive as Java. 

His NLP code “uses a huge amount of memory-resident 

data,” such as lists of proper nouns, text categories, 

common last names, and nationalities. 

“Getting the right view into 

unstructured data from 

heterogeneous sources.” 

— Bradford Cross of FlightCaster 

With LISP, Watson says, he can load the data once and 

test multiple times. In C++, he would need to use a 

relational database and reload each time for a program 

test. Using LISP makes it possible to create and test 

small bits of code in an iterative fashion, a major reason 

for the productivity gains. 

This iterative, LISP-like program-programmer 

interaction with Clojure leads to what Hickey calls 

“dynamic development.” Any code entered in the 

console interface, he points out, is automatically 

compiled on the fly. 

Thrift 

Thrift, initially created at Facebook in 2007 and then 

released to open source, helps developers create 

services that communicate across languages, including 

C++, C#, Java, Perl, Python, PHP, Erlang, and Ruby. 

With Thrift, according to Facebook, users can “define 

all the necessary data structures and interfaces for a 

complex service in a single short file.” 

A more important aspect of Thrift, according to 

BackType’s Marz, is its ability to create strongly typed 

data and flexible schemas. Countering the emphasis of 

the so-called noSQL community on schema-less data, 

Marz asserts there are effective ways to lightly structure 

the data in Hadoop—style analysis. 

Marz uses Thrift’s serialization features, which turn 

objects into a sequence of bits that can be stored as 

files, to create schemas between types (for instance, 

differentiating between text strings and long, 64-bit 

integers) and schemas between relationships (for 

instance, linking Twitter accounts that share a 

common interest). Structuring the data in this way 

helps BackType avoid inconsistencies in the data 

or the need to manually filter for some attributes. 

BackType can use required and optional fields to 

structure the Twitter messages it crawls and analyzes. 

The required fields can help enforce data type. The 

optional fields, meanwhile, allow changes to the 

schema as well as the use of old objects that were 

created using the old schema. 


Marz’s use of Thrift to model social graphs like the one 

in Figure 7 demonstrates the flexibility of the schema 

for Hadoop-style computing. Thrift essentially enables 

modularity in the social graph described in the schema. 

For example, to select a single age for each person, 

BackType can take into account all the raw age data. 

It can do this by a computation on the entire data set 

or a selective computation on only the people in the 

data set who have new data. 

Alice 

Gender female 

Age 25 

Bob 

Apache 

Thrift 

Gender male 

Age 39 

Charlie 

Gender male 

Age 22 

Language: C++ 

Figure 7: An example of a social graph modeled using 

Thrift schema 

Source: Nathan Marz, 2010 

BackType doesn’t just work with raw data. It runs a 

series of jobs that constantly normalize and analyze 

new data coming in, and then other jobs that write the 

analyzed data to a scalable random-access database 

such as HBase or Cassandra. 12 

Open-source, non-relational data stores 

Non-relational data stores have become much more 

numerous since the Apache Hadoop project began in 

2007. Many are open source. Developers of these data 

stores have optimized each for a different kind of data. 

When contrasted with relational databases, these data 

stores lack many design features that can be essential 

for enterprise transactional data. However, they are 

often well tailored to specific, intended purposes, 

and they offer the added benefit of simplicity. Primary 

non-relational data store types include the following: 

• Multidimensional map store—Each record 

maps a row name, a column name, and a time 

stamp to a value. Map stores have their heritage 

in Google’s Bigtable. 

• Key-value store—Each record consists of a key, 

or unique identifier, mapped to one or more values. 

• Graph store—Each record consists of elements that 

together form a graph. Graphs depict relationships. 

For example, social graphs describe relationships 

between people. Other graphs describe relationships 

between objects, between links, or both. 

• Document store—Each record consists of a 

document. Extensible Markup Language (XML) 

databases, for example, store XML documents. 

Because of their simplicity, map and key-value stores 

can have scalability advantages over most types of 

relational databases. (HadoopDB, a hybrid approach 

developed at Yale University, is designed to overcome 

the scalability problems associated with relational 

databases.) Table 1 provides a few examples of the 

open-source, non-relational data stores that 

are available. 

Map Key-value Document Graph 

HBase Tokyo Cabinet/Tyrant MongoDB Resource Description Framework (RDF) 

Hypertable Project Voldemort CouchDB Neo4j 

Cassandra Redis Xindice InfoGrid 

Table 1: Example open-source, non-relational data stores 

Source: PricewaterhouseCoopers, Daniel Abadi of Yale University, and organization Web sites, 2010 


“We established that Hadoop does horizontally scale. This is what’s really 

exciting, because I’m an RDBMS guy, right? I’ve done that for years, and 

you don’t get that kind of constant scalability no matter what you do.” 

— Scott Thompson of Disney 

Other related technologies and vendors 

A comprehsensive review of the various tools created 

for the Hadoop ecosystem is beyond the scope of this 

article, but a few of the tools merit brief description here 

because they’ve been mentioned elsewhere in this issue: 

• Pig—A scripting language called Pig Latin, which is 

a primary feature of Apache Pig, allows more concise 

querying of data sets “directly from the console” than 

is possible using MapReduce, according to author 

Tom White. 

• Hive—Hive is designed as “mainly an ETL [extract, 

transform, and load] system” for use at Facebook, 

according to Chris Wensel. 

• Zookeeper—Zookeeper provides an interface 

for creating distributed applications, according 

to Apache. 

Big Data covers many vendor niches, and some 

vendors’ products take advantage of the Hadoop stack 

or add to its capabilities. (See the sidebar “Selected Big 

Data tool vendors.”) 

Conclusion 

Interest in and adoption of Hadoop clusters are 

growing rapidly. Reasons for Hadoop’s 

popularity include: 

• Open, dynamic development—The Hadoop/ 

MapReduce environment offers cost-effective 

distributed computing to a community of opensource 

programmers who’ve grown up on Linux 

and Java, and scripting languages such as 

Perl and Python. Some are taking advantage of 

functional programming language dialects such 

as Clojure. The openness and interaction can 

lead to faster development cycles. 

• Cost-effective scalability—Horizontal scaling from 

a low-cost base implies a feasible long-term cost 

structure for more kinds of data. Scott Thompson, 

vice president for infrastructure at the Disney 

Technology Shared Services Group, says, “We 

established that Hadoop does horizontally scale. 

This is what’s really exciting, because I’m an 

RDBMS guy, right? I’ve done that for years, and 

you don’t get that kind of constant scalability no 

matter what you do.” 

• Fault tolerance—Associated with scalability is 

the assumption that some nodes will fail. Hadoop 

and MapReduce are fault tolerant, another reason 

commodity hardware can be used. 

• Suitability for less-structured data—Perhaps 

most importantly, the methods that Google 

pioneered, and that Yahoo and others expanded, 

focus on what Cloudera’s Awadallah calls 

“complex” data. Although developers such as Marz 

understand the value of structuring data, most 

Hadoop/MapReduce developers don’t have a 

DBMS mentality. They have an NLP mentality, and 

they’re focused on techniques optimized for large 

amounts of less-structured information, such as the 

vast amount of information on the Web. 

The methods, cost advantages, and scalability of 

Hadoop-style cluster computing clear a path for 

enterprises to analyze the Big Data they didn’t have 

the means to analyze before. This set of methods is 

separate from, yet complements, data warehousing. 

Understanding what Hadoop clusters do and how 

they do it is fundamental to deciding when and where 

enterprises should consider making use of them. 


Selected Big Data tool vendors 

Amazon 

Amazon provides a Hadoop framework on its 

Elastic Compute Cloud (EC2) and S3 storage 

service it calls Elastic MapReduce. 

Appistry 

Appistry’s CloudIQ Storage platform offers a 

substitute for HDFS, one designed to eliminate the 

single point of failure of the NameNode. 

Cloudera 

Cloudera takes a Red Hat approach to Hadoop, 

offering its own distribution on EC2/S3 with 

management tools, training, support, and 

professional services. 

Cloudscale 

Cloudscale’s first product, Cloudcel, marries an 

Excel-style front end to a back end that’s a modified 

HDFS. The product is designed to process stored, 

historical, or streamed data. 

Concurrent 

Concurrent developed Cascading, for which it 

offers licensing, training, and support. 

Drawn to Scale 

Drawn to Scale offers an HBase/HDFS storage 

platform and Hadoop ecosystem consulting 

and training. 

IBM 

IBM’s jStart team offers briefings and workshops 

on Hadoop pilots. IBM BigSheets acts as an 

aggregation, analysis, and visualization point for 

large amounts of Web data. 

Microsoft 

Microsoft Pivot uses the company’s Deep Zoom 

technology to provide visual data browsing 

capabilities for XML files. Azure Table services is 

in some ways comparable to Bigtable or HBase. 

(See the interview with Mark Taylor and Ray Velez 

of Razorfish on page 46.) 

ParaScale 

ParaScale offers software for enterprises to 

set up their own public or private cloud storage 

environments with parallel processing and 

large-scale data handling capability. 

1 FLOPS stands for “floating point operations per second.” Floating point 

processors use more bits to store each value, allowing more precision 

and ease of programming than fixed point processors. One petaflop is 

upwards of one quadrillion floating point operations per second. 

2 Brough Turner, “Google Surpasses Supercomputer Community, 

Unnoticed?” May 20, 2008, http://blogs.broughturner.com/ 

communications/2008/05/google-surpasses-supercomputercommunity-unnoticed.html 

(accessed April 8, 2010). 

3 See, for example, Tim Kientzle, “Beowulf: Linux clustering,” 

Dr. Dobb’s Journal, November 1, 1998, Factiva Document 

dobb000020010916dub100045 (accessed April 9, 2010). 

4 Luis Barroso, Jeffrey Dean, and Urs Hoelzle, “Web Search for a 

Planet: The Google Cluster Architecture,” Google Research 

Publications, http://research.google.com/archive/googlecluster.html 


5 See http://sortbenchmark.org/ and http://developer.yahoo.net/blog/ 


6 Tom White, Hadoop: The Definitive Guide (Sebastopol, CA: O’Reilly 

Media, 2009), 4. 

7 See Derek Gottfrid, “Self-service, Prorated Super Computing Fun!” 

The New York Times Open Blog, November 1, 2007, http://open. 

blogs.nytimes.com/2007/11/01/self-service-prorated-supercomputing-fun/ 

(accessed March 23, 2010) and {Alan: Check this 

one. I ran a Google search and the author, article title, and date don’t 

match. Bill Snyder wrote another article on March 5, 2008. http:// 

www.cio.com/article/192701/Cloud_Computing_Tales_from_the_ 

Front } Bill Snyder, “Cloud Computing: Not Just Pie in the Sky,” CIO, 

March 5, 2008, Factiva Document CIO0000020080402e4350000 

(accessed March 28, 2010). 

8 See “HadoopDB” at http://db.cs.yale.edu/hadoopdb/hadoopdb.html 


9 Nathan Marz, “Thrift + Graphs = Strong, flexible schemas on 

Hadoop,” http://nathanmarz.com/blog/schemas-on-hadoop/ 


10 Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data 

Processing on Large Clusters,” Google Research Publications, 

December 2004, http://labs.google.com/papers/mapreduce.html 


11 See Dean, et al., US Patent No. 7,650,331, January 19, 2010, at http:// 

www.uspto.gov. For an example of the participation by Google and 

IBM in Hadoop’s development, see “Google and IBM Announce 

University Initiative to Address Internet-Scale Computing Challenges,” 

Google press release, October 8, 2007, http://www.google.com/intl/en/ 

press/pressrel/20071008_ibm_univ.html (accessed March 28, 2010). 

12 See the Apache site at http://apache.org/ for descriptions of many 

tools that take advantage of MapReduce and/or HDFS that are not 

profiled in this article. 


Hadoop’s foray 

into the enterprise 

Cloudera’s Amr Awadallah discusses how and why 

diverse companies are trying this novel approach. 

Interview conducted by Alan Morrison, Bo Parker, and Vinod Baya 

Amr Awadallah is vice president, engineering and chief technology officer at Cloudera, 

a company that offers products and services around Hadoop, an open-source 

technology that allows efficient mining of large, complex data sets. In this interview, 

Awadallah provides an overview of Hadoop’s capabilities and how Cloudera customers 

are using them. 

PwC: Were you at Yahoo before coming 

to Cloudera? 

AA: Yes. I was with Yahoo from mid-2000 until mid- 

2008, starting with the Yahoo Shopping team after 

selling my company VivaSmart to Yahoo. Beginning in 

2003, my career shifted toward business intelligence 

and analytics at consumer-facing properties such as 

Yahoo News, Mail, Finance, Messenger, and Search. 

I had the daunting task of building a very large data 

warehouse infrastructure that covered all these diverse 

products and figuring out how to bring them together. 

That is when I first experienced Hadoop. Its model of 

“mine first, govern later” fits in with the well-governed 

infrastructure of a data mart, so it complements these 

systems very well. Governance standards are important 

for maintaining a common language across the 

organization. However, they do inhibit agility, so it’s best 

to complement a well-governed data mart with a more 

agile complex data processing system like Hadoop. 

PwC: How did Yahoo start using Hadoop? 

AA: In 2005, Yahoo was faced with a business 

challenge. The cost of creating the Web search index 

was approaching the revenues being made from the 

keyword advertising on the search pages. Yahoo Search 

adopted Hadoop as an economically scalable solution, 

and worked on it in conjunction with the open-source 

Apache Hadoop community. Yahoo played a very big 

role in the evolution of Hadoop to where it is today. 

Soon after the Yahoo Search team started using 

Hadoop, other parts of the company began to see 

the power and flexibility that this system offers. 

Today Yahoo uses Hadoop for data warehousing, 

mail spam detection, news feed processing, and 

content/ad targeting. 

PwC: What are some of the advantages of 

Hadoop when you compare it with RDBMSs 

[relational database management systems]? 

AA: With Oracle, Teradata, and other RDBMSs, you 

must create the table and schema first. You say, this is 

what I’m going to be loading in, these are the types of 

columns I’m going to load in, and then you load your 

data. That process can inhibit how fast you can evolve 

your data model and schemas, and it can limit what you 

log and track. 

With Hadoop, it’s the other way around. You load all of 

your data, such as XML [Extensible Markup Language], 

tab delimited flat files, Apache log files, JSON 

[Javascript Object Notation], etc. Then in Hive or Pig 

[both of which are Hadoop Data Query Tools], you point 

your metadata toward the file and parse the data on 


“We are not talking about a replacement technology for datawarehouses— 

let’s be clear on this. No customers are using Hadoop in that fashion.” 

the fly when reading it out. This approach lets you 

extract the columns that map to the data structure 

you’re interested in. 

Creating the structure on the read path like this can 

have its disadvantages; however, it gives you the agility 

and the flexibility to evolve your schema much quicker 

without normalizing your data first. In general, relational 

systems are not well suited for quickly evolving complex 

data types. 

Another benefit is retroactive schemas. For example, 

an engineer launching a new product feature can add 

the logging for it, and that new data will start flowing 

directly into Hadoop. Weeks or months later, a data 

analyst can update their read schema on how to parse 

this new data. Then they will immediately be able to 

query the history of this metric since it started flowing 

in [as opposed to waiting for the RDBMS schema to be 

updated and the ETL processes to reload the full history 

of that metric]. 

PwC: What about the cost advantages? 

AA: The cost basis is 10 to 100 times cheaper than 

other solutions. But it’s not just about cost. Relational 

databases are really good at what they were designed 

for, which is running interactive SQL queries against 

well-structured data. We are not talking about a 

replacement technology for data warehouses—let’s be 

clear on this. 

No customers are using Hadoop in that fashion. They 

recognize that the nature of data is changing. Where is 

the data growing? It’s growing around complex data 

types. Is a relational container the best and most 

interesting place to ask questions of complex plus 

relational data? Probably not, although organizations 

still need to use, collect, and present relational data 

for questions that are routine and require, in some 

cases, a real-time response. 

PwC: How have companies benefited 

from querying across both structured and 

complex data? 

AA: When you query against complex data types, such 

as Web log files and customer support forums, as well 

as against the structured data you have already been 

collecting, such as customer records, sales history, and 

transactions, you get a much more accurate answer to 

the question you’re asking. For example, a large credit 

card company we’ve worked with can identify which 

transactions are most likely fraudulent and can prioritize 

which accounts need to be addressed. 

PwC: Are the companies you work with aware 

that this is a totally different paradigm? 

AA: Yes and no. The main use case we see is in 

companies that have a mix of complex data and 

structured data that they want to query across. Some 

large financial institutions that we talk to have 10, 20, 

or even hundreds of Oracle systems—it’s amazing. 

They have all of these file servers storing XML files or 

log files, and they want to consolidate all these tables 

and files onto one platform that can handle both data 

types so they can run comprehensive queries. This is 

where Hadoop really shines; it allows companies to 

run jobs across both data types. • 

Hadoop’s foray into the enterprise 35

Revising the CIO’s 

data playbook 

Start by adopting a fresh mind-set, grooming the right talent, 

and piloting new tools to ride the next wave of innovation. 

By Jimmy Guterman 


Like pioneers exploring a new territory, a few 

enterprises are making discoveries by exploring Big 

Data. The terrain is complex and far less structured 

than the data CIOs are accustomed to. And it is 

growing by exabytes each year. But it is also getting 

easier and less expensive to explore and analyze, in 

part because software tools built to take advantage 

of cloud computing infrastructures are now available. 

Our advice to CIOs: You don’t need to rush, but do 

begin to acquire the necessary mind-set, skill set, 

and tool kit. 

These are still the early days. The prime directive 

for any CIO is to deliver value to the business through 

technology. One way to do that is to integrate new 

technologies in moderation, with a focus on the 

long-term opportunities they may yield. Leading 

CIOs pride themselves on waiting until a technology 

has proven value before they adopt it. Fair enough. 

However, CIOs who ignore the Big Data trends 

described in the first two articles risk being 

marginalized in the C-suite. As they did with earlier 

technologies, including traditional business 

intelligence, business unit executives are ready to 

seize the Big Data opportunity and make it their own. 

This will be good for their units and their careers, 

but it would be better for the organization as a whole 

if someone—the CIO is the natural person—drove a 

single, central, cross-enterprise Big Data initiative. 

With this in mind, PricewaterhouseCoopers 

encourages CIOs to take these steps: 

• Start to add the discipline and skill set for Big Data 

to your organizations; the people for this may or 

may not come from existing staff. 

• Set up sandboxes (which you can rent or buy) to 

experiment with Big Data technologies. 

• Understand the open-source nature of the tools 

and how to manage risk. 

Enterprises have the opportunity to analyze more 

kinds of data more cheaply than ever before. It is 

also important to remember that Big Data tools did 

snot originate with vendors that were simply trying to 

create new markets. The tools sprung from a real 

need among the enterprises that first confronted 

the scalability and cost challenges of Big Data— 

challenges that are now felt more broadly. These 

pioneers also discovered the need for a wider variety 

of talent than IT has typically recruited. 

Enterprises have the opportunity to analyze more kinds of data 

more cheaply than ever before. It is also important to remember 

that Big Data tools did not originate with vendors that were simply 

trying to create new markets. 

Revising the CIO’s data playbook 37

Big Data lessons from Web companies 

Today’s CIO literature is full of lessons you can learn 

from companies such as Google. Some of the 

comparisons are superficial because most companies 

do not have a Web company’s data complexities and 

will never attain the original singleness of purpose 

that drove Google, for example, to develop Big Data 

innovations. But there is no niche where the 

development of Big Data tools, techniques, mind-set, 

and usage is greater than in companies such as 

Google, Yahoo, Facebook, Twitter, and LinkedIn. 

And there is plenty that CIOs can learn from these 

companies. Every major service these companies 

create is built on the idea of extracting more and more 

value from more and more data. 

For example, the 1-800-GOOG-411 service, which 

individuals can call to get telephone numbers and 

addresses of local businesses, does not merely take 

an ax to the high-margin directory assistance services 

run by incumbent carriers (although it does that). 

That is just a by-product. More important, the 

800-number service has let Google compile what has 

been described as the world’s largest database of 

spoken language. Google is using that database to 

improve the quality of voice recognition in Google 

Voice, in its sundry mobile-phone applications, and in 

other services under development. Some of the ways 

companies such as Google capture data and convert 

it into services are listed in Table 1. 

Service 

Self-serve advertising 

Analytics 

Social networking 

Browser 

E-mail 

Search engine 

RSS feeds 

Extra browser functionality 

View videos 

Free directory assistance 

Data that Web companies capture 

Ad-clicking and -picking behavior 

Aggregated Web site usage tracking 

Sundry online 

Limited browser behaviors 

Words used in e-mails 

Searches and clicking information 

Detailed reading habits 

All browser behavior 

All site behavior 

Database of spoken words 

Table 1: Web portal Big Data strategy 



“I see inspiration from the Google model ...just having lots of 

cheap stuff that you can use to crunch vast quantities of data.” 

— Phil Buckle, CIO, UK NPIA 

Many Web companies are finding opportunity in “gray 

data.” Gray data is the raw and unvalidated data that 

arrives from various sources, in huge quantities, and not 

in the most usable form. Yet gray data can deliver value 

to the business even if the generators of that content 

(for example, people calling directory assistance) are 

contributing that data for a reason far different from 

improving voice-recognition algorithms. They just want 

the right phone number; the data they leave is a gift to 

the company providing the service. 

The new technologies and services described in the 

article, “Building a bridge to the rest of your data,” on 

page 22 are making it possible to search for enterprise 

value in gray data in agile ways at low cost. Much of 

this value is likely to be in the area of knowing your 

customers, a sure path for CIOs looking for ways to 

contribute to company growth and deepen their 

relationships with the rest of the C-suite. 

What Web enterprise use of Big Data shows 

CIOs, most of all, is that there is a way to think and 

manage differently when you conclude that standard 

transactional data analysis systems are not and 

should not be the only models. New models are 

emerging. CIOs who recognize these new models 

without throwing away the legacy systems that still 

serve them well will see that having more than one 

tool set, one skill set, and one set of controls makes 

their organizations more sophisticated, more agile, 

less expensive to maintain, and more valuable to 

the business. 

The business case 

Besides Google, Yahoo, and other Web-based 

enterprises that have complex data sets, there are 

stories of brick and mortar organizations that will be 

making more use of Big Data. For example, Rollin Ford, 

Wal-Mart’s CIO, told The Economist earlier this year, 

“Every day I wake up and ask, ‘How can I flow data 

better, manage data better, analyze data better?’” 

The answer to that question today implies a budget 

reallocation, with less-expensive hardware and software 

carrying more of the load. “I see inspiration from the 

Google model and the notion of moving into 

commodity-based computing—just having lots of 

cheap stuff that you can use to crunch vast quantities 

of data. I think that really contrasts quite heavily with 

the historic model of paying lots of money for really 

specialist stuff,” says Phil Buckle, CIO of the UK’s 

National Policing Improvement Agency, which oversees 

law enforcement infrastructure nationwide. That’s a new 

mind-set for the CIO, who ordinarily focuses on keeping 

the plumbing and the data it carries safe, secure, 

in-house, and functional. 

Seizing the Big Data initiative would give CIOs in 

particular and IT in general more clout in the executive 

suite. But are CIOs up to the task? “It would be a 

positive if IT could harness unstructured data 

effectively,” former Gartner analyst Howard Dresner, 

CEO of Dresner Advisory Services, observes. “However, 

they haven’t always done a great job with structured 

data, and unstructured is far more complex and exists 

predominately outside the firewall and beyond 

their control.” 

Tools are not the issue. Many evolving tools, as noted 

in the previous article, come from the open-source 

community; they can be downloaded and experimented 

with for low cost and are certainly up to supporting 

any pilot project. More important is the aforementioned 

mind-set and a new kind of talent IT will need. 


“The talent demand isn’t so much for Java developers or statisticians 

per se as it is for people who know how to work with denormalized 

data.” — Ray Velez of Razorfish 

To whom does the future of IT belong? 

The ascendance of Big Data means that CIOs need a 

more data-centric approach. But what kind of talent 

can help a CIO succeed in a more data-centric business 

environment, and what specific skills do the CIO’s 

teams focused on the area need to develop 

and balance? 

Hal Varian, a University of California, Berkeley, professor 

and Google’s chief economist, says, “The sexy job in 

the next 10 years will be statisticians.” He and others, 

such as IT and management professor Erik Brynjolfsson 

at the Massachusetts Institute of Technology (MIT), 

contend this demand will happen because the amount 

of data to be analyzed is out of control. Those who 

can make sense of the flood will reap the greatest 

rewards. They have a point, but the need is not just 

for statisticians—it’s for a wide range of analytically 

minded people. 

Today, larger companies still need staff with expertise 

in package implementations and customizations, 

systems integration, and business process 

reengineering, as well as traditional data management 

and business intelligence that’s focused on 

transactional data. But there is a growing role for 

people with flexible minds to analyze data and suggest 

solutions to problems or identify opportunities from 

that data. 

In Silicon Valley and elsewhere, where businesses such 

as Google, Facebook, and Twitter are built on the 

rigorous and speedy analysis of data, programming 

frameworks such as MapReduce (which works with 

Hadoop) and NoSQL (a database approach for nonrelational 

data stores) are becoming more popular. 

Chris Wensel, who created Cascading (an alternative 

application programming interface [API] to MapReduce) 

and straddles the worlds of startups and entrenched 

companies, says, “When I talk to CIOs, I tell them: 

‘You know those people you have who know about 

data. You probably don’t use those people as much 

as you should. But once you take advantage of that 

expertise and reallocate that talent, you can take 

advantage of these new techniques.’” 

The increased emphasis on data analysis does not 

mean that traditional programmers will be replaced by 

quantitative analysts or data warehouse specialists. 

“The talent demand isn’t so much for Java developers 

or statisticians per se as it is for people who know how 

to work with denormalized data,” says Ray Velez, CTO 

at Razorfish, an interactive marketing and technology 

consulting firm involved in many Big Data initiatives. 

“It’s about understanding how to map data into a format 

that most people are not familiar with. Most people 

understand SQL and the relational format, so the real 

skill set evolution doesn’t have quite as much to do with 

whether it’s Java or Python or other technologies.” 

Velez points to Bill James as a useful case. James, a 

baseball writer and statistician, challenged conventional 

wisdom by taking an exploratory mind-set to baseball 

statistics. He literally changed how baseball 

management makes talent decisions, and even how 

they manage on the field. In fact, James became senior 

adviser for baseball operations in the Boston Red Sox’s 

front office. 


For example, James showed that batting average is 

less an indicator of a player’s future success than 

how often he’s involved in scoring runs—getting on 

base, advancing runners, or driving them in. In this 

example and many others, James used his knowledge 

of the topic, explored the data, asked questions no 

one had asked, and then formulated, tested, and 

refined hypotheses. 

Says Velez: “Our analytics team within Razorfish has 

the James types of folks who can help drive different 

thinking and envision possibilities with the data. We 

need to find a lot more of those people. They’re not 

very easy to find. There is an aspect of James that 

just has to do with boldness and courage, a willingness 

to challenge those who are in the habit of using 

metrics they’ve been using for years.” 

The CIO will need people throughout the organization 

who have all sorts of relevant analysis and coding skills, 

who understand the value of data, and who are not 

afraid to explore. This does not mean the end of the 

technology- or application-centric organizational chart 

of the typical IT organization. Rather, it means the 

addition of a data-exploration dimension that is more 

than one or two people. These people will be using a 

blend of tools that differ depending on requirements, 

as Table 2 illustrates. More of the tools will be open 

source than in the past. 

Skills Tools (a sampler) Comments 

Natural language processing 

and text mining 

Clojure, Redis, Scala, Crane, other 

Java functional language libraries, 

Python Natural Language ToolKit 

To some extent, each of these serves as 

a layer of abstraction on top of Hadoop. 

Those familiar keep adding layers on top of 

layers. FlightCaster, for example, uses a 

stack consisting of Amazon S3 -> Amazon 

EC2 -> Cloudera -> HDFS -> Hadoop -> 

Cascading -> Clojure 1 

Data mining R, Mathlab R is more suited to finance and statistics, 

whereas Mathlab is more engineering 

oriented. 2 

Scripting and NoSQL 

database programming skills 

Python and related frameworks, 

HBase, Cassandra, CouchDB, 

Tokyo Cabinet 

These lend themselves to or are based on 

the functional languages mentioned above. 

CouchDB, for example, is written in Erlang 3 

“Erlang, another functional programming 

language comparable to LISP (see 

discussion of Clojure and LISP on page 30). 

Table 2: New skills and tools for the IT department 

Source: Cited online postings and PricewaterhouseCoopers, 2008–2010 

1 

Pete Skomoroch, “How FlightCaster Squeezes Predictions from Flight Data,” Data Wrangling blog, August 24, 2009, 

http://www.datawrangling.com/how-flightcaster-squeezes-predictions-from-flight-data (accessed May 14, 2010). 

2 

Brendan O’Connor, “Comparison of data analysis packages,” AI and Social Science blog, February 23, 2009, 

http://anyall.org/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/, 

accessed May 25, 2010. 

3 

Scripting languages such as Python run more slowly than Java, but developers sometimes make the tradeoff 

to increase their own productivity. Some companies have created their own frameworks and released these 

to open source. See Klaas Bosteels, “ Python + Hadoop = Flying Circus Elephant,” Last.HQ Last.fm blog, 

May 29, 2008, http://blog.last.fm/2008/05/29/python-hadoop-flying-circus-elephant (accessed May 14, 2010). 


“Every technology department has a skunkworks, no matter how 

informal—a sandbox where they can test and prove technologies. 

That’s how open source entered our organization. A small Hadoop 

installation might be a gateway that leads you to more open source. 

But it might turn out to be a neat little open-source project that 

sits by itself and doesn’t bother anything else.” —CIO at a small 

Massachusetts company 

Where do CIOs find such talent? Start with your own 

enterprise. For example, business analysts managing 

the marketing department’s lead-generation systems 

could be promoted onto an IT data staff charged with 

exploring the data flow. Most large consumer-oriented 

companies already have people in their business units 

who can analyze data and suggest solutions to 

problems or identify opportunities. These people need 

to be groomed and promoted, and more of them hired 

for IT, to enable the entire organization, not just the 

marketing department, to reap the riches. 

Set up a sandbox 

Although the business case CIOs can make for Big Data 

is inarguable, even inarguable business cases carry 

some risk. Many CIOs will look at the risks associated 

with Big Data and find a familiar canard. Many Big Data 

technologies—Hadoop in particular—are open source, 

and open source is often criticized for carrying too 

much risk. 

The open-source versus proprietary technology 

argument is nothing new. CIOs who have tried to 

implement open-source programs, from the Apache 

Web server to the Drupal content-management system, 

have faced the usual arguments against code being 

available to all comers. Some of those arguments, 

especially concerns revolving around security and 

reliability, verge on the specious. Google built its internal 

Web servers atop Apache. And it would be difficult to 

find a Big Data site as reliable as Google’s. 

Clearly, one challenge CIOs face has nothing to do 

with data or skill sets. Open-source projects become 

available earlier in their evolution than do proprietary 

alternatives. In this respect, Big Data tools are less 

stable and complete than are Apache or Linux 

open-source tools kits. 

Introducing an open-source technology such as 

Hadoop into a mostly proprietary environment does 

not necessarily mean turning the organization upside 

down. A CIO at a small Massachusetts company says, 

“Every technology department has a skunkworks, no 

matter how informal—a sandbox where they can test 

and prove technologies. That’s how open source 

entered our organization. A small Hadoop installation 

might be a gateway that leads you to more open 

source. But it might turn out to be a neat little opensource 

project that sits by itself and doesn’t bother 

anything else. Either can be OK, depending on the 

needs of your company.” 

Bud Albers, executive vice president and CTO of 

Disney Technology Shared Services Group, concurs. 

“It depends on your organizational mind-set,” he says. 

“It depends on your organizational capability. There 

is a certain ‘don’t try this at home’ kind of warning 

that goes with technologies like Hadoop. You have to 

be willing at this stage of its maturity to maybe have 

a little higher level of capability to go in.” 


PricewaterhouseCoopers agrees with those sentiments 

and strongly urges large enterprises to establish a 

sandbox dedicated to Big Data and Hadoop/ 

MapReduce. This move should be standard operating 

procedure for large companies in 2010, as should a 

small, dedicated staff of data explorers and modest 

budget for the efforts. For more information on what 

should be in your sandbox, refer to the article, “Building 

a bridge to the rest of your data,” on page 22. 

And for some ideas on how the sandbox could fit in 

your org chart, see Figure 1. 

VP of IT 

Director of application development 

Director of data analysis 

Data analysis 

team 


exploration 

team 

Marketing 

manager 

Web site 

manager 

Sales 

manager 

Operations 

manager 

Finance 

manager 

Figure 1: Where a data exploration team might fit in an 

organization chart 


Different companies will want to experiment with 

Hadoop in different ways, or segregate it from the rest 

of the IT infrastructure with stronger or weaker walls. 

The CIO must determine how to encourage this kind 

of experimentation. 

Understand and manage the risks 

Some of the risks associated with Big Data are 

legitimate, and CIOs must address them. In the case of 

Hadoop clusters, security is a pressing question: it was 

a feature added as the project developed, not cooked 

in from the beginning. It’s still far from perfect. Many 

open-source projects start as cool projects intended to 

prove a concept or solve a particular problem. Some, 

such as Linux or Mozilla, become massive successes, 

but they rarely start with the sort of requirements a CIO 

faces when introducing systems to corporate settings. 

Beyond open source, regardless of which tools are 

used to manipulate data, there are always risks 

associated with making decisions based on the analysis 

of Big Data. To give one dramatic example, the recent 

financial crisis was caused in part by banks and rating 

agencies whose models for understanding value at risk 

and the potential for securities based on subprime 

mortgages to fail were flat-out wrong. Just as there is 

risk in data that is not sufficiently clean, there is risk 

in data manipulation techniques that have not been 

sufficiently vetted. Many times, the only way to 

understand big, complicated data is through the use 

of big, complicated algorithms, which leaves a door 

open to big, catastrophic mistakes in analysis. Table 3 

includes a list of the risks associated with Big Data 

analysis and ways to mitigate them. 


Risk 

Over-reliance on insights gleaned from 

data analysis leads to loss 

Inaccurate or obsolete data 

Analysis leads to paralysis 

Security 

Buggy code and other glitches 

Rejection by other parts of 

the organization 

Mitigation tactic 

Testing 

Maintains strong metadata management; unverified information must be flagged 

Keep the sandbox related to the business problem or opportunity 

Keep the Hadoop clusters away from the firewall, be vigilant, ask chief security 

officer for help 

Make sure the team keeps track of modifications and other implementation 

history, since documentation isn’t plentiful 

Do change management to help improve the odds of acceptance, 

along with quick successes 

Table 3: How to mitigate the risks of Big Data analysis 


By nature and by work experiences, most CIOs are 

risk averse. Blue-chip CIOs hold off installing new versions of software until they have been proven beyond a 

doubt, and these CIOs don’t standardize 

on new platforms until the risk for change appears to 

be less than the risk of stasis. 

“The fundamental issue is whether IT is willing to 

depart from the status quo, such as an RDBMS 

[relational database management system], in favor 

of more powerful technologies,” Dresner says. 

“This means massive change, and IT doesn’t 

always embrace change.” More forward-thinking IT 

organizations constantly review their software 

portfolio and adjust accordingly. 

In this case, the need to manipulate larger and larger 

amounts of data that companies are collecting is 

pressing. Even risk-averse CIOs are exploring the 

possibilities of Big Data for their businesses. Bernard 

(Bud) Mathaisel, CIO of the outsourcing vendor 

Achievo, divides the risks of Big Data and their 

solutions into three areas: 

• Accessibility—data repository used for data 

analysis should be access managed 

• Classification—Gray data should be identified 

as such 

• Governance—Who’s doing what with this? 

Yes, Big Data is new. But accessibility, classification, 

and governance are matters CIOs have had to deal 

with for many years in many guises. 


Conclusion 

At many companies, Big Data is both an opportunity 

(what useful needles can we find in a terabyte-sized 

haystack?) and a source of stress (Big Data is 

overwhelming our current tools and methods; they 

don’t scale up to meet the challenge). The prefix 

“tera” in “terabyte,” after all, comes from the Greek 

word for “monster.” CIOs aiming to use Big Data to 

add value to their businesses are monster slayers. 

CIOs don’t just manage hardware and software now; 

they’re expected to manage the data stored in that 

hardware and used by that software—and provide a 

framework for delivering insights from the data. 

From Amazon.com to the Boston Red Sox, diverse 

companies compete based on what data they collect 

and what they learn from it. CIOs must deliver easy, 

reliable, secure access to that data and develop 

consistent, trustworthy ways to explore and wrench 

wisdom from that data. CIOs do not need to rush, but 

they do need to be prepared for the changes that Big 

Data is likely to require. 

Perhaps the most productive way for CIOs to frame the 

issue is to acknowledge that Big Data isn’t merely a 

new model; it’s a new way to think about all data 

models. Big Data isn’t merely more data; it is different 

data that requires different tools. As more and more 

internal and external sources cast off more and more 

data, basic notions about the size and attributes of data 

sets are likely to change. With those changes, CIOs will 

be expected to capture more data and deliver it to the 

executive team in a manner that reveals the business— 

and how to grow it—in new ways. 

Web companies have set the bar high already. John 

Avery, a partner at Sungard Consulting, points to the 

YouTube example: “YouTube’s ability to index a data 

store of such immense size and then accrete additional 

analysis on top of that, as an ongoing process with no 

foresight into what those analyses would look like when 

the data was originally stored, is very, very impressive. 

That is something that has challenged folks in financial 

technology for years.” 

As companies with a history of cautious data policies 

begin to test and embrace Hadoop, MapReduce, and 

the like, forward-looking CIOs will turn to the issues that 

will become more important as Big Data becomes the 

norm. The communities arising around Hadoop (and the 

inevitable open-source and proprietary competitors that 

follow) will grow and become influential, inspiring more 

CIOs to become more data-centric. The profusion of 

new data sources will lead to dramatic growth in the 

use and diversity of metadata. As the data grows, so 

will our vocabulary for understanding it. 

Whether learning from Google’s approach to Big Data, 

hiring a staff primed to maximize its value, or managing 

the new risks, forward-looking CIOs will, as always, 

be looking to enable new business opportunities 

through technology. 

As companies with a history of 

cautious data policies begin to 

test and embrace Hadoop, 

MapReduce, and the like, forwardlooking 

CIOs will turn to the issues 

that will become more important 

as Big Data becomes the norm. 

The communities arising around 

Hadoop (and the inevitable opensource 

and proprietary competitors 

that follow) will grow and become 

influential, inspiring more CIOs to 

become more data-centric. 


New approaches to 

customer data analysis 

Razorfish’s Mark Taylor and Ray Velez discuss how new 

techniques enable them to better analyze petabytes of 

Web data. 

Interview conducted by Alan Morrison and Bo Parker 

Mark Taylor is global solutions director and Ray Velez is CTO of Razorfish, an interactive 

marketing and technology consulting firm that is now a part of Publicis Groupe. In this 

interview, Roberts and Velez discuss how they use Amazon’s Elastic Compute Cloud 

(EC2) and Elastic MapReduce services, as well as Microsoft Azure Table services, for 

large-scale customer segmentation and other data mining functions. 

PwC: What business problem were you trying to 

solve with the Amazon services? 

MT: We needed to join together large volumes of 

disparate data sets that both we and a particular client 

can access. Historically, those data sets have not been 

able to be joined at the capacity level that we were able 

to achieve using the cloud. 

In our traditional data environment, we were limited 

to the scope of real clickstream data that we could 

actually access for processing and leveraging 

bandwidth, because we procured a fixed size of data. 

We managed and worked with a third party to serve 

that data center. 

This approach worked very well until we wanted to tie 

together and use SQL servers with online analytical 

processing cubes, all in a fixed infrastructure. With the 

cloud, we were able to throw billions of rows of data 

together to really start categorizing that information 

so that we could segment non-personally identifiable 

data from browsing sessions and from specific ways 

in which we think about segmenting the behavior 

of customers. 

That capability gives us a much smarter way to apply 

rules to our clients’ merchandising approaches, so that 

we can achieve far more contextual focus for the use of 

the data. Rather than using the data for reporting only, 

we can actually leverage it for targeting and think about 

how we can add value to the insight. 

RV: It was slightly different from a traditional database 

approach. The traditional approach just isn’t going to 

work when dealing with the amounts of data that a tool 

like the Atlas ad server [a Razorfish ad engine that is 

now owned by Microsoft and offered through Microsoft 

Advertising] has to deal with. 

PwC: The scalability aspect of it seems clear. 

But is the nature of the data you’re collecting 

such that it may not be served well by a 

relational approach? 

RV: It’s not the nature of the data itself, but what we 

end up needing to deal with when it comes to relational 

data. Relational data has lots of flexibility because of 

the normalized format, and then you can slice and dice 

and look at the data in lots of different ways. Until you 


“Rather than using the data for reporting only, we can actually leverage it 

for targeting and think about how we can add value to the insight.” 

— Ray Velez 

put it into a data warehouse format or a denormalized 

EMR [Elastic MapReduce] or Bigtable type of format, 

you really don’t get the performance that you need 

when dealing with larger data sets. 

So it’s really that classic tradeoff; the data doesn’t 

necessarily lend itself perfectly to either approach. 

When you’re looking at performance and the amount 

of data, even a data warehouse can’t deal with the 

amount of data that we would get from a lot of our 

data sources. 

PwC: What motivated you to look at this new 

technology to solve that old problem? 

RV: Here’s a similar example where we used a slightly 

different technology. We were working with a large 

financial services institution, and we were dealing with 

massive amounts of spending-pattern and anonymous 

data. We knew we had to scale to Internet volumes, 

and we were talking about columnar databases. We 

wondered, can we use a relational structure with 

enough indexes to make it perform well? We 

experimented with a relational structure and it 

just didn’t work. 

So early on we jumped into what Microsoft Azure 

technology allowed us to do, and we put it into a 

Bigtable format, or a Hadoop-style format, using Azure 

Table services. The real custom element was designing 

the partitioning structure of this data to denormalize 

what would usually be five or six tables into one huge 

table with lots of columns, to the point where we started 

to bump up against the maximum number of columns 

they had. 

We were able to build something that we never would 

have thought of exposing to the world because it never 

would have performed well. It actually spurred a whole 

new business idea for us. We were able to take what 

would typically be a BusinessObjects or a Cognos 

application, which would not scale to Internet volumes. 

We did some sizing to determine how big the data 

footprint would be. Obviously, when you do that, you 

tend to have a ton more space than you require, 

because you’re duplicating lots and lots of data that, 

with a relational database table, would be lookup data 

or other things like that. But it turned out that when I 

laid the indexes on top of the traditionally relational 

data, the resulting data set actually had even greater 

storage requirements than performing the duplication 

and putting the data set into a large denormalized 

format. That was a bit of a surprise to us. The size of 

the indexes got so large. 

When you think about it, maybe that’s just how an index 

works anyway—it puts things into this denormalized 

format. An index file is just some closed concept in 

your database or memory space. The point is, we would 

have never tried to expose that to consumers, but we 

were able to expose it to consumers because of this 

new format. 

MT: The first commercial benefits were the ability to 

aggregate large and disparate data into one place and 

extra processing power. But the next phase of benefits 

really derives from the ability to identify true 

relationships across that data. 

Managing petabyte-scale customer data analysis 47

“The stat section on [the MLB] site was always the most difficult part of 

the site, but the business insisted it needed it.” — Ray Velez 

Tiny percentages of these data sets have the most 

significant impact on our customer interactions. We are 

already developing new data measurement and KPI 

strategies as we’re starting to ask ourselves, “Do our 

clients really need all of the data and measurement 

points to solve their business goals?” 

PwC: Given these new techniques, is the skill 

set that’s most beneficial to have at Razorfish 

changing? 

RV: It’s about understanding how to map data into a 

format that most people are not familiar with. Most 

people understand SQL and relational format, so I think 

the real skill set evolution doesn’t have quite as much 

to do with whether the tool of choice is Java or Python 

or other technologies; it’s more about do I understand 

normalized versus denormalized structures. 

MT: From a more commercial viewpoint, there’s a shift 

away from product type and skill set, which is based 

around constraints and managing known parameters, 

and very much more toward what else can we do. 

It changes the impact—not just in the technology 

organization, but in the other disciplines as well. 

I’ve already seen a profound effect on the old ways of 

doing things. Rather than thinking of doing the same 

things better, it really comes down to having the people 

and skills to meet your intended business goals. 

Using the Elastic MapReduce service can have a ripple 

effect on all of the non-technical business processes 

and engagements across teams. For example, 

conventional marketing segmentation used to involve 

teams of analysts who waded through various data sets 

and stages of processing and analysis to make sense of 

how a business might view groups of customers. Using 

the Hadoop-style alternative and Cascading, we’re able 

to identify unconventional relationships across many 

data points with less effort, and in the process create 

new segmentations and insights. 

This way, we stay relevant and respond more quickly to 

customerdemand. We’re identifying new variations and 

shifts in the data on a real-time basis that would have 

otherwise taken weeks or months, or even missed 

completely using the old approach. The analyst’s role 

in creating these new algorithms and designing new 

methods of campaign planning is clearly key to this 

type of solution design. The outcome of all this is really 

interesting and I’m starting to see a subtle, organic 

response to different changes in customer behavior. 

PwC: Are you familiar with Bill James, a Major 

League Baseball statistician who has taken a 

rather different approach to metrics? James 

developed some metrics that turned out to be 

more useful than those used for many years 

in baseball. That kind of person seems to be 

the type that you’re enabling to hypothesize, 

perhaps even do some machine learning to 

generate hypotheses. 

RV: Absolutely. Our analytics team within Razorfish has 

the Bill James type of folks who can help drive different 

thinking and envision possibilities with the data. We 

need to find a lot more of those people. They’re not 

very easy to find. And we have some of the leading 

folks in the industry. 

You know, a long, long time ago we designed the 

Major League Baseball site and the platform. The stat 

section on that site was always the most difficult part 

of the site, but the business insisted it needed it. The 

amount of people who really wanted to churn through 

that data was small. We were using Oracle at the time. 

We used the concept of temporary tables, which 

would denormalize lots of different relational tables for 

performance reasons, and that was a challenge. If I 

had the cluster technology we do now back in 1999 

and 2000, we could have built to scale much more 

than going to two measly servers that we could cluster. 


PwC: The Bill James analogy goes beyond 

batting averages, which have been the age-old 

metric for assessing the contribution of a hitter 

to a team, to measuring other things that weren’t 

measured before. 

RV: Even crazy stuff. We used to do things like, show 

me all of Derek Jeter’s hits at night on a grass field. 

PwC: There you go. Exactly. 

RV: That’s the example I always use, because that was 

the hardest thing to get to scale, but if you go to the 

stat section, you can do a lot of those things. But if too 

many people went to the stat section on the site, the 

site would melt down, because Oracle couldn’t handle 

it. If I were to rebuild that today, I could use an EMR 

or a Bigtable and I’d be much happier. 

PwC: Considering the size of the Bigtable that 

you’re able to put together without using joins, it 

seems like you’re initially able to filter better and 

maybe do multistage filtering to get to something 

useful. You can take a cyclical approach to your 

analysis, correct? 

RV: Yes, you’re almost peeling away the layer of the 

onion. But putting data into a denormalized format does 

restrict flexibility, because you have so much more 

power with a where clause than you do with a standard 

EMR or Bigtable access mechanism. 

It’s like the difference between something built for 

exactly one task versus something built to handle tasks 

I haven’t even thought of. If you peel away the layer of 

the onion, you might decide, wow, this data’s interesting 

and we’re going in a very interesting direction, so what 

about this? You may not be able to slice it that way. You 

might have to step back and come up with a different 

partition structure to support it. 

PwC: Social media is helping customers become 

more active and engaged. From a marketing 

analysis perspective, it’s a variation on a Super 

Bowl advertisement, just scaled down to that 

social media environment. And if that’s going to 

happen frequently, you need to know what is the 

impact, who’s watching it, and how are the 

people who were watching it affected by it. If 

you just think about the data ramifications of 

that, it sort of blows your mind. 

RV: If you think about the popularity of Hadoop and 

Bigtable, which is really looking under the covers of 

the way Google does its search, and when you think 

about search at the end of the day, search really is 

recommendations. It’s relevancy. What are the impacts 

on the ability of people to create new ways to do search 

and to compete in a more targeted fashion with the 

search engine? If you look three to five years out, that’s 

really exciting. We used to say we could never re-create 

that infrastructure that Google has; Google is the 

second largest server manufacturer in the world. But 

now we have a way to create small targeted ways of 

doing what Google does. I think that’s pretty exciting. • 

“What are the impacts on the ability 

of people to create new ways to 

do search and to compete in a 

more targeted fashion with the 

search engine? If you look three to 

five years out, that’s really exciting.” 

— Ray Velez 

Managing petabyte-scale customer data analysis 49

Acknowledgments 

Advisory 

Sponsor & Technology Leader 

Tom DeGarmo 

US Thought Leadership 

Partner-in-Charge 

Tom Craren 

Center for Technology and Innovation 

Managing Editor 

Bo Parker 

Editors 

Vinod Baya, Alan Morrison 

Contributors 

Larry Best, Galen Gruman, Jimmy Guterman, Larry Marion, Bill Roberts 

Editorial Advisers 

Markus Anderle. Stephen Bay. Brian Butte, Tom Johnson, Krishna 

Kumaraswamy, Bud Mathaisel, Sean McClowry, Rajesh Munavalli, 

Luis Orama, Dave Patton, Jonathan Reichental, Terry Retter, Deepak Sahi, 

Carter Shock, David Steier, Joe Tagliaferro, Dimpsy Teckchandani, 

Cindi Thompson, Tom Urquhart, Christine Wendin, Dean Wotkiewich 

Copyedit 

Lea Anne Bantsari, Ellen Dunn 

Transcription 

Paula Burns 


Graphic Design 

Art Director 

Jacqueline Corliss 

Designer 

Jacqueline Corliss, Suzanne Lau 

Illustrator 

Donald Bernhardt, Suzanne Lau, 

Tatiana Pechenik 

Photographers 

Tim Szumowski 

Marina Waltz 

Online 

Director, Online Marketing 

Jack Teuber 

Designer and Producer 

Scott Schmidt 

Reviewers 

Dave Stuckey, Chris Wensel 

Marketing 

Bob Kramer 

Special thanks to 

Ray George, Page One 

Rachel Lovinger, Razorfish 

Mariam Sughayer, Disney 

Industry perspectives 

During the preparation of this publication, we benefited 

greatly from interviews and conversations with the 

following executives: and industry analysts: 

Bud Albers, executive vice president and chief 

technology officer, Technology Shared Services 

Group, Disney 

Matt Aslett, analyst, enterprise software, the451 

John Avery, partner, Sungard Consulting Services 

Amr Awadallah, vice president, engineering, 

and chief technology officer, Cloudera 

Phil Buckle, chief information officer, National Policing 

Improvement Agency 

Howard Dresner, president and founder, 

Dresner Advisory Services” 

Brian Donnelly, founder and chief executive officer, 

InSilico Discovery 

Matt Estes, principal architect, Technology Shared 

Services Group, Disney 

Jim Kobelius, senior analyst, Forrester Research 

Doug Lenat, founder and chief executive officer, Cycorp 

Roger Magoulas, research director, O’Reilly Media 

Nathan Marz, lead engineer, BackType 

Bill McColl, founder and chief executive officer, 

Cloudscale 

John Parkinson, acting chief technology officer, 

TransUnion 

David Smoley, chief information officer, Flextronics 

Mark Taylor, director, global solutions, Razorfish 

Scott Thompson, vice president, architecture, 

Technology Shared Services Group, Disney 

Ray Velez, chief technology officer, Razorfish 

Acknowledgments 51

pwc.com/us 

To have a deeper conversation 

about how this subject may affect 

your business, please contact: 

Tom DeGarmo 

Principal, Technology Leader 

PricewaterhouseCoopers 

+1 267-330-2658 

thomas.p.degarmo@us.pwc.com 

This publication is printed on Coronado Stipple Cover made from 30% recycled fiber; and 

Endeavor Velvet Book made from 50% recycled fiber, a Forest Stewardship Council (FSC) 

certified stock using 25% post-consumer waste. 

Recycled paper

Subtext 

Big Data 

Data sets that range from many terabytes to petabytes in size, and 

that usually consist of less-structured information such as Web log files. 

Hadoop cluster 

A type of scalable computer cluster inspired by the Google 

Cluster Architecture and intended for cost-effectively processing 

less-structured information. 

Apache Hadoop 

The core of an open-source ecosystem that makes Big Data analysis 

more feasible through the efficient use of commodity computer clusters. 

Cascading 

A bridge from Hadoop to common Java-based programming techniques 

not previously usable in cluster-computing environments. 

NoSQL 

Gray data 

A class of non-relational data stores and data analysis techniques that 

are intended for various kinds of less-structured data. Many of these 

techniques are part of the Hadoop ecosystem. 

Data from multiple sources that isn’t formatted or vetted for 

specific needs, but worth exploring with the help of Hadoop 

cluster analysis techniques. 

Comments or requests? Please visit www.pwc.com/techforecast OR send e-mail to: techforecasteditors@us.pwc.com 

PricewaterhouseCoopers (www.pwc.com) provides industry-focused assurance, tax and advisory services to build public trust and enhance value for 

its clients and their stakeholders. More than 155,000 people in 153 countries across our network share their thinking, experience and solutions to 

develop fresh perspectives and practical advice. 

© 2010 PricewaterhouseCoopers LLP. All rights reserved. “PricewaterhouseCoopers” refers to PricewaterhouseCoopers LLP, a Delaware limited 

liability partnership, or, as the context requires, the PricewaterhouseCoopers global network or other member firms of the network, each of which is a 

separate and independent legal entity. This document is for general information purposes only, and should not be used as a substitute for consultation 

with professional advisors.

Making Sense of Big Data - Cloudera Blog

Create successful ePaper yourself

Delete template?

Save as template?