Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI

Testing 3Vs (Volume, Variety and Velocity) of Big Data 

1

A lot happens in the Digital World in 60 

seconds… 

2

What is Big Data 

• Big Data refers to data sets whose size is beyond the 

ability of commonly used software tools to capture, 

manage, and process the data within a tolerable elapsed 

time. 

• Big Data is a generic term used to describe the 

voluminous amount of unstructured, structured and 

semi-structured data. 

3

Big Data Characteristics 

• 3 key characteristics of big data: 

Volume: High volume of data created both inside 

corporations and outside the corporations via the web, 

mobile devices, IT infrastructure, and other sources 

Variety: Data is in structured, semi-structured and 

unstructured format. 

Velocity: Data is generated at a high speed 

oHigh volume of data needs to be processed within 

seconds 

4

Big Data Processing using Hadoop 

framework 

❶Loading source 

data files into 

HDFS 

❷Perform Map 

Reduce 

operations 

❸Extract the 

output results 

from HDFS 

5

Hadoop Map/Reduce processing – 

Overview 

Map/Reduce is distributed 

computing and parallel 

processing framework where 

we have the advantage of 

pushing the computation to 

data 

• Distributed computing 

• Parallel Computing 

• Based on Map & Reduce 

tasks 

6

Hadoop Eco-System 

• HDFS – Hadoop Distributed File System 

• HBase – NoSQL data store (Non-relational distributed database) 

• Map/Reduce - Distributed computing framework 

• Scoop - SQL-to-Hadoop database import and export tool 

• Hive – Hadoop Data Warehouse 

• Pig - Platform for creating Map/Reduce programs for analyzing 

large data sets 

7

Unique Testing Opportunities in Big 

Data Implementations 

8

Testing Opportunities for ‘Independent 

Testing’ 

Early Validation of the 

Requirements 

Preparation of Big Test 

Data 

Early Validation of the 

Design 

Configuration Testing 

Incremental Load Testing 

Functional Testing 

9

Early Validation of the Requirements 

Enterprise Data Warehouses 

integrated with Big Data 

Business Intelligence Systems 

integrated with Big Data 

Early Validation of the Requirements 

Are the requirements mapped to 

the right data sources? 

Are any data sources, that are 

not considered? Why? 

10

Early Validation of the Design 

Is the ‘Unstructured Data’ 

stored, in right place, for 

analytics? 

Is the ‘structured Data’ stored, 

in right place, for analytics? 

Early Validation of the Design 

Is the data duplicated in 

multiple storage systems? 

Why? 

Are the data synchronization 

needs adequately identified and 

addressed? 

11

Test Data 

Replicate data, intelligently, with 

tools 

How big the data files should 

be, to ensure near-real volumes 

of data? 

Preparation of Big Test Data 

Create data with incorrect 

schema 

Create erroneous data 

12

Cluster Setup 

Is the system behaving as expected when a 

cluster is removed? 

Cluster Setup Testing 

Is the system behaving as expected when a 

cluster is added? 

13

Big Data Testing 

14

Volume Testing: Challenges 

Testing challenges 

• Terabytes and Petabytes of data. 

• Data storage in HDFS in file formats 

• Data files are split and stored in multiple data 

nodes 

• 100% coverage cannot be achieved 

• Data consolidation issues 

15

Volume Testing: Approach 

Testing Approach 

• Use Data Sampling strategy 

• Sampling to be done based on data 

requirements 

• Convert raw data into expected result format 

to compare with actual output data 

• Prepare ‘Compare scripts’ to compare the 

data present in HDFS file storage 

16

Variety Testing: Challenges 


• Manually validating semi-structured and 

unstructured data 

• Unstructured validation issues because of no 

defined format 

• Lot of scripting required to be performed to 

process semi-structured and unstructured 

data 

• Unstructured data sampling challenge 

17

Variety Testing: Approach 

Testing Approach 

• Structured Data : Compare data using compare tools 

and identify the discrepancies 

• Semi-structured Data : 

• Convert semi-structured data into structured format 

• Format converted raw data to expected results 

• Compare expected result data with actual results 

• Unstructured Data : 

• Parse unstructured text data into data blocks and 

aggregate the computed data blocks 

• Validate aggregated data against the data output 

18

Velocity Testing: Challenges 


• Setting up of production like environment 

for performance testing 

• Simulating production job run 

• High velocity volume streaming Test data 

setup 

• Simulating node failures 

19

Velocity Testing: Approach 

Validation Points 

• Performance of Pig/Hive jobs and 

capture 

• Job completion time and validating 

against the benchmark 

• Throughput of the jobs 

• Impact of background processes on 

performance of the system 

• Memory and CPU details of task 

tracker 

• Availability of name node and data 

nodes 

Metrics Captured 

• Job completion time 

• Throughput 

• Memory utilization 

• No. of spills and 

spilled records 

• Identify Jobs failure 

rate 

20

Questions? 

21

References 

• http://en.wikipedia.org/wiki/Big_data 

• www.cloudera.com/ 

• http://developer.yahoo.com/hadoop/tutorial/index.html 

• http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond 

22

THANK YOU 

www.infosys.com 

The contents of this document are proprietary and confidential to Infosys Limited and may not be disclosed in 

whole or in part at any time, to any third party without the prior written consent of Infosys Limited. 

© 2012 Infosys Limited. All rights reserved. Copyright in the whole and any part of this document belongs to 

Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in 

whole or in part, in any manner or form, or in any media, without the prior written consent of Infosys Limited.

Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI

Create successful ePaper yourself

Delete template?

Save as template?