11.06.2015 Views

Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI

Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI

Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Testing</strong> <strong>3Vs</strong> (<strong>Volume</strong>, <strong>Variety</strong> <strong>and</strong> <strong>Velocity</strong>) <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />

1


A lot happens in the Digital World in 60<br />

seconds…<br />

2


What is <strong>Big</strong> <strong>Data</strong><br />

• <strong>Big</strong> <strong>Data</strong> refers to data sets whose size is beyond the<br />

ability <strong>of</strong> commonly used s<strong>of</strong>tware tools to capture,<br />

manage, <strong>and</strong> process the data within a tolerable elapsed<br />

time.<br />

• <strong>Big</strong> <strong>Data</strong> is a generic term used to describe the<br />

voluminous amount <strong>of</strong> unstructured, structured <strong>and</strong><br />

semi-structured data.<br />

3


<strong>Big</strong> <strong>Data</strong> Characteristics<br />

• 3 key characteristics <strong>of</strong> big data:<br />

<strong>Volume</strong>: High volume <strong>of</strong> data created both inside<br />

corporations <strong>and</strong> outside the corporations via the web,<br />

mobile devices, IT infrastructure, <strong>and</strong> other sources<br />

<strong>Variety</strong>: <strong>Data</strong> is in structured, semi-structured <strong>and</strong><br />

unstructured format.<br />

<strong>Velocity</strong>: <strong>Data</strong> is generated at a high speed<br />

oHigh volume <strong>of</strong> data needs to be processed within<br />

seconds<br />

4


<strong>Big</strong> <strong>Data</strong> Processing using Hadoop<br />

framework<br />

❶Loading source<br />

data files into<br />

HDFS<br />

❷Perform Map<br />

Reduce<br />

operations<br />

❸Extract the<br />

output results<br />

from HDFS<br />

5


Hadoop Map/Reduce processing –<br />

Overview<br />

Map/Reduce is distributed<br />

computing <strong>and</strong> parallel<br />

processing framework where<br />

we have the advantage <strong>of</strong><br />

pushing the computation to<br />

data<br />

• Distributed computing<br />

• Parallel Computing<br />

• Based on Map & Reduce<br />

tasks<br />

6


Hadoop Eco-System<br />

• HDFS – Hadoop Distributed File System<br />

• HBase – NoSQL data store (Non-relational distributed database)<br />

• Map/Reduce - Distributed computing framework<br />

• Scoop - SQL-to-Hadoop database import <strong>and</strong> export tool<br />

• Hive – Hadoop <strong>Data</strong> Warehouse<br />

• Pig - Platform for creating Map/Reduce programs for analyzing<br />

large data sets<br />

7


Unique <strong>Testing</strong> Opportunities in <strong>Big</strong><br />

<strong>Data</strong> Implementations<br />

8


<strong>Testing</strong> Opportunities for ‘Independent<br />

<strong>Testing</strong>’<br />

Early Validation <strong>of</strong> the<br />

Requirements<br />

Preparation <strong>of</strong> <strong>Big</strong> Test<br />

<strong>Data</strong><br />

Early Validation <strong>of</strong> the<br />

Design<br />

Configuration <strong>Testing</strong><br />

Incremental Load <strong>Testing</strong><br />

Functional <strong>Testing</strong><br />

9


Early Validation <strong>of</strong> the Requirements<br />

Enterprise <strong>Data</strong> Warehouses<br />

integrated with <strong>Big</strong> <strong>Data</strong><br />

Business Intelligence Systems<br />

integrated with <strong>Big</strong> <strong>Data</strong><br />

Early Validation <strong>of</strong> the Requirements<br />

Are the requirements mapped to<br />

the right data sources?<br />

Are any data sources, that are<br />

not considered? Why?<br />

10


Early Validation <strong>of</strong> the Design<br />

Is the ‘Unstructured <strong>Data</strong>’<br />

stored, in right place, for<br />

analytics?<br />

Is the ‘structured <strong>Data</strong>’ stored,<br />

in right place, for analytics?<br />

Early Validation <strong>of</strong> the Design<br />

Is the data duplicated in<br />

multiple storage systems?<br />

Why?<br />

Are the data synchronization<br />

needs adequately identified <strong>and</strong><br />

addressed?<br />

11


Test <strong>Data</strong><br />

Replicate data, intelligently, with<br />

tools<br />

How big the data files should<br />

be, to ensure near-real volumes<br />

<strong>of</strong> data?<br />

Preparation <strong>of</strong> <strong>Big</strong> Test <strong>Data</strong><br />

Create data with incorrect<br />

schema<br />

Create erroneous data<br />

12


Cluster Setup<br />

Is the system behaving as expected when a<br />

cluster is removed?<br />

Cluster Setup <strong>Testing</strong><br />

Is the system behaving as expected when a<br />

cluster is added?<br />

13


<strong>Big</strong> <strong>Data</strong> <strong>Testing</strong><br />

14


<strong>Volume</strong> <strong>Testing</strong>: Challenges<br />

<strong>Testing</strong> challenges<br />

• Terabytes <strong>and</strong> Petabytes <strong>of</strong> data.<br />

• <strong>Data</strong> storage in HDFS in file formats<br />

• <strong>Data</strong> files are split <strong>and</strong> stored in multiple data<br />

nodes<br />

• 100% coverage cannot be achieved<br />

• <strong>Data</strong> consolidation issues<br />

15


<strong>Volume</strong> <strong>Testing</strong>: Approach<br />

<strong>Testing</strong> Approach<br />

• Use <strong>Data</strong> Sampling strategy<br />

• Sampling to be done based on data<br />

requirements<br />

• Convert raw data into expected result format<br />

to compare with actual output data<br />

• Prepare ‘Compare scripts’ to compare the<br />

data present in HDFS file storage<br />

16


<strong>Variety</strong> <strong>Testing</strong>: Challenges<br />

<strong>Testing</strong> challenges<br />

• Manually validating semi-structured <strong>and</strong><br />

unstructured data<br />

• Unstructured validation issues because <strong>of</strong> no<br />

defined format<br />

• Lot <strong>of</strong> scripting required to be performed to<br />

process semi-structured <strong>and</strong> unstructured<br />

data<br />

• Unstructured data sampling challenge<br />

17


<strong>Variety</strong> <strong>Testing</strong>: Approach<br />

<strong>Testing</strong> Approach<br />

• Structured <strong>Data</strong> : Compare data using compare tools<br />

<strong>and</strong> identify the discrepancies<br />

• Semi-structured <strong>Data</strong> :<br />

• Convert semi-structured data into structured format<br />

• Format converted raw data to expected results<br />

• Compare expected result data with actual results<br />

• Unstructured <strong>Data</strong> :<br />

• Parse unstructured text data into data blocks <strong>and</strong><br />

aggregate the computed data blocks<br />

• Validate aggregated data against the data output<br />

18


<strong>Velocity</strong> <strong>Testing</strong>: Challenges<br />

<strong>Testing</strong> challenges<br />

• Setting up <strong>of</strong> production like environment<br />

for performance testing<br />

• Simulating production job run<br />

• High velocity volume streaming Test data<br />

setup<br />

• Simulating node failures<br />

19


<strong>Velocity</strong> <strong>Testing</strong>: Approach<br />

Validation Points<br />

• Performance <strong>of</strong> Pig/Hive jobs <strong>and</strong><br />

capture<br />

• Job completion time <strong>and</strong> validating<br />

against the benchmark<br />

• Throughput <strong>of</strong> the jobs<br />

• Impact <strong>of</strong> background processes on<br />

performance <strong>of</strong> the system<br />

• Memory <strong>and</strong> CPU details <strong>of</strong> task<br />

tracker<br />

• Availability <strong>of</strong> name node <strong>and</strong> data<br />

nodes<br />

Metrics Captured<br />

• Job completion time<br />

• Throughput<br />

• Memory utilization<br />

• No. <strong>of</strong> spills <strong>and</strong><br />

spilled records<br />

• Identify Jobs failure<br />

rate<br />

20


Questions?<br />

21


References<br />

• http://en.wikipedia.org/wiki/<strong>Big</strong>_data<br />

• www.cloudera.com/<br />

• http://developer.yahoo.com/hadoop/tutorial/index.html<br />

• http://wikibon.org/wiki/v/<strong>Big</strong>_<strong>Data</strong>:_Hadoop,_Business_Analytics_<strong>and</strong>_Beyond<br />

22


THANK YOU<br />

www.infosys.com<br />

The contents <strong>of</strong> this document are proprietary <strong>and</strong> confidential to Infosys Limited <strong>and</strong> may not be disclosed in<br />

whole or in part at any time, to any third party without the prior written consent <strong>of</strong> Infosys Limited.<br />

© 2012 Infosys Limited. All rights reserved. Copyright in the whole <strong>and</strong> any part <strong>of</strong> this document belongs to<br />

Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in<br />

whole or in part, in any manner or form, or in any media, without the prior written consent <strong>of</strong> Infosys Limited.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!