Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI
Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI
Testing 3Vs (Volume, Variety and Velocity) of Big Data - QAI
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Testing</strong> <strong>3Vs</strong> (<strong>Volume</strong>, <strong>Variety</strong> <strong>and</strong> <strong>Velocity</strong>) <strong>of</strong> <strong>Big</strong> <strong>Data</strong><br />
1
A lot happens in the Digital World in 60<br />
seconds…<br />
2
What is <strong>Big</strong> <strong>Data</strong><br />
• <strong>Big</strong> <strong>Data</strong> refers to data sets whose size is beyond the<br />
ability <strong>of</strong> commonly used s<strong>of</strong>tware tools to capture,<br />
manage, <strong>and</strong> process the data within a tolerable elapsed<br />
time.<br />
• <strong>Big</strong> <strong>Data</strong> is a generic term used to describe the<br />
voluminous amount <strong>of</strong> unstructured, structured <strong>and</strong><br />
semi-structured data.<br />
3
<strong>Big</strong> <strong>Data</strong> Characteristics<br />
• 3 key characteristics <strong>of</strong> big data:<br />
<strong>Volume</strong>: High volume <strong>of</strong> data created both inside<br />
corporations <strong>and</strong> outside the corporations via the web,<br />
mobile devices, IT infrastructure, <strong>and</strong> other sources<br />
<strong>Variety</strong>: <strong>Data</strong> is in structured, semi-structured <strong>and</strong><br />
unstructured format.<br />
<strong>Velocity</strong>: <strong>Data</strong> is generated at a high speed<br />
oHigh volume <strong>of</strong> data needs to be processed within<br />
seconds<br />
4
<strong>Big</strong> <strong>Data</strong> Processing using Hadoop<br />
framework<br />
❶Loading source<br />
data files into<br />
HDFS<br />
❷Perform Map<br />
Reduce<br />
operations<br />
❸Extract the<br />
output results<br />
from HDFS<br />
5
Hadoop Map/Reduce processing –<br />
Overview<br />
Map/Reduce is distributed<br />
computing <strong>and</strong> parallel<br />
processing framework where<br />
we have the advantage <strong>of</strong><br />
pushing the computation to<br />
data<br />
• Distributed computing<br />
• Parallel Computing<br />
• Based on Map & Reduce<br />
tasks<br />
6
Hadoop Eco-System<br />
• HDFS – Hadoop Distributed File System<br />
• HBase – NoSQL data store (Non-relational distributed database)<br />
• Map/Reduce - Distributed computing framework<br />
• Scoop - SQL-to-Hadoop database import <strong>and</strong> export tool<br />
• Hive – Hadoop <strong>Data</strong> Warehouse<br />
• Pig - Platform for creating Map/Reduce programs for analyzing<br />
large data sets<br />
7
Unique <strong>Testing</strong> Opportunities in <strong>Big</strong><br />
<strong>Data</strong> Implementations<br />
8
<strong>Testing</strong> Opportunities for ‘Independent<br />
<strong>Testing</strong>’<br />
Early Validation <strong>of</strong> the<br />
Requirements<br />
Preparation <strong>of</strong> <strong>Big</strong> Test<br />
<strong>Data</strong><br />
Early Validation <strong>of</strong> the<br />
Design<br />
Configuration <strong>Testing</strong><br />
Incremental Load <strong>Testing</strong><br />
Functional <strong>Testing</strong><br />
9
Early Validation <strong>of</strong> the Requirements<br />
Enterprise <strong>Data</strong> Warehouses<br />
integrated with <strong>Big</strong> <strong>Data</strong><br />
Business Intelligence Systems<br />
integrated with <strong>Big</strong> <strong>Data</strong><br />
Early Validation <strong>of</strong> the Requirements<br />
Are the requirements mapped to<br />
the right data sources?<br />
Are any data sources, that are<br />
not considered? Why?<br />
10
Early Validation <strong>of</strong> the Design<br />
Is the ‘Unstructured <strong>Data</strong>’<br />
stored, in right place, for<br />
analytics?<br />
Is the ‘structured <strong>Data</strong>’ stored,<br />
in right place, for analytics?<br />
Early Validation <strong>of</strong> the Design<br />
Is the data duplicated in<br />
multiple storage systems?<br />
Why?<br />
Are the data synchronization<br />
needs adequately identified <strong>and</strong><br />
addressed?<br />
11
Test <strong>Data</strong><br />
Replicate data, intelligently, with<br />
tools<br />
How big the data files should<br />
be, to ensure near-real volumes<br />
<strong>of</strong> data?<br />
Preparation <strong>of</strong> <strong>Big</strong> Test <strong>Data</strong><br />
Create data with incorrect<br />
schema<br />
Create erroneous data<br />
12
Cluster Setup<br />
Is the system behaving as expected when a<br />
cluster is removed?<br />
Cluster Setup <strong>Testing</strong><br />
Is the system behaving as expected when a<br />
cluster is added?<br />
13
<strong>Big</strong> <strong>Data</strong> <strong>Testing</strong><br />
14
<strong>Volume</strong> <strong>Testing</strong>: Challenges<br />
<strong>Testing</strong> challenges<br />
• Terabytes <strong>and</strong> Petabytes <strong>of</strong> data.<br />
• <strong>Data</strong> storage in HDFS in file formats<br />
• <strong>Data</strong> files are split <strong>and</strong> stored in multiple data<br />
nodes<br />
• 100% coverage cannot be achieved<br />
• <strong>Data</strong> consolidation issues<br />
15
<strong>Volume</strong> <strong>Testing</strong>: Approach<br />
<strong>Testing</strong> Approach<br />
• Use <strong>Data</strong> Sampling strategy<br />
• Sampling to be done based on data<br />
requirements<br />
• Convert raw data into expected result format<br />
to compare with actual output data<br />
• Prepare ‘Compare scripts’ to compare the<br />
data present in HDFS file storage<br />
16
<strong>Variety</strong> <strong>Testing</strong>: Challenges<br />
<strong>Testing</strong> challenges<br />
• Manually validating semi-structured <strong>and</strong><br />
unstructured data<br />
• Unstructured validation issues because <strong>of</strong> no<br />
defined format<br />
• Lot <strong>of</strong> scripting required to be performed to<br />
process semi-structured <strong>and</strong> unstructured<br />
data<br />
• Unstructured data sampling challenge<br />
17
<strong>Variety</strong> <strong>Testing</strong>: Approach<br />
<strong>Testing</strong> Approach<br />
• Structured <strong>Data</strong> : Compare data using compare tools<br />
<strong>and</strong> identify the discrepancies<br />
• Semi-structured <strong>Data</strong> :<br />
• Convert semi-structured data into structured format<br />
• Format converted raw data to expected results<br />
• Compare expected result data with actual results<br />
• Unstructured <strong>Data</strong> :<br />
• Parse unstructured text data into data blocks <strong>and</strong><br />
aggregate the computed data blocks<br />
• Validate aggregated data against the data output<br />
18
<strong>Velocity</strong> <strong>Testing</strong>: Challenges<br />
<strong>Testing</strong> challenges<br />
• Setting up <strong>of</strong> production like environment<br />
for performance testing<br />
• Simulating production job run<br />
• High velocity volume streaming Test data<br />
setup<br />
• Simulating node failures<br />
19
<strong>Velocity</strong> <strong>Testing</strong>: Approach<br />
Validation Points<br />
• Performance <strong>of</strong> Pig/Hive jobs <strong>and</strong><br />
capture<br />
• Job completion time <strong>and</strong> validating<br />
against the benchmark<br />
• Throughput <strong>of</strong> the jobs<br />
• Impact <strong>of</strong> background processes on<br />
performance <strong>of</strong> the system<br />
• Memory <strong>and</strong> CPU details <strong>of</strong> task<br />
tracker<br />
• Availability <strong>of</strong> name node <strong>and</strong> data<br />
nodes<br />
Metrics Captured<br />
• Job completion time<br />
• Throughput<br />
• Memory utilization<br />
• No. <strong>of</strong> spills <strong>and</strong><br />
spilled records<br />
• Identify Jobs failure<br />
rate<br />
20
Questions?<br />
21
References<br />
• http://en.wikipedia.org/wiki/<strong>Big</strong>_data<br />
• www.cloudera.com/<br />
• http://developer.yahoo.com/hadoop/tutorial/index.html<br />
• http://wikibon.org/wiki/v/<strong>Big</strong>_<strong>Data</strong>:_Hadoop,_Business_Analytics_<strong>and</strong>_Beyond<br />
22
THANK YOU<br />
www.infosys.com<br />
The contents <strong>of</strong> this document are proprietary <strong>and</strong> confidential to Infosys Limited <strong>and</strong> may not be disclosed in<br />
whole or in part at any time, to any third party without the prior written consent <strong>of</strong> Infosys Limited.<br />
© 2012 Infosys Limited. All rights reserved. Copyright in the whole <strong>and</strong> any part <strong>of</strong> this document belongs to<br />
Infosys Limited. This work may not be used, sold, transferred, adapted, abridged, copied or reproduced in<br />
whole or in part, in any manner or form, or in any media, without the prior written consent <strong>of</strong> Infosys Limited.