DataStage is Big Data Integration - DSXchange.net

Luncheon Webinar Series 

May 13, 2013 

InfoSphere DataStage is Big Data Integration 

Sponsored By: 

Presented by : 

Tony Curcio, InfoSphere 

Product Management 

0


• Questions and suggestions regarding presentation topics? - send 

to editor@dsxchange.net 

• Downloading the presentation 

– Click Presentation YES on Poll Question 

– Replay will be available within one day with email with details 

• Bonus Offer – Free premium membership for your DataStage 

Management! Submit your management’s email address and we 

will offer him/her access on your behalf. 

– Email Info@dsxchange.net subject line “Managers special”. 

– Join us all at Linkedin http://tinyurl.com/DSXmembers 

– ISXchange will sponsor Trial membership for new requests at 

Linkedin DSX members site

© 2013 IBM Corporation 


Tony Curcio 

InfoSphere Product Management

Bigger Data Integration Challenges 

New types of data stores 

• Big Data introduces additional data stores that need to be 

integrated – both Hadoop based and noSQL based 

• These data stores don’t easily lend themselves to conventional 

methods for data movement 

New data types and formats 

• Unstructured data; poly-structured data stores; JSON, Avro, 

and what more to come ??? 

• Video, docs, web logs, … 

Larger volumes 

• Solutions need to move, transform, cleanse and otherwise 

prepare huge data volumes 

• Big Data requires data scalability 

3

Benefits of InfoSphere DataStage 

Speeds Productivity 

Graphical design easier to use than hand coding 

Simplifies Heterogeneity 

Common method for diverse data sources 

Shortens Project Cycles 

Pre-built components reduce cost and timelines 

Promotes Object Reuse 

Build once, share, and run anywhere (etl/elt/real-time) 

Reduces Operational Cost 

Provides a robust framework to manage data integration 

Protects from Changes 

isolation from underlying technologies changes as they 

continue to evolve

Big Data is part of the Information Supply Chain 

Transactional 

& Collaborative 

Applications 

Manage 

Integrate 

Master 

Data 

Analyze 

Content 

Big Data 

Cubes 

Streams 

Business Analytics 

Applications 

External 

Information 

Sources 


Content 

Streaming 

Information 

Govern 


Warehouses 

Information 

Governance 

Quality 

Lifecycle 

Security & 

Privacy 

Standards 

Gartner Magic Quadrant 

“IBM is the only DBMS vendor that can offer an information architecture across the 

entire organization, covering information on all systems” 

5

4 Key Analytical Use Cases for Big Data 

• Find, visualize, 

understand all big 

data to improve 

decision making 

• Integrate big data 

and data warehouse 

capabilities to increase 

operational efficiency 


Exploration 


Warehouse 

Augmentation 

Enhanced 

360 o View of 

the Customer 

Operations 

Analysis 

• Extend existing 

customer views by 

incorporating additional 

information sources 

• Analyze a variety of 

machine data for 

improved business 

results

Data Warehouse Augmentation 

Integrate big data and data warehouse 

capabilities to increase operational efficiency 

Challenges 

• Leveraging structured, unstructured, 

and streaming data sources for deep 

analysis 

• Low latency requirements 

• Query access to data 

• Optimizing warehouse for big data 

volumes 

• Metadata management to support 

impact analysis and data lineage 

Required capabilities 

• Data Integration Hub Processing 

• High-speed, massively scalable 

read from and write to big data 

sources and new data 

• Big Data Expert 

• Automatically build MapReduce 

logic through simple data flow 

design and coordinate workflow 

across traditional and big data 

platforms

Data Integration 

Hub Processing


“Connectivity Hub” 

InfoSphere 

DataStage 

Effectively handle the complexity of enterprise information sources 

and types with a common design paradigm across 

heterogeneous landscape with high-speed scalable solution 

to speed the delivery of analytics.


Sour 

ce 


Transfor 

m 

Cleanse Enrich EDW 

• Dynamic 

Instantly get better performance 

as hardware resources are 

added to any topology 

Sequential 

Disk 

CPU 

Memor 

y 

CPU 

4-way Parallel 

CPU 

Disk 

CPU 

Shared 

Memory 

CPU 

64-way Parallel 

Uniprocessor SMP System MPP Clustered System 

• Extendable 

Add a new server to scale out 

through simple text file edit (or, in 

grid config, automatically via 

integration with grid management 

software). 

• Data Partitioned 

In true MPP fashion (like 

Hadoop) data persisted in the 

data integration platform is stored 

in parallel to scale out the I/O. 

• Hadoop Integrated 

Push all or parts of the process 

out to Hadoop to take advantage 

of it’s scalability in ELT fashion. 

10 

10

Big Data Source Types 

Hadoop Distributed File System 

massively scalable and resilient storage 

noSQL (not-only SQL) 

record storage optimized for read (or write) 

noSQL 

InfoSphere Streams 

massive real-time analytics 

11

Blazing Fast HDFS 

• Available since v8.7 in 2011 

• Extends the simple flat file 

paradigm - just add your hadoop 

server name and port number 

• Parallelization techniques to pipe 

data in and out at massive scale 

• Performance study run up to 5.2 

TB/hr before hdfs disks were 

complete saturated (5 node 

hadoop cluster) 

12

Simple data flow design for HDFS 

Transform/ 

restructure 

the data 

Read from an 

HDFS file in 

parallel 

Create new 

HDFS file, 

fully 

parallelized 

Join two 

HDFS files 

13

Agile Connector Accelerators for noSQL 

• New connectors available on 

developerWorks 

• Plugs into InfoSphere DataStage and 

operates just like any other stage. 

• Includes features to exploit specific 

data sources 

Open 

Code 

14

Sample Job with MongoDB and Hive 

Selects what HDFS 

data to send down 

stream. 

Accepts specific 

MongoDB 

directives 

Writing data 

to MongoDB 

Writing data 

to Hive 

15

Parse and Compose JSON (beta) 

• Parsing and composing 

of JSON data format 

• Included advanced 

transformation 

framework already 

provided for XML 

capabilities 

• Beta available on 

InfoSphere 

DataStage 9.1 FP1 

16


Expert


“Big Data Expert” 

InfoSphere 

DataStage 

Automatically push transformational processing close to where the 

data resides, both SQL for DBMS and MapReduce for Hadoop, 

leveraging the same simple data flow design process and coordinate 

workflow across all platforms

Automated MapReduce Job Generation 

• New in 9.1, leverage the same UI and the same stages to build 

MapReduce. 

• Drag and drop stages to the canvas to create a job, rather than have to 

learn MapReduce programming. 

• Push the processing to Hadoop for patterns when you don’t want to 

transport the data on the network. 

19


Build integration 

jobs with the 

same data flow 

tool and stages 

Automatically 

creates 

MapReduce 

code. 

20 

© 2013 IBM Corporation


Job includes other 

database on 

separate system 

Recognizes what processing 

can run natively in Hadoop 

and what requires DataStage 

engine to move the data 

21 

© 2013 IBM Corporation

Architecture for Warehouse Landing Zone 

Use Case Requirements: Data Warehouse Landing Zone 

Large Scale – large data volumes, scale out requires open MPP platform 

Low Cost – low cost storage, compute and commodity hardware 

Many Data Types – un/semi structured and social datatype coverage 

Many Access Patterns – exploratory, iterative and discovery oriented 

clickstream 

ETL 

Lineage 

Quality 

sensors 

transactions 

Replication 

Information Server 

JAQL Hive HBase 

Analytics 

Warehouse 

Zone 

content 

Guardium 

BigInsights / Hadoop 

all sources 

Landing Zone 

Masking 

Masking 

Optim 

Custom MR 

Operational 

Warehouse 

Zone 

22

Combined Workflows for Big Data 

• Oozie Integration 

– Same design paradigm for 

workflows as for job design. 

– Directly call an Oozie activity that is 

invoking custom MapReduce code. 

• End-to-end Workflows 

– Sequence right alongside other 

data integration and analytics 

activities 

– Allows users to have the data 

sourcing, ETL, Analytics and 

delivery of information all controlled 

through a single process. 

– Monitor all stages through 

Operations Console’s web based 

interace 

23

Cross Tool Impact Analysis and Traceability 

• Understand how traditional and big data sources are being used 

• Assess impact of change and mitigate risks 

• Show impact on downstream applications and BI reports 

• Navigate through impacted areas and drill down

Wrap-up

The IBM Big Data Platform 

New analytic applications drive the 

requirements for a big data platform 

• Integrate and manage the full 

variety, velocity and volume of data 

• Apply advanced analytics to 

information in its native form 

• Visualize all available data for adhoc 

analysis 

• Development environment for 

building new analytic applications 

• Workload optimization and 

scheduling 

• Security and Governance 

Systems 

Management 

Hadoop 

System 

BIG DATA PLATFORM 

Application 

Development 

Accelerators 

Stream 

Computing 

Discovery 


Warehouse 

Information Integration & Governance 

Data Media Content Machine Social 

26

Information Integration & Governance for Big Data 

Integrate & Link Big Data 

• Big Data as a Source 

• Big Data as a Target 

• Data Transformations 

• Data Movement 

• Integrate w/existing Enterprise 

• Data Lineage & Impact Analysis 

• Metadata Integration w/Analytics 

• Realtime & Data Federation 

Cleanse and Validate Big Data 

• Accuracy and Entity Matching 

with Social Data 

• De-duplication and 

Standardization of Machine Data 

• In-line Cleansing with Integration 

• Trusted Data Dashboard and 

Reporting on Data Quality 

Protect Big Data 

• Activity Monitoring 

• Data Masking 

• Data Encryption 

• On-Demand / In-Place Protection 

• In-Line Protection (w/ETL etc.) 

• Active Detection & Alerting 

Audit & Archive Big Data 

• Queryable Archive 

• Structured and Semi-Structured 

• Optimized Connectors to existing Apps 

• Hot-Restorable On-the-Fly 

• Immutable and Secure Access 

• Automated Legal Hold Capability for Data 

Freeze 

Master Big Data 

• Big Data as a Supplier 

• Big Data as a Consumer 

• Links between Big Data and 

Trusted Golden Records 

• Leverage Master Data in Big 

Data Analytics 

• Entity Resolution at Extreme 

Scale Out Levels 

• Probabilistic Entity Matching 

27

Where to go for learn more…. 

• If you’d like to explore this topic further… 

– Contact your IBM account team or your preferred IBM Partner. 

• If you’d like to explore more about InfoSphere DataStage and the 

Information Server platform… 

– http://www-01.ibm.com/software/data/integration/info_server/ 

• If you’re looking for a Enterprise level Hadoop distribution… 

– InfoSphere Big Insightshttp://www- 

01.ibm.com/software/data/infosphere/biginsights/ 

29

Thanks

DataStage is Big Data Integration - DSXchange.net

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?